Assignment 1C: Control, Sampling, and Measurement

This assignment consists of answers to several questions about the following paper:

Ardern, J., & Henry, B. (2019). Testing writing on computers: An experiment comparing student performance on tests conducted via computer and via paper-and-pencil. Journal of Research in Digital Education, 20(3), 1-20.

Control

Blinding was used in the scoring process. All performance writing responses that were on pencil and paper were entered into the computer and intermixed with the computer responses so that raters did not know if they were scoring responses from the control group or the experimental group. Whether they realized it or not, any or all of the raters may have had expectations that either computer or pencil and paper responses would score higher, and this expectancy could introduce bias into the rating.

Constancy was used in the design of the computer-based assessments. Care was taken to make each page on the computer screen look as similar as possible to the paper version of the exam, with attempts to keep the number of items on a page, the position of headers and footers, the order of the responses, etc. the same. The researchers noted that previous studies had reported that changes in appearance of tests could alter performance, so without this control in place it is possible that the performance of the experimental group could have been influenced by the appearance of the exam as opposed to the mode of administration.

Sampling

Random selection and random assignment into groups is important to neutralize any threats that could cause bias in the study. By random assignment of students into either the control or experimental group, the researchers could assume roughly the same number of students in each group would be affected by any extraneous variables, therefore not allowing those variables to have more of an effect on one group than the other.

Sample Sizes

Experimental (computer) group: 46 – originally recruited 50
Control (paper-and-pencil) group: 68 – originally recruited 70

Rule of thumb: Minimum group size of 30, 40 often recommended to create comparable groups.

At least 63 per group to get medium effect size (d = 0.50) that is statistically significant. 25 per group to get large effect size (d = 0.80) that is statistically significant.

Standard Deviation

In this entry, SD refers to standard deviation from the mean score of the open-ended (OE) writing exam. It indicates how student scores were dispersed around the mean. Of the 114 assessments, the mean score for the OE exam was 7.87 out of a possible 14 points and the standard deviation was 2.96.

This indicates that approximately 68% of the students scored within plus or minus 1 standard deviation of the mean, which when calculated equals between 4.91 (7.87 – 2.96) and 10.83 (7.87 + 2.96).

Approximately 95% of students scored within plus or minus 2 standard deviations of the mean: in other words, between 1.95 and 13.97.

Effect Size

I understand why the researchers interpreted the effect size as both statistically and practically significant. With an effect size of 0.94, the mean score of the experimental group would shift 94% of a standard deviation to the right, causing the mean of the experimental group to fall at the 83^rd percentile of the control group.

Measurement

The modest level of inter-rater reliability reported (0.44 to 0.62) indicates that scores assigned to a student’s response were often different between the three raters. A modest or low inter-rater reliability could be considered an error in measurement and render the data useless. However, the researchers in this study attempted to control for the moderated inter-rater reliability by using average of the three scores for each student response. Measures with modest or low reliability are undesirable in research because they may present scores or data that are not as close to the “true” value.

Content validity would have been most relevant to this study. In order to measure student writing performance, it would be important to ensure that the assessment actually measures writing performance.

Leave a Reply Cancel reply