How much difference does it make? Notes on understanding, using, and calculating effect sizes for schools
A good way of presenting differences between groups or changes over time in test scores or other measures is by ‘effect sizes’, which allow us to compare things happening in different classes, schools or subjects regardless of how they are measured. This booklet is designed to help school staff to understand and use effect sizes, and includes handy tips and warnings as well as useful tables to calculate effect size values from change scores on standardised tests.
Author(s): Ian Schagen, Research Division, Ministry of Education and Edith Hodgen, New Zealand Council for Educational Research.
Date Published: March 2009
This report is available as a download (please refer to the 'Downloads' inset box). To view the individual chapters please refer to the 'Sections' inset box.
Section 2: Getting a standard deviation
If we have a bunch of data and want to estimate the standard deviation, then the easiest way is probably to put it into a spreadsheet and use the internal functions to do it for you. If you want to calculate it by hand, here is how to do it:
- Calculate the mean of the data by adding up all the values and dividing by the number of cases.
- Subtract the mean from each value to get a "deviation" (positive or negative).
- Square these deviations and add them all up.
- Divide the result by the number of cases minus 1.
- Take the square root to get the standard deviation.
Here is a worked example with the following values: 10, 13, 19, 24, 6, 23, 15, 18, 22, 17.
- Mean = 167/10 = 16.7.
- Deviations: -6.7, -3.7, 2.3, 7.3, -10.7, 6.3, -1.7, 1.3, 5.3, 0.3.
- Squared deviations: 44.89, 13.69, 5.29, 53.29, 114.49, 39.69, 2.89, 1.69, 28.09, 0.09. Sum of these = 304.1.
- Divide by 10-1 = 9: 33.79.
- Square root: 5.81.
Therefore the standard deviation is estimated as 5.81. However, if we tested a different bunch of 10 students with the same test we would undoubtedly get a different estimate of standard deviation, and this means that estimating it in this way is not ideal. If the value we're using to standardise our results depends on the exact sample of students we use, this means our effect size measure has an extra element of variability which needs to be taken into account.
Another issue arises when we test and retest students. Which standard deviation do we use: the pre-test one, the post-test one, or some kind of "pooled" standard deviation? If we use the pre-test, then it may be that all students start from the same low state of understanding and the standard deviation is quite small (or even zero) - this will grossly inflate our effect size calculation. The same might happen with the post-test, if we've brought everyone up to the same level. The "pooled" standard deviation is basically an average of the two, but this might also suffer from the same issues.
A better option is to use a value which is fixed for every different "outing" of the same test and which we can use regardless of which particular group of students is tested. If the test has been standardised on a large sample, then there should be data available on its overall standard deviation and this is the value we can use. If it's one we've constructed ourselves then we may need to wait for data on a fair few students to become available before calculating a standard deviation to be used for all effect size calculations.
Another option is to cheat. Suppose we have created a test which is designed to be appropriate over a range of abilities, with an average score we expect to be about 50 percent. We also expect about 95 percent of students to get scores between about 10 percent and 90 percent. The normal "bell-shaped" curve (see diagram above) has 95 percent of its values between about plus or minus twice the standard deviation from the mean.
So if 90 - 10 = 4 x standard deviation, then estimate the standard deviation = 20.
If we use 20 as the standard deviation for all effect size calculations, regardless of the group tested, then we have the merits of consistency and no worries about how to estimate this. We can check our assumption once we've collected enough data, and modify the value if required.
Another example: Suppose we monitored students on a rating scale, from 0 (knows nothing) to 8 (totally proficient). Then we might say that the nominal standard deviation was around 8/4 = 2.0, and use this value to compute effect sizes for all changes monitored using this scale.
- If possible, use the published standard deviation for a standardised test.
- For a test where there are no published norms, calculate the standard deviation for each set of data, and state whether you chose to take:
- the standard deviation for the first set of scores
- the standard deviation for the second set of scores
- a pooled value that lies between the two (closer to the value from the set of scores that has more students)
- an estimate (or "nominal SD") from the expected highest and lowest scores for the middle 95 percent of the students:
- SD = (highest estimate - lowest estimate)/4.