How much difference does it make? Notes on understanding, using, and calculating effect sizes for schools
A good way of presenting differences between groups or changes over time in test scores or other measures is by ‘effect sizes’, which allow us to compare things happening in different classes, schools or subjects regardless of how they are measured. This booklet is designed to help school staff to understand and use effect sizes, and includes handy tips and warnings as well as useful tables to calculate effect size values from change scores on standardised tests.
Author(s): Ian Schagen, Research Division, Ministry of Education and Edith Hodgen, New Zealand Council for Educational Research.
Date Published: March 2009
This report is available as a download (please refer to the 'Downloads' inset box). To view the individual chapters please refer to the 'Sections' inset box.
Section 4: Uncertainty in effect sizes
As with any statistical calculation, effect sizes are subject to uncertainty. If we repeated the exercise with a randomly different bunch of students we would get a different answer. The question is: How different? And how do we estimate the likely magnitude of the difference?
The term "standard error" (SE for short) is used to refer to the standard deviation in the likely error around an estimated value. Generally, 95 percent of the time the "true" value will be within plus or minus two SEs of the estimated value, and 68 percent of the time it will be within plus or minus one SE of the estimated value. If we assume the standard deviation of the underlying scores has been fixed in some way, then the SE of an effect size is just the SE in the difference of two means divided by the standard deviation.
When we are trying to measure the average test score, we expect that an average based on five students is less likely to be very near the true score for those students than an average based on 500 students would be. A single student having a very bad or good day could affect a five-student average quite a lot, but would have very little effect on a 500-student average. In the same way, the uncertainty in estimates of effect size is much greater for small groups of students than it is for large ones. In fact, as we'll see later, the uncertainty in effect size can be well approximated just using the number of students involved.
The calculations work differently if we are dealing with two separate groups or with measurements at two points in time for the same group. Let us do an example calculation both ways, assuming an 8-point scale with nominal standard deviation 2.0. Here are some data, for two groups A and B:
|Data for Groups A and B and the Difference||Mean||
is (4.5-.1)/2.0 = 0.7. The SE for the mean of group A is calculated from the standard deviation of the group A scores divided by the square root of the number of cases (10), giving the value 0.46. A similar calculation for group B yields a value of SE equal to 0.45.
To get the SE for the difference in group means we need to combine these two separate SEs, by squaring them, summing them, and then taking the square root.
This gives: SE of group mean difference = √(0.462 + 0.452) = 0.64.
Therefore, SE of effect size = SE of group mean difference/(nominal SD) = 0.64/2.0 = 0.32.
A 95 percent confidence interval for the effect size is therefore 0.70 ± 1.96 x 0.32 = 0.07 to 1.33.
In scenario 2, group B is just the same set of students as group A, but tested at a later point in time. In this case we are interested in the difference scores, in the last column of the table above. The mean is 1.4, with a standard deviation of 1.26 and SE 0.40 (= standard deviation divided by square root of number of cases). The estimated effect size is still 0.70, but now with a value of SE equal to 0.40/2.0 = 0.20. A 95 percent confidence interval for the effect size is therefore 0.70 ± 1.96 x 0.20 = 0.31 to 1.09.
If we look at the size of the two confidence intervals, why is the second (0.31 - 1.09) so much narrower than the first (0.07 - 1.33)? In scenario 1, we measured different students on the two occasions, so some of the differences in score will be due to differences between students, and some due to what happened between testing points. In scenario 2, we measured the same students on both occasions, so we expect the second scores to be relatively similar to the first, with the difference between scores being mainly due to what happened between testing points, which means that the effect size is measured with less error.
A simpler estimate of SEs
An even simpler way of estimating SE values makes use of the fact that we've kind of cancelled out the actual standard deviations in the above formulae so that all we need to know to calculate the standard error is the number of students. The simple formulae are:
- Two separate samples (scenario 1): SE = square root of (1 divided by number in first group + 1 divided by number in second group) = √(1/10 + 1/10) = 0.45.
- Same sample retested (scenario 2): SE = square root of (1 divided by number in sample) = √(1/10) = 0.32, assuming a moderate relationship between test scores (a correlation of r = 0.5).3
The main reason why these "quick" estimates are different from those calculated earlier is that we have previously divided by a nominal standard deviation of 2.0 rather than the "pooled estimate" of 1.44. Had we used the pooled estimate we would have had 0.64/1.44 = 0.44 and 0.4/1.44 = 0.28.
This method for quickly estimating standard errors can be quite useful for judging the likely uncertainty in effect size calculations for particular sample sizes.
If different groups of students did the two tests, use:
SE = square root of (1/(number in first group) + 1/(number in second group)) or use Table 6.
If the same students did the two tests, use:
SE = square root of (2*(1-r)/number of students)
where r is the correlation between the first and second test scores or use Table 7.
To make calculating effect sizes and their confidence intervals easier, we have made some tables for the main standardised tests used in New Zealand (asTTle, STAR, and PAT), see Tables 1-7 on pages 15-23. These tables allow you to read off an approximate effect size for a test, given a mean difference or change score. They assume that the difference is over a year, and take into account the expected growth over that year (see the examples). Example 3 (p. 12) shows how the tables can be used if the scores are not measured a year apart.
- The value 2.0 in the formula is the "nominal" or assumed value for our scale. Instead of this, we could used an average or "pooled" standard deviation estimated from the data as 1.44. This would give higher estimates of effect size, but would change if we took a different sample of students.
- If the correlation, r, between scores is known, then the formula is SE = square root of (2(1 - r)/n), where n is the number of students.