How much difference does it make? Notes on understanding, using, and calculating effect sizes for schools
A good way of presenting differences between groups or changes over time in test scores or other measures is by ‘effect sizes’, which allow us to compare things happening in different classes, schools or subjects regardless of how they are measured. This booklet is designed to help school staff to understand and use effect sizes, and includes handy tips and warnings as well as useful tables to calculate effect size values from change scores on standardised tests.
Author(s): Ian Schagen, Research Division, Ministry of Education and Edith Hodgen, New Zealand Council for Educational Research.
Date Published: March 2009
This report is available as a download (please refer to the 'Downloads' inset box). To view the individual chapters please refer to the 'Sections' inset box.
Section 8: How easy is it to calculate effect sizes for New Zealand standardised tests
What you need to know before you can calculate an effect size is:
- the two sample means
- the expected growth for students in the relevant year levels
- the standard deviation.
The issues, really, are how to work out expected growth and a standard deviation. For standardised tests, it is best to use the published expected growth to correct any change in score to reflect only advances greater than expectation. Manuals or other reference material for the tests also give the standard deviation as published from the norming study, and this is the best value to use to calculate an effect size.
This means that, in fact, tables of effect sizes are easy to construct for a test, given the year level of the students (see Tables 1–5).
The examples below will walk you through both doing the actual calculations, and using tables to look up approximate values.
Example 1 (PAT Maths, one year between tests):
If a group of Year 4 students achieved a PAT Maths score of 28.2 at the start of one year, and at the start of the next the same students achieved a score of 39.9 (in Year 5), they had a mean difference of 11.7, which is a little in advance of the overall mean difference of 8.5 at that year level (Table 1). To look up the effect size for this difference in Table 1, find the difference of 11.7 down the left side of the table, and Year 5 under PAT Mathematics across the top. The nearest difference down the left-hand side is 11.5, and the matching effect size is 0.23. Had the difference been 12.0, the effect size would have been 0.27, so if we calculated the effect size, rather than looking it up, it would probably have come out at around 0.24, which we could take as our estimate.
Alternatively, the effect size could be calculated directly using the data in the table. Our difference of 11.7 needs to be "deflated" by the expected growth (40.3 – 31.8 = 8.5) and then divided by the Year 5 standard deviation of 13.2. This gives an effect size of (11.7 – 8.5)/13.2 = 0.24.
Once we have an effect size, it is easy to add a confidence interval. Suppose that in the example above there were 57 Year 4 students and a year later, 64 students took the test in Year 5, and individual students were not matched (because, say, the school was one where there is a very high transience rate). The standard error of the effect size can be read off Table 6. Both samples are around 60 students, and matching standard error is 0.18. Had the samples been smaller, say both were of size about 50 (this was the nearest option in the table), the standard error would have been 0.20. If one sample was 60 and the other 50, the standard error would have been 0.19. So, taking all these options into account, 0.18 looks like a good estimate.
A 68 percent confidence interval for the effect size would be from 0.24 – 0.18 = 0.06 to 0.24 + 0.18 = 0.42, and a 95 percent confidence interval would be from about 0.24 – 2*0.18 = -0.12 to 0.60. Using the more stringent criterion, we cannot be sure that there was an effect.
What if we had more students? If we had 120 students in Year 4 and 140 in Year 5, the standard error would be somewhere between 0.12 and 0.14 (looking at the values for samples of 100 and 150, the nearest numbers in the table), so we can use 0.13. This would give a 68 percent confidence interval of 0.11 to 0.37, and a 95 percent confidence interval of -0.02 to 0.50, and we can still not be certain that there was an effect.
Example 2 (PAT Reading, one year between tests):
A group of 60 Year 6 students had a mean PAT Reading comprehension score of 38.1 and when they were in Year 7 the mean score of the same students was 58.7. Their mean difference was 20.6, a great deal higher than the expected growth of 53.2 – 45.0 = 8.2. The effect size, from Table 1 is 0.98 or 0.99 (PAT Reading comprehension, Year 7, difference 20.5, which is the nearest to 20.6).
The standard error for this score is 0.08 from Table 7, assuming a correlation of 0.8 and sample of 60. This gives a 68 percent confidence interval of 0.99 – 0.08 = 0.91 to 0.99 + 0.08 = 1.07, and a 95 percent confidence interval of 0.99 – 2*0.08 = 0.86 to 1.15. Without doubt substantial progress was made in this case.
Example 3 (asTTle Writing, two years between tests):
A school has had an intervention for two years. At the start, the 360 students in Year 4 at the school had an average asTTle Writing score of 390, and two years later the 410 students then in Year 6 had an average writing score of 521. Over the two years, they made a mean gain of 521 – 390 = 131. The tables are made for a single year's growth, so to use a table we need to "discount" the gain by the expected gain for the first year (the table will do the discounting for the second year).
The expected gain in asTTle Writing between Year 4 and Year 5 is 482 – 454 = 28 (from the top of Table 5), so our "discounted" gain is 131 – 28 = 103. In Table 5, the effect size for a mean difference of 103, for students now in Year 6, is between 0.78 and 0.83, so we can take a value of 0.81 (103 is a little closer to 105 than to 100).
The standard error of the effect size is between 0.07, 0.06, and 0.08 (using n1 = 300 and 500 and n2 = 300 and 500 in Table 6), so using 0.07 looks a good idea.
The confidence interval for the effect size is 0.81 ± 2*0.07 or 0.67 to 0.95.
We can say that the intervention appeared to be very effective.
Example 4 (STAR, not quite one year between tests):
A group of 237 students had a mean STAR stanine score of 3.7 at the start of a year, and a score of 4.5 at the end of the year. STAR scores, and all other stanine scores, have a mean of 0 and standard deviation of 2. If a student progresses as expected, their stanine score will stay more or less the same over time.
Table 2 is provided for completeness, but effect sizes are very easily calculated for stanine scores (divide the mean difference by 2), and so long as the standardisation process was appropriate for each student's age and time of year, it doesn't matter how far apart in time the scores are (they do not need "discounting" for expected progress).
In this example:
Effect size = (4.5 – 3.7)/2 = 0.4 (or look up 4.5 – 3.7 = 0.8 in Table 2).
The standard error is about 0.03 as STAR tests tend to have a correlation of between 0.8 and 0.9, and the number of students is between 200 and 250, giving a confidence interval of 0.33 to 0.46. This would often be considered to indicate a moderate effect.
- Effect sizes are a useful device for comparing results on different measures, or over time, or between groups, on a scale which does not depend on the exact measure being used.
- Effect size measures are useful for comparing results on different tests (a comparison of two scores on the same standardised test is not made much more meaningful by using effect sizes).
- Effect sizes can be used to compare different groups of students, but are most often used to measure progress over time.
- Effect sizes measured at two different time points need to be "deflated" to account for expected progress (unless both measures are standardised against expected progress – for example, stanine scores).
- Published standard deviations should be used for standardised tests.
- For other tests, either an approximate SD can be guessed from the spread of scores, or the SD of the sample data can be calculated.
- Effect sizes should be quoted with a confidence interval.
- How the confidence interval is calculated depends on whether the same students were measured at the two time points (matched samples) or not.
- The confidence interval can be used to judge whether the effect is large enough to be considered unlikely to be a lucky chance (the interval should not include zero).
- Regression to the mean can produce a spuriously large effect size if the group of students being measured was selected as being the lowest performing 10 or 20 percent.
- The effect size measure discussed here is the most commonly used one, but is only really suited to comparing two sets of scores. There are other measures for more complicated comparisons.