How much difference does it make? Notes on understanding, using, and calculating effect sizes for schools
A good way of presenting differences between groups or changes over time in test scores or other measures is by ‘effect sizes’, which allow us to compare things happening in different classes, schools or subjects regardless of how they are measured. This booklet is designed to help school staff to understand and use effect sizes, and includes handy tips and warnings as well as useful tables to calculate effect size values from change scores on standardised tests.
Author(s): Ian Schagen, Research Division, Ministry of Education and Edith Hodgen, NZCER.
Date Published: March 2009
This report is available as a download (please refer to the 'Downloads' inset box). To view the individual chapters please refer to the 'Sections' inset box.
Section 1: Introduction
Suppose you tested a class in a topic and then gave them some kind of learning experience before testing them again, and on average their scores increased by 12 points. Another teacher in another school using a different test and another learning experience found a rise in scores of 25 points on average. How would you try to judge which was the better learning experience, in terms of improvement in scores?
Well, you can't just compare the changes in scores because they're based on totally different tests. Let's say the first test was out of 30 and the second out of 100 - even that doesn't help us because we don't know the spread of scores, or any way of mapping the results on one test into results on the other. It's as if every time we drove a car the speed came up in different units: kilometres per hour, then feet per second, then poles per fortnight. Not very useful.
One of the important aspects of any test is the amount of "spread" in the scores it usually produces, and a conventional way of measuring this is the "standard deviation" (often called SD for short). Many test scores have a hump- or lump- or bell-shaped distribution, with most students scoring in the middle, and fewer scoring very high or low. The theoretical distribution usually known as the "normal distribution" often describes test scores well. This diagram shows what the standard deviation looks like for an idealised test with a "bell-shaped" or normal distribution of scores.
When scores have a distribution like this, 68 percent of the scores lie within one standard deviation of the mean, and 95 percent lie within two standard deviations of the mean. Almost all scores lie within three standard deviations of the mean.
This standard deviation measure is a good way of comparing the spreads of different tests and hence getting a direct comparison of what are sometimes called "change scores". A change score is the difference between two test scores, usually for the same kind of test taken at different times. A change score is a way to measure progress.
There are actually two ways of getting a new measure from the test scores; one that is easier to compare in a meaningful way:
- "Standardise" each test to have the same mean and standard deviation, so that you can compare score changes directly. For example, "IQ" tests tend to all have mean 100 and standard deviation 15; international studies (such as PISA and TIMSS) go for mean 500 and standard deviation 100.
- Divide the change score, or difference between scores over time, T2 - T1, for each test by the standard deviation to get a fraction which is independent of the test used - we shall call this fraction an "effect size".
In this paper we focus on the second approach and try to show how to calculate, use and understand effect sizes in a variety of contexts. By using effect sizes we should be able to do the following:
- investigate differences between groups of students on a common scale (like using kilometres/hour all the time)
- see how much change a particular teaching approach makes, again on a common scale
- compare the effects of different approaches in different schools and classrooms
- know about the uncertainty in our estimates of differences or changes, and whether these are likely to be real or spurious.
- The standard deviation is a measure of the average spread of scores about the mean (average) score; almost all scores lie within three standard deviations of the mean.
- An effect size is a measure that is independent of the original units of measurement; it can be a useful way to measure how much effect a treatment or intervention had.
Back to our example. Let's assume we know the following and we'll worry about how to get the standard deviation values later:
Class A: Test standard deviation = 10; average change in scores = 12; effect size = 1.2.
Class B: Test standard deviation = 30; average change in scores = 25; effect size = 0.83.
From these results we might be able to assume that there has been more progress in Class A than in Class B - but how do we know that this apparent difference is real, and not just due to random variations in the data?
So far we have introduced effect sizes and shown how they can be handy ways of comparing differences across different measuring instruments, but this now raises a number of questions, including:
- How do we estimate the standard deviation of the test, to divide the change score by?
- What other comparisons can we do using effect sizes?
- How do we estimate the uncertainty in our effect size calculations?
- How do we know that differences between effect sizes are real?
- How big should an effect size be to be "educationally meaningful"?
- What are the cautions and caveats in using effect sizes?
- How easy is it to calculate an effect size for New Zealand standardised tests?
- To compare progress over time on the same test (most common use).
- To compare results measured on different tests.
- To compare different groups doing the same test (least common use).