# How much difference does it make? Notes on understanding, using, and calculating effect sizes for schools

## Publication Details

A good way of presenting differences between groups or changes over time in test scores or other measures is by ‘effect sizes’, which allow us to compare things happening in different classes, schools or subjects regardless of how they are measured. This booklet is designed to help school staff to understand and use effect sizes, and includes handy tips and warnings as well as useful tables to calculate effect size values from change scores on standardised tests.

Author(s): Ian Schagen, Research Division, Ministry of Education and Edith Hodgen, New Zealand Council for Educational Research.

Date Published: March 2009

## Section 5: How do we know effect sizes are real

This is equivalent to asking if the results are "statistically significant" - could we have got an effect size this big by random chance, even if there was really no difference between the groups or real change over time? Usually we take a probability of 5 percent or less as marking the point where we decide that a difference is real.

This is actually quite easy to do using the 95 percent confidence intervals calculated as in the above example. If the interval is all positive (or all negative) then the probability is less than 5 percent that it includes zero effect size, and we can conclude (with a fairly small chance of being wrong) that the effect size is really non-zero. A good way of displaying all this is graphically, especially if we are comparing effect sizes and their confidence intervals for different groups or different influencing factors. A "Star Wars" plot like the one below illustrates this.

In this kind of plot, the diamond represents the estimated effect size for each factor relative to the outcome, and the length of the bar represents the 95 percent confidence interval. If the bar cuts the zero line, then we can say the factor is not statistically significant. In the above plot, this is true for factors C, D, E and F. Factors A, B and G have significant negative relationships, while Factor H has a significant positive one. Although the effect size for H is lower than for D, the latter is not significant and we should be cautious about ascribing any relationship here at all, whereas we can be fairly confident that H really does have a positive relationship with the outcome. From what we saw above, the estimates for Factors D and E, with wide confidence intervals, would be based on far fewer test scores than those for Factors A, B, and G, with much narrow confidence intervals.