Updated: Jul 16
When using research to examine continuous improvement efforts, many educators get stuck when it comes to interpreting the results of statistical tests. Don’t feel bad – most of us only had one statistics class in college and it was probably focused solely on interpreting the results of standardized tests. While the lessons you learned in college are valuable, it can sometimes be challenging to apply those test and measurements rules to other contexts. In this article, I want to explore the ideas of statistical significance and magnitude and give you some tips for using these measures to inform your continuous improvement decision making.
Statistical significance is the first measure often reported by researchers deploying a technique called null-hypothesis testing, or NHT. Broadly speaking, in an NHT, the researcher has two or more groups of study participants who receive different experiences. At the end, the researcher deploys statistical tests to determine if the groups of participants had statistically different outcomes. Common NHTs include the t-test and the ANOVA. When deployed on populations assigned through random sampling, these tests can help researchers understand the impact of specific activities.
Statistical significance is reported using the p-value. The p-value is the probability that an outcome occurred by random chance. The p-value is an oddly controversial value, and possibly one that many scientists don’t truly understand. If you’re experiencing statistical imposter syndrome, you should take comfort in the fact that the nuances of p-value continue to baffle those who use it every day.
When it comes to interpreting the p-value, there is one commonly accepted rule. The lower the p-value, the stronger the statistical significance. It is commonly accepted that the results of a statistical tests are significant if the p-value is below 0.05 and strongly significant if it is below 0.01 or 0.001. You will always see the p-value reported along with statistical outcomes in educational research papers. Memorize the three thresholds and you will be well on your way to understanding the results of statistical tests.
In addition to statistical significance, many researchers will also report on the magnitude of the difference between their groups. Magnitude is sometimes called “practical significance” because it quantifies the difference between two groups in real numbers. The most common measure of magnitude is effect size, usually using a formula called Cohen’s d.
Effect size is calculated by dividing the difference between the means by the pooled standard deviation. If that formula sounds complicated, never fear. You can upload your data into the tools found in The Repository to instantly calculate the effect size, as well as other statistical tests, for your distributions of scores.
To put effect size into layman’s terms, the formula provides a standardized way to say that the average of group one is this far away from the average of group two. Effect size can also be used to examine the difference between pre-test and post-test scores. In that context, it provides a standardized way to say the group experienced this much change between test administrations.
Unlike statistical significance, the higher the effect size the stronger the effect is determined to be. It is commonly accepted that an effect size is small when d=0.20, medium when d=0.50, and large when d=0.80. Of course, you will rarely have an effect size that is exactly d=0.80, so these numbers provide a rough guide to help you think about effect size results on a spectrum.
What does all this really mean?
When using research to make decisions about new interventions, programs, or strategies for your school, properly interpreting the results of statistical testing is an important element. Generally speaking, education leaders will want to focus their attention on research in which the results were statistically significant. In fact, the definition of an evidence-based practice under the Every Student Succeeds Act (ESSA) insists that a practice is not evidence-based unless it is supported by statistically significant outcomes. If you are using research for ESSA compliance, you must check the p-value and reject studies with a p-value greater than 0.05.