#BeyondTheMean

An Introduction to Exploratory Data Analysis

Updated: Jan 23



Welcome to #BeyondTheMean! Check out this post to see what this blog is all about.


In education, we have access to mountains of data. Think about all the data that we collect on a daily basis: attendance data, behavior data, achievement data, stability data, demographic data… the list could go on and on. The challenge for many of us is simple: what the heck do we do with it all? In order to make use of the data, we must adopt a procedure that allows us to take the data for what it is and draw out meaningful insights. Enter: Exploratory data analysis (EDA).


According to the renowned statistician John Tukey, EDA is a set of “procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data (Tukey, 1961).”


Essentially, data scientists use EDA to examine their data, summarize it, and identify key insights that can be later used to inform decision making, design research projects, or monitor system improvements.



My favorite thing about EDA, and why I teach it to educators, is that it is focused on simple mathematical principles rather than complex statistical theory. If you can identify the lowest and highest numbers in a column, count the number of variables in a data set, or calculate the average of a distribution of scores, you can implement a meaningful EDA process.


A good EDA process has a few common traits. First, it is repetitive. Those who deploy EDA processes regularly have a step-by-step framework that they follow to perform their analysis and gain new insights. Every time they open a new data set, they follow the same set of procedures. This makes EDA easy to teach and easy to adopt because you don’t have to start from scratch every time you launch Excel. Most EDA processes begin by identifying and summarizing categorical variables. Then, the user calculates descriptive statistics for continuous variables and creates visualizations to help them better understand what their data says. Finally, simple inferential statistics are deployed to examine the strength of relationships between variables.


EDA processes are cyclical and iterative, which means they improve and enhance over time. When I am performing EDA on my data, my inner monologue usually goes “hmm, wonder what that’s about?” Having posed the question, I restructure my data and cycle back through the process again to gain a better understanding. Inevitably, I will ask the question again and repeat the cycle. With each iteration, I learn something new about my data set that can help me uncover actionable information about my students. I like to compare EDA to wringing out a sponge. Each time you wring the sponge you get new information and you just keep wringing until you have pulled all the information out.


Many of us were trained to use data in a stagnant way. We were trained to collect it, line it up into neat columns, and use statistics to answer a single question. This is called confirmatory data analysis, and its incredibly useful. As an educator, you can use these techniques to measure the impact of an intervention, understand the breadth of inequity, or examine changes to survey results. But confirmatory analysis requires you to ask a question. What happens if you ask the wrong question?


Since EDA begins with procedures instead of questions, it is an incredibly useful tool for school improvement specialists seeking to better understand their data. By engaging in an open ended conversation with your data you can begin to see anomalies that you never would have thought to ask about in the first place. Anomaly detection is a key component of successful EDA. When you do a bunch of calculations and line them up, it is easy to see which data points are off. You can then home in on these anomalies to better understand your students in context.


In this way, EDA is also a hypothesis building process. It helps you to think abstractly about your data and since you are starting with an open mind and a willingness to explore and engage with the data, you are more likely to uncover rich hypotheses that can be tested more rigorously later on. These EDA hypotheses can inform your improvement strategy, your 30-60-90 day plans, or your action research projects as you seek to improve teaching and learning conditions in your school.



Another way that data scientists use EDA is to monitor the quality of their data collection procedures. If your data collection is wonky – your data analysis will be too. Sometimes, those anomalies that I talked about may reflect internal collection processes more than they reflect actual student outcomes. By performing EDA on your local data, you can identify gaps in your processes so that you can have more accurate data to inform your decision making.


All in all, EDA is about a spirit of open mindedness and a willingness to engage deeply with your data in an iterative way. If you are looking to get started with EDA, check out this paper I published in the International Journal of Education Policy and Leadership. Also, consider reviewing the tools and resources in The Repository. My free data analysis tools will instantly summarize your distributions and help you get started with EDA. Good luck on your journey friends and let me know how I can help.