Updated: Nov 10
Hi! Thanks for tuning into this video on the principles of tidy data. My name is Matthew Courtney. I am an education researcher and consultant specializing in data use for continuous improvement. In this video, I am going to help you get set up for success by following the principles of tidy data.
All of the information in this video comes from this amazing paper by Hadley Whickam. It was published in the Journal of Statistical Software in 2014. While this video will hit the highlights and put a classroom-oriented spin on the topic, there are many important nuances in this paper. I encourage you to download it and give it a read.
So let’s dive in. What is Tidy Data anyway? Tidy data is a term used to describe data that is neatly collected and ready for analysis. It is also a set of principles, which we will explore in a moment. Tidy data makes your data analysis efficient – allowing you to easily produce charts and graphs and calculate statistics.
It is important that you gain an understanding of these principles because most of the time, data is not going to come out of your database ready to go. You have to clean it. If you have control over your data collection system, such as a gradebook you keep in a spreadsheet, you can save yourself a lot of headache by setting up your collection process to be tidy from the start.
Let’s take a look at the five principles of tidy data. I should note, there are more than five things under this topic. I have simply chosen to isolate five concepts here that I think are most beneficial to educators.
Principle number one: Each row is an observation.
When you are looking through your data, you want to make sure that each row contains an observation. An observation is the individual unit that you are studying in your analysis. In education, observations are typically students, schools, districts, or states.
Let’s take a look at some student data. This is a tidy data set. On this table, our observation is the individual student. You can see that each student has their own row on our table and all of the information for that student is in the same row. If you have multiple rows per student, your data analysis will likely be inaccurate.
Principle number two: Each column is a variable.
Variables are the various pieces of information that we have about our students, schools, and districts. In a tidy data set, each variable has its own column. Variables can be either dependent or independent and most school level data sets will include a variety of dependent and independent variables for each student.
Independent variables are those that are independent from the outcomes. In education, they tend to be things like grade level, gender, or race/ethnicity of our students. Dependent variables are generally your outcome variables. They are things like test scores, the number of behavior events exhibited by a student, or their attendance rate.
Let’s take a look at a tidy data set. Here we see only one observation, Sarah, and she has four variables. She has two independent variables, she is in seventh grade and is a female, and we have two dependent variables, her attendance rate is 92.5% and her test score is 287. All of Sarah’s information is stored in her row, with each new piece of information sitting in its own column.
Principle number three: Each cell is a value.
If you have mastered the first two principles, this one should be taken care of. Each cell in your tidy data set should contain only one value. It is also important that each column contains values that are all formatted the same way.
On this tidy data set, you can see that we have a table with four variables and three observations. Each cell contains one piece of information and our values all match. All of our Grade Level values are single digits. All of our gender values are whole words. All of our attendance rates are listed in percentages. All of our test scores are listed in whole numbers. This data set is tidy and almost ready for analysis.
Principle number four: Each observation has a unique ID.
This is an often-skipped steps in schools, but one that is very important. Each student, school, or district in your data set should be linked to a unique identification number. There is no shortage of unique identifiers, generally they will be student numbers, National Center for Education Statistics numbers, or other generated state student ID numbers. These ID numbers allow you to quickly link data from multiple tables together for analysis. You can also use these numbers to create dashboards to quickly report on students or schools that you may be tracking more closely.
This is a simple tidy data set. You can see that we have three observations, our students, and each one has a unique identifier.
Principle number five: Each column has a unique name.
This is another one that is pretty straight forward. Make sure that you give each column in your table a specific and unique name. Take a look at this tidy data set. The second column is labeled “Grade Level”. This is a more specific heading than if it were simply called “Grade” because the word grade can refer to your year in school, the score you received on an individual assignment, or the summative score you earned in a class. This table has three columns that could all be called “Grade”. This level of specificity will help speed up your analysis process and also help you remember what you did when you look back at this data in a week, a month, or a year.
Now that you have an understanding of the principles of tidy data, take some time to complete this simple practice activity.
To get started with Tidy Data, start by auditing your current data collection system:
Make a list of all the places where you store your data.
Pull some data reports to examine.
Identify which principles of Tidy Data are already in place in your data.
Identify which elements you may need to clean up before analyzing your data.
Where possible, update your data collection system to eliminate cleaning in the future.
I hope this video has given you a better understanding of how to prepare your data for analysis. For more information about how you can use data to enhance student learning, subscribe to my channel or visit www.matthewbcourtney.com.