Updated: Jul 18
R is an incredibly powerful and useful tool for analyzing large volumes of student data. Within it are hundreds of packages that help you complete complex analysis quickly and easily. However, when you first start using R to explore student data, the best place is to start with the built-in functions, called Base R. In this article I will show you some of the ways that I use Base R to examine student data.
Before you can do anything with your data, you have to get it into R. In Base R, you can easily import text files (.txt) or files with comma separated values (.csv). So the first thing you need to do is take your data, which is probably in an Excel spreadsheet, and covert it to a .txt or .csv file by doing a “Save As”. You will then use the following code to load your data.
Code for a .txt file:
df <- read.table(‘file.txt’)
Code for a .csv file:
df <- read.csv(‘file.csv’)
Now, I have previously discussed the benefits of using RStudio. If you have taken my advice, you want to drop this code into your script box (the top left corner). When you run it, you will see it populate in the console (the bottom left corner). It will also populate in your environment window (the top right corner).
You should also consider giving your data a name. Here is an example of what my code looks like using a random file that is sitting on my desktop.
data1 <- read.csv('C:/Users/username/Desktop/Book1.csv')
As you can see here, I have named my data “data1”. This is helpful in case I need to load a second set of data later. I will use this data1 name for all of the other examples in this article. I use the code “read.csv” because my file is a .csv file. The name of the file goes inside parenthesis with single quotes. Click “Run” to load the data.
R can be difficult at first because you aren’t looking at the data the way you would be in Excel. Sometimes it is helpful to examine the data set. To do that we are going to use the built-in functions head and structure. The head function allows you to view the first rows of your spreadsheet. The structure function tells you a little information about what is contained within your spreadsheet. Here is the head of my data1 spreadsheet.
Student.Number Gender Fall.Test.Score Winter.Test.Score Spring.Test.Score 1 920248 Male 78 88 94 2 913377 Male 81 101 109 3 974300 Female 75 85 92 4 911323 Female 90 97 105 5 974450 Female 87 102 109 6 983327 Female 86 92 99
In the first line of this snippet you see the code for the head function. Next we see the column names, in this case Student Number, Gender, Fall Test Score, Winter Test Score, and Spring Test Score. R then shows you the first six rows in my table. You can also see that my last column, Winter.Test.Score wraps down to the next line. This wrapping will follow for all of your columns. Let’s take a look at its structure.
'data.frame': 100 obs. of 5 variables: $ Student.Number : int 920248 913377 974300 911323 974450 983327 935025 977292 968419 947486 ... $ Gender : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 1 1 1 2 2 ... $ Fall.Test.Score : int 78 81 75 90 87 86 79 88 88 90 ... $ Winter.Test.Score: int 88 101 85 97 102 92 99 102 108 100 ... $ Spring.Test.Score: int 94 109 92 105 109 99 106 109 114 106 ...
Again you can see the code for structure in the first line of the snippet. Now we see some different things. First, it tells us that this data variable is a data.frame. That is R’s way of saying it’s a table. It tells us that we have 100 observations of 5 variables. Each consecutive line tells us the name of a column, the type of data stored within that column (in this case we have mostly integers and a Factor with Two Levels). We also see the first several entries for each column.
These two functions are super helpful when you are working with student data because occasionally in R we can forget what our data looks like or what columns are named. This allows you to quickly check on those things. I normally only use these functions in the console box of my RStudio display. I don’t need to see them over and over every time I run the data, so I can run them one time, when I need them, and then make them go away.
Summarizing the Data
Another helpful feature of Base R is the summary function. Summary allows you to quickly see the specific details of your dataset. Let’s take a look at the summary of data1.
Student.Number Gender Fall.Test.Score Winter.Test.Score Spring.Test.Score Min. :911262 Female:46 Min. :75.00 Min. : 81.00 Min. : 87.0 1st Qu.:934834 Male :54 1st Qu.:78.00 1st Qu.: 90.00 1st Qu.: 96.0 Median :953559 Median :82.00 Median : 95.00 Median :101.0 Mean :955155 Mean :82.24 Mean : 94.71 Mean :101.2 3rd Qu.:977808 3rd Qu.:87.00 3rd Qu.:100.00 3rd Qu.:106.0 Max. :999574 Max. :90.00 Max. :109.00 Max. :116.0
In the first line you see the code for returning the summary. When you type that and click enter or Run you get the output listed here. You can see that I have all of my columns again and each column is summarized in a way that makes sense. Since most of my columns contain numbers, you can see that it gives me the quartiles and the mean for the distribution. For my Gender column, which is a text variable, it has counted them up to tell me that I have 46 female and 54 male students. As with before, my winter column has wrapped down to the next line.
Basic Math Functions
Base R contains many basic math functions that can be helpful for analyzing student data. I am going to use these math functions to analyze the data for my Fall Test Score column. To do that, I will indicate that I want to look at that column by using the $. It will look like this:
In the snippet below, I will show you the code and output for the following functions: maximum score, minimum score, sum, mean, median, variance, and standard deviation of my Fall Test Score column.
> max(data1$Fall.Test.Score)  90 > min(data1$Fall.Test.Score)  75 > sum(data1$Fall.Test.Score)  8224 > mean(data1$Fall.Test.Score)  82.24 > median(data1$Fall.Test.Score)  82 > var(data1$Fall.Test.Score)  23.19434 > sd(data1$Fall.Test.Score)  4.816051
Since the purpose of this article is to show you the Base R code and its outputs, I am not going to talk about the meaning of these individual scores here, but you should easily be able to copy and replicate them to suit your needs.
Base R also includes functions for performing statistical analysis. Most useful to education data analysis is the t-test. A t-test allows you to compare the means of two groups. In education, this is a useful procedure for looking to see how test scores change over time. Let’s run a t-test to compare fall and spring test scores.
> t.test(data1$Fall.Test.Score, data1$Spring.Test.Score)
Welch Two Sample t-test data: data1$Fall.Test.Score and data1$Spring.Test.Score t = -22.315, df = 175.8, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -20.60418 -17.25582 sample estimates: mean of x mean of y 82.24 101.17
You can see here that I have used the $ to indicate which two columns I want to perform the t-test on. The output gives us all of the information that we need to interpret the results. We see our t-score, our degrees of freedom, and our p value. It also tells us about our alternative hypothesis, the confidence interval, and the means. Again, I’m not going to talk about what all that means here, but this simple code will show you the results of your t-test with way less programming than Excel.
Making a Scatter Plot
Base R contains easy code for creating a quick scatter plot. While this isn’t my preferred method (I use ggplot2 for this normally) it is a quick and easy way to crank out a plot. Let’s make a scatter plot showing the fall and spring test scores we analyzed above.
Unlike the other items we have examined in this section, the scatter plot does not appear in the console of RStudio. Now, the plot will appear in the Plot window, in the bottom right corner. This plot is pretty basic, but it gets the job done. I’ll teach you how to create prettier ones in another post later on.
Making a Histogram
Histograms are my favorite way to report on a distribution of scores. Unlike a scatter plot that shows relationships between two variables, a histogram sorts scores into bins and reports them. Visually, I find them to be helpful when talking about my data to school and district leaders. You can only make a histogram for one variable at a time, so lets look at our winter test scores.
Again, this is pretty no frills, but it gets the job done and tells us a lot about our data.
In this post I have shown you a handful of easy tools in Base R that can get you started quickly working with your own student data. So where do you go from here? My advice, get a set of student data that you know pretty well, drop it into R, and copy and paste the codes from this page. You will need to re-name some of the variables, but the model code here should be enough to get you started. I think the best way to learn R is to use model code and examine it with your own data. In time, these codes will become second nature to you.