Technical Directions
The Distribution Comparison Tool - Technical Directions
​
This tool is a Shiny app that enables users to compare two distributions by uploading a spreadsheet file and selecting two columns for comparison. It calculates summary statistics, performs a t-test to compare the means of the two columns, and creates a boxplot and histogram to visualize the distributions. The app is designed to work with numeric data and uses different t-tests depending on the properties of the input data.
About the Tool
The Distribution Comparison Tool is a Shiny Web Application developed by Matthew Courtney in 2023. It uses the R statistical programming language to read data off of a spreadsheet and create a summary of any column. The tool is hosted on the shinyapps.io server.
Preparing your Data
​
-
Make sure your data is in a spreadsheet file format, such as .xls, .xlsx, or .csv.
-
Ensure that your spreadsheet file does not contain personally identifiable information or any other sensitive or confidential data.
-
Ensure that the column you want to analyze contains numerical data. If the column contains non-numerical data, such as text or dates, you will need to clean and prepare the data by converting it to numerical data.
-
Remove any empty cells or rows from your data to avoid errors during analysis.
-
Check for and remove any duplicates in your data to avoid skewing your analysis.
-
If your data contains missing values, decide how to handle them before analysis. One option is to remove any rows or cells with missing values. Another option is to fill in missing values with a reasonable estimate, such as the mean or median of the column.
-
If you have a large dataset, consider reducing the size of your data to improve the speed of analysis. You can do this by selecting a subset of the columns or rows in your dataset that are most relevant for your analysis.
-
Ensure that your data is accurate and reliable by validating the data before analysis. Check for errors and inconsistencies in your data, and correct any issues that you find.
-
Finally, when uploading your data into the tool, make sure to select the correct file and column to analyze. Double-check the data to ensure that you have selected the right file and column, and that the data is clean and prepared for analysis.
​
Using the DAT
-
Click the "Choose a spreadsheet file" button and select your spreadsheet file. The file should have an extension of .xls, .xlsx, .xlsm, or .csv.
-
Wait for the app to read in your data and display the two dropdown menus for selecting the columns to compare.
-
Select the first distribution you want to compare by choosing a column from the first dropdown menu. The dropdown menu should display the names of the columns in your data.
-
Select the second distribution you want to compare by choosing a column from the second dropdown menu.
-
Wait for the app to calculate summary statistics for both distributions and perform a t-test to compare their means. The app uses different t-tests depending on the properties of the input data. If both columns are numeric, it uses a two-sample t-test for independent samples. If both columns are binary, it uses a two-sample t-test for binary outcomes. If both columns have more than two unique values and are normally distributed with sample sizes greater than or equal to 30, it uses a two-sample t-test for normally distributed data. Otherwise, it uses Welch's t-test for non-normally distributed data. The t-test results will be displayed in the app. The output will show the t-value, degrees of freedom, and p-value.
-
Review the results of the t-test to see if the distributions are significantly different from each other. If the p-value is less than 0.05, the distributions are considered to be significantly different from each other. If the p-value is greater than or equal to 0.05, the distributions are considered to be not significantly different.
-
View the boxplot to see how the distributions compare visually. The boxplot will be displayed in the app. The boxplot displays the median, interquartile range, and outliers for each distribution.
-
View the histogram to see the distribution of each variable. The histogram will be displayed in the app. The histogram displays the frequency of each value or range of values for each distribution.
-
Repeat steps 6-11 as needed to compare additional distributions. You can select different columns from the dropdown menus to compare other distributions in your data.
-
Once you have finished analyzing your data, close the app and save your results if desired.
​
References
-
R: R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
-
shiny: Winston Chang, Joe Cheng, JJ Allaire, Yihui Xie and Jonathan McPherson (2018). shiny: Web Application Framework for R. R package version 1.4.0. https://CRAN.R-project.org/package=shiny
-
readxl: Hadley Wickham and Jennifer Bryan (2019). readxl: Read Excel Files. R package version 1.3.1. https://CRAN.R-project.org/package=readxl
-
dplyr: Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.7. https://CRAN.R-project.org/package=dplyr
-
knitr: Yihui Xie (2015). Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963, URL http://yihui.name/knitr/.
-
kableExtra: Hao Zhu (2021). kableExtra: Construct Complex Table with knitr::kable() + pipe. R package version 1.3.4. https://CRAN.R-project.org/package=kableExtra
-
infer: Andrew Bray and Chester Ismay (2021). infer: Tidy Statistical Inference. R package version 1.1.3. https://CRAN.R-project.org/package=infer.
-
effsize: Daniel Lakens (2021). effsize: Efficient Effect Size Computation. R package version 0.4.0. https://CRAN.R-project.org/package=effsize.
-
shinyjs: Dean Attali (2021). shinyjs: Easily Improve the User Experience of Your Shiny Apps in Seconds. R package version 2.0.0. https://CRAN.R-project.org/package=shinyjs.