Statistics

Exercise 2: Generating descriptive statistics and plots

Introduction

In this exercise, we will explore the process of generating descriptive statistics and some basic plots that show important information about our data. We will use the student data consisting of both the data we collected last week and the data from my master-level course last semester.

Setup

Header and packages

First, open a new script and enter the same information we worked on last week. You can copy and paste everything we did in the last session, then just make some adjustments so it looks like this:

###########################################################################
# Project: Descriptive statistics and plots
# Author: [Your name]
# Date: [Today's date]
###########################################################################

# Clearing environment
rm(list=ls())

library(ggplot2)     # for creating plots

setwd("C:/Projects/applied-econometrics/R") # Enter the correct path for your directory

Note that we only need the ggplot2 library today. We will use more packages later in the semester, but there is no reason to load a bunch of packages that we will not use in the current session.

Loading the data

Now we will load the data. In statistical software, we never actually touch the data sets we use. Instead, we load them into the memory and run analyses from the memory. This ensures that we do not accidentally change the data in a way that cannot be recovered.

In R, we will load the data as an object called a data frame. Once loaded, it appears in the top right panel of RStudio (environment). We can name it whatever we want, but we will simply call it d for data. It is a csv file so we use a command called read.csv(), and we will tell R that it can be found in the dat folder. Our command to do all of this will be written as:

d <- read.csv("dat/student_data_complete.csv")

Inspecting the data

You can now see the object in the environment. We can also run a couple different lines of code to see different attributes of the data set. For example, the head() command will show the headers and first few lines of data, similar to what you would see if it were opened in Excel:

head(d)

##   class height shoe hand sleep sport pet
## 1     1    161   37 16.5     7     1   3
## 2     1    169   41 18.5     6     0   0
## 3     1    165   40 16.5     5     0   0
## 4     1    178   39 19.0     7     0   0
## 5     1    162   38 17.0     7     0   0
## 6     1    175   44 19.0     5     1   2

We can also use the str() command to look at the structure of the data. This provides more information about what each variable contains, as shown here:

str(d)

## 'data.frame':    26 obs. of  7 variables:
##  $ class : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ height: int  161 169 165 178 162 175 172 160 171 178 ...
##  $ shoe  : num  37 41 40 39 38 44 38 39 42 45 ...
##  $ hand  : num  16.5 18.5 16.5 19 17 19 17 18 19.2 19.8 ...
##  $ sleep : num  7 6 5 7 7 5 6 8 8 6.5 ...
##  $ sport : int  1 0 0 0 0 1 0 0 0 1 ...
##  $ pet   : int  3 0 0 0 0 2 0 2 0 0 ...

In addition to providing a visual of the values found in the data set, we can see the format of each variable. Variables can be stored in many different formats, with perhaps the most common being integers (whole numbers), numeric (general numbers with decimal points allowed), and factors for categorical (i.e. qualitative) variables. We see that some of the variables are numeric, and some are integers. However, sport and pet are both integers but they are qualitative variables (the actual number has no meaning). We should convert these to factor variables, which can be done with the following commands:

d$sport <- as.factor(d$sport) 
d$pet <- as.factor(d$pet)

Now take a look at the structure again, and we will see that those variables changed from integers to factors:

str(d)

## 'data.frame':    26 obs. of  7 variables:
##  $ class : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ height: int  161 169 165 178 162 175 172 160 171 178 ...
##  $ shoe  : num  37 41 40 39 38 44 38 39 42 45 ...
##  $ hand  : num  16.5 18.5 16.5 19 17 19 17 18 19.2 19.8 ...
##  $ sleep : num  7 6 5 7 7 5 6 8 8 6.5 ...
##  $ sport : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 2 ...
##  $ pet   : Factor w/ 4 levels "0","1","2","3": 4 1 1 1 1 3 1 3 1 1 ...

Histogram

Base `R` approach

The first plot we will generate is a histogram. This plot is nice because it works easily with data sets of any size, and it shows the distribution of a single variable as a set of boxes.

We can start by generating a simple histogram using just the base R commands by writing hist() for histogram, then we have to specify which variable we are interested in. We will look at the range for students’ height, and we have to specify that it is in our data object named d:

hist(d$height)

In the plot above, we see the raw counts of each particular range of values (e.g. one person has a height between 150cm and 155cm, three people have a height between 155cm and 160cm, and so on). This is fine for small data sets when specific numbers are good to know, but it is better practice to display the y-axis in terms of relative frequency. This has the important advantage of being useful for any sample size since the scale will always have a range of 0-1 (or 0-100 if expressed as a percentage).

Histogram using `ggplot()`

To generate the histogram with relative frequencies, we will move to the ggplot2 library, as it provides a lot more flexibility and options in our plots. An important caveat to this approach is that ggplot() does not natively handle frequencies in the y-axis, so we add after_stat(count / sum(count) to our y-axis specification, which is the formula for relative frequencies specified in the textbook and in the lecture slides: \(RF = \frac{f}{n}\).

ggplot(d, aes(x = height, y = after_stat(count / sum(count)))) + # specify data set and variables
  geom_histogram(bins = 9) + # specify the number of bins for our histogram
  labs(y = "Relative frequency") # add labels. Other options here can be x ="", title ="", subtitle = ""

That looks fine, and we can now see what is basically the same plot in the base R version, but instead with the relative frequency instead of raw counts.

However, it looks boring! Let’s add some options to make it more visually appealing. We will change the colors of the boxes, and add a title to the x-axis to make it easier to follow:

ggplot(d, aes(x = height, y = after_stat(count / sum(count)))) +
  geom_histogram(bins = 9, 
                 fill = "green", 
                 color = "purple") +
  labs(y = "Relative frequency", 
       x = "Height (cm)")

Perhaps the green is a little obnoxious, but we can see everything much more clearly. Now let’s focus on the actual data. We can see that there is a skew in the data, meaning that there might be a difference in our measures of center. This will have important implications if we are trying to summarize student heights.

Mean and median

We can look at the difference in the central tendency values using simple commands. Here is how we can see the mean:

mean(d$height)

## [1] 169

And now let’s look at the median

median(d$height)

## [1] 168

While not a very large difference, we can see that a couple tall students in our sample pull the mean up slightly compared to the median value. This is an important distinction, as the mean and median will only be the same if the data are normally distributed (we will cover normal distributions in greater detail later in the course).

Overall, there are two important findings here. First, the histogram provides evidence of a skew in the data, as there are a couple instances of students that are taller than most in the sample. Second, we use the measures of center to validate the idea that there is a skew in the data since the mean is larger than the median.

Boxplot

A boxplot, also called a box-and-whisker plot, displays the distribution of data using five key values: the minimum, Q1, median (or Q2), Q3, and maximum. Please refer to the textbook or lecture slides for an explanation of quartiles.

The box spans from Q1 to Q3, covering the middle 50% of the data, with a line inside marking the median. The distance between Q1 and Q3 is called the interquartile range (IQR). The whiskers are the lines extending from each end of the box out to the minimum and maximum values, giving a sense of how far the extremes stretch relative to the bulk of the data. When outliers are present, the whiskers instead extend to the last data point within \(1.5 \times IQR\), and any values beyond that are plotted individually as dots. Together they provide a quick visual summary of the center, spread, and any unusually extreme values in the data.

We will again use ggplot() to create this plot. Just for fun, we will create two boxes in the plot based on whether or not the students play a sport. This is a dummy variable, meaning that it is a qualitative variable that can only take on values of 0 or 1. If \(sport=0\), the student does not play a sport, and they do play a sport if \(sport=1\). We add this variable to the x-axis, and the students’ height to the y-axis as such:

ggplot(d, aes(x = sport, y = height)) +
  geom_boxplot(fill = "green", color = "purple") +
  labs(y = "Height (cm)", x = "Sport")

The findings here are interesting. We can now see that two students in the sample do indeed have heights that are outside of the normal range of values, as evidenced by the dots. It is also interesting to note that while there does not seem to be a significant difference in the distributions of the data based on sports activity, we do find that both outliers do not play sports.