Statistics

Exercise 3: Introduction to linear regression

Introduction

In this exercise, we will run a regression estimating the relationship between height and hand size using our student data set.

Today we will ask the simple question: Is there a relationship between a person’s height and the size of their hand?

Setup

Header and packages

First, open a new script and enter the same information we worked on before. You can copy and paste everything we did in the last session, then just make some adjustments so it looks like this:

###########################################################################
# Project: Running a regression
# Author: [Your name]
# Date: [Today's date]
###########################################################################

# Clearing environment
rm(list=ls())

library(ggplot2)     # for creating plots

setwd("C:/Projects/applied-econometrics/R") # Enter the correct path for your directory 

Note that we only need the ggplot2 library today. Regressions are handled using base R functions, but we will want to make a few plots so we can get an idea of what we are working with.

Loading the data

As mentioned, we will use the data from the last assignment, which is the file called student_data_complete.csv

d <- read.csv("dat/student_data_complete.csv")

Plotting the data

When running a regression, the most useful visual is the scatter plot. We will set the height on the x-axis and hand size on the y-axis.

Since we are using ggplot like the previous exercises, we can use very similar code and just change the line specifying the plot type. We previously used geom_histogram() for a histogram and geom_boxplot() for a box plot. We will now use geom_point() for the scatter plot, and within that environment we can specify aspects of the markers such as shape, size, and color. Feel free to play around with these options. More information on the markers can be found here.

ggplot(d, aes(x = height, y = hand)) +
  geom_point(shape = 21, size = 5, color = "black", fill = "green") +
  labs(x = "Height", y = "Hand size",
       title = "Scatter Plot of Height vs Hand Size") +
  theme_minimal()

Based on the plot, we can see that there does appear to be a positive relationship between height and hand size, meaning that as height increases, hand size also increases. However, we should run a regression to get more precise estimates of this relationship.

Running the regression

Now let’s run the regression. It takes only one simple command in base R. The command lm() means linear model (for estimating a linear equation), and we need to specify the variables and data object, and we will create a new object called reg1 that contains the results of the regression. The simple written form of what we are trying to estimate is:

\[ height = a + (b \times hand) \]

where \(height\) is the height of a student in centimeters, \(a\) is our intercept, \(b\) is the regression coefficient, and \(hand\) is the size the student’s hand. We code this as:

reg1 <- lm(height ~ hand, data = d)

Now, we have to run a summary() to view the results:

summary(reg1)
## 
## Call:
## lm(formula = height ~ hand, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1452 -3.5142  0.2237  2.8548 13.7266 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  47.8640    16.8986   2.832  0.00944 ** 
## hand          6.7378     0.9358   7.200 2.49e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.501 on 23 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.6927, Adjusted R-squared:  0.6793 
## F-statistic: 51.84 on 1 and 23 DF,  p-value: 2.488e-07

Now we can interpret the results. This is the important part! We are interesting in interpreting only the following variables for this exercise:

  • Intercept(\(a\)): A hand size of 0cm corresponds to a height of about 47.86 cm. Not really practical in this example, but worth remembering how to interpret it.
  • Slope (\(b\)): For every 1 cm increase in hand size, height increases by about 6.74 cm on average.
  • \(R^2\): About 69% of the observed variation in height is explained by hand size.
  • p-value: The p-value for slope = 2.49e-07, meaning there is strong evidence of a linear relationship between hand size and height.

To visualize the relationship, we can add a regression line to the scatter plot we produced before. This can be done by addition the command for geom_smooth() and specifying that the line is for a linear model method = "lm" and there are not standard errors (that is a lesson for later). We also will set the line to be black so it is easy to see:

ggplot(d, aes(x = height, y = hand)) +
  geom_point(shape = 21, size = 5, color = "black", fill = "green") +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(x = "Height", y = "Hand size",
       title = "Scatter Plot of Height vs Hand Size") +
  theme_minimal()