Statistics
Exercise 3: Introduction to linear regression
Introduction
In this exercise, we will run a regression estimating the relationship between height and hand size using our student data set.
Today we will ask the simple question: Is there a relationship between a person’s height and the size of their hand?
Setup
Header and packages
First, open a new script and enter the same information we worked on before. You can copy and paste everything we did in the last session, then just make some adjustments so it looks like this:
###########################################################################
# Project: Running a regression
# Author: [Your name]
# Date: [Today's date]
###########################################################################
# Clearing environment
rm(list=ls())
library(ggplot2) # for creating plots
setwd("C:/Projects/applied-econometrics/R") # Enter the correct path for your directory Note that we only need the ggplot2 library today.
Regressions are handled using base R functions, but we will
want to make a few plots so we can get an idea of what we are working
with.
Plotting the data
When running a regression, the most useful visual is the scatter plot. We will set the height on the x-axis and hand size on the y-axis.
Since we are using ggplot like the previous exercises,
we can use very similar code and just change the line specifying the
plot type. We previously used geom_histogram() for a
histogram and geom_boxplot() for a box plot. We will now
use geom_point() for the scatter plot, and within that
environment we can specify aspects of the markers such as shape, size,
and color. Feel free to play around with these options. More information
on the markers can be found here.
ggplot(d, aes(x = height, y = hand)) +
geom_point(shape = 21, size = 5, color = "black", fill = "green") +
labs(x = "Height", y = "Hand size",
title = "Scatter Plot of Height vs Hand Size") +
theme_minimal()Based on the plot, we can see that there does appear to be a positive relationship between height and hand size, meaning that as height increases, hand size also increases. However, we should run a regression to get more precise estimates of this relationship.
Running the regression
Now let’s run the regression. It takes only one simple command in
base R. The command lm() means linear
model (for estimating a linear equation), and we need to
specify the variables and data object, and we will create a new object
called reg1 that contains the results of the regression.
The simple written form of what we are trying to estimate is:
\[ height = a + (b \times hand) \]
where \(height\) is the height of a student in centimeters, \(a\) is our intercept, \(b\) is the regression coefficient, and \(hand\) is the size the student’s hand. We code this as:
Now, we have to run a summary() to view the results:
##
## Call:
## lm(formula = height ~ hand, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1452 -3.5142 0.2237 2.8548 13.7266
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.8640 16.8986 2.832 0.00944 **
## hand 6.7378 0.9358 7.200 2.49e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.501 on 23 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.6927, Adjusted R-squared: 0.6793
## F-statistic: 51.84 on 1 and 23 DF, p-value: 2.488e-07
Now we can interpret the results. This is the important part! We are interesting in interpreting only the following variables for this exercise:
- Intercept(\(a\)): A hand size of 0cm corresponds to a height of about 47.86 cm. Not really practical in this example, but worth remembering how to interpret it.
- Slope (\(b\)): For every 1 cm increase in hand size, height increases by about 6.74 cm on average.
- \(R^2\): About 69% of the observed variation in height is explained by hand size.
- p-value: The p-value for slope = 2.49e-07, meaning there is strong evidence of a linear relationship between hand size and height.
To visualize the relationship, we can add a regression line to the
scatter plot we produced before. This can be done by addition the
command for geom_smooth() and specifying that the line is
for a linear model method = "lm" and there are not standard
errors (that is a lesson for later). We also will set the line to be
black so it is easy to see:
ggplot(d, aes(x = height, y = hand)) +
geom_point(shape = 21, size = 5, color = "black", fill = "green") +
geom_smooth(method = "lm", se = FALSE, color = "black") +
labs(x = "Height", y = "Hand size",
title = "Scatter Plot of Height vs Hand Size") +
theme_minimal()