Advanced Economic Methods
Exercise 2: Multicollinearity
Overview
In this lesson, we will learn how to test for and correct multicollinarity. Simply put, multicollinarity arises when 2 or more of the independent (explanatory) variables in a multivariate regression are highly correlated.
We will use the data set for Example 2 in Chapter 5 of the textbook.
Setup
We need to install two new packages called dplyr and
car. The first helps with a wide range of data organization
tasks, and the second is used to estimate a variance inflation factor
(VIF). Copy the following code into the command line in the bottom left
corner (not in your script) and hit enter:
Now create a new R script and add the header like we did in the first lesson. Then clear the environment and add the packages we will need for this analysis. The beginning of your script should look something like this:
###########################################################################
# Project: Multicollinearity
# Author: [Your name]
# Date: [Today's date]
###########################################################################
# Clear environment and load libraries
rm(list=ls())
library(readxl)
library(dplyr)
library(car)Now we can set the working drive and load the data. Following the exercise in the book, we will use UK economic data.This data has two rows of labels at the top of the file, so we will skip the first row and use the labels from the second row.
# Set working drive
setwd("C:/Projects/applied-econometrics/R") # Replace with your directory path
# Load data
d <- read_excel("dat/imports_uk.xls", skip = 1, col_names = TRUE)## New names:
## • `` -> `...1`
Let’s have a quick look at the data to see what we are working with.
## # A tibble: 6 × 5
## ...1 Imports GDP CPI RPI
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 80 Q1 27681 72.2 NA 45.6
## 2 Q2 26651 70.8 NA 47.5
## 3 Q3 24394 70.5 NA 48.3
## 4 Q4 23815 69.7 NA 49.2
## 5 81 Q1 22833 69.4 NA 50.6
## 6 Q2 24096 69.5 NA 52.4
We will not need the column titled “…1”. It is a time identifier that will not be used, so let’s just remove it to make the data set easier to work with
Analysis
First, let’s do a simple correlation matrix using all four variables
in the data set. We can see from the head() command that
there are some missing values in the data set, so we will add the option
use = "pairwise.complete.obs" because it will use all
available data for each pair of variables instead of dropping entire
rows with any missing value
## Imports GDP CPI RPI
## Imports 1.0000000 0.9874500 0.8771250 0.9551633
## GDP 0.9874500 1.0000000 0.8795169 0.9667326
## CPI 0.8771250 0.8795169 1.0000000 0.9907019
## RPI 0.9551633 0.9667326 0.9907019 1.0000000
Correlation coefficients have a range of -1 to 1, with 0 meaning no correlation at all, -1 meaning a perfect negative correlation (e.g. a 1-unit increase in variable A corresponds to a 1-unit decrease in variable B), and 1 meaning a perfect positive correlation. We can see from the results that all these variables have very high correlation, which makes it quite clear that multicollinearity will be an issue in any regressions we run on the data.
To demonstrate why this correlation might cause problems, we will now
run a set of regressions to see what happens when we modify the
variables in the regression. The first model will be run using the log
of Imports as the dependent variable and the logs of
GDP and CPI as independent variables.
Note that I am writing the regression to an object called
reg1. The output provided when running this command without
storing as an object does not provide the appropriate summary
statistics, so we will use the summary() command after
running the regression to view the results.
##
## Call:
## lm(formula = log(Imports) ~ log(GDP) + log(CPI), data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.057621 -0.014054 0.000724 0.013818 0.051238
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.50640 0.29530 1.715 0.0934 .
## log(GDP) 2.13614 0.10543 20.261 <2e-16 ***
## log(CPI) 0.10714 0.05012 2.138 0.0381 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02501 on 44 degrees of freedom
## (53 observations deleted due to missingness)
## Multiple R-squared: 0.9778, Adjusted R-squared: 0.9768
## F-statistic: 969.9 on 2 and 44 DF, p-value: < 2.2e-16
We can see that we have a high R-squared, GDP has a relatively high coefficient with strong statistical significance (i.e. very low p-value), and CPI is barely statistically significant (p = 0.0381).
Now we will run another regression but add in the PPI variable, as such
##
## Call:
## lm(formula = log(Imports) ~ log(GDP) + log(CPI) + log(RPI), data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.051519 -0.014701 0.003888 0.013514 0.046339
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3942 0.2813 1.402 0.16819
## log(GDP) 2.2473 0.1081 20.782 < 2e-16 ***
## log(CPI) 0.5538 0.1791 3.093 0.00348 **
## log(RPI) -0.5745 0.2222 -2.586 0.01319 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02353 on 43 degrees of freedom
## (53 observations deleted due to missingness)
## Multiple R-squared: 0.9808, Adjusted R-squared: 0.9795
## F-statistic: 732.4 on 3 and 43 DF, p-value: < 2.2e-16
Now, we see that statistical significance for CPI improved considerably, with a coefficient that jumped from 0.11 to 0.55 and p-values of 0.038 to 0.003. Additionally, we find that the RPI variable has a negative coefficient that and significant (p = 0.013). This change in coefficients and significance is a classic sign of multicollinearity: the independent variables are highly correlated, making it hard for the model to isolate their individual effects. Let’s run the model again without CPI in the mix
##
## Call:
## lm(formula = log(Imports) ~ log(GDP) + log(RPI), data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.063172 -0.024733 -0.001189 0.024888 0.071544
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.38698 0.22695 6.111 4.58e-08 ***
## log(GDP) 1.99504 0.10589 18.840 < 2e-16 ***
## log(RPI) 0.06458 0.06146 1.051 0.297
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0331 on 72 degrees of freedom
## (25 observations deleted due to missingness)
## Multiple R-squared: 0.9872, Adjusted R-squared: 0.9869
## F-statistic: 2781 on 2 and 72 DF, p-value: < 2.2e-16
Now we see that the coefficient for RPI is positive, but relatively small and not statistically significant. This pattern confirms that multicollinearity is affecting the regression: the estimated coefficients change dramatically depending on which correlated variables are included. We can also run a VIF test to verify the issue:
## log(GDP) log(RPI)
## 14.10901 14.10901
With a VIF > 14, we know for sure that multicollinearity is a problem in the model. The easiest way to solve this problem is to simply drop one of the highly correlated variables. Including all of them does not provide additional useful information. Since GDP has had strong metrics that were relatively stable across the regressions and the others created problems, I would suggest running a regression with only GDP:
##
## Call:
## lm(formula = log(Imports) ~ log(GDP), data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.066117 -0.024121 -0.001085 0.024499 0.076025
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.18858 0.12601 9.432 2.9e-14 ***
## log(GDP) 2.10230 0.02821 74.519 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03312 on 73 degrees of freedom
## (25 observations deleted due to missingness)
## Multiple R-squared: 0.987, Adjusted R-squared: 0.9868
## F-statistic: 5553 on 1 and 73 DF, p-value: < 2.2e-16
These results are now pretty good, so we can stop with this. Other methods are available to include the other variables. You could, as examples, create a new variable that is the ratio of GDP to CPI, or create a composite variable that aggregates some or all of the X variables.