Advanced Economic Methods

Exercise 2: Multicollinearity

Overview

In this lesson, we will learn how to test for and correct multicollinarity. Simply put, multicollinarity arises when 2 or more of the independent (explanatory) variables in a multivariate regression are highly correlated.

We will use the data set for Example 2 in Chapter 5 of the textbook.

Setup

We need to install two new packages called dplyr and car. The first helps with a wide range of data organization tasks, and the second is used to estimate a variance inflation factor (VIF). Copy the following code into the command line in the bottom left corner (not in your script) and hit enter:

install.packages(c("dplyr", "car"))

Now create a new R script and add the header like we did in the first lesson. Then clear the environment and add the packages we will need for this analysis. The beginning of your script should look something like this:

###########################################################################
# Project: Multicollinearity
# Author: [Your name]
# Date: [Today's date]
###########################################################################

# Clear environment and load libraries
rm(list=ls())
library(readxl)
library(dplyr)
library(car)

Now we can set the working drive and load the data. Following the exercise in the book, we will use UK economic data.This data has two rows of labels at the top of the file, so we will skip the first row and use the labels from the second row.

# Set working drive
setwd("C:/Projects/applied-econometrics/R") # Replace with your directory path

# Load data
d <- read_excel("dat/imports_uk.xls", skip = 1, col_names = TRUE)

## New names:
## • `` -> `...1`

Let’s have a quick look at the data to see what we are working with.

head(d)

## # A tibble: 6 × 5
##   ...1  Imports   GDP   CPI   RPI
##   <chr>   <dbl> <dbl> <dbl> <dbl>
## 1 80 Q1   27681  72.2    NA  45.6
## 2 Q2      26651  70.8    NA  47.5
## 3 Q3      24394  70.5    NA  48.3
## 4 Q4      23815  69.7    NA  49.2
## 5 81 Q1   22833  69.4    NA  50.6
## 6 Q2      24096  69.5    NA  52.4

We will not need the column titled “…1”. It is a time identifier that will not be used, so let’s just remove it to make the data set easier to work with

d <- d %>%
  select(-"...1")

Analysis

First, let’s do a simple correlation matrix using all four variables in the data set. We can see from the head() command that there are some missing values in the data set, so we will add the option use = "pairwise.complete.obs" because it will use all available data for each pair of variables instead of dropping entire rows with any missing value

cor(d, use = "pairwise.complete.obs")

##           Imports       GDP       CPI       RPI
## Imports 1.0000000 0.9874500 0.8771250 0.9551633
## GDP     0.9874500 1.0000000 0.8795169 0.9667326
## CPI     0.8771250 0.8795169 1.0000000 0.9907019
## RPI     0.9551633 0.9667326 0.9907019 1.0000000

Correlation coefficients have a range of -1 to 1, with 0 meaning no correlation at all, -1 meaning a perfect negative correlation (e.g. a 1-unit increase in variable A corresponds to a 1-unit decrease in variable B), and 1 meaning a perfect positive correlation. We can see from the results that all these variables have very high correlation, which makes it quite clear that multicollinearity will be an issue in any regressions we run on the data.

To demonstrate why this correlation might cause problems, we will now run a set of regressions to see what happens when we modify the variables in the regression. The first model will be run using the log of Imports as the dependent variable and the logs of GDP and CPI as independent variables.

Note that I am writing the regression to an object called reg1. The output provided when running this command without storing as an object does not provide the appropriate summary statistics, so we will use the summary() command after running the regression to view the results.

reg1 <- lm(log(Imports) ~ log(GDP) + log(CPI), data = d)

summary(reg1)

## 
## Call:
## lm(formula = log(Imports) ~ log(GDP) + log(CPI), data = d)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.057621 -0.014054  0.000724  0.013818  0.051238 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.50640    0.29530   1.715   0.0934 .  
## log(GDP)     2.13614    0.10543  20.261   <2e-16 ***
## log(CPI)     0.10714    0.05012   2.138   0.0381 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02501 on 44 degrees of freedom
##   (53 observations deleted due to missingness)
## Multiple R-squared:  0.9778, Adjusted R-squared:  0.9768 
## F-statistic: 969.9 on 2 and 44 DF,  p-value: < 2.2e-16

We can see that we have a high R-squared, GDP has a relatively high coefficient with strong statistical significance (i.e. very low p-value), and CPI is barely statistically significant (p = 0.0381).

Now we will run another regression but add in the PPI variable, as such

reg2 <- lm(log(Imports) ~ log(GDP) + log(CPI) + log(RPI), data = d)

summary(reg2)

## 
## Call:
## lm(formula = log(Imports) ~ log(GDP) + log(CPI) + log(RPI), data = d)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.051519 -0.014701  0.003888  0.013514  0.046339 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.3942     0.2813   1.402  0.16819    
## log(GDP)      2.2473     0.1081  20.782  < 2e-16 ***
## log(CPI)      0.5538     0.1791   3.093  0.00348 ** 
## log(RPI)     -0.5745     0.2222  -2.586  0.01319 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02353 on 43 degrees of freedom
##   (53 observations deleted due to missingness)
## Multiple R-squared:  0.9808, Adjusted R-squared:  0.9795 
## F-statistic: 732.4 on 3 and 43 DF,  p-value: < 2.2e-16

Now, we see that statistical significance for CPI improved considerably, with a coefficient that jumped from 0.11 to 0.55 and p-values of 0.038 to 0.003. Additionally, we find that the RPI variable has a negative coefficient that and significant (p = 0.013). This change in coefficients and significance is a classic sign of multicollinearity: the independent variables are highly correlated, making it hard for the model to isolate their individual effects. Let’s run the model again without CPI in the mix

reg3 <- lm(log(Imports) ~ log(GDP) + log(RPI), data = d)

summary(reg3)

## 
## Call:
## lm(formula = log(Imports) ~ log(GDP) + log(RPI), data = d)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.063172 -0.024733 -0.001189  0.024888  0.071544 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.38698    0.22695   6.111 4.58e-08 ***
## log(GDP)     1.99504    0.10589  18.840  < 2e-16 ***
## log(RPI)     0.06458    0.06146   1.051    0.297    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0331 on 72 degrees of freedom
##   (25 observations deleted due to missingness)
## Multiple R-squared:  0.9872, Adjusted R-squared:  0.9869 
## F-statistic:  2781 on 2 and 72 DF,  p-value: < 2.2e-16

Now we see that the coefficient for RPI is positive, but relatively small and not statistically significant. This pattern confirms that multicollinearity is affecting the regression: the estimated coefficients change dramatically depending on which correlated variables are included. We can also run a VIF test to verify the issue:

vif(reg3)

## log(GDP) log(RPI) 
## 14.10901 14.10901

With a VIF > 14, we know for sure that multicollinearity is a problem in the model. The easiest way to solve this problem is to simply drop one of the highly correlated variables. Including all of them does not provide additional useful information. Since GDP has had strong metrics that were relatively stable across the regressions and the others created problems, I would suggest running a regression with only GDP:

reg4 <- lm(log(Imports) ~ log(GDP), data = d)

summary(reg4)

## 
## Call:
## lm(formula = log(Imports) ~ log(GDP), data = d)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.066117 -0.024121 -0.001085  0.024499  0.076025 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.18858    0.12601   9.432  2.9e-14 ***
## log(GDP)     2.10230    0.02821  74.519  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03312 on 73 degrees of freedom
##   (25 observations deleted due to missingness)
## Multiple R-squared:  0.987,  Adjusted R-squared:  0.9868 
## F-statistic:  5553 on 1 and 73 DF,  p-value: < 2.2e-16

These results are now pretty good, so we can stop with this. Other methods are available to include the other variables. You could, as examples, create a new variable that is the ratio of GDP to CPI, or create a composite variable that aggregates some or all of the X variables.