Advanced Economic Methods

Exercise 4: Autocorrelation

Overview

The following tutorial will guide you through the identification and resolution of autocorrelation in an example data set. If you have not done the previous tutorials, go back to the introductory lesson here to learn where to get the example data sets and set up the appropriate working paths to follow along.

Setup

We do not need to install any new packages for today’s exercise. Start a new script file, then add your header and the appropriate libraries. We will use the haven package to load the data, dplyr for some data organization commands, ggplot2 for generating graphical analyses, and lmtest for some of the postestimation testing for aurocorrelation.

###########################################################################
# Project: Autocorrelation
# Author: [Your name]
# Date: [Today's date]
###########################################################################

# Clear environment and load libraries
rm(list=ls())
library(haven) 
library(dplyr)
library(ggplot2)
library(lmtest)

# Set working drive
setwd("C:/Projects/applied-econometrics/R") # Replace with your directory path

Now we can set the working drive and load the data. For this exercise, we will use the ser_cor data set:

# Load data
d <- read_dta("dat/ser_cor.dta")

Take a look at the data:

str(d)

## tibble [38 × 4] (S3: tbl_df/tbl/data.frame)
##  $ lcons  : num [1:38] 4.63 4.61 4.56 4.57 4.6 ...
##   ..- attr(*, "label")= chr "LCONS"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  $ ldisp  : num [1:38] 4.43 4.45 4.45 4.46 4.48 ...
##   ..- attr(*, "label")= chr "LDISP"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  $ lprice : num [1:38] 4.29 4.33 4.38 4.4 4.38 ...
##   ..- attr(*, "label")= chr "LPRICE"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  $ datevar: num [1:38] 300 301 302 303 304 305 306 307 308 309 ...
##   ..- attr(*, "format.stata")= chr "%tm"

and we can see it contains the following variables:

lcons: The natural log of consumer expenditure on food
ldisp: The natural log of disposable income
lprice: The natural log of the relative price index of food
datevar: Quarterly time variable. It contains only numbers but translates to be the first quarter (Q1) of 1985 to 1994Q2

Regression

Let’s start by running a regression that will evaluate the relationship on consumer food expenditure relative to disposable income and the relative price index of food:

\[lcons_t=\beta_0+\beta_1ldisp_t+\beta_2lprice_t+u_t\]

Take a close look at the results to practice interpreting the findings:

reg1 <- lm(lcons ~ ldisp + lprice, data = d)

summary(reg1)

## 
## Call:
## lm(formula = lcons ~ ldisp + lprice, data = d)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.067365 -0.031403 -0.001917  0.030286  0.106230 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  2.48543    0.78835   3.153  0.00331 **
## ldisp        0.52929    0.29233   1.811  0.07880 . 
## lprice      -0.06403    0.14651  -0.437  0.66476   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04625 on 35 degrees of freedom
## Multiple R-squared:  0.2344, Adjusted R-squared:  0.1907 
## F-statistic: 5.358 on 2 and 35 DF,  p-value: 0.009331

Graphical analysis

The first step of identifying autocorrelation is the extract and plot the residuals. We can extract them by creating a new object that we will call res:

residual <- residuals(reg1)

We can add them to the main data set d. We also need the lagged residuals for the second plot, so we will add them both in the same chain of commands using the mutate command to add these as new variables in the data set. For simplicity, we will call these new variables res1 for the main residuals, and res2 for the lagged residuals.

d <- d %>%
  mutate(res1 = residual,
         res2 = lag(residual, 1)) # The 1 specifies that we are lagging by 1 time period

Residuals over time

Let’s now plot the residuals. We will start by plotting them over time using the datevar variable along the x-axis and the values of res1 on the y-axis.Since the residuals should be randomly hovering around 0 for all time periods, we will add a horizontal line at 0 for a better visualization

ggplot(d, aes(x = datevar, y = res1)) +
  geom_point(shape = 16, color="darkblue", size = 3) + 
  geom_hline(yintercept = 0, color = "darkred", linetype = "dashed") +
  labs(title = "Residuals over time",
       x = "Time", y = "Residual") +
theme_minimal()

Residuals vs lagged residuals

The second plot we will generate is a scatter plot that looks at the main residuals res1 on the y-axis and the lagged residuals res2 on the x-axis.

ggplot(d, aes(x = res2, y = res1)) +
  geom_point(shape = 16, color = "darkblue", size = 3) +
  geom_hline(yintercept = 0, color = "darkred", linetype = "dashed") +
  labs(title = "Residuals vs lagged residuals",
       x = "Lagged residuals", y = "Residuals") + 
theme_minimal()

Both plots show trends consistent with positive autocorrelation, so we will proceed with the statistical tests to formally identify and treat the issue.

Statistical tests

This section will run through the Durbin-Watson and Breusch-Godfrey tests.

Durbin-Watson (DW) test

The following assumptions should be satisfied to run the DW test:

The regression model (reg1) should have a constant
Autocorrelation is assumed to be only first-order, i.e. \(u_t=\rho u_{t-1}+e_t\), where \(\rho\) is a correlation coefficient with range [-1,1]
The equation does not include a lagged variable

To conduct the test, we will use the dwtest() command from the lmtest package:

dwtest(reg1)

## 
##  Durbin-Watson test
## 
## data:  reg1
## DW = 0.37019, p-value = 3.456e-12
## alternative hypothesis: true autocorrelation is greater than 0

The results here are quite conclusive considering the tiny p-value, but we should still look at what the DW value actually means. First, we need to look up the appropriate test statistic from a DW test table, which you can find here. To read the table, we need to know the number of regressors (i.e. independent variables), which is easy (2). We also need to know the number of observations, which is equal in our case to the number of rows:

nrow(d)

## [1] 38

Now let’s look at the table. Our model has an intercept, so we will use the Savin and White table on page 4. Looking at \(k=2\) and \(n=38\), we get a lower bound value (dL) of 1.176 and an upper bound (dU) of 1.388. Since our test statistic is considerably lower than the lower boundary, we can reject the null hypothesis of no autocorrelation because \(DW<dL\).

There are a few drawbacks to the DW test:

It may give inconclusive results
Not applicable when a lagged dependent variable is used
Can’t account for higher-order autocorrelation

Breusch-Godfrey test

To overcome some of the drawbacks of the DW test, we can also apply the Breusch-Godfrey test. This is a Lagrange Multiplier test, meaning that it operates by estimating a restricted model (under the null hypothesis of no autocorrelation) and then testing whether the Lagrange multipliers associated with the constraints (that autocorrelation coefficients are zero) are statistically significant.

The value of this test in the context of our analysis is that it can test different numbers of orders (i.e. lags), whereas the DW test assumed first-order only. This is very important for our data because we have a quarterly time series, so it makes sense to test as many as 4 lags to catch any yearly patterns in the data. This is basically the theoretical maximum we should test, so we will run the test 4 times and increase the order number each time up to 4:

bgtest(reg1, order = 1)

## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  reg1
## LM test = 23.23, df = 1, p-value = 1.437e-06

bgtest(reg1, order = 2)

## 
##  Breusch-Godfrey test for serial correlation of order up to 2
## 
## data:  reg1
## LM test = 23.263, df = 2, p-value = 8.883e-06

bgtest(reg1, order = 3)

## 
##  Breusch-Godfrey test for serial correlation of order up to 3
## 
## data:  reg1
## LM test = 25.801, df = 3, p-value = 1.05e-05

bgtest(reg1, order = 4)

## 
##  Breusch-Godfrey test for serial correlation of order up to 4
## 
## data:  reg1
## LM test = 26.224, df = 4, p-value = 2.851e-05

We can see that we have significant autocorrelation in all four tests based on the low p-value (with a null hypothesis of no autocorrelation). It doesn’t make sense to continue increasing the number of lags beyond the theoretical maximum, so we will conclude that we have to continue by resolving the problem.

Resolving autocorrelation

There are 5 steps to this procedure:

Step 1: Estimate the regression and obtain residuals (already done)
Step 2: Estimate \(\rho\) from regressing the residuals to its lagged terms
Step 3: Transform the original variables using \(\rho\)
Step 4: Run the regression again with the transformed variables and obtain residuals
Step 5 and on: Continue repeating steps 2 to 4 for several rounds until (stopping rule) the estimates of from two successive iterations differ by no more than some preselected small value, such as 0.001.

Since this is a repetitive task, we will create a loop that will run through the cycle until we reach our target differing value. First, we can create some objects to help us. We will create a new data object for our regression to preserve the original data structure and name it reg_current. We will also create an object for \(\rho\) called rho_old and set it to 0, then set our target differing value to 0.001 with an object called tol

reg_current <- reg1
rho_old <- 0
tol <- 0.001

Now we will create the loop and set it for 20 iterations, though it should take far fewer than that

for(iter in 1:20) {
  # The next two lines generate the residuals and rho
  resids <- residuals(reg_current)
  rho <- cor(resids[-1], resids[-length(resids)])
  
  # This will signal if the process is complete and provide the text summary  
  if(abs(rho - rho_old) < tol) {
    cat("Converged after", iter, "iterations. Rho =", rho, "\n")
    break
  }
  
  n <- nrow(d) # Calculates new sample size (previous - 1)
  lcons_trans <- d$lcons[-1] - rho * d$lcons[-n]    # Transform consumption
  ldisp_trans <- d$ldisp[-1] - rho * d$ldisp[-n]    # Transform disposable income
  lprice_trans <- d$lprice[-1] - rho * d$lprice[-n] # Transform price index
  
  # Run new regression with transformed variables
  reg_current <- lm(lcons_trans ~ ldisp_trans + lprice_trans)
  rho_old <- rho # Replace old rho value with current rho value for next iteration
}

## Converged after 8 iterations. Rho = 0.5853132

The loop is complete, and we can see that it ran for 8 iterations and set the precise \(\rho\) value we need for the new regression. Let’s take a look at the new regression results fitted with \(\rho\) and interpret the results:

summary(reg_current)

## 
## Call:
## lm(formula = lcons_trans ~ ldisp_trans + lprice_trans)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.04111 -0.01466 -0.00094  0.01615  0.06141 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.2957     0.3774   0.784  0.43870    
## ldisp_trans    1.2459     0.2637   4.725 3.89e-05 ***
## lprice_trans  -0.3930     0.1272  -3.090  0.00398 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02573 on 34 degrees of freedom
## Multiple R-squared:  0.3968, Adjusted R-squared:  0.3613 
## F-statistic: 11.18 on 2 and 34 DF,  p-value: 0.0001853