Advanced Economic Methods
Exercise 4: Autocorrelation
Overview
The following tutorial will guide you through the identification and resolution of autocorrelation in an example data set. If you have not done the previous tutorials, go back to the introductory lesson here to learn where to get the example data sets and set up the appropriate working paths to follow along.
Setup
We do not need to install any new packages for today’s exercise.
Start a new script file, then add your header and the appropriate
libraries. We will use the haven package to load the data,
dplyr for some data organization commands,
ggplot2 for generating graphical analyses, and
lmtest for some of the postestimation testing for
aurocorrelation.
###########################################################################
# Project: Autocorrelation
# Author: [Your name]
# Date: [Today's date]
###########################################################################
# Clear environment and load libraries
rm(list=ls())
library(haven)
library(dplyr)
library(ggplot2)
library(lmtest)
# Set working drive
setwd("C:/Projects/applied-econometrics/R") # Replace with your directory pathNow we can set the working drive and load the data. For this
exercise, we will use the ser_cor data set:
Take a look at the data:
## tibble [38 × 4] (S3: tbl_df/tbl/data.frame)
## $ lcons : num [1:38] 4.63 4.61 4.56 4.57 4.6 ...
## ..- attr(*, "label")= chr "LCONS"
## ..- attr(*, "format.stata")= chr "%9.0g"
## $ ldisp : num [1:38] 4.43 4.45 4.45 4.46 4.48 ...
## ..- attr(*, "label")= chr "LDISP"
## ..- attr(*, "format.stata")= chr "%9.0g"
## $ lprice : num [1:38] 4.29 4.33 4.38 4.4 4.38 ...
## ..- attr(*, "label")= chr "LPRICE"
## ..- attr(*, "format.stata")= chr "%9.0g"
## $ datevar: num [1:38] 300 301 302 303 304 305 306 307 308 309 ...
## ..- attr(*, "format.stata")= chr "%tm"
and we can see it contains the following variables:
lcons: The natural log of consumer expenditure on foodldisp: The natural log of disposable incomelprice: The natural log of the relative price index of fooddatevar: Quarterly time variable. It contains only numbers but translates to be the first quarter (Q1) of 1985 to 1994Q2
Regression
Let’s start by running a regression that will evaluate the relationship on consumer food expenditure relative to disposable income and the relative price index of food:
\[lcons_t=\beta_0+\beta_1ldisp_t+\beta_2lprice_t+u_t\]
Take a close look at the results to practice interpreting the findings:
##
## Call:
## lm(formula = lcons ~ ldisp + lprice, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.067365 -0.031403 -0.001917 0.030286 0.106230
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.48543 0.78835 3.153 0.00331 **
## ldisp 0.52929 0.29233 1.811 0.07880 .
## lprice -0.06403 0.14651 -0.437 0.66476
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04625 on 35 degrees of freedom
## Multiple R-squared: 0.2344, Adjusted R-squared: 0.1907
## F-statistic: 5.358 on 2 and 35 DF, p-value: 0.009331
Graphical analysis
The first step of identifying autocorrelation is the extract and plot
the residuals. We can extract them by creating a new object that we will
call res:
We can add them to the main data set d. We also need the
lagged residuals for the second plot, so we will add them both in the
same chain of commands using the mutate command to add
these as new variables in the data set. For simplicity, we will call
these new variables res1 for the main residuals, and
res2 for the lagged residuals.
d <- d %>%
mutate(res1 = residual,
res2 = lag(residual, 1)) # The 1 specifies that we are lagging by 1 time periodResiduals over time
Let’s now plot the residuals. We will start by plotting them over
time using the datevar variable along the x-axis and the
values of res1 on the y-axis.Since the residuals should be
randomly hovering around 0 for all time periods, we will add a
horizontal line at 0 for a better visualization
ggplot(d, aes(x = datevar, y = res1)) +
geom_point(shape = 16, color="darkblue", size = 3) +
geom_hline(yintercept = 0, color = "darkred", linetype = "dashed") +
labs(title = "Residuals over time",
x = "Time", y = "Residual") +
theme_minimal()Residuals vs lagged residuals
The second plot we will generate is a scatter plot that looks at the
main residuals res1 on the y-axis and the lagged residuals
res2 on the x-axis.
ggplot(d, aes(x = res2, y = res1)) +
geom_point(shape = 16, color = "darkblue", size = 3) +
geom_hline(yintercept = 0, color = "darkred", linetype = "dashed") +
labs(title = "Residuals vs lagged residuals",
x = "Lagged residuals", y = "Residuals") +
theme_minimal()Both plots show trends consistent with positive autocorrelation, so we will proceed with the statistical tests to formally identify and treat the issue.
Statistical tests
This section will run through the Durbin-Watson and Breusch-Godfrey tests.
Durbin-Watson (DW) test
The following assumptions should be satisfied to run the DW test:
- The regression model (
reg1) should have a constant - Autocorrelation is assumed to be only first-order, i.e. \(u_t=\rho u_{t-1}+e_t\), where \(\rho\) is a correlation coefficient with range [-1,1]
- The equation does not include a lagged variable
To conduct the test, we will use the dwtest() command
from the lmtest package:
##
## Durbin-Watson test
##
## data: reg1
## DW = 0.37019, p-value = 3.456e-12
## alternative hypothesis: true autocorrelation is greater than 0
The results here are quite conclusive considering the tiny p-value,
but we should still look at what the DW value actually
means. First, we need to look up the appropriate test statistic from a
DW test table, which you can find here.
To read the table, we need to know the number of regressors
(i.e. independent variables), which is easy (2). We also need to know
the number of observations, which is equal in our case to the number of
rows:
## [1] 38
Now let’s look at the table. Our model has an intercept, so we will use the Savin and White table on page 4. Looking at \(k=2\) and \(n=38\), we get a lower bound value (dL) of 1.176 and an upper bound (dU) of 1.388. Since our test statistic is considerably lower than the lower boundary, we can reject the null hypothesis of no autocorrelation because \(DW<dL\).
There are a few drawbacks to the DW test:
- It may give inconclusive results
- Not applicable when a lagged dependent variable is used
- Can’t account for higher-order autocorrelation
Breusch-Godfrey test
To overcome some of the drawbacks of the DW test, we can also apply the Breusch-Godfrey test. This is a Lagrange Multiplier test, meaning that it operates by estimating a restricted model (under the null hypothesis of no autocorrelation) and then testing whether the Lagrange multipliers associated with the constraints (that autocorrelation coefficients are zero) are statistically significant.
The value of this test in the context of our analysis is that it can test different numbers of orders (i.e. lags), whereas the DW test assumed first-order only. This is very important for our data because we have a quarterly time series, so it makes sense to test as many as 4 lags to catch any yearly patterns in the data. This is basically the theoretical maximum we should test, so we will run the test 4 times and increase the order number each time up to 4:
##
## Breusch-Godfrey test for serial correlation of order up to 1
##
## data: reg1
## LM test = 23.23, df = 1, p-value = 1.437e-06
##
## Breusch-Godfrey test for serial correlation of order up to 2
##
## data: reg1
## LM test = 23.263, df = 2, p-value = 8.883e-06
##
## Breusch-Godfrey test for serial correlation of order up to 3
##
## data: reg1
## LM test = 25.801, df = 3, p-value = 1.05e-05
##
## Breusch-Godfrey test for serial correlation of order up to 4
##
## data: reg1
## LM test = 26.224, df = 4, p-value = 2.851e-05
We can see that we have significant autocorrelation in all four tests based on the low p-value (with a null hypothesis of no autocorrelation). It doesn’t make sense to continue increasing the number of lags beyond the theoretical maximum, so we will conclude that we have to continue by resolving the problem.
Resolving autocorrelation
There are 5 steps to this procedure:
- Step 1: Estimate the regression and obtain residuals (already done)
- Step 2: Estimate \(\rho\) from regressing the residuals to its lagged terms
- Step 3: Transform the original variables using \(\rho\)
- Step 4: Run the regression again with the transformed variables and obtain residuals
- Step 5 and on: Continue repeating steps 2 to 4 for several rounds until (stopping rule) the estimates of from two successive iterations differ by no more than some preselected small value, such as 0.001.
Since this is a repetitive task, we will create a loop that will run
through the cycle until we reach our target differing value. First, we
can create some objects to help us. We will create a new data object for
our regression to preserve the original data structure and name it
reg_current. We will also create an object for \(\rho\) called rho_old and set
it to 0, then set our target differing value to 0.001 with an object
called tol
Now we will create the loop and set it for 20 iterations, though it should take far fewer than that
for(iter in 1:20) {
# The next two lines generate the residuals and rho
resids <- residuals(reg_current)
rho <- cor(resids[-1], resids[-length(resids)])
# This will signal if the process is complete and provide the text summary
if(abs(rho - rho_old) < tol) {
cat("Converged after", iter, "iterations. Rho =", rho, "\n")
break
}
n <- nrow(d) # Calculates new sample size (previous - 1)
lcons_trans <- d$lcons[-1] - rho * d$lcons[-n] # Transform consumption
ldisp_trans <- d$ldisp[-1] - rho * d$ldisp[-n] # Transform disposable income
lprice_trans <- d$lprice[-1] - rho * d$lprice[-n] # Transform price index
# Run new regression with transformed variables
reg_current <- lm(lcons_trans ~ ldisp_trans + lprice_trans)
rho_old <- rho # Replace old rho value with current rho value for next iteration
}## Converged after 8 iterations. Rho = 0.5853132
The loop is complete, and we can see that it ran for 8 iterations and set the precise \(\rho\) value we need for the new regression. Let’s take a look at the new regression results fitted with \(\rho\) and interpret the results:
##
## Call:
## lm(formula = lcons_trans ~ ldisp_trans + lprice_trans)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.04111 -0.01466 -0.00094 0.01615 0.06141
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2957 0.3774 0.784 0.43870
## ldisp_trans 1.2459 0.2637 4.725 3.89e-05 ***
## lprice_trans -0.3930 0.1272 -3.090 0.00398 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02573 on 34 degrees of freedom
## Multiple R-squared: 0.3968, Adjusted R-squared: 0.3613
## F-statistic: 11.18 on 2 and 34 DF, p-value: 0.0001853