Advanced Economic Methods

Exercise 3: Heteroskedasticity

Overview

Setup

We need to install three new packages:

haven is used to read Stata files. We will use this format instead of the Excel files
sandwich is for modeling robust covariance matrix estimators (useful for correcting heteroskedasticity
lmtest performs a broad range of tasks for linear models

install.packages(c("haven", "sandwich", "lmtest"))

Now create a new R script and set it up with the appropriate header, libraries, and your working drive. The beginning of your script should look something like this:

###########################################################################
# Project: Heteroskedasticity
# Author: [Your name]
# Date: [Today's date]
###########################################################################

# Clear environment and load libraries
rm(list=ls())
library(haven) 
library(ggplot2)
library(sandwich)
library(lmtest)

# Set working drive
setwd("C:/Projects/applied-econometrics/R") # Replace with your directory path

Now we can set the working drive and load the data. Following the exercise starting on page 139 in the book, we will use data on UK housing prices and characteristics:

# Load data
d <- read_dta("dat/houseprice.dta")

Let’s have a quick look at the data to see what we are working with.

head(d)

## # A tibble: 6 × 3
##    price rooms sqfeet
##    <dbl> <dbl>  <dbl>
## 1 477500     7   3529
## 2 310000     6   1386
## 3 471250     5   2617
## 4 375000     5   2293
## 5 713500     5   3331
## 6 725000     5   3662

We can see that there are three variables to work with:

price: housing price in pounds
rooms: number of bedrooms in the house
sqfeet: size of the house in square feet

The model

Model statement for first regression

The logical first step is to generate a regression we are interested in. We will fit a model with price as the dependent variable, and rooms and sqfeet as the independent variables. We can write the model statement as:

\[price=\beta_0+\beta_1 rooms + \beta_2 sqfeet + u\]

We will run this regression and generate the summary output using the following code:

reg1 <- lm(price ~ rooms + sqfeet, data = d)

summary(reg1)

## 
## Call:
## lm(formula = price ~ rooms + sqfeet, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -127627  -42876   -7051   32589  229003 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -19315.00   31046.62  -0.622    0.536    
## rooms        15198.19    9483.52   1.603    0.113    
## sqfeet         128.44      13.82   9.291 1.39e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63040 on 85 degrees of freedom
## Multiple R-squared:  0.6319, Adjusted R-squared:  0.6233 
## F-statistic: 72.96 on 2 and 85 DF,  p-value: < 2.2e-16

Interpreting the results

The regression examines the relationship between house price (dependent variable) and two explanatory variables: number of rooms and house size (square feet).

Intercept (-19,315, p = 0.536): Not statistically significant. Interpretation is not meaningful in this context since a house with zero rooms and zero square feet is unrealistic.
Rooms (15,198, p = 0.113): Suggests that each additional room increases price by about £15,200, holding square footage constant. However, the coefficient is not statistically significant at conventional levels (p > 0.05).
Square feet (128.44, p < 0.001): Indicates that each additional square foot increases price by about £128, controlling for number of rooms. This effect is highly statistically significant and shows a strong positive relationship.
\(R^2\) = 0.632 and Adjusted \(R^2\) = 0.623: the model explains about 63% of the variation in house prices.
Residual standard error = £63,040, indicating the typical deviation of observed prices from predicted values.
F-statistic = 72.96 (p < 0.001), confirming the model is statistically significant overall.

Summary: House size is a highly significant predictor of price, while the number of rooms does not show a significant effect once house size is taken into account. The model fits the data reasonably well, explaining around two-thirds of the variation in prices.

Visual inspection

The simplest way to get an idea of whether or not heteroskedasticity is an issue in our model is to generate scatter plots to evaluate the relationship between the dependent and independent variables. This method cannot be used as conclusive evidence by itself. Generally speaking, you can probably stop here if it is very clear in the plots that there is nothing to worry about. If it is not clear, we will have to proceed to statistical tests.

Housing price will be our dependent variable, and the number of bedrooms (first plot) and size of the house (second plot) will be the independent variables. It is good practice to keep the dependent variable on the y-axis (as in, \(y=b+mx\)).

Price and number of bedrooms

ggplot(d, aes(x = rooms, y = price)) +
  geom_point(shape = 16, size = 2) +                         
  geom_smooth(method = "lm", se = FALSE) + # Adds a regression line
  labs(
    title = "House price vs number of rooms",
    x = "Number of rooms",
    y = "Price"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Price and size (in square feet)

ggplot(d, aes(x = sqfeet, y = price)) +
  geom_point(shape = 16, size = 2) +                         
  geom_smooth(method = "lm", se = FALSE) + # Adds a regression line
  labs(
    title = "House price vs house size",
    x = "House size (square feet)",
    y = "Price"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Both of the plots above suggest that heteroskedasticity might be present in the data set. We can clearly see that the spread for each number of rooms is uneven. With respect to the size of the home, the plot suggests that there may be a larger spread in prices as house sizes increase, though this is less clear than the former plot. To formally test this assumption, we will proceed to statistical testing.

Breusch-Pagan test

This next line saves a considerable amount of time, as there is a lot going on under the hood. Here is a step-by-step of what it does:

Extracts the residuals from reg1 and squares them,
Runs an auxiliary regression as \(\hat{u^2_t} = a_0 + a_1 rooms_{t} + a_2 sqfeet_{t} + v\)
Generates an LM statistic and compares it with a critical \(\chi^2\) value

The bptest() function from the lmtest package runs this test directly by reporting the LM statistic and the p-value. If the p-value is small (e.g., below 0.05), we reject the null hypothesis of homoskedasticity and conclude that heteroskedasticity is present.

bptest(reg1)

## 
##  studentized Breusch-Pagan test
## 
## data:  reg1
## BP = 10.576, df = 2, p-value = 0.005051

Solution

White’s corrected standard error estimates

Since we know that heteroskedasticity is present in the data, we can run a correction that will adjust the standard errors and produce more robust results. Following the textbook, we will obtain White’s corrected standard error estimates, which will use our first regression. This will keep the regression coefficients constant and simply adjust the standard errors.

coeftest(reg1, vcov = vcovHC(reg1, type = "HC1"))

## 
## t test of coefficients:
## 
##               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) -19314.996  41520.503 -0.4652   0.64298    
## rooms        15198.191   8943.735  1.6993   0.09292 .  
## sqfeet         128.436     19.591  6.5559 4.076e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that while the regression coefficients are the same in both models, the standard errors differ between the corrected and uncorrected versions.

Confidence intervals

To see how this translates to the performance and accuracy of the model, we can compare the confidence intervals of the two models. We calculate these using the confint() command for the first regression:

confint(reg1)

##                   2.5 %     97.5 %
## (Intercept) -81043.9930 42414.0006
## rooms        -3657.5815 34053.9636
## sqfeet         100.9495   155.9229

For the second regression, we just have to change coeftest() to coefci():

coefci(reg1, vcov. = vcovHC(reg1, type = "HC1"))

##                     2.5 %     97.5 %
## (Intercept) -101868.88058 63238.8881
## rooms         -2584.34944 32980.7316
## sqfeet           89.48427   167.3882

To summarize

The comparison of confidence intervals reveals an important insight about heteroskedasticity correction: the goal is not to uniformly improve precision, but to obtain accurate standard errors. In our analysis, the White correction affected each variable differently. For the rooms coefficient, the correction slightly narrowed the confidence interval from 37,711.55 to 35,565.08 units wide, suggesting the original model was overestimating uncertainty for this variable. Conversely, for the sqfeet coefficient, the correction widened the confidence interval from 54.97 to 77.90 units wide, indicating the original model was underestimating uncertainty. This demonstrates that heteroskedasticity correction provides more reliable inference by adjusting standard errors to reflect the true variability in the data, regardless of whether this results in wider or narrower confidence intervals for individual coefficients.