Advanced Economic Methods
Exercise 3: Heteroskedasticity
Overview
Setup
We need to install three new packages:
havenis used to read Stata files. We will use this format instead of the Excel filessandwichis for modeling robust covariance matrix estimators (useful for correcting heteroskedasticitylmtestperforms a broad range of tasks for linear models
Now create a new R script and set it up with the appropriate header, libraries, and your working drive. The beginning of your script should look something like this:
###########################################################################
# Project: Heteroskedasticity
# Author: [Your name]
# Date: [Today's date]
###########################################################################
# Clear environment and load libraries
rm(list=ls())
library(haven)
library(ggplot2)
library(sandwich)
library(lmtest)
# Set working drive
setwd("C:/Projects/applied-econometrics/R") # Replace with your directory pathNow we can set the working drive and load the data. Following the exercise starting on page 139 in the book, we will use data on UK housing prices and characteristics:
Let’s have a quick look at the data to see what we are working with.
## # A tibble: 6 × 3
## price rooms sqfeet
## <dbl> <dbl> <dbl>
## 1 477500 7 3529
## 2 310000 6 1386
## 3 471250 5 2617
## 4 375000 5 2293
## 5 713500 5 3331
## 6 725000 5 3662
We can see that there are three variables to work with:
price: housing price in poundsrooms: number of bedrooms in the housesqfeet: size of the house in square feet
The model
Model statement for first regression
The logical first step is to generate a regression we are interested
in. We will fit a model with price as the dependent
variable, and rooms and sqfeet as the
independent variables. We can write the model statement as:
\[price=\beta_0+\beta_1 rooms + \beta_2 sqfeet + u\]
We will run this regression and generate the summary output using the following code:
##
## Call:
## lm(formula = price ~ rooms + sqfeet, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -127627 -42876 -7051 32589 229003
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19315.00 31046.62 -0.622 0.536
## rooms 15198.19 9483.52 1.603 0.113
## sqfeet 128.44 13.82 9.291 1.39e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63040 on 85 degrees of freedom
## Multiple R-squared: 0.6319, Adjusted R-squared: 0.6233
## F-statistic: 72.96 on 2 and 85 DF, p-value: < 2.2e-16
Interpreting the results
The regression examines the relationship between house price (dependent variable) and two explanatory variables: number of rooms and house size (square feet).
- Intercept (-19,315, p = 0.536): Not statistically significant. Interpretation is not meaningful in this context since a house with zero rooms and zero square feet is unrealistic.
- Rooms (15,198, p = 0.113): Suggests that each additional room increases price by about £15,200, holding square footage constant. However, the coefficient is not statistically significant at conventional levels (p > 0.05).
- Square feet (128.44, p < 0.001): Indicates that each additional square foot increases price by about £128, controlling for number of rooms. This effect is highly statistically significant and shows a strong positive relationship.
- \(R^2\) = 0.632 and Adjusted \(R^2\) = 0.623: the model explains about 63% of the variation in house prices.
- Residual standard error = £63,040, indicating the typical deviation of observed prices from predicted values.
- F-statistic = 72.96 (p < 0.001), confirming the model is statistically significant overall.
Summary: House size is a highly significant predictor of price, while the number of rooms does not show a significant effect once house size is taken into account. The model fits the data reasonably well, explaining around two-thirds of the variation in prices.
Visual inspection
The simplest way to get an idea of whether or not heteroskedasticity is an issue in our model is to generate scatter plots to evaluate the relationship between the dependent and independent variables. This method cannot be used as conclusive evidence by itself. Generally speaking, you can probably stop here if it is very clear in the plots that there is nothing to worry about. If it is not clear, we will have to proceed to statistical tests.
Housing price will be our dependent variable, and the number of bedrooms (first plot) and size of the house (second plot) will be the independent variables. It is good practice to keep the dependent variable on the y-axis (as in, \(y=b+mx\)).
Price and number of bedrooms
ggplot(d, aes(x = rooms, y = price)) +
geom_point(shape = 16, size = 2) +
geom_smooth(method = "lm", se = FALSE) + # Adds a regression line
labs(
title = "House price vs number of rooms",
x = "Number of rooms",
y = "Price"
) +
theme_minimal()## `geom_smooth()` using formula = 'y ~ x'
Price and size (in square feet)
ggplot(d, aes(x = sqfeet, y = price)) +
geom_point(shape = 16, size = 2) +
geom_smooth(method = "lm", se = FALSE) + # Adds a regression line
labs(
title = "House price vs house size",
x = "House size (square feet)",
y = "Price"
) +
theme_minimal()## `geom_smooth()` using formula = 'y ~ x'
Both of the plots above suggest that heteroskedasticity might be present in the data set. We can clearly see that the spread for each number of rooms is uneven. With respect to the size of the home, the plot suggests that there may be a larger spread in prices as house sizes increase, though this is less clear than the former plot. To formally test this assumption, we will proceed to statistical testing.
Breusch-Pagan test
This next line saves a considerable amount of time, as there is a lot going on under the hood. Here is a step-by-step of what it does:
- Extracts the residuals from
reg1and squares them, - Runs an auxiliary regression as \(\hat{u^2_t} = a_0 + a_1 rooms_{t} + a_2 sqfeet_{t} + v\)
- Generates an LM statistic and compares it with a critical \(\chi^2\) value
The bptest() function from the lmtest
package runs this test directly by reporting the LM statistic and the
p-value. If the p-value is small (e.g., below 0.05), we reject the null
hypothesis of homoskedasticity and conclude that heteroskedasticity is
present.
##
## studentized Breusch-Pagan test
##
## data: reg1
## BP = 10.576, df = 2, p-value = 0.005051
Solution
White’s corrected standard error estimates
Since we know that heteroskedasticity is present in the data, we can run a correction that will adjust the standard errors and produce more robust results. Following the textbook, we will obtain White’s corrected standard error estimates, which will use our first regression. This will keep the regression coefficients constant and simply adjust the standard errors.
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19314.996 41520.503 -0.4652 0.64298
## rooms 15198.191 8943.735 1.6993 0.09292 .
## sqfeet 128.436 19.591 6.5559 4.076e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that while the regression coefficients are the same in both models, the standard errors differ between the corrected and uncorrected versions.
Confidence intervals
To see how this translates to the performance and accuracy of the
model, we can compare the confidence intervals of the two models. We
calculate these using the confint() command for the first
regression:
## 2.5 % 97.5 %
## (Intercept) -81043.9930 42414.0006
## rooms -3657.5815 34053.9636
## sqfeet 100.9495 155.9229
For the second regression, we just have to change
coeftest() to coefci():
## 2.5 % 97.5 %
## (Intercept) -101868.88058 63238.8881
## rooms -2584.34944 32980.7316
## sqfeet 89.48427 167.3882
To summarize
The comparison of confidence intervals reveals an important insight
about heteroskedasticity correction: the goal is not to uniformly
improve precision, but to obtain accurate standard errors. In our
analysis, the White correction affected each variable differently. For
the rooms coefficient, the correction slightly narrowed the
confidence interval from 37,711.55 to 35,565.08 units wide, suggesting
the original model was overestimating uncertainty for this variable.
Conversely, for the sqfeet coefficient, the correction
widened the confidence interval from 54.97 to 77.90 units wide,
indicating the original model was underestimating uncertainty. This
demonstrates that heteroskedasticity correction provides more reliable
inference by adjusting standard errors to reflect the true variability
in the data, regardless of whether this results in wider or narrower
confidence intervals for individual coefficients.