Advanced Economic Methods
Exercise 6: Working with dummy variables
Overview
The following tutorial will guide you through the identification and resolution of autocorrelation in an example data set. If you have not done the previous tutorials, go back to the introductory lesson here to learn where to get the example data sets and set up the appropriate working paths to follow along.
Setup
We do not need to install any new packages for today’s exercise.
Start a new script file, then add your header and the appropriate
libraries. We will use the haven package to load the data,
dplyr for some data organization commands, and
ggplot2 for generating graphical analyses.
###########################################################################
# Project: Dummy variables
# Author: [Your name]
# Date: [Today's date]
###########################################################################
# Clear environment and load libraries
rm(list=ls())
library(haven)
library(dplyr)
library(ggplot2)
# Set working drive
setwd("C:/Projects/applied-econometrics/R") # Replace with your directory pathNow we can set the working drive and load the data. For this
exercise, we will use the dummies data set:
An important consideration before the analysis is to consider the format of each variable we will use in our regressions. Take a look at the structure of the data:
## tibble [935 × 17] (S3: tbl_df/tbl/data.frame)
## $ age : num [1:935] 29 32 30 37 37 30 37 33 36 29 ...
## ..- attr(*, "label")= chr "AGE"
## ..- attr(*, "format.stata")= chr "%8.0g"
## $ age1 : num [1:935] 1 0 0 0 0 0 0 0 0 1 ...
## ..- attr(*, "label")= chr "AGE1"
## ..- attr(*, "format.stata")= chr "%8.0g"
## $ age2 : num [1:935] 0 1 1 0 0 1 0 1 0 0 ...
## ..- attr(*, "label")= chr "AGE2"
## ..- attr(*, "format.stata")= chr "%8.0g"
## $ age3 : num [1:935] 0 0 0 1 1 0 1 0 1 0 ...
## ..- attr(*, "label")= chr "AGE3"
## ..- attr(*, "format.stata")= chr "%8.0g"
## $ black : num [1:935] 0 0 0 1 0 0 1 0 1 0 ...
## ..- attr(*, "label")= chr "BLACK"
## ..- attr(*, "format.stata")= chr "%8.0g"
## $ educ1 : num [1:935] 0 1 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "label")= chr "EDUC1"
## ..- attr(*, "format.stata")= chr "%8.0g"
## $ educ2 : num [1:935] 1 0 0 1 1 1 1 0 1 1 ...
## ..- attr(*, "label")= chr "EDUC2"
## ..- attr(*, "format.stata")= chr "%8.0g"
## $ educ3 : num [1:935] 0 0 0 0 0 0 0 1 0 0 ...
## ..- attr(*, "label")= chr "EDUC3"
## ..- attr(*, "format.stata")= chr "%8.0g"
## $ educ4 : num [1:935] 0 0 1 0 0 0 0 0 0 0 ...
## ..- attr(*, "label")= chr "EDUC4"
## ..- attr(*, "format.stata")= chr "%8.0g"
## $ exper : num [1:935] 9 11 7 21 18 9 16 10 16 11 ...
## ..- attr(*, "label")= chr "EXPER"
## ..- attr(*, "format.stata")= chr "%8.0g"
## $ iq : num [1:935] 108 103 96 75 101 94 83 104 67 113 ...
## ..- attr(*, "label")= chr "IQ"
## ..- attr(*, "format.stata")= chr "%8.0g"
## $ male : num [1:935] 0 0 0 0 1 0 0 0 0 0 ...
## ..- attr(*, "label")= chr "MALE"
## ..- attr(*, "format.stata")= chr "%8.0g"
## $ married: num [1:935] 1 1 1 1 1 1 1 0 1 1 ...
## ..- attr(*, "label")= chr "MARRIED"
## ..- attr(*, "format.stata")= chr "%8.0g"
## $ meduc : num [1:935] 10 12 12 NA NA 5 3 14 7 10 ...
## ..- attr(*, "label")= chr "MEDUC"
## ..- attr(*, "format.stata")= chr "%8.0g"
## $ south : num [1:935] 0 1 0 1 0 0 1 0 1 1 ...
## ..- attr(*, "label")= chr "SOUTH"
## ..- attr(*, "format.stata")= chr "%8.0g"
## $ urban : num [1:935] 1 1 0 0 1 1 1 0 1 0 ...
## ..- attr(*, "label")= chr "URBAN"
## ..- attr(*, "format.stata")= chr "%8.0g"
## $ wage : num [1:935] 115 200 233 260 265 289 300 310 318 325 ...
## ..- attr(*, "label")= chr "WAGE"
## ..- attr(*, "format.stata")= chr "%8.0g"
We can see that all variables say num next to them,
meaning they are coded as a number. Since dummy variables are not
numeric in nature, we have to recode all the variables with values of
[0,1] as factor variables. This can be done individually like this:
and we can see that the first variable in the list has been changed to a factor:
## Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
Now it is correctly formatted as a factor variable with 2 levels (0 and 1). To save time and space, we will transform all variables that should be dummies in one go:
Regressions with a single dummy variable
Simple regression
Let’s start by regressing wages relative to IQ scores. Both variables are continuous, and the equation will look like this:
\[ wage = \beta_0 + \beta_1 iq + u\]
##
## Call:
## lm(formula = wage ~ iq, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -898.7 -256.5 -47.3 201.1 2072.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.9916 85.6415 1.366 0.172
## iq 8.3031 0.8364 9.927 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 384.8 on 933 degrees of freedom
## Multiple R-squared: 0.09554, Adjusted R-squared: 0.09457
## F-statistic: 98.55 on 1 and 933 DF, p-value: < 2.2e-16
Interpretation: The regression looks fairly good. We have a strong statistically significant relationship between wage and IQ, where an increase in IQ score by 1 point corresponds to an increase in wage by about $8.30. However, the \(R^2\) is pretty low here in this model, with IQ score only catching less than 10% of the variation in wage. If we plot this regression, we will see that most of the observations are not very close to the trend line:
ggplot(d, aes(x = iq, y = wage)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "green") +
ggtitle("Univariate regression, wage x IQ")## `geom_smooth()` using formula = 'y ~ x'
Dummy variable for gender (intercept only)
Since it is clear that we should try improving the model, let’s
introduce a dummy variable for the gender of the respondent. The
variable is male, where \(male=0\) means that the respondent is
female, and \(male=1\) means that the
respondent is male. The model will now look like this:
\[ wage = \beta_0 + \beta_1 iq + \beta_2 male(0/1) + u\]
I added the \((0/1)\) to
specifically denote that the variable is a dummy and takes on only
values of 0 and 1. We will run this regression two different ways. In
reg2, the dummy variable will only change the intercept
between the two gender groups and keep the slope constant:
##
## Call:
## lm(formula = wage ~ iq + male, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -970.63 -186.55 -20.81 137.15 1811.91
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 224.8438 66.6424 3.374 0.000772 ***
## iq 5.0766 0.6624 7.665 4.5e-14 ***
## male1 498.0493 20.0768 24.807 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 298.8 on 932 degrees of freedom
## Multiple R-squared: 0.4552, Adjusted R-squared: 0.4541
## F-statistic: 389.4 on 2 and 932 DF, p-value: < 2.2e-16
This regression looks much better than reg1. Both of the
independent variables are statistically significant and the \(R^2\) has jumped from less than 10% to more
than 45%. We can also see that the relationship between ìq
and wage has changed a bit, and now means that a 1-point
increase in IQ score corresponds to about a $5.08 increase in wage.
However, the inclusion of the dummy variable is the critical part. The
model tells us that on average, a male respondent earns almost $500 more
than a female respondent across all IQ scores (different intercept, same
slope). We can plot this as:
ggplot(d, aes(x = iq, y = wage, color = factor(male))) +
geom_point(alpha = 0.6) +
geom_abline(intercept = coef(reg2)[1],
slope = coef(reg2)[2],
color = "red") +
geom_abline(intercept = coef(reg2)[1] + coef(reg2)[3],
slope = coef(reg2)[2],
color = "blue") +
scale_color_manual(values = c("red", "blue"),
labels = c("Female", "Male"),
name = "Gender") +
ggtitle("Wage vs IQ by gender (same slope)") +
labs(x = "IQ", y = "Wage") +
theme_minimal(base_size = 14)Dummy variable for gender (intercept and slope)
Let’s now assume that the slope is not constant, and that the gender
pay gap may change based on IQ score. We can create a new regression
that introduces an interaction term between iq and
male, which is modeled as:
\[ wage = \beta_0 + \beta_1 iq + \beta_2 male(0/1) + \beta_3 (iq \times male) + u\]
##
## Call:
## lm(formula = wage ~ iq * male, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -963.17 -183.58 -14.44 142.16 1806.99
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 357.8567 84.7894 4.221 2.68e-05 ***
## iq 3.7285 0.8492 4.391 1.26e-05 ***
## male1 149.1039 139.6018 1.068 0.2858
## iq:male1 3.4121 1.3510 2.526 0.0117 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 297.9 on 931 degrees of freedom
## Multiple R-squared: 0.4589, Adjusted R-squared: 0.4572
## F-statistic: 263.2 on 3 and 931 DF, p-value: < 2.2e-16
This regression reveals a more nuanced picture of the gender wage
gap. The interaction term between iq and male
is statistically significant (p = 0.012), indicating that the
relationship between IQ and wages differs by gender. For female
respondents, each 1-point increase in IQ corresponds to a $3.73 increase
in wage, while for male respondents, each 1-point increase in IQ
corresponds to a $7.14 increase in wage (3.73 + 3.41). This means that
the gender wage gap widens as IQ increases. Interestingly, the main
effect of male is no longer statistically significant (p = 0.286),
suggesting that at very low IQ levels, there is no significant wage
difference between genders. However, the interaction effect means that
at higher IQ levels—closer to the sample mean—the gender wage gap
becomes substantial. The \(R^2\) has
improved slightly to 45.89%, indicating that allowing the slope to vary
by gender provides a marginally better fit to the data than the parallel
slopes model in reg2.
We can plot these results like this:
ggplot(d, aes(x = iq, y = wage, color = factor(male))) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE) +
scale_color_manual(values = c("red", "blue"),
labels = c("Female", "Male"),
name = "Gender") +
ggtitle("Wage vs IQ by Gender") +
labs(x = "IQ", y = "Wage") +
theme_minimal(base_size = 14)## `geom_smooth()` using formula = 'y ~ x'
Regressions with multiple dummy variables
Regressing wage with education and experience
It is quite common to have multiple dummy variables describing different groupings. In this case, we have dummy variables for different levels of education. We can define them as:
educ1: Individual does not have a high school diplomaeduc2: Individual has a high school diplomaeduc3: Individual has a bachelors degreeeduc4: Individual has an advanced degree (e.g. masters or PhD)
When running this type of model, we have to be careful about the dummy variable trap. This occurs when a model includes all possible categories of a dummy variable plus the constant term, creating perfect multicollinearity.
If we include all four education dummies alongside the intercept,
they sum to 1 for every observation, making it mathematically impossible
to estimate unique coefficients. To avoid this trap, we must always omit
one category to serve as the reference group. This
omitted category becomes the baseline for comparisons, while the
included dummies measure differences relative to that baseline, ensuring
the model remains statistically identifiable and interpretable. We will
omit educ1 and compare our results to the group of
individuals that do not have a high school degree.
We also include a variable measuring work experience
exper (in years) as a continuous variable because it
captures a separate and complementary influence on wages beyond
education. While the education dummies account for differences in formal
schooling levels, experience measures the accumulation of skills and
knowledge over time. Including both allows the model to distinguish how
education and on-the-job experience jointly contribute to wage
differences. Our model is as follows:
\[ wage = \beta_0+\beta_1 exper+\beta_2 educ2(0/1)+\beta_3 educ3(0/1)+ \beta_4 educ4(0/1) + u\]
##
## Call:
## lm(formula = wage ~ exper + educ2 + educ3 + educ4, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -872.91 -260.73 -40.68 192.71 2064.32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 509.379 63.032 8.081 1.98e-15 ***
## exper 17.591 3.217 5.468 5.85e-08 ***
## educ21 122.008 45.038 2.709 0.00687 **
## educ31 310.798 50.845 6.113 1.44e-09 ***
## educ41 473.388 50.691 9.339 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 378.3 on 930 degrees of freedom
## Multiple R-squared: 0.1284, Adjusted R-squared: 0.1246
## F-statistic: 34.24 on 4 and 930 DF, p-value: < 2.2e-16
We can see that all the regression coefficients are statistically significant. The findings of this model can be summarized as:
- On average, every year of experience corresponds to a wage increase of $17.59. This is constant across all education groups since we assume differing intercepts and constant slopes in the model
- On average, individuals with a high school diploma earn about $122 more than those without a diploma
- On average, individuals with a bachelors degree earn about $311 more than those without a HS diploma
- On average, individuals with an advanced degree (masters or PhD) earn about $473 more than those without a HS diploma
The adjusted \(R^2\) suggests that our model only accounts for about 12.5% of the observed variation in wages. While this is considerably lower than the other models, the model is still very useful in finding meaningful relationships between wages, experience, and education.
# First create a single education variable
d$educ_cat <- ifelse(d$educ1 == 1, 1,
ifelse(d$educ2 == 1, 2,
ifelse(d$educ3 == 1, 3, 4)))
# Then plot
ggplot(d, aes(x = exper, y = wage, color = factor(educ_cat))) +
geom_point(alpha = 0.6) +
geom_abline(intercept = coef(reg4)[1],
slope = coef(reg4)[2],
color = "red") +
geom_abline(intercept = coef(reg4)[1] + coef(reg4)[3],
slope = coef(reg4)[2],
color = "blue") +
geom_abline(intercept = coef(reg4)[1] + coef(reg4)[4],
slope = coef(reg4)[2],
color = "green") +
geom_abline(intercept = coef(reg4)[1] + coef(reg4)[5],
slope = coef(reg4)[2],
color = "purple") +
scale_color_manual(values = c("red", "blue", "green", "purple"),
labels = c("No diploma (educ1)",
"HS diploma (educ2)",
"Bach degree (educ3)",
"Adv degree (educ4)"),
name = "Education Level") +
ggtitle("Wage vs experience by education level (parallel slopes)") +
labs(x = "Experience (years)", y = "Wage") +
theme_minimal(base_size = 14)Tons of dummy variables
The above example provided adequate information on the relationship between wages, work experience, and education. However, the \(R_2\) was still rather low since only 12.5% of the total variation in wage was explained by education and experience.
Let’s see if we can improve that by adding all the
dummy variables in the data set. This includes location variables
urban and south, and whether the respondent is
married (1 = married, 0 = single), black (1 =
Black, 0 = any other race), or male (1 = male, 0 = female).
The model statement will now look like this:
\[ wage = \beta_0+\beta_1 exper+\beta_2 educ2(0/1) + \beta_3 educ3(0/1) + \\ \beta_4 educ4(0/1) + \beta_5 urban(0/1) + \beta_6 south(0/1) + \\ \beta_7 married(0/1) + \beta_8 black(0/1) + \beta_9 male(0/1) + u\]
reg5 <- lm(wage ~ exper + educ2 + educ3 + educ4 + urban + south +
married + black + male, data = d)
summary(reg5)##
## Call:
## lm(formula = wage ~ exper + educ2 + educ3 + educ4 + urban + south +
## married + black + male, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1004.66 -175.14 -24.11 138.64 1804.32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 382.799 58.175 6.580 7.86e-11 ***
## exper 9.449 2.492 3.792 0.000159 ***
## educ21 35.040 34.778 1.008 0.313945
## educ31 140.245 39.632 3.539 0.000422 ***
## educ41 233.167 40.235 5.795 9.36e-09 ***
## urban1 88.242 21.665 4.073 5.04e-05 ***
## south1 -44.488 20.813 -2.138 0.032814 *
## married1 137.236 30.978 4.430 1.05e-05 ***
## black1 -99.382 30.009 -3.312 0.000963 ***
## male1 456.265 20.222 22.562 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 289.7 on 925 degrees of freedom
## Multiple R-squared: 0.4918, Adjusted R-squared: 0.4868
## F-statistic: 99.45 on 9 and 925 DF, p-value: < 2.2e-16
We can see that the \(R^2\) has improved dramatically with the additional dummy variables. Whereas the previous model explained only about 12.5% of the total variation in wage, the new model explains almost 49%. This indicates that the inclusion of education, location, marital status, race, and gender significantly improves the model’s ability to account for differences in wages.
Looking at the coefficients, experience has a higher effect on wages
than in the previous model, finding that wages by about $9.45 per year
of experience on average. Higher education levels also contribute
similar to the previous model; however, educ2 is not longer
statistically significant. This finding suggests that the average wage
difference between those with a high school degree versus those who
dropped out is marginal, assuming all other factors are held
constant.
Location and demographics also matter. Living in an urban area adds about $88, whereas living in the South reduces wages by around $44. Being married increases wages by approximately $137, while being Black is associated with a reduction of about $99. Gender shows the largest effect, with males earning roughly $456 more than females, all else equal.