Advanced Economic Methods

Exercise 6: Working with dummy variables

Overview

The following tutorial will guide you through the identification and resolution of autocorrelation in an example data set. If you have not done the previous tutorials, go back to the introductory lesson here to learn where to get the example data sets and set up the appropriate working paths to follow along.

Setup

We do not need to install any new packages for today’s exercise. Start a new script file, then add your header and the appropriate libraries. We will use the haven package to load the data, dplyr for some data organization commands, and ggplot2 for generating graphical analyses.

###########################################################################
# Project: Dummy variables
# Author: [Your name]
# Date: [Today's date]
###########################################################################

# Clear environment and load libraries
rm(list=ls())
library(haven) 
library(dplyr)
library(ggplot2)

# Set working drive
setwd("C:/Projects/applied-econometrics/R") # Replace with your directory path

Now we can set the working drive and load the data. For this exercise, we will use the dummies data set:

# Load data
d <- read_dta("dat/dummies.dta")

An important consideration before the analysis is to consider the format of each variable we will use in our regressions. Take a look at the structure of the data:

str(d)

## tibble [935 × 17] (S3: tbl_df/tbl/data.frame)
##  $ age    : num [1:935] 29 32 30 37 37 30 37 33 36 29 ...
##   ..- attr(*, "label")= chr "AGE"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ age1   : num [1:935] 1 0 0 0 0 0 0 0 0 1 ...
##   ..- attr(*, "label")= chr "AGE1"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ age2   : num [1:935] 0 1 1 0 0 1 0 1 0 0 ...
##   ..- attr(*, "label")= chr "AGE2"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ age3   : num [1:935] 0 0 0 1 1 0 1 0 1 0 ...
##   ..- attr(*, "label")= chr "AGE3"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ black  : num [1:935] 0 0 0 1 0 0 1 0 1 0 ...
##   ..- attr(*, "label")= chr "BLACK"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ educ1  : num [1:935] 0 1 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "label")= chr "EDUC1"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ educ2  : num [1:935] 1 0 0 1 1 1 1 0 1 1 ...
##   ..- attr(*, "label")= chr "EDUC2"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ educ3  : num [1:935] 0 0 0 0 0 0 0 1 0 0 ...
##   ..- attr(*, "label")= chr "EDUC3"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ educ4  : num [1:935] 0 0 1 0 0 0 0 0 0 0 ...
##   ..- attr(*, "label")= chr "EDUC4"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ exper  : num [1:935] 9 11 7 21 18 9 16 10 16 11 ...
##   ..- attr(*, "label")= chr "EXPER"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ iq     : num [1:935] 108 103 96 75 101 94 83 104 67 113 ...
##   ..- attr(*, "label")= chr "IQ"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ male   : num [1:935] 0 0 0 0 1 0 0 0 0 0 ...
##   ..- attr(*, "label")= chr "MALE"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ married: num [1:935] 1 1 1 1 1 1 1 0 1 1 ...
##   ..- attr(*, "label")= chr "MARRIED"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ meduc  : num [1:935] 10 12 12 NA NA 5 3 14 7 10 ...
##   ..- attr(*, "label")= chr "MEDUC"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ south  : num [1:935] 0 1 0 1 0 0 1 0 1 1 ...
##   ..- attr(*, "label")= chr "SOUTH"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ urban  : num [1:935] 1 1 0 0 1 1 1 0 1 0 ...
##   ..- attr(*, "label")= chr "URBAN"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ wage   : num [1:935] 115 200 233 260 265 289 300 310 318 325 ...
##   ..- attr(*, "label")= chr "WAGE"
##   ..- attr(*, "format.stata")= chr "%8.0g"

We can see that all variables say num next to them, meaning they are coded as a number. Since dummy variables are not numeric in nature, we have to recode all the variables with values of [0,1] as factor variables. This can be done individually like this:

d$age1 <- as.factor(d$age1)

and we can see that the first variable in the list has been changed to a factor:

str(d$age1)

##  Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...

Now it is correctly formatted as a factor variable with 2 levels (0 and 1). To save time and space, we will transform all variables that should be dummies in one go:

d <- d %>%
  mutate(across(c(age2, age3, black, educ1, educ2, educ3, educ4, 
                  male, married, south, urban), as.factor))

Regressions with a single dummy variable

Simple regression

Let’s start by regressing wages relative to IQ scores. Both variables are continuous, and the equation will look like this:

\[ wage = \beta_0 + \beta_1 iq + u\]

reg1 <- lm(wage ~ iq, data = d)
summary(reg1)

## 
## Call:
## lm(formula = wage ~ iq, data = d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -898.7 -256.5  -47.3  201.1 2072.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 116.9916    85.6415   1.366    0.172    
## iq            8.3031     0.8364   9.927   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 384.8 on 933 degrees of freedom
## Multiple R-squared:  0.09554,    Adjusted R-squared:  0.09457 
## F-statistic: 98.55 on 1 and 933 DF,  p-value: < 2.2e-16

Interpretation: The regression looks fairly good. We have a strong statistically significant relationship between wage and IQ, where an increase in IQ score by 1 point corresponds to an increase in wage by about $8.30. However, the $R^2$ is pretty low here in this model, with IQ score only catching less than 10% of the variation in wage. If we plot this regression, we will see that most of the observations are not very close to the trend line:

ggplot(d, aes(x = iq, y = wage)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = TRUE, color = "green") + 
  ggtitle("Univariate regression, wage x IQ")

## `geom_smooth()` using formula = 'y ~ x'

Dummy variable for gender (intercept only)

Since it is clear that we should try improving the model, let’s introduce a dummy variable for the gender of the respondent. The variable is male, where $male=0$ means that the respondent is female, and $male=1$ means that the respondent is male. The model will now look like this:

\[ wage = \beta_0 + \beta_1 iq + \beta_2 male(0/1) + u\]

I added the $(0/1)$ to specifically denote that the variable is a dummy and takes on only values of 0 and 1. We will run this regression two different ways. In reg2, the dummy variable will only change the intercept between the two gender groups and keep the slope constant:

reg2 <- lm(wage ~ iq + male, data = d)
summary(reg2)

## 
## Call:
## lm(formula = wage ~ iq + male, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -970.63 -186.55  -20.81  137.15 1811.91 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 224.8438    66.6424   3.374 0.000772 ***
## iq            5.0766     0.6624   7.665  4.5e-14 ***
## male1       498.0493    20.0768  24.807  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 298.8 on 932 degrees of freedom
## Multiple R-squared:  0.4552, Adjusted R-squared:  0.4541 
## F-statistic: 389.4 on 2 and 932 DF,  p-value: < 2.2e-16

This regression looks much better than reg1. Both of the independent variables are statistically significant and the $R^2$ has jumped from less than 10% to more than 45%. We can also see that the relationship between ìq and wage has changed a bit, and now means that a 1-point increase in IQ score corresponds to about a $5.08 increase in wage. However, the inclusion of the dummy variable is the critical part. The model tells us that on average, a male respondent earns almost $500 more than a female respondent across all IQ scores (different intercept, same slope). We can plot this as:

ggplot(d, aes(x = iq, y = wage, color = factor(male))) + 
  geom_point(alpha = 0.6) + 
  geom_abline(intercept = coef(reg2)[1], 
              slope = coef(reg2)[2], 
              color = "red") +
  geom_abline(intercept = coef(reg2)[1] + coef(reg2)[3], 
              slope = coef(reg2)[2], 
              color = "blue") +
  scale_color_manual(values = c("red", "blue"),
                     labels = c("Female", "Male"),
                     name = "Gender") +
  ggtitle("Wage vs IQ by gender (same slope)") +
  labs(x = "IQ", y = "Wage") +
  theme_minimal(base_size = 14)

Dummy variable for gender (intercept and slope)

Let’s now assume that the slope is not constant, and that the gender pay gap may change based on IQ score. We can create a new regression that introduces an interaction term between iq and male, which is modeled as:

\[ wage = \beta_0 + \beta_1 iq + \beta_2 male(0/1) + \beta_3 (iq \times male) + u\]

reg3 <- lm(wage ~ iq * male, data = d)
summary(reg3)

## 
## Call:
## lm(formula = wage ~ iq * male, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -963.17 -183.58  -14.44  142.16 1806.99 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 357.8567    84.7894   4.221 2.68e-05 ***
## iq            3.7285     0.8492   4.391 1.26e-05 ***
## male1       149.1039   139.6018   1.068   0.2858    
## iq:male1      3.4121     1.3510   2.526   0.0117 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 297.9 on 931 degrees of freedom
## Multiple R-squared:  0.4589, Adjusted R-squared:  0.4572 
## F-statistic: 263.2 on 3 and 931 DF,  p-value: < 2.2e-16

This regression reveals a more nuanced picture of the gender wage gap. The interaction term between iq and male is statistically significant (p = 0.012), indicating that the relationship between IQ and wages differs by gender. For female respondents, each 1-point increase in IQ corresponds to a $3.73 increase in wage, while for male respondents, each 1-point increase in IQ corresponds to a $7.14 increase in wage (3.73 + 3.41). This means that the gender wage gap widens as IQ increases. Interestingly, the main effect of male is no longer statistically significant (p = 0.286), suggesting that at very low IQ levels, there is no significant wage difference between genders. However, the interaction effect means that at higher IQ levels—closer to the sample mean—the gender wage gap becomes substantial. The $R^2$ has improved slightly to 45.89%, indicating that allowing the slope to vary by gender provides a marginally better fit to the data than the parallel slopes model in reg2.

We can plot these results like this:

ggplot(d, aes(x = iq, y = wage, color = factor(male))) + 
  geom_point(alpha = 0.6) + 
  geom_smooth(method = "lm", se = TRUE) + 
  scale_color_manual(values = c("red", "blue"),
                     labels = c("Female", "Male"),
                     name = "Gender") +
  ggtitle("Wage vs IQ by Gender") +
  labs(x = "IQ", y = "Wage") +
  theme_minimal(base_size = 14)

## `geom_smooth()` using formula = 'y ~ x'

Regressions with multiple dummy variables

Regressing wage with education and experience

It is quite common to have multiple dummy variables describing different groupings. In this case, we have dummy variables for different levels of education. We can define them as:

educ1: Individual does not have a high school diploma
educ2: Individual has a high school diploma
educ3: Individual has a bachelors degree
educ4: Individual has an advanced degree (e.g. masters or PhD)

When running this type of model, we have to be careful about the dummy variable trap. This occurs when a model includes all possible categories of a dummy variable plus the constant term, creating perfect multicollinearity.

If we include all four education dummies alongside the intercept, they sum to 1 for every observation, making it mathematically impossible to estimate unique coefficients. To avoid this trap, we must always omit one category to serve as the reference group. This omitted category becomes the baseline for comparisons, while the included dummies measure differences relative to that baseline, ensuring the model remains statistically identifiable and interpretable. We will omit educ1 and compare our results to the group of individuals that do not have a high school degree.

We also include a variable measuring work experience exper (in years) as a continuous variable because it captures a separate and complementary influence on wages beyond education. While the education dummies account for differences in formal schooling levels, experience measures the accumulation of skills and knowledge over time. Including both allows the model to distinguish how education and on-the-job experience jointly contribute to wage differences. Our model is as follows:

\[ wage = \beta_0+\beta_1 exper+\beta_2 educ2(0/1)+\beta_3 educ3(0/1)+ \beta_4 educ4(0/1) + u\]

reg4 <- lm(wage ~ exper + educ2 + educ3 + educ4, data = d)
summary(reg4)

## 
## Call:
## lm(formula = wage ~ exper + educ2 + educ3 + educ4, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -872.91 -260.73  -40.68  192.71 2064.32 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  509.379     63.032   8.081 1.98e-15 ***
## exper         17.591      3.217   5.468 5.85e-08 ***
## educ21       122.008     45.038   2.709  0.00687 ** 
## educ31       310.798     50.845   6.113 1.44e-09 ***
## educ41       473.388     50.691   9.339  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 378.3 on 930 degrees of freedom
## Multiple R-squared:  0.1284, Adjusted R-squared:  0.1246 
## F-statistic: 34.24 on 4 and 930 DF,  p-value: < 2.2e-16

We can see that all the regression coefficients are statistically significant. The findings of this model can be summarized as:

On average, every year of experience corresponds to a wage increase of $17.59. This is constant across all education groups since we assume differing intercepts and constant slopes in the model
On average, individuals with a high school diploma earn about $122 more than those without a diploma
On average, individuals with a bachelors degree earn about $311 more than those without a HS diploma
On average, individuals with an advanced degree (masters or PhD) earn about $473 more than those without a HS diploma

The adjusted $R^2$ suggests that our model only accounts for about 12.5% of the observed variation in wages. While this is considerably lower than the other models, the model is still very useful in finding meaningful relationships between wages, experience, and education.

# First create a single education variable
d$educ_cat <- ifelse(d$educ1 == 1, 1,
                     ifelse(d$educ2 == 1, 2,
                            ifelse(d$educ3 == 1, 3, 4)))

# Then plot
ggplot(d, aes(x = exper, y = wage, color = factor(educ_cat))) + 
  geom_point(alpha = 0.6) + 
  geom_abline(intercept = coef(reg4)[1], 
              slope = coef(reg4)[2], 
              color = "red") +
  geom_abline(intercept = coef(reg4)[1] + coef(reg4)[3], 
              slope = coef(reg4)[2], 
              color = "blue") +
  geom_abline(intercept = coef(reg4)[1] + coef(reg4)[4], 
              slope = coef(reg4)[2], 
              color = "green") +
  geom_abline(intercept = coef(reg4)[1] + coef(reg4)[5], 
              slope = coef(reg4)[2], 
              color = "purple") +
  scale_color_manual(values = c("red", "blue", "green", "purple"),
                     labels = c("No diploma (educ1)", 
                                "HS diploma (educ2)", 
                                "Bach degree (educ3)", 
                                "Adv degree (educ4)"),
                     name = "Education Level") +
  ggtitle("Wage vs experience by education level (parallel slopes)") +
  labs(x = "Experience (years)", y = "Wage") +
  theme_minimal(base_size = 14)

Tons of dummy variables

The above example provided adequate information on the relationship between wages, work experience, and education. However, the $R_2$ was still rather low since only 12.5% of the total variation in wage was explained by education and experience.

Let’s see if we can improve that by adding all the dummy variables in the data set. This includes location variables urban and south, and whether the respondent is married (1 = married, 0 = single), black (1 = Black, 0 = any other race), or male (1 = male, 0 = female). The model statement will now look like this:

\[ wage = \beta_0+\beta_1 exper+\beta_2 educ2(0/1) + \beta_3 educ3(0/1) + \\ \beta_4 educ4(0/1) + \beta_5 urban(0/1) + \beta_6 south(0/1) + \\ \beta_7 married(0/1) + \beta_8 black(0/1) + \beta_9 male(0/1) + u\]

reg5 <- lm(wage ~ exper + educ2 + educ3 + educ4 + urban + south + 
             married + black + male, data = d)
summary(reg5)

## 
## Call:
## lm(formula = wage ~ exper + educ2 + educ3 + educ4 + urban + south + 
##     married + black + male, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1004.66  -175.14   -24.11   138.64  1804.32 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  382.799     58.175   6.580 7.86e-11 ***
## exper          9.449      2.492   3.792 0.000159 ***
## educ21        35.040     34.778   1.008 0.313945    
## educ31       140.245     39.632   3.539 0.000422 ***
## educ41       233.167     40.235   5.795 9.36e-09 ***
## urban1        88.242     21.665   4.073 5.04e-05 ***
## south1       -44.488     20.813  -2.138 0.032814 *  
## married1     137.236     30.978   4.430 1.05e-05 ***
## black1       -99.382     30.009  -3.312 0.000963 ***
## male1        456.265     20.222  22.562  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 289.7 on 925 degrees of freedom
## Multiple R-squared:  0.4918, Adjusted R-squared:  0.4868 
## F-statistic: 99.45 on 9 and 925 DF,  p-value: < 2.2e-16

We can see that the $R^2$ has improved dramatically with the additional dummy variables. Whereas the previous model explained only about 12.5% of the total variation in wage, the new model explains almost 49%. This indicates that the inclusion of education, location, marital status, race, and gender significantly improves the model’s ability to account for differences in wages.

Looking at the coefficients, experience has a higher effect on wages than in the previous model, finding that wages by about $9.45 per year of experience on average. Higher education levels also contribute similar to the previous model; however, educ2 is not longer statistically significant. This finding suggests that the average wage difference between those with a high school degree versus those who dropped out is marginal, assuming all other factors are held constant.

Location and demographics also matter. Living in an urban area adds about $88, whereas living in the South reduces wages by around $44. Being married increases wages by approximately $137, while being Black is associated with a reduction of about $99. Gender shows the largest effect, with males earning roughly $456 more than females, all else equal.