Advanced Economic Methods
Exercise 9: Working with limited dependent variable models
Overview
The following tutorial will guide you through running regressions with dummy dependent variables. If you have not done the previous tutorials, go back to the introductory lesson here to learn how to set up the appropriate working paths for the code to work.
Setup
Header
Start a new script file and add your header and the appropriate
libraries. We are using a Stata file again (.dta) so we
will need the haven package. We will use the
dplyr package for data organization. For simplicity, I will
show you how to generate plots in the base R options
instead of using ggplot.
###########################################################################
# Project: Limited dependent variable models
# Author: [Your name]
# Date: [Today's date]
###########################################################################
# Clear environment and load libraries
rm(list=ls())
library(haven)
library(dplyr)
# Set working drive
setwd("C:/Projects/applied-econometrics/R") # Replace with your directory pathData
Take a look at the data file:
## # A tibble: 6 × 5
## indiv dummy fam_inc age exper
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 1.72 30 11
## 2 2 0 1.90 48 1
## 3 3 0 1.91 50 10
## 4 4 0 1.72 46 15
## 5 5 0 1.88 38 8
## 6 6 0 2.58 38 16
The data set contains the following variables:
indiv: A unique identifier for each respondentdummy: A binary variable denoting if the respondent has chosen to enter the workforcefam_inc: Amount of family incomeage: The age of the respondent (years)exper: Work experience of the respondent (years)
I think it will also help to have a look at the income variable to get a better idea of the magnitude of the effects when we start running regressions
This is helpful because we can see that a majority of the values are between 1 and 3, with only a few observations on the more extreme sides.
Regressions
Linear probability model
To start, we can run a linear probability model. This is essentially identical to most of the other models we have been working with, only using a dummy dependent variable.
Model statement
The functional form of this model is simply the relationship between the dummy variable denoting if individual \(i\) is working (\(work_i\)) and family income (\(inc\)):
\[work_i(0/1) = \beta_0 + \beta_1 inc_i + u\]
##
## Call:
## lm(formula = dummy ~ fam_inc, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.60168 -0.26320 0.05829 0.25008 0.55643
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.57963 0.05001 31.59 <2e-16 ***
## fam_inc -0.58560 0.02295 -25.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3169 on 505 degrees of freedom
## Multiple R-squared: 0.5632, Adjusted R-squared: 0.5623
## F-statistic: 651.1 on 1 and 505 DF, p-value: < 2.2e-16
Interpretations
- Intercept (1.58): The predicted probability of being in the workforce when family income = 0 is 1.58 (or 158%). This is clearly nonsensical, as probabilities can’t exceed 1! This is one of the key problems with linear probability models.
- \(inc = -0.586\): For each one-unit increase in family income, the probability of being in the workforce decreases by 0.586 (or 58.6 percentage points).
This is the main advantage of LPM - the coefficient is directly interpretable as the marginal effect on probability The negative relationship makes intuitive sense: higher family income might mean less economic pressure to work
Problems evident in this output:
- Predictions outside [0,1]: The intercept of 1.58 already shows impossible probabilities. At low family income values, you’ll get predicted probabilities > 1, and at high values, you’ll get probabilities < 0.
- Constant marginal effects: The -0.586 effect is assumed constant across all income levels, which is unrealistic. In reality, the effect of income on workforce participation probably varies (e.g., diminishing effects at very high or low income levels).
- High \(R^2\) (0.56): This looks good, but is somewhat misleading given the model’s theoretical problems.
Extracting parameters and plotting
Let’s have a look at the model we produced. We have an additional
step compared to the other models we have been working with in the
earlier tutorials, because we have to predict the probabilities of
entering the workforce. To do this, we will first create a new data
frame where we extract 100 observations of fam_inc (we
don’t need the entire data set for a plot):
Now we will create a plot. This will be done in the base
R commands just for simplicity
plot(dummy ~ fam_inc, data = d, col = "#29af7f", pch = 16,
xlab = "Family income",
ylab = "Probability of returning to workforce",
main = "Linear probability model")
# Add predicted probability line
lines(newdat$fam_inc, newdat$pred1, col = "#482173", lwd = 2)Logit regression
Now, let’s fit a logit model. The syntax is almost the same as
reg1. They key differences are that we replace the
lm() command with glm() (generalized
linear model instead of linear model) and specify the type of regression
we want to do. In this case, we set this as
family = binomial so R understands that we are
interested in knowing the predicted probabilities of a dummy dependent
variable.
##
## Call:
## glm(formula = dummy ~ fam_inc, family = binomial, data = d)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 19.828 2.386 8.309 <2e-16 ***
## fam_inc -11.153 1.337 -8.340 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 659.61 on 506 degrees of freedom
## Residual deviance: 193.85 on 505 degrees of freedom
## AIC: 197.85
##
## Number of Fisher Scoring iterations: 8
Notice that there is not a clear goodness of fit metric like the
\(R^2\) in the output of a model
produced using lm(). The reason for the missing metric is
because we have the information to produce it, but there are competing
methods and R does not want to pick a favorite. For
simplicity, we will calculate a pseudo-\(R^2\) using the approach called
McFadden’s \(R^2\) that uses the
residual and null deviance metrics:
## [1] 0.7061165
The model explains about 71% of the deviance, which is quite good!
The interpretation in most of the logit model metrics can be quite
difficult. Personally, I think the best way to report the findings is by
describing the sign of the relationship (-/+), the degree of
significance, then producing a plot and describing what you see. We can
see from the findings above that fam_inc is negatively
associated with the probability of entering the workforce, which is an
expected relationship. The p-value suggests that this relationship is
strong (\(p<0.001\)). Now let’s
predict the probabilities of this model, add them to the
newdat object, and generate a plot.
# Predict probabilities
newdat$pred2 <- predict(reg2, newdata = newdat, type = "response")
# Plot actual data points
plot(dummy ~ fam_inc, data = d, col = "#29af7f", pch = 16,
xlab = "Family income",
ylab = "Probability of returning to workforce",
main = "Logit regression")
# Add predicted probability line
lines(newdat$fam_inc, newdat$pred2, col = "#482173", lwd = 2)From the green line on the plot, we can provide a few points to describe the relationship between \(inc\) and \(work\). Examples:
- If \(inc\) is less than about 1.3, the probability of entering the workforce is 100%
- The fastest rate at which individuals change their mind about entering the workforce occurs when \(inc\) is between about (1.7,1.9)
- Beyond about \(inc=2.3\), the probability of entering the workforce for individual \(i\) is zero (\(P_i=0\))
Probit regression
Let’s now specify a probit model to see how it differs from the logit
model. The way it is specified in R is almost identical to
the logit model, but we add the call for probit in the family
command:
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
We can see that the estimation is throwing a warning. The warning
says that for some values of fam_inc, the model predicts
probabilities so close to 0 or 1 that the underlying latent variable is
extremely large (positive or negative). In other words:
- Some values of fam_inc perfectly predict the outcome (all 1s or all 0s),
- So the estimated coefficients blow up (tend to \(\pm \infty\))
We will ignore this warning for now because we can still produce an output that is suitable for this tutorial, but it should be corrected in the future. Corrections could include running the model using Bayesian inference or a penalized likelihood function.
##
## Call:
## glm(formula = dummy ~ fam_inc, family = binomial(link = "probit"),
## data = d)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.7228 1.3268 8.835 <2e-16 ***
## fam_inc -6.5853 0.7425 -8.869 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 659.61 on 506 degrees of freedom
## Residual deviance: 190.70 on 505 degrees of freedom
## AIC: 194.7
##
## Number of Fisher Scoring iterations: 10
As with the other regressions, we can extract the predicted
probabilities and add them to the newdat object, then plot
the relationship between the variables. We will also include the logit
line so we can compare the differences between the logit and probit
predicted probabilities.
newdat$pred3 <- predict(reg3, newdata = newdat, type = "response")
# Plot actual data points
plot(dummy ~ fam_inc, data = d, col = "#29af7f", pch = 16,
xlab = "Family income",
ylab = "Probability of returning to workforce",
main = "Probit and logit regressions")
# Add predicted probability line
lines(newdat$fam_inc, newdat$pred2, col = "#482173", lwd = 2)
lines(newdat$fam_inc, newdat$pred3, col = "#bddf26", lwd = 2)