Advanced Economic Methods

Exercise 9: Working with limited dependent variable models

Overview

The following tutorial will guide you through running regressions with dummy dependent variables. If you have not done the previous tutorials, go back to the introductory lesson here to learn how to set up the appropriate working paths for the code to work.

Setup

Data

d <- read_dta("dat/binary2.dta")

Take a look at the data file:

head(d)
## # A tibble: 6 × 5
##   indiv dummy fam_inc   age exper
##   <dbl> <dbl>   <dbl> <dbl> <dbl>
## 1     1     0    1.72    30    11
## 2     2     0    1.90    48     1
## 3     3     0    1.91    50    10
## 4     4     0    1.72    46    15
## 5     5     0    1.88    38     8
## 6     6     0    2.58    38    16

The data set contains the following variables:

  • indiv: A unique identifier for each respondent
  • dummy: A binary variable denoting if the respondent has chosen to enter the workforce
  • fam_inc: Amount of family income
  • age: The age of the respondent (years)
  • exper: Work experience of the respondent (years)

I think it will also help to have a look at the income variable to get a better idea of the magnitude of the effects when we start running regressions

hist(d$fam_inc)

This is helpful because we can see that a majority of the values are between 1 and 3, with only a few observations on the more extreme sides.

Regressions

Linear probability model

To start, we can run a linear probability model. This is essentially identical to most of the other models we have been working with, only using a dummy dependent variable.

Model statement

The functional form of this model is simply the relationship between the dummy variable denoting if individual \(i\) is working (\(work_i\)) and family income (\(inc\)):

\[work_i(0/1) = \beta_0 + \beta_1 inc_i + u\]

reg1 <- lm(dummy ~ fam_inc, data = d)
summary(reg1)
## 
## Call:
## lm(formula = dummy ~ fam_inc, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.60168 -0.26320  0.05829  0.25008  0.55643 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.57963    0.05001   31.59   <2e-16 ***
## fam_inc     -0.58560    0.02295  -25.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3169 on 505 degrees of freedom
## Multiple R-squared:  0.5632, Adjusted R-squared:  0.5623 
## F-statistic: 651.1 on 1 and 505 DF,  p-value: < 2.2e-16

Interpretations

  • Intercept (1.58): The predicted probability of being in the workforce when family income = 0 is 1.58 (or 158%). This is clearly nonsensical, as probabilities can’t exceed 1! This is one of the key problems with linear probability models.
  • \(inc = -0.586\): For each one-unit increase in family income, the probability of being in the workforce decreases by 0.586 (or 58.6 percentage points).

This is the main advantage of LPM - the coefficient is directly interpretable as the marginal effect on probability The negative relationship makes intuitive sense: higher family income might mean less economic pressure to work

Problems evident in this output:

  • Predictions outside [0,1]: The intercept of 1.58 already shows impossible probabilities. At low family income values, you’ll get predicted probabilities > 1, and at high values, you’ll get probabilities < 0.
  • Constant marginal effects: The -0.586 effect is assumed constant across all income levels, which is unrealistic. In reality, the effect of income on workforce participation probably varies (e.g., diminishing effects at very high or low income levels).
  • High \(R^2\) (0.56): This looks good, but is somewhat misleading given the model’s theoretical problems.

Extracting parameters and plotting

Let’s have a look at the model we produced. We have an additional step compared to the other models we have been working with in the earlier tutorials, because we have to predict the probabilities of entering the workforce. To do this, we will first create a new data frame where we extract 100 observations of fam_inc (we don’t need the entire data set for a plot):

newdat <- data.frame(fam_inc = sort(d$fam_inc))

newdat$pred1 <- predict(reg1, newdata = newdat)

Now we will create a plot. This will be done in the base R commands just for simplicity

plot(dummy ~ fam_inc, data = d, col = "#29af7f", pch = 16,
     xlab = "Family income",
     ylab = "Probability of returning to workforce",
     main = "Linear probability model")

# Add predicted probability line
lines(newdat$fam_inc, newdat$pred1, col = "#482173", lwd = 2)

Logit regression

Now, let’s fit a logit model. The syntax is almost the same as reg1. They key differences are that we replace the lm() command with glm() (generalized linear model instead of linear model) and specify the type of regression we want to do. In this case, we set this as family = binomial so R understands that we are interested in knowing the predicted probabilities of a dummy dependent variable.

reg2 <- glm(dummy ~ fam_inc, data = d, family = binomial)

summary(reg2)
## 
## Call:
## glm(formula = dummy ~ fam_inc, family = binomial, data = d)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   19.828      2.386   8.309   <2e-16 ***
## fam_inc      -11.153      1.337  -8.340   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 659.61  on 506  degrees of freedom
## Residual deviance: 193.85  on 505  degrees of freedom
## AIC: 197.85
## 
## Number of Fisher Scoring iterations: 8

Notice that there is not a clear goodness of fit metric like the \(R^2\) in the output of a model produced using lm(). The reason for the missing metric is because we have the information to produce it, but there are competing methods and R does not want to pick a favorite. For simplicity, we will calculate a pseudo-\(R^2\) using the approach called McFadden’s \(R^2\) that uses the residual and null deviance metrics:

1 - (reg2$deviance / reg2$null.deviance)
## [1] 0.7061165

The model explains about 71% of the deviance, which is quite good!

The interpretation in most of the logit model metrics can be quite difficult. Personally, I think the best way to report the findings is by describing the sign of the relationship (-/+), the degree of significance, then producing a plot and describing what you see. We can see from the findings above that fam_inc is negatively associated with the probability of entering the workforce, which is an expected relationship. The p-value suggests that this relationship is strong (\(p<0.001\)). Now let’s predict the probabilities of this model, add them to the newdat object, and generate a plot.

# Predict probabilities
newdat$pred2 <- predict(reg2, newdata = newdat, type = "response")

# Plot actual data points
plot(dummy ~ fam_inc, data = d, col = "#29af7f", pch = 16,
     xlab = "Family income",
     ylab = "Probability of returning to workforce",
     main = "Logit regression")

# Add predicted probability line
lines(newdat$fam_inc, newdat$pred2, col = "#482173", lwd = 2)

From the green line on the plot, we can provide a few points to describe the relationship between \(inc\) and \(work\). Examples:

  • If \(inc\) is less than about 1.3, the probability of entering the workforce is 100%
  • The fastest rate at which individuals change their mind about entering the workforce occurs when \(inc\) is between about (1.7,1.9)
  • Beyond about \(inc=2.3\), the probability of entering the workforce for individual \(i\) is zero (\(P_i=0\))

Probit regression

Let’s now specify a probit model to see how it differs from the logit model. The way it is specified in R is almost identical to the logit model, but we add the call for probit in the family command:

reg3 <- glm(dummy ~ fam_inc, data = d, family = binomial(link = "probit"))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

We can see that the estimation is throwing a warning. The warning says that for some values of fam_inc, the model predicts probabilities so close to 0 or 1 that the underlying latent variable is extremely large (positive or negative). In other words:

  • Some values of fam_inc perfectly predict the outcome (all 1s or all 0s),
  • So the estimated coefficients blow up (tend to \(\pm \infty\))

We will ignore this warning for now because we can still produce an output that is suitable for this tutorial, but it should be corrected in the future. Corrections could include running the model using Bayesian inference or a penalized likelihood function.

summary(reg3)
## 
## Call:
## glm(formula = dummy ~ fam_inc, family = binomial(link = "probit"), 
##     data = d)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  11.7228     1.3268   8.835   <2e-16 ***
## fam_inc      -6.5853     0.7425  -8.869   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 659.61  on 506  degrees of freedom
## Residual deviance: 190.70  on 505  degrees of freedom
## AIC: 194.7
## 
## Number of Fisher Scoring iterations: 10

As with the other regressions, we can extract the predicted probabilities and add them to the newdat object, then plot the relationship between the variables. We will also include the logit line so we can compare the differences between the logit and probit predicted probabilities.

newdat$pred3 <- predict(reg3, newdata = newdat, type = "response")

# Plot actual data points
plot(dummy ~ fam_inc, data = d, col = "#29af7f", pch = 16,
     xlab = "Family income",
     ylab = "Probability of returning to workforce",
     main = "Probit and logit regressions")

# Add predicted probability line
lines(newdat$fam_inc, newdat$pred2, col = "#482173", lwd = 2)
lines(newdat$fam_inc, newdat$pred3, col = "#bddf26", lwd = 2)