Advanced Economic Methods

Exercise 8: Analyzing life expectancy with cross-sectional and panel data models

Overview

The following tutorial will explore relationships between life expectancy and key socioeconomic variables using a longitudinal data set. If you have not done the previous tutorials, go back to the introductory lesson here to learn how to set up the appropriate working paths for the code to work.

Setup

Data

For this analysis, we will download a data set from Kaggle. This can be a useful website for finding practice data sets. We will use a data set on life expectancy determinants, which can be found here. Download the data file, and store it in your dat folder so it is consistent with all other lessons we have done so far.

d <- read.csv("dat/Life Expectancy Data.csv")

Now let’s take a look at the data:

str(d)
## 'data.frame':    2938 obs. of  22 variables:
##  $ Country                        : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Year                           : int  2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
##  $ Status                         : chr  "Developing" "Developing" "Developing" "Developing" ...
##  $ Life.expectancy                : num  65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
##  $ Adult.Mortality                : int  263 271 268 272 275 279 281 287 295 295 ...
##  $ infant.deaths                  : int  62 64 66 69 71 74 77 80 82 84 ...
##  $ Alcohol                        : num  0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
##  $ percentage.expenditure         : num  71.3 73.5 73.2 78.2 7.1 ...
##  $ Hepatitis.B                    : int  65 62 64 67 68 66 63 64 63 64 ...
##  $ Measles                        : int  1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
##  $ BMI                            : num  19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
##  $ under.five.deaths              : int  83 86 89 93 97 102 106 110 113 116 ...
##  $ Polio                          : int  6 58 62 67 68 66 63 64 63 58 ...
##  $ Total.expenditure              : num  8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
##  $ Diphtheria                     : int  65 62 64 67 68 66 63 64 63 58 ...
##  $ HIV.AIDS                       : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##  $ GDP                            : num  584.3 612.7 631.7 670 63.5 ...
##  $ Population                     : num  33736494 327582 31731688 3696958 2978599 ...
##  $ thinness..1.19.years           : num  17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
##  $ thinness.5.9.years             : num  17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
##  $ Income.composition.of.resources: num  0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
##  $ Schooling                      : num  10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...

Please refer back to the website of the data source for complete descriptions of the variables included in the data set, as this tutorial will assume knowledge of the definitions.

Note that Status is a dummy variable with characters instead of numbers. Just so there are no issues, we will transform it into a factor variable by creating a new variable called developed and defining the categories to make it very clear that country \(i\) is developed if this variable is equal to 1, and developing if it is equal to 0.

d <- d %>%
  mutate(developed = factor(ifelse(d$Status == "Developed", 1, 0)))

Cross-section regression

For comparison purposes, we will first fit a cross-section regression to get familiarized with the relationships between life expectancy and the independent variables we select for the model. This tutorial will use only a subset of the available variables for demonstration purposes. However, you are encouraged to explore the data set more and build a more comprehensive model.

Data prep

To run a cross-sectional regression, we have to choose one of the possible 16 years available in the data set. Some earlier explorations found that 2015 had some missing values for some of the key indicators, so we will use the 2014 data since it is the most recent comprehensive year in the data set. To do this, we can create a new data object called d_2014 that contains only the data for 2014:

d_2014 <- d %>%
  filter(Year == 2014)

Regression

Let’s now run a basic regression using the following functional form:

\[exp_i = \beta_0 + \beta_1 log(GDP)_i + \beta_2 edu_i + \beta_3 alcohol_i + \beta_4 dev(0/1)_i + \beta_5 hep_i + u\]

where \(exp_i\) is the life expectancy of country \(i\), \(log(GDP)\) is the logarithmic value of GDP so we can interpret changes as a percentage, \(edu_i\) is the mean number of years of education, \(alcohol_i\) is the average number of liters of pure alcohol consumed per capita in country \(i\), \(dev(0/1)\) is a dummy variable denoting if country \(i\) is developing or developed, and \(hep_i\) is the immunization rate for Hepatitis B in country \(i\). We write the model syntax as:

reg1 <- lm(Life.expectancy ~ log(GDP) + Schooling + Alcohol + developed 
           + Adult.Mortality + Hepatitis.B, data = d_2014)

summary(reg1)
## 
## Call:
## lm(formula = Life.expectancy ~ log(GDP) + Schooling + Alcohol + 
##     developed + Adult.Mortality + Hepatitis.B, data = d_2014)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.474  -2.140   0.376   2.463   7.541 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     57.332665   2.659895  21.554  < 2e-16 ***
## log(GDP)         0.398207   0.201481   1.976   0.0501 .  
## Schooling        1.090051   0.190034   5.736 5.87e-08 ***
## Alcohol          0.206120   0.099262   2.077   0.0397 *  
## developed1       2.167519   1.078742   2.009   0.0465 *  
## Adult.Mortality -0.036726   0.003582 -10.254  < 2e-16 ***
## Hepatitis.B      0.015945   0.014286   1.116   0.2663    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.768 on 138 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.8096, Adjusted R-squared:  0.8013 
## F-statistic: 97.81 on 6 and 138 DF,  p-value: < 2.2e-16

Interpretations

  • The intercept is 55.76, meaning that the average life expectancy of country \(i\) is just under 56 years old, assuming that is does not have any GDP, no education, is a developing country, has no adult mortality, and does not have any Hepatitis B immunization
  • An increase in GDP of 1% corresponds to an increase of about 0.45 years of life expectancy, on average
  • An increase in the average level of education by 1 year corresponds to an increase in life expectancy of about 1.23 years on average
  • There is a positive relationship between alcohol consumption and life expectancy, where an increase in the consumption of alcohol by 1 liter per person corresponds to an increase in life expectancy of about 0.21 years
  • The difference in life expectancy between developed and developing countries is about 2.72 years, assuming all else is equal
  • An increase in the adult mortality rate by 1 case per 1000 people corresponds to an decrease in the life expectancy by about 0.04 years, on average
  • We find evidence suggesting a positive relationship between Hepatitis B immunization and life expectancy, but this is not a statistically significant finding
  • The model estimates that approximately 80% of the variation in life expectancy is explained by the variables we included in the model

Overall, the model seems to be a good fit. Some of the relationships are expected, e.g. there is a known positive relationship between GDP and life expectancy, and a decrease in adult mortality should correspond to a higher life expectancy. The relationship between alcohol and life expectancy seems a bit strange, and there is certainly not a causal relationship between these. However, it does make some sense since many higher income countries tend to consume more alcohol than developing countries. Examples could include the wine regions in Greece or Italy, and the beer-drinking countries such as Germany.

Testing the robustness of the model

Let’s run a few quick tests to make sure the model is statistically sound before we move on. Since it is a multivariate regression, a good place to start is to test for multicollinearity. We can do this using the vif() command, which estimates a variance inflation factor. All values should be below 10, and below 5 is ideal:

vif(reg1)
##        log(GDP)       Schooling         Alcohol       developed Adult.Mortality 
##        1.523275        2.798845        1.635852        1.471820        1.552662 
##     Hepatitis.B 
##        1.095457

The values here are great, so let’s quickly also test for heteroskedasticity using the Breusch–Pagan test. In this test, the null hypothesis is that we have homoskedastic errors.

bptest(reg1)
## 
##  studentized Breusch-Pagan test
## 
## data:  reg1
## BP = 21.671, df = 6, p-value = 0.001389

Ooops, we have heteroskedasticity! We reject the null hypothesis since there is a very low p-value. While this does affect the statistical significance of the relationships identified in reg1, we can run the panel models using robust standard errors to be sure that our results are robust

Panel data regression

While the cross-sectional regression seems to have a fair amount of explanatory power, we can improve it by adding more data. We can defined a new model with the full observation period provided by the data set (2000-2015). A new functional form to incorporate multiple time periods can be defined as:

\[exp_{it} = \beta_0 + \beta_1 log(GDP)_{it} + \beta_2 edu_{it} + \beta_3 alcohol_{it} + \beta_4 dev(0/1)_{it} + \beta_5 hep_{it} + u\]

where all variables are the same as the first equation, but we add the subscript \(t\) to denote that all observed values of the dependent and independent variables are time-specific.

We also have to understand that once we start incorporating multiple time periods of the same countries, there will be unobserved heterogeneity. This refers to country-specific characteristics that affect life expectancy but are not included in our model. These are factors that differ across countries but remain relatively constant over time. Examples include:

  • Geographic characteristics (climate, natural resources, landlocked vs. coastal)
  • Cultural factors (dietary habits, social norms, family structures)
  • Historical context (colonial history, past conflicts)
  • Institutional quality (strength of healthcare systems, governance structures)
  • Genetic or demographic composition of the population

The problem with unobserved heterogeneity is that these unmeasured factors may be correlated with our independent variables, leading to omitted variable bias in our estimates. For instance, countries with strong institutions may have both higher GDP and better health outcomes, but the cross-sectional model cannot distinguish whether GDP itself improves health or whether both are driven by good institutions.

Panel data models address this issue by explicitly accounting for these country-specific effects, allowing us to obtain more reliable estimates of how our independent variables truly affect life expectancy.

Fixed effects vs random effects

Panel data models account for unobserved heterogeneity across countries (or entities) in different ways:

  • Fixed effects (FE): This model assumes that each country has its own unique characteristics (e.g., culture, geography, institutions) that are correlated with the independent variables. The FE model controls for these time-invariant differences by essentially giving each country its own intercept. This approach focuses on within-country variation over time, asking “how do changes in GDP, education, etc. affect changes in life expectancy within the same country?”
  • Random effects (RE): This model assumes that country-specific characteristics are random and uncorrelated with the independent variables. The RE model is more efficient when this assumption holds, as it uses both between-country and within-country variation. However, if country characteristics are correlated with the predictors (e.g., wealthier countries have both higher GDP and better healthcare systems), the RE estimates will be biased.

The Hausman test helps determine which approach is appropriate by testing whether the country-specific effects are correlated with the regressors.

Fixed effects model

reg2 <- plm(Life.expectancy ~ log(GDP) + Schooling + Alcohol + developed 
           + Adult.Mortality + Hepatitis.B,
            data = d,
            model = "within")

Random effects model

reg3 <- plm(Life.expectancy ~ log(GDP) + Schooling + Alcohol + developed 
           + Adult.Mortality + Hepatitis.B,
            data = d,
            model = "random")

Compare fixed and random effects using Hausman test

We can test which model is the most statistically appropriate using the Hausman test. In this test, the null hypothesis would conclude that the random effects model is more appropriate.

hausman <- phtest(reg2, reg3)
hausman
## 
##  Hausman Test
## 
## data:  Life.expectancy ~ log(GDP) + Schooling + Alcohol + developed +  ...
## chisq = 467.2, df = 5, p-value < 2.2e-16
## alternative hypothesis: one model is inconsistent

We reject the null hypothesis and conclude that the fixed effects model is better in this case.

Fixed effects model results

Now we can look at the results of the model; however, the standard summary() command will not show us standard errors. It is still useful for looking at the regression coefficients and the \(R^2\), so we use it for the basic interpretations, then check the robust standard errors to make sure the relationships are still statistically significant.

summary(reg2)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = Life.expectancy ~ log(GDP) + Schooling + Alcohol + 
##     developed + Adult.Mortality + Hepatitis.B, data = d, model = "within")
## 
## Unbalanced Panel: n = 148, T = 1-16, N = 1865
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -7.64595 -0.91330 -0.13349  0.52391 12.40247 
## 
## Coefficients:
##                    Estimate  Std. Error t-value  Pr(>|t|)    
## log(GDP)         0.15503140  0.03864081  4.0121 6.277e-05 ***
## Schooling        0.71358589  0.05003509 14.2617 < 2.2e-16 ***
## Alcohol         -0.24408717  0.03566393 -6.8441 1.068e-11 ***
## Adult.Mortality -0.00370095  0.00063581 -5.8208 6.976e-09 ***
## Hepatitis.B      0.01370162  0.00243700  5.6223 2.196e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    8773.3
## Residual Sum of Squares: 7035.8
## R-Squared:      0.19804
## Adj. R-Squared: 0.12684
## F-statistic: 84.556 on 5 and 1712 DF, p-value: < 2.22e-16

Interpretations

  • The dummy variable \(dev\) has been dropped from the regression because it is time invariant
  • An increase in GDP of 1% corresponds to an increase in life expectancy by about 0.15 years on average. This relationship is considerably lower than what we observed in the cross-sectional model
  • An increase in the average level of education by 1 year corresponds to an increase in the life expectancy by about 0.71 years. This is also a smaller relationship than the cross-sectional model
  • An increase in alcohol consumption by 1 liter per person corresponds to a decrease in life expectancy by about 0.24 years on average. This variable flipped signs compared to the cross-sectional regression
  • An increase in adult mortality by 1 person per 1000 citizens corresponds to a decrease in the life expectancy by 0.0004 years. This is a 10-fold decrease in the effect size compared to the cross-section model
  • An increase in the Hepatitis B vaccination rate by 1% corresponds to an increase in the life expectancy by 0.013 years. This is similar to the cross-sectional model; however, the results now show a statistically significant relationship when the cross-sectional model did not
  • We now have an adjusted \(R^2\) of about 13%, which is drastically lower than the 80% reported in the cross-sectional model

Model results with robust standard errors

For the final step, let’s take a look at the coefficients with robust standard errors. The coefficient estimations will remain unchanged, but the standard errors and corresponding p-values will be adjusted.

coeftest(reg2, vcov = vcovHC(reg2, type = "HC1"))
## 
## t test of coefficients:
## 
##                   Estimate Std. Error t value Pr(>|t|)   
## log(GDP)         0.1550314  0.0471117  3.2907 0.001020 **
## Schooling        0.7135859  0.2536465  2.8133 0.004959 **
## Alcohol         -0.2440872  0.0790040 -3.0896 0.002037 **
## Adult.Mortality -0.0037010  0.0016981 -2.1795 0.029431 * 
## Hepatitis.B      0.0137016  0.0042509  3.2232 0.001292 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results here look good. All variables show significant p-values, though they are lower than in the standard output provided by the summary() command.

Conclusion

This analysis demonstrates the importance of accounting for panel structure when analyzing longitudinal data. Comparing the cross-sectional regression (2014 only) with the fixed effects panel model reveals substantial differences in both effect sizes and interpretations.

The cross-sectional model produced an adjusted \(R^2\) of 80% and showed strong positive relationships for most variables. However, this high explanatory power was partly driven by between-country differences—wealthy developed nations tend to have higher GDP, better education, more alcohol consumption, and longer life expectancy simultaneously. The cross-sectional approach cannot distinguish whether these variables causally improve life expectancy or simply correlate with other country characteristics.

The fixed effects model, by controlling for time-invariant country characteristics, provides more credible estimates of how changes in these factors affect life expectancy within countries over time. Key findings include:

  • The GDP-life expectancy relationship, while still positive and significant, is much smaller (0.15 years per 1% increase vs. 0.45 years in the cross-section), suggesting much of the cross-sectional correlation reflects underlying country differences rather than a direct effect of economic growth.
  • Education remains a strong predictor (0.71 years per additional year of schooling), though smaller than the cross-sectional estimate, confirming its importance for population health.
  • The alcohol consumption coefficient reversed from positive to negative (-0.24 years per liter), revealing that the cross-sectional positive correlation was likely confounded by wealth—the fixed effects model shows that increases in alcohol consumption within countries actually reduce life expectancy.
  • The Hepatitis B vaccination relationship became statistically significant in the panel model, suggesting immunization programs do improve life expectancy when we account for country-specific factors.
  • The lower adjusted \(R_2\) (13%) in the fixed effects model is expected and does not indicate a worse model—it simply reflects that we’re now explaining within-country variation over time rather than between-country differences, which are absorbed by the fixed effects.

The presence of heteroskedasticity in the cross-sectional model and our use of robust standard errors in the panel model ensure our inference is reliable. The Hausman test confirmation that fixed effects is preferred over random effects indicates that country-specific characteristics are indeed correlated with our predictors, validating our modeling choice. Future extensions could explore dynamic panel models, additional health indicators, or interaction effects between development status and other variables.