Advanced Economic Methods
Exercise 8: Analyzing life expectancy with cross-sectional and panel data models
Overview
The following tutorial will explore relationships between life expectancy and key socioeconomic variables using a longitudinal data set. If you have not done the previous tutorials, go back to the introductory lesson here to learn how to set up the appropriate working paths for the code to work.
Setup
Header
Start a new script file and add your header and the appropriate
libraries. The data set is a .csv file so base
R can open the file. We will use the dplyr
package for data organization, the car and
lmtest packages for a few basic help functions in our
regressions, and the plm package to run panel regressions.
You can install the latter package by copying and pasting
install.packages("plm") into your console and hitting
enter.
###########################################################################
# Project: Cross-section and panel models
# Author: [Your name]
# Date: [Today's date]
###########################################################################
# Clear environment and load libraries
rm(list=ls())
library(dplyr)
library(car)
library(lmtest)
library(plm)
# Set working drive
setwd("C:/Projects/applied-econometrics/R") # Replace with your directory pathData
For this analysis, we will download a data set from Kaggle. This can
be a useful website for finding practice data sets. We will use a data
set on life expectancy determinants, which can be found here.
Download the data file, and store it in your dat folder so
it is consistent with all other lessons we have done so far.
Now let’s take a look at the data:
## 'data.frame': 2938 obs. of 22 variables:
## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Year : int 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
## $ Status : chr "Developing" "Developing" "Developing" "Developing" ...
## $ Life.expectancy : num 65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
## $ Adult.Mortality : int 263 271 268 272 275 279 281 287 295 295 ...
## $ infant.deaths : int 62 64 66 69 71 74 77 80 82 84 ...
## $ Alcohol : num 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
## $ percentage.expenditure : num 71.3 73.5 73.2 78.2 7.1 ...
## $ Hepatitis.B : int 65 62 64 67 68 66 63 64 63 64 ...
## $ Measles : int 1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
## $ BMI : num 19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
## $ under.five.deaths : int 83 86 89 93 97 102 106 110 113 116 ...
## $ Polio : int 6 58 62 67 68 66 63 64 63 58 ...
## $ Total.expenditure : num 8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
## $ Diphtheria : int 65 62 64 67 68 66 63 64 63 58 ...
## $ HIV.AIDS : num 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
## $ GDP : num 584.3 612.7 631.7 670 63.5 ...
## $ Population : num 33736494 327582 31731688 3696958 2978599 ...
## $ thinness..1.19.years : num 17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
## $ thinness.5.9.years : num 17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
## $ Income.composition.of.resources: num 0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
## $ Schooling : num 10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...
Please refer back to the website of the data source for complete descriptions of the variables included in the data set, as this tutorial will assume knowledge of the definitions.
Note that Status is a dummy variable with characters
instead of numbers. Just so there are no issues, we will transform it
into a factor variable by creating a new variable called
developed and defining the categories to make it very clear
that country \(i\) is developed if this
variable is equal to 1, and developing if it is equal to 0.
Cross-section regression
For comparison purposes, we will first fit a cross-section regression to get familiarized with the relationships between life expectancy and the independent variables we select for the model. This tutorial will use only a subset of the available variables for demonstration purposes. However, you are encouraged to explore the data set more and build a more comprehensive model.
Data prep
To run a cross-sectional regression, we have to choose one of the
possible 16 years available in the data set. Some earlier explorations
found that 2015 had some missing values for some of the key indicators,
so we will use the 2014 data since it is the most recent comprehensive
year in the data set. To do this, we can create a new data object called
d_2014 that contains only the data for 2014:
Regression
Let’s now run a basic regression using the following functional form:
\[exp_i = \beta_0 + \beta_1 log(GDP)_i + \beta_2 edu_i + \beta_3 alcohol_i + \beta_4 dev(0/1)_i + \beta_5 hep_i + u\]
where \(exp_i\) is the life expectancy of country \(i\), \(log(GDP)\) is the logarithmic value of GDP so we can interpret changes as a percentage, \(edu_i\) is the mean number of years of education, \(alcohol_i\) is the average number of liters of pure alcohol consumed per capita in country \(i\), \(dev(0/1)\) is a dummy variable denoting if country \(i\) is developing or developed, and \(hep_i\) is the immunization rate for Hepatitis B in country \(i\). We write the model syntax as:
reg1 <- lm(Life.expectancy ~ log(GDP) + Schooling + Alcohol + developed
+ Adult.Mortality + Hepatitis.B, data = d_2014)
summary(reg1)##
## Call:
## lm(formula = Life.expectancy ~ log(GDP) + Schooling + Alcohol +
## developed + Adult.Mortality + Hepatitis.B, data = d_2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.474 -2.140 0.376 2.463 7.541
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.332665 2.659895 21.554 < 2e-16 ***
## log(GDP) 0.398207 0.201481 1.976 0.0501 .
## Schooling 1.090051 0.190034 5.736 5.87e-08 ***
## Alcohol 0.206120 0.099262 2.077 0.0397 *
## developed1 2.167519 1.078742 2.009 0.0465 *
## Adult.Mortality -0.036726 0.003582 -10.254 < 2e-16 ***
## Hepatitis.B 0.015945 0.014286 1.116 0.2663
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.768 on 138 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.8096, Adjusted R-squared: 0.8013
## F-statistic: 97.81 on 6 and 138 DF, p-value: < 2.2e-16
Interpretations
- The intercept is 55.76, meaning that the average life expectancy of country \(i\) is just under 56 years old, assuming that is does not have any GDP, no education, is a developing country, has no adult mortality, and does not have any Hepatitis B immunization
- An increase in GDP of 1% corresponds to an increase of about 0.45 years of life expectancy, on average
- An increase in the average level of education by 1 year corresponds to an increase in life expectancy of about 1.23 years on average
- There is a positive relationship between alcohol consumption and life expectancy, where an increase in the consumption of alcohol by 1 liter per person corresponds to an increase in life expectancy of about 0.21 years
- The difference in life expectancy between developed and developing countries is about 2.72 years, assuming all else is equal
- An increase in the adult mortality rate by 1 case per 1000 people corresponds to an decrease in the life expectancy by about 0.04 years, on average
- We find evidence suggesting a positive relationship between Hepatitis B immunization and life expectancy, but this is not a statistically significant finding
- The model estimates that approximately 80% of the variation in life expectancy is explained by the variables we included in the model
Overall, the model seems to be a good fit. Some of the relationships are expected, e.g. there is a known positive relationship between GDP and life expectancy, and a decrease in adult mortality should correspond to a higher life expectancy. The relationship between alcohol and life expectancy seems a bit strange, and there is certainly not a causal relationship between these. However, it does make some sense since many higher income countries tend to consume more alcohol than developing countries. Examples could include the wine regions in Greece or Italy, and the beer-drinking countries such as Germany.
Testing the robustness of the model
Let’s run a few quick tests to make sure the model is statistically
sound before we move on. Since it is a multivariate regression, a good
place to start is to test for multicollinearity. We can do this using
the vif() command, which estimates a variance inflation
factor. All values should be below 10, and below 5 is ideal:
## log(GDP) Schooling Alcohol developed Adult.Mortality
## 1.523275 2.798845 1.635852 1.471820 1.552662
## Hepatitis.B
## 1.095457
The values here are great, so let’s quickly also test for heteroskedasticity using the Breusch–Pagan test. In this test, the null hypothesis is that we have homoskedastic errors.
##
## studentized Breusch-Pagan test
##
## data: reg1
## BP = 21.671, df = 6, p-value = 0.001389
Ooops, we have heteroskedasticity! We reject the null hypothesis
since there is a very low p-value. While this does affect the
statistical significance of the relationships identified in
reg1, we can run the panel models using robust standard
errors to be sure that our results are robust
Panel data regression
While the cross-sectional regression seems to have a fair amount of explanatory power, we can improve it by adding more data. We can defined a new model with the full observation period provided by the data set (2000-2015). A new functional form to incorporate multiple time periods can be defined as:
\[exp_{it} = \beta_0 + \beta_1 log(GDP)_{it} + \beta_2 edu_{it} + \beta_3 alcohol_{it} + \beta_4 dev(0/1)_{it} + \beta_5 hep_{it} + u\]
where all variables are the same as the first equation, but we add the subscript \(t\) to denote that all observed values of the dependent and independent variables are time-specific.
We also have to understand that once we start incorporating multiple time periods of the same countries, there will be unobserved heterogeneity. This refers to country-specific characteristics that affect life expectancy but are not included in our model. These are factors that differ across countries but remain relatively constant over time. Examples include:
- Geographic characteristics (climate, natural resources, landlocked vs. coastal)
- Cultural factors (dietary habits, social norms, family structures)
- Historical context (colonial history, past conflicts)
- Institutional quality (strength of healthcare systems, governance structures)
- Genetic or demographic composition of the population
The problem with unobserved heterogeneity is that these unmeasured factors may be correlated with our independent variables, leading to omitted variable bias in our estimates. For instance, countries with strong institutions may have both higher GDP and better health outcomes, but the cross-sectional model cannot distinguish whether GDP itself improves health or whether both are driven by good institutions.
Panel data models address this issue by explicitly accounting for these country-specific effects, allowing us to obtain more reliable estimates of how our independent variables truly affect life expectancy.
Fixed effects vs random effects
Panel data models account for unobserved heterogeneity across countries (or entities) in different ways:
- Fixed effects (FE): This model assumes that each country has its own unique characteristics (e.g., culture, geography, institutions) that are correlated with the independent variables. The FE model controls for these time-invariant differences by essentially giving each country its own intercept. This approach focuses on within-country variation over time, asking “how do changes in GDP, education, etc. affect changes in life expectancy within the same country?”
- Random effects (RE): This model assumes that country-specific characteristics are random and uncorrelated with the independent variables. The RE model is more efficient when this assumption holds, as it uses both between-country and within-country variation. However, if country characteristics are correlated with the predictors (e.g., wealthier countries have both higher GDP and better healthcare systems), the RE estimates will be biased.
The Hausman test helps determine which approach is appropriate by testing whether the country-specific effects are correlated with the regressors.
Fixed effects model
Random effects model
Compare fixed and random effects using Hausman test
We can test which model is the most statistically appropriate using the Hausman test. In this test, the null hypothesis would conclude that the random effects model is more appropriate.
##
## Hausman Test
##
## data: Life.expectancy ~ log(GDP) + Schooling + Alcohol + developed + ...
## chisq = 467.2, df = 5, p-value < 2.2e-16
## alternative hypothesis: one model is inconsistent
We reject the null hypothesis and conclude that the fixed effects model is better in this case.
Fixed effects model results
Now we can look at the results of the model; however, the standard
summary() command will not show us standard errors. It is
still useful for looking at the regression coefficients and the \(R^2\), so we use it for the basic
interpretations, then check the robust standard errors to make sure the
relationships are still statistically significant.
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = Life.expectancy ~ log(GDP) + Schooling + Alcohol +
## developed + Adult.Mortality + Hepatitis.B, data = d, model = "within")
##
## Unbalanced Panel: n = 148, T = 1-16, N = 1865
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -7.64595 -0.91330 -0.13349 0.52391 12.40247
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## log(GDP) 0.15503140 0.03864081 4.0121 6.277e-05 ***
## Schooling 0.71358589 0.05003509 14.2617 < 2.2e-16 ***
## Alcohol -0.24408717 0.03566393 -6.8441 1.068e-11 ***
## Adult.Mortality -0.00370095 0.00063581 -5.8208 6.976e-09 ***
## Hepatitis.B 0.01370162 0.00243700 5.6223 2.196e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 8773.3
## Residual Sum of Squares: 7035.8
## R-Squared: 0.19804
## Adj. R-Squared: 0.12684
## F-statistic: 84.556 on 5 and 1712 DF, p-value: < 2.22e-16
Interpretations
- The dummy variable \(dev\) has been dropped from the regression because it is time invariant
- An increase in GDP of 1% corresponds to an increase in life expectancy by about 0.15 years on average. This relationship is considerably lower than what we observed in the cross-sectional model
- An increase in the average level of education by 1 year corresponds to an increase in the life expectancy by about 0.71 years. This is also a smaller relationship than the cross-sectional model
- An increase in alcohol consumption by 1 liter per person corresponds to a decrease in life expectancy by about 0.24 years on average. This variable flipped signs compared to the cross-sectional regression
- An increase in adult mortality by 1 person per 1000 citizens corresponds to a decrease in the life expectancy by 0.0004 years. This is a 10-fold decrease in the effect size compared to the cross-section model
- An increase in the Hepatitis B vaccination rate by 1% corresponds to an increase in the life expectancy by 0.013 years. This is similar to the cross-sectional model; however, the results now show a statistically significant relationship when the cross-sectional model did not
- We now have an adjusted \(R^2\) of about 13%, which is drastically lower than the 80% reported in the cross-sectional model
Model results with robust standard errors
For the final step, let’s take a look at the coefficients with robust standard errors. The coefficient estimations will remain unchanged, but the standard errors and corresponding p-values will be adjusted.
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## log(GDP) 0.1550314 0.0471117 3.2907 0.001020 **
## Schooling 0.7135859 0.2536465 2.8133 0.004959 **
## Alcohol -0.2440872 0.0790040 -3.0896 0.002037 **
## Adult.Mortality -0.0037010 0.0016981 -2.1795 0.029431 *
## Hepatitis.B 0.0137016 0.0042509 3.2232 0.001292 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results here look good. All variables show significant p-values,
though they are lower than in the standard output provided by the
summary() command.
Conclusion
This analysis demonstrates the importance of accounting for panel structure when analyzing longitudinal data. Comparing the cross-sectional regression (2014 only) with the fixed effects panel model reveals substantial differences in both effect sizes and interpretations.
The cross-sectional model produced an adjusted \(R^2\) of 80% and showed strong positive relationships for most variables. However, this high explanatory power was partly driven by between-country differences—wealthy developed nations tend to have higher GDP, better education, more alcohol consumption, and longer life expectancy simultaneously. The cross-sectional approach cannot distinguish whether these variables causally improve life expectancy or simply correlate with other country characteristics.
The fixed effects model, by controlling for time-invariant country characteristics, provides more credible estimates of how changes in these factors affect life expectancy within countries over time. Key findings include:
- The GDP-life expectancy relationship, while still positive and significant, is much smaller (0.15 years per 1% increase vs. 0.45 years in the cross-section), suggesting much of the cross-sectional correlation reflects underlying country differences rather than a direct effect of economic growth.
- Education remains a strong predictor (0.71 years per additional year of schooling), though smaller than the cross-sectional estimate, confirming its importance for population health.
- The alcohol consumption coefficient reversed from positive to negative (-0.24 years per liter), revealing that the cross-sectional positive correlation was likely confounded by wealth—the fixed effects model shows that increases in alcohol consumption within countries actually reduce life expectancy.
- The Hepatitis B vaccination relationship became statistically significant in the panel model, suggesting immunization programs do improve life expectancy when we account for country-specific factors.
- The lower adjusted \(R_2\) (13%) in the fixed effects model is expected and does not indicate a worse model—it simply reflects that we’re now explaining within-country variation over time rather than between-country differences, which are absorbed by the fixed effects.
The presence of heteroskedasticity in the cross-sectional model and our use of robust standard errors in the panel model ensure our inference is reliable. The Hausman test confirmation that fixed effects is preferred over random effects indicates that country-specific characteristics are indeed correlated with our predictors, validating our modeling choice. Future extensions could explore dynamic panel models, additional health indicators, or interaction effects between development status and other variables.