Advanced Economic Methods

Exercise 5: Student data and functional forms

Overview

The following tutorial will introduce you to running regressions with different functional forms with the data we collected in class.

Setup

We do not need to install any new packages for today’s exercise. Start a new script file, then add your header and the appropriate libraries. We only need the ggplot2 for generating graphical analyses.

###########################################################################
# Project: Student data and functional forms
# Author: [Your name]
# Date: [Today's date]
###########################################################################

# Clear environment and load libraries
rm(list=ls())
library(ggplot2)

# Set working drive
setwd("C:/Projects/applied-econometrics/R") # Replace with your directory path

You can retrieve the student data from Lectio and place it into your dat folder with all other data files for the course. It is a .csv file so we do not need a package to load the data, as R can handle this with its base command read.csv()

# Load data
d <- read.csv("dat/student_data2.csv")

Let’s first explore the data. We can view all of the variables in the data set using head()

head(d)
##   X height shoe hand sleep sport pet
## 1 1    161   37 16.5     7     1   3
## 2 2    169   41 18.5     6     0   0
## 3 3    165   40 16.5     5     0   0
## 4 4    178   39 19.0     7     0   0
## 5 5    162   38 17.0     7     0   0
## 6 6    175   44 19.0     5     1   2

The variables are defined as:

  • X: Generic ID variable
  • height: Height of each student in centimeters
  • shoe: Shoe size of each student in the EU sizing system
  • hand: The length of each student’s hand in centimeters
  • sleep: The number of hours of sleep each student had last night
  • sport: If the student plays a sport (0 = no, 1 = yes)
  • pet: If the student has a pet (0 = no, 1 = cat, 2 = dog, 3 = other)

We will not use all of the variables in the lesson. For now, let’s focus on the three continuous variables for height, sleep, and hand size. We can use the base commands in R to generate a few generic histograms to see the range and distributions of each variable:

# Histograms
hist(d$height)

hist(d$sleep)

hist(d$hand)

Exploring functional forms

When we start running regressions, we have the option to change the relationships between the variables using transformations such as logarithmic functions, exponents, and reciprocals.

Basic univariate regression

Let’s start with a basic univariate regression looking at the relationship between height and hand size, which can be defined as:

\[ height = \beta_0 + \beta_1 hand + u \]

We write this equation in R as:

reg1 <- lm(height ~ hand, data = d)
summary(reg1)
## 
## Call:
## lm(formula = height ~ hand, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1608 -4.3861 -0.2927  4.6848 10.6936 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.3400     9.7802    7.09 4.34e-09 ***
## hand          5.5456     0.5392   10.28 6.21e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.647 on 50 degrees of freedom
## Multiple R-squared:  0.679,  Adjusted R-squared:  0.6726 
## F-statistic: 105.8 on 1 and 50 DF,  p-value: 6.21e-14

We find a strong statistically significant relationship between hand size and height. We can interpret this by saying that an increase in hand size by 1 centimeters corresponds to an increase in about 5.5 centimeters of a student’s height on average. We can graph this relationship using the following plot:

ggplot(d, aes(x = hand, y = height)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = TRUE, color = "red") + 
  ggtitle("Univariate regression, height x hand size")
## `geom_smooth()` using formula = 'y ~ x'

Quadratic term

The plot looks nice and the findings seem rather conclusive in describing the relationship between height and hand size. However, if we suspect that this relationship between the variables would be better described using a curved line instead of a straight line, we can introduce a quadratic term to model a curvilinear relationship using polynomial regression:

\[ height = \beta_0 + \beta_1 hand + \beta_2 hand^2 + u\]

We write this equation in R as:

reg2 <- lm(height ~ hand + I(hand^2), data = d)
summary(reg2)
## 
## Call:
## lm(formula = height ~ hand + I(hand^2), data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8372 -3.9596 -0.4853  4.1381 11.8171 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 282.6568    84.5926   3.341   0.0016 **
## hand        -17.8544     9.2376  -1.933   0.0591 . 
## I(hand^2)     0.6376     0.2513   2.537   0.0144 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.363 on 49 degrees of freedom
## Multiple R-squared:  0.7163, Adjusted R-squared:  0.7047 
## F-statistic: 61.85 on 2 and 49 DF,  p-value: 3.946e-14

The quadratic term \(hand^2\) is statistically significant, suggesting some curvature in the relationship. The model shows a slightly higher \(R^2\) (0.716 vs 0.679), indicating it explains marginally more variance in height. However, the individual coefficients are harder to interpret directly because the linear term becomes negative (-17.85) while the quadratic term is positive (0.638), creating a U-shaped parabola. Let’s plot it to visualize the differences in the models.

ggplot(d, aes(x = hand, y = height)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x + I(x^2), 
              se = TRUE, color = "red") +
  ggtitle("Quadratic term")

While this does look interesting, it introduces a more complex interpretation. Unlike the simple model where we could say “1 cm increase in hand size = 5.5 cm increase in height,” the quadratic model’s effect depends on where you start. At smaller hand sizes, the relationship may be weaker or even slightly negative, then accelerates at larger hand sizes due to the positive quadratic term.

We could probably conclude that reg1 is a better model for this analysis. The improvement in \(R^2\) is rather modest (0.037), and the linear model provides a clearer, more interpretable relationship. The quadratic form may be overfitting, and the negative linear coefficient seems counterintuitive for this relationship. Unless there’s strong theoretical or visual evidence of curvature, the simpler linear model is likely preferable in this case.

Reciprocal term

Let’s assume that we still want to try improving our initial univariate model. We can also try a reciprocal term for hand size. This would be modeled as:

\[ height = \beta_0 + \beta_1 \frac{1}{hand} + u \]

The regression would look like this:

reg3 = lm(height ~ I(1/hand), data = d)
summary(reg3)
## 
## Call:
## lm(formula = height ~ I(1/hand), data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7930 -4.8532 -0.6111  4.4470 12.7783 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   267.71      10.55  25.376  < 2e-16 ***
## I(1/hand)   -1762.45     188.92  -9.329 1.58e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.021 on 50 degrees of freedom
## Multiple R-squared:  0.6351, Adjusted R-squared:  0.6278 
## F-statistic: 87.03 on 1 and 50 DF,  p-value: 1.578e-12

Here, we find that the reciprocal term \(\frac{1}{hand}\) is highly significant (p = 1.58e-12), indicating a nonlinear relationship where the effect of hand size diminishes as hand size increases. The coefficient -1762.45 means that as hand size increases, the negative reciprocal term becomes smaller (closer to zero), so height increases but at a decreasing rate.

Interpretation: This functional form suggests that going from, say, 15 cm to 16 cm hand size has a larger impact on height than going from 20 cm to 21 cm. The relationship follows a curve that rises steeply at smaller hand sizes then flattens out.

We can graph it by adding this to the formula command in ggplot:

ggplot(d, aes(x = hand, y = height)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ I(1/x), 
              se = TRUE, color = "red") +
  ggtitle("Reciprocal")

Comparing univariate, quadratic, and reciprocal

In the table below, we can evaluate the fit statistics of all three models:

Model Comparison
Model Adj. R² Interpretation
Linear 0.679 0.673 Simplest, clearest
Quadratic 0.716 0.705 Best fit, but harder to interpret
Reciprocal 0.635 0.628 Worst fit

Conclusion: The linear model reg1 remains the best choice. While the quadratic model has marginally better fit statistics, the improvement is modest and comes at the cost of interpretability. The reciprocal model actually performs worse than the linear model (lower \(R^2\)). For a straightforward relationship like hand size and height, the simple linear model provides the best balance of fit, interpretability, and parsimony. In a simple analysis like this, simple models are usually the best bet.

Log transformations

Another common functional form is the log transformation, which is useful when relationships appear to be multiplicative rather than additive, or when we want to interpret results in terms of percentage changes. We can apply the log transformation to either the dependent variable, independent variable, or both.

Log-linear model (log Y)

If we transform only the dependent variable as:

\[ log(height) = \beta_0 + \beta_1 hand + u\]

The regression would be:

reg4 <- lm(log(height) ~ hand, data = d)
summary(reg4)
## 
## Call:
## lm(formula = log(height) ~ hand, data = d)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.054095 -0.024569 -0.000682  0.025615  0.061141 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.551348   0.057584   79.04  < 2e-16 ***
## hand        0.032107   0.003175   10.11  1.1e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03325 on 50 degrees of freedom
## Multiple R-squared:  0.6716, Adjusted R-squared:  0.6651 
## F-statistic: 102.3 on 1 and 50 DF,  p-value: 1.102e-13

Interpretation: A one-centimeter increase in hand size is associated with approximately a 3.2% increase in height (0.032 × 100). This model has an \(R^2\) of 0.672, performing similarly to the simple linear model but expressing the relationship in percentage terms.

Linear-log model

If we transform only the independent variable:

\[ height = \beta_0 + \beta_1 log(hand) + u\]

The regression would be:

reg5 <- lm(height ~ log(hand), data = d)
summary(reg5)
## 
## Call:
## lm(formula = height ~ log(hand), data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4808 -4.6069 -0.4752  4.6036 11.5234 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -118.07      29.28  -4.033 0.000188 ***
## log(hand)      99.49      10.12   9.830 2.86e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.82 on 50 degrees of freedom
## Multiple R-squared:  0.659,  Adjusted R-squared:  0.6522 
## F-statistic: 96.62 on 1 and 50 DF,  p-value: 2.856e-13

Interpretation: A 1% increase in hand size is associated with an approximately 0.99 cm increase in height (99.49 ÷ 100). This specification has the lowest \(R^2\) (0.659) among the log models and suggests diminishing returns—larger absolute changes in hand size at lower values have bigger impacts on height.

Log-log model

If we transform both variables as:

\[ log(height) = \beta_0 + \beta_1 log(hand) + u\]

Then the regression is:

reg6 <- lm(log(height) ~ log(hand), data = d)
summary(reg6)
## 
## Call:
## lm(formula = log(height) ~ log(hand), data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.05595 -0.02605 -0.00114  0.02790  0.06674 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.46387    0.17174  20.170  < 2e-16 ***
## log(hand)    0.57683    0.05937   9.716  4.2e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03414 on 50 degrees of freedom
## Multiple R-squared:  0.6537, Adjusted R-squared:  0.6468 
## F-statistic:  94.4 on 1 and 50 DF,  p-value: 4.204e-13

Interpretation: A 1% increase in hand size is associated with an approximately 0.99 cm increase in height (99.49 ÷ 100). This specification has the lowest \(R^2\) (0.659) among the log models and suggests diminishing returns—larger absolute changes in hand size at lower values have bigger impacts on height.