Advanced Economic Methods
Exercise 5: Student data and functional forms
Overview
The following tutorial will introduce you to running regressions with different functional forms with the data we collected in class.
Setup
We do not need to install any new packages for today’s exercise.
Start a new script file, then add your header and the appropriate
libraries. We only need the ggplot2 for generating
graphical analyses.
###########################################################################
# Project: Student data and functional forms
# Author: [Your name]
# Date: [Today's date]
###########################################################################
# Clear environment and load libraries
rm(list=ls())
library(ggplot2)
# Set working drive
setwd("C:/Projects/applied-econometrics/R") # Replace with your directory pathYou can retrieve the student data from Lectio and place it into your
dat folder with all other data files for the course. It is
a .csv file so we do not need a package to load the data, as
R can handle this with its base command
read.csv()
Let’s first explore the data. We can view all of the variables in the
data set using head()
## X height shoe hand sleep sport pet
## 1 1 161 37 16.5 7 1 3
## 2 2 169 41 18.5 6 0 0
## 3 3 165 40 16.5 5 0 0
## 4 4 178 39 19.0 7 0 0
## 5 5 162 38 17.0 7 0 0
## 6 6 175 44 19.0 5 1 2
The variables are defined as:
X: Generic ID variableheight: Height of each student in centimetersshoe: Shoe size of each student in the EU sizing systemhand: The length of each student’s hand in centimeterssleep: The number of hours of sleep each student had last nightsport: If the student plays a sport (0 = no, 1 = yes)pet: If the student has a pet (0 = no, 1 = cat, 2 = dog, 3 = other)
We will not use all of the variables in the lesson. For now, let’s
focus on the three continuous variables for height, sleep, and hand
size. We can use the base commands in R to generate a few
generic histograms to see the range and distributions of each
variable:
Exploring functional forms
When we start running regressions, we have the option to change the relationships between the variables using transformations such as logarithmic functions, exponents, and reciprocals.
Basic univariate regression
Let’s start with a basic univariate regression looking at the relationship between height and hand size, which can be defined as:
\[ height = \beta_0 + \beta_1 hand + u \]
We write this equation in R as:
##
## Call:
## lm(formula = height ~ hand, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1608 -4.3861 -0.2927 4.6848 10.6936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.3400 9.7802 7.09 4.34e-09 ***
## hand 5.5456 0.5392 10.28 6.21e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.647 on 50 degrees of freedom
## Multiple R-squared: 0.679, Adjusted R-squared: 0.6726
## F-statistic: 105.8 on 1 and 50 DF, p-value: 6.21e-14
We find a strong statistically significant relationship between hand size and height. We can interpret this by saying that an increase in hand size by 1 centimeters corresponds to an increase in about 5.5 centimeters of a student’s height on average. We can graph this relationship using the following plot:
ggplot(d, aes(x = hand, y = height)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "red") +
ggtitle("Univariate regression, height x hand size")## `geom_smooth()` using formula = 'y ~ x'
Quadratic term
The plot looks nice and the findings seem rather conclusive in describing the relationship between height and hand size. However, if we suspect that this relationship between the variables would be better described using a curved line instead of a straight line, we can introduce a quadratic term to model a curvilinear relationship using polynomial regression:
\[ height = \beta_0 + \beta_1 hand + \beta_2 hand^2 + u\]
We write this equation in R as:
##
## Call:
## lm(formula = height ~ hand + I(hand^2), data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8372 -3.9596 -0.4853 4.1381 11.8171
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 282.6568 84.5926 3.341 0.0016 **
## hand -17.8544 9.2376 -1.933 0.0591 .
## I(hand^2) 0.6376 0.2513 2.537 0.0144 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.363 on 49 degrees of freedom
## Multiple R-squared: 0.7163, Adjusted R-squared: 0.7047
## F-statistic: 61.85 on 2 and 49 DF, p-value: 3.946e-14
The quadratic term \(hand^2\) is statistically significant, suggesting some curvature in the relationship. The model shows a slightly higher \(R^2\) (0.716 vs 0.679), indicating it explains marginally more variance in height. However, the individual coefficients are harder to interpret directly because the linear term becomes negative (-17.85) while the quadratic term is positive (0.638), creating a U-shaped parabola. Let’s plot it to visualize the differences in the models.
ggplot(d, aes(x = hand, y = height)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x + I(x^2),
se = TRUE, color = "red") +
ggtitle("Quadratic term")While this does look interesting, it introduces a more complex interpretation. Unlike the simple model where we could say “1 cm increase in hand size = 5.5 cm increase in height,” the quadratic model’s effect depends on where you start. At smaller hand sizes, the relationship may be weaker or even slightly negative, then accelerates at larger hand sizes due to the positive quadratic term.
We could probably conclude that reg1 is a better model
for this analysis. The improvement in \(R^2\) is rather modest (0.037), and the
linear model provides a clearer, more interpretable relationship. The
quadratic form may be overfitting, and the negative linear coefficient
seems counterintuitive for this relationship. Unless there’s strong
theoretical or visual evidence of curvature, the simpler linear model is
likely preferable in this case.
Reciprocal term
Let’s assume that we still want to try improving our initial univariate model. We can also try a reciprocal term for hand size. This would be modeled as:
\[ height = \beta_0 + \beta_1 \frac{1}{hand} + u \]
The regression would look like this:
##
## Call:
## lm(formula = height ~ I(1/hand), data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7930 -4.8532 -0.6111 4.4470 12.7783
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 267.71 10.55 25.376 < 2e-16 ***
## I(1/hand) -1762.45 188.92 -9.329 1.58e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.021 on 50 degrees of freedom
## Multiple R-squared: 0.6351, Adjusted R-squared: 0.6278
## F-statistic: 87.03 on 1 and 50 DF, p-value: 1.578e-12
Here, we find that the reciprocal term \(\frac{1}{hand}\) is highly significant (p = 1.58e-12), indicating a nonlinear relationship where the effect of hand size diminishes as hand size increases. The coefficient -1762.45 means that as hand size increases, the negative reciprocal term becomes smaller (closer to zero), so height increases but at a decreasing rate.
Interpretation: This functional form suggests that going from, say, 15 cm to 16 cm hand size has a larger impact on height than going from 20 cm to 21 cm. The relationship follows a curve that rises steeply at smaller hand sizes then flattens out.
We can graph it by adding this to the formula command in
ggplot:
ggplot(d, aes(x = hand, y = height)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ I(1/x),
se = TRUE, color = "red") +
ggtitle("Reciprocal")Comparing univariate, quadratic, and reciprocal
In the table below, we can evaluate the fit statistics of all three models:
| Model | R² | Adj. R² | Interpretation |
|---|---|---|---|
| Linear | 0.679 | 0.673 | Simplest, clearest |
| Quadratic | 0.716 | 0.705 | Best fit, but harder to interpret |
| Reciprocal | 0.635 | 0.628 | Worst fit |
Conclusion: The linear model reg1
remains the best choice. While the quadratic model has marginally better
fit statistics, the improvement is modest and comes at the cost of
interpretability. The reciprocal model actually performs worse than the
linear model (lower \(R^2\)). For a
straightforward relationship like hand size and height, the simple
linear model provides the best balance of fit, interpretability, and
parsimony. In a simple analysis like this, simple models are usually the
best bet.
Log transformations
Another common functional form is the log transformation, which is useful when relationships appear to be multiplicative rather than additive, or when we want to interpret results in terms of percentage changes. We can apply the log transformation to either the dependent variable, independent variable, or both.
Log-linear model (log Y)
If we transform only the dependent variable as:
\[ log(height) = \beta_0 + \beta_1 hand + u\]
The regression would be:
##
## Call:
## lm(formula = log(height) ~ hand, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.054095 -0.024569 -0.000682 0.025615 0.061141
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.551348 0.057584 79.04 < 2e-16 ***
## hand 0.032107 0.003175 10.11 1.1e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03325 on 50 degrees of freedom
## Multiple R-squared: 0.6716, Adjusted R-squared: 0.6651
## F-statistic: 102.3 on 1 and 50 DF, p-value: 1.102e-13
Interpretation: A one-centimeter increase in hand size is associated with approximately a 3.2% increase in height (0.032 × 100). This model has an \(R^2\) of 0.672, performing similarly to the simple linear model but expressing the relationship in percentage terms.
Linear-log model
If we transform only the independent variable:
\[ height = \beta_0 + \beta_1 log(hand) + u\]
The regression would be:
##
## Call:
## lm(formula = height ~ log(hand), data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4808 -4.6069 -0.4752 4.6036 11.5234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -118.07 29.28 -4.033 0.000188 ***
## log(hand) 99.49 10.12 9.830 2.86e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.82 on 50 degrees of freedom
## Multiple R-squared: 0.659, Adjusted R-squared: 0.6522
## F-statistic: 96.62 on 1 and 50 DF, p-value: 2.856e-13
Interpretation: A 1% increase in hand size is associated with an approximately 0.99 cm increase in height (99.49 ÷ 100). This specification has the lowest \(R^2\) (0.659) among the log models and suggests diminishing returns—larger absolute changes in hand size at lower values have bigger impacts on height.
Log-log model
If we transform both variables as:
\[ log(height) = \beta_0 + \beta_1 log(hand) + u\]
Then the regression is:
##
## Call:
## lm(formula = log(height) ~ log(hand), data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.05595 -0.02605 -0.00114 0.02790 0.06674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.46387 0.17174 20.170 < 2e-16 ***
## log(hand) 0.57683 0.05937 9.716 4.2e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03414 on 50 degrees of freedom
## Multiple R-squared: 0.6537, Adjusted R-squared: 0.6468
## F-statistic: 94.4 on 1 and 50 DF, p-value: 4.204e-13
Interpretation: A 1% increase in hand size is associated with an approximately 0.99 cm increase in height (99.49 ÷ 100). This specification has the lowest \(R^2\) (0.659) among the log models and suggests diminishing returns—larger absolute changes in hand size at lower values have bigger impacts on height.