Advanced Economic Methods

Exercise 1: Intro to R

Welcome

Welcome to the course! Over the next 10 lessons, you will learn how to run regressions, store the results, and generate plots using R. This document is divided into sections that first introduce you to the layout of RStudio, outline the file layout we will use to ensure that the code can be copy/pasted into your own files to make the lessons run smoothly, and show you the first steps to setting up and running code.

File organization

While there are many different ways to organize your files, please use the structure I provide so that all code will run if you would like to copy and paste it directly into your own R documents without any headaches. I use a parent folder called R, with subfolders named cmd for command (store your code files here), dat (store your data files here), and results (store your plots and regression results here). The file structure should look like this:

R/
├── cmd/       # code files
├── dat/       # data files
└── results/   # plots and regression outputs

We will use the data files provided by the author of the textbook with the Excel format. We will use this format since most publicly available data sets are in Excel or CSV formats. Extract these files from the zip file of data provided by the author and copy them to the dat folder.

Installing and using R and RStudio

To start, you must download and install R and RStudio and download the data supplied by the textbook Applied Econometrics. All file downloads are found in the following links:

  • First install the newest version of R: Windows version, Mac version
  • Then, install RStudio (free version): Windows and Mac,
  • Now, download the data files here and extract to dat. Be sure that the files are not in any subfolders in dat

RStudio layout

Please refer to the image below to identify the panels in RStudio

  • Top left: This is where your script files are. You will write all your code, run the commands, and save your work
  • Top right: You will usually use the environment tab here, which shows all the objects you will create (data files, regression results, values, etc.)
  • Bottom left: You will usually use the console tab, which shows all progress, informs you of errors, shows results of the commands you run, etc.
  • Bottom right: Can be used for navigation, but we will mostly look at the plots tab to see our visualizations before saving to file

Starting your script

We will do (almost) all coding within script files. Open a new script file by going to File → New File → R Script. Save the new script file in the cmd folder. The remainder of the tutorial will give you instructions on all the code to enter to your new script.

A few useful tips when coding:

  • Any text after a hashtag (#) will not be run as operational code. This is referred to as ‘commenting-out’ text
  • Add spaces between the commands along with commented-out descriptions of what you are doing
  • Write a header at the top of your script with basic details (project name, author, date, etc.)
  • Add a command directly below your header that clears the environment (information in the top right panel). This ensures that everything will run correctly if you are running multiple scripts while R is open.

Below is an example script header. Copy/paste it into the top of your script file and change the details to fit your project.

###########################################################################
# Project: Introduction to R
# Author: [Your name]
# Date: [Today's date]
###########################################################################

Now, add the following line to clear the environment:

# Clearing environment
rm(list=ls())

Installing and loading libraries

Because it is open-source software, R requires many user-written packages to run. You will have to install any packages you have never used before. Since this is your first time using R, we will install several packages that you will use throughout the course.

Run the following code to install readxl (for reading Excel files), along with other essential packages for importing, organizing, and plotting data. Since you only need to run it once, I would suggest running it in your console rather than keeping it in the script file.

# Install packages (only run this once)
install.packages(c("readxl", "tidyverse", "ggplot2"))

What each package does:

  • readxl: Imports .xls and .xlsx Excel files into R.
  • tidyverse: A collection of packages for data science, including:
    • dplyr for data manipulation,
    • readr for reading .csv and .tsv files,
    • tidyr for reshaping data,
    • tibble for modern data frames,
    • stringr for string operations,
    • forcats for working with factors.
  • ggplot2: A powerful system for creating publication-quality graphics.

*Note: if you chose to run the Stata files provided by the textbook, you will also have to install and load the haven package.

The packages are now permanently saved to your library. However, you will have to call them every time you start a new session. Place the code below after the line that clears the environment:

# Loading packages
library(readxl)      # for reading Excel files
library(tidyverse)   # includes dplyr, readr, tidyr, and more
library(ggplot2)     # for creating plots

Setting the working directory

We will now set the working directory. We do this so R knows where to find all our data and save results to file. We use the setwd() command, and you must enter your file path. This will be wherever you created your R file specified in the file directory above. Make sure that the directory is in quotation marks and uses forward slashes / only. The example below shows a generic filepath example that you can modify to fit the location of your R folder:

setwd("C:/Projects/applied-econometrics/R")

Loading data and creating a plot

Now, we will upload a sample data set and generate a basic plot to show you some basic functions. When we load a new data set, we will create an object for it in the environment. For simplicity, we will just call this object d, and we will load the arch data set from the dat folder:

d <- read_excel("dat/arch.xls")
## New names:
## • `` -> `...1`

You can now look at the data structure using the str() command to get an overview of what the data set contains

str(d)
## tibble [2,610 × 7] (S3: tbl_df/tbl/data.frame)
##  $ ...1    : POSIXct[1:2610], format: "1990-01-01" "1990-01-02" ...
##  $ GULF    : num [1:2610] 0 0 0 0 0 0 0 0 0 0 ...
##  $ POSTWAR : num [1:2610] 0 0 0 0 0 0 0 0 0 0 ...
##  $ R_FTSE  : num [1:2610] 0.002442 0.000266 -0.007566 0.0073 -0.010836 ...
##  $ R_STOCK1: num [1:2610] 0 0.00805 0.00398 0 0.00397 ...
##  $ R_STOCK2: num [1:2610] 0 0.01733 0.04317 -0.01897 0.00475 ...
##  $ R_STOCK3: num [1:2610] 0 0.00174 0.01216 -0.00694 0.00694 ...

We can see that when the data were loaded, the date column was given an unclear name (...1). To make the data set easier to understand, we’ll rename this column to Date. We do this using the rename() function, and the %>% pipe operator, which passes the data (d) into the next function. The pipe is useful for chaining together multiple operations, which will become especially helpful as we perform more complex data transformations later in the course.

d <- d %>%
  rename(Date = ...1)

Now, we will look at the data again to make sure the change was successful. This time we will use the head() command to view the first few rows of the data set instead of the overall structure.

head(d)
## # A tibble: 6 × 7
##   Date                 GULF POSTWAR    R_FTSE R_STOCK1 R_STOCK2 R_STOCK3
##   <dttm>              <dbl>   <dbl>     <dbl>    <dbl>    <dbl>    <dbl>
## 1 1990-01-01 00:00:00     0       0  0.00244   0        0        0      
## 2 1990-01-02 00:00:00     0       0  0.000266  0.00805  0.0173   0.00174
## 3 1990-01-03 00:00:00     0       0 -0.00757   0.00398  0.0432   0.0122 
## 4 1990-01-04 00:00:00     0       0  0.00730   0       -0.0190  -0.00694
## 5 1990-01-05 00:00:00     0       0 -0.0108    0.00397  0.00475  0.00694
## 6 1990-01-06 00:00:00     0       0 -0.00652  -0.00196  0        0

If you prefer, you can also use the command View(d) (note the capital V) to open the entire data set in a new window. This might be cumbersome if it is a very large data set, but it is nice to keep this window open for a small data set so you can look at the changes you make throughout the script.

View(d)

Our data looks good, so now we will make a basic plot to get a visualization of one of the variables. We will plot the daily returns of stock over time using the ggplot command, which is a very nice package to get publishable quality graphics and it relatively easy to use. The commented out parts of the code below explain what each part of the function is doing in the plot.

ggplot(d, aes(x = Date, y = R_STOCK1)) +  # Specifying the data object d, and the x and y variables
  geom_line(color = "steelblue") +        # Sets the color of the line
  labs(
    title = "Daily Returns of Stock 1",   # This is the title of the plot
    x = "Date",                           # The label for the x-axis
    y = "Return"                          # The label for the y-axis
  ) +
  theme_minimal()                         # This keeps the theme basic (minimal)