Statistics

Exercise 1: File structure, installing R, and basic commands

Welcome

Welcome to the course! This is the first lesson in a series designed to teach you how to do basic programming in R such as generating summary statistics and plots, and basic linear regression models. This document is dedicated primarily to helping you install R, and is broken up into three sections:

  • First, I will show you the structure of the directory that we should all use for class. This is important, as it will allow you to copy/paste code from these lessons into your own script files, and will allow me to run the code you generate for your assignments without any snags.
  • Second, I will provide you with all the information you need to install R and RStudio. They are separate programs, and both are necessary to complete the assignments.
  • Finally, I will give you a basic introduction to the layout of RStudio so you can familiarize yourself to the program. This section will also introduce you to some of the essential programming that you will have to do every time you start a new project, such as creating a header to label your work, installing and loading libraries, and setting your working directory.

File organization

While there are many different ways to organize your files, I recommend using the structure I provide so that all code will run if you would like to copy from these tutorials and paste it directly into your own R documents without any headaches. I use a parent folder called R, with subfolders named cmd for command (store your code files here), dat (store your data files here), and results (store your plots and regression results here). The file structure should look like this:

R/
├── cmd/       # code files
└── dat/       # data files 

Data files will be provided on the same page on my website as the tutorials. From there, you can download them and place in the dat folder.

Installing and using R and RStudio

To start, you must download and install R and RStudio. The programs can be downloaded using the following links:

RStudio layout

Please refer to the image below to identify the panels in RStudio

  • Top left: This is where your script files are. You will write all your code, run the commands, and save your work
  • Top right: You will usually use the environment tab here, which shows all the objects you will create (data files, regression results, values, etc.)
  • Bottom left: You will usually use the console tab, which shows all progress, informs you of errors, shows results of the commands you run, etc.
  • Bottom right: Can be used for navigation, but we will mostly look at the plots tab to see our visualizations before saving to file

Starting your script

We will do (almost) all coding within script files. Open a new script file by going to File → New File → R Script. Save the new script file in the cmd folder. The remainder of the tutorial will give you instructions on all the code to enter to your new script.

A few useful tips when coding:

  • Any text after a hashtag (#) will not be run as operational code. This is referred to as ‘commenting-out’ text
  • Add spaces between the commands along with commented-out descriptions of what you are doing
  • Write a header at the top of your script with basic details (project name, author, date, etc.)
  • Add a command directly below your header that clears the environment (information in the top right panel). This ensures that everything will run correctly if you are running multiple scripts while R is open.

Below is an example script header. Copy/paste it into the top of your script file and change the details to fit your project. Since we will start with the actual programming next week, we will preemptively name this file as Generating descriptive statistics

###########################################################################
# Project: Generating descriptive statistics
# Author: [Your name]
# Date: [Today's date]
###########################################################################

Now, add the following line to clear the environment:

# Clearing environment
rm(list=ls())

Installing and loading libraries

Because it is open-source software, R requires many user-written packages to run. You will have to install any packages you have never used before. Since this is your first time using R, we will install several packages that you will use throughout the course.

Run the following code to install readxl (for reading Excel files), along with other essential packages for importing, organizing, and plotting data.

Important: Since you only need to run the install.packages command once for any given package, run this command directly in your console (bottom left panel) rather than keeping it in the script file. If you save it to your script file, then R will get crabby when it tries to install a package that is already installed.

# Install packages (only run this once)
install.packages(c("readxl", "tidyverse", "ggplot2"))

What each package does:

  • readxl: Imports .xls and .xlsx Excel files into R.
  • tidyverse: A collection of packages for data science, including:
    • dplyr for data manipulation,
    • readr for reading .csv and .tsv files,
    • tidyr for reshaping data,
    • tibble for modern data frames,
    • stringr for string operations,
    • forcats for working with factors.
  • ggplot2: A powerful system for creating publication-quality graphics.

The packages are now permanently saved to your library. However, you will have to call them every time you start a new session. Place the code below after the line that clears the environment:

# Loading packages
library(readxl)      # for reading Excel files
library(tidyverse)   # includes dplyr, readr, tidyr, and more
library(ggplot2)     # for creating plots

Setting the working directory

We will now set the working directory. We do this so R knows where to find all our data and save results to file. We use the setwd() command, and you must enter your file path. This will be wherever you created your R file specified in the file directory above. Make sure that the directory is in quotation marks and uses forward slashes / only. The example below shows a generic filepath example that you can modify to fit the location of your R folder:

setwd("C:/Projects/applied-econometrics/R")

Your final script file (so far)

After all these aforementioned steps, you should have a script file that looks like this:

###########################################################################
# Project: Generating descriptive statistics
# Author: [Your name]
# Date: [Today's date]
###########################################################################

# Clearing environment
rm(list=ls())

# Loading packages
library(readxl)      # for reading Excel files
library(tidyverse)   # includes dplyr, readr, tidyr, and more
library(ggplot2)     # for creating plots

setwd("C:/Projects/applied-econometrics/R")

Finished!

That is all for today! Next week we will pick up right from here by loading a data set and generating some basic descriptive statistics.