Statistics
Exercise 1: File structure, installing R, and basic commands
Welcome
Welcome to the course! This is the first lesson in a series designed
to teach you how to do basic programming in R such as
generating summary statistics and plots, and basic linear regression
models. This document is dedicated primarily to helping you install
R, and is broken up into three sections:
- First, I will show you the structure of the directory that we should all use for class. This is important, as it will allow you to copy/paste code from these lessons into your own script files, and will allow me to run the code you generate for your assignments without any snags.
- Second, I will provide you with all the information you need to
install
RandRStudio. They are separate programs, and both are necessary to complete the assignments. - Finally, I will give you a basic introduction to the layout of
RStudioso you can familiarize yourself to the program. This section will also introduce you to some of the essential programming that you will have to do every time you start a new project, such as creating a header to label your work, installing and loading libraries, and setting your working directory.
File organization
While there are many different ways to organize your files, I
recommend using the structure I provide so that all code will run if you
would like to copy from these tutorials and paste it directly into your
own R documents without any headaches. I use a parent
folder called R, with subfolders named cmd for command
(store your code files here), dat (store your data files here),
and results (store your plots and regression results here). The
file structure should look like this:
R/
├── cmd/ # code files
└── dat/ # data files
Data files will be provided on the same page on my website as the
tutorials. From there, you can download them and place in the
dat folder.
Installing and using R and RStudio
To start, you must download and install R and
RStudio. The programs can be downloaded using the following
links:
- First install the newest version of
R: Windows version, Mac version - Then, install
RStudio(free version): Windows and Mac
RStudio layout
Please refer to the image below to identify the panels in
RStudio
- Top left: This is where your script files are. You will write all your code, run the commands, and save your work
- Top right: You will usually use the environment tab here, which shows all the objects you will create (data files, regression results, values, etc.)
- Bottom left: You will usually use the console tab, which shows all progress, informs you of errors, shows results of the commands you run, etc.
- Bottom right: Can be used for navigation, but we will mostly look at the plots tab to see our visualizations before saving to file
Starting your script
We will do (almost) all coding within script files. Open a new script
file by going to File → New File → R Script. Save the
new script file in the cmd folder. The remainder of the
tutorial will give you instructions on all the code to enter to your new
script.
A few useful tips when coding:
- Any text after a hashtag (
#) will not be run as operational code. This is referred to as ‘commenting-out’ text - Add spaces between the commands along with commented-out descriptions of what you are doing
- Write a header at the top of your script with basic details (project name, author, date, etc.)
- Add a command directly below your header that clears the environment
(information in the top right panel). This ensures that everything will
run correctly if you are running multiple scripts while
Ris open.
Below is an example script header. Copy/paste it into the top of your script file and change the details to fit your project. Since we will start with the actual programming next week, we will preemptively name this file as Generating descriptive statistics
###########################################################################
# Project: Generating descriptive statistics
# Author: [Your name]
# Date: [Today's date]
###########################################################################Now, add the following line to clear the environment:
Installing and loading libraries
Because it is open-source software, R requires many
user-written packages to run. You will have to install any packages you
have never used before. Since this is your first time using
R, we will install several packages that you will use
throughout the course.
Run the following code to install readxl (for reading
Excel files), along with other essential packages for importing,
organizing, and plotting data.
Important: Since you only need to run the
install.packages command once for any given package, run
this command directly in your console (bottom left panel) rather than
keeping it in the script file. If you save it to your script file, then
R will get crabby when it tries to install a package that
is already installed.
What each package does:
- readxl: Imports
.xlsand.xlsxExcel files into R. - tidyverse: A collection of packages for data
science, including:
dplyrfor data manipulation,readrfor reading.csvand.tsvfiles,tidyrfor reshaping data,tibblefor modern data frames,stringrfor string operations,forcatsfor working with factors.
- ggplot2: A powerful system for creating publication-quality graphics.
The packages are now permanently saved to your library. However, you will have to call them every time you start a new session. Place the code below after the line that clears the environment:
Setting the working directory
We will now set the working directory. We do this so R
knows where to find all our data and save results to file. We use the
setwd() command, and you must enter your file path. This
will be wherever you created your R file specified in the
file directory above. Make sure that the directory is in quotation marks
and uses forward slashes / only. The example below shows a
generic filepath example that you can modify to fit the location of your
R folder:
Your final script file (so far)
After all these aforementioned steps, you should have a script file that looks like this:
###########################################################################
# Project: Generating descriptive statistics
# Author: [Your name]
# Date: [Today's date]
###########################################################################
# Clearing environment
rm(list=ls())
# Loading packages
library(readxl) # for reading Excel files
library(tidyverse) # includes dplyr, readr, tidyr, and more
library(ggplot2) # for creating plots
setwd("C:/Projects/applied-econometrics/R")Finished!
That is all for today! Next week we will pick up right from here by loading a data set and generating some basic descriptive statistics.