R, a statistics programming language environment, is becoming more popular as organizations, governments and businesses have increased their use of data science. In an effort to provide a quick bootcamp to learn the basics of R quickly, I’ve assemble some of the most basic processes to give a new user a quick introduction to the R language.
This post assumes that you have already installed R and have it running correctly on your computer. I recommend getting RStudio to use to write and execute your code. It will make your life much easier.
Getting Started
First R is an interactive programming environment, which means you are able to send commands to its interpreter to tell it what to do.
There are two basic methods to send commands to R. The first is by using the console, which is like your old-school command line computing methods. The second method is more typically used by R coders, and that’s to write a script. An R script isn’t fancy. At its core it’s a text document that contain a collection of R commands. Then when the code is executed it is treated like a collection of individual commands being feed one-by-one into the R interpreter. This differs on how other, more fundamental programming languages work.
Basics
Comments are probably the best place to start, especially because my code is chock-full of them. A comment is code that is fed to R, but it’s not executed and has no bearing on the function of your script or command. In R comments are lines prefaced with a #.
#comments start with #-signs
9-3 #basic math (this doesn't save this in a variable)
One of the most basic thing you could use R for is a calculator. For instance if we run the code 9-3, R will display a 6 as the result of that code. All of this is rather straight forward. The only operator you might not be familiar with if you are new to coding is the modulus operator, which yields the remainder when you divide the first number by the second. This gets used often when dealing with data. For example, you can get a 0 for even number and 1 for odd number if you take you variable use the modulus operator with the number 2.
#basic operations
#yields numeric value
1+2 #addition
3-2 #subtraction
3*2 #multiplication
4/5 #division
3 %% 2 #modulus (remainder operator)
Beyond the basic math and numeric operations you can do, R has several fundamental data types. NULL
and NA
are representative of empty objects or missing data. These two data types aren’t the same. NA
will fill a position in an vector or data frame. The details are best left for another entry.
#basic data structure
NULL #empty value
NA #missing value
9100 #numeric value
'abcdef' #string
TRUE #boolean
T #equilvant form
FALSE
F
Numeric values can have mathematical operations performed on them. Strings are essentially non-numeric values. You can’t add strings together or find the average of a string. In any type of data analysis, you’ll typically have some string data. It can be used to classify entries in categorically such as male/female or Mac/Windows/Linux. R will treat these like factors.
Finally, boolean values (True or False) are binary logical values. They work like normal logic operations you might have learned in math or a logic class with AND (&&) and OR (||) operators. These can be used in conditional statements and various other data manipulation operations such as subsetting.
#logical operators
T && F #and
F || T #or
Now that we covered the basic operations and data types, let’s look at how to store that — variables. To assign a value to a variable it’s rather easy. You can use a simple equation or the traditional R notation using an arrow.
x <- 1 #basic assignment
x = 1
#####Acceptable Variables
x <- 1
X <- 1
X1 <- 1
X.1 <- 1
X_1 <- 1
########################
####UNACCEPTABLE Variables
1X <- 1
X-1 <- 1 #CODE WILL NOT WORK
X,1 <- 1
#######################
Variables must begin with a letter and they are case-sensitive. Periods are acceptable faux separators in variable names, but that doesn’t translate to other programming languages like Python or JavaScript, so that might factor in how you establish naming conventions.
I’ve mentioned vectors a few times already. They are an important data structure within R. A vector is an ordered list of data. Typically, thought of as numeric data, but character (string) vectors are often used in R. The c()
operator can create a vector. It’s important that vectors contain the same type of data: boolean, numeric or character. If you mix types it will force values into another type. And you can assign your vectors to variables. In fact, you can store just about any thing in R to a variable.
x.vector <- c() #vector operator
x.vector <- c(1,2,3,4,5,6) #creates a vector (typically numeric)
x.vector <- c(T, 1)
x.list <- list('A',12,'b') #creates a list (not used for numeric operations)
mean(c(1,3,2))
Lists are created with the list()
command. They are used more for storage and organization than for data structure. For example you could store the mean, median and range for a set of data in a list. A vector would house the data used to calculated said summary stats. Lists are useful when you begin to write bigger programs and need to shuffle a lot of things around.
The basic statistic operators are listed below. All of these require a vector to operate on.
#basic stats
x.vector <- c(10,11,12,12,10,11,20,9) #puts your data into a vector
mean(x.vector) #takes mean of vector
median(x.vector) #median of vector
max(x.vector) #maximum of vector
min(x.vector) #minimum of vector
range(x.vector) #yields a vector with a range
sd(x.vector) #standard deviation
var(x.vector) #variance
Handling Data
Above we discussed some of the building blocks of basic analysis in R. Beyond introductory Statistics classes, R isn’t very useful unless you can import data. There are many ways to do this since data exists in many different formats. A .csv file is one of the most basic, compatible way data is stored to be used between different analytical tools.
Before loading this data file into R, it’s a good idea to set your working directory. This is where the data file is stored.
#load in data
setwd('**folder path**') #sets your working directory
#specific to each computer
Next you can use the read.csv()
function to ingest a .csv file into R. This call won’t save the data in a variable, it just brings it in as a data frame and show it to you.
read.csv('data_bryant_kobe.csv') #reads the data into R
#does not save it into a variable
data <- read.csv('data_bryant_kobe.csv') #reads the data and saves it
#into a variable called 'data'
Data frames are the primary form of data structure you’ll encounter in R. Data frames are like tables in Excel or SQL in that they are rectangular and have a rigid schema. However, at a data frame’s core are a collection of equal-length vectors.
If you assign the data frame output of the read.csv()
function to a variable, you can pass around the data frame to different data manipulation functions or modeling functions. One of the most basic ways to manipulate the data is to access different values within the data frame. Below are several different examples on how to get to values, rows or columns in a data frame.
The basic concept is that data frames can be accessed by row and column number. [row, column] And that an entire row or column can be accessed by omitting the dimension you aren’t trying to retrieve. You can retrieve individual fields (variables) by using the $ sign and using the variable name. This is the method I use most often. It requires you knowing and using the name of the variables, which can make your code easier to read.
#accessing values
row <- 3
column <- 2
data$Age #returns the Age variable column
data[row,column] #individual value
data[data$Age <= 25,] #returns entire row
data[,column] #returns entire column as a vector
data$Age[3] #returns entire column as a vector
By accessing rows, you can create a subset of data by using a logical argument to filter out your data set.
#creating a quick subset
data.U25 <- data[data$Age < 25,] #creates an under-25 set
data.O25 <- data[data$Age >= 25,] #creates a 25 and older set
The code above creates two new data frames which separate Kobe Bryant’s season stats into an under-25 data set and a 25 and under data set.
Relationships Between Variables
Correlation is often used to summarize the linear relationship between two variables. Getting the correlation in R is simple. Use the cor()
function with two equal length vectors. R uses the corresponding elements in each vector to get a Pearson correlation coefficient.
#correlation between different variables within the subset
cor(data.U25$MP, data.U25$PTS)
cor(data.O25$MP, data.O25$PTS)
cor(data$MP, data$PTS) #correlation with two related vectors from data set
cor(data$MP, data$FTpct)
A simple linear model can be made by using the lm()
function. The linear model function requires two things: a formula and a data frame. The formula uses a tilde (~) instead an equal sign. The formula represent the variables you would use in your standard ordinary least squares regression. The data parameter is the data frame which contains all the data.
#create a basic linear model
linear.model <- lm(PTS ~ MP, data=data.U25)
linear.model <- lm(PTS ~ MP + Age, data=data.U25)
The summary()
function will take the linear model object and displays information about the coefficients of your linear model.
summary(linear.model)
NOTES: The data set used in this tutorial is from basketball-reference.com.
The full code I used in this tutorial can be found on my GitHub .