Tag Archives: featured

R Bootcamp: Making a Subset

Data Manipulation: Subsetting

Making a subset of a data frame in R is one of the most basic and necessary data manipulation techniques you can use in R. If you are brand new to data analysis, a data frame is the most common data storage object in R and subsets are a collection of rows from that data frame based on certain criteria.

Data Frame
V1 V2 V3 V4 V5 V6 V7
Row1
Row2
Row3
Row4
Row5
Row6

Arrow

Subset
V1 V2 V3 V4 V5 V6 V7
Row2
Row5
Row6

The Data

For this example, I’m using data from FanGraphs. You can get the exact data set here, and it’s provided in my GitHub. This data set has players names, teams, seasons and stats. We are able to create a subset based on any one or more of these variables.

The Code

I’m going to show four different ways to subset data frames: using a boolean vector, using the which() function, using the subset() function and using filter() function from the dplyr package. All of these functions are different ways to do the same thing. The dplyr package is fast and easy to code, and it is my recommended subsetting method, so let’s start with that. This is especially true when you have to loop an operation or run something that will be run repeatedly.

dplyr

The filter() requires the dplyr package to be loaded in your R environment, and it removes the filter() function from the default stats package. You don’t need to worry about but it does tell you that when you first install and load the package.

#install.packages('dplyr')
library(dplyr) #load the package

#from http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2015&month=0&season1=2010&ind=1&team=&rost=&age=&filter=&players=&page=2_30
setwd('***PATH***') 
data <- read.csv('FanGraphs Leaderboard.csv') #loads in the data

Aside from the loading the package, you'll have to load the data in as well.

#finds all players who played for the Marlins
data.sub.1 <- filter(data, Team=='Marlins')

#finds all the NL East players
NL.East <- c('Marlins','Nationals','Mets','Braves','Phillies') #makes the division
data.sub.2 <- filter(data, Team %in% NL.East) #finds all players that are in the NL East

#Both of these find players in the NL East and have more than 30 home runs.
data.sub.3 <- filter(data, Team %in% NL.East, HR > 30) #uses multiple arguments
data.sub.3 <- filter(data, Team %in% NL.East & HR > 30) #uses & sign

#Finds players in the NL East or has more than 30 HR
data.sub.4 <- filter(data, Team %in% NL.East | HR > 30)

#Finds players not in the NL East and who have more than 30 home runs.
data.sub.5 <- filter(data, !(Team %in% NL.East), HR > 30)

The filter() function is rather simple to use. The examples above illustrate a few simple examples where you specify the data frame you want to use and create true/false expressions, which filter() uses to find which rows it should keep. The output of the function is saved into a separate variable, so we can reuse the original data frame for other subsets. I put a few other examples in the code to demonstrate how it works.

Built-in Functions

#method 1 -- using a T/F vector
data.sub.1 <- data[data$Team == 'Marlins',]

#method 2 -- which()
data.sub.2 <- data[which(data$Team == 'Marlins'),]

#method 3 -- subset()
data.sub.3 <- subset(data,subset = (Team=='Marlins'))

#other comparison functions
data.sub.4 <- data[data$HR > 30,] #greater than
data.sub.5 <- data[data$HR < 30,] #less than

data.sub.6 <- data[data$AVG > .320 & data$PA > 600,] #duel requirements using AND (&)
data.sub.7 <- data.sub3 <- subset(data, subset = (AVG > .300 & PA > 600)) #using subset()

data.sub.8 <- data[data$HR > 40 | data$SB > 30,] #duel requirements using OR (|)

data.sub.9 <- data[data$Team %in% c('Marlins','Nationals','Mets','Braves','Phillies'),] #finds values in a vector

data.sub.10 <- data[data$Team != '- - -',] #removes players who played for two teams

If you don't want to use the dplyr package, you are able to accomplish the same thing uses the basic functionality of R. #method 1 uses a boolean vector to select rows for the subset. #method 2 uses the which() function. This function finds the index of a boolean vector of True values. Both of these techniques use the original data frame and uses the row index to create a subset.

The subset() function works much like the filter() function, except the syntax is slightly different and you don't have to download a separate package.

Efficiency

While subset works in a similar fashion, it doesn't perform the same way. While some data manipulation might only happen once or a few times throughout a project, many projects require constant subsetting and possibly from a loop. So while the gains might seem insignificant for one run, multiply that difference and it adds up quickly.

I timed how long it would take to run the same [complex] subset of a 500,000 row data frame using the four different techniques.

Time to Subset 500,000 Rows
Subset Method Elapsed Time (sec)
boolean vector 0.87
which() 0.33
subset() 0.81
dplyr filter() 0.21

The dpylr filter() function was by far the quickest, which is why I prefer to use it.

The full code I used to write up this tutorial is available on my GitHub .

References:

Introduction to dplyr. https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

emoji

Emoji iOS 9.1 Update — The Taco Emoji Analysis

Before I get too far I don’t actually analysis taco emojis. At least not yet. I, however, give you the tools to start parsing them from tweets, text or anything you can get into Python.

This past month Apple released their iOS 9.1 and their latest OS X 10.11.1 El Capitan update. That updated included a bunch of new emojis. I’ve made a quick primer on how to handle emoji analysis in Python. Then when Apple released an update to their emojis to include the diversity, I updated my small Python class for emoji counting to include to the newest emojis. I also looked at what is actually happening with the unicode when diversity modifier patches are used.

Click for Updated socialmediaparse Library

With this latest update, Apple and the Unicode Consortium didn’t really introduce any new concepts, but I did update the Python class to include the newest emojis. In my GitHub the data folder includes a text file with all the emojis delimitated by ‘\n’. The class uses this file to find any emoji’s in a unicode string which has been passed to the add_emoji_count() method.

Building off of the diversity emoji update, I added a skin_tone_dict property of the EmojiDict class. This property returns a dictionary with the number of unique human emojis per tweet and their skin tones. This property will not catch multiple human emojis written if they in the same execution of the add_emoji_count() method

import socialmediaparse as smp #loads the package
 
counter = smp.EmojiDict() #initializes the EmojiDict class
 
#goes through list of unicode objects calling the add_emoji_count method for each string
#the method keeps track of the emoji count in the attributes of the instance
for unicode_string in collection:
   counter.add_emoji_count(unicode_string)  
 
#output of the instance
print counter.dict_total #dict of the absolute total count of the emojis in corpus
print counter.dict       #dict of the count of strings with the emoji in corpus
print counter.baskets    #list of lists, emoji in each string.  one list for each string.
print counter.skin_tones_dict #dictionary of unique emoji emojis aggregated by the counter.

#print counter.skin_tones_dict output
#{'human_emoji': 4, '\\U0001f3fe': 1, '\\U0001f3fd': 1, '\\U0001f3ff': 0, '\\U0001f3fc': 2, '\\U0001f3fb': 1}
 
counter.create_csv(file='emoji_out.csv')  #method for creating csv

Above is an example of how to use the new attribute. It is a dictionary so you can work that into your analysis however you like. I will eventually create better methods and outputs to make this feature more robust and useful.

The full code / class I used in this post can be found on my GitHub .

R Bootcamp — A Quick Introduction

R, a statistics programming language environment, is becoming more popular as organizations, governments and businesses have increased their use of data science. In an effort to provide a quick bootcamp to learn the basics of R quickly, I’ve assemble some of the most basic processes to give a new user a quick introduction to the R language.

This post assumes that you have already installed R and have it running correctly on your computer. I recommend getting RStudio to use to write and execute your code. It will make your life much easier.

Getting Started

First R is an interactive programming environment, which means you are able to send commands to its interpreter to tell it what to do.

There are two basic methods to send commands to R. The first is by using the console, which is like your old-school command line computing methods. The second method is more typically used by R coders, and that’s to write a script. An R script isn’t fancy. At its core it’s a text document that contain a collection of R commands. Then when the code is executed it is treated like a collection of individual commands being feed one-by-one into the R interpreter. This differs on how other, more fundamental programming languages work.

R - How R Works

Basics

Comments are probably the best place to start, especially because my code is chock-full of them. A comment is code that is fed to R, but it’s not executed and has no bearing on the function of your script or command. In R comments are lines prefaced with a #.

#comments start with #-signs

9-3 #basic math (this doesn't save this in a variable)

One of the most basic thing you could use R for is a calculator. For instance if we run the code 9-3, R will display a 6 as the result of that code. All of this is rather straight forward. The only operator you might not be familiar with if you are new to coding is the modulus operator, which yields the remainder when you divide the first number by the second. This gets used often when dealing with data. For example, you can get a 0 for even number and 1 for odd number if you take you variable use the modulus operator with the number 2.

#basic operations
#yields numeric value
1+2 #addition
3-2 #subtraction
3*2 #multiplication
4/5 #division
3 %% 2 #modulus (remainder operator)

Beyond the basic math and numeric operations you can do, R has several fundamental data types. NULL and NA are representative of empty objects or missing data. These two data types aren’t the same. NA will fill a position in an vector or data frame. The details are best left for another entry.

#basic data structure

NULL     #empty value
NA       #missing value
9100     #numeric value
'abcdef' #string
TRUE     #boolean
T        #equilvant form
FALSE 
F

Numeric values can have mathematical operations performed on them. Strings are essentially non-numeric values. You can’t add strings together or find the average of a string. In any type of data analysis, you’ll typically have some string data. It can be used to classify entries in categorically such as male/female or Mac/Windows/Linux. R will treat these like factors.

Finally, boolean values (True or False) are binary logical values. They work like normal logic operations you might have learned in math or a logic class with AND (&&) and OR (||) operators. These can be used in conditional statements and various other data manipulation operations such as subsetting.

#logical operators
T && F #and
F || T #or

Now that we covered the basic operations and data types, let’s look at how to store that — variables. To assign a value to a variable it’s rather easy. You can use a simple equation or the traditional R notation using an arrow.

x <- 1 #basic assignment
x = 1

#####Acceptable Variables
x <- 1
X <- 1
X1 <- 1
X.1 <- 1
X_1 <- 1
########################

####UNACCEPTABLE Variables
1X <- 1
X-1 <- 1  #CODE WILL NOT WORK
X,1 <- 1
#######################

Variables must begin with a letter and they are case-sensitive. Periods are acceptable faux separators in variable names, but that doesn’t translate to other programming languages like Python or JavaScript, so that might factor in how you establish naming conventions.

I’ve mentioned vectors a few times already. They are an important data structure within R. A vector is an ordered list of data. Typically, thought of as numeric data, but character (string) vectors are often used in R. The c() operator can create a vector. It’s important that vectors contain the same type of data: boolean, numeric or character. If you mix types it will force values into another type. And you can assign your vectors to variables. In fact, you can store just about any thing in R to a variable.

x.vector <- c() #vector operator
x.vector <- c(1,2,3,4,5,6) #creates a vector (typically numeric)
x.vector <- c(T, 1)
x.list <- list('A',12,'b') #creates a list (not used for numeric operations)

mean(c(1,3,2))

Lists are created with the list() command. They are used more for storage and organization than for data structure. For example you could store the mean, median and range for a set of data in a list. A vector would house the data used to calculated said summary stats. Lists are useful when you begin to write bigger programs and need to shuffle a lot of things around.

The basic statistic operators are listed below. All of these require a vector to operate on.

#basic stats
x.vector <- c(10,11,12,12,10,11,20,9) #puts your data into a vector

mean(x.vector) #takes mean of vector
median(x.vector) #median of vector
max(x.vector) #maximum of vector
min(x.vector) #minimum of vector
range(x.vector) #yields a vector with a range

sd(x.vector) #standard deviation
var(x.vector) #variance

Handling Data

Above we discussed some of the building blocks of basic analysis in R. Beyond introductory Statistics classes, R isn’t very useful unless you can import data. There are many ways to do this since data exists in many different formats. A .csv file is one of the most basic, compatible way data is stored to be used between different analytical tools.

Before loading this data file into R, it’s a good idea to set your working directory. This is where the data file is stored.

#load in data
setwd('**folder path**') #sets your working directory 
                                         #specific to each computer

Next you can use the read.csv() function to ingest a .csv file into R. This call won’t save the data in a variable, it just brings it in as a data frame and show it to you.

read.csv('data_bryant_kobe.csv') #reads the data into R
                                 #does not save it into a variable

data <- read.csv('data_bryant_kobe.csv') #reads the data and saves it
                                         #into a variable called 'data'

Data frames are the primary form of data structure you’ll encounter in R. Data frames are like tables in Excel or SQL in that they are rectangular and have a rigid schema. However, at a data frame’s core are a collection of equal-length vectors.

If you assign the data frame output of the read.csv() function to a variable, you can pass around the data frame to different data manipulation functions or modeling functions. One of the most basic ways to manipulate the data is to access different values within the data frame. Below are several different examples on how to get to values, rows or columns in a data frame.

The basic concept is that data frames can be accessed by row and column number. [row, column] And that an entire row or column can be accessed by omitting the dimension you aren’t trying to retrieve. You can retrieve individual fields (variables) by using the $ sign and using the variable name. This is the method I use most often. It requires you knowing and using the name of the variables, which can make your code easier to read.

#accessing values
row <- 3
column <- 2
data$Age #returns the Age variable column
data[row,column]  #individual value
data[data$Age <= 25,]        #returns entire row
data[,column]     #returns entire column as a vector
data$Age[3]       #returns entire column as a vector

By accessing rows, you can create a subset of data by using a logical argument to filter out your data set.

#creating a quick subset
data.U25 <- data[data$Age < 25,]  #creates an under-25 set
data.O25 <- data[data$Age >= 25,] #creates a 25 and older set

The code above creates two new data frames which separate Kobe Bryant’s season stats into an under-25 data set and a 25 and under data set.

Relationships Between Variables

Correlation is often used to summarize the linear relationship between two variables. Getting the correlation in R is simple. Use the cor() function with two equal length vectors. R uses the corresponding elements in each vector to get a Pearson correlation coefficient.

#correlation between different variables within the subset
cor(data.U25$MP, data.U25$PTS)
cor(data.O25$MP, data.O25$PTS)

cor(data$MP, data$PTS) #correlation with two related vectors from data set
cor(data$MP, data$FTpct)

A simple linear model can be made by using the lm() function. The linear model function requires two things: a formula and a data frame. The formula uses a tilde (~) instead an equal sign. The formula represent the variables you would use in your standard ordinary least squares regression. The data parameter is the data frame which contains all the data.

#create a basic linear model
linear.model <- lm(PTS ~ MP, data=data.U25)
linear.model <- lm(PTS ~ MP + Age, data=data.U25)

The summary() function will take the linear model object and displays information about the coefficients of your linear model.

summary(linear.model)

NOTES: The data set used in this tutorial is from basketball-reference.com.

The full code I used in this tutorial can be found on my GitHub .

Covariance — Different Ways to Explain or Visualize It

Covariance is the less understood sibling of correlation. While correlation is commonly used in reporting, covariance provides the mathematical underpinnings to a lot of different statistical concepts. Covariance describes how two variables change in relation to one another. If variable X increases and variable Y increases as well, Both X & Y will have positive covariance. Negative covariance will result from having two variables move in opposite directions, and zero covariance will result from the variables have no relationship with each other. Variance is also a specific case of covariance where both input variables are the same. All of this is rather abstract, so let’s look at more concrete definitions.

Covariance — Summation Notation

The definition you will find in most introductory stat books is a relatively simple equation using the summation operator (Σ). This shows covariances as the sum of the product of a paired data point relative to its mean. First, you need to find the mean of both variables. Then take all the data points and subtract the mean from its respective variable. Finally, you multiply the differences together

Population Covariance:

$latex
cov(X, Y) = \frac{1}{{N}} \sum\limits_{i}^{N}{(X_i – \mu_x)(Y_i – \mu_y)}
&s=2$

Sample Covariance:

$latex
cov(X, Y) = \frac{1}{{n-1}} \sum\limits_{i}^{n}{(X_i – \bar X)(Y_i – \bar Y)}
&s=2$

N is the number of data points in the population. n is the sample number. μX is the population mean for X; μY for Y. and are the mean as well but this notation designates it as a sample mean rather than a population mean. Calculating the covariance of any significant data set can be tedious if done by hand, but we can set-up the equation in R and see it work. I used modified version of Anscombe’s Quartet data set.

#get a data set
X <- c(anscombe$x1, 6,4,10)
Y = c(anscombe$y1, 10,8,6)
#get the means
X.bar = mean(X)
Y.bar = mean(Y)
#calculate the covariance
sum((X-X.bar)*(Y-Y.bar)) / (length(X)) #manually population
sum((X-X.bar)*(Y-Y.bar)) / (length(X) - 1) #manually sample
cov(X,Y) #built-in function USES SAMPLE COVARIANCE

Obviously, since covariance is used so much within statistics, R has a built-in function cov(), which yields the sample covariance for two vectors or even a matrix.

Covariance — Expected Value Notation

[Trying to explain covariance in expected value notation makes me realize I should back up and explain the expected value operator, but that will have to wait for another post. Quickly and oversimplified, the expect value is the mean value of a random variable. E[X] = mean(X). The expected value notation below describes the population covariance of two variables (not sample covariance):

$latex
cov(X, Y) = \textnormal{E}[(X-\textnormal{E}[X])(Y-\textnormal{E}[Y])]
&s=2$

The above formula is just the population covariance written differently. For example, E[X] is the same as μx. And the E[] acts the same as taking the average of (X-E[X])(Y-E[Y]). After some algebraic transformations you can arrive at the less intuitive, but still useful formula for covariance:

$latex
cov(X, Y) = \textnormal{E}[XY] – \textnormal{E}[X]\textnormal{E}[Y]
&s=2$

This formula can be interpreted as the product of the means of variables X and Y subtracted from the average of signed areas of variables X and Y. This probably isn’t very useful if you are trying to interpret covariance. But you’ll see it from time to time. And it works! Try it in R and compare it to the population covariance from above.

mean(X*Y) - mean(X)*mean(Y) #expected value notation

Covariance — Signed Area of Rectangles

Covariance can also be thought of as the sum of the signed area of the rectangles that can be drawn from the data points to the variables respective means. It’s called the signed area because we will get two types of rectangles, ones with a positive value and ones with negative values. Area is always a positive number, but these rectangles take on a sign by virtue of their geometric position. This is more of an academic exercise, in that it provides an understanding of what the math is doing and less of a practical interpretation and application of covariance. If you plot paired data points, in this case we will use the X and Y variables we have already used, you can tell just be looking there is probably some positive covariance because it looks like there is a linear relationship in the data. I’ve chosen to scale the plot so that zero is not included. Since this is a scatter plot including zero isn’t necessary.

XY Scatter Plot

First, we can draw the lines for the means of both variables as straight lines. These lines effectively create a new set of axes and will be used to draw the rectangles. The sides of the rectangles will be the difference between a data point and it’s mean [Xi - X̄]. When that is multiplied by [Yi - Ȳ], you can see that gives you an area of a rectangle. Do that for every point in your data set, add them up and divide by the number of data points, and you get the population covariance.

Scatterplot Covariance Rectangle

The following is a plot has a rectangle for each data point, and it is coded red for negative and blue for positive signs.

All Covariance Rectangles

The overlapping rectangles need to be considered separately so the opacity is reduced so that all the rectangles are visible. For this data set there is much more blue area than there is red area, so there is positive covariance, which jives with what we calculated earlier in R. If you were to take the areas of those rectangles and add/subtract according to the blue/red color then divide by the number of rectangles, you would arrive the population covariance: 3.16. To get the sample covariance you’d subtract one from the number of rectangles when you divide.

References:

Chatterjee, S., Hadi, A. S., & Price, B. (2000). Regression analysis by example. New York: Wiley.

Covariance. http://math.tutorvista.com/statistics/covariance.html

Covariance As Signed Area Of Rectangles. http://www.davidchudzicki.com/posts/covariance-as-signed-area-of-rectangles/
How would you explain covariance to someone who understands only the mean?
https://stats.stackexchange.com/questions/18058/how-would-you-explain-covariance-to-someone-who-understands-only-the-mean

Notes:

The signed area of rectangles on Chudzicki’s site and statexchange use a different covariance formulation, but similar concept than my approach.

The full code I used to write up this tutorial is available on my GitHub .

Anscombe correlation R

Introduction to Correlation with R | Anscombe’s Quartet

Correlation is one the most commonly [over]used statistical tool. In short, it measures how strong the relationship between two variables. It’s important to realize that correlation doesn’t necessarily imply that one of the variables affects the other.

Basic Calculation and Definition

Covariance also measures the relationship between two variables, but it is not scaled, so it can’t tell you the strength of that relationship. For example, Let’s look at the following vectors a, b and c. Vector c is simply a 10X-scaled transformation of vector b, and vector a has no transformational relationship with vector b or c.

a b c
1 4 40
2 5 50
3 6 60
4 8 80
5 8 80
5 4 40
6 10 100
7 12 120
10 15 150
4 9 90
8 12 120

Plotted out, a vs. b and a vs. c look identical except the y-axis is scaled differently for each. When the covariance is taken of both a & b and a & c, you get different a large difference in results. The covariance between a & b is much smaller than the covariance between a & c even though the plots are identical except the scale. The y-axis on the c vs. a plot goes to 150 instead of 15.

$latex
cov(X, Y) = \frac{\Sigma_i^N{(X_i – \bar X)(Y_i – \bar Y)}}{N-1}
&s=2$

$latex
cov(a, b) = 8.5
&s=2$

$latex
cov(a, c) = 85
&s=2$

correlation b vs acorrelation c vs a

To account for this, correlation is takes covariance and scales it by the product of the standard deviations of the two variables.

$latex
cor(X, Y) = \frac{cov(X, Y)}{s_X s_Y}
&s=2$

$latex
cor(a, b) = 0.8954
&s=2$

$latex
cor(a, c) = 0.8954
&s=2$

Now, correlation describes how strong the relationship between the two vectors regardless of the scale. Since the standard deviation in vector c is much greater than vector b, this accounts for the larger covariance term and produces identical correlations terms. The correlation coefficient will fall between -1 and 1. Both -1 and 1 indicate a strong relationship, while the sign of the coefficient indicates the direction of the relationship. A correlation of 0 indicates no relationship.

Here’s the R code that will run through the calculations.

#covariance vs correlation
a <- c(1,2,3,4,5,5,6,7,10,4,8)
b <- c(4,5,6,8,8,4,10,12,15,9,12)
c <- c(4,5,6,8,8,4,10,12,15,9,12) * 10

data <- data.frame(a, b, c)

cov(a, b)  #8.5
cov(a, c)  #85

cor(a,b)  #0.8954
cor(a,c)  #0.8954

Caution | Anscombe's Quartet

Correlation is great. It's a basic tool that is easy to understand, but it has its limitations. The most prominent being the correlation =/= causation caveat. The linked BuzzFeed article does a good job explaining the concept some ridiculous examples, but there are real-life examples being researched or argued in crime and public policy. For example, crime is a problem that has so many variables that it's hard to isolate one factor. Politicians and pundits still try.

Another famous caution about using correlation is Anscombe's Quartet. Anscombe's Quartet uses different sets of data to achieve the same correlation coefficient (0.8164 give or take some rounding). This exercise is typically used to emphasize why it's important to visualize data.

Anscombe Correlation R

The graphs demonstrates how different the data sets can be. If this was real-world data, the green and yellow plots would be investigated for outliers, and the blue plot would probably be modeled with non-linear terms. Only the red plot would be consider appropriate for a basic, linear model.

I created this plot in R with ggplot2. The Anscombe data set is included in base R, so you don't need to install any packages to use it. Ggplot2 is a fantastic and powerful data visualization package which can be download for free using the install.packages('ggplot2') command. Below is the R code I used to make the graphs individually and combine them into a matrix.

#correlation
cor1 <- format(cor(anscombe$x1, anscombe$y1), digits=4)
cor2 <- format(cor(anscombe$x2, anscombe$y2), digits=4)
cor3 <- format(cor(anscombe$x3, anscombe$y3), digits=4)
cor4 <- format(cor(anscombe$x4, anscombe$y4), digits=4)

#define the OLS regression
line1 <- lm(y1 ~ x1, data=anscombe)
line2 <- lm(y2 ~ x2, data=anscombe)
line3 <- lm(y3 ~ x3, data=anscombe)
line4 <- lm(y4 ~ x4, data=anscombe)

circle.size = 5
colors = list('red', '#0066CC', '#4BB14B', '#FCE638')

#plot1
plot1 <- ggplot(anscombe, aes(x=x1, y=y1)) + geom_point(size=circle.size, pch=21, fill=colors[[1]]) +
  geom_abline(intercept=line1$coefficients[1], slope=line1$coefficients[2]) +
  annotate("text", x = 12, y = 5, label = paste("correlation = ", cor1))

#plot2
plot2 <- ggplot(anscombe, aes(x=x2, y=y2)) + geom_point(size=circle.size, pch=21, fill=colors[[2]]) +
  geom_abline(intercept=line2$coefficients[1], slope=line2$coefficients[2]) +
  annotate("text", x = 12, y = 3, label = paste("correlation = ", cor2))

#plot3
plot3 <- ggplot(anscombe, aes(x=x3, y=y3)) + geom_point(size=circle.size, pch=21, fill=colors[[3]]) +
  geom_abline(intercept=line3$coefficients[1], slope=line3$coefficients[2]) +
  annotate("text", x = 12, y = 6, label = paste("correlation = ", cor3))

#plot4
plot4 <- ggplot(anscombe, aes(x=x4, y=y4)) + geom_point(size=circle.size, pch=21, fill=colors[[4]]) +
  geom_abline(intercept=line4$coefficients[1], slope=line4$coefficients[2]) +
  annotate("text", x = 15, y = 6, label = paste("correlation = ", cor4))

grid.arrange(plot1, plot2, plot3, plot4, top='Anscombe Quadrant -- Correlation Demostration')

The full code I used to write up this tutorial is available on my GitHub .

References:

Chatterjee, S., Hadi, A. S., & Price, B. (2000). Regression analysis by example. New York: Wiley.

Baseball Twitter Roller Coaster

Because Twitter is fun and so are graphs, I have tweet volume graphs from my Twitter scraper that collects tweets with the team-specific nicknames and Twitter handles. After a trade (or non-trade), the data can be collected and a graphical picture of the reaction can be produced. The graph represents the volume of sampled tweets that contained the specific keywords: Mets, Gomez, Flores and Hamels. “Tears” is a collection of any tweets which mentioned either “tears” or “crying”, since Flores was in tears as he took the field.

Here are the reactions to the Gomez non-trade and Hamels trade last night:

Mets

And here’s the timeline of necessary tweets:
[All times are EDT.]


July 29
9:00 PM


9:45 PM


9:54 PM



10:15 PM



10:55 PM



July 30
12:13 AM

Some of the times were rounded if there wasn’t a clear single tweet that caused the peak on Twitter.

correlation matrix

Making a Correlation Matrix in R

This tutorial is a continuation of making a covariance matrix in R. These tutorials walk you through the matrix algebra necessary to create the matrices, so you can better understand what is going on underneath the hood in R. There are built-in functions within R that make this process much quicker and easier.

The correlation matrix is is rather popular for exploratory data analysis, because it can quickly show you the correlations between variables in your data set. From a practical application standpoint, this entire post is unnecessary, because I’m going to show how to derive this using matrix algebra in R.

First, the starting point will be the covariance matrix that was computed from the last post.

#create vectors -- these will be our columns
a <- c(1,2,3,4,5,6)
b <- c(2,3,5,6,1,9)
c <- c(3,5,5,5,10,8)
d <- c(10,20,30,40,50,55)
e <- c(7,8,9,4,6,10)

#create matrix from vectors
M <- cbind(a,b,c,d,e)
k <- ncol(M) #number of variables
n <- nrow(M) #number of subjects

#create means for each column
M_mean <- matrix(data=1, nrow=n) %*% cbind(mean(a),mean(b),mean(c),mean(d),mean(e)) 

#creates a difference matrix
D <- M - M_mean

#creates the covariance matrix
C <- k^-1 * t(D) %*% D

$latex {\bf C } =
\begin{bmatrix}
V_a\ & C_{a,b}\ & C_{a,c}\ & C_{a,d}\ & C_{a,e} \\
C_{a,b} & V_b & C_{b,c} & C_{b,d} & C_{b,e} \\
C_{a,c} & C_{b,c} & V_c & C_{c,d} & C_{c,e} \\
C_{a,d} & C_{b,d} & C_{c,d} & V_d & C_{d,e} \\
C_{a,e} & C_{b,e} & C_{c,e} & C_{d,e} & V_e
\end{bmatrix}&s=2$

This matrix has all the information that's needed to get the correlations for all the variables and create a correlation matrix [V -- variance, C -- Covariance]. Correlation, we are using the Pearson version of correlation, is calculated using the covariance between two vectors and their standard deviations [s, square root of the variance]:

$latex
cor(X, Y) = \frac{cov(X,Y)}{s_{X}s_{Y}}
&s=2$

The trick will be using matrix algebra to easily carry out these calculations. The variance components are all on the diagonal of the covariance matrix, so in matrix algebra notation we want to use this:

$latex {\bf V} = diag({\bf C}) = \begin{bmatrix}
V_a\ & 0\ & 0\ & 0\ & 0 \\
0 & V_b & 0 & 0 & 0 \\
0 & 0 & V_c & 0 & 0 \\
0 & 0 & 0 & V_d & 0 \\
0 & 0 & 0 & 0 & V_e
\end{bmatrix}

&s=2$

Since R doesn't quite work the same way as matrix algebra notation, the diag() function creates a vector from a matrix and a matrix from a vector, so it's used twice to create the diagonal variance matrix. Once to get a vector of the variances, and a second time to turn that vector into the above diagonal matrix. Since the standard deviations are needed, the square root is taken. Also the variances are inverted to facilitate division.

#pulls out the standard deviations from the covariance matrix
S <- diag(diag(C)^(-1/2))

After getting the diagonal matrix, basic matrix multiplication is used to get the all the terms in the covariance to reflect the basic correlation formula from above.

$latex {\bf R } = {\bf S} \times {\bf C} \times {\bf S}&s=2$

#constructs the correlation matrix
S %*% C %*% S

And the correlation matrix is symbolically represented as:

$latex {\bf R } =
\begin{bmatrix}
r_{a,a}\ & r_{a,b}\ & r_{a,c}\ & r_{a,d}\ & r_{a,e} \\
r_{a,b} & r_{b,b} & r_{b,c} & r_{b,d} & r_{b,e} \\
r_{a,c} & r_{b,c} & r_{c,c} & r_{c,d} & r_{c,e} \\
r_{a,d} & r_{b,d} & r_{c,d} & r_{d,d} & r_{d,e} \\
r_{a,e} & r_{b,e} & r_{c,e} & r_{d,e} & r_{e,e}
\end{bmatrix}&s=2$

The diagonal where the variances where in the covariance matrix are now 1, since a variable's correlation with itself is always 1.

Covariance Matrix

Making a Covariance Matrix in R

The full R code for this post is available on my GitHub.

Understanding what a covariance matrix is can be helpful in understanding some more advanced statistical concepts. First, let’s define the data matrix, which is the essentially a matrix with n rows and k columns. I’ll define the rows as being the subjects, while the columns are the variables assigned to those subjects. While we use the matrix terminology, this would look much like a normal data table you might already have your data in. For the example in R, I’m going to create a 6×5 matrix, which 6 subjects and 5 different variables (a,b,c,d,e). I’m choosing this particular convention because R and databases use it. A row in a data frame represents represents a subject while the columns are different variables. [The underlying structure of the data frame is a collection of vectors.] This is against normal mathematical convention which has the variables as rows and not columns, so this won’t follow the normal formulas found else where online.

The covariance matrix is a matrix that only concerns the relationships between variables, so it will be a k x k square matrix. [In our case, a 5×5 matrix.] Before constructing the covariance matrix, it’s helpful to think of the data matrix as a collection of 5 vectors, which is how I built our data matrix in R.]

#create vectors -- these will be our columns
a <- c(1,2,3,4,5,6)
b <- c(2,3,5,6,1,9)
c <- c(3,5,5,5,10,8)
d <- c(10,20,30,40,50,55)
e <- c(7,8,9,4,6,10)

#create matrix from vectors
M <- cbind(a,b,c,d,e)

The data matrix (M) written out is shown below.

     a b  c  d  e
[1,] 1 2  3 10  7
[2,] 2 3  5 20  8
[3,] 3 5  5 30  9
[4,] 4 6  5 40  4
[5,] 5 1 10 50  6
[6,] 6 9  8 55 10

Each value in the covariance matrix represents the covariance (or variance) between two of the vectors. With five vectors, there are 25 different combinations that can be made and those combinations can be laid out in a 5x5 matrix.

There are a few different ways to formulate covariance matrix. You can use the cov() function on the data matrix instead of two vectors. [This is the easiest way to get a covariance matrix in R.]

cov(M)

But we'll use the following steps to construct it manually:

  1. Create a matrix of means (M_mean).
  2. $latex {\bf M\_mean} = \begin{bmatrix}
    1 \\
    1 \\
    1 \\
    1 \\
    1 \\
    \end{bmatrix}
    \times
    \begin{bmatrix} \bar{x_{a}} & \bar{x_{b}} & \bar{x_{c}} & \bar{x_{d}} & \bar{x_{e}}\end{bmatrix}&s=2$

  3. Create a difference matrix (D) by subtracting the matrix of means (M_mean) from data matrix (M).
  4. $latex {\bf D = M - M\_mean} &s=2$

  5. Create the covariance matrix (C) by multiplying the transposed the difference matrix (D) with a normal difference matrix and inverse of the number of subjects (n) [We will use (n-1), since this is necessary for the unbiased, sample covariance estimator. This is covariance R will return by default.
  6. $latex {\bf C = } (n-1)^{-1} \times {\bf D^T} \times {\bf D} &s=2$

k <- ncol(M) #number of variables
n <- nrow(M) #number of subjects

#create means for each column
M_mean <- matrix(data=1, nrow=n) %*% cbind(mean(a),mean(b),mean(c),mean(d),mean(e)) 

#creates a difference matrix
D <- M - M_mean

#creates the covariance matrix
C <- (n-1)^-1 t(D) %*% D

The final covariance matrix made using the R code looks like this:

     a         b    c          d        e
a  3.5  3.000000  4.0  32.500000 0.400000
b  3.0  8.666667  0.4  25.333333 2.466667
c  4.0  0.400000  6.4  38.000000 0.400000
d 32.5 25.333333 38.0 304.166667 1.333333
e  0.4  2.466667  0.4   1.333333 4.666667

It represents the various covariances (C) and variance (V) combinations of the five different variables in our data set. These are all values that you might be familiar with if you've used the var() or cov() functions in R or similar functions in Excel, SPSS, etc.

$latex
\begin{bmatrix}
V_a\ & C_{a,b}\ & C_{a,c}\ & C_{a,d}\ & C_{a,e} \\
C_{a,b} & V_b & C_{b,c} & C_{b,d} & C_{b,e} \\
C_{a,c} & C_{b,c} & V_c & C_{c,d} & C_{c,e} \\
C_{a,d} & C_{b,d} & C_{c,d} & V_d & C_{d,e} \\
C_{a,e} & C_{b,e} & C_{c,e} & C_{d,e} & V_e
\end{bmatrix}&s=2$

This matrix is used in applications like constructing the correlation matrix and generalized least squares regressions.

One-Sample t-Test [With R Code]

The one sample t-test is very similar to the one sample z-test. A sample mean is being compared to a claimed population mean. The t-test is required when the population standard deviation is unknown. The t-test uses the sample’s standard deviation (not the population’s standard deviation) and the Student t-distribution as the sampling distribution to find a p-value.

The t-Distribution

While the z-test uses the normal distribution, which is only dependent on the mean and standard deviation of the population. Any of the various t-tests [one-sample, independent, dependent] use the t-distribution, which has an extra parameter over the normal distribution: degrees of freedom (df). The theoretical basis for degrees of freedom deserves a lot of attention, but for now for the one-sample t-test, df = n – 1.

t-distribution vs normal distribution

The distributions above show how the degrees of freedom affect the shape of t distribution. The gray distribution is the normal distribution. Low df causes the tails of the distribution to be fatter, while a higher df makes the t-distribution become more like the normal distribution.

The practical outcome of this will be that samples with smaller n will need to be further from the population mean to reject the null hypothesis than samples with larger n. And compared to the z-test, the t-test will always need to have the sample mean further from the population mean.

Just like the one-sample z-test, we have to define our null hypothesis and alternate hypothesis. This time I’m going to show a two-tailed test. The null hypothesis will be that there is NO difference between the sample mean and the population mean. The alternate hypothesis will test to see if the sample mean is significantly different from the population mean. The null and alternate hypotheses are written out as:

  • $latex H_0: \bar{x} = \mu&s=2$
  • $latex H_A: \bar{x} \neq \mu&s=2$

t-distribution critical

The graphic above shows a t-distribution with a df = 5 with the critical regions highlighted. Since the shape of the distribution changes with degrees of freedom, the critical value for the t-test will change as well.

The t-stat for this test is calculated the same way as the z-stat for the z-test, except for the σ term [population standard deviation] in the z-test is replaced with s [sample standard deviation]:

$latex z = \frac{\bar{x} – \mu}{\sigma/\sqrt{n}} \hspace{1cm} t = \frac{\bar{x} – \mu}{s/\sqrt{n}} &s=2$

Like the z-stat, the higher the t-stat is the more certainty there is that the sample mean and the population mean are different. There are three things make the t-stat larger:

  • a bigger difference between sample mean and population mean
  • a small sample standard deviation
  • a larger sample size

Example in R

Since the one-sample t-test follows the same process as the z-test, I’ll simply show a case where you reject the null hypothesis. This will also be a two-tailed test, so we will use the null and alternate hypotheses found earlier on this page.

Once again using the height and weight data set from UCLA’s site, I’ll create a tall-biased sample of 50 people for us to test.

#reads data set
data <- read.csv('data/Height_data.csv')
height <- data$Height

#N - number in population
#n - number in sample
N <- length(height)
n <- 50

#population mean
pop_mean <- mean(height)

#tall-biased sample
cut <- 1:25000
weights <- cut^.6
sorted_height <- sort(height)
set.seed(123)
height_sample_biased <- sample(sorted_height, size=n, prob=weights)

This sample would represent something like athletes, CEOs, or maybe a meeting of tall people. After creating the sample, we use R's mean() and sd() functions to get the parameters for the t-stat formula from above.

sample_mean <- mean(height_sample_biased)
sample_sd <- sd(height_sample_biased)

Now using the population mean, the sample mean, the sample standard deviation, and the number of samples (n = 50) we can calculate the t-stat.

#t-stat
t <- (sample_mean - pop_mean) / (sample_sd/sqrt(n))

Now you could look up the critical value for the t-test with 49 degrees of freedom [50-1 = 49], but this is R, so we can find the area under the tail of the curve [the blue area from the critical region diagram] and see if it's under 0.025. This will be our p-value, which is the probability that the value was achieved by random chance.

#p-value for t-test
1-pt(t,n-1)

The answer should be 0.006882297, which is well below 0.025, so the null hypothesis is rejected and the difference between the tall-biased sample and the general population is statistically significant.

You can find the full R code including code to create the t-distribution and normal distribution data sets on my GitHub .

Twitter Sentiment — Penguins VS. Rangers Gm 4

Game 4 of the Penguins-Rangers series featured a brief overtime period that overshadowed the rest of the game as far as tweet volume goes. Rangers fans were more negative at the beginning of the game after the Penguins scored their first goal. Twitter volume picked up for both teams during the overtime period and Rangers fans’ tweets spiked when they won the game and continued throughout the night.

Gm4 Sentiment