Category Archives: all articles

Collecting Twitter Data: Converting Twitter JSON to CSV — UTF-8

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8 [current page]


The main drawback to the ASCII CSV parser and the csv library and is that it can’t handle unicode characters or objects. I want to be able to make a csv file that is encoding in UTF-8, so that will have to be done from scratch. The basic structure follows the previous ASCII post so the json Python object description can be found on the previous tutorial.

io.open

First, to handle the UTF-8 encoding, I used the io.open class. For the sake of consistency, I used this class for both reading the JSON file and writing the CSV file. This actually doesn’t require much change to the structure of the program, but it’s an important change. The json.loads() reads the JSON data and parses it into an object you can access like a Python dictionary.

import json
import csv
import io

data_json = io.open('raw_tweets.json', mode='r', encoding='utf-8').read() #reads in the JSON file
data_python = json.loads(data_json)

csv_out = io.open('tweets_out_utf8.csv', mode='w', encoding='utf-8') #opens csv file

Unicode Object Instead of List

Since this program uses the write() method instead of a csv.writerow() method, and the write() method requires a string or in this case a unicode object instead of a list. Commas have to be manually inserted into the string to properly. For the field names, I just rewrote the line of code to be a unicode string instead of the list used for the ASCII parser. The u'*string*' is the syntax for a unicode string, which behave similarly to normal strings, but they are different. Using the wrong type of string can cause compatibly issues. The line of code that uses the u'\n' creates a new line in the CSV. Once again this is need in this parser needs to insert the new line character to create a new line in the CSV file.

fields = u'created_at,text,screen_name,followers,friends,rt,fav' #field names
csv_out.write(fields)
csv_out.write(u'\n')

The for loop and Delimiters

This might be the biggest change relative to the ASCII program. Since this is a CSV parser made from scratch, the delimiters have to be programmed in. For this flavor of CSV, it will have the text field entirely enclosed by quotation marks (") and use commas (,) to separate the different fields. To account for the possibility of having quotation marks in the actual text content, any real quotation marks will be designated by double quotes (""). This can give rise to triple quotes, which happens if a quotation mark starts or ends a tweet’s text field.

for line in data_python:

    #writes a row and gets the fields from the json object
    #screen_name and followers/friends are found on the second level hence two get methods
    row = [line.get('created_at'),
           '"' + line.get('text').replace('"','""') + '"', #creates double quotes
           line.get('user').get('screen_name'),
           unicode(line.get('user').get('followers_count')),
           unicode(line.get('user').get('friends_count')),
           unicode(line.get('retweet_count')),
           unicode(line.get('favorite_count'))]

    row_joined = u','.join(row)
    csv_out.write(row_joined)
    csv_out.write(u'\n')


csv_out.close()

This parser implements the delimiters requirements of the text fields by

  1. Replacing all quotation marks with double quotes in the text.
  2. Adding quotation marks to the beginning and end of the unicode string
'"' + line.get('text').replace('"','""') + '"', #creates double quotes

Joining the row list using a comma as a separator is a quick way to write the unicode string for the line of the CSV file.

row_joined = u','.join(row)

The full code I used in this tutorial can be found on my GitHub .


Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8 [current page]

Collecting Twitter Data: Converting Twitter JSON to CSV — ASCII

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII [current page] | Part VII: Twitter JSON to CSV — UTF-8


I outlined some of the potential hurdles that you have to overcome when converting Twitter JSON data to a CSV file in the previous section. Here I outline a quick Python script that allows you to parse your Twitter JSON file with the csv library. This has the obvious drawback in that it can’t handle the utf-8 encoded characters that can be present in tweets. But this program will produce a CSV file that will work well in Excel or other programs that are limited to ASCII characters.

The JSON File

The first requirement is to have a valid JSON file. This file should contain an array of Twitter JSON objects, or in analogous Python terms a list of Twitter dictionaries. The tutorial for the Python Stream Listener has been updated to make the correctly formatted file to work in Python.

[{Twitter JSON Object}, {Twitter JSON Object}, {Twitter JSON Object}]

The JSON file is loaded into Python and is automatically parsed into a Python friendly object by the json library using the json.loads() method. This opens and reads the file in as a string in the open() line, then decodes the string into a json Python object which behaves similar to a list of Python dictionaries — one dictionary for each tweet.

import json
import csv

data_json = open('raw_tweets.json', mode='r').read() #reads in the JSON file into Python as a string
data_python = json.loads(data_json) #turns the string into a json Python object

The CSV Writer

Before getting too ahead of things, a CSV writer should create a file and write the first row to label the data columns. The open() line creates a file and allows Python to write to it. This is a generic file, so anything could be written to it. The csv.writer() line creates an object which will write CSV formatted text to file we just opened. There are some other parameters you are able to specify, but it defaults to Excel specifications, so it those options can be omitted.

csv_out = open('tweets_out_ASCII.csv', mode='w') #opens csv file
writer = csv.writer(csv_out) #create the csv writer object

fields = ['created_at', 'text', 'screen_name', 'followers', 'friends', 'rt', 'fav'] #field names
writer.writerow(fields) #writes field

The purpose of this parser is to get some really basic information from the tweets, so it will only get the date and time, text, screen name and the number of followers, friends, retweets and favorites [which are called likes now]. If you wanted to retrieve other information, you’d would create the column names accordingly. the writerow() method writes a list with each element being a value which is separated by the comma in the CSV file.

The json Python object can be used in a for loop to access the individual tweets. From there each line can be accessed to get the different variables we are interested in. I’ve condensed the code so that is all in one statement. Breaking it down the line.get('*attribute*') retrieves the relevant information from the tweet. The line represents an individual tweet.

for line in data_python:

    #writes a row and gets the fields from the json object
    #screen_name and followers/friends are found on the second level hence two get methods
    writer.writerow([line.get('created_at'),
                     line.get('text').encode('unicode_escape'), #unicode escape to fix emoji issue
                     line.get('user').get('screen_name'),
                     line.get('user').get('followers_count'),
                     line.get('user').get('friends_count'),
                     line.get('retweet_count'),
                     line.get('favorite_count')])

csv_out.close()

You might not notice this line, but it’s critical for this program working.

line.get('text').encode('unicode_escape'), #unicode escape to fix emoji issue

If the encode() method isn’t included, unicode characters (like emojis) are included in their native encoding. This will be sent to the csv.writer object, which can’t handle those characters and fail. This would be necessary for any field that could possibly have a unicode character. I know the other fields I chose cannot have non-ASCII characters, but if you were to add name or description, you’d have to make sure they do not have incompatible characters.

The unicode escape rewrites the unicode as a string of letters and number much like \U0001f35f. These represent the characters and can actually be decoded later.

The full code I used in this tutorial can be found on my GitHub .


Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII [current page] | Part VII: Twitter JSON to CSV — UTF-8

Collecting Twitter Data: Converting Twitter JSON to CSV — Possible Errors

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors [current page] | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8


ASCII JSON-to-CSV | UTF-8 JSON-to-CSV

Before diving into the problem of how to save tweets in a CSV file, let me say there are a 1,000 ways to do this and about 100 complications that arise depending which way you want to accomplish this. I will devote two posts which covers using both ASCII and UTF-8 encoding because many tweets contain characters beyond the normal Latin alphabet.

Let’s look at some of the issues with writing CSV from tweets.

  • Tweets are JSON and contain a massive amount of metadata. More than you probably want.
  • The JSON isn’t a flat structure; it has levels. [Direct contrast to a CSV file.]
  • The JSON files don’t all have the same elements.
  • There are many foreign languages and emoji used in tweets.
  • Tweets contain many different grammatical marks such as commas and quotation marks.

These issues aren’t incredibly daunting, but those unfamiliar will encounter frustrating errors.

Tweets are JSON and contain a massive amount of metadata. More than you probably want.

I’m always in favor of keeping as much data as possible, but tweets contain a massive amount of different metadata attributes. All of these are designed for the Twitter platform and for the associated client apps. Some items like the profile_background_image_url_https really don’t have much of an impact on any analysis. Choosing which attributes you want to keep will be critical before embarking on a process to parse the data into a CSV. There’s a lot to choose from: timestamp data, user data, retweet data, geocoding data, hashtag data and link data.

The JSON isn’t a flat structure; it has levels.

This issue is an extension of the previous issue, since tweet JSON data isn’t organized into a flat, spreadsheet-like structure. The created_at and text elements are located on the top level and are easy to access, but something as simple as the tweeter’s name and screen_name are located in the user nested object. Like everything else mentioned in this post, this isn’t a huge issue, but the structure of a tweet JSON file has to be considered when coding your program.

The JSON files don’t all have the same elements.

The final problem with JSON files is the fields aren’t necessarily present in every object. Many geo related attributes do not appear unless geotagging is enabled. This means if you write your program to look for geotagging data, it can throw a key error if those keys don’t exist in that specific tweet. To avoid this you have to account for the exception or use a method that already does that. I use the get() method to avoid these key errors in the CSV parser.

There are many foreign languages and emoji used in tweets.

I quickly addressed this issue in a few posts, and it’s one of the reasons why I like to store tweets in MongoDB. Tweets contain a lot of of [read: important] unicode characters. These are typically many foreign language characters and the ubiquitous emojis. This is important because the presence of UTF-8 unicode characters can and will cause encoding errors when parser a file or loading a file into Excel. Excel (at least the version on my computer) can’t handle these characters. Other tools like the built-in CSV writer in Python can’t handle unicode out of box. Being able to deal with these characters is critical to compatibility with other software as long as the integrity of your data.

This issue forces me to write two different parsers for examples. I have a CSV parser that outputs ASCII that imports well into Excel along with a UTF-8 version which allows you to natively save the characters and emojis in a human-readable CSV file.

Tweets contain many different grammatical marks such as commas and quotation marks.

This is a problem that I had when I first started working with Twitter data and tried to write my own parser — characters that are part of your text content sometimes get confused with the delimiters. In this case I’m talking about quotation marks (") and commas (,). Comma sseparate the values for each ‘cell’, hence the acronym CSV. If you tweet you’ve probably tweeted using one of these characters. I’ve stripped them out of the text previously to solve this problem, but that’s not a great solution. The way Excel handles this is to enclose any elements that contain commas with quotation marks then to use double quotation marks to signify an actual quotation mark and not enclosed text. This will be demonstrated in the UTF-8 parser since I made that from scratch.


Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors [current page] | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8

R Bootcamp: Making a Subset

Data Manipulation: Subsetting

Making a subset of a data frame in R is one of the most basic and necessary data manipulation techniques you can use in R. If you are brand new to data analysis, a data frame is the most common data storage object in R and subsets are a collection of rows from that data frame based on certain criteria.

Data Frame
V1 V2 V3 V4 V5 V6 V7
Row1
Row2
Row3
Row4
Row5
Row6

Arrow

Subset
V1 V2 V3 V4 V5 V6 V7
Row2
Row5
Row6

The Data

For this example, I’m using data from FanGraphs. You can get the exact data set here, and it’s provided in my GitHub. This data set has players names, teams, seasons and stats. We are able to create a subset based on any one or more of these variables.

The Code

I’m going to show four different ways to subset data frames: using a boolean vector, using the which() function, using the subset() function and using filter() function from the dplyr package. All of these functions are different ways to do the same thing. The dplyr package is fast and easy to code, and it is my recommended subsetting method, so let’s start with that. This is especially true when you have to loop an operation or run something that will be run repeatedly.

dplyr

The filter() requires the dplyr package to be loaded in your R environment, and it removes the filter() function from the default stats package. You don’t need to worry about but it does tell you that when you first install and load the package.

#install.packages('dplyr')
library(dplyr) #load the package

#from http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2015&month=0&season1=2010&ind=1&team=&rost=&age=&filter=&players=&page=2_30
setwd('***PATH***') 
data <- read.csv('FanGraphs Leaderboard.csv') #loads in the data

Aside from the loading the package, you'll have to load the data in as well.

#finds all players who played for the Marlins
data.sub.1 <- filter(data, Team=='Marlins')

#finds all the NL East players
NL.East <- c('Marlins','Nationals','Mets','Braves','Phillies') #makes the division
data.sub.2 <- filter(data, Team %in% NL.East) #finds all players that are in the NL East

#Both of these find players in the NL East and have more than 30 home runs.
data.sub.3 <- filter(data, Team %in% NL.East, HR > 30) #uses multiple arguments
data.sub.3 <- filter(data, Team %in% NL.East & HR > 30) #uses & sign

#Finds players in the NL East or has more than 30 HR
data.sub.4 <- filter(data, Team %in% NL.East | HR > 30)

#Finds players not in the NL East and who have more than 30 home runs.
data.sub.5 <- filter(data, !(Team %in% NL.East), HR > 30)

The filter() function is rather simple to use. The examples above illustrate a few simple examples where you specify the data frame you want to use and create true/false expressions, which filter() uses to find which rows it should keep. The output of the function is saved into a separate variable, so we can reuse the original data frame for other subsets. I put a few other examples in the code to demonstrate how it works.

Built-in Functions

#method 1 -- using a T/F vector
data.sub.1 <- data[data$Team == 'Marlins',]

#method 2 -- which()
data.sub.2 <- data[which(data$Team == 'Marlins'),]

#method 3 -- subset()
data.sub.3 <- subset(data,subset = (Team=='Marlins'))

#other comparison functions
data.sub.4 <- data[data$HR > 30,] #greater than
data.sub.5 <- data[data$HR < 30,] #less than

data.sub.6 <- data[data$AVG > .320 & data$PA > 600,] #duel requirements using AND (&)
data.sub.7 <- data.sub3 <- subset(data, subset = (AVG > .300 & PA > 600)) #using subset()

data.sub.8 <- data[data$HR > 40 | data$SB > 30,] #duel requirements using OR (|)

data.sub.9 <- data[data$Team %in% c('Marlins','Nationals','Mets','Braves','Phillies'),] #finds values in a vector

data.sub.10 <- data[data$Team != '- - -',] #removes players who played for two teams

If you don't want to use the dplyr package, you are able to accomplish the same thing uses the basic functionality of R. #method 1 uses a boolean vector to select rows for the subset. #method 2 uses the which() function. This function finds the index of a boolean vector of True values. Both of these techniques use the original data frame and uses the row index to create a subset.

The subset() function works much like the filter() function, except the syntax is slightly different and you don't have to download a separate package.

Efficiency

While subset works in a similar fashion, it doesn't perform the same way. While some data manipulation might only happen once or a few times throughout a project, many projects require constant subsetting and possibly from a loop. So while the gains might seem insignificant for one run, multiply that difference and it adds up quickly.

I timed how long it would take to run the same [complex] subset of a 500,000 row data frame using the four different techniques.

Time to Subset 500,000 Rows
Subset Method Elapsed Time (sec)
boolean vector 0.87
which() 0.33
subset() 0.81
dplyr filter() 0.21

The dpylr filter() function was by far the quickest, which is why I prefer to use it.

The full code I used to write up this tutorial is available on my GitHub .

References:

Introduction to dplyr. https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

emoji

Emoji iOS 9.1 Update — The Taco Emoji Analysis

Before I get too far I don’t actually analysis taco emojis. At least not yet. I, however, give you the tools to start parsing them from tweets, text or anything you can get into Python.

This past month Apple released their iOS 9.1 and their latest OS X 10.11.1 El Capitan update. That updated included a bunch of new emojis. I’ve made a quick primer on how to handle emoji analysis in Python. Then when Apple released an update to their emojis to include the diversity, I updated my small Python class for emoji counting to include to the newest emojis. I also looked at what is actually happening with the unicode when diversity modifier patches are used.

Click for Updated socialmediaparse Library

With this latest update, Apple and the Unicode Consortium didn’t really introduce any new concepts, but I did update the Python class to include the newest emojis. In my GitHub the data folder includes a text file with all the emojis delimitated by ‘\n’. The class uses this file to find any emoji’s in a unicode string which has been passed to the add_emoji_count() method.

Building off of the diversity emoji update, I added a skin_tone_dict property of the EmojiDict class. This property returns a dictionary with the number of unique human emojis per tweet and their skin tones. This property will not catch multiple human emojis written if they in the same execution of the add_emoji_count() method

import socialmediaparse as smp #loads the package
 
counter = smp.EmojiDict() #initializes the EmojiDict class
 
#goes through list of unicode objects calling the add_emoji_count method for each string
#the method keeps track of the emoji count in the attributes of the instance
for unicode_string in collection:
   counter.add_emoji_count(unicode_string)  
 
#output of the instance
print counter.dict_total #dict of the absolute total count of the emojis in corpus
print counter.dict       #dict of the count of strings with the emoji in corpus
print counter.baskets    #list of lists, emoji in each string.  one list for each string.
print counter.skin_tones_dict #dictionary of unique emoji emojis aggregated by the counter.

#print counter.skin_tones_dict output
#{'human_emoji': 4, '\\U0001f3fe': 1, '\\U0001f3fd': 1, '\\U0001f3ff': 0, '\\U0001f3fc': 2, '\\U0001f3fb': 1}
 
counter.create_csv(file='emoji_out.csv')  #method for creating csv

Above is an example of how to use the new attribute. It is a dictionary so you can work that into your analysis however you like. I will eventually create better methods and outputs to make this feature more robust and useful.

The full code / class I used in this post can be found on my GitHub .

R Bootcamp — A Quick Introduction

R, a statistics programming language environment, is becoming more popular as organizations, governments and businesses have increased their use of data science. In an effort to provide a quick bootcamp to learn the basics of R quickly, I’ve assemble some of the most basic processes to give a new user a quick introduction to the R language.

This post assumes that you have already installed R and have it running correctly on your computer. I recommend getting RStudio to use to write and execute your code. It will make your life much easier.

Getting Started

First R is an interactive programming environment, which means you are able to send commands to its interpreter to tell it what to do.

There are two basic methods to send commands to R. The first is by using the console, which is like your old-school command line computing methods. The second method is more typically used by R coders, and that’s to write a script. An R script isn’t fancy. At its core it’s a text document that contain a collection of R commands. Then when the code is executed it is treated like a collection of individual commands being feed one-by-one into the R interpreter. This differs on how other, more fundamental programming languages work.

R - How R Works

Basics

Comments are probably the best place to start, especially because my code is chock-full of them. A comment is code that is fed to R, but it’s not executed and has no bearing on the function of your script or command. In R comments are lines prefaced with a #.

#comments start with #-signs

9-3 #basic math (this doesn't save this in a variable)

One of the most basic thing you could use R for is a calculator. For instance if we run the code 9-3, R will display a 6 as the result of that code. All of this is rather straight forward. The only operator you might not be familiar with if you are new to coding is the modulus operator, which yields the remainder when you divide the first number by the second. This gets used often when dealing with data. For example, you can get a 0 for even number and 1 for odd number if you take you variable use the modulus operator with the number 2.

#basic operations
#yields numeric value
1+2 #addition
3-2 #subtraction
3*2 #multiplication
4/5 #division
3 %% 2 #modulus (remainder operator)

Beyond the basic math and numeric operations you can do, R has several fundamental data types. NULL and NA are representative of empty objects or missing data. These two data types aren’t the same. NA will fill a position in an vector or data frame. The details are best left for another entry.

#basic data structure

NULL     #empty value
NA       #missing value
9100     #numeric value
'abcdef' #string
TRUE     #boolean
T        #equilvant form
FALSE 
F

Numeric values can have mathematical operations performed on them. Strings are essentially non-numeric values. You can’t add strings together or find the average of a string. In any type of data analysis, you’ll typically have some string data. It can be used to classify entries in categorically such as male/female or Mac/Windows/Linux. R will treat these like factors.

Finally, boolean values (True or False) are binary logical values. They work like normal logic operations you might have learned in math or a logic class with AND (&&) and OR (||) operators. These can be used in conditional statements and various other data manipulation operations such as subsetting.

#logical operators
T && F #and
F || T #or

Now that we covered the basic operations and data types, let’s look at how to store that — variables. To assign a value to a variable it’s rather easy. You can use a simple equation or the traditional R notation using an arrow.

x <- 1 #basic assignment
x = 1

#####Acceptable Variables
x <- 1
X <- 1
X1 <- 1
X.1 <- 1
X_1 <- 1
########################

####UNACCEPTABLE Variables
1X <- 1
X-1 <- 1  #CODE WILL NOT WORK
X,1 <- 1
#######################

Variables must begin with a letter and they are case-sensitive. Periods are acceptable faux separators in variable names, but that doesn’t translate to other programming languages like Python or JavaScript, so that might factor in how you establish naming conventions.

I’ve mentioned vectors a few times already. They are an important data structure within R. A vector is an ordered list of data. Typically, thought of as numeric data, but character (string) vectors are often used in R. The c() operator can create a vector. It’s important that vectors contain the same type of data: boolean, numeric or character. If you mix types it will force values into another type. And you can assign your vectors to variables. In fact, you can store just about any thing in R to a variable.

x.vector <- c() #vector operator
x.vector <- c(1,2,3,4,5,6) #creates a vector (typically numeric)
x.vector <- c(T, 1)
x.list <- list('A',12,'b') #creates a list (not used for numeric operations)

mean(c(1,3,2))

Lists are created with the list() command. They are used more for storage and organization than for data structure. For example you could store the mean, median and range for a set of data in a list. A vector would house the data used to calculated said summary stats. Lists are useful when you begin to write bigger programs and need to shuffle a lot of things around.

The basic statistic operators are listed below. All of these require a vector to operate on.

#basic stats
x.vector <- c(10,11,12,12,10,11,20,9) #puts your data into a vector

mean(x.vector) #takes mean of vector
median(x.vector) #median of vector
max(x.vector) #maximum of vector
min(x.vector) #minimum of vector
range(x.vector) #yields a vector with a range

sd(x.vector) #standard deviation
var(x.vector) #variance

Handling Data

Above we discussed some of the building blocks of basic analysis in R. Beyond introductory Statistics classes, R isn’t very useful unless you can import data. There are many ways to do this since data exists in many different formats. A .csv file is one of the most basic, compatible way data is stored to be used between different analytical tools.

Before loading this data file into R, it’s a good idea to set your working directory. This is where the data file is stored.

#load in data
setwd('**folder path**') #sets your working directory 
                                         #specific to each computer

Next you can use the read.csv() function to ingest a .csv file into R. This call won’t save the data in a variable, it just brings it in as a data frame and show it to you.

read.csv('data_bryant_kobe.csv') #reads the data into R
                                 #does not save it into a variable

data <- read.csv('data_bryant_kobe.csv') #reads the data and saves it
                                         #into a variable called 'data'

Data frames are the primary form of data structure you’ll encounter in R. Data frames are like tables in Excel or SQL in that they are rectangular and have a rigid schema. However, at a data frame’s core are a collection of equal-length vectors.

If you assign the data frame output of the read.csv() function to a variable, you can pass around the data frame to different data manipulation functions or modeling functions. One of the most basic ways to manipulate the data is to access different values within the data frame. Below are several different examples on how to get to values, rows or columns in a data frame.

The basic concept is that data frames can be accessed by row and column number. [row, column] And that an entire row or column can be accessed by omitting the dimension you aren’t trying to retrieve. You can retrieve individual fields (variables) by using the $ sign and using the variable name. This is the method I use most often. It requires you knowing and using the name of the variables, which can make your code easier to read.

#accessing values
row <- 3
column <- 2
data$Age #returns the Age variable column
data[row,column]  #individual value
data[data$Age <= 25,]        #returns entire row
data[,column]     #returns entire column as a vector
data$Age[3]       #returns entire column as a vector

By accessing rows, you can create a subset of data by using a logical argument to filter out your data set.

#creating a quick subset
data.U25 <- data[data$Age < 25,]  #creates an under-25 set
data.O25 <- data[data$Age >= 25,] #creates a 25 and older set

The code above creates two new data frames which separate Kobe Bryant’s season stats into an under-25 data set and a 25 and under data set.

Relationships Between Variables

Correlation is often used to summarize the linear relationship between two variables. Getting the correlation in R is simple. Use the cor() function with two equal length vectors. R uses the corresponding elements in each vector to get a Pearson correlation coefficient.

#correlation between different variables within the subset
cor(data.U25$MP, data.U25$PTS)
cor(data.O25$MP, data.O25$PTS)

cor(data$MP, data$PTS) #correlation with two related vectors from data set
cor(data$MP, data$FTpct)

A simple linear model can be made by using the lm() function. The linear model function requires two things: a formula and a data frame. The formula uses a tilde (~) instead an equal sign. The formula represent the variables you would use in your standard ordinary least squares regression. The data parameter is the data frame which contains all the data.

#create a basic linear model
linear.model <- lm(PTS ~ MP, data=data.U25)
linear.model <- lm(PTS ~ MP + Age, data=data.U25)

The summary() function will take the linear model object and displays information about the coefficients of your linear model.

summary(linear.model)

NOTES: The data set used in this tutorial is from basketball-reference.com.

The full code I used in this tutorial can be found on my GitHub .

Covariance — Different Ways to Explain or Visualize It

Covariance is the less understood sibling of correlation. While correlation is commonly used in reporting, covariance provides the mathematical underpinnings to a lot of different statistical concepts. Covariance describes how two variables change in relation to one another. If variable X increases and variable Y increases as well, Both X & Y will have positive covariance. Negative covariance will result from having two variables move in opposite directions, and zero covariance will result from the variables have no relationship with each other. Variance is also a specific case of covariance where both input variables are the same. All of this is rather abstract, so let’s look at more concrete definitions.

Covariance — Summation Notation

The definition you will find in most introductory stat books is a relatively simple equation using the summation operator (Σ). This shows covariances as the sum of the product of a paired data point relative to its mean. First, you need to find the mean of both variables. Then take all the data points and subtract the mean from its respective variable. Finally, you multiply the differences together

Population Covariance:

$latex
cov(X, Y) = \frac{1}{{N}} \sum\limits_{i}^{N}{(X_i – \mu_x)(Y_i – \mu_y)}
&s=2$

Sample Covariance:

$latex
cov(X, Y) = \frac{1}{{n-1}} \sum\limits_{i}^{n}{(X_i – \bar X)(Y_i – \bar Y)}
&s=2$

N is the number of data points in the population. n is the sample number. μX is the population mean for X; μY for Y. and are the mean as well but this notation designates it as a sample mean rather than a population mean. Calculating the covariance of any significant data set can be tedious if done by hand, but we can set-up the equation in R and see it work. I used modified version of Anscombe’s Quartet data set.

#get a data set
X <- c(anscombe$x1, 6,4,10)
Y = c(anscombe$y1, 10,8,6)
#get the means
X.bar = mean(X)
Y.bar = mean(Y)
#calculate the covariance
sum((X-X.bar)*(Y-Y.bar)) / (length(X)) #manually population
sum((X-X.bar)*(Y-Y.bar)) / (length(X) - 1) #manually sample
cov(X,Y) #built-in function USES SAMPLE COVARIANCE

Obviously, since covariance is used so much within statistics, R has a built-in function cov(), which yields the sample covariance for two vectors or even a matrix.

Covariance — Expected Value Notation

[Trying to explain covariance in expected value notation makes me realize I should back up and explain the expected value operator, but that will have to wait for another post. Quickly and oversimplified, the expect value is the mean value of a random variable. E[X] = mean(X). The expected value notation below describes the population covariance of two variables (not sample covariance):

$latex
cov(X, Y) = \textnormal{E}[(X-\textnormal{E}[X])(Y-\textnormal{E}[Y])]
&s=2$

The above formula is just the population covariance written differently. For example, E[X] is the same as μx. And the E[] acts the same as taking the average of (X-E[X])(Y-E[Y]). After some algebraic transformations you can arrive at the less intuitive, but still useful formula for covariance:

$latex
cov(X, Y) = \textnormal{E}[XY] – \textnormal{E}[X]\textnormal{E}[Y]
&s=2$

This formula can be interpreted as the product of the means of variables X and Y subtracted from the average of signed areas of variables X and Y. This probably isn’t very useful if you are trying to interpret covariance. But you’ll see it from time to time. And it works! Try it in R and compare it to the population covariance from above.

mean(X*Y) - mean(X)*mean(Y) #expected value notation

Covariance — Signed Area of Rectangles

Covariance can also be thought of as the sum of the signed area of the rectangles that can be drawn from the data points to the variables respective means. It’s called the signed area because we will get two types of rectangles, ones with a positive value and ones with negative values. Area is always a positive number, but these rectangles take on a sign by virtue of their geometric position. This is more of an academic exercise, in that it provides an understanding of what the math is doing and less of a practical interpretation and application of covariance. If you plot paired data points, in this case we will use the X and Y variables we have already used, you can tell just be looking there is probably some positive covariance because it looks like there is a linear relationship in the data. I’ve chosen to scale the plot so that zero is not included. Since this is a scatter plot including zero isn’t necessary.

XY Scatter Plot

First, we can draw the lines for the means of both variables as straight lines. These lines effectively create a new set of axes and will be used to draw the rectangles. The sides of the rectangles will be the difference between a data point and it’s mean [Xi - X̄]. When that is multiplied by [Yi - Ȳ], you can see that gives you an area of a rectangle. Do that for every point in your data set, add them up and divide by the number of data points, and you get the population covariance.

Scatterplot Covariance Rectangle

The following is a plot has a rectangle for each data point, and it is coded red for negative and blue for positive signs.

All Covariance Rectangles

The overlapping rectangles need to be considered separately so the opacity is reduced so that all the rectangles are visible. For this data set there is much more blue area than there is red area, so there is positive covariance, which jives with what we calculated earlier in R. If you were to take the areas of those rectangles and add/subtract according to the blue/red color then divide by the number of rectangles, you would arrive the population covariance: 3.16. To get the sample covariance you’d subtract one from the number of rectangles when you divide.

References:

Chatterjee, S., Hadi, A. S., & Price, B. (2000). Regression analysis by example. New York: Wiley.

Covariance. http://math.tutorvista.com/statistics/covariance.html

Covariance As Signed Area Of Rectangles. http://www.davidchudzicki.com/posts/covariance-as-signed-area-of-rectangles/
How would you explain covariance to someone who understands only the mean?
https://stats.stackexchange.com/questions/18058/how-would-you-explain-covariance-to-someone-who-understands-only-the-mean

Notes:

The signed area of rectangles on Chudzicki’s site and statexchange use a different covariance formulation, but similar concept than my approach.

The full code I used to write up this tutorial is available on my GitHub .

Anscombe correlation R

Introduction to Correlation with R | Anscombe’s Quartet

Correlation is one the most commonly [over]used statistical tool. In short, it measures how strong the relationship between two variables. It’s important to realize that correlation doesn’t necessarily imply that one of the variables affects the other.

Basic Calculation and Definition

Covariance also measures the relationship between two variables, but it is not scaled, so it can’t tell you the strength of that relationship. For example, Let’s look at the following vectors a, b and c. Vector c is simply a 10X-scaled transformation of vector b, and vector a has no transformational relationship with vector b or c.

a b c
1 4 40
2 5 50
3 6 60
4 8 80
5 8 80
5 4 40
6 10 100
7 12 120
10 15 150
4 9 90
8 12 120

Plotted out, a vs. b and a vs. c look identical except the y-axis is scaled differently for each. When the covariance is taken of both a & b and a & c, you get different a large difference in results. The covariance between a & b is much smaller than the covariance between a & c even though the plots are identical except the scale. The y-axis on the c vs. a plot goes to 150 instead of 15.

$latex
cov(X, Y) = \frac{\Sigma_i^N{(X_i – \bar X)(Y_i – \bar Y)}}{N-1}
&s=2$

$latex
cov(a, b) = 8.5
&s=2$

$latex
cov(a, c) = 85
&s=2$

correlation b vs acorrelation c vs a

To account for this, correlation is takes covariance and scales it by the product of the standard deviations of the two variables.

$latex
cor(X, Y) = \frac{cov(X, Y)}{s_X s_Y}
&s=2$

$latex
cor(a, b) = 0.8954
&s=2$

$latex
cor(a, c) = 0.8954
&s=2$

Now, correlation describes how strong the relationship between the two vectors regardless of the scale. Since the standard deviation in vector c is much greater than vector b, this accounts for the larger covariance term and produces identical correlations terms. The correlation coefficient will fall between -1 and 1. Both -1 and 1 indicate a strong relationship, while the sign of the coefficient indicates the direction of the relationship. A correlation of 0 indicates no relationship.

Here’s the R code that will run through the calculations.

#covariance vs correlation
a <- c(1,2,3,4,5,5,6,7,10,4,8)
b <- c(4,5,6,8,8,4,10,12,15,9,12)
c <- c(4,5,6,8,8,4,10,12,15,9,12) * 10

data <- data.frame(a, b, c)

cov(a, b)  #8.5
cov(a, c)  #85

cor(a,b)  #0.8954
cor(a,c)  #0.8954

Caution | Anscombe's Quartet

Correlation is great. It's a basic tool that is easy to understand, but it has its limitations. The most prominent being the correlation =/= causation caveat. The linked BuzzFeed article does a good job explaining the concept some ridiculous examples, but there are real-life examples being researched or argued in crime and public policy. For example, crime is a problem that has so many variables that it's hard to isolate one factor. Politicians and pundits still try.

Another famous caution about using correlation is Anscombe's Quartet. Anscombe's Quartet uses different sets of data to achieve the same correlation coefficient (0.8164 give or take some rounding). This exercise is typically used to emphasize why it's important to visualize data.

Anscombe Correlation R

The graphs demonstrates how different the data sets can be. If this was real-world data, the green and yellow plots would be investigated for outliers, and the blue plot would probably be modeled with non-linear terms. Only the red plot would be consider appropriate for a basic, linear model.

I created this plot in R with ggplot2. The Anscombe data set is included in base R, so you don't need to install any packages to use it. Ggplot2 is a fantastic and powerful data visualization package which can be download for free using the install.packages('ggplot2') command. Below is the R code I used to make the graphs individually and combine them into a matrix.

#correlation
cor1 <- format(cor(anscombe$x1, anscombe$y1), digits=4)
cor2 <- format(cor(anscombe$x2, anscombe$y2), digits=4)
cor3 <- format(cor(anscombe$x3, anscombe$y3), digits=4)
cor4 <- format(cor(anscombe$x4, anscombe$y4), digits=4)

#define the OLS regression
line1 <- lm(y1 ~ x1, data=anscombe)
line2 <- lm(y2 ~ x2, data=anscombe)
line3 <- lm(y3 ~ x3, data=anscombe)
line4 <- lm(y4 ~ x4, data=anscombe)

circle.size = 5
colors = list('red', '#0066CC', '#4BB14B', '#FCE638')

#plot1
plot1 <- ggplot(anscombe, aes(x=x1, y=y1)) + geom_point(size=circle.size, pch=21, fill=colors[[1]]) +
  geom_abline(intercept=line1$coefficients[1], slope=line1$coefficients[2]) +
  annotate("text", x = 12, y = 5, label = paste("correlation = ", cor1))

#plot2
plot2 <- ggplot(anscombe, aes(x=x2, y=y2)) + geom_point(size=circle.size, pch=21, fill=colors[[2]]) +
  geom_abline(intercept=line2$coefficients[1], slope=line2$coefficients[2]) +
  annotate("text", x = 12, y = 3, label = paste("correlation = ", cor2))

#plot3
plot3 <- ggplot(anscombe, aes(x=x3, y=y3)) + geom_point(size=circle.size, pch=21, fill=colors[[3]]) +
  geom_abline(intercept=line3$coefficients[1], slope=line3$coefficients[2]) +
  annotate("text", x = 12, y = 6, label = paste("correlation = ", cor3))

#plot4
plot4 <- ggplot(anscombe, aes(x=x4, y=y4)) + geom_point(size=circle.size, pch=21, fill=colors[[4]]) +
  geom_abline(intercept=line4$coefficients[1], slope=line4$coefficients[2]) +
  annotate("text", x = 15, y = 6, label = paste("correlation = ", cor4))

grid.arrange(plot1, plot2, plot3, plot4, top='Anscombe Quadrant -- Correlation Demostration')

The full code I used to write up this tutorial is available on my GitHub .

References:

Chatterjee, S., Hadi, A. S., & Price, B. (2000). Regression analysis by example. New York: Wiley.

Baseball Twitter Roller Coaster

Because Twitter is fun and so are graphs, I have tweet volume graphs from my Twitter scraper that collects tweets with the team-specific nicknames and Twitter handles. After a trade (or non-trade), the data can be collected and a graphical picture of the reaction can be produced. The graph represents the volume of sampled tweets that contained the specific keywords: Mets, Gomez, Flores and Hamels. “Tears” is a collection of any tweets which mentioned either “tears” or “crying”, since Flores was in tears as he took the field.

Here are the reactions to the Gomez non-trade and Hamels trade last night:

Mets

And here’s the timeline of necessary tweets:
[All times are EDT.]


July 29
9:00 PM


9:45 PM


9:54 PM



10:15 PM



10:55 PM



July 30
12:13 AM

Some of the times were rounded if there wasn’t a clear single tweet that caused the peak on Twitter.

correlation matrix

Making a Correlation Matrix in R

This tutorial is a continuation of making a covariance matrix in R. These tutorials walk you through the matrix algebra necessary to create the matrices, so you can better understand what is going on underneath the hood in R. There are built-in functions within R that make this process much quicker and easier.

The correlation matrix is is rather popular for exploratory data analysis, because it can quickly show you the correlations between variables in your data set. From a practical application standpoint, this entire post is unnecessary, because I’m going to show how to derive this using matrix algebra in R.

First, the starting point will be the covariance matrix that was computed from the last post.

#create vectors -- these will be our columns
a <- c(1,2,3,4,5,6)
b <- c(2,3,5,6,1,9)
c <- c(3,5,5,5,10,8)
d <- c(10,20,30,40,50,55)
e <- c(7,8,9,4,6,10)

#create matrix from vectors
M <- cbind(a,b,c,d,e)
k <- ncol(M) #number of variables
n <- nrow(M) #number of subjects

#create means for each column
M_mean <- matrix(data=1, nrow=n) %*% cbind(mean(a),mean(b),mean(c),mean(d),mean(e)) 

#creates a difference matrix
D <- M - M_mean

#creates the covariance matrix
C <- k^-1 * t(D) %*% D

$latex {\bf C } =
\begin{bmatrix}
V_a\ & C_{a,b}\ & C_{a,c}\ & C_{a,d}\ & C_{a,e} \\
C_{a,b} & V_b & C_{b,c} & C_{b,d} & C_{b,e} \\
C_{a,c} & C_{b,c} & V_c & C_{c,d} & C_{c,e} \\
C_{a,d} & C_{b,d} & C_{c,d} & V_d & C_{d,e} \\
C_{a,e} & C_{b,e} & C_{c,e} & C_{d,e} & V_e
\end{bmatrix}&s=2$

This matrix has all the information that's needed to get the correlations for all the variables and create a correlation matrix [V -- variance, C -- Covariance]. Correlation, we are using the Pearson version of correlation, is calculated using the covariance between two vectors and their standard deviations [s, square root of the variance]:

$latex
cor(X, Y) = \frac{cov(X,Y)}{s_{X}s_{Y}}
&s=2$

The trick will be using matrix algebra to easily carry out these calculations. The variance components are all on the diagonal of the covariance matrix, so in matrix algebra notation we want to use this:

$latex {\bf V} = diag({\bf C}) = \begin{bmatrix}
V_a\ & 0\ & 0\ & 0\ & 0 \\
0 & V_b & 0 & 0 & 0 \\
0 & 0 & V_c & 0 & 0 \\
0 & 0 & 0 & V_d & 0 \\
0 & 0 & 0 & 0 & V_e
\end{bmatrix}

&s=2$

Since R doesn't quite work the same way as matrix algebra notation, the diag() function creates a vector from a matrix and a matrix from a vector, so it's used twice to create the diagonal variance matrix. Once to get a vector of the variances, and a second time to turn that vector into the above diagonal matrix. Since the standard deviations are needed, the square root is taken. Also the variances are inverted to facilitate division.

#pulls out the standard deviations from the covariance matrix
S <- diag(diag(C)^(-1/2))

After getting the diagonal matrix, basic matrix multiplication is used to get the all the terms in the covariance to reflect the basic correlation formula from above.

$latex {\bf R } = {\bf S} \times {\bf C} \times {\bf S}&s=2$

#constructs the correlation matrix
S %*% C %*% S

And the correlation matrix is symbolically represented as:

$latex {\bf R } =
\begin{bmatrix}
r_{a,a}\ & r_{a,b}\ & r_{a,c}\ & r_{a,d}\ & r_{a,e} \\
r_{a,b} & r_{b,b} & r_{b,c} & r_{b,d} & r_{b,e} \\
r_{a,c} & r_{b,c} & r_{c,c} & r_{c,d} & r_{c,e} \\
r_{a,d} & r_{b,d} & r_{c,d} & r_{d,d} & r_{d,e} \\
r_{a,e} & r_{b,e} & r_{c,e} & r_{d,e} & r_{e,e}
\end{bmatrix}&s=2$

The diagonal where the variances where in the covariance matrix are now 1, since a variable's correlation with itself is always 1.