Tag Archives: coding

D3 Visualization Basics — Introduction

Data visualization is important, really important. I can’t be more blunt than that. We are able to process much more information faster by seeing a visual representation than we could look at a table, database or interacting with a spreadsheet. I will be writing a series of posts that explore some of the foundations D3 is built on along with how to create engaging data visualizations using it.

D3 is a powerful tool that allows you to create interactive data visualizations for the web. Understanding how D3 works starts with understanding how modern web pages are designed.

If you have found this page, you probably at least have some knowledge of how to make a modern website: HTML, CSS, JavaScript, responsive design, etc. D3 uses basic elements from these components of web design to create the visualizations. This by no means the only way to create interactive visualizations, but this is an effective way to produce them.

Before jumping into D3 nuts and bolts, let’s look at what does each of these components do. [If you already know this stuff, feel free to skip ahead…once I get the other posts built out.]

Basic Web Programming

In the most simplistic terms, HTML provides the structure of the webpage, CSS provides the styling and formatting, and JavaScript provides the functionality of the site. The browser brings these three components together and interprets them into something the end user (you) can understand and use. Sometimes one component can accomplish what the other does, but if you stick to this generalization you’ll be in good shape.

To produce a professional-looking, fully-functional D3 data visualization you will need to understand, write and manipulate all three components.

HTML

The most vivid memories I have of HTML is from the websites of the late 90s: Geocities, Angelfire, etc. HTML provides instructions on how browsers should interpret information; it organizes the information. Everything you see on a webpage has corresponding HTML code.

If you look at the source HTML or inspect one of this site’s page you’ll see some of the structure. When HTML renders in the browser these elements are referred to DOM elements. DOM stands for Document Object Model, which is the structure of a webpage.

HTML containers

Looking at the DOM tree you can see the many of the div containers that provide structure for the how the site is laid out. The p tags contain each paragraph in the content of my posts. h1, h2 and h3 are subheadings I’ve made to make the post more organized. You also notice some attributes especially class which have many uses for CSS, JavaScript and D3. Classes in particular are used to identify what function that DOM element plays in JavaScript or how to style it in CSS.

CSS

A house without painted walls or decorations is pretty boring. The same thing happens with bare bones HTML. You can organize the information, but it won’t be in an appealing format.

Most sites have style sheets (CSS) which sets margins, colors, display options, etc. Style sheets have a specific syntax which identifies HTML elements by type, class or id. This identification and selection concept is used extensively in D3.

CSS example

Above is some CSS from this site. It contains formatting instructions for elements of the class “page-links”. It includes instructions for the font size, margins, height, width and to make the text all uppercase. The advantage of CSS is that it keeps formatting away from the structure of the HTML allowing you to format many elements at once. For example if you wanted to change the color of every link, you could easily do that by modifying the CSS.

There is an alternative to using CSS style sheets and that’s by using inline style definitions.

Inline styles use the same markup as the CSS in the style sheets. Inline styles

  • control only the element they are in
  • OVERRIDE any CSS styles [without an !important tag]

The code above overrides the normal paragraph’s style property which aligns it left. Using inline styles are generally bad for web design, but it’s important to understand how they work since D3 manipulates inline styles often.

JavaScript

JavaScript breathes life into your web page. It certainly not the only way to have your website become interactive or build programming into it, but it is widely used and supported in the popular browsers. D3 is a JavaScript library, so you will inevitably have to write JavaScript to use it.

For D3 visualization, JavaScript will be use to

  • Manage and manipulate data for the visualization
  • Create DOM elements
  • Manipulate DOM elements
  • Destroy DOM elements
  • Attach data to DOM elements

JavaScript will be used insert elements onto the page, it will also be used to change colors and styles of those elements. You might be able to see how this could be useful. For example JavaScript could map data points to an element’s position for a scatter plot or to an element’s height or width for a bar chart.

I bolded the last function D3 does, attaching data to elements, because it’s so critical to D3. This allows you to attach a data point beyond x, y data to allow for rich visualization.

__data__

Above is data attached to a D3 visualization I made for FanGraphs. This is a simple example, but I was able to attach data detailing the team’s name, id, league, ERA and FIP. Using the attached data I was able to create the graph and tooltips. More complex designs can take advantage of the robust data structure D3 provides.

[Next]

I’ll look at how to set up a basic project by organizing data, files and code.

R Bootcamp: Making a Subset

Data Manipulation: Subsetting

Making a subset of a data frame in R is one of the most basic and necessary data manipulation techniques you can use in R. If you are brand new to data analysis, a data frame is the most common data storage object in R and subsets are a collection of rows from that data frame based on certain criteria.

Data Frame
V1 V2 V3 V4 V5 V6 V7
Row1
Row2
Row3
Row4
Row5
Row6

Arrow

Subset
V1 V2 V3 V4 V5 V6 V7
Row2
Row5
Row6

The Data

For this example, I’m using data from FanGraphs. You can get the exact data set here, and it’s provided in my GitHub. This data set has players names, teams, seasons and stats. We are able to create a subset based on any one or more of these variables.

The Code

I’m going to show four different ways to subset data frames: using a boolean vector, using the which() function, using the subset() function and using filter() function from the dplyr package. All of these functions are different ways to do the same thing. The dplyr package is fast and easy to code, and it is my recommended subsetting method, so let’s start with that. This is especially true when you have to loop an operation or run something that will be run repeatedly.

dplyr

The filter() requires the dplyr package to be loaded in your R environment, and it removes the filter() function from the default stats package. You don’t need to worry about but it does tell you that when you first install and load the package.

#install.packages('dplyr')
library(dplyr) #load the package

#from http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2015&month=0&season1=2010&ind=1&team=&rost=&age=&filter=&players=&page=2_30
setwd('***PATH***') 
data <- read.csv('FanGraphs Leaderboard.csv') #loads in the data

Aside from the loading the package, you'll have to load the data in as well.

#finds all players who played for the Marlins
data.sub.1 <- filter(data, Team=='Marlins')

#finds all the NL East players
NL.East <- c('Marlins','Nationals','Mets','Braves','Phillies') #makes the division
data.sub.2 <- filter(data, Team %in% NL.East) #finds all players that are in the NL East

#Both of these find players in the NL East and have more than 30 home runs.
data.sub.3 <- filter(data, Team %in% NL.East, HR > 30) #uses multiple arguments
data.sub.3 <- filter(data, Team %in% NL.East & HR > 30) #uses & sign

#Finds players in the NL East or has more than 30 HR
data.sub.4 <- filter(data, Team %in% NL.East | HR > 30)

#Finds players not in the NL East and who have more than 30 home runs.
data.sub.5 <- filter(data, !(Team %in% NL.East), HR > 30)

The filter() function is rather simple to use. The examples above illustrate a few simple examples where you specify the data frame you want to use and create true/false expressions, which filter() uses to find which rows it should keep. The output of the function is saved into a separate variable, so we can reuse the original data frame for other subsets. I put a few other examples in the code to demonstrate how it works.

Built-in Functions

#method 1 -- using a T/F vector
data.sub.1 <- data[data$Team == 'Marlins',]

#method 2 -- which()
data.sub.2 <- data[which(data$Team == 'Marlins'),]

#method 3 -- subset()
data.sub.3 <- subset(data,subset = (Team=='Marlins'))

#other comparison functions
data.sub.4 <- data[data$HR > 30,] #greater than
data.sub.5 <- data[data$HR < 30,] #less than

data.sub.6 <- data[data$AVG > .320 & data$PA > 600,] #duel requirements using AND (&)
data.sub.7 <- data.sub3 <- subset(data, subset = (AVG > .300 & PA > 600)) #using subset()

data.sub.8 <- data[data$HR > 40 | data$SB > 30,] #duel requirements using OR (|)

data.sub.9 <- data[data$Team %in% c('Marlins','Nationals','Mets','Braves','Phillies'),] #finds values in a vector

data.sub.10 <- data[data$Team != '- - -',] #removes players who played for two teams

If you don't want to use the dplyr package, you are able to accomplish the same thing uses the basic functionality of R. #method 1 uses a boolean vector to select rows for the subset. #method 2 uses the which() function. This function finds the index of a boolean vector of True values. Both of these techniques use the original data frame and uses the row index to create a subset.

The subset() function works much like the filter() function, except the syntax is slightly different and you don't have to download a separate package.

Efficiency

While subset works in a similar fashion, it doesn't perform the same way. While some data manipulation might only happen once or a few times throughout a project, many projects require constant subsetting and possibly from a loop. So while the gains might seem insignificant for one run, multiply that difference and it adds up quickly.

I timed how long it would take to run the same [complex] subset of a 500,000 row data frame using the four different techniques.

Time to Subset 500,000 Rows
Subset Method Elapsed Time (sec)
boolean vector 0.87
which() 0.33
subset() 0.81
dplyr filter() 0.21

The dpylr filter() function was by far the quickest, which is why I prefer to use it.

The full code I used to write up this tutorial is available on my GitHub .

References:

Introduction to dplyr. https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

R Bootcamp — A Quick Introduction

R, a statistics programming language environment, is becoming more popular as organizations, governments and businesses have increased their use of data science. In an effort to provide a quick bootcamp to learn the basics of R quickly, I’ve assemble some of the most basic processes to give a new user a quick introduction to the R language.

This post assumes that you have already installed R and have it running correctly on your computer. I recommend getting RStudio to use to write and execute your code. It will make your life much easier.

Getting Started

First R is an interactive programming environment, which means you are able to send commands to its interpreter to tell it what to do.

There are two basic methods to send commands to R. The first is by using the console, which is like your old-school command line computing methods. The second method is more typically used by R coders, and that’s to write a script. An R script isn’t fancy. At its core it’s a text document that contain a collection of R commands. Then when the code is executed it is treated like a collection of individual commands being feed one-by-one into the R interpreter. This differs on how other, more fundamental programming languages work.

R - How R Works

Basics

Comments are probably the best place to start, especially because my code is chock-full of them. A comment is code that is fed to R, but it’s not executed and has no bearing on the function of your script or command. In R comments are lines prefaced with a #.

#comments start with #-signs

9-3 #basic math (this doesn't save this in a variable)

One of the most basic thing you could use R for is a calculator. For instance if we run the code 9-3, R will display a 6 as the result of that code. All of this is rather straight forward. The only operator you might not be familiar with if you are new to coding is the modulus operator, which yields the remainder when you divide the first number by the second. This gets used often when dealing with data. For example, you can get a 0 for even number and 1 for odd number if you take you variable use the modulus operator with the number 2.

#basic operations
#yields numeric value
1+2 #addition
3-2 #subtraction
3*2 #multiplication
4/5 #division
3 %% 2 #modulus (remainder operator)

Beyond the basic math and numeric operations you can do, R has several fundamental data types. NULL and NA are representative of empty objects or missing data. These two data types aren’t the same. NA will fill a position in an vector or data frame. The details are best left for another entry.

#basic data structure

NULL     #empty value
NA       #missing value
9100     #numeric value
'abcdef' #string
TRUE     #boolean
T        #equilvant form
FALSE 
F

Numeric values can have mathematical operations performed on them. Strings are essentially non-numeric values. You can’t add strings together or find the average of a string. In any type of data analysis, you’ll typically have some string data. It can be used to classify entries in categorically such as male/female or Mac/Windows/Linux. R will treat these like factors.

Finally, boolean values (True or False) are binary logical values. They work like normal logic operations you might have learned in math or a logic class with AND (&&) and OR (||) operators. These can be used in conditional statements and various other data manipulation operations such as subsetting.

#logical operators
T && F #and
F || T #or

Now that we covered the basic operations and data types, let’s look at how to store that — variables. To assign a value to a variable it’s rather easy. You can use a simple equation or the traditional R notation using an arrow.

x <- 1 #basic assignment
x = 1

#####Acceptable Variables
x <- 1
X <- 1
X1 <- 1
X.1 <- 1
X_1 <- 1
########################

####UNACCEPTABLE Variables
1X <- 1
X-1 <- 1  #CODE WILL NOT WORK
X,1 <- 1
#######################

Variables must begin with a letter and they are case-sensitive. Periods are acceptable faux separators in variable names, but that doesn’t translate to other programming languages like Python or JavaScript, so that might factor in how you establish naming conventions.

I’ve mentioned vectors a few times already. They are an important data structure within R. A vector is an ordered list of data. Typically, thought of as numeric data, but character (string) vectors are often used in R. The c() operator can create a vector. It’s important that vectors contain the same type of data: boolean, numeric or character. If you mix types it will force values into another type. And you can assign your vectors to variables. In fact, you can store just about any thing in R to a variable.

x.vector <- c() #vector operator
x.vector <- c(1,2,3,4,5,6) #creates a vector (typically numeric)
x.vector <- c(T, 1)
x.list <- list('A',12,'b') #creates a list (not used for numeric operations)

mean(c(1,3,2))

Lists are created with the list() command. They are used more for storage and organization than for data structure. For example you could store the mean, median and range for a set of data in a list. A vector would house the data used to calculated said summary stats. Lists are useful when you begin to write bigger programs and need to shuffle a lot of things around.

The basic statistic operators are listed below. All of these require a vector to operate on.

#basic stats
x.vector <- c(10,11,12,12,10,11,20,9) #puts your data into a vector

mean(x.vector) #takes mean of vector
median(x.vector) #median of vector
max(x.vector) #maximum of vector
min(x.vector) #minimum of vector
range(x.vector) #yields a vector with a range

sd(x.vector) #standard deviation
var(x.vector) #variance

Handling Data

Above we discussed some of the building blocks of basic analysis in R. Beyond introductory Statistics classes, R isn’t very useful unless you can import data. There are many ways to do this since data exists in many different formats. A .csv file is one of the most basic, compatible way data is stored to be used between different analytical tools.

Before loading this data file into R, it’s a good idea to set your working directory. This is where the data file is stored.

#load in data
setwd('**folder path**') #sets your working directory 
                                         #specific to each computer

Next you can use the read.csv() function to ingest a .csv file into R. This call won’t save the data in a variable, it just brings it in as a data frame and show it to you.

read.csv('data_bryant_kobe.csv') #reads the data into R
                                 #does not save it into a variable

data <- read.csv('data_bryant_kobe.csv') #reads the data and saves it
                                         #into a variable called 'data'

Data frames are the primary form of data structure you’ll encounter in R. Data frames are like tables in Excel or SQL in that they are rectangular and have a rigid schema. However, at a data frame’s core are a collection of equal-length vectors.

If you assign the data frame output of the read.csv() function to a variable, you can pass around the data frame to different data manipulation functions or modeling functions. One of the most basic ways to manipulate the data is to access different values within the data frame. Below are several different examples on how to get to values, rows or columns in a data frame.

The basic concept is that data frames can be accessed by row and column number. [row, column] And that an entire row or column can be accessed by omitting the dimension you aren’t trying to retrieve. You can retrieve individual fields (variables) by using the $ sign and using the variable name. This is the method I use most often. It requires you knowing and using the name of the variables, which can make your code easier to read.

#accessing values
row <- 3
column <- 2
data$Age #returns the Age variable column
data[row,column]  #individual value
data[data$Age <= 25,]        #returns entire row
data[,column]     #returns entire column as a vector
data$Age[3]       #returns entire column as a vector

By accessing rows, you can create a subset of data by using a logical argument to filter out your data set.

#creating a quick subset
data.U25 <- data[data$Age < 25,]  #creates an under-25 set
data.O25 <- data[data$Age >= 25,] #creates a 25 and older set

The code above creates two new data frames which separate Kobe Bryant’s season stats into an under-25 data set and a 25 and under data set.

Relationships Between Variables

Correlation is often used to summarize the linear relationship between two variables. Getting the correlation in R is simple. Use the cor() function with two equal length vectors. R uses the corresponding elements in each vector to get a Pearson correlation coefficient.

#correlation between different variables within the subset
cor(data.U25$MP, data.U25$PTS)
cor(data.O25$MP, data.O25$PTS)

cor(data$MP, data$PTS) #correlation with two related vectors from data set
cor(data$MP, data$FTpct)

A simple linear model can be made by using the lm() function. The linear model function requires two things: a formula and a data frame. The formula uses a tilde (~) instead an equal sign. The formula represent the variables you would use in your standard ordinary least squares regression. The data parameter is the data frame which contains all the data.

#create a basic linear model
linear.model <- lm(PTS ~ MP, data=data.U25)
linear.model <- lm(PTS ~ MP + Age, data=data.U25)

The summary() function will take the linear model object and displays information about the coefficients of your linear model.

summary(linear.model)

NOTES: The data set used in this tutorial is from basketball-reference.com.

The full code I used in this tutorial can be found on my GitHub .

Collecting Twitter Data: Getting Started

Part I: Introduction | Part II: Getting Started [current page] | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8


The R code used in this post can be found on my git-hub.

After getting R, Python or whatever programming language you prefer, the next steps will require API keys from Twitter. This requires you have to have a Twitter account and to create an ‘application’ using the following steps.

Getting API Keys

  1. Sign into Twitter
  2. Go to https://apps.twitter.com/app/new and create a new application

    twitter register app

  3. Click on “Keys and Access Tokens” on the your application’s page

    twitter access keys

  4. Get and copy your Consumer Key, Consumer Secret Key, Access Token, and Secret Token

    twitter oauth screen

Those four complex strings of case-sensitive letters and numbers are your API keys. Keep them secret, because they are more powerful than your Twitter password. If you are wondering what the keys are for, they are really two pairs of keys consisting of secret and non-secret, and this is done for security purposes. The consumer key pair authorizes your program to use the Twitter API, and the access token essentially signs you in as your specific Twitter user account. This framework makes more sense in the context of third party Twitter developers like TweetDeck where the application is making API calls but it needs access to each user’s personal data to write tweets, access their timelines, etc.

Getting Started in R

If you don’t have a preference for a certain programming environment, I recommend that people with less programming experience start with R for tweet scraping since it is simpler to collect and parse the data without having to understand much programming. The Streaming API authentication I use in R is slightly more complicated than what I normally do with Python. If you feel comfortable with Python, I recommend using the tweepy package for Python. It’s more robust than R’s streamR but has a steeper learning curve.

First, like most R scripts, the libraries need to be installed and called. Hopefully you already installed if not the install.packages commands are commented out for reference.

#install.packages("streamR")
#install.packages("ROAuth")
library(ROAuth)
library(streamR)

The first part of the actually code for a Twitter scraper will use the API keys obtained from Twitter’s development website. You insert your personal API keys where the **KEY** is in the code. For this method of authentication in R it only uses the CONSUMER KEY and CONSUMER SECRET KEY and it gets your ACCESS TOKEN from a PIN number from using an web address you open in your browser.

#create your OAuth credential
credential <- OAuthFactory$new(consumerKey='**CONSUMER KEY**',
                         consumerSecret='**CONSUMER SECRETY KEY**',
                         requestURL='https://api.twitter.com/oauth/request_token',
                         accessURL='https://api.twitter.com/oauth/access_token',
                         authURL='https://api.twitter.com/oauth/authorize')

#authentication process
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
credential$handshake(cainfo="cacert.pem")

After this is executed properly, R will give you output in your console that looks like the following:

twitter handshake

  1. Copy the https:// URL into a browser
  2. Log into Twitter if you haven't already
  3. Authorize the application
  4. Then you'll get the PIN number to copy into the R console and hit Enter

    twitter pin

Now that the authentication handshake was completed, the R program is able to use those credentials to make API calls. A basic call using the Streaming API is the filterStream() function in the streamR package. This will connected you to Twitter's stream for a designated amount of time and/or for a certain number of tweets collected.

#function to actually scrape Twitter
filterStream( file.name="tweets_test.json",
             track="twitter", tweets=1000, oauth=cred, timeout=10, lang='en' )

The track parameter tells Twitter want you want to 'search' for. It's technically not really a search since you are filtering the Twitter stream and not searching...technically. Twitter's dev site has a nice explanation of all the Streaming APIs parameters. For example, the track parameter is not case sensitive, it will treat hashtags and regular words the same, and it will find tweets with any of the words you specify, not just when all the words are present. The track parameter 'apple, twitter' will find tweets with 'apple', tweets with 'twitter', and tweets with both.

The filterStream() function will stay open as long as you tell it to in the timeout parameter [in seconds], so don't set it too long if you want your data quickly. The data Twitter returns to you is a .json file, which is a JavaScript data file.

twitter json

The above is an excerpt from a tweet that's been formatted to be easier to read. Here's a larger annotated version of a tweet JSON file. Thes are useful in some contexts of programming, but for basic use in R, Tableau, and Excel it's gibberish.

There are a few different ways to parse the data into something useful. The most basic [and easiest] is to use the parseTweets() function that is also in streamR.

#Parses the tweets
tweet_df <- parseTweets(tweets='tweets_test.json')

This is a pretty simple function that takes the JSON file that filterStream() produced, reads it, and creates a wide data frame. The data frame can be pretty daunting, since there is so much metadata available.

twitter data frame

You might notice some of the ?-mark characters. These are text encoding errors. This is one of the limitations of using R to parse the tweets, because the streamR package doesn't handle utf-8 characters well in its functions. This means that R can only read basic A-Z characters and can't translate emoji, foreign languages, and some punctuation. I'd recommend using something like MongoDB to store tweets or create your own parser if you want be able to use these features of the text.

Quick Analysis

This tutorial focuses on how to collect Twitter data and not the intricacies of analyzing it, but here are a few simple examples of how you can use the tweet data frame.

#using the Twitter data frame
tweet_df$created_at
tweet_df$text


plot(tweet_df$friends_count, tweet_df$followers_count) #plots scatterplot
cor(tweet_df$friends_count, tweet_df$followers_count) #returns the correlation coefficient

The different columns within the data frame can be called separately. Calling the created_at field gives you the tweet's time stamp, and the text field is the content of the tweet. Generally, there will be some correlation between the number of followers a person has [followers_count] and the number of accounts a person follows [friends_count]. When I ran my script I got a correlation of about 0.25. The scatter plot will be heavily impacted by the Justin Biebers of the world where they have millions of followers but follow only a few themselves.

Conclusion

This is quick-start tutorial to collecting Twitter data. There are plenty of resources to be found on Twitter's developer site and all over the internet. While this tutorial is useful to learn the basics of how the OAuth process works and how Twitter returns data, I recommend using a tool like Python and MongoDB which can give you greater flexibility for analysis. Collecting tweets is the foundation of using Twitter's API, but you can also get user objects, trends, or accomplish anything that you can in a Twitter client with the REST and Search APIs.

 


The R code used in this post can be found on my git-hub.

Part I: Introduction | Part II: Getting Started [current page] | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV -- Errors | Part VI: Twitter JSON to CSV -- ASCII | Part VII: Twitter JSON to CSV -- UTF-8

Collecting Twitter Data: Introduction

Part I: Introduction [current page] | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8


Collecting Twitter data is a great exercise in data science and can provide interesting insights in how people behave on the social media platform. Below is an overview of the steps to build a Twitter analysis from scratch. This tutorial will go through several steps to arrive at being able to analyze Twitter data.

  1. Overview of Twitter API does
  2. Get R or Python
  3. Install Twitter packages
  4. Get Developer API Key from Twitter
  5. Write Code to Collect Tweets
  6. Parse the Raw Tweet Data [JSON files]
  7. Analyze the Tweet Data

Introduction

Before diving into the technical aspects of how to use the Twitter API [Application Program Interface] to collect tweets and other data from their site, I want to give a general overview of what the Twitter API is and isn’t capable of doing. First, data collection on Twitter doesn’t necessarily produce a representative sample to make inferences about the general population. And people tend to be rather emotional and negative on Twitter. That said, Twitter is a treasure trove of data and there are plenty of interesting things you can discover. You can pull various data structures from Twitter: tweets, user profiles, user friends and followers, what’s trending, etc. There are three methods to get this data: the REST API, the Search API, and the Streaming API. The Search API is retrospective and allows you search old tweets [with severe limitations], the REST API allows you to collect user profiles, friends, and followers, and the Streaming API collects tweets in real time as they happen. [This is best for data science.] This means that most Twitter analysis has to be planned beforehand or at least tweets have to be collected prior to the timeframe you want to analyze. There are some ways around this if Twitter grants you permission, but the run-of-the-mill Twitter account will find the Streaming API much more useful.

The Twitter API requires a few steps:

  1. Authenticate with OAuth
  2. Make API call
  3. Receive JSON file back
  4. Interpret JSON file

The authentication requires that you get an API key from the Twitter developers site. This just requires that you have a Twitter account. The four keys the site gives you are used as parameters in the programs. The OAuth authentication gives your program permission to make API calls.

The API call is an http call that has the parameters incorporated into the URL like
https://stream.twitter.com/1.1/statuses/filter.json?track=twitter
This Streaming API call is asking to connect to Twitter and tracks the keyword ‘twitter’. Using prebuilt software packages in R or Python will hide this step from you the programmer, but these calls are happening behind the scenes.

JSON files are the data structure that Twitter returns. These are rather comprehensive with the amount of data, but hard to use without them being parsed first. Some of the software packages have built-in parsers or you can use a NoSQL database like MongoDB to store and query your tweets.

Get R or Python

While there are many different programing languages to interface with the API, I prefer to use either Python or R for any Twitter data scraping. R is easier to use out of the box if you are just getting started with coding, and Python offers more flexibility. If you don’t have either of these, I’d recommend installing then learning to do some basic things before tackling Twitter data.

Download R: http://cran.rstudio.com/
R Studio: http://www.rstudio.com/ [optional]

Download Python: https://www.python.org/downloads/

Install Twitter Packages

The easiest way to access the API is to install a software package that has prebuilt libraries that makes coding projects much simpler. Since this tutorial will primarily be focused on using the Streaming API, I recommend installing the streamR package for R or tweepy for Python. If you have a Mac, Python is already installed and you can run it from the terminal. I recommend getting a program to help you organize your projects like PyCharm, but that is beyond the scope of this tutorial.

R
[in the R environment]

install.packages('streamR')
install.packages('ROAuth')
library(ROAuth)
library(streamR)

Python
[in the terminal, assuming you have pip installed]

$ pip install tweepy

 


Part I: Introduction [current page] | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8