Tag Archives: twitter

Collecting Twitter Data: Converting Twitter JSON to CSV — Possible Errors

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors [current page] | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8


Before diving into the problem of how to save tweets in a CSV file, let me say there are a 1,000 ways to do this and about 100 complications that arise depending which way you want to accomplish this. I will devote two posts which covers using both ASCII and UTF-8 encoding because many tweets contain characters beyond the normal Latin alphabet.

Let’s look at some of the issues with writing CSV from tweets.

  • Tweets are JSON and contain a massive amount of metadata. More than you probably want.
  • The JSON isn’t a flat structure; it has levels. [Direct contrast to a CSV file.]
  • The JSON files don’t all have the same elements.
  • There are many foreign languages and emoji used in tweets.
  • Tweets contain many different grammatical marks such as commas and quotation marks.

These issues aren’t incredibly daunting, but those unfamiliar will encounter frustrating errors.

Tweets are JSON and contain a massive amount of metadata. More than you probably want.

I’m always in favor of keeping as much data as possible, but tweets contain a massive amount of different metadata attributes. All of these are designed for the Twitter platform and for the associated client apps. Some items like the profile_background_image_url_https really don’t have much of an impact on any analysis. Choosing which attributes you want to keep will be critical before embarking on a process to parse the data into a CSV. There’s a lot to choose from: timestamp data, user data, retweet data, geocoding data, hashtag data and link data.

The JSON isn’t a flat structure; it has levels.

This issue is an extension of the previous issue, since tweet JSON data isn’t organized into a flat, spreadsheet-like structure. The created_at and text elements are located on the top level and are easy to access, but something as simple as the tweeter’s name and screen_name are located in the user nested object. Like everything else mentioned in this post, this isn’t a huge issue, but the structure of a tweet JSON file has to be considered when coding your program.

The JSON files don’t all have the same elements.

The final problem with JSON files is the fields aren’t necessarily present in every object. Many geo related attributes do not appear unless geotagging is enabled. This means if you write your program to look for geotagging data, it can throw a key error if those keys don’t exist in that specific tweet. To avoid this you have to account for the exception or use a method that already does that. I use the get() method to avoid these key errors in the CSV parser.

There are many foreign languages and emoji used in tweets.

I quickly addressed this issue in a few posts, and it’s one of the reasons why I like to store tweets in MongoDB. Tweets contain a lot of of [read: important] unicode characters. These are typically many foreign language characters and the ubiquitous emojis. This is important because the presence of UTF-8 unicode characters can and will cause encoding errors when parser a file or loading a file into Excel. Excel (at least the version on my computer) can’t handle these characters. Other tools like the built-in CSV writer in Python can’t handle unicode out of box. Being able to deal with these characters is critical to compatibility with other software as long as the integrity of your data.

This issue forces me to write two different parsers for examples. I have a CSV parser that outputs ASCII that imports well into Excel along with a UTF-8 version which allows you to natively save the characters and emojis in a human-readable CSV file.

Tweets contain many different grammatical marks such as commas and quotation marks.

This is a problem that I had when I first started working with Twitter data and tried to write my own parser — characters that are part of your text content sometimes get confused with the delimiters. In this case I’m talking about quotation marks (") and commas (,). Comma sseparate the values for each ‘cell’, hence the acronym CSV. If you tweet you’ve probably tweeted using one of these characters. I’ve stripped them out of the text previously to solve this problem, but that’s not a great solution. The way Excel handles this is to enclose any elements that contain commas with quotation marks then to use double quotation marks to signify an actual quotation mark and not enclosed text. This will be demonstrated in the UTF-8 parser since I made that from scratch.

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors [current page] | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8

Twitter Sentiment — Penguins VS. Rangers Gm 4

Game 4 of the Penguins-Rangers series featured a brief overtime period that overshadowed the rest of the game as far as tweet volume goes. Rangers fans were more negative at the beginning of the game after the Penguins scored their first goal. Twitter volume picked up for both teams during the overtime period and Rangers fans’ tweets spiked when they won the game and continued throughout the night.

Gm4 Sentiment

Penguins Rangers 2015 Game 3 Sentiment

Twitter Sentiment — Penguins vs. Rangers Gm 3

Unfortunately, my Twitter scraper wasn’t looking for the most viral story of the Penguin’s loss to the Rangers. [I’m not linking to it, but it involves a columnist and the Penguins’ GM.] I was able to get general sentiment over the course of the game. There isn’t too much to analyze. There are more Rangers tweets overall, most likely due to increased interest and a larger market, and I’ve annotated where there was scoring during the game. I’ll probably have a few more updates through out the playoffs.

Penguins Rangers 2015 Game 3 Sentiment

Using New, Diverse Emojis for Analysis in Python

I haven’t been updating this site often since I’ve started to perform a similar job over at FanGraphs. All non-baseball stat work that I do will continued to be housed here.

Over the past week, Apple has implemented new emojis with a focus on diversity in their iOS 8.3 and the OS X 10.10.3 update. I’ve written quite a bit about the underpinnings of emojis and how to get Python to run text analytics on them. The new emojis provide another opportunity to gain insights on how people interact, feel, or use them. Like always, I prefer to use Python for any web scraping or data processing, and emoji processing is no exception. I already wrote a basic primer on how to get Python to find emoji in your text. If you combine the tutorials I have for tweet scraping, MongoDB and emoji analysis, you have yourself a really nice suite of data analysis.

Modifier Patch

These new emojis are a product of the Unicode Consortium’s plan for how to incorporate racial diversity into the previously all-white human emoji line up. (And yes, there’s a consortium for emoji planning.) The method used to produce new emojis isn’t quite as simple as just making a new character/emoji. Instead, they decided to include a modifier patch at the end of human emojis to indicate skin color. As a end-user, this won’t affect you if you have all the software updates and your device can render the new emojis. However, if you don’t have the updates, you’ll get something that looks like this:

Emoji Patch Error
That box at the end of the emoji is the modifier patch. Essentially what is happening here is that there is a default emoji (in this case it’s the old man) and the modifier patch (the box). For older systems it doesn’t display, because the old system doesn’t know how to interpret this new data. This method actually allows the emojis to be backwards compatible, since it still conveys at least part of the meaning of the emoji. If you have the new updates, you will see the top row of emoji.

Emoji Plus Modifier Patches

Using a little manipulation (copying and pasting) using my newly updated iPhone we can figure out this is what really is going on for emojis. There are five skin color patches available to be added to each emoji, which is demonstrated on the bottom row of emoji. Now you might notice there are a lot of yellow emoji. Yellow emojis (Simpsons) are now the default. This is so that no single real skin tone is the default. The yellow emojis have no modifier patch attached to them, so if you simply upgrade your phone and computer and then go back and look at old texts, all the emojis with people in them are now yellow.

New Families

The new emoji update also includes new families. These are also a little different, since they are essentially combinations of other emoji. The original family emoji is one single emoji, but the new families with multiple children and various combinations of children and partners contain multiple emojis. The graphic below demonstrates this.

Emoji New Familes

The man, woman, girl and boy emoji are combined to form that specific family emoji. I’ve seen criticisms about the families not being multiracial. I’d have to believe the limitation here is a technical one, since I don’t believe the Unicode consortium has an effective method to apply modifier patches and combine multiple emojis at once. That would result in a unmanageable number of glyphs in the font set to represent the characters. (625 different combinations for just one given family of 4, and there are many different families with different gender iterations.)

New Analysis

So now that we have the background on the how the new emojis work, we can update how we’ve searched and analyzed them. I have updated my emoji .csv file, so that anyone can download that and run a basic search within your text corpus. I have also updated my github to have this file as well for the socialmediaparse library I built.

The modifier patches are searchable, so now you can search for certain swatches (or lack there of). Below I have written out the unicode escape output for the default (yellow) man emoji and its light-skinned variation. The emoji with a human skin color has that extra piece of code at the end.

#unicode escape
\U0001f468 #unmodified man
\U0001f468\U0001f3fb  #light-skinned man

Here are all the modifier patches as unicode escape.

Emoji Modifier Patches

#modifier patch unicode escape
\U0001f3fb  #skin tone 1 (lightest)
\U0001f3fc  #skin tone 2
\U0001f3fd  #skin tone 3
\U0001f3fe  #skin tone 4
\U0001f3ff  #skin tone 5 (darkest)

The easiest way to search for these is to use the following snippet of code:

#searches for any emoji with skin tone 5
unicode_object = u'Some text with emoji in it as a unicode object not str!'

if '\U0001f3ff' in unicode_object.encode('unicode_escape'):
   #do something

You can throw that snippet into a for loop for a Pandas data frame or a MongoDB cursor. I’m planning on updating my socialmediaparse library with patch searching, and I’ll update this post when I do that.


Finally, there’s Spock!

Emoji Spock

The unicode escape for Spock is:


Add your modifier patches as needed.

Collecting Twitter Data: Storing Tweets in MongoDB

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB [current page] | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8

In the first three sections of the Twitter data collection tutorial, I demonstrated how to collect tweets using both R and Python and how to store these tweets first as JSON files then having R parse them into a .csv file. The .csv file works well, but tweets don’t always make good flat .csv files, since not every tweet contains the same fields or the same structure. Some of the data is well nested into the JSON object. It is possible to write a parser that has a field for each possible subfield, but this might take a while to write and will create a rather large .csv file or SQL database.


Fortunately, NoSQL databases like MongoDB exist and it greatly simplifies tweet storage, search, and recall eliminating the need of a tweet parser. Installation and setup of MongoDB and the pymongo library is beyond the scope of this tutorial, but I can quickly explain what MongoDB does. It is a document-based database that uses documents instead of tuples in tables to store data. These documents look like just like JSON objects using key-value pairs, but they are called BSON [since it’s stored as binary]. From a programming prospective, they have similar properties as both JS objects and Python dictionaries.

Since JSON and BSON are so similar, storing a tweet in a MongoDB database is as easy as putting the entire content of the tweet’s JSON string into an insert statement. Recalling or searching the tweets is rather simple as well; it does require an OOP mindset over the traditional SQL command structure.

[I’m writing this from the perspective from ad hoc small-scale research. There might be performance issues that make other storage options much more desirable. Knowing the specific metadata from a tweet you want to keep will make any analysis faster or require less store space. MongoDB allows you to store all the information the API returns to you.]

Storing Tweets in MongoDB

I am going to assume that you have MongoDB running on your local computer for all the code examples.

Storing tweets is rather simple if you already have the Python stream listener built from Part III of the tutorial, since there are only a few changes to be made to the code. The first change will be calling the libraries: pymongo and json. The json library is available by default in Python, but you’ll have to install pymongo using pip install pymongo if you have the pip installer. The bulk of the changes will be in the listener child class.

from pymongo import MongoClient
import json

class listener(StreamListener):

def __init__(self, start_time, time_limit=60):

self.time = start_time
self.limit = time_limit

def on_data(self, data):

while (time.time() - self.time) < self.limit:


client = MongoClient('localhost', 27017)
db = client['twitter_db']
collection = db['twitter_collection']
tweet = json.loads(data)


return True

except BaseException, e:
print 'failed ondata,', str(e)


def on_error(self, status):
print statuses

The major change in the code includes:

client = MongoClient('localhost', 27017)
db = client['twitter_db']
collection = db['twitter_collection']
tweet = json.loads(data)


MongoClient creates the MongoClient instance which will be used to interface with the database. The client[‘twitter_db’] call designates the database that is going to be used, and the db[‘twitter_collection’] call selects the collection where the documents will be stored. The json.loads() call converts the string returned from the Twitter API into a json object in Python. Finally, the collection.insert() call inserts the json object into the MongoDB database. From this rather simple change to the Python stream listener all the tweets can be saved into a MongoDB database.

Recalling Tweets from MongoDB

Recalling the tweets from MongoDB database is not too difficult if you understand the basics of Python for loops and dictionaries. The function to retrieve any documents from the database is collection.find(). You are able to specify what you want to find or leave it blank and get all the documents (tweets) returned. For this example, I’ll first just leave it blank to get all the tweets.

After calling the .find() method, Python will return a MongoDB cursor, which can be iterated through in a for loop. The for loop runs the loop for each object in the iterator. If you wanted to print the text from every tweet you would write:

tweets_iterator = collection.find()
for tweet in tweets_iterator:
  print tweet['text']

tweet contains one document [or in this case a tweet JSON object] in the sequence that tweet_iterator produces. The loop will change this document to another one until the for loop runs through every document in the iterator.

Since tweets in JSON format contain many subdocuments, it’s important to know what data you are looking and where to find it. The following code snippet is an example of different fields available to examine.

text = tweet['text']
user_screen_name = tweet['user']['screen_name']
user_name = tweet['user']['name']
retweet_count = tweet['retweeted_status']['retweet_count']
retweeted_name = tweet['retweeted_status']['user']['name']
retweeted_screen_name = tweet['retweeted_status']['user']['screen_name']

The last [‘field’] represents a property and any [‘fields’] before the last represents subdocuments. The text field is on the top level on any tweet document, this is the the text that is written in the tweet. There is a user subdocument with a lot of information in there. The code above pulls the screen_name and the user’s given name and content from retweeted tweets. If I were to retweet Barack Obama, you’d be to pull this data about Obama’s tweet from my retweet. I’ve used this to analyze retweet behavior.

Since MongoDB is a database, you are able to query it; you just can’t use SQL. The collection.find() is the method used for querying. Until now I’ve only used empty parameters in the .find() method to return the entire collection. Querying is done in a style similar to JSON.

To find an exact match to a string:

collection.find({'text' : 'This will return tweets with only this exact string.'}) 

The previous command will find only that exact string in a top level attribute. This isn’t helpful in a practical sense since exact searches aren’t very useful, but it’s the most basic find command. Having MongoDB pull twitter by a given user’s screen_name has some uses, but it is in a a subdocument so it requires some new syntax "document.subdocument":

tweets = collection.find({'user.screen_name' : 'exactScreenName'})

The above code will search the screen_name property in the user subdocument. Since I’ve shown exact searches, you can search for particular words using a regular expressions operator. This will search the text property to see if it can finds ‘word’ anywhere and return the entire tweet.

tweets = collection.find({'text': { '$regex' : 'word'}})

Since in the MongoDB the tweets might not have the same fields or properties, sometimes just searching to see if a property exists is useful. For example, if you wanted to find all the native retweets in your collection the following snippet is will return any tweet with a retweeted_status property. [The retweeted_status is typically a subdocument containing all the information about the retweeted tweet.]

collection.find({"retweeted_status" : { "$exists" : "true"}})


While using MongoDB has a learning curve, it can be rather useful to store data like tweets. It eliminates the need to write a parser since you effectively parse the data when you retrieve it. Knowing the subdocument structure of the documents in your database and thinking like a programming rather than a SQL database user will help you successful execute analyses in Python using MongoDB for Twitter data.

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB [current page] | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8

Collecting Twitter Data: Using a Python Stream Listener

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener [current page] | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8

I use the term stream listener [2 words] to refer to program build with this code and StreamListener [1 word] to refer to the specific class from the tweepy package. The two are related but not the same. The StreamListener class makes the stream listener program what it is, but the program entails more than the class.

While using R and its streamR package to scrape Twitter data works well, Python allows more customization than R does. It also has a steeper learning curve, because the coding is more invovled. Before using Python to scrape Twitter data, a software package like tweepy must be installed. If you have the pip installer installed on your system, the installation procedure is rather easy and executed in the Terminal.

Call Tweepy Library


$ pip install tweepy

After the software package is installed, you can start writing a stream listener script. First, the libraries have to be imported.

import time
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import os

The three tweepy class imports will be used to construct the stream listener, the time library will be used create a time-out feature for the script, and the os library will be used to set your working directory.

Set Variables Values

Before diving into constructing the stream listener, let’s set some variables. These variables will be used in the stream listener by being feed into the tweepy objects. I code them as variables instead of directly into the functions so that they can be easily changed.

ckey = '**CONSUMER KEY**'
consumer_secret = '**CONSUMER SECRET KEY***'
access_token_key = '**ACCESS TOKEN**'
access_token_secret = '**ACCESS TOKEN SECRET**'

start_time = time.time() #grabs the system time
keyword_list = ['twitter'] #track list

Using and Modifying the Tweepy Classes

I believe that tweet scraping with Python has a steeper learner curve than with R, because Python is dependent on combining instances of different classes. If you don’t understand the basics of object-oriented programming, it might be difficult to comprehend what the code is accomplishing or how to manipulate the code. The code I show in this post does the following:

  • Creates an OAuthHandler instance to handle OAuth credentials
  • Creates a listener instance with a start time and time limit parameters passed to it
  • Creates an StreamListener instance with the OAuthHandler instance and the listener instance

Before these instances are created, we have to “modify” the StreamListener class by creating a child class to output the data into a .csv file.

#Listener Class Override
class listener(StreamListener):

	def __init__(self, start_time, time_limit=60):

		self.time = start_time
		self.limit = time_limit
		self.tweet_data = []

	def on_data(self, data):

		saveFile = io.open('raw_tweets.json', 'a', encoding='utf-8')

		while (time.time() - self.time) < self.limit:



				return True

			except BaseException, e:
				print 'failed ondata,', str(e)

		saveFile = io.open('raw_tweets.json', 'w', encoding='utf-8')

	def on_error(self, status):

		print statuses

This is the most complicated section of this code. The code rewrite the actions taken when the StreamListener instance receives data [the tweet JSON].

saveFile = io.open('raw_tweets.json', 'w', encoding='utf-8')

This block of code opens an output file, writes the opening square bracket, writes the JSON data as text separated by commas, then inserts a closing square bracket, and closes the document. This is the standard JSON format with each Twitter object acting as an element in a JavaScript array. If you bring this into R or Python built-in parser and the json library can properly handle it.

This section can be modified to or modify the JSON file. For example you can place other properties/fields like a UNIX time stamp or a random variable into the JSON. You can also modified the output file or eliminate the need for a .csv file and insert the tweet directly into a MongoDB database. As it is written, this will produce a file that can be parsed by Python's json class.
After the child class is created we can create the instances and start the stream listener.

auth = OAuthHandler(ckey, consumer_secret) #OAuth object
auth.set_access_token(access_token_key, access_token_secret)

twitterStream = Stream(auth, listener(start_time, time_limit=20)) #initialize Stream object with a time out limit
twitterStream.filter(track=keyword_list, languages=['en'])  #call the filter method to run the Stream Object

Here the OAuthHandler uses your API keys [consumer key & consumer secret key] to create the auth object. The access token, which is unique to an individual user [not an application], is set in the following line. Unlike the filterStream() function in R, this will take all four of your credentials from the Twitter Dev site. The modified StreamListener class simply called listener is used to create an listener instance. This contains the information about what to do with the data once it comes back from the Twitter API call. Both the listener and auth instances are used to create the Stream instance which combines the authentication credentials with the instructions on what to do with the retrieved data. The Stream class also contains a method for filtering the Twitter Stream. This method works just like the R filterStream() function taking similar parameters, because the parameters are passed to the Stream API call.

Python vs R

At this stage in the tutorial, I would recommend parsing this data using the parser in R from the last section of the Twitter tutorial or creating your own. Since it's easier to customize the StreamListener methods in Python, I prefer to use it over other R. Generally, I think Python works better for collecting and processing data, but isn't as easy to use for most statistical analysis. Since tweet scraping would fall into the data collection category, I like Python. It becomes easier to access databases and to manipulate the data when you are already working in Python.

11-10-2015 -- I've updated the StreamListener to output properly formatted JSON. The old script which works well with R's tweetParse is still available on my GitHub.



Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener [current page] | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV -- Errors | Part VI: Twitter JSON to CSV -- ASCII | Part VII: Twitter JSON to CSV -- UTF-8

Collecting Twitter Data: Getting Started

Part I: Introduction | Part II: Getting Started [current page] | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8

The R code used in this post can be found on my git-hub.

After getting R, Python or whatever programming language you prefer, the next steps will require API keys from Twitter. This requires you have to have a Twitter account and to create an ‘application’ using the following steps.

Getting API Keys

  1. Sign into Twitter
  2. Go to https://apps.twitter.com/app/new and create a new application

    twitter register app

  3. Click on “Keys and Access Tokens” on the your application’s page

    twitter access keys

  4. Get and copy your Consumer Key, Consumer Secret Key, Access Token, and Secret Token

    twitter oauth screen

Those four complex strings of case-sensitive letters and numbers are your API keys. Keep them secret, because they are more powerful than your Twitter password. If you are wondering what the keys are for, they are really two pairs of keys consisting of secret and non-secret, and this is done for security purposes. The consumer key pair authorizes your program to use the Twitter API, and the access token essentially signs you in as your specific Twitter user account. This framework makes more sense in the context of third party Twitter developers like TweetDeck where the application is making API calls but it needs access to each user’s personal data to write tweets, access their timelines, etc.

Getting Started in R

If you don’t have a preference for a certain programming environment, I recommend that people with less programming experience start with R for tweet scraping since it is simpler to collect and parse the data without having to understand much programming. The Streaming API authentication I use in R is slightly more complicated than what I normally do with Python. If you feel comfortable with Python, I recommend using the tweepy package for Python. It’s more robust than R’s streamR but has a steeper learning curve.

First, like most R scripts, the libraries need to be installed and called. Hopefully you already installed if not the install.packages commands are commented out for reference.


The first part of the actually code for a Twitter scraper will use the API keys obtained from Twitter’s development website. You insert your personal API keys where the **KEY** is in the code. For this method of authentication in R it only uses the CONSUMER KEY and CONSUMER SECRET KEY and it gets your ACCESS TOKEN from a PIN number from using an web address you open in your browser.

#create your OAuth credential
credential <- OAuthFactory$new(consumerKey='**CONSUMER KEY**',
                         consumerSecret='**CONSUMER SECRETY KEY**',

#authentication process
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")

After this is executed properly, R will give you output in your console that looks like the following:

twitter handshake

  1. Copy the https:// URL into a browser
  2. Log into Twitter if you haven't already
  3. Authorize the application
  4. Then you'll get the PIN number to copy into the R console and hit Enter

    twitter pin

Now that the authentication handshake was completed, the R program is able to use those credentials to make API calls. A basic call using the Streaming API is the filterStream() function in the streamR package. This will connected you to Twitter's stream for a designated amount of time and/or for a certain number of tweets collected.

#function to actually scrape Twitter
filterStream( file.name="tweets_test.json",
             track="twitter", tweets=1000, oauth=cred, timeout=10, lang='en' )

The track parameter tells Twitter want you want to 'search' for. It's technically not really a search since you are filtering the Twitter stream and not searching...technically. Twitter's dev site has a nice explanation of all the Streaming APIs parameters. For example, the track parameter is not case sensitive, it will treat hashtags and regular words the same, and it will find tweets with any of the words you specify, not just when all the words are present. The track parameter 'apple, twitter' will find tweets with 'apple', tweets with 'twitter', and tweets with both.

The filterStream() function will stay open as long as you tell it to in the timeout parameter [in seconds], so don't set it too long if you want your data quickly. The data Twitter returns to you is a .json file, which is a JavaScript data file.

twitter json

The above is an excerpt from a tweet that's been formatted to be easier to read. Here's a larger annotated version of a tweet JSON file. Thes are useful in some contexts of programming, but for basic use in R, Tableau, and Excel it's gibberish.

There are a few different ways to parse the data into something useful. The most basic [and easiest] is to use the parseTweets() function that is also in streamR.

#Parses the tweets
tweet_df <- parseTweets(tweets='tweets_test.json')

This is a pretty simple function that takes the JSON file that filterStream() produced, reads it, and creates a wide data frame. The data frame can be pretty daunting, since there is so much metadata available.

twitter data frame

You might notice some of the ?-mark characters. These are text encoding errors. This is one of the limitations of using R to parse the tweets, because the streamR package doesn't handle utf-8 characters well in its functions. This means that R can only read basic A-Z characters and can't translate emoji, foreign languages, and some punctuation. I'd recommend using something like MongoDB to store tweets or create your own parser if you want be able to use these features of the text.

Quick Analysis

This tutorial focuses on how to collect Twitter data and not the intricacies of analyzing it, but here are a few simple examples of how you can use the tweet data frame.

#using the Twitter data frame

plot(tweet_df$friends_count, tweet_df$followers_count) #plots scatterplot
cor(tweet_df$friends_count, tweet_df$followers_count) #returns the correlation coefficient

The different columns within the data frame can be called separately. Calling the created_at field gives you the tweet's time stamp, and the text field is the content of the tweet. Generally, there will be some correlation between the number of followers a person has [followers_count] and the number of accounts a person follows [friends_count]. When I ran my script I got a correlation of about 0.25. The scatter plot will be heavily impacted by the Justin Biebers of the world where they have millions of followers but follow only a few themselves.


This is quick-start tutorial to collecting Twitter data. There are plenty of resources to be found on Twitter's developer site and all over the internet. While this tutorial is useful to learn the basics of how the OAuth process works and how Twitter returns data, I recommend using a tool like Python and MongoDB which can give you greater flexibility for analysis. Collecting tweets is the foundation of using Twitter's API, but you can also get user objects, trends, or accomplish anything that you can in a Twitter client with the REST and Search APIs.


The R code used in this post can be found on my git-hub.

Part I: Introduction | Part II: Getting Started [current page] | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV -- Errors | Part VI: Twitter JSON to CSV -- ASCII | Part VII: Twitter JSON to CSV -- UTF-8

Collecting Twitter Data: Introduction

Part I: Introduction [current page] | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8

Collecting Twitter data is a great exercise in data science and can provide interesting insights in how people behave on the social media platform. Below is an overview of the steps to build a Twitter analysis from scratch. This tutorial will go through several steps to arrive at being able to analyze Twitter data.

  1. Overview of Twitter API does
  2. Get R or Python
  3. Install Twitter packages
  4. Get Developer API Key from Twitter
  5. Write Code to Collect Tweets
  6. Parse the Raw Tweet Data [JSON files]
  7. Analyze the Tweet Data


Before diving into the technical aspects of how to use the Twitter API [Application Program Interface] to collect tweets and other data from their site, I want to give a general overview of what the Twitter API is and isn’t capable of doing. First, data collection on Twitter doesn’t necessarily produce a representative sample to make inferences about the general population. And people tend to be rather emotional and negative on Twitter. That said, Twitter is a treasure trove of data and there are plenty of interesting things you can discover. You can pull various data structures from Twitter: tweets, user profiles, user friends and followers, what’s trending, etc. There are three methods to get this data: the REST API, the Search API, and the Streaming API. The Search API is retrospective and allows you search old tweets [with severe limitations], the REST API allows you to collect user profiles, friends, and followers, and the Streaming API collects tweets in real time as they happen. [This is best for data science.] This means that most Twitter analysis has to be planned beforehand or at least tweets have to be collected prior to the timeframe you want to analyze. There are some ways around this if Twitter grants you permission, but the run-of-the-mill Twitter account will find the Streaming API much more useful.

The Twitter API requires a few steps:

  1. Authenticate with OAuth
  2. Make API call
  3. Receive JSON file back
  4. Interpret JSON file

The authentication requires that you get an API key from the Twitter developers site. This just requires that you have a Twitter account. The four keys the site gives you are used as parameters in the programs. The OAuth authentication gives your program permission to make API calls.

The API call is an http call that has the parameters incorporated into the URL like
This Streaming API call is asking to connect to Twitter and tracks the keyword ‘twitter’. Using prebuilt software packages in R or Python will hide this step from you the programmer, but these calls are happening behind the scenes.

JSON files are the data structure that Twitter returns. These are rather comprehensive with the amount of data, but hard to use without them being parsed first. Some of the software packages have built-in parsers or you can use a NoSQL database like MongoDB to store and query your tweets.

Get R or Python

While there are many different programing languages to interface with the API, I prefer to use either Python or R for any Twitter data scraping. R is easier to use out of the box if you are just getting started with coding, and Python offers more flexibility. If you don’t have either of these, I’d recommend installing then learning to do some basic things before tackling Twitter data.

Download R: http://cran.rstudio.com/
R Studio: http://www.rstudio.com/ [optional]

Download Python: https://www.python.org/downloads/

Install Twitter Packages

The easiest way to access the API is to install a software package that has prebuilt libraries that makes coding projects much simpler. Since this tutorial will primarily be focused on using the Streaming API, I recommend installing the streamR package for R or tweepy for Python. If you have a Mac, Python is already installed and you can run it from the terminal. I recommend getting a program to help you organize your projects like PyCharm, but that is beyond the scope of this tutorial.

[in the R environment]


[in the terminal, assuming you have pip installed]

$ pip install tweepy


Part I: Introduction [current page] | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8

SOTU Title

2015 State of the Union Address — Text Analytics

I collected tweets about the 2015 State of the Union address [SOTU] in real time from 10am to 2am using the keywords [obama, state of the union, sotu, sotusocial, ernst]. The tweets were analyzed for sentiment, content, emoji, hashtags, and retweets. The graph below shows Twitter activity over the course of the night. The volume of tweets and the sentiment of reactions were the highest during the latter half of the speech when Obama made the remark “I should know; I won both of them” referring to the 2008 & 2012 elections he won.

2015 State of the Union Tweet Volume

Throughout the day before the speech, there weren’t many tweets and they tended to be neutral. These tweets typically contained links to news articles previewing the SOTU address or reminders about the speech. Both of these types of tweets are factual but bland when compared to the commentary and emotional reaction that occurred during the SOTU address itself. The huge spike in Twitter traffic didn’t happen until the President walked onto the House floor which was just before 9:10 PM. When the speech started, the sentiment/number of positive words per tweet increased to about 0.3 positive words/tweet suggesting that the SOTU address was well received. [at least to the people who bothered to tweet]

Around 7:45-8:00 PM the largest negative sentiment occurred during the day. I’ve looked back through the tweets from that time and couldn’t find anything definitive that happened to cause that. My conjecture would be that is when news coverage started [and strongly opinionated] people started watching the news and began to tweet.

The highest sentiment/number of positive words came during the 15-minute polling window where the President quipped about winning two elections. Unfortunately, that sound bite didn’t make a great hashtag, so it didn’t show up else where in my analysis. However, there are many news articles and discussion about that off-the-cuff remark, and it will probably be the most memorable moment from the SOTU address.


Once again [Emoji Popularity Link], the crying-my-eyes-out emoji proved to be the most used emoji in SOTU tweets, typically being used in tweets which aren’t serious and generally sarcastic. Not surprisingly, the clapping emoji was the second most popular emoji mimicking the copious ovations the SOTU address receives. Other notable popular emoji are the fire, US flag, the zzzz emoji and skull. The US flag reflects the patriotic themes of the entire night. The fire is generally reflecting praise for Obama’s speech. The skull and zzzz are commenting on spectators in the crowd.

2015 State of the Union Twitter Emoji Use

Two topic-specific emoji counts were interesting. For the most part in all of my tweet collections, the crying-my-eyes-out emoji is exponentially more popular than any other emoji. Understandably, the set of tweets that contained language associated with terrorism had more handclaps, flags, and angry emoji reflecting the serious nature of the subject.

2015 State of the Union Subject Emojis

Then tweets corresponding to the GOP response had a preponderance of pig-related emojis due to Joni Ernst’s campaign ad.


The following hashtag globe graphic is rather large. Please enlarge to see the most popular hashtags associated with the SOTU address. I removed the #SOTU hashtag, because it was use extensively and overshadowed the rest. For those wondering what #TCOT means, it stands for Top Conservatives on Twitter. The #P2 hashtag is its progressive counterpart. [Source]

2015 State of the Union Hashtag Globe


The White House staff won the retweeting war by being the two most retweeted [RT] accounts during the speech last night. This graph represents the total summed RTs over all the tweets they made. Since the White House and the Barack Obama account tweeted constantly during the speech, they accumulated the most retweets. Michael Clifford has the most retweeted single tweet stating he is just about met the President. If you are wondering who Michael Clifford is, you aren’t alone, because I had to look him up. He’s the 19-yo guitarist from 5 Seconds of Summer. The tweet is from August, however, people did retweet it during the day. [I was measuring the max retweet count on the tweets.] Rand Paul was the most retweeted non-President politician, and the Huffington Post had the most for a news outlet.

2015 State of the Union Popular Retweets

The Speech

Obama released his speech online before starting the State of the Union address. I used this for a quick word-count analysis, and it doesn’t contain the off-the-cuff remarks just the script, which he did stick to with few exceptions. The first graph uses the count of single words with ‘new’ being by far the most used word.

2015 State of the Union Address Word Frequency

This graph shows the most used two-word combinations [also known as bi-grams].

2015 State of the Union Address Bigram Frequency

Further Notes

I was hoping this would be the perfect opportunity to test out my sentiment analysis process, and the evaluation results were rather moderate achieving about 50% accuracy on three classes [negative, neutral, positive]. In this case 50% is an improvement over a 33% random guess, but not very encouraging overall. For the sentiment portion in the tweet volume graph, I used the bag-of-words approach that I have used many times before.

A more interesting and informative classifier might look try to classify the tweet into sarcastic/trolling, positive, and angry genres. I had problems classifying some tweets as positive and negative, because there were many news links, which are neutral, and sarcastic comments, which look positive but feel negative. For politics, classifying the political position might be more useful, since a liberal could be mocking Boehner one minute, then praising Obama the next. Having two tweets classified as liberal rather than a negative tweet and a positive tweet is much more informative when aggregating.

2015 Steelers-Ravens Playoffs Hashtag Use

2015 Steelers-Ravens Playoff Twitter Infographics

The Steelers-Ravens playoff game gave me a chance to test out a new analytics server and some of the tools I’ve been working on to make Twitter analysis easy using ad hoc Python scripts. So here goes:

There were a lot of Steelers or Ravens colored emojis, black and gold hearts or buttons and the purple devils. Though for some reasons the ‘crying my eyes out’ emoji is by far the most popular in this collection of tweets. The yellow line represents how many unique tweets there were featuring that emoji. For example, 14 of the same of emoji in one tweet would count for 14 in the blue bar, while it would count for just 1 in the context of the yellow line.

2015 Steelers-Ravens Playoffs Emoji Use

Here’s the hashtag use. The #steelers exceeded the #ravens. This looks cool, but it doesn’t tell you much.

2015 Steelers-Ravens Playoffs Hashtag Use

Here’s a bar chart that’s a lot easier to read if you want the information.

2015 Steelers-Ravens Playoffs Hashtag Bar Chart