All posts by Sean Dolinar

Using New, Diverse Emojis for Analysis in Python

I haven’t been updating this site often since I’ve started to perform a similar job over at FanGraphs. All non-baseball stat work that I do will continued to be housed here.

Over the past week, Apple has implemented new emojis with a focus on diversity in their iOS 8.3 and the OS X 10.10.3 update. I’ve written quite a bit about the underpinnings of emojis and how to get Python to run text analytics on them. The new emojis provide another opportunity to gain insights on how people interact, feel, or use them. Like always, I prefer to use Python for any web scraping or data processing, and emoji processing is no exception. I already wrote a basic primer on how to get Python to find emoji in your text. If you combine the tutorials I have for tweet scraping, MongoDB and emoji analysis, you have yourself a really nice suite of data analysis.

Modifier Patch

These new emojis are a product of the Unicode Consortium’s plan for how to incorporate racial diversity into the previously all-white human emoji line up. (And yes, there’s a consortium for emoji planning.) The method used to produce new emojis isn’t quite as simple as just making a new character/emoji. Instead, they decided to include a modifier patch at the end of human emojis to indicate skin color. As a end-user, this won’t affect you if you have all the software updates and your device can render the new emojis. However, if you don’t have the updates, you’ll get something that looks like this:

Emoji Patch Error
That box at the end of the emoji is the modifier patch. Essentially what is happening here is that there is a default emoji (in this case it’s the old man) and the modifier patch (the box). For older systems it doesn’t display, because the old system doesn’t know how to interpret this new data. This method actually allows the emojis to be backwards compatible, since it still conveys at least part of the meaning of the emoji. If you have the new updates, you will see the top row of emoji.

Emoji Plus Modifier Patches

Using a little manipulation (copying and pasting) using my newly updated iPhone we can figure out this is what really is going on for emojis. There are five skin color patches available to be added to each emoji, which is demonstrated on the bottom row of emoji. Now you might notice there are a lot of yellow emoji. Yellow emojis (Simpsons) are now the default. This is so that no single real skin tone is the default. The yellow emojis have no modifier patch attached to them, so if you simply upgrade your phone and computer and then go back and look at old texts, all the emojis with people in them are now yellow.

New Families

The new emoji update also includes new families. These are also a little different, since they are essentially combinations of other emoji. The original family emoji is one single emoji, but the new families with multiple children and various combinations of children and partners contain multiple emojis. The graphic below demonstrates this.

Emoji New Familes

The man, woman, girl and boy emoji are combined to form that specific family emoji. I’ve seen criticisms about the families not being multiracial. I’d have to believe the limitation here is a technical one, since I don’t believe the Unicode consortium has an effective method to apply modifier patches and combine multiple emojis at once. That would result in a unmanageable number of glyphs in the font set to represent the characters. (625 different combinations for just one given family of 4, and there are many different families with different gender iterations.)

New Analysis

So now that we have the background on the how the new emojis work, we can update how we’ve searched and analyzed them. I have updated my emoji .csv file, so that anyone can download that and run a basic search within your text corpus. I have also updated my github to have this file as well for the socialmediaparse library I built.

The modifier patches are searchable, so now you can search for certain swatches (or lack there of). Below I have written out the unicode escape output for the default (yellow) man emoji and its light-skinned variation. The emoji with a human skin color has that extra piece of code at the end.

#unicode escape
\U0001f468 #unmodified man
\U0001f468\U0001f3fb  #light-skinned man

Here are all the modifier patches as unicode escape.

Emoji Modifier Patches

#modifier patch unicode escape
\U0001f3fb  #skin tone 1 (lightest)
\U0001f3fc  #skin tone 2
\U0001f3fd  #skin tone 3
\U0001f3fe  #skin tone 4
\U0001f3ff  #skin tone 5 (darkest)

The easiest way to search for these is to use the following snippet of code:

#searches for any emoji with skin tone 5
unicode_object = u'Some text with emoji in it as a unicode object not str!'

if '\U0001f3ff' in unicode_object.encode('unicode_escape'):
   #do something

You can throw that snippet into a for loop for a Pandas data frame or a MongoDB cursor. I’m planning on updating my socialmediaparse library with patch searching, and I’ll update this post when I do that.

Spock

Finally, there’s Spock!

Emoji Spock

The unicode escape for Spock is:

\U0001f596

Add your modifier patches as needed.

Collecting Twitter Data: Storing Tweets in MongoDB

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB [current page] | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8


In the first three sections of the Twitter data collection tutorial, I demonstrated how to collect tweets using both R and Python and how to store these tweets first as JSON files then having R parse them into a .csv file. The .csv file works well, but tweets don’t always make good flat .csv files, since not every tweet contains the same fields or the same structure. Some of the data is well nested into the JSON object. It is possible to write a parser that has a field for each possible subfield, but this might take a while to write and will create a rather large .csv file or SQL database.

MongoDB

Fortunately, NoSQL databases like MongoDB exist and it greatly simplifies tweet storage, search, and recall eliminating the need of a tweet parser. Installation and setup of MongoDB and the pymongo library is beyond the scope of this tutorial, but I can quickly explain what MongoDB does. It is a document-based database that uses documents instead of tuples in tables to store data. These documents look like just like JSON objects using key-value pairs, but they are called BSON [since it’s stored as binary]. From a programming prospective, they have similar properties as both JS objects and Python dictionaries.

Since JSON and BSON are so similar, storing a tweet in a MongoDB database is as easy as putting the entire content of the tweet’s JSON string into an insert statement. Recalling or searching the tweets is rather simple as well; it does require an OOP mindset over the traditional SQL command structure.

[I’m writing this from the perspective from ad hoc small-scale research. There might be performance issues that make other storage options much more desirable. Knowing the specific metadata from a tweet you want to keep will make any analysis faster or require less store space. MongoDB allows you to store all the information the API returns to you.]

Storing Tweets in MongoDB

I am going to assume that you have MongoDB running on your local computer for all the code examples.

Storing tweets is rather simple if you already have the Python stream listener built from Part III of the tutorial, since there are only a few changes to be made to the code. The first change will be calling the libraries: pymongo and json. The json library is available by default in Python, but you’ll have to install pymongo using pip install pymongo if you have the pip installer. The bulk of the changes will be in the listener child class.

from pymongo import MongoClient
import json


class listener(StreamListener):

def __init__(self, start_time, time_limit=60):

self.time = start_time
self.limit = time_limit

def on_data(self, data):

while (time.time() - self.time) < self.limit:

try:


client = MongoClient('localhost', 27017)
db = client['twitter_db']
collection = db['twitter_collection']
tweet = json.loads(data)

collection.insert(tweet)

return True


except BaseException, e:
print 'failed ondata,', str(e)
time.sleep(5)
pass

exit()

def on_error(self, status):
print statuses

The major change in the code includes:

client = MongoClient('localhost', 27017)
db = client['twitter_db']
collection = db['twitter_collection']
tweet = json.loads(data)

collection.insert(tweet)

MongoClient creates the MongoClient instance which will be used to interface with the database. The client[‘twitter_db’] call designates the database that is going to be used, and the db[‘twitter_collection’] call selects the collection where the documents will be stored. The json.loads() call converts the string returned from the Twitter API into a json object in Python. Finally, the collection.insert() call inserts the json object into the MongoDB database. From this rather simple change to the Python stream listener all the tweets can be saved into a MongoDB database.

Recalling Tweets from MongoDB

Recalling the tweets from MongoDB database is not too difficult if you understand the basics of Python for loops and dictionaries. The function to retrieve any documents from the database is collection.find(). You are able to specify what you want to find or leave it blank and get all the documents (tweets) returned. For this example, I’ll first just leave it blank to get all the tweets.

After calling the .find() method, Python will return a MongoDB cursor, which can be iterated through in a for loop. The for loop runs the loop for each object in the iterator. If you wanted to print the text from every tweet you would write:

tweets_iterator = collection.find()
for tweet in tweets_iterator:
  print tweet['text']

tweet contains one document [or in this case a tweet JSON object] in the sequence that tweet_iterator produces. The loop will change this document to another one until the for loop runs through every document in the iterator.

Since tweets in JSON format contain many subdocuments, it’s important to know what data you are looking and where to find it. The following code snippet is an example of different fields available to examine.

text = tweet['text']
user_screen_name = tweet['user']['screen_name']
user_name = tweet['user']['name']
retweet_count = tweet['retweeted_status']['retweet_count']
retweeted_name = tweet['retweeted_status']['user']['name']
retweeted_screen_name = tweet['retweeted_status']['user']['screen_name']

The last [‘field’] represents a property and any [‘fields’] before the last represents subdocuments. The text field is on the top level on any tweet document, this is the the text that is written in the tweet. There is a user subdocument with a lot of information in there. The code above pulls the screen_name and the user’s given name and content from retweeted tweets. If I were to retweet Barack Obama, you’d be to pull this data about Obama’s tweet from my retweet. I’ve used this to analyze retweet behavior.

Since MongoDB is a database, you are able to query it; you just can’t use SQL. The collection.find() is the method used for querying. Until now I’ve only used empty parameters in the .find() method to return the entire collection. Querying is done in a style similar to JSON.

To find an exact match to a string:

collection.find({'text' : 'This will return tweets with only this exact string.'}) 

The previous command will find only that exact string in a top level attribute. This isn’t helpful in a practical sense since exact searches aren’t very useful, but it’s the most basic find command. Having MongoDB pull twitter by a given user’s screen_name has some uses, but it is in a a subdocument so it requires some new syntax "document.subdocument":

tweets = collection.find({'user.screen_name' : 'exactScreenName'})

The above code will search the screen_name property in the user subdocument. Since I’ve shown exact searches, you can search for particular words using a regular expressions operator. This will search the text property to see if it can finds ‘word’ anywhere and return the entire tweet.

tweets = collection.find({'text': { '$regex' : 'word'}})

Since in the MongoDB the tweets might not have the same fields or properties, sometimes just searching to see if a property exists is useful. For example, if you wanted to find all the native retweets in your collection the following snippet is will return any tweet with a retweeted_status property. [The retweeted_status is typically a subdocument containing all the information about the retweeted tweet.]

collection.find({"retweeted_status" : { "$exists" : "true"}})

Conclusion

While using MongoDB has a learning curve, it can be rather useful to store data like tweets. It eliminates the need to write a parser since you effectively parse the data when you retrieve it. Knowing the subdocument structure of the documents in your database and thinking like a programming rather than a SQL database user will help you successful execute analyses in Python using MongoDB for Twitter data.


Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB [current page] | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8

Collecting Twitter Data: Using a Python Stream Listener

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener [current page] | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8


I use the term stream listener [2 words] to refer to program build with this code and StreamListener [1 word] to refer to the specific class from the tweepy package. The two are related but not the same. The StreamListener class makes the stream listener program what it is, but the program entails more than the class.

While using R and its streamR package to scrape Twitter data works well, Python allows more customization than R does. It also has a steeper learning curve, because the coding is more invovled. Before using Python to scrape Twitter data, a software package like tweepy must be installed. If you have the pip installer installed on your system, the installation procedure is rather easy and executed in the Terminal.

Call Tweepy Library

Terminal:

$ pip install tweepy

After the software package is installed, you can start writing a stream listener script. First, the libraries have to be imported.

import time
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import os

The three tweepy class imports will be used to construct the stream listener, the time library will be used create a time-out feature for the script, and the os library will be used to set your working directory.

Set Variables Values

Before diving into constructing the stream listener, let’s set some variables. These variables will be used in the stream listener by being feed into the tweepy objects. I code them as variables instead of directly into the functions so that they can be easily changed.

ckey = '**CONSUMER KEY**'
consumer_secret = '**CONSUMER SECRET KEY***'
access_token_key = '**ACCESS TOKEN**'
access_token_secret = '**ACCESS TOKEN SECRET**'


start_time = time.time() #grabs the system time
keyword_list = ['twitter'] #track list

Using and Modifying the Tweepy Classes

I believe that tweet scraping with Python has a steeper learner curve than with R, because Python is dependent on combining instances of different classes. If you don’t understand the basics of object-oriented programming, it might be difficult to comprehend what the code is accomplishing or how to manipulate the code. The code I show in this post does the following:

  • Creates an OAuthHandler instance to handle OAuth credentials
  • Creates a listener instance with a start time and time limit parameters passed to it
  • Creates an StreamListener instance with the OAuthHandler instance and the listener instance

Before these instances are created, we have to “modify” the StreamListener class by creating a child class to output the data into a .csv file.

#Listener Class Override
class listener(StreamListener):

	def __init__(self, start_time, time_limit=60):

		self.time = start_time
		self.limit = time_limit
		self.tweet_data = []

	def on_data(self, data):

		saveFile = io.open('raw_tweets.json', 'a', encoding='utf-8')

		while (time.time() - self.time) < self.limit:

			try:

				self.tweet_data.append(data)

				return True


			except BaseException, e:
				print 'failed ondata,', str(e)
				time.sleep(5)
				pass

		saveFile = io.open('raw_tweets.json', 'w', encoding='utf-8')
		saveFile.write(u'[\n')
		saveFile.write(','.join(self.tweet_data))
		saveFile.write(u'\n]')
		saveFile.close()
		exit()

	def on_error(self, status):

		print statuses

This is the most complicated section of this code. The code rewrite the actions taken when the StreamListener instance receives data [the tweet JSON].

saveFile = io.open('raw_tweets.json', 'w', encoding='utf-8')
saveFile.write(u'[\n')
saveFile.write(','.join(self.tweet_data))
saveFile.write(u'\n]')
saveFile.close()

This block of code opens an output file, writes the opening square bracket, writes the JSON data as text separated by commas, then inserts a closing square bracket, and closes the document. This is the standard JSON format with each Twitter object acting as an element in a JavaScript array. If you bring this into R or Python built-in parser and the json library can properly handle it.

This section can be modified to or modify the JSON file. For example you can place other properties/fields like a UNIX time stamp or a random variable into the JSON. You can also modified the output file or eliminate the need for a .csv file and insert the tweet directly into a MongoDB database. As it is written, this will produce a file that can be parsed by Python's json class.
After the child class is created we can create the instances and start the stream listener.

auth = OAuthHandler(ckey, consumer_secret) #OAuth object
auth.set_access_token(access_token_key, access_token_secret)


twitterStream = Stream(auth, listener(start_time, time_limit=20)) #initialize Stream object with a time out limit
twitterStream.filter(track=keyword_list, languages=['en'])  #call the filter method to run the Stream Object

Here the OAuthHandler uses your API keys [consumer key & consumer secret key] to create the auth object. The access token, which is unique to an individual user [not an application], is set in the following line. Unlike the filterStream() function in R, this will take all four of your credentials from the Twitter Dev site. The modified StreamListener class simply called listener is used to create an listener instance. This contains the information about what to do with the data once it comes back from the Twitter API call. Both the listener and auth instances are used to create the Stream instance which combines the authentication credentials with the instructions on what to do with the retrieved data. The Stream class also contains a method for filtering the Twitter Stream. This method works just like the R filterStream() function taking similar parameters, because the parameters are passed to the Stream API call.

Python vs R

At this stage in the tutorial, I would recommend parsing this data using the parser in R from the last section of the Twitter tutorial or creating your own. Since it's easier to customize the StreamListener methods in Python, I prefer to use it over other R. Generally, I think Python works better for collecting and processing data, but isn't as easy to use for most statistical analysis. Since tweet scraping would fall into the data collection category, I like Python. It becomes easier to access databases and to manipulate the data when you are already working in Python.

11-10-2015 -- I've updated the StreamListener to output properly formatted JSON. The old script which works well with R's tweetParse is still available on my GitHub.

 


 

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener [current page] | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV -- Errors | Part VI: Twitter JSON to CSV -- ASCII | Part VII: Twitter JSON to CSV -- UTF-8

Collecting Twitter Data: Getting Started

Part I: Introduction | Part II: Getting Started [current page] | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8


The R code used in this post can be found on my git-hub.

After getting R, Python or whatever programming language you prefer, the next steps will require API keys from Twitter. This requires you have to have a Twitter account and to create an ‘application’ using the following steps.

Getting API Keys

  1. Sign into Twitter
  2. Go to https://apps.twitter.com/app/new and create a new application

    twitter register app

  3. Click on “Keys and Access Tokens” on the your application’s page

    twitter access keys

  4. Get and copy your Consumer Key, Consumer Secret Key, Access Token, and Secret Token

    twitter oauth screen

Those four complex strings of case-sensitive letters and numbers are your API keys. Keep them secret, because they are more powerful than your Twitter password. If you are wondering what the keys are for, they are really two pairs of keys consisting of secret and non-secret, and this is done for security purposes. The consumer key pair authorizes your program to use the Twitter API, and the access token essentially signs you in as your specific Twitter user account. This framework makes more sense in the context of third party Twitter developers like TweetDeck where the application is making API calls but it needs access to each user’s personal data to write tweets, access their timelines, etc.

Getting Started in R

If you don’t have a preference for a certain programming environment, I recommend that people with less programming experience start with R for tweet scraping since it is simpler to collect and parse the data without having to understand much programming. The Streaming API authentication I use in R is slightly more complicated than what I normally do with Python. If you feel comfortable with Python, I recommend using the tweepy package for Python. It’s more robust than R’s streamR but has a steeper learning curve.

First, like most R scripts, the libraries need to be installed and called. Hopefully you already installed if not the install.packages commands are commented out for reference.

#install.packages("streamR")
#install.packages("ROAuth")
library(ROAuth)
library(streamR)

The first part of the actually code for a Twitter scraper will use the API keys obtained from Twitter’s development website. You insert your personal API keys where the **KEY** is in the code. For this method of authentication in R it only uses the CONSUMER KEY and CONSUMER SECRET KEY and it gets your ACCESS TOKEN from a PIN number from using an web address you open in your browser.

#create your OAuth credential
credential <- OAuthFactory$new(consumerKey='**CONSUMER KEY**',
                         consumerSecret='**CONSUMER SECRETY KEY**',
                         requestURL='https://api.twitter.com/oauth/request_token',
                         accessURL='https://api.twitter.com/oauth/access_token',
                         authURL='https://api.twitter.com/oauth/authorize')

#authentication process
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
credential$handshake(cainfo="cacert.pem")

After this is executed properly, R will give you output in your console that looks like the following:

twitter handshake

  1. Copy the https:// URL into a browser
  2. Log into Twitter if you haven't already
  3. Authorize the application
  4. Then you'll get the PIN number to copy into the R console and hit Enter

    twitter pin

Now that the authentication handshake was completed, the R program is able to use those credentials to make API calls. A basic call using the Streaming API is the filterStream() function in the streamR package. This will connected you to Twitter's stream for a designated amount of time and/or for a certain number of tweets collected.

#function to actually scrape Twitter
filterStream( file.name="tweets_test.json",
             track="twitter", tweets=1000, oauth=cred, timeout=10, lang='en' )

The track parameter tells Twitter want you want to 'search' for. It's technically not really a search since you are filtering the Twitter stream and not searching...technically. Twitter's dev site has a nice explanation of all the Streaming APIs parameters. For example, the track parameter is not case sensitive, it will treat hashtags and regular words the same, and it will find tweets with any of the words you specify, not just when all the words are present. The track parameter 'apple, twitter' will find tweets with 'apple', tweets with 'twitter', and tweets with both.

The filterStream() function will stay open as long as you tell it to in the timeout parameter [in seconds], so don't set it too long if you want your data quickly. The data Twitter returns to you is a .json file, which is a JavaScript data file.

twitter json

The above is an excerpt from a tweet that's been formatted to be easier to read. Here's a larger annotated version of a tweet JSON file. Thes are useful in some contexts of programming, but for basic use in R, Tableau, and Excel it's gibberish.

There are a few different ways to parse the data into something useful. The most basic [and easiest] is to use the parseTweets() function that is also in streamR.

#Parses the tweets
tweet_df <- parseTweets(tweets='tweets_test.json')

This is a pretty simple function that takes the JSON file that filterStream() produced, reads it, and creates a wide data frame. The data frame can be pretty daunting, since there is so much metadata available.

twitter data frame

You might notice some of the ?-mark characters. These are text encoding errors. This is one of the limitations of using R to parse the tweets, because the streamR package doesn't handle utf-8 characters well in its functions. This means that R can only read basic A-Z characters and can't translate emoji, foreign languages, and some punctuation. I'd recommend using something like MongoDB to store tweets or create your own parser if you want be able to use these features of the text.

Quick Analysis

This tutorial focuses on how to collect Twitter data and not the intricacies of analyzing it, but here are a few simple examples of how you can use the tweet data frame.

#using the Twitter data frame
tweet_df$created_at
tweet_df$text


plot(tweet_df$friends_count, tweet_df$followers_count) #plots scatterplot
cor(tweet_df$friends_count, tweet_df$followers_count) #returns the correlation coefficient

The different columns within the data frame can be called separately. Calling the created_at field gives you the tweet's time stamp, and the text field is the content of the tweet. Generally, there will be some correlation between the number of followers a person has [followers_count] and the number of accounts a person follows [friends_count]. When I ran my script I got a correlation of about 0.25. The scatter plot will be heavily impacted by the Justin Biebers of the world where they have millions of followers but follow only a few themselves.

Conclusion

This is quick-start tutorial to collecting Twitter data. There are plenty of resources to be found on Twitter's developer site and all over the internet. While this tutorial is useful to learn the basics of how the OAuth process works and how Twitter returns data, I recommend using a tool like Python and MongoDB which can give you greater flexibility for analysis. Collecting tweets is the foundation of using Twitter's API, but you can also get user objects, trends, or accomplish anything that you can in a Twitter client with the REST and Search APIs.

 


The R code used in this post can be found on my git-hub.

Part I: Introduction | Part II: Getting Started [current page] | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV -- Errors | Part VI: Twitter JSON to CSV -- ASCII | Part VII: Twitter JSON to CSV -- UTF-8

Collecting Twitter Data: Introduction

Part I: Introduction [current page] | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8


Collecting Twitter data is a great exercise in data science and can provide interesting insights in how people behave on the social media platform. Below is an overview of the steps to build a Twitter analysis from scratch. This tutorial will go through several steps to arrive at being able to analyze Twitter data.

  1. Overview of Twitter API does
  2. Get R or Python
  3. Install Twitter packages
  4. Get Developer API Key from Twitter
  5. Write Code to Collect Tweets
  6. Parse the Raw Tweet Data [JSON files]
  7. Analyze the Tweet Data

Introduction

Before diving into the technical aspects of how to use the Twitter API [Application Program Interface] to collect tweets and other data from their site, I want to give a general overview of what the Twitter API is and isn’t capable of doing. First, data collection on Twitter doesn’t necessarily produce a representative sample to make inferences about the general population. And people tend to be rather emotional and negative on Twitter. That said, Twitter is a treasure trove of data and there are plenty of interesting things you can discover. You can pull various data structures from Twitter: tweets, user profiles, user friends and followers, what’s trending, etc. There are three methods to get this data: the REST API, the Search API, and the Streaming API. The Search API is retrospective and allows you search old tweets [with severe limitations], the REST API allows you to collect user profiles, friends, and followers, and the Streaming API collects tweets in real time as they happen. [This is best for data science.] This means that most Twitter analysis has to be planned beforehand or at least tweets have to be collected prior to the timeframe you want to analyze. There are some ways around this if Twitter grants you permission, but the run-of-the-mill Twitter account will find the Streaming API much more useful.

The Twitter API requires a few steps:

  1. Authenticate with OAuth
  2. Make API call
  3. Receive JSON file back
  4. Interpret JSON file

The authentication requires that you get an API key from the Twitter developers site. This just requires that you have a Twitter account. The four keys the site gives you are used as parameters in the programs. The OAuth authentication gives your program permission to make API calls.

The API call is an http call that has the parameters incorporated into the URL like
https://stream.twitter.com/1.1/statuses/filter.json?track=twitter
This Streaming API call is asking to connect to Twitter and tracks the keyword ‘twitter’. Using prebuilt software packages in R or Python will hide this step from you the programmer, but these calls are happening behind the scenes.

JSON files are the data structure that Twitter returns. These are rather comprehensive with the amount of data, but hard to use without them being parsed first. Some of the software packages have built-in parsers or you can use a NoSQL database like MongoDB to store and query your tweets.

Get R or Python

While there are many different programing languages to interface with the API, I prefer to use either Python or R for any Twitter data scraping. R is easier to use out of the box if you are just getting started with coding, and Python offers more flexibility. If you don’t have either of these, I’d recommend installing then learning to do some basic things before tackling Twitter data.

Download R: http://cran.rstudio.com/
R Studio: http://www.rstudio.com/ [optional]

Download Python: https://www.python.org/downloads/

Install Twitter Packages

The easiest way to access the API is to install a software package that has prebuilt libraries that makes coding projects much simpler. Since this tutorial will primarily be focused on using the Streaming API, I recommend installing the streamR package for R or tweepy for Python. If you have a Mac, Python is already installed and you can run it from the terminal. I recommend getting a program to help you organize your projects like PyCharm, but that is beyond the scope of this tutorial.

R
[in the R environment]

install.packages('streamR')
install.packages('ROAuth')
library(ROAuth)
library(streamR)

Python
[in the terminal, assuming you have pip installed]

$ pip install tweepy

 


Part I: Introduction [current page] | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8

SOTU Title

2015 State of the Union Address — Text Analytics

I collected tweets about the 2015 State of the Union address [SOTU] in real time from 10am to 2am using the keywords [obama, state of the union, sotu, sotusocial, ernst]. The tweets were analyzed for sentiment, content, emoji, hashtags, and retweets. The graph below shows Twitter activity over the course of the night. The volume of tweets and the sentiment of reactions were the highest during the latter half of the speech when Obama made the remark “I should know; I won both of them” referring to the 2008 & 2012 elections he won.

2015 State of the Union Tweet Volume

Throughout the day before the speech, there weren’t many tweets and they tended to be neutral. These tweets typically contained links to news articles previewing the SOTU address or reminders about the speech. Both of these types of tweets are factual but bland when compared to the commentary and emotional reaction that occurred during the SOTU address itself. The huge spike in Twitter traffic didn’t happen until the President walked onto the House floor which was just before 9:10 PM. When the speech started, the sentiment/number of positive words per tweet increased to about 0.3 positive words/tweet suggesting that the SOTU address was well received. [at least to the people who bothered to tweet]

Around 7:45-8:00 PM the largest negative sentiment occurred during the day. I’ve looked back through the tweets from that time and couldn’t find anything definitive that happened to cause that. My conjecture would be that is when news coverage started [and strongly opinionated] people started watching the news and began to tweet.

The highest sentiment/number of positive words came during the 15-minute polling window where the President quipped about winning two elections. Unfortunately, that sound bite didn’t make a great hashtag, so it didn’t show up else where in my analysis. However, there are many news articles and discussion about that off-the-cuff remark, and it will probably be the most memorable moment from the SOTU address.

Emoji

Once again [Emoji Popularity Link], the crying-my-eyes-out emoji proved to be the most used emoji in SOTU tweets, typically being used in tweets which aren’t serious and generally sarcastic. Not surprisingly, the clapping emoji was the second most popular emoji mimicking the copious ovations the SOTU address receives. Other notable popular emoji are the fire, US flag, the zzzz emoji and skull. The US flag reflects the patriotic themes of the entire night. The fire is generally reflecting praise for Obama’s speech. The skull and zzzz are commenting on spectators in the crowd.

2015 State of the Union Twitter Emoji Use

Two topic-specific emoji counts were interesting. For the most part in all of my tweet collections, the crying-my-eyes-out emoji is exponentially more popular than any other emoji. Understandably, the set of tweets that contained language associated with terrorism had more handclaps, flags, and angry emoji reflecting the serious nature of the subject.

2015 State of the Union Subject Emojis

Then tweets corresponding to the GOP response had a preponderance of pig-related emojis due to Joni Ernst’s campaign ad.

#Hashtags

The following hashtag globe graphic is rather large. Please enlarge to see the most popular hashtags associated with the SOTU address. I removed the #SOTU hashtag, because it was use extensively and overshadowed the rest. For those wondering what #TCOT means, it stands for Top Conservatives on Twitter. The #P2 hashtag is its progressive counterpart. [Source]

2015 State of the Union Hashtag Globe

RTs

The White House staff won the retweeting war by being the two most retweeted [RT] accounts during the speech last night. This graph represents the total summed RTs over all the tweets they made. Since the White House and the Barack Obama account tweeted constantly during the speech, they accumulated the most retweets. Michael Clifford has the most retweeted single tweet stating he is just about met the President. If you are wondering who Michael Clifford is, you aren’t alone, because I had to look him up. He’s the 19-yo guitarist from 5 Seconds of Summer. The tweet is from August, however, people did retweet it during the day. [I was measuring the max retweet count on the tweets.] Rand Paul was the most retweeted non-President politician, and the Huffington Post had the most for a news outlet.

2015 State of the Union Popular Retweets

The Speech

Obama released his speech online before starting the State of the Union address. I used this for a quick word-count analysis, and it doesn’t contain the off-the-cuff remarks just the script, which he did stick to with few exceptions. The first graph uses the count of single words with ‘new’ being by far the most used word.

2015 State of the Union Address Word Frequency

This graph shows the most used two-word combinations [also known as bi-grams].

2015 State of the Union Address Bigram Frequency

Further Notes

I was hoping this would be the perfect opportunity to test out my sentiment analysis process, and the evaluation results were rather moderate achieving about 50% accuracy on three classes [negative, neutral, positive]. In this case 50% is an improvement over a 33% random guess, but not very encouraging overall. For the sentiment portion in the tweet volume graph, I used the bag-of-words approach that I have used many times before.

A more interesting and informative classifier might look try to classify the tweet into sarcastic/trolling, positive, and angry genres. I had problems classifying some tweets as positive and negative, because there were many news links, which are neutral, and sarcastic comments, which look positive but feel negative. For politics, classifying the political position might be more useful, since a liberal could be mocking Boehner one minute, then praising Obama the next. Having two tweets classified as liberal rather than a negative tweet and a positive tweet is much more informative when aggregating.

MLB — Pace of Play [Working Post]

This post is a work in progress. The data concerning the pace of play is rather messy and this project is rather large compare to what I normally tackle. For that reason I’m going start this post and update it as a ‘working post’. Please feel free to contact me if anyone has any input: @seandolinar on Twitter or
sean.dolinar@gmail.com

Having collected the time between pitches from PITCH/fx, I was able to look at the different factors that affect how long pitchers took between plays. [I’m defining this as the pitch pace.] PITCH/fx has a time stamp associated with each pitch. Using that time stamp, I was able to calculate the time between each pitch. I used the resulting calculation combined with other information available about each at-bat to draw some conclusions about what affects pace of play.

The most obvious influence on the time between pitches is whether or not there was a baserunner. This was rather simple to explore since PITCH/fx provides information on whether or not there is a runner on 1B, 2B, or 3B. Using this I was able to create the following table of median pitch pace. [I’ll explain why I decided to use the median and not the mean/average later.]

Median Pitch Pace

The data matches what your experience with baseball suggests. Pitchers will slow down the game when there is a runner on base. This will happen for several reasons: run-game tactics, conferences on the mound, and even time for the ball to get back to the pitcher after the play. Given the fact there is a slight drop off for when there isn’t an open base or there are two outs, I would conclude that the run-game prevention tactics play a rather significant role in the pitch pace.

Pitch Pace Distribution

The distribution of pitch pace data shows how often pitchers take 5-10 seconds, 10-15 seconds, 15-20 seconds, etc. between pitches. Both distributions are highly skewed right, so the average pitch pace isn’t representative of the central tendency of the data set; the median works a lot better in this situation to describe the most likely outcome.

The pitch pace with the highest frequency with the bases empty is the 15-20 second range, while the most frequent pitch pace bumps up to 20-25 seconds when runners are on base. MLB is kicking around the idea of having a 20 second pitch clock. From the distribution, it becomes apparent that keeping the pace to under 20 seconds would have an impact on the pitch pace of play.

Pitch Pace Box Plot

I created a box plot to show another perspective of the distributions. The mean of the runners on base pitch pace is significantly higher than the mean of the pitch pace with bases empty.

Data Background

PITCH/fx data isn’t designed to accurately measure the time between pitches; it has some problems. A human operator is needed to enter data on each pitch such as ball/strike, information about the hit or if runs scored. For this reason, the data is very messy. It has problems where subtracting the time of each subsequent pitch from the pitch prior yields negative numbers because of the operator entered the previous pitch after the pitcher threw the next pitch. For these reasons I have to re-examine cleaning and processing the PITCH/fx data.

Further Work

I need to clean the data further. This will include identifying and excluding first pitches from at-bats and aggregating each at-bat. This should alleviate some of the delay problems associated with the human entry component of PITCH/fx.

I want to look at leverage’s impact on the pitch pace. My initial analysis is that leverage doesn’t matter all too much when you consider if there’s a player on base or not since leverage and having a player on base are collinear. With cleaner data the effect of leverage or post season play might be more apparent.

I’m going look at the time between innings. This should change depending on the broadcast; national broadcasts have longer commercial breaks. There also should be artifacts for weather delays.

Pitching changes should also be included. Inning breaks with new pitchers tend to be longer, it would be nice to see how much longer they are on the aggregate.

All of these need to be programmed into a parser that looks at the data sequentially. My plan is to update this page once I have more research available.

2015 Steelers-Ravens Playoffs Hashtag Use

2015 Steelers-Ravens Playoff Twitter Infographics

The Steelers-Ravens playoff game gave me a chance to test out a new analytics server and some of the tools I’ve been working on to make Twitter analysis easy using ad hoc Python scripts. So here goes:

There were a lot of Steelers or Ravens colored emojis, black and gold hearts or buttons and the purple devils. Though for some reasons the ‘crying my eyes out’ emoji is by far the most popular in this collection of tweets. The yellow line represents how many unique tweets there were featuring that emoji. For example, 14 of the same of emoji in one tweet would count for 14 in the blue bar, while it would count for just 1 in the context of the yellow line.

2015 Steelers-Ravens Playoffs Emoji Use

Here’s the hashtag use. The #steelers exceeded the #ravens. This looks cool, but it doesn’t tell you much.

2015 Steelers-Ravens Playoffs Hashtag Use

Here’s a bar chart that’s a lot easier to read if you want the information.

2015 Steelers-Ravens Playoffs Hashtag Bar Chart

One-Tailed Z-test

One Mean Z-test [with R code]

I’ve included the full R code and the data set can be found on UCLA’s Stats Wiki

Building on finding z-scores for individual measurement or values within a population, a z-test can determine if there is a statistically significance different between a sample mean and a population mean with a known population standard deviation. [Those conditions are essential for using this test.] The z-test uses z-scores and a normal distribution to determine the probability the sample mean is drawn randomly from a known population. If the test fails, the conclusion is that random sampling is likely to have produced this. If the test rejects the null hypothesis, then the sample is likely to be a result of non-random sampling [ie. like team captains picking the tallest kids for a basketball game in gym class].

The z-test relies critically on the central limit theorem, which basically states that if you take a n >= 30 sample a population [with any distribution] many times over, you’ll get a normal distribution of the sample means. [This needs it’s own post to explain fully, and there are interesting ways you can program R to illustrate this.] The sample mean distribution chart is shown below compared to the population distribution. The important concepts to notice here are:

  • the area of both distributions is equal to 1
  • the sample mean distribution is a normal distribution
  • the sample mean distribution is tighter and taller than the population distribution

Central Limit Theorem Comparison to Population Distribution

For the rest of this post, the sample mean distribution will be used for the z-test and it is also represent in green opposed to blue. Also the data I use in this post is height data from this data set. It represents the heights of 25,000 children from Hong Kong. The data doesn’t reflect US adults, but it’s a great normally distributed data set.

The goal of the z-test will be to test to see if a sample and its mean are randomly sampled from the population or if there’s some significant difference. For example, you could use this test to see if the average height of NBA players is statistically significantly different than the general population. While the NBA example is pretty common sense, not every problem will be that clear. Sample size [like in many hypothesis tests] is a huge factor. Small sample sizes require huge differences between the sample mean and the population mean to be significant.

For a one-mean z-test, we will be using a one-tail hypothesis test. The null hypothesis will be that there is NO difference between the sample mean and the population mean. The alternate hypothesis will test to see if the sample mean is greater. The null and alternate hypotheses are written out as:

  • $latex H_0: \bar{x} = \mu&s=2$
  • $latex H_A: \bar{x} > \mu&s=2$

One-Tailed Z-test

The graph above shows the critical regions for a right-tailed z-test. The critical regions reflect areas where the z-stat has to fall in order for the test to reject the null hypothesis. The critical regions are defined because they represent a probability less the the stated confidence level. For example the critical region for 95% confidence level only has an area [probability] of 5%. If the sample mean is the same as the population mean, there’s a 5% chance it was drawn by random chance. This concept is the basis for almost every hypothesis test.

The z-test uses the z-stat, which is calculated analogously to the z-score the difference being it uses standard error instead of standard deviation. These two concepts are similar; The standard deviation applies to the ‘spread’ of the blue population distribution, while the standard error applies to the ‘spread’ of the green sample mean distribution. The z-stat is calculated as:

$latex z = \frac{\bar{x} – \mu}{\sigma/\sqrt{n}} &s=2$

The higher the z-stat is the more certainty there is that the sample mean and the population are different. There are three things make the z-stat larger:

  • a bigger difference between sample mean and population mean
  • a small population standard deviation
  • a larger sample size

Example

I have two sets of sample from the data set: one is entirely random and the other I weighted heavily towards taller people. The null hypothesis would be that both there’s no difference between the sample mean and the population mean. The alternate would be that the sample mean is greater than the population mean. The weighted sample would be the sample if you were evaluating the mean height of a basketball team vs the general population. Here are the two sets of an n=50 sample and R code on how I constructed them using a set random seed of 123.

Unbiased random sample

unbiased_sample

Tall-biased random sample

biased_sample

#unbiased random sample
set.seed(123)
n <- 50
height_sample <- sample(height, size=n)
sample_mean <- mean(height_sample)

#tall-biased sample
cut <- 1:25000
weights <- cut^.6
sorted_height <- sort(height)
set.seed(123)
height_sample_biased <- sample(sorted_height, size=n, prob=weights)
sample_mean_biased <- mean(height_sample_biased)

The population mean is 67.993, the first unbiased sample is 68.099, and the tall-biased group is 68.593. Both samples are higher than the than the population mean, but are both significantly higher than the mean? To figure this out, we need to calculate the z-stats and find out if those z-stats fall in the critical region using the equation:

$latex z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}} &s=2$

We can substitute and calculate with the population standard deviation [σ] = 1.902:

$latex z_{unbiased} = \frac{68.593 - 67.993}{1.902/\sqrt{50}} = 0.3922 \ \ \ \ z_{tall-biased} = \frac{68.099 - 67.993}{1.902/\sqrt{50}} = 2.229 &s=0$

#random unbiased sample
#z-stat calculation
sample_mean
z <- (sample_mean - pop_mean)/(pop_sd/sqrt(n))

#tall-biased sample
z <- (sample_mean_biased - pop_mean)/(pop_sd/sqrt(n))

Quickly, knowing that the critical value for a one-tail z-test at 95% confidence is 1.645, we can determine the unbiased random sample is not significantly different, but the tall-biased sample is significantly different. This is because the z-stat for the unbiased sample is less than the critical value, while the tall-biased is higher than the critical value.

Failed Z-test Example Comparison

Plotting the z-test for the unbiased sample, the area [probability] to the right of the z-stat is much higher than the accepted 5%. The larger the green area is the more likely the difference between the sample mean and the population mean were obtained by random chance. To get a z-test to be significant, you want to get the z-stat high so that the area [probability] is low. [In practice, this can be done by increasing sample size.]

Successful Z-test Example

The tall-baised sample mean's z-stat creates a plot with much less area to the right of the z-stat, so these results were much less likely to be obtained by chance. The p-values can be obtained by calculating the area to right of the z-stat. The R code below summarizes how to do that using R's 'pnorm' function.

#calculating the p-value
p_yellow2 <- pnorm(z)                   
p_green2 <- 1 - p_yellow2
p_green2

The p-value for the unbiased sample is .3474 or there's a 34.74% chance that the result was obtained due to random chance, while the tall-biased sample only have a p-value of .01291 or a 1.291% chance being a result of random chance. Since the p-value tall-biased sample is less than the .05, the null hypothesis is rejected, but the since the unbiased sample's p-value is well above .05, the null hypothesis is retained.

What the one-mean z-test accomplished was telling us that a simple random sample from a population wasn't really that different from population, while a sample that wasn't completely random but was much taller than the overall population was shown to be different. While this test isn't used often, the principles of distributions, calculating test stats, and p-values have many applications with in the statistics universe.

Probabiliy of Finding Someone Taller than 6 Comparison

Calculating Z-Scores [with R code]

I’ve included the full R code and the data set can be found on UCLA’s Stats Wiki

Normal distributions are convenient because they can be scaled to any mean or standard deviation meaning you can use the exact same distribution for weight, height, blood pressure, white-noise errors, etc. Obviously, the means and standard deviations of these measurements should all be completely different. In order to get the distributions standardized, the measurements can be changed into z-scores.

Z-scores are a stand-in for the actual measurement, and they represent the distance of a value from the mean measured in standard deviations. So a z-score of 2.0 means the measurement is 2 standard deviations away from the mean.

To demonstrate how this is calculated and used, I found a height and weight data set on UCLA’s site. They have height measurements from children from Hong Kong. Unfortunately, the site doesn’t give much detail about the data, but it is an excellent example of normal distribution as you can see in the graph below. The red line represents the theoretical normal distribution, while the blue area chart reflects a kernel density estimation of the data set obtained from UCLA. The data set doesn’t deviate much from the theoretical distribution.

Normal Distribution Z-Score Comparison

The z-scores are also listed on this normal distribution to show how the actual measurements of height correspond to the z-scores, since the z-scores are simple arithmetic transformations of the actual measurements. The first step to find the z-score is to find the population mean and standard deviation. It should be noted that the sd function in R uses the sample standard deviation and not the population standard deviation, though with 25,000 samples the different is rather small.

#DATA LOAD
data <- read.csv('Height_data.csv')
height <- data$Height

hist(height) #histogram

#POPULATION PARAMETER CALCULATIONS
pop_sd <- sd(height)*sqrt((length(height)-1)/(length(height)))
pop_mean <- mean(height)

Using just the population mean [μ = 67.99] and standard deviation [σ = 1.90], you can calculate the z-score for any given value of x. In this example I'll use 72 for x.

$latex z = \frac{x - \mu}{\sigma} &s=2$

z <- (72 - pop_mean) / pop_sd

This gives you a z-score of 2.107. To put this tool to use, let's use the z-score to find the probability of finding someone who is 72 inches [6-foot] tall. [Remember this data set doesn't apply to adults in the US, so these results might conflict with everyday experience.] The z-score will be used to determine the area [probability] underneath the distribution curve past the z-score value that we are interested in.
[One note is that you have to specify a range (72 to infinity) and not a single value (72). If you wanted to find people who are exactly 6-foot, not taller than 6-foot, you would have to specify the range of 71.5 to 72.5 inches. This is another problem, but this has everything to do with definite integrals intervals if you are familiar with Calc I.]

Probabiliy of Finding Someone Taller than 6 Comparison

The above graph shows the area we intend to calculate. The blue area is our target, since it represents the probability of finding someone taller than 6-foot. The yellow area represents the rest of the population or everyone is is under 6-feet tall. The z-score and actual height measurements are both given underscoring the relationship between the two.

Typically in an introductory stats class, you'd use the z-score and look it up in a table and find the probability that way. R has a function 'pnorm' which will give you a more precise answer than a table in a book. ['pnorm' stands for "probability normal distribution".] Both R and typical z-score tables will return the area under the curve from -infinity to value on the graph this is represented by the yellow area. In this particular problem, we want to find the blue area. The solution to this is an easy arithmetic function. The area under the curve is 1, so by subtracting the yellow area from 1 will give you the area [probability] for the blue area.

Yellow Area:

p_yellow1 <- pnorm(72, pop_mean, pop_sd)    #using x, mu, and sigma
p_yellow2 <- pnorm(z)                       #using z-score of 2.107

Blue Area [TARGET]:

p_blue1 <- 1 - p_yellow1   #using x, mu, and sigma
p_blue2 <- 1 - p_yellow2   #using z-score of 2.107

Both of these techniques in R will yield the same answer of 1.76%. I used both methods, to show that R has some versatility that traditional statistics tables don't have. I personally find statistics tables antiquated, since we have better ways to determine it, and the table doesn't help provide any insight over software solutions.

Z-scores are useful when relating different measurement distributions to each acting as a 'common denominator'. The z-scores are used extensively for determining area underneath the curve when using text book tables, and also can be easily used in programs such as R. Some statistical hypothesis tests are based on z-scores and the basic principles of finding the area beyond some value.