Tag Archives: featured

Using New, Diverse Emojis for Analysis in Python

I haven’t been updating this site often since I’ve started to perform a similar job over at FanGraphs. All non-baseball stat work that I do will continued to be housed here.

Over the past week, Apple has implemented new emojis with a focus on diversity in their iOS 8.3 and the OS X 10.10.3 update. I’ve written quite a bit about the underpinnings of emojis and how to get Python to run text analytics on them. The new emojis provide another opportunity to gain insights on how people interact, feel, or use them. Like always, I prefer to use Python for any web scraping or data processing, and emoji processing is no exception. I already wrote a basic primer on how to get Python to find emoji in your text. If you combine the tutorials I have for tweet scraping, MongoDB and emoji analysis, you have yourself a really nice suite of data analysis.

Modifier Patch

These new emojis are a product of the Unicode Consortium’s plan for how to incorporate racial diversity into the previously all-white human emoji line up. (And yes, there’s a consortium for emoji planning.) The method used to produce new emojis isn’t quite as simple as just making a new character/emoji. Instead, they decided to include a modifier patch at the end of human emojis to indicate skin color. As a end-user, this won’t affect you if you have all the software updates and your device can render the new emojis. However, if you don’t have the updates, you’ll get something that looks like this:

Emoji Patch Error
That box at the end of the emoji is the modifier patch. Essentially what is happening here is that there is a default emoji (in this case it’s the old man) and the modifier patch (the box). For older systems it doesn’t display, because the old system doesn’t know how to interpret this new data. This method actually allows the emojis to be backwards compatible, since it still conveys at least part of the meaning of the emoji. If you have the new updates, you will see the top row of emoji.

Emoji Plus Modifier Patches

Using a little manipulation (copying and pasting) using my newly updated iPhone we can figure out this is what really is going on for emojis. There are five skin color patches available to be added to each emoji, which is demonstrated on the bottom row of emoji. Now you might notice there are a lot of yellow emoji. Yellow emojis (Simpsons) are now the default. This is so that no single real skin tone is the default. The yellow emojis have no modifier patch attached to them, so if you simply upgrade your phone and computer and then go back and look at old texts, all the emojis with people in them are now yellow.

New Families

The new emoji update also includes new families. These are also a little different, since they are essentially combinations of other emoji. The original family emoji is one single emoji, but the new families with multiple children and various combinations of children and partners contain multiple emojis. The graphic below demonstrates this.

Emoji New Familes

The man, woman, girl and boy emoji are combined to form that specific family emoji. I’ve seen criticisms about the families not being multiracial. I’d have to believe the limitation here is a technical one, since I don’t believe the Unicode consortium has an effective method to apply modifier patches and combine multiple emojis at once. That would result in a unmanageable number of glyphs in the font set to represent the characters. (625 different combinations for just one given family of 4, and there are many different families with different gender iterations.)

New Analysis

So now that we have the background on the how the new emojis work, we can update how we’ve searched and analyzed them. I have updated my emoji .csv file, so that anyone can download that and run a basic search within your text corpus. I have also updated my github to have this file as well for the socialmediaparse library I built.

The modifier patches are searchable, so now you can search for certain swatches (or lack there of). Below I have written out the unicode escape output for the default (yellow) man emoji and its light-skinned variation. The emoji with a human skin color has that extra piece of code at the end.

#unicode escape
\U0001f468 #unmodified man
\U0001f468\U0001f3fb  #light-skinned man

Here are all the modifier patches as unicode escape.

Emoji Modifier Patches

#modifier patch unicode escape
\U0001f3fb  #skin tone 1 (lightest)
\U0001f3fc  #skin tone 2
\U0001f3fd  #skin tone 3
\U0001f3fe  #skin tone 4
\U0001f3ff  #skin tone 5 (darkest)

The easiest way to search for these is to use the following snippet of code:

#searches for any emoji with skin tone 5
unicode_object = u'Some text with emoji in it as a unicode object not str!'

if '\U0001f3ff' in unicode_object.encode('unicode_escape'):
   #do something

You can throw that snippet into a for loop for a Pandas data frame or a MongoDB cursor. I’m planning on updating my socialmediaparse library with patch searching, and I’ll update this post when I do that.

Spock

Finally, there’s Spock!

Emoji Spock

The unicode escape for Spock is:

\U0001f596

Add your modifier patches as needed.

Collecting Twitter Data: Introduction

Part I: Introduction [current page] | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8


Collecting Twitter data is a great exercise in data science and can provide interesting insights in how people behave on the social media platform. Below is an overview of the steps to build a Twitter analysis from scratch. This tutorial will go through several steps to arrive at being able to analyze Twitter data.

  1. Overview of Twitter API does
  2. Get R or Python
  3. Install Twitter packages
  4. Get Developer API Key from Twitter
  5. Write Code to Collect Tweets
  6. Parse the Raw Tweet Data [JSON files]
  7. Analyze the Tweet Data

Introduction

Before diving into the technical aspects of how to use the Twitter API [Application Program Interface] to collect tweets and other data from their site, I want to give a general overview of what the Twitter API is and isn’t capable of doing. First, data collection on Twitter doesn’t necessarily produce a representative sample to make inferences about the general population. And people tend to be rather emotional and negative on Twitter. That said, Twitter is a treasure trove of data and there are plenty of interesting things you can discover. You can pull various data structures from Twitter: tweets, user profiles, user friends and followers, what’s trending, etc. There are three methods to get this data: the REST API, the Search API, and the Streaming API. The Search API is retrospective and allows you search old tweets [with severe limitations], the REST API allows you to collect user profiles, friends, and followers, and the Streaming API collects tweets in real time as they happen. [This is best for data science.] This means that most Twitter analysis has to be planned beforehand or at least tweets have to be collected prior to the timeframe you want to analyze. There are some ways around this if Twitter grants you permission, but the run-of-the-mill Twitter account will find the Streaming API much more useful.

The Twitter API requires a few steps:

  1. Authenticate with OAuth
  2. Make API call
  3. Receive JSON file back
  4. Interpret JSON file

The authentication requires that you get an API key from the Twitter developers site. This just requires that you have a Twitter account. The four keys the site gives you are used as parameters in the programs. The OAuth authentication gives your program permission to make API calls.

The API call is an http call that has the parameters incorporated into the URL like
https://stream.twitter.com/1.1/statuses/filter.json?track=twitter
This Streaming API call is asking to connect to Twitter and tracks the keyword ‘twitter’. Using prebuilt software packages in R or Python will hide this step from you the programmer, but these calls are happening behind the scenes.

JSON files are the data structure that Twitter returns. These are rather comprehensive with the amount of data, but hard to use without them being parsed first. Some of the software packages have built-in parsers or you can use a NoSQL database like MongoDB to store and query your tweets.

Get R or Python

While there are many different programing languages to interface with the API, I prefer to use either Python or R for any Twitter data scraping. R is easier to use out of the box if you are just getting started with coding, and Python offers more flexibility. If you don’t have either of these, I’d recommend installing then learning to do some basic things before tackling Twitter data.

Download R: http://cran.rstudio.com/
R Studio: http://www.rstudio.com/ [optional]

Download Python: https://www.python.org/downloads/

Install Twitter Packages

The easiest way to access the API is to install a software package that has prebuilt libraries that makes coding projects much simpler. Since this tutorial will primarily be focused on using the Streaming API, I recommend installing the streamR package for R or tweepy for Python. If you have a Mac, Python is already installed and you can run it from the terminal. I recommend getting a program to help you organize your projects like PyCharm, but that is beyond the scope of this tutorial.

R
[in the R environment]

install.packages('streamR')
install.packages('ROAuth')
library(ROAuth)
library(streamR)

Python
[in the terminal, assuming you have pip installed]

$ pip install tweepy

 


Part I: Introduction [current page] | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8

SOTU Title

2015 State of the Union Address — Text Analytics

I collected tweets about the 2015 State of the Union address [SOTU] in real time from 10am to 2am using the keywords [obama, state of the union, sotu, sotusocial, ernst]. The tweets were analyzed for sentiment, content, emoji, hashtags, and retweets. The graph below shows Twitter activity over the course of the night. The volume of tweets and the sentiment of reactions were the highest during the latter half of the speech when Obama made the remark “I should know; I won both of them” referring to the 2008 & 2012 elections he won.

2015 State of the Union Tweet Volume

Throughout the day before the speech, there weren’t many tweets and they tended to be neutral. These tweets typically contained links to news articles previewing the SOTU address or reminders about the speech. Both of these types of tweets are factual but bland when compared to the commentary and emotional reaction that occurred during the SOTU address itself. The huge spike in Twitter traffic didn’t happen until the President walked onto the House floor which was just before 9:10 PM. When the speech started, the sentiment/number of positive words per tweet increased to about 0.3 positive words/tweet suggesting that the SOTU address was well received. [at least to the people who bothered to tweet]

Around 7:45-8:00 PM the largest negative sentiment occurred during the day. I’ve looked back through the tweets from that time and couldn’t find anything definitive that happened to cause that. My conjecture would be that is when news coverage started [and strongly opinionated] people started watching the news and began to tweet.

The highest sentiment/number of positive words came during the 15-minute polling window where the President quipped about winning two elections. Unfortunately, that sound bite didn’t make a great hashtag, so it didn’t show up else where in my analysis. However, there are many news articles and discussion about that off-the-cuff remark, and it will probably be the most memorable moment from the SOTU address.

Emoji

Once again [Emoji Popularity Link], the crying-my-eyes-out emoji proved to be the most used emoji in SOTU tweets, typically being used in tweets which aren’t serious and generally sarcastic. Not surprisingly, the clapping emoji was the second most popular emoji mimicking the copious ovations the SOTU address receives. Other notable popular emoji are the fire, US flag, the zzzz emoji and skull. The US flag reflects the patriotic themes of the entire night. The fire is generally reflecting praise for Obama’s speech. The skull and zzzz are commenting on spectators in the crowd.

2015 State of the Union Twitter Emoji Use

Two topic-specific emoji counts were interesting. For the most part in all of my tweet collections, the crying-my-eyes-out emoji is exponentially more popular than any other emoji. Understandably, the set of tweets that contained language associated with terrorism had more handclaps, flags, and angry emoji reflecting the serious nature of the subject.

2015 State of the Union Subject Emojis

Then tweets corresponding to the GOP response had a preponderance of pig-related emojis due to Joni Ernst’s campaign ad.

#Hashtags

The following hashtag globe graphic is rather large. Please enlarge to see the most popular hashtags associated with the SOTU address. I removed the #SOTU hashtag, because it was use extensively and overshadowed the rest. For those wondering what #TCOT means, it stands for Top Conservatives on Twitter. The #P2 hashtag is its progressive counterpart. [Source]

2015 State of the Union Hashtag Globe

RTs

The White House staff won the retweeting war by being the two most retweeted [RT] accounts during the speech last night. This graph represents the total summed RTs over all the tweets they made. Since the White House and the Barack Obama account tweeted constantly during the speech, they accumulated the most retweets. Michael Clifford has the most retweeted single tweet stating he is just about met the President. If you are wondering who Michael Clifford is, you aren’t alone, because I had to look him up. He’s the 19-yo guitarist from 5 Seconds of Summer. The tweet is from August, however, people did retweet it during the day. [I was measuring the max retweet count on the tweets.] Rand Paul was the most retweeted non-President politician, and the Huffington Post had the most for a news outlet.

2015 State of the Union Popular Retweets

The Speech

Obama released his speech online before starting the State of the Union address. I used this for a quick word-count analysis, and it doesn’t contain the off-the-cuff remarks just the script, which he did stick to with few exceptions. The first graph uses the count of single words with ‘new’ being by far the most used word.

2015 State of the Union Address Word Frequency

This graph shows the most used two-word combinations [also known as bi-grams].

2015 State of the Union Address Bigram Frequency

Further Notes

I was hoping this would be the perfect opportunity to test out my sentiment analysis process, and the evaluation results were rather moderate achieving about 50% accuracy on three classes [negative, neutral, positive]. In this case 50% is an improvement over a 33% random guess, but not very encouraging overall. For the sentiment portion in the tweet volume graph, I used the bag-of-words approach that I have used many times before.

A more interesting and informative classifier might look try to classify the tweet into sarcastic/trolling, positive, and angry genres. I had problems classifying some tweets as positive and negative, because there were many news links, which are neutral, and sarcastic comments, which look positive but feel negative. For politics, classifying the political position might be more useful, since a liberal could be mocking Boehner one minute, then praising Obama the next. Having two tweets classified as liberal rather than a negative tweet and a positive tweet is much more informative when aggregating.

MLB — Pace of Play [Working Post]

This post is a work in progress. The data concerning the pace of play is rather messy and this project is rather large compare to what I normally tackle. For that reason I’m going start this post and update it as a ‘working post’. Please feel free to contact me if anyone has any input: @seandolinar on Twitter or
sean.dolinar@gmail.com

Having collected the time between pitches from PITCH/fx, I was able to look at the different factors that affect how long pitchers took between plays. [I’m defining this as the pitch pace.] PITCH/fx has a time stamp associated with each pitch. Using that time stamp, I was able to calculate the time between each pitch. I used the resulting calculation combined with other information available about each at-bat to draw some conclusions about what affects pace of play.

The most obvious influence on the time between pitches is whether or not there was a baserunner. This was rather simple to explore since PITCH/fx provides information on whether or not there is a runner on 1B, 2B, or 3B. Using this I was able to create the following table of median pitch pace. [I’ll explain why I decided to use the median and not the mean/average later.]

Median Pitch Pace

The data matches what your experience with baseball suggests. Pitchers will slow down the game when there is a runner on base. This will happen for several reasons: run-game tactics, conferences on the mound, and even time for the ball to get back to the pitcher after the play. Given the fact there is a slight drop off for when there isn’t an open base or there are two outs, I would conclude that the run-game prevention tactics play a rather significant role in the pitch pace.

Pitch Pace Distribution

The distribution of pitch pace data shows how often pitchers take 5-10 seconds, 10-15 seconds, 15-20 seconds, etc. between pitches. Both distributions are highly skewed right, so the average pitch pace isn’t representative of the central tendency of the data set; the median works a lot better in this situation to describe the most likely outcome.

The pitch pace with the highest frequency with the bases empty is the 15-20 second range, while the most frequent pitch pace bumps up to 20-25 seconds when runners are on base. MLB is kicking around the idea of having a 20 second pitch clock. From the distribution, it becomes apparent that keeping the pace to under 20 seconds would have an impact on the pitch pace of play.

Pitch Pace Box Plot

I created a box plot to show another perspective of the distributions. The mean of the runners on base pitch pace is significantly higher than the mean of the pitch pace with bases empty.

Data Background

PITCH/fx data isn’t designed to accurately measure the time between pitches; it has some problems. A human operator is needed to enter data on each pitch such as ball/strike, information about the hit or if runs scored. For this reason, the data is very messy. It has problems where subtracting the time of each subsequent pitch from the pitch prior yields negative numbers because of the operator entered the previous pitch after the pitcher threw the next pitch. For these reasons I have to re-examine cleaning and processing the PITCH/fx data.

Further Work

I need to clean the data further. This will include identifying and excluding first pitches from at-bats and aggregating each at-bat. This should alleviate some of the delay problems associated with the human entry component of PITCH/fx.

I want to look at leverage’s impact on the pitch pace. My initial analysis is that leverage doesn’t matter all too much when you consider if there’s a player on base or not since leverage and having a player on base are collinear. With cleaner data the effect of leverage or post season play might be more apparent.

I’m going look at the time between innings. This should change depending on the broadcast; national broadcasts have longer commercial breaks. There also should be artifacts for weather delays.

Pitching changes should also be included. Inning breaks with new pitchers tend to be longer, it would be nice to see how much longer they are on the aggregate.

All of these need to be programmed into a parser that looks at the data sequentially. My plan is to update this page once I have more research available.

2015 Steelers-Ravens Playoffs Hashtag Use

2015 Steelers-Ravens Playoff Twitter Infographics

The Steelers-Ravens playoff game gave me a chance to test out a new analytics server and some of the tools I’ve been working on to make Twitter analysis easy using ad hoc Python scripts. So here goes:

There were a lot of Steelers or Ravens colored emojis, black and gold hearts or buttons and the purple devils. Though for some reasons the ‘crying my eyes out’ emoji is by far the most popular in this collection of tweets. The yellow line represents how many unique tweets there were featuring that emoji. For example, 14 of the same of emoji in one tweet would count for 14 in the blue bar, while it would count for just 1 in the context of the yellow line.

2015 Steelers-Ravens Playoffs Emoji Use

Here’s the hashtag use. The #steelers exceeded the #ravens. This looks cool, but it doesn’t tell you much.

2015 Steelers-Ravens Playoffs Hashtag Use

Here’s a bar chart that’s a lot easier to read if you want the information.

2015 Steelers-Ravens Playoffs Hashtag Bar Chart

One-Tailed Z-test

One Mean Z-test [with R code]

I’ve included the full R code and the data set can be found on UCLA’s Stats Wiki

Building on finding z-scores for individual measurement or values within a population, a z-test can determine if there is a statistically significance different between a sample mean and a population mean with a known population standard deviation. [Those conditions are essential for using this test.] The z-test uses z-scores and a normal distribution to determine the probability the sample mean is drawn randomly from a known population. If the test fails, the conclusion is that random sampling is likely to have produced this. If the test rejects the null hypothesis, then the sample is likely to be a result of non-random sampling [ie. like team captains picking the tallest kids for a basketball game in gym class].

The z-test relies critically on the central limit theorem, which basically states that if you take a n >= 30 sample a population [with any distribution] many times over, you’ll get a normal distribution of the sample means. [This needs it’s own post to explain fully, and there are interesting ways you can program R to illustrate this.] The sample mean distribution chart is shown below compared to the population distribution. The important concepts to notice here are:

  • the area of both distributions is equal to 1
  • the sample mean distribution is a normal distribution
  • the sample mean distribution is tighter and taller than the population distribution

Central Limit Theorem Comparison to Population Distribution

For the rest of this post, the sample mean distribution will be used for the z-test and it is also represent in green opposed to blue. Also the data I use in this post is height data from this data set. It represents the heights of 25,000 children from Hong Kong. The data doesn’t reflect US adults, but it’s a great normally distributed data set.

The goal of the z-test will be to test to see if a sample and its mean are randomly sampled from the population or if there’s some significant difference. For example, you could use this test to see if the average height of NBA players is statistically significantly different than the general population. While the NBA example is pretty common sense, not every problem will be that clear. Sample size [like in many hypothesis tests] is a huge factor. Small sample sizes require huge differences between the sample mean and the population mean to be significant.

For a one-mean z-test, we will be using a one-tail hypothesis test. The null hypothesis will be that there is NO difference between the sample mean and the population mean. The alternate hypothesis will test to see if the sample mean is greater. The null and alternate hypotheses are written out as:

  • $latex H_0: \bar{x} = \mu&s=2$
  • $latex H_A: \bar{x} > \mu&s=2$

One-Tailed Z-test

The graph above shows the critical regions for a right-tailed z-test. The critical regions reflect areas where the z-stat has to fall in order for the test to reject the null hypothesis. The critical regions are defined because they represent a probability less the the stated confidence level. For example the critical region for 95% confidence level only has an area [probability] of 5%. If the sample mean is the same as the population mean, there’s a 5% chance it was drawn by random chance. This concept is the basis for almost every hypothesis test.

The z-test uses the z-stat, which is calculated analogously to the z-score the difference being it uses standard error instead of standard deviation. These two concepts are similar; The standard deviation applies to the ‘spread’ of the blue population distribution, while the standard error applies to the ‘spread’ of the green sample mean distribution. The z-stat is calculated as:

$latex z = \frac{\bar{x} – \mu}{\sigma/\sqrt{n}} &s=2$

The higher the z-stat is the more certainty there is that the sample mean and the population are different. There are three things make the z-stat larger:

  • a bigger difference between sample mean and population mean
  • a small population standard deviation
  • a larger sample size

Example

I have two sets of sample from the data set: one is entirely random and the other I weighted heavily towards taller people. The null hypothesis would be that both there’s no difference between the sample mean and the population mean. The alternate would be that the sample mean is greater than the population mean. The weighted sample would be the sample if you were evaluating the mean height of a basketball team vs the general population. Here are the two sets of an n=50 sample and R code on how I constructed them using a set random seed of 123.

Unbiased random sample

unbiased_sample

Tall-biased random sample

biased_sample

#unbiased random sample
set.seed(123)
n <- 50
height_sample <- sample(height, size=n)
sample_mean <- mean(height_sample)

#tall-biased sample
cut <- 1:25000
weights <- cut^.6
sorted_height <- sort(height)
set.seed(123)
height_sample_biased <- sample(sorted_height, size=n, prob=weights)
sample_mean_biased <- mean(height_sample_biased)

The population mean is 67.993, the first unbiased sample is 68.099, and the tall-biased group is 68.593. Both samples are higher than the than the population mean, but are both significantly higher than the mean? To figure this out, we need to calculate the z-stats and find out if those z-stats fall in the critical region using the equation:

$latex z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}} &s=2$

We can substitute and calculate with the population standard deviation [σ] = 1.902:

$latex z_{unbiased} = \frac{68.593 - 67.993}{1.902/\sqrt{50}} = 0.3922 \ \ \ \ z_{tall-biased} = \frac{68.099 - 67.993}{1.902/\sqrt{50}} = 2.229 &s=0$

#random unbiased sample
#z-stat calculation
sample_mean
z <- (sample_mean - pop_mean)/(pop_sd/sqrt(n))

#tall-biased sample
z <- (sample_mean_biased - pop_mean)/(pop_sd/sqrt(n))

Quickly, knowing that the critical value for a one-tail z-test at 95% confidence is 1.645, we can determine the unbiased random sample is not significantly different, but the tall-biased sample is significantly different. This is because the z-stat for the unbiased sample is less than the critical value, while the tall-biased is higher than the critical value.

Failed Z-test Example Comparison

Plotting the z-test for the unbiased sample, the area [probability] to the right of the z-stat is much higher than the accepted 5%. The larger the green area is the more likely the difference between the sample mean and the population mean were obtained by random chance. To get a z-test to be significant, you want to get the z-stat high so that the area [probability] is low. [In practice, this can be done by increasing sample size.]

Successful Z-test Example

The tall-baised sample mean's z-stat creates a plot with much less area to the right of the z-stat, so these results were much less likely to be obtained by chance. The p-values can be obtained by calculating the area to right of the z-stat. The R code below summarizes how to do that using R's 'pnorm' function.

#calculating the p-value
p_yellow2 <- pnorm(z)                   
p_green2 <- 1 - p_yellow2
p_green2

The p-value for the unbiased sample is .3474 or there's a 34.74% chance that the result was obtained due to random chance, while the tall-biased sample only have a p-value of .01291 or a 1.291% chance being a result of random chance. Since the p-value tall-biased sample is less than the .05, the null hypothesis is rejected, but the since the unbiased sample's p-value is well above .05, the null hypothesis is retained.

What the one-mean z-test accomplished was telling us that a simple random sample from a population wasn't really that different from population, while a sample that wasn't completely random but was much taller than the overall population was shown to be different. While this test isn't used often, the principles of distributions, calculating test stats, and p-values have many applications with in the statistics universe.

Probabiliy of Finding Someone Taller than 6 Comparison

Calculating Z-Scores [with R code]

I’ve included the full R code and the data set can be found on UCLA’s Stats Wiki

Normal distributions are convenient because they can be scaled to any mean or standard deviation meaning you can use the exact same distribution for weight, height, blood pressure, white-noise errors, etc. Obviously, the means and standard deviations of these measurements should all be completely different. In order to get the distributions standardized, the measurements can be changed into z-scores.

Z-scores are a stand-in for the actual measurement, and they represent the distance of a value from the mean measured in standard deviations. So a z-score of 2.0 means the measurement is 2 standard deviations away from the mean.

To demonstrate how this is calculated and used, I found a height and weight data set on UCLA’s site. They have height measurements from children from Hong Kong. Unfortunately, the site doesn’t give much detail about the data, but it is an excellent example of normal distribution as you can see in the graph below. The red line represents the theoretical normal distribution, while the blue area chart reflects a kernel density estimation of the data set obtained from UCLA. The data set doesn’t deviate much from the theoretical distribution.

Normal Distribution Z-Score Comparison

The z-scores are also listed on this normal distribution to show how the actual measurements of height correspond to the z-scores, since the z-scores are simple arithmetic transformations of the actual measurements. The first step to find the z-score is to find the population mean and standard deviation. It should be noted that the sd function in R uses the sample standard deviation and not the population standard deviation, though with 25,000 samples the different is rather small.

#DATA LOAD
data <- read.csv('Height_data.csv')
height <- data$Height

hist(height) #histogram

#POPULATION PARAMETER CALCULATIONS
pop_sd <- sd(height)*sqrt((length(height)-1)/(length(height)))
pop_mean <- mean(height)

Using just the population mean [μ = 67.99] and standard deviation [σ = 1.90], you can calculate the z-score for any given value of x. In this example I'll use 72 for x.

$latex z = \frac{x - \mu}{\sigma} &s=2$

z <- (72 - pop_mean) / pop_sd

This gives you a z-score of 2.107. To put this tool to use, let's use the z-score to find the probability of finding someone who is 72 inches [6-foot] tall. [Remember this data set doesn't apply to adults in the US, so these results might conflict with everyday experience.] The z-score will be used to determine the area [probability] underneath the distribution curve past the z-score value that we are interested in.
[One note is that you have to specify a range (72 to infinity) and not a single value (72). If you wanted to find people who are exactly 6-foot, not taller than 6-foot, you would have to specify the range of 71.5 to 72.5 inches. This is another problem, but this has everything to do with definite integrals intervals if you are familiar with Calc I.]

Probabiliy of Finding Someone Taller than 6 Comparison

The above graph shows the area we intend to calculate. The blue area is our target, since it represents the probability of finding someone taller than 6-foot. The yellow area represents the rest of the population or everyone is is under 6-feet tall. The z-score and actual height measurements are both given underscoring the relationship between the two.

Typically in an introductory stats class, you'd use the z-score and look it up in a table and find the probability that way. R has a function 'pnorm' which will give you a more precise answer than a table in a book. ['pnorm' stands for "probability normal distribution".] Both R and typical z-score tables will return the area under the curve from -infinity to value on the graph this is represented by the yellow area. In this particular problem, we want to find the blue area. The solution to this is an easy arithmetic function. The area under the curve is 1, so by subtracting the yellow area from 1 will give you the area [probability] for the blue area.

Yellow Area:

p_yellow1 <- pnorm(72, pop_mean, pop_sd)    #using x, mu, and sigma
p_yellow2 <- pnorm(z)                       #using z-score of 2.107

Blue Area [TARGET]:

p_blue1 <- 1 - p_yellow1   #using x, mu, and sigma
p_blue2 <- 1 - p_yellow2   #using z-score of 2.107

Both of these techniques in R will yield the same answer of 1.76%. I used both methods, to show that R has some versatility that traditional statistics tables don't have. I personally find statistics tables antiquated, since we have better ways to determine it, and the table doesn't help provide any insight over software solutions.

Z-scores are useful when relating different measurement distributions to each acting as a 'common denominator'. The z-scores are used extensively for determining area underneath the curve when using text book tables, and also can be easily used in programs such as R. Some statistical hypothesis tests are based on z-scores and the basic principles of finding the area beyond some value.

emoji header

The Most Popular Emoji Characters on Twitter

On Twitter, about 10% of general-topic tweets contain emoji characters, the tiny icons and emoticons, which are starting to get more attention when analyzing tweets, Facebook messages, or text messages. An emoji [] can capture an emotion or completely change the meaning of the written text. Before exploring how different emojis are used and what they mean to people, I wanted to get an idea of how prevalent they are and which ones are the most popular on Twitter.

Emotion:

Changes Meaning:

How I Did This

I collected tweets using a sampled stream from Twitter. In order to get a general representative sample of tweets, I tracked five popular, basic words: ‘the’, ‘and’, ‘to’, ‘you’, and ‘it’. These words are good search words, since there aren’t many sentences or thoughts that don’t use them. A Python script was used to find and count all the the emojis present in a collection of over 100,000 tweets. To avoid skewing due to a popular celebrity or viral tweet, I removed any retweets which were obvious retweets, and not retweets which function more like mentions.

Results

Emoji Use on Twitter

In the general collection of tweets, I found that 10.23% of tweets contained at least one emoji. So there isn’t an overwhelming number of tweets which contain an emoji, but 10% of Twitter content is a significant portion. The ‘Emoji Selection’ graph shows the percentage of tweets containing that particular emoji out of the tweets that HAD an emoji in it. The most popular emoji by and far was the ‘tears of joy’ emoji followed by the ‘loudly crying’ emoji . Heart-related emoji [the ones I thought would prove most popular] was third and fourth.

Emoji Selection on Twitter

Since I only collected these over the course of a day and not over several weeks or months, I would be hesitant to think these results would hold up over time. An event or seasonality can trigger a cascade of people using a certain emoji. For example, the Christmas tree emoji was popular being present in 2.16% of tweets that included emojis; this would be expected to get larger as we get closer to Christmas and smaller after Christmas. Another interesting find is that the emoji ranks high. My pure conjecture is that this emoji’s high use rate is due to protests in Ferguson and around the country. To confirm this I would need a sample of tweets from before the grand jury announcement or track the use as time passes.

Further analysis could utilize emoji groups or clusters. Emojis with similar meanings would not necessarily produce a high number if people spread their selection over 5 emoji instead of one. I plan to update this and expand on this as time passes and I’m able to collect more data.

Technical

In order to avoid any conflicts with ASCII conversions that some Python or R packages do on Twitter data, I stored tweets from the Twitter Streaming API directly into a MongoDB database, which encodes strings in UTF-8. Since tweets come from the API as a JSON object, they can be naturally stored in the document-orientated database with each metadata field in the tweet being accessible without parsing the entire tweet into a data frame or SQL database. Retweets were removed by finding any tweets with ‘RT’ in the first two characters of the text entry. This is how Twitter represents automatic retweets in JSON format.

Also since I collected 103,416 tweets the margin of error for any of the proportions given are well below 1%. Events within the social network would definitely outweigh any margin of error.

James Bond Hamiltonian Path

James Bond — Graph Theory

If you have every wondered if you could watch every James Bond movie without watching the same actor play James Bond in a row or how many different possibilities there were, you’ve unsuspectedly ventured into graph theory. Graph theory is basically the study of connected things. These can be bridges, social networks, or in this post’s case, James Bond films.

James Bond Example Graph

This is an example of what a graph is. There are nodes/vertices [the James Bond films] and edges [the connection between films that don’t have the same actor playing Bond]. This can be abstracted to many situations especially traveling salespeople. The graph above has six Bond films each one being connected to others that don’t have the same actor portraying Bond. Goldeneye and Die Another Day are connected to the other four non-Pierce Brosnan films, but not with each other.

When this is extended to all 23 films the graph becomes much busier.

All Bond Films Graph

The best way, I found to display this was in a circular graph with all the neighboring nodes being connected. To reiterate, this graph is drawn so only Bond films with different actors playing Bond are connected. Right away, you can clearly see that there is a way to watch all 23 films with the prescribed condition, since you can trace a circle through all 23 films on the graph.

The path created by watching every film without repeating has a named; it’s called a Hamiltonian path. And there are many, many different ways to achieve this in this graph. Unfortunately, there isn’t a succinct algorithm to find all of them short of programing an exhaustive search. Since I didn’t know the best way to approach this, I created a stochastic approach [Python code] to finding some of the possible Hamiltonian paths. A stochastic process uses randomness injected into an algorithm. The first Bond film in the path was selected at random, and so were subsequent films (nodes) that weren’t already visited.

James Bond Hamiltonian Path

This is just one of the possible Hamiltonian paths to fulfill the requirements of this post. The path goes [Tomorrow Never Dies, Live and Let Die, From Russia with Love, The Spy Who Loved Me, Goldfinger, Die Another Day, Moonraker, Dr. No, The Man with the Golden Gun, Casino Royale, GoldenEye, Skyfall, You Only Live Twice, License to Kill, Quantum of Solace, The Living Daylights, On Her Majesty’s Secret Service, For Your Eyes Only, The World Is Not Enough, Octopussy, Diamonds Are Forever, A View to a Kill, Thunderball]

Unfortunately, the only way to find the total number of paths for this problem is an exhaustive search. I’m going to table that as a problem for later. I looped the stochastic Hamiltonian path program nearly 1 million times and found 757,733 different Hamiltonian path permutations. Practically, if I did figure out how many different unique paths there are, it will be another really high number.

Frequency of Bond Hamiltonian Paths in 999,999-N Run

What I do find interesting that until the algorithm is run enough times to start repeating paths, it will find a complete path [23 is a complete path — 23 total Bond films] about 75% of the time. Which means you have a pretty good chance to watch the movies in real life if you just pick randomly. I’d actually say you’d have a better than 75% chance because you can look ahead and use some reason so you don’t leave two of the same era films for last. For example, you could see you have three films left: 2 Sean Connery and 1 Roger Moore. You wouldn’t want to watch the Roger Moore film first, you’d logically choose the Sean Connery film. The best strategy I think would be to hold off on the George Lazenby film till the end in case you need bailed out. Conversely, you could do worse than random if you were biased in your choice of films. For example, if you prefer more recent films and alternated watching Pierce Brosnan and Daniel Craig films first, now you have fewer choices sooner.

CNN Sentiment Score

Visualization of CNN’s 2014 Midterm Election Coverage

Adding to the basic text analytics I wrote about last week, I ran a bag-of-word sentiment analysis on CNN’s midterm election coverage on transcripts found on their site. Fortunately, all the transcripts have a time stamp on them denoting what hour of programming the transcript covers, so I was able to attach a time of day to all the transcripts to produce a visualization of CNN’s election coverage.

Category Term Table

To capture what CNN was talking about, I wrote a Python script that found specific key words based on a frequency distribution produced from the entire transcript corpus. The idea being that if CNN didn’t talk about the topic, it wasn’t worth investigating. To consolidate some of the terms into concepts or topics, I created categories to group similar words together. Sometimes these were analogous or simply a plural form of the same word.

CNN Term Count Summary

Republicans and the President were the biggest talking points of the night being mentioned more times than the Senate or Democrats. The number of times that a topic is mentioned doesn’t provide any clue to the context or demeanor of how CNN presented this topic, so a bag-of-words approach was used to score the sentiment of words surrounding these terms within the transcript. This process won’t give an exact interpretation for every instance, but it can get close. With enough term occurrences, the overall sentiment should rise above the error noise.

CNN Sentiment Score

The first thing to notice is that there was no bad news about the Republicans. The sentiment analysis never found any hour of CNN’s broadcast that had more negative mentions of Republicans than positive. In contrast, through out the day, Democrats were not doing anywhere near as well on the newscast from morning to about early evening. Mentions of President Obama were rather volatile with strong negative and positive swings through out the day. Mitch McConnell got a rather big bump right when CNN projected his Senate race in his favor [at about 7PM EST]. The topic of Washington, predominately referring to the federal government, was the only topic that had a negative overall score for the entire day.

CNN Sentiment Direct Comparison

The graph above offers a direct comparison of the sentiment scores for the political categories for every hour of broadcast during the actual election returns after 6PM EST. [It also aggregates mentions across the different programs CNN runs on different channels, so there might be a little disagreement with the numbers if you are comparing charts.]

Overall, the sentiment analysis produces an interesting visual picture of how CNN handled the election. If other news networks had transcripts of entire shows readily available, I’d be able to compare the outlets looking for evidence of bias or slant. If this was applied over a longer time frame, it could present an interesting look into how a news story evolves shredding an objective light on how the news cycle works.