All posts by Sean Dolinar

emoji header

The Most Popular Emoji Characters on Twitter

On Twitter, about 10% of general-topic tweets contain emoji characters, the tiny icons and emoticons, which are starting to get more attention when analyzing tweets, Facebook messages, or text messages. An emoji [] can capture an emotion or completely change the meaning of the written text. Before exploring how different emojis are used and what they mean to people, I wanted to get an idea of how prevalent they are and which ones are the most popular on Twitter.

Emotion:

Changes Meaning:

How I Did This

I collected tweets using a sampled stream from Twitter. In order to get a general representative sample of tweets, I tracked five popular, basic words: ‘the’, ‘and’, ‘to’, ‘you’, and ‘it’. These words are good search words, since there aren’t many sentences or thoughts that don’t use them. A Python script was used to find and count all the the emojis present in a collection of over 100,000 tweets. To avoid skewing due to a popular celebrity or viral tweet, I removed any retweets which were obvious retweets, and not retweets which function more like mentions.

Results

Emoji Use on Twitter

In the general collection of tweets, I found that 10.23% of tweets contained at least one emoji. So there isn’t an overwhelming number of tweets which contain an emoji, but 10% of Twitter content is a significant portion. The ‘Emoji Selection’ graph shows the percentage of tweets containing that particular emoji out of the tweets that HAD an emoji in it. The most popular emoji by and far was the ‘tears of joy’ emoji followed by the ‘loudly crying’ emoji . Heart-related emoji [the ones I thought would prove most popular] was third and fourth.

Emoji Selection on Twitter

Since I only collected these over the course of a day and not over several weeks or months, I would be hesitant to think these results would hold up over time. An event or seasonality can trigger a cascade of people using a certain emoji. For example, the Christmas tree emoji was popular being present in 2.16% of tweets that included emojis; this would be expected to get larger as we get closer to Christmas and smaller after Christmas. Another interesting find is that the emoji ranks high. My pure conjecture is that this emoji’s high use rate is due to protests in Ferguson and around the country. To confirm this I would need a sample of tweets from before the grand jury announcement or track the use as time passes.

Further analysis could utilize emoji groups or clusters. Emojis with similar meanings would not necessarily produce a high number if people spread their selection over 5 emoji instead of one. I plan to update this and expand on this as time passes and I’m able to collect more data.

Technical

In order to avoid any conflicts with ASCII conversions that some Python or R packages do on Twitter data, I stored tweets from the Twitter Streaming API directly into a MongoDB database, which encodes strings in UTF-8. Since tweets come from the API as a JSON object, they can be naturally stored in the document-orientated database with each metadata field in the tweet being accessible without parsing the entire tweet into a data frame or SQL database. Retweets were removed by finding any tweets with ‘RT’ in the first two characters of the text entry. This is how Twitter represents automatic retweets in JSON format.

Also since I collected 103,416 tweets the margin of error for any of the proportions given are well below 1%. Events within the social network would definitely outweigh any margin of error.

Emoji, UTF-8, and Python

I have updated [better] code that allows for easy counting of emoji’s in string objects in Python, it can be found on my GitHub. I have a two counting classes in a mini-package loaded there.

Emoji [], those ubiquitous emoticons that popped up when iPhone users found them in 2011 with iOS 5 are a different set of characters aside from the traditional alphanumeric and punctuation characters. These are essentially another alphabet, and this concept will be useful when using the emoji in Python. Emoji are NOT a font like Wingdings from Windows95, they are unique characters with no corresponding letter or symbol representation. If you have a document or webpage that has the Wingding font, you can simply change the font to a typical Latin font to see the normal characters the Wingding font represents.

Technical Background

Without getting into the technical encoding problems, emoji are defined in Unicode and UTF-8, which can represent just about a million characters. A lot of applications or software packages default to ASCII, which only encodes the typical 128 characters. Some Python IDEs, csv writing packages, or parsing software default to or translate to ASCII, so they don’t necessarily handle the emoji characters properly.

I wrote a Python script [or this Python ‘package’] that takes tweets that are stored in a MongoDB database (more on that later) and counts the number of different emoji in the tweet corpus. To make sure Python plays nice with the emojis, first I loaded in the data by making sure I had UTF-8 encoding specified otherwise you’ll get this encoding error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 0: ordinal not in range(128)

I loaded an emoji key I made using all the emoji’s in Apple’s implementation by loading this code into a Panda’s data frame:

emoji_key = pd.read_csv('emoji_table.txt', encoding='utf-8', index_col=0)

If Python loads you data in correctly with UTF-8 encoding, each emoji will be treated as separate unique character, so string function and regular expressions can be used to find the emoji’s in other strings such as Twitter text. In some IDEs emoji’s don’t display [Canopy] or don’t display well [PyCharm]. I remedied the invisible/messy emoji’s by running the script in Mac OS X’s terminal application, which displays emoji . Python can also produce an ASCII compliant string by using a unicode escape encoding:

unicode_object.encode('unicode_escape')

The escape encoded string will display something like this:

\U0001f604

All IDEs will display the ASCII string. You would need to decode it from the unicode escape to get it back into a unicode object. Ultimately I had a Pandas data frame containing unicode objects. To make sure the correct encoding was used on the output text file, I used the following code:

 with open('emoji_out.csv', 'w') as f: 
    emoji_count.to_csv(f, sep=',', index = False, encoding='utf-8')  

Emoji Counter Class

I made an emoji counter class in Python to simplify the process of counting and aggregating emoji counts. The code [socialmediaparse] is on my GitHub along with the necessary emoji data file, so it can load the key when the instance is created. Using the package, you can repeatedly call the add_emoji_count() method to change the internal count for each emoji. The results can be retrieved using the .dict, dict_total, and .baskets attributes of the instance. I wrote this because it organizes and simplifies the analysis for any social media or emoji application. Separate emoji dictionary counter objects can be created for different sets of tweets that someone would want to analyze.

import socialmediaparse as smp #loads the package

counter = smp.EmojiDict() #initializes the EmojiDict class

#goes through list of unicode objects calling the add_emoji_count method for each string
#the method keeps track of the emoji count in the attributes of the instance
for unicode_string in collection:
   counter.add_emoji_count(unicode_string)  

#output of the instance
print counter.dict_total #dict of the absolute total count of the emojis in corpus
print counter.dict       #dict of the count of strings with the emoji in corpus
print counter.baskets    #list of lists, emoji in each string.  one list for each string.

counter.create_csv(file='emoji_out.csv')  #method for creating csv

Project

MongoDB was used for this project because the data stores the JSON files very well, not needing a parser or a csv writer. It also has the advantage of natively storing strings in UTF-8. If I used R’s StreamR csv parser, there would be many encoding errors and virtually no emoji’s present in the data. There might be possible work arounds, but MongoDB was the easiest way I’ve found to work with Twitter JSON, UTF-8 encoded data.

James Bond Hamiltonian Path

James Bond — Graph Theory

If you have every wondered if you could watch every James Bond movie without watching the same actor play James Bond in a row or how many different possibilities there were, you’ve unsuspectedly ventured into graph theory. Graph theory is basically the study of connected things. These can be bridges, social networks, or in this post’s case, James Bond films.

James Bond Example Graph

This is an example of what a graph is. There are nodes/vertices [the James Bond films] and edges [the connection between films that don’t have the same actor playing Bond]. This can be abstracted to many situations especially traveling salespeople. The graph above has six Bond films each one being connected to others that don’t have the same actor portraying Bond. Goldeneye and Die Another Day are connected to the other four non-Pierce Brosnan films, but not with each other.

When this is extended to all 23 films the graph becomes much busier.

All Bond Films Graph

The best way, I found to display this was in a circular graph with all the neighboring nodes being connected. To reiterate, this graph is drawn so only Bond films with different actors playing Bond are connected. Right away, you can clearly see that there is a way to watch all 23 films with the prescribed condition, since you can trace a circle through all 23 films on the graph.

The path created by watching every film without repeating has a named; it’s called a Hamiltonian path. And there are many, many different ways to achieve this in this graph. Unfortunately, there isn’t a succinct algorithm to find all of them short of programing an exhaustive search. Since I didn’t know the best way to approach this, I created a stochastic approach [Python code] to finding some of the possible Hamiltonian paths. A stochastic process uses randomness injected into an algorithm. The first Bond film in the path was selected at random, and so were subsequent films (nodes) that weren’t already visited.

James Bond Hamiltonian Path

This is just one of the possible Hamiltonian paths to fulfill the requirements of this post. The path goes [Tomorrow Never Dies, Live and Let Die, From Russia with Love, The Spy Who Loved Me, Goldfinger, Die Another Day, Moonraker, Dr. No, The Man with the Golden Gun, Casino Royale, GoldenEye, Skyfall, You Only Live Twice, License to Kill, Quantum of Solace, The Living Daylights, On Her Majesty’s Secret Service, For Your Eyes Only, The World Is Not Enough, Octopussy, Diamonds Are Forever, A View to a Kill, Thunderball]

Unfortunately, the only way to find the total number of paths for this problem is an exhaustive search. I’m going to table that as a problem for later. I looped the stochastic Hamiltonian path program nearly 1 million times and found 757,733 different Hamiltonian path permutations. Practically, if I did figure out how many different unique paths there are, it will be another really high number.

Frequency of Bond Hamiltonian Paths in 999,999-N Run

What I do find interesting that until the algorithm is run enough times to start repeating paths, it will find a complete path [23 is a complete path — 23 total Bond films] about 75% of the time. Which means you have a pretty good chance to watch the movies in real life if you just pick randomly. I’d actually say you’d have a better than 75% chance because you can look ahead and use some reason so you don’t leave two of the same era films for last. For example, you could see you have three films left: 2 Sean Connery and 1 Roger Moore. You wouldn’t want to watch the Roger Moore film first, you’d logically choose the Sean Connery film. The best strategy I think would be to hold off on the George Lazenby film till the end in case you need bailed out. Conversely, you could do worse than random if you were biased in your choice of films. For example, if you prefer more recent films and alternated watching Pierce Brosnan and Daniel Craig films first, now you have fewer choices sooner.

CNN Sentiment Score

Visualization of CNN’s 2014 Midterm Election Coverage

Adding to the basic text analytics I wrote about last week, I ran a bag-of-word sentiment analysis on CNN’s midterm election coverage on transcripts found on their site. Fortunately, all the transcripts have a time stamp on them denoting what hour of programming the transcript covers, so I was able to attach a time of day to all the transcripts to produce a visualization of CNN’s election coverage.

Category Term Table

To capture what CNN was talking about, I wrote a Python script that found specific key words based on a frequency distribution produced from the entire transcript corpus. The idea being that if CNN didn’t talk about the topic, it wasn’t worth investigating. To consolidate some of the terms into concepts or topics, I created categories to group similar words together. Sometimes these were analogous or simply a plural form of the same word.

CNN Term Count Summary

Republicans and the President were the biggest talking points of the night being mentioned more times than the Senate or Democrats. The number of times that a topic is mentioned doesn’t provide any clue to the context or demeanor of how CNN presented this topic, so a bag-of-words approach was used to score the sentiment of words surrounding these terms within the transcript. This process won’t give an exact interpretation for every instance, but it can get close. With enough term occurrences, the overall sentiment should rise above the error noise.

CNN Sentiment Score

The first thing to notice is that there was no bad news about the Republicans. The sentiment analysis never found any hour of CNN’s broadcast that had more negative mentions of Republicans than positive. In contrast, through out the day, Democrats were not doing anywhere near as well on the newscast from morning to about early evening. Mentions of President Obama were rather volatile with strong negative and positive swings through out the day. Mitch McConnell got a rather big bump right when CNN projected his Senate race in his favor [at about 7PM EST]. The topic of Washington, predominately referring to the federal government, was the only topic that had a negative overall score for the entire day.

CNN Sentiment Direct Comparison

The graph above offers a direct comparison of the sentiment scores for the political categories for every hour of broadcast during the actual election returns after 6PM EST. [It also aggregates mentions across the different programs CNN runs on different channels, so there might be a little disagreement with the numbers if you are comparing charts.]

Overall, the sentiment analysis produces an interesting visual picture of how CNN handled the election. If other news networks had transcripts of entire shows readily available, I’d be able to compare the outlets looking for evidence of bias or slant. If this was applied over a longer time frame, it could present an interesting look into how a news story evolves shredding an objective light on how the news cycle works.

Election Term Count

Basic Text Analytics for News Bias

Bias is a problem every news media outlet has in some form beyond the well-debated political slants that Fox News and MSNBC are renown for. I’ve been attempting to quantify biases using text analytics. By looking at the frequency and topics of articles, word choices, and associated words, I believe that you can find analytical evidence to better understand the how different news outlets are communicating their news.

My first attempt at this has a simple approach: measure and compare the frequency of specific key terms. I used the current topics of Ebola and the midterm election, which will demonstrate some polarization. To summarize the news content, the data was collected towards the tail end of the quarantine-issue news cycle, so there have been political debates on how to handle health-care workers returning to the United States. Oversimplifying, conservatives favor hardline precautions like quarantine, while liberals generally favor the present policy of self-monitoring. The election articles reflect news articles from the weekend before a midterm election where Republicans are favored in the polls to take control of the Senate.

All the articles were gathered from scraping Google search results for ‘ebola+[news outlet]’ or ‘election+[news outlet]’ with a Python script. So the data will reflect data recent news articles relative to November 1, 2014. The text was analyzed by counting specific terms in the articles and the total word count of each article. For those Python-orientated readers, I used the TextBlob package for the n-gram/count methods.

Getting an idea of what the collection of news articles looks like, there are about 100 articles per news outlet and topic, which is what Google returns on the first page of results. All duplicate articles and non-outlet domains [both these restrictions used URLS] are removed, so the number might be less than 100. I’m also scraping Google’s news search site meant for normal web use, so there are related article links attached to some of the results possibly pushing the total results over 100.

Article Count By Topic

Word Count Per Article

Generally, longer articles can provide more detailed information or complex arguments, and it will also be taken into consideration when calculating a term count for articles from the news outlets. The New York Times has by far the longest articles, while NBC News has the shortest.

Ebola Term Count

I assembled a count of certain terms associated with Ebola and averaged those across all the articles. Not surprisingly, out of the the terms I chose, ‘quarantine’ appeared the most with the most frequent mentions by Fox News. An associated term ‘Hickox’, the name of the nurse who was quarantined in NJ and ME, was also used often, but mostly by NBC News. Even though Fox News mentioned quarantining more often, it did not mention the name of the nurse nearly as often. Conversely, NBC News mentioned ‘Hickox’ more often than they did quarantine. Since this is just basic text analysis, I’m hesitant to draw too many conclusions on what the coverage bias means for the new outlet’s slant.

Election Term Count

Similar to the Ebola term count, I gathered similar information for articles about the midterm elections. There wasn’t much disparity in the frequency the articles used terms like there was for Ebola. The most notable pattern was that NPR had strikingly few explicit mentions of political parties or philosophies possibility indicating their strategy to avoid politicizing articles. Fox News and NBC News differed the most in their use of the word ‘liberal’, which is slightly pejorative in conservative circle. This could act as confirmation evidence of the outlet’s well-known slants, but I would insist on further investigation and better evidence.

For those curious about the calculations of the term metrics, it’s the TERM COUNT/ ARTICLE WORD COUNT averaged over all the articles for the outlet and subject, so the measurements on the graphs are essentially average term proportions per article.

This is just a basic, analytical look at news articles for coverage bias, which is associated with what a news outlet decides to cover or include in articles. More articles, TV transcripts, and social media headlines and comments could provide a richer data set for analysis. And hopefully, I can find emotionally charged words and evaluate opinions. All work for the future.

Using a Genetic Algorithm to Minimize an OLS Regression in R

A genetic algorithm allows you to optimize parameters by using an algorithm that mimics biological evolution. It will run through several generations of values trying to find the values that minimizes [or maximizes depending on the algorithm] its fitness or evaluation function, which is just any function that returns a value from the parameters the algorithm is optimizing.

There is a lot of literature on how genetic algorithms work, and I would recommend reading those if you want the technical details on how they work. Genetic algorithms are typically demonstrated by the knapsack algorithm problem [Numb3rs Scene Youtube], where you look to optimize the survival points by seeking the right combination of survival items weighing under a specified amount to fit in a knapsack. This R-bloggers site has a good demonstration of that example and code. However, I find it more interesting to use a genetic algorithm on something more familiar to analytics and statistics, and that’s the ordinary least squares regression (OLS).

OLS minimizes the sum of squared error (SSE) to find the best fit line or regression line for the data set. This is derived using matrix calculus, and it’s computational efficient, easy to understand, and ubiquitous.

Since OLS essentially is an algorithm that uses calculus to minimize SSE, we can use a genetic algorithm to accomplish the same task. R’s GA (genetic algorithm) package allows you to use either binary or real numbers as parameters for the fitness function. Traditionally, genetic algorithms use binary parameters [see the knapsack algorithm], but for this problem, real numbers will be much more useful since the regression coefficients will be real numbers.

The GA algorithm will create a vector of real numbers between -100 and 100, then use that vector to evaluation a regression equation in the fitness function. The fitness function returns the SSE. Since the GA algorithm seeks to maximize the fitness function, the function has a negative sign in front of it, so the lowest absolute SSE will at the maximum if it’s negative. The GA has a population of 500 vectors which are evaluated with the fitness function, and the best solutions are generally kept and children vectors are created, the process is repeated 500 times. The results is a SSE that is very close the OLS solution, and parameter estimates that match up as well.

I’ve included two different linear models. The first has only two variables which play significant roles in the OLS regression, and a second model which has every variable with not all being significant. You can run it a few times and see how the GA solutions differ. The first model’s GA estimates will be a lot closer to the OLS’ estimates than the second model’s.

All of this is rather academic for well-behaved linear regression problems, since GA are computationally expensive taking forever relative to your standard OLS procedure.

The full annotated R code follows:



#install.packages('GA')
library(GA)

#loads an airquality dataframe
data(airquality)

#removes missing data
airquality <- na.omit(airquality) 


#### create a function to evaluate a linear regression
#### takes intercept and the two best variables to compute the predicted y_hat
#### then computes and returns the SSE for each chromosome
#### we will try to minimize the SSE like OLS does

OLS <- function(data, b0, b1, b2){
  
  attach(data, warn.conflicts=F)
  
  Y_hat <- b0  + b1*Wind + b2*Temp
  
  SSE = t(Ozone-Y_hat) %*% (Ozone-Y_hat) #matrix formulation for SSE
  
  detach(data)
  
  return(SSE)
  
}


#### this sets up a real-value GA using 3 parameters all from -100 to 100
#### the parameters use real numbers (so floating decimals) and passes those to
#### the linear regression equation/function
#### the real-value GA requires a min and max
#### this takes a while to run

ga.OLS <- ga(type='real-valued', min=c(-100,-100, -100), 
             max=c(100, 100, 100), popSize=500, maxiter=500, names=c('intercept', 'Wind', 'Temp'),
             keepBest=T, fitness = function(b) -OLS(airquality, b[1],b[2], b[3]))

#### summary of the ga with solution
ga.model <- summary(ga.OLS)
ga.model


#### check against the results against the typical OLS procedure
lm.model <- lm(formula= Ozone ~ Wind + Temp, data=airquality)
summary(lm.model)
lm.model$res %*% lm.model$res ### SSE.lm
-ga.model$fitness ### SSE.ga

lm.model$res %*% lm.model$res + ga.model$fitness  ### difference between OLS and GA's SSE



#### FULL MODEL ####

OLS.FULL <- function(data, b0, b1, b2, b3, b4, b5){
  
  attach(data, warn.conflicts=F)
  
  Y_hat <- b0 + b1*Solar.R + b2*Wind + b3*Temp + b4*Month + b5*Day  # linear regression equation
  
  SSE = t(Ozone-Y_hat) %*% (Ozone-Y_hat) #matrix formulation for SSE
  
  detach(data)
  
  return(SSE)
  
}


#### this sets up a real-value GA using 6 parameters all from -100 to 100
#### the parameters use real numbers (so floating decimals) and passes those to
#### the linear regression equation/function
#### the real-value GA requires a min and max
#### this takes a while to run related to the survival pack
#### this will produce some values that vary a lot from OLS estimates since not all values are significant
#### some estimates should have high standard error
ga.OLS <- ga(type='real-valued', min=c(-100,-100, -100, -100, -100, -100), 
             max=c(100,100, 100, 100, 100, 100), popSize=500, maxiter=500, 
             keepBest=T, fitness = function(b) -OLS.FULL(airquality, b[1],b[2], b[3], b[4], b[5], b[6]))

#### summary of the ga with solution
summary(ga.OLS)

#### check against the results against the typical OLS procedure
summary(lm(formula= Ozone ~ Wind + Temp, data=airquality))








OLS Derivation

Ordinary Least Squares (OLS) is a great low computing power way to obtain estimates for coefficients in a linear regression model. I wanted to detail the derivation of the solution since it can be confusing for anyone not familiar with matrix calculus.

First, the initial matrix equation is setup below. With X being a matrix of the data’s p covariates plus the regression constant. [This will be represented as a column of ones if you were to look at the data in the X matrix.] Y is the column matrix of the target variable and β is the column matrix of unknown coefficients. e is a column matrix of the residuals.

$latex \mathbf{Y = X} \boldsymbol{\beta} + \boldsymbol{e} &s=1$

Before manipulating the equation it is important to note you are not solving for X or Y, but instead β and will do this by minimizing the sum of squares for the residuals (SSE). So the equation can be rewritten by moving the error term to the left side of the equation.

$latex \boldsymbol{e} = \mathbf{Y – X} \boldsymbol{\beta}&s=1$

The SSE can be written as the product of the transposed residual column vector and its original column vector. [This is actually how you would obtain the sum of squares for any vector.]

$latex \mathrm{SSE} = \boldsymbol{e}’\boldsymbol{e} &s=1$

Since you transpose and multiply one side of the equation, you have to follow suit on the other side. Yielding

$latex \boldsymbol{e’e} = (\mathbf{Y – X} \boldsymbol{\beta})'(\mathbf{Y – X} \boldsymbol{\beta})&s=1$

The transpose operator can be distributed through out the quantity on the right side, so the right side can be multiplied out.

$latex \boldsymbol{e’e} = (\mathbf{Y’ – \boldsymbol{\beta}’X’})(\mathbf{Y – X} \boldsymbol{\beta})&s=1$

Using the rule that A’X = X’A, you can multiple out the right side and simplify it.

$latex \boldsymbol{e’e} = (\mathbf{Y’Y – Y’X\boldsymbol{\beta} – \boldsymbol{\beta}’X’Y} + \boldsymbol{\beta’\mathbf{X’X}\beta})&s=1$
$latex \boldsymbol{e’e} = (\mathbf{Y’Y – \boldsymbol{\beta}’X’Y – \boldsymbol{\beta}’X’Y} + \boldsymbol{\beta’\mathbf{X’X}\beta})&s=1$
$latex \boldsymbol{e’e} = (\mathbf{Y’Y – 2\boldsymbol{\beta}’X’Y} + \boldsymbol{\beta’\mathbf{X’X}\beta})&s=1$

To minimize the SSE, you have to take the partial derivative relative to β. Any terms without a β term in them will go to zero. Using the transpose rule from before you can see how the middle term yields -2X’Y using differentiation rules from Calc1. The last term is a bit tricky, but it derives to +2X’Xβ.

$latex \frac{\delta\boldsymbol{e’e}}{\delta\boldsymbol{\beta}} = \frac{\delta\mathbf{Y’Y}}{\delta\boldsymbol{\beta}} – \frac{2\boldsymbol{\beta}’\mathbf{X’Y}}{\delta\boldsymbol{\beta}} + \frac{\delta\boldsymbol{\beta’\mathbf{X’X}\beta}}{\delta\boldsymbol{\beta}}&s=1$

$latex \frac{\delta\boldsymbol{e’e}}{\delta\boldsymbol{\beta}} = – 2\mathbf{X’Y} + 2\boldsymbol{\mathbf{X’X}\beta}&s=1$

To find the minimum (it will never be a maximum if you have all the requirements for OLS fulfilled), the derivative of the SSE is set to zero.

$latex 0 = – 2\mathbf{X’Y} + 2\mathbf{X’X}\boldsymbol{\beta}&s=1$

$latex 0 = \mathbf{- X’Y} + \mathbf{X’X}\boldsymbol{\beta}&s=1$

Using some basic linear algebra and multiplying both sides by the inverse of (X’X)…

$latex (\mathbf{X’X})^{-1}\mathbf{X’X}\boldsymbol{\beta} = (\mathbf{X’X})^{-1}\mathbf{X’Y}&s=1$

…yields the solution for β

$latex \boldsymbol{\beta} = (\mathbf{X’X})^{-1}\mathbf{X’Y}&s=1$

References:

The Mathematical Derivation of Least Squares. Psychology 8815. Retrieved from: http://isites.harvard.edu/fs/docs/icb.topic515975.files/OLSDerivation.pdf

Chatterjee, S & Hadi, A. (2012). Regression analysis by example. Hoboken, NJ: John Wiley & Sons, Inc.

AL Wild Card Game Twitter Map

2014 ALWCG Twitter Graphs

The Royals and A’s had quite the entertaining 12-inning game Tuesday night. These are a few graphs I made from Twitter data. Yellow is Oakland; blue is Kansas City. The proportions of tweets between teams might be off, but I would venture to guess the Royals had much more social media activity than the A’s. The map shows geotagged tweets from 5PM to 1AM EDT from yesterday. The middle of the country was solid blue, California was pretty yellow, and the East Coast was rather mixed.

AL Wild Card Game Twitter Map

The volume of tweets per minute is a pretty cool view of what happened during the game. It looks like the Royals really outpaced the A’s for volume, but I’d have to use some controls to determine that for sure. These are just for fun.

AL Wild Card Game Twitter Time Series

I used Twitter’s streaming API to collect tweets with keywords like “Royals”, “A’s”, “TakeTheCrown”, “GreenCollar”, etc. I could have missed a crucial element of discussion, and none of this takes into account sentiment just frequency of mention in a tweet.

Probability of Outcomes in 7 Game Series For Teams

Getting Lucky in a Playoff Series

Sports have a constant uncertainty and randomness in every aspect of the game including determining champions. This is one area you wouldn’t expect to have a lot of variability, since you would want the team that has the best roster composition and played the hardest to win the championship. This concept is usually brought up in the arguments against the one-game Wild Card round that MLB introduced in 2012 saying there’s too much that can happen in one game to determine the fate of a season. [The counter-argument to this specifically is the division winners now have a reward for winning the division, besides having cool sweatshirts.]

The Sports Side

The basis for championship series in MLB, NBA, and NHL is the an odd-numbered of games series with the champion being the team that wins the majority of games in those series. Most sports use a 7-game series; so for example the Boston Red Sox had to win 4 games to win the World Series last year. Using an example of randomness I got from Leonard Mlodinow’s The Drunkard’s Walk: How Randomness Rules Our Lives, I can illustrate how a team that’s clearly an underdog can win a playoff series against a superior opponent. Mlodinow has a recorded lecture where he explains what he wrote in his book. [It’s a good book, you should read it.]

Let’s use two teams; one is the Favorite, and it is assumed they will beat the Underdog 55% of the time [given enough games]. This also means that the Underdog will win 45% of the time. This represents win probabilities more uneven that you are likely to find in a playoff game since teams are typically much more evenly matched [at least in baseball]. The last assumption of this example is that the teams win probabilities don’t change with a different starting pitcher or home field/court/ice advantage. These are terrible assumptions if you wanted to project real playoff series, but the underlying principle of random sequencing will still hold true.

In order to win the playoff series, a team has to win a certain number of games before the opponent wins that number. To model this distribution based on pure randomness, you can use the negative binomial distribution to determine the probability that the Favorite will win a 7-game playoff series in 4, 5, 6, or 7 games. If you wanted to design a playoff series to minimize the chances that the underdog will win, you’d want to choose a number of games which would have the smallest probability of the underdog winning the series.

Probability of Outcomes in 7 Game Series For Teams

This chart shows the probability for all 8 possible outcomes of a 7-game playoff series based of a 55/45 winning percentage split and pure randomness with no home-field advantage. As you can see there’s a substantial chance [39% probability] that the Underdog will win a 7-game series. 39% is rather large, and this is a 7-game series. Baseball also employs a 5-game series for their division series (LDS) and a one-game playoff for the the Wild Card round (WC). The chances of an upset becomes more likely as the number of games decreases. I’ve also added another set of teams (60/40 split — greater disparity) for comparison’s sake.

Comparison of Outcomes for Sports Playoff Series

Comparison of Outcomes for Sports Playoff Series

It should be obvious that the 1-game series has the greatest chance of an upset, hence the objections to its use in baseball. Though my contention would be that a 3-game series does not offer much more certainty that the best team will win.

The Math Side

I first calculated these probabilities by writing out all the possible combinations then adding up those probabilities. I have since realized there was a much easier way to determine these probabilities, and that is by using the negative binomial distribution (NBD). If you want to familiarize yourself with what the distribution represents please read the count data primer. In short, the NBD will determine the probability that a team will lose a certain number of games [0-3] before the other team wins 4 games. The NBD is defined by the following function:

$latex P(X=k) = {{r+k-1}\choose{k}} p^{r} (1-p)^{k}&s=2$

where X is the random variable whose probability we are calculating, k is number of Team A losses [this will vary], r is the number of Team A wins [for the 7-game series, it will be 4 games], and p is the probability of Team A winning. In the case of this example we will be determining the probability of Team A winning a 7-game series, when Team A has a 55%/45% advantage over Team B.

$latex P(X=2) = {{4+2-1}\choose{2}} 0.55^{4} (1-0.55)^{2}&s=2$

$latex P(X=2) = {{4+2-1}\choose{2}} 0.55^{4} (1-0.55)^{2}&s=2$

$latex P(X=2) = 10 * 0.0915 * 0.2025 = 18.53\% &s=2$

This is the probability for just one possible outcome, Team A wins the series in 6 games. To determine the probability that Team A wins the series, you add the probabilities for each outcome Team A wins in 4, 5, 6, or 7 games. So this calculation then repeated for every loss possibility:

$latex P(WinningSeries) = P(X=0) + P(X=1) + P(X=2) + P(X=3)&s=0$

$latex P(WinningSeries) = 9.15\% + 16.47\% + 18.53\% + 16.68\% = 60.83\%&s=0$

From these calculations, there is a 60.83% chance that the Team A wins just by randomness. Conversely, there is a 39.17% [100% – 60.83%] chance that Team B, the inferior team, wins because of random sequencing.

Conclusion

The MLB Wild Card game rightfully gets criticized for being too susceptible to having a bad day or getting a bad bounce. I wanted to illustrate that any playoff series has a lot of randomness in it. Beyond the numbers, people remember the bad bounces way more than they remember the positive or neutral events that occur [negativity bias]. A bad bounce or a pitcher having a bad day could easily benefit the team you are rooting for. The only real way to root out the randomness you would need to play hundreds of games, and somehow I don’t think that is feasible.

2014 Playoff Probability Season

Do MLB Playoff Odds Work?

One of the more fan-accessible advanced stats are playoff odds [technically postseason probabilities]. Playoff odds range from 0% – 100% telling the fan the probability that a certain team will reach the MLB postseason. These are determined by creating a Monte Carlo simulation which runs the baseball season thousands of times [FanGraph runs theirs 10,000 times]. In those simulations, if a team reaches the postseason 5,000 times, then the team is predicted to have a 50% probability for making the postseason. FanGraphs and Baseball Prospectus run these every day, so playoff odds can be collected every day and show the story of a team’s season if they are graphed.

2014 Playoff Probability Season

Above is a composite graph of the three different types of teams. The Dodgers were identified as a good team early in the season and their playoff odds stayed high because of consistently good play. The Brewers started their season off strong but had two steep drop offs in early July and early September. Even though the Brewers had more wins than the Dodgers, the FanGraphs playoff odds never valued the Brewers more than the Dodgers. The Royals started slow and had a strong finish to secure themselves their first postseason birth since 1985. All these seasons are different and their stories are captured by the graph. Generally, this is how fans will remember their team’s season — by the storyline.

Since the playoff odds change every day and become either 100% or 0% by the end of the season, the projections need to be compared to the actual results at the end of the season. The interpretation of having a playoff probability of 85% means that 85% of the time teams with the given parameters will make the postseason.

I gathered the entire 2014 season playoff odds from FanGraphs, put their predictions in buckets containing 10% increments of playoff probability. The bucket containing all the predictions for 20% bucket means that 20% of all the predictions in that bucket will go on to postseason. This can be applied to all the buckets 0%, 10%, 20%, etc.

Fangraphs Playoff Evaluation

Above is a chart comparing the buckets to the actual results. Since this is only using one year of data and only 10 teams made the playoffs, the results don’t quite match up to the buckets. The desired pattern is encouraging, but I would insist on looking at multiple years before making any real conclusions. The results for any given year is subject to the ‘stories’ of the 30 teams that play that season. For example, the 2014 season did have a team like the 2011 Red Sox, who failed to make the postseason after having a > 95% playoff probability. This is colloquially considered an epic ‘collapse’, but the 95% probability prediction not only implies there’s chance the team might fail, but it PREDICTS that 5% of the teams will fail. So there would be nothing wrong with the playoff odds model if ‘collapses’ like the Red Sox only happened once in a while.

The playoff probability model relies on an expected winning percentage. Unlike a binary variable like making the postseason, a winning percentage has a more continuous quality to the data, so this will make the evaluation of the model easier. For the most part most teams do a good job staying around the initial predicted winning percentage coming really close to the prediction by the end of the season. Not every prediction is correct, but if there are enough good predictions the predictive model is useful. Teams also aren’t static, so bad teams can become worse by trading away players at the trade deadline or improve by acquiring those good players who were traded. There are also factors like injuries or player improvement, that the prediction system can’t account for because they are unpredictable by definition. The following line graph allows you to pick a team and check to see how they did relative to the predicted winning percentage. Some teams are spot on, but there are a few like the Orioles or Red Sox which are really far off.

Pirates Expected Win Percentage

The residual distribution [the actual values – the predicted values] should be a normal distribution centered around 0 wins. The following graph shows the residual distribution in numbers of wins, the teams in the middle had their actual results close to the predicted values. The values on the edges of the distribution are more extreme deviations. You would expect that improved teams would balance out the teams that got worse. However, the graph is skewed toward the teams that become much worse implying that there would be some mechanism that makes bad teams lose more often. This is where attitude, trades, and changes in strategy would come into play. I’d would go so far to say this is evidence that soft skills of a team like chemistry break down.

Difference Between Wins and Predicted Wins

Since I don’t have access to more years of FanGraphs projections or other projection systems, I can’t do a full evaluation of the team projections. More years of playoff odds should yield probability buckets that reflect the expectation much better than a single year. This would allow for more than 10 different paths to the postseason to be present in the data. In the absence of this, I would say the playoff odds and predicted win expectancy are on the right track and a good predictor of how a team will perform.