# All posts by Sean Dolinar # Statistics — Probability vs. Odds

Probability and odds are two basic statistic terms to describe the likeliness that an event will occur. They are often used interchangeably in causal conversation or even in published material. However, they are not mathematically equivalent because they are looking at likeliness in different contexts. In everyday conversation when numbers or values aren’t given, the two terms are synonymous . If an event has a high probability, then it has high odds for happening. The incorrect usage arises when a person ascribes a mathematical value to either the odds or probability they are discussing. Hopefully, if you aren’t quite sure what the exact mathematical difference is, this will clear it up for you.

Probability is defined as the fraction of desired outcomes in the context of every possible outcome with a value between 0 and 1, where 0 would be an impossible event and 1 would represent an inevitable event. Probabilities are usually given as percentages. [ie. 50% probability that a coin will land on HEADS.] Odds can have any value from zero to infinity and they represent a ratio of desired outcomes versus the field. Odds are a ratio, and can be given in two different ways: ‘odds in favor’ and ‘odds against’. ‘Odds in favor’ are odds describing the if an event will occur, while ‘odds against’ will describe if an event will not occur. If you are familiar with gambling, ‘odds against’ are what Vegas gives as odds. More on that later. For the coin flip odds in favor of a HEADS outcome is 1:1, not 50%.

Visual Math

Simple probability of event A occurring is mathematically defined as:

$latex P(A) = \frac{Number \ of \ Event \ A}{Total \ Number \ of \ Events}&s=2$

The best way to illustrate this is with the classic marbles-in-a-bag example. The graphic below depicts all the marbles in an opaque bag that one marble will be pulled out of. There are 6 blue, 3 red, 2 yellow, and 1 green for a total of 12 marbles in the bag. The probability of pulling a red marble would be calculated by taking the total number of red marbles and dividing it by the total number of marbles. OR

$latex P(RED) = \frac{3 \ RED \ marbles}{12 \ TOTAL \ marbles} = 25\%&s=2$.

Notice that the probability calculation includes the red marbles in the denominator of the calculation, because probability considers the context of the entire event space. Odds, on the other hand, are the ratio of favorable outcomes to unfavorable outcomes. The denominator contains ONLY the marbles that aren’t the favorable outcomes. Odds uses the contexts of good outcomes and bad outcomes. Written as fractions, these two values are completely different. Probability is 1/4 while odds in favor are 1/3. You can see how mistakenly interchanging the terms could give the wrong information. The ‘odds in favor’ of RED would be mathematically calculated by OR

$latex Odds\_Favor(RED) = \frac{3 \ RED \ marbles}{9 \ NOT \ RED \ marbles} = 1:3&s=2$.

To find ‘odds against’ you would simply flip odds in favor upside down and this describes the odds of the event not occurring. OR

$latex Odds\_Against(RED) = \frac{9 \ NOT \ RED \ marbles}{3 \ RED \ marbles} = 3:1&s=2$.

Gambling

‘Odds against’ are commonly are used in the context of gambling. When you hear that the Seattle Seahawks Vegas odds to win the Super Bowl are 5:1 [Retrieved 9/19/2014], the 5:1 is referring to the ‘odds against’ Seattle winning the Super Bowl. Using some quick math we could determine the probability of Seattle winning the Super Bowl would be 1/6 or 16.7%.

Vegas odds are technically payoff odds, because they describe the payout if you were to win the bet. The payout on the Seahawks would win you $5 for every$1 bet on the Seattle winning the Super Bowl. They aren’t true odds, since no one is really sure what the true odds are, because you can’t simply count and weigh the possibilities like with the bag of marbles. The payoff will increase when the event becomes less likely. If you could create a reliable predictive model that told you the Seahawks actually had a 20% probability to win the Super Bowl, you could bet on the Seahawks, knowing that their actual probability to win is better than what Vegas is giving them. And if you made enough bets like this you could beat Vegas.

Mathematical Relationship

I stated earlier that probability and odds were colloquially interchangeable when values aren’t given. This is true, because the two are mathematically related. Odds can be computed from probability and probability from odds.

$latex P(A) = \frac{Odds\_Favor(A)}{1 + Odds\_Favor(A)}&s=2$

$latex Odds\_Favor(A) = \frac{P(A)}{1 – P(A)}&s=2$

Using the RED marble example [P(RED) = 1/4 and Odds_Favor(RED) = 1/3] we can demonstrate how these are equivalent:

$latex P(RED) = \frac{1/3}{1 + 1/3} = \frac{1/3}{4/3} = \frac{1}{4}&s=2$

$latex Odds\_Favor(RED) = \frac{1/4}{1 – 1/4} = \frac{1/4}{3/4} = \frac{1}{3}&s=2$ # MLB — Run Distribution Per Game & Per Inning — Negative Binomial

This is an extension of an earlier post I wrote about the runs per inning distribution. In this post I use the negative binomial distribution to better model the how MLB teams score runs in an inning or in a game. I wrote a primer on the math of the different distributions mentioned in the post for reference.

The Baseball Side

A team in the American League will average .4830 runs per inning, but does this mean they will score a run every two innings? This seems intuitive if you apply math from Algebra I [1 run / 2 innings ~ .4830 runs/inning]. However, if you attend a baseball game, the vast majority of innings you’ll watch will be scoreless. This large number of scoreless innings can be described by discrete probability distributions that account for teams scoring none, one, or multiple runs in one inning.

Runs in baseball are considered rare events and count data, so they will follow a discrete probability distribution if they are random. The overall goal of this post is to describe the random process that arises with scoring runs in baseball. Previously, I’ve used the Poisson distribution (PD) to describe the probability of getting a certain number of runs within an inning. The Poisson distribution describes count data like car crashes or earthquakes over a given period of time and defined space. This worked reasonably well to get the general shape of the distribution, but it didn’t capture all the variance that the real data set contained. It predicted fewer scoreless innings and many more 1-run innings than what really occurred. The PD makes an assumption that the mean and variance are equal. In both runs per inning and runs per game, the variance is about twice as much as the mean, so the real data will ‘spread out’ more than a PD predicts. The graph above shows an example of the application of count data distributions. The actual data is in gray and the Poisson distribution in yellow. It’s not a terrible way to approximate the data or to conceptually understand the randomness behind baseball scoring, but the negative binomial distribution (NBD) works much better. The NBD is also a discrete probability distribution, but it finds the probability of a certain number of failures occurring before a certain number of successes. It would answer the question, what’s the probability that I get 3 TAILS before I get 5 HEADS when I continue to flip a coin. This doesn’t at first intuitively seem like it relates to a baseball game or an inning, but that will be explained later.

From a conceptual stand point, the two distributions are closely related. So if you are trying to describe why 73% of all MLB innings are scoreless to a friend over a beer, either will work. I’ve ploted both distributions for comparison through out the post. The second section of the post will discuss the specific equations and their application to baseball.

Runs per Inning

Because of the difference in rules regarding the designated hitter between the two different leagues there will be a different expected value [average] and variance of runs/inning for each league. I separated the two leagues to get a better fit for the data. Using data from 2011-2013, the American League had an expected value of 0.4830 runs/inning with a 1.0136 variance, while the National League had 0.4468 runs/innings as the expected value with a .9037 variance. [So NL games are shorter and more boring to watch.] Using only the expected value and the variance, the negative binomial distribution [the red line in the graph] approximates the distribution of runs per inning more accurately than the Poisson distribution. It’s clear that there are a lot of scoreless innings, and very few innings having multiple runs scored. This distribution allows someone to calculate the probability of the likelihood of an MLB team scoring more than 7 runs in an inning or the probability that the home team forces extra innings down by a run in the bottom of the 9th. Using a pitcher’s expected runs/inning, the NBD could be used to approximate the pitcher’s chances of throwing a no-hitter assuming he will pitch for all 9 innings.

Runs Per Game

The NBD and PD can be used to describe the runs scored in a game by a team as well. Once again, I separated the AL and NL, because the AL had an expected run value of 4.4995 runs/game and a 9.9989 variance, and the NL had 4.2577 runs/game expected value and 9.1394 variance. This data is taken from 2008-2013. I used a larger span of years to increase the total number of games. Even though MLB teams average more than 4 runs in a game, the single most likely run total for one team in a game is actually 3 runs. The negative binomial distribution once again modeled the distribution well, but the Poisson distribution had a terrible fit when compared to the previous graph. Both models, however, underestimate the shut-out rate. A remedy for this is to adjust for zero-inflation. This would increase the likelihood of getting a shut out in the model and adjust the rest of the probabilities accordingly. An inference of needing zero-inflation is that baseball scoring isn’t completely random. A manager is more likely to use his best pitchers to continue a shut out rather than randomly assign pitchers from the bullpen.

Hits Per Inning

It turns out the NBD/PD are useful in many other baseball statistics like hits per inning. The distribution for hits per inning are slightly similar to runs per inning, except the expected value is higher and the variance is lower. [AL: .9769 hits/inning, 1.2847 variance | NL: .9677 hits/inning, 1.2579 variance (2011-2013)] Since the variance is much closer to the expected value, the hits per inning has more values in the middle and fewer at the extremes than the runs per inning distribution.

I could spend all day finding more applications of the NBD and PD, because there are really a lot of examples within baseball. Understanding how these discrete distributions will help you understand how the game works, and they could be used to model outcomes within baseball.

The Math Side

Hopefully, you skipped down to this section right away if you are curious about the math behind this. I’ve compiled the numbers used in the graphs for the American League above for those curious enough to look at examples of the actual values.

The Poisson distribution is given by the equation:

$latex P(X = x) = \frac{e^{-\lambda}\lambda^x}{x!}&s=2$

There are two parameters for this equation: expected value [$latex \lambda&s=1$] and the number of runs you are looking to calculate the probability for [$latex x&s=1$]. To determine the probability of a team scoring exactly three runs in a game, you would set $latex x = 3&s=1$ and using the AL expected runs per game you’d calculate:

$latex P(X = x) = \frac{e^{-4.4995}4.4995^3}{3!} = 16.87\% &s=2$

This is repeated for the entire set of $latex x&s=1$ = {0, 1, 2, 3, 4, 5, 6, … } to get the Poisson distribution used through out the post.

One of the assumption the PD makes is that mean and the variance are equal. For these examples, this assumption doesn’t hold true, so the empirical data from actual baseball results doesn’t quite fit the PD and is overdispersed. The NBD accounts for the variance by including it in the parameters.

The negative binomial distribution is usually symbolized by the following equation:

$latex P(X=k) = {{r+k-1}\choose{k}} p^{r} (1-p)^{k}&s=2$

where $latex r&s=1$ is the number of successes, $latex k&s=1$ is the number of failures, and $latex p&s=1$ is the probability of success. A key restriction is that a success has to be the last event in the series of successes and failures.

Unfortunately, we don’t have a clear value for $latex p&s=1$ or a clear concept on what will be measured, because the NBD measures the probability of binary, Bernoulli trials. It’s help to view this problem from the vantage point of the fielding team or pitcher, because a SUCCESS will be defined as getting out of the inning or game, and a FAILURE will be allowing 1 run to score. This will conform to the restriction by having a success [getting out of the inning/game] being the ultimate event of the series.

In order to make this work the NBD needs to be parameterized differently, for mean, variance, and number of runs allowed [failures]. The following equations are derived from the mean and variance equations of a negative binomial. $latex \alpha&s=1$ represents the ‘odds in favor‘ of getting out of the inning. And $latex r&s=1$ is the expected value multiplied by the ‘odds in favor’ which will yield a real, non-integer for the number of successes. The NBD can then be written as

$latex P(X=k) = \frac{\Gamma(k+r)}{\Gamma(k+1)\Gamma(r)} (\frac{\alpha}{1+\alpha})^{r} (\frac{1}{1+\alpha})^{k}&s=2$

where

$latex r = Expected Value * \alpha; \alpha = \frac{Expected Value}{Variance -Expected Value}&s=2$

So using the same example as the PD distribution, this would yield:

$latex r = 4.4995 * 0.8182 = 3.6815 ; \alpha = \frac{4.4995}{9.9989 – 4.4995} = 0.8182&s=2$

$latex P(X=3) = \frac{\Gamma(3+3.6815)}{\Gamma(3+1)\Gamma(3.6815)} (\frac{0.8182}{1+.0.8182})^{3.6815} (\frac{1}{1+0.8182})^{3}&s=2$

$latex = 14.18\% &s=2$

The above equations are adapted from this blog about negative binomials and this one about applying the distribution to baseball. The $latex \Gamma &s=1$ function is used in the equation instead of a combination operator because the combination operator, specifically the factorial, can’t handle the non-whole numbers we are using to describe the number of successes, and the gamma function is a continuous function from 0 to infinity.

Conclusion

The negative binomial distribution is really useful in modeling the distribution of discrete count data from baseball for a given inning or game. The most interesting aspect of the NBD is that a success is considered getting out of the inning/game, while a failure would be letting a run score. This is a little counterintuitive if you approach modeling the distribution from the perspective of the batting team. While the NBD has a better fit, the PD has a simpler concept to explain: the count of discrete event over a given period of time, which might make it better to discuss over beers with your friends.

The fit of the NBD suggests that run scoring is a negative binomial process, but inconsistencies especially with shut outs indicate elements of the game aren’t completely random. I’m explaining the underestimation number of shut outs as the increase use of the best relievers in shut out games over other games increasing the total number of shut outs and subsequently decreasing the frequency of other run-total games.

All MLB data is from retrosheet.org. It’s available free of charge from there. So please check it out, because it’s a great data set. If there are any errors or if you have questions, comments, or want to grab a beer to talk about the Poisson distribution please feel free to tweet me @seandolinar. # Count Data Distribution Primer — Binomial / Negative Binomial / Poisson

Count data is exclusively whole number data where each increment represents one of something. It could be a car accident, a run in baseball, or an insurance claim. The critical thing here is that these are discrete, distinct items. Count data behaves differently than continuous data, and the distribution [frequency of of different values] is different between the two. Random continuous data typically follows the normal distribution, which is the bell curve everyone remembers from high school grade systems. [Which is a really bad way to grade, but I digress.] Count data generally follows the Binomial/Negative Binomial/Poisson distribution depending what context you are viewing the data; all three distributions are mathematically related.

Binomial Distribution:

The binomial distribution (BD) is the collection of probabilities of getting a certain number of successes in a given number of trials specifically measuring Bernoulli trials [a yes/no event similar to a coin flip, but it’s not necessarily 50/50]. My favorite example to understand the binomial distribution is using it to determine the probability that you’d get exactly 5 HEADS if you flipped a coin 10 times [it’s NOT 50%!]. It’s actually 24.61%. The probability of getting heads in any given coin flip is 50%, but over 10 flips, you’ll only get exactly 5 HEADS and 5 TAILS about 25% of the time. The equation below gives the two popular notations for the binomial probability mass function. $latex n&s=1$ is total number of trials. [the graph above used n=10]. $latex r&s=1$ is the number of successes you want to know the probability for. You calculate this function for each number of HEADS [0-10] for $latex r&s=1$ to get the distribution above. $latex p&s=1$ is the simple probability for each event. [$latex p&s=1$ = .5 for the coin flip.]

$latex P(X=r) = {{n}\choose{r}} p^{r} (1-p)^{n-r} = \frac{n!}{r!(n-r)!} p^{r} (1-p)^{n-r}&s=2$

The equation has three parts. The first part is the combination $latex {{n}\choose{r}} &s=1$, which is the number of combinations when you have $latex n&s=1$ total items taken $latex r&s=1$ at a time. Combination disregard order, so the set {1, 4, 9} is the same as {4, 9, 1}. This part of the equation tells you how many possible ways there are to get to a certain outcome since there are many way to get 5 HEADS in 10 tosses. Since $latex {{10}\choose{5}}&s=1$ is larger than any other combination, 5 HEADS will have the largest probability.

There are two more terms in the equation. $latex p^r&s=1$ is joint probability of getting r successes in a particular order, and $latex (1-p)^{n-r}&s=1$ is the corresponding probably of also getting the failures also in a particular order. I find it helpful to conceptualize the equation as having three parts accounting for different things: total combinations of successes and failures, the probabilities of successes, and the probability of failures.

Negative Binomial Distribution:

While there is a good reason for it, the name of the negative binomial distribution (NBD) is confusing. Nothing I will present will involve making anything negative so, let’s just get that out of the way and ignore it. The binomial distribution uses the probability of successes in the total number of ATTEMPTS. To contrast this, the negative binomial distribution uses the probability that a certain number of FAILURES occur before the $latex r&s=1$th SUCCESS. This has many applications specifically when a sequence terminates after the $latex r&s=1$th success such as modeling the probability that you will sell out of the 25 cups of lemonade you have stocked for a given number of cars that pass by. The idea is that you would pack up your lemonade stand after you sell out, so cars that would pass by after the final success won’t matter. Another good example is modeling the win probability of a 7-game sports playoff series. The team that wins the series must win 4 games and specifically the last game played in the series, since the playoff series terminates after one team reaches 4 wins.

One of the more important restrictions on the NBD is that the last event must be a success. Going back to the sports playoff series example, the team that wins the series will NEVER lose the last game. With the 10 coin-flip example, the BD was looking for the probability of getting a certain number of HEADS within a set number of coin flips. Using the NBD, we will look for the probability of 5 HEADS before getting a certain number of TAILS. The total number of flips will not ALWAYS equal 10 and actually exceeds 10 as seen below. The probability mass function that describes the NBD graph above is given below:

$latex P(X=k) = {{r+k-1}\choose{k}} p^{r} (1-p)^{k}&s=2$

The equation for the NBD has the same parts as the BD: the combinations, the success, and the failures. In the NBD case the combinations are less than the BD [for the same total number of coin flips]. This is because the last outcome is held fix at a success. The probability of success and failure parts of the equation are conceptually the same as the BD. The failure portion is written differently because the number of failures is a parameter $latex k&s=1$ instead of a derived quantity like [$latex n-r&s=1$].

Poisson Distribution:

The Poisson Distribution (PD) is directly related to both the BD and the NBD, because it is the limiting case of both of them. As the number of trials goes to infinity, then the Poisson distribution emerges. The graph for the PD will look similar to the NBD or the BD, and there is no example comparing the coin flip since there has to be some non-discrete process like traffic flow or earthquakes. The major difference is not what is represented, but how it is viewed and calculated. The Poisson distribution is described by the equation:

$latex P(X = x) = \frac{e^{-\lambda}\lambda^x}{x!}&s=2$

$latex \lambda&s=1$ is the expected value [or the mean] for an event and $latex x&s=1$ is the count value. If you knew an average of 0.2 car crashes happen at an intersection at a given day then you could solve the equation for $latex x&s=1$ = {0, 1, 2, 3, 4, 5, … } and get the PD for the problem.

One of the restrictions and major issues with the use of the PD is that the model assumes the mean and the variance are equal. In most real data instances the variance is greater than the mean, so the PD tends to favor more values around the expected value than real data reflects.

If you are interested in the derivations and math behind these I recommend this site: http://statisticalmodeling.wordpress.com/. I feel like they explain the derivation of the negative binomial better than most places I’ve found. It addresses why it’s called the NEGATIVE binomial distribution as well. The site also contains derivations of the PD being the limiting case of the BD and NBD. This uses the same data set I obtained from my NU Data Mining final project [summary].

Recently, @MLBcathedrals tweeted a photo I submitted to them:

I got a bunch of retweet/favorite notifications at first then I got fewer as the day went on. Now a month later, I’ll get a favorite or retweet [RT] notification every so often. The process of getting a retweet follows a Poisson process, where there is a discrete and somewhat small outcome that can be thought of as count data — you can count retweets per minute.

I used the tweets I just had lying around from my project and pulled out several collections of native RTs that had their first RT in the data set and a high volume of retweets. This was to ensure I had the first part of the tweet’s life and not just had captured it in the middle. Time is measured in seconds from the first retweet event. This simplifies things by giving each collection of a RTed tweets a time base relative to when RTing started.

The common time base enabled me to make this comparison chart of different RTed tweets: Not every RT pattern is the same. Some have many more RTs, some take a little while to get momentum, but generally they start off strong, then slowly die out taking the shape of a logarithmic function. The total number of RTs over time is interesting, but this problem works better if we look at the rate of RTing. The reason why the RT pattern flattens out is that there is steadily decreasing RTing rate over time. This makes intuitive sense if you have ever used Twitter, people react to things as they happen then rarely go back to it.

It turns out you can mathematically model the RT rate with a Poisson generalized linear model reasonably well. The following three graphs show the actual RT rate data points as red dots, the expected value regression as the black line, and a probability range as blue bands. The model for this particular RTed tweet is described by the equation:

$latex ln(E[Y|t]) = 2.980154 – 0.0017236*t&s=1$

$latex Y&s=1$ is the number of RT per minute. $latex t&s=1$ is time. And $latex E[Y|t]&s=1$ is read as the expected value of $latex Y&s=1$ given $latex t&s=1$, or what is the most likely number of RTs per minute at a given time. The constant [2.980154] represents the rate at t=1, and the negative regression coefficient [-0.0017236] indicates that rate will decline. This regression line represents the expected value, which is essentially an average of possible outcomes. Using the Poisson distribution and the expected value, I constructed a probability distribution showing a band where 50% of all data points should be located, and another band that should encompass 90% of them.

The bands, in my opinion, are more important than the regression line, because we are dealing with count data. So having an expected value of say 2.5342 doesn’t mean much if you don’t know the probability for getting a value of 0, 1, 2, 3, 4, etc. For this reason the last graph in the series of three has only the actual data points in the probable area bands. For each minute, the data point has a 50% chance to be in the dark blue region, a 40% chance to be in the light blue region [90%-50%], and a 10% chance to be in the white.

This is all predicated on RTing being a random process with a fixed audience. This described most of the RTs I looked at fairly well, but there will always be other factors such as viral growth and time of day. Viral growth means it starts off slow then grows large. If this were to happen, it would not follow this pattern; it would look more like an S. For better or worse, most RTs come from accounts with a large number of followers, so they aren’t actually viral, they are propagations of already popular tweets.

This specific regression by itself won’t predict how many RTs a tweet might get before it’s tweeted, but it describes what happens after people have begun retweeting. This is a summary of a final project I did from my Introduction to Data Mining class at NU. The goal of the project was to find a business need and execute a data mining process. The general process I used is outlined here and the sentiment lexicon is found here. The lexicon is from a paper: Minqing Hu and Bing Liu. “Mining and Summarizing Customer Reviews.” Proceedings of the ACM SIGKDD International Conference on Knowledge.Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA.

My experiences using social media, the business-centric focus of my grad classes, and my love for burritos inspired me to look into Twitter sentiment analysis. [I also needed to research something that wasn’t baseball.] Imagine every time you’ve misinterpreted a text message from your friend. Or every time an irate Twitter follower takes a sarcastic tweet seriously. That’s how hard it is for normal people to correctly interpret sentiment of written communication. So now picture trying to get a computer to do the same thing. Not easy. But at the very least we can find a way to categorize more tweets correctly than we misidentify.

On August 25th & 26th, I scraped tweets containing Twitter handles of some companies: I choose these based on my personal preferences or companies that I thought might have strong sentiment. The tweets were scraped using R and the package streamR. [These are pretty easy to use, if anyone wants to start doing any Twitter research, I’d start here.] The tweets are saved as JSON files, which are a mess, but human-readable. R parses the tweets into a data frame then that can go into a SQL database if so desired.

The way I determined a tweet’s sentiment was by matching words in the tweet to a predetermined sentiment lexicon. This process is outlined in this presentation, if you are curious about how to execute it. The algorithm is simple enough to write with one FOR loop. The toughest problem I had was dealing with the tweet’s data object. The scoring system was simple, each word from a tweet that matched to the lexicon list will count as a +1 [positive] or -1 [negative]. The words are added together to give a sentiment score. The sentiment score can be interpreted like algebra [0 is neutral, greater than 0 is positive, and less than 0 is negative].

I grouped the results by company: The bar graph sums the sentiment scores of every tweet I captured. So for example, two negative words in one tweet will drop the company’s total score 2 points. No surprises here that Comcast is in last place. I’ve never heard anyone say anything nice about Comcast, and apparently Twitter is not much kinder. News about the Burger King-Tim Horton’s merger broke while I was capturing tweets, so that accounts for the great discontent about Burger King. I’m not quite sure what BK’s baseline should be since this is the only data I have on hand. Verizon has very positive scores. I am a little skeptical of this, because Verizon (and Apple) had a lot tweets that were ad-based. While it’s great to know who is advertising your product, it’s not the goal of this particular project. Ads and news stories gets retweeted and regurgitated a lot, but the tweets from real customers don’t.

One way to account for large volume of tweets is to look at average sentiment per tweet: This graph takes the total sentiment score and divides it by the total number of tweets mentioning that company. This will give more weight to a company’s Twitter-customer base that’s strongly opinionated one way or another. To no one’s surprise, Comcast is dead last again. But to my delight, Chipotle ranks first! [I love burritos…and apparently a lot of Twitter users do too.] Chipotle is a good example of a company that does not have the volume that Comcast or Verizon has, but their users feel strongly about their product and tweet positively about it.

Luckily, there is a bunch of metadata available with the tweets including my favorite variable, a timestamp. First, here’s a time series baseline for August 25, 2014: The volume of tweets increased through out the day, and peaked around lunch in EDT. Let’s look at Chipotle’s time series graph broken up by sentiment classification: Neutral and positive tweets peaked during the 1PM EDT hour, with no corresponding spike in negative tweets. This is very encouraging for Chipotle since you would expect the negative tweets to follow the same pattern, and they don’t. They actually rise later in the day. Further research would have to be done to determine if this trend is real and what the source of it is. It could be a time zone delayed problem, general staffing/production issues in non-lunch hours, selection bias of people who go to a later lunch, or a random fluctuation that happened that day.

Comcast also has some interesting patterns: There are two spikes in neutral tweet volume early in the day. I think these are results of mentions in a news related tweet that was retweeted a lot. The large spike at 2PM EDT is probably also caused by retweets as well. However, during the early afternoon, there are distinctive negative customer tweets accounting for the surge in negative sentiment after lunch. My conjecture for this surge would be an increase of people dealing with Comcast’s customer service. It would be interesting to see if call center data matched up.

This is just the most basic implementation of sentiment analysis. There are more advanced machine learning techniques that can weight words differently and looks at consecutive word groups (n-grams) in addition to individual words. The advantage of this can be seen easily. The phrase ‘not good’ is negative, but only looking at the constituent words it would be scored neutral. There is a lot of other processes I can try to get more accurate results, but unfortunately, not in this post. # Pirates 2014 — Bullpen

All the graphs are pulled from this Fangraphs leaderboard.

The Pirates bullpen has a been a source of problems and criticisms for the Pirates this year. At the beginning of 2014, the bullpen had almost the same personnel as the 2013 season. Bullpens can vary wildly from year to year, and the Pirates relievers pitched out of their minds for most of 2013, so you’d expect there to be some fall off. Currently [August 26, 2014], the Pirates lead MLB with 22 blown saves. Personally, I abhor saves and blown saves, but I needed to get this out of the way, since it’s the stat that will get thrown around the most. And for reference Tony Watson [the All-Star] leads the team with 6 blown saves. So there’s that. I wanted to look at some of the peripheral stats of the Pirates bullpen to understand the entire story. First, the Pirates starters have been terrible this year. They rank last in starter WAR, middle of the pack in FIP, and near the bottom in WPA. Analyzing that situation is for another day, but suffice it to say they give up a lot of runs before the bullpen gets into the game. The smaller the average lead the bullpen has to hold on to, the more often they will give up the lead [accrue a blown save]. Shutdowns and meltdowns are Fangraphs stats which are better for evaluating individual relievers than saves. They provide a broader evaluation of how a pitcher or bullpen has performed rather than just looking at save situations. For a shutdown a pitcher basically adds to the win probability while for a meltdown a pitcher subtracts from the win probability. For instance last night Jared Hughes had a meltdown allowing three runs and inverting the win probability.  The Pirates are in the middle of the pack for both of those stats. There really isn’t anything interesting here.

Finally, the Pirates’ reliever xFIP is not very good. It’s towards the lower end of MLB. xFIP is one of the better park-independent, context-independent predictors of pitching skill. It just uses BB, K, and flyballs [for HR/FB]. This will also ‘adjust’ for some of Grilli and Frieri’s HRs that they gave up when they were struggling earlier this year. Those struggles won’t affect the bullpen moving forward since they are no longer on the team. After this quick analysis to answer my initial question about the Pirates bullpen, they aren’t good. They aren’t terrible, but they aren’t good. They do have two really good pitchers with Melancon and Watson. Two decent pitchers in Wilson and Hughes. Then the rest aren’t great. Taking this analysis, what could the Pirates do to improve? Frieri was a gamble that didn’t pay off. But honestly, I think from a management stand point, you had to get rid of Grilli to get him out of the closer role. John Axford might help. He’s been good in the 5 appearances for the Pirates so far, and his career xFIP is 3.26 which is pretty good. As far as a trade, ‘proven’ relievers are overvalued in the free agent market, and the trade market was really expensive this year. Overall, one reliever isn’t going to affect your win total dramatically. # MLB — Bases Loaded. No Outs. No Runs.

Bases loaded, no outs is one of the most tenuous points of a close baseball game. If you are rooting for the team at the plate, you feel confident your team will score here. Anything else, would be a huge disappointment. If you are rooting for the fielding team and your pitcher gets out of the jam, you are elated and praising the pitching staff for being able to handle pressure. Even though bases loaded, no outs (BLNO) seems like a sure thing, there is about a 15% chance the team DOESN’T score at all.

I’ve created this table of probability of scoring AT LEASE ONE RUN in the various base-out state situations using data from 2011-2013. The base-out states represent the 8 possible combinations of runners on base with the 3 out states that can exist [24 total]. 1- – means there’s only a runner on first, 1-3 means first and third, and 123 is bases loaded. Looking at the chart there is only an 85.18% chance that the team with BLNO scores a run. It’s one of the highest run probability situations, but there’s still a significance chance they won’t score a run. This table considers every play that started with this base-out configuration and looks at the remainder of the inning to see if the team scored. [It uses every play in baseball from 2011-2013 including playoff games.] In general these numbers fluctuate slightly over time and between teams. This table is also context neutral, specifically batter neutral, so having Mike Trout at bat would significantly change the probability versus a player like Clint Barmes.

Looking at the table, it’s apparent to score AT LEAST one run the lead runner is the most important factor, since all the base-out states have similar probabilities between the states when the lead runner is at third or second. So having a lead-off triple is about as valuable [in the context of scoring ONLY one run] as having the bases loaded, no out.

There are different run and out possibilities that exist with each base-out state. For the lead-off triple, there is no force play on the bases, while a bases-loaded situation has a force play at every bag including home. Having bases loaded would turn a ground ball into a potential run robbing force play, while a single runner on third would require a tag. Conversely, BLNO allows for walks and hit by pitches to drive in a run. This table also looks uses the entire rest of the inning, not just the play that occurs with BLNO. So if the team got the bases loaded with no out, gets two outs, then scores a run, it still counts as a success. A double play, which is easier to get with bases loaded than just a runner on third, will dramatically reduce the run probability of the next play affecting the previous base-out state. In summary, there are trade offs that can occur effecting the overall, context-neutral probability of the base-out state.

Example — Pirates Game

Failing to score a run in the context of this post means after loading the bases, the team does not score any runs before the end of the inning. All the probabilities are determined empirically.

Something kind of cool happened during the Pirates game last night (8/8/2014). There were two instances that bases were loaded with no outs, and the teams weren’t able to score any runs. The not being able to score any runs with the bases loaded/no outs isn’t that uncommon. A run-probability table can tell you that ~14% of the time a team will fail to score any runs for the rest of the inning after achieving that base-out state.

A base-out state is one of the 24 possible combinations of baserunners and number of outs. So there are 8 base states, bases empty, runner on first, etc. to bases loaded, and three different out states, 0, 1, or 2 outs. 8 x 3 = 24.

In the control room at the Pirates game last night, we were debating how often you see two occasions in the same game where no runs are scored after the bases are loaded with no outs. It turns out it relatively rare, but it happened twice at PNC Park before 2014: May 12, 2002 and August 28, 2003.

Between 2003 and 2013, bases were loaded with no out and no runs scored 1,092 times. There were 25 games that this happened multiple times, which is 0.0923% of all games played during that time [27,094 games]. This is on par with the probability of seeing a no-hitter (0.111%) and less probable than seeing a walk-off walk to end the game (0.266%).

The probability of seeing a game with two or more non-scoring bases loaded/no outs situations is 0.0923%

Using the table below bases empty/no outs will occur in every game (this happens at the start of every inning), and all the other base-out states have varying frequencies with runners on third with low out-states being the rarest. Bases loaded/no outs is the rarest base-out state occurring in only 21.92% of all games and occurring twice in the same game only in 6.05% of all games. Just for reference here is a chart of how often the base-out state events occur relative all events. This would represent the probability that any random event (plate appearance, at-bat, stolen base, etc.) would have that base-out state. All data is from retrosheet.org # Moving Average Time Series — Baseball

Usually I use stats to describe baseball, but this post is going to use baseball to illustrate stats. There’ll be some math. If that scares you, you’ve been duly warned. Also I have collected the SAS output for each model for technical reference.

A time series is data that has been collected at a regular interval over time. This is rather intuitive when given the definition, but they are different from cross-sectional data, which is the type of data set most people are familiar with. The closing price of a stock is a time series, because it’s a measurement at 4PM every M-F. Cross-sectional data would looking at which type of stocks gained the most over a quarter in your portfolio. This is one measurement (quarterly change) made for a many different stocks. Not every data set fits neatly into a category and the analysis goal is different for each instrument.

The goal of univariate time series analysis (TSA) is to forecast a variable only using past observations of that variable. In the case of the stock market example, TSA seeks to project what the closing price for the next day will be using data from the specified time frame. However, finance is boring and I wanted a data set that I can extract some insight from, so we’ll be looking at MLB strikeouts (K) per year and home runs (HR) per year as the data sets.

What does a time series look like. If you scroll down or look up a stock market graph, you’ll see what a time series looks like. It’s messy. I created this data set, so I can describe this process accurately. It’s a first-order moving average process with a lag_1 coefficient of 0.9 and a series mean of 0. I’ve also included the normal linear regression (OLS) trend for the time series that shows it to have a slightly positive trend. This is a typical analytical technique to show that a time series is moving. In this case the trend is non-significant over these 50 data points. There is no trend, and the mean is zero. The model that corresponds to the graph above has the general form as follows:

$latex y_t = \mu + a_t + \theta_1 a_{t-1}&s=2$

where $latex y&s=1$ is the time-dependent target variable, $latex \mu&s=1$ is the average of the entire series of data, $latex \theta&s=1$ is the regression coefficient, and $latex a&s=1$ is a time dependent shock to the system. The $latex t&s=1$ terms describe which time period the variable is from starting with the most current one, $latex t=50&s=1$.

Before describing the model above, it is important to fully understand what the $latex a_t&s=1$ represents. This is a shock term that can encompass a lot of different things. If you are consider something like quarterly earnings, factors influencing the shock term are unemployment, economic growth, marketing campaigns, etc. We are looking at the data in absence of this knowledge, and since we are in the dark, the causes of the shocks appear random. The $latex a_t&s=1$ terms should be a normally distributed and not autocorrelated. The expected value should be zero, $latex E[a_t] = 0&s=1$. The expected value is another way to describe the average of all the $latex a_t&s=1$ terms.

Here’s a great way to think about the MA process. Think about a simplified personal monthly expenditures where you had a constant salary and a modest saving account. Shocks that would be included in the $latex a_t&s=1$ term would be unexpected expenses. The unexpected expense could influence the next time period if you had to dip into savings. So a high unexpected expense in January would impact the spending in February, because you’d have payoff your credit card or put money back into savings.

There are many more details to understanding time series such as autocorrelation. Hopefully I’ll write a separate post on that in the future.

Let’s look at some real data. Luckily, I have every play from MLB in a database thanks to retrosheet.org, so we’ll look at some time series from there specifically, HR and Ks per year. Conceptually for this rudimentary modeling, a MA process makes sense. A shock from the previous year like expansion, steroids, or selection bias would carry over year to year. Looking at the time series graph below, it doesn’t behave like the previous time series that was centered around zero. This time series is considered non-stationary, which means there’s a trend and that trend changes over time. The number of HR per season increased over time up until around 2001 when it leveled off and started to decline. There’s a trend up until 2001 a trend after it, and they aren’t the same. To get around this instead of modeling the actual values, the differences between two years of HRs will be model. A difference ($latex \nabla&s=1$) is simply $latex y_t – y_{t-1}&s=1$. Or the difference in HRs in 2013 and 2012, which would be -279 HRs. The green line are the actual HRs each year. The ‘cantaloupe’ colored lines are the 50% confidence interval (CI) of the forecast. The red line are the forecasted values. I used 50% CIs to show likely deviations, not statistically significant deviations.

The differenced moving average model [ARIMA(0,1,1)] takes the form:

$latex \nabla y_t = \mu + a_t – \theta a_{t-1}&s=2$

Substituting the estimated coefficient for $latex \theta&s=1$ and $latex \mu&s=1$ a forecast can be made with the following equation:

$latex y_{t+1} = \mu + y_t + a_{t+1} – \theta * a_{t}&s=2$

$latex y_{t+1} = 50.11163 + y_t + a_{t+1} – .45073 * a_{t}&s=2$

The last equation is used to generate the forecast line and the ultimately the 50% CI lines. The interpretation of this equation is that half of the shock from the previous time period still has an effect on the change to the current period. The forecast predicts that the home runs will actually increase over the past few years and not continue the decline. Looking backwards the model can be used to identify some years of interest, and I’ve marked those on the graph. Expansion probably has the greatest impact on the number of HRs, because it dilutes the talent pool and increases the total number of games per season. If you wanted to measure the impact training or steroids had on HRs, you’d wanted to use a HR/game time series [see below] instead of total HRs. [This is total HRs between both teams.] The HR/Gm is the time series that a baseball analyst would want to use, because it controls for extra games from expansion, so the trends are also less pronounced. This is still a non-stationary time series, so it needs to be difference like the previous model and can be described by the following equation:

$latex y_{t+1} = 0.0045989 + y_t + a_t – .49927 * a_{t}&s=2$

Still the greatest shocks are the expansion years, which tend to have a bit of a lingering effect before regressing. 1987 now stands as a really enigmatic outlier. There was no expansion that year. The best explanation is there was a strike zone change, but I can only find that in one article. The home run outburst of the late 90s and early 2000s happens with the ‘steroid era’ and two close periods of expansion. This post isn’t interested in analyzing steroids effect on MLB, only that it’s ‘shock’ is mixed in with expansion team ‘shock’. Also it should be noted HRs/Gm haven’t returned to pre-1993 expansion levels.

Looking at the opposite of a home run, the strike outs per year has a trend that is much more steady, and it’s increasing. The graph displayed above is also differenced first order moving average process, ARIMA(0,1,1). Its equation looks very similar to the last two so I won’t write it out. The parameters can be found in the SAS output appendix, I have for this page. The forecast has a definite increase in total strike outs over the next few years. Just like the HR per year time series, the time series of Ks are best analyzed by looking at the K/Gm. The K/Gm time series turns out to be a different model than the first three models, because it is a just a random walk around a linear trend. This process has random shocks around a positive trend with no ‘memory’ of the past shocks like the other three models had. This model for K/Gm, ARIMA(0,1,0), looks a little different than the ARIMA(0,1,1) models seen earlier since there is no lagged $latex a_{t-1}&s=1$ term. The ARIMA(0,1,0) model is given by the following equation:

$latex \nabla y_t = \mu + a_t&s=2$

and the forecast equation with parameters in it would be:

$latex y_{t+1} = 0.11637 + y_t + a_t&s=2$

This indicates that the K/Gm will increase by 0.11637 every year on average. Obviously since there are only 54 outs in a baseball game this trend can’t go on forever. As of the beginning August 2014, the current K/Gm is 15.4 and it is forecasted to be 15.2497, which is within the 50% CI of the forecast.

While these models can make predictions about baseball, I wouldn’t considering this the best [or even good] models for forecasting since we could incorporate other variables or improve the granularity of the forecast to individual players. There also isn’t much value in saying there’ll be more strike outs in 2014 than 2013. However, this example is a good academic exercise in understanding how univariate time series work. And hopefully it provides some insight into both time series and a little bit about trends in baseball. # Pirates Do Not Need Help Against LHP

Stats in this post are current up to right before the July 31, 2014 PIT-ARZ game.

The MLB non-waiver trade deadline just passed. I’m not interesting in debating what teams should or should not have done except to say the price for quality players was very high this year. The whole supply & demand, free market thing really worked in the favor of teams that were already out of the post season race. It was suggested that the Pirates needed a right-handed batter (RHB), since they don’t do well against left-handed pitching (LHP). I had my doubts this was really true, and adding a good RHB won’t improve the team beyond what general improvements you could expect from that batter. MLB teams generally do better against LHP, since most batters are RHB and the RHB/LHP split favors the batter.

Before getting into this, LHP make up only 21% of the Pirates’ season-to-date plate appearances, out of all the problems the Pirates could have making a roster move to address this isn’t necessary unless you are looking to platoon. More on that later.

Looking at the team batting splits, the Pirates have an overall .722 OPS and a LHP .670 OPS. On the surface, it appears they are performing worse against LHP, and I will concede the argument the Pirates HAVE performed worse against LHP so far in 2014, but this shouldn’t continue going forward.

The Pirates have 4,152 plate appearances racked up thru July 30th, but only 867 of them have occurred against LHP (~21%). To put this in perspective, that is equivalent to less than one month of games. How accurate are batting statistics at the end of April? They aren’t. Put simply the Pirates ‘struggles’ against LHP can mostly be attributed to a small sample size.

I went and laid out all the outcomes (1B, BB, 2B, etc.) in a vector of plate appearances and had the computer randomly draw 900 samples from the entire Pirates season and computed the OPS 1000 different times. Then I plotted them below.

Due to the central limit theorem the mean should hover around .720 (the overall OPS) and the data should be normally distributed. Because of this I constructed the normal distribution curve and then used that to calculate the probability that a 900 plate-appearance sample can be drawn from the Pirates’ total plate appearances. It turns out 9% of the time the program will select plate appearances that total a < .670 OPS. 9% isn't that likely, but it is not outrageous to conclude the Pirates low vsLHP OPS is due to small sample size. This is not just applicable to LHP vs overall splits, but any low-count split including RISP. I wrote about this previously and came to a similar conclusion.

The composite distribution curves below illustrate what happens with sample size increases and why small small sizes are problematic. The vertical line is the .670 OPS mark. On the 900-sample distribution (vs LHP) there is a 9% probability of drawing a .670 OPS from the Pirates’ total plate appearances. This is the area underneath the curve to the left of the red line. Using the 3000-sample distribution curve, it’s 0.0016%. There is barely any area under the 3000-PA curve at that point, and this is a huge difference. (3000 samples are approximately how many the team has against RHP.) One more graph! This is a histogram of the differences between the LHP OPS and the overall OPS. The Pirates are on the low end of it. Not great, but there’s a lot of variation there. Switching from statistics to baseball, the Pirates have the second fewest plate appearances against LHP in MLB. They are 11-9 in games started by a LHP. That alone should discount the poor-performance-against-LHP argument, but obviously the team batting stats suggests that they are and it has been woven into a narrative.

Looking closely at the Pirates’ roster there are many solid RHBs, McCutchen [their best hitter], Martin, Marte, Sanchez, and Mercer/Harrison are pretty good against lefties. Now, some of these player are underperforming against LHP this year, but this is where the small sample size comes in again. You wouldn’t determine any of these batters lost their platoon advantage after only 80 plate appearances. Going forward almost all of these bats should regress to their normal platoon splits.

Pedro Alvarez, Gregory Polanco, Ike Davis. Their platoon splits are pretty atrocious both for 2014 and career-wise. For example, Alvarez has a .787 OPS vs RHP and a .517 OPS against LHP this year. I don’t want to get into analyzing what’s wrong with the Pirates’ left-handed bats, except to say they are terrible against LHP. The argument should change from the Pirates don’t do well against LHP to the Pirates’ left-handed batters are terrible against LHP.

What can be done about this? The simple answer is to get better left-handed batters. Since that’s not really possible, the next best option would be platooning the left-handed batters. Ike Davis is already platooned with Gaby Sanchez, and Pedro Alvarez is barely starting any games. Polanco has regressed from his debut, but I think the best idea is for him to play everyday and deal with LOOGY relievers. I also don’t know how many fans actually want to see or are suggesting that he’s should be platooned. With all this in mind I’m not quite sure what acquiring a right-handed bat would accomplish. The Pirates are already trying to find a place for RHB Josh Harrison to play. He’s been having a good season, no matter what you think about Harrison. Furthermore, the Pirates have a guy who’s been killing LHP this year and has decent splits against them for his career. And that’s Jose Tabata.

Bottom line, adding a RHB wouldn’t help much because the team splits are still a small sample size against LHP. Beyond the statistics, the two big left-handed bats have terrible splits against LHP, and these problems have been already addressed by platooning and benching. # MLB — Poisson Distribution To Model Runs Scored Per Inning

I have recently written a much more mathematically involved post using the negative binomial and wrote up a discrete probability distribution primer. These are a more complete treatment of the the topic. However, this post is a good overview of the basics.

My friend sparked my recent interest in Poisson distributions by mentioning how rare it is to meet a romantic interest/significant other that you’ll have a long term relationship opposed to going out for just a few dates or even dating at all. I immediately though about earthquakes. It’s strange, but makes some sense, since the large-impact earthquakes are both very unpredictable and rare, much like dating. I’d love to show this actually happens, but since I can’t download relationship data, I’ve found something almost as good: baseball data!

A Poisson distribution [pronunciation] is used for count data and rare events over a specified time/area. This is in contrast to the more familiar bell-curve normal distribution which uses continuous data. [For math/science people, it’s a decaying exponential] A few good example potential models using a Poisson distribution are number of sick days a person uses through out a year or traffic accidents per month on a certain stretch of road.  Earthquake frequency modeling is probably one of the more famous uses of a Poisson distribution.

Getting back to baseball, runs are not common events, and I wouldn’t go so far to call them rare events. However, in the context of individual innings, runs are rare. Going back to a previous post about the Pirates’ run probability, any given team in MLB only has a 26% chance that they will score in any given inning. This means that 73% of the time you are watching baseball you are watching the teams not score.  I am interested in how often a team will score 0, 1, 2, 3 or more runs in an inning.  To determine the probability that a certain number of runs are scored in any inning a Poisson distribution can be used and it follows the general form:

$P(X) = \frac{e^{-\lambda}\lambda^X}{X!}$

Substituting the $\lambda$ term for the Run Expectancy for the beginning of a inning which is .4615 runs an inning in 2013, you will the red distribution line below. [ Run Expectancy/Expected Runs is a fancy way to say the average runs for a given situation.]  The blue area represents the actual run frequencies, and the gold line is the distribution which I obtained from regression. The Poisson distribution describes how often runs are scored during innings pretty well, but it’s not perfect. The trend line underestimates the shutout and big-run innings, while overestimating the one-run innings. The model shown above is suffering from overdispersion, which means the variance [how spread out the data is] is larger than what the model assumes. The short reason to account for the lack of fit is that baseball isn’t completely random. You’ll have better teams who score multiple runs in an inning against poor teams who will in turn fail to score any runs in an inning. The disparity in teams will cause a wider variance in run scoring.

The  red line in the graph above is a distribution I obtained  when I regressed the count data against the number of runs and obtained a ‘new’ mean.  This distribution is a little bit closer to the empirical data, though it still suffers overdispersion. I’ve put all the counts and frequencies/probabilities into a table so that it is easier to reference.  If you wanted to calculate the probability that you would see an entire game (full 9 innings) with 7 or more runs in an inning (like last night’s Pirates game), you would use the following formula:

$P(X) =1-P($of not having any +7-run innings$)^{18 innings}$

$= 1 – (1-[.0009+.0003+.0001])^{18} = .02314$

So there’s a roughly 2% chance that any baseball game you attend will have an inning with 7 or more runs scored in it.