Tag Archives: Poisson

Negative Binomial Fit

MLB — Run Distribution Per Game & Per Inning — Negative Binomial

This is an extension of an earlier post I wrote about the runs per inning distribution. In this post I use the negative binomial distribution to better model the how MLB teams score runs in an inning or in a game. I wrote a primer on the math of the different distributions mentioned in the post for reference.

The Baseball Side

A team in the American League will average .4830 runs per inning, but does this mean they will score a run every two innings? This seems intuitive if you apply math from Algebra I [1 run / 2 innings ~ .4830 runs/inning]. However, if you attend a baseball game, the vast majority of innings you’ll watch will be scoreless. This large number of scoreless innings can be described by discrete probability distributions that account for teams scoring none, one, or multiple runs in one inning.

Runs in baseball are considered rare events and count data, so they will follow a discrete probability distribution if they are random. The overall goal of this post is to describe the random process that arises with scoring runs in baseball. Previously, I’ve used the Poisson distribution (PD) to describe the probability of getting a certain number of runs within an inning. The Poisson distribution describes count data like car crashes or earthquakes over a given period of time and defined space. This worked reasonably well to get the general shape of the distribution, but it didn’t capture all the variance that the real data set contained. It predicted fewer scoreless innings and many more 1-run innings than what really occurred. The PD makes an assumption that the mean and variance are equal. In both runs per inning and runs per game, the variance is about twice as much as the mean, so the real data will ‘spread out’ more than a PD predicts.

Negative Binomial Fit

The graph above shows an example of the application of count data distributions. The actual data is in gray and the Poisson distribution in yellow. It’s not a terrible way to approximate the data or to conceptually understand the randomness behind baseball scoring, but the negative binomial distribution (NBD) works much better. The NBD is also a discrete probability distribution, but it finds the probability of a certain number of failures occurring before a certain number of successes. It would answer the question, what’s the probability that I get 3 TAILS before I get 5 HEADS when I continue to flip a coin. This doesn’t at first intuitively seem like it relates to a baseball game or an inning, but that will be explained later.

From a conceptual stand point, the two distributions are closely related. So if you are trying to describe why 73% of all MLB innings are scoreless to a friend over a beer, either will work. I’ve ploted both distributions for comparison through out the post. The second section of the post will discuss the specific equations and their application to baseball.

Runs per Inning

Because of the difference in rules regarding the designated hitter between the two different leagues there will be a different expected value [average] and variance of runs/inning for each league. I separated the two leagues to get a better fit for the data. Using data from 2011-2013, the American League had an expected value of 0.4830 runs/inning with a 1.0136 variance, while the National League had 0.4468 runs/innings as the expected value with a .9037 variance. [So NL games are shorter and more boring to watch.] Using only the expected value and the variance, the negative binomial distribution [the red line in the graph] approximates the distribution of runs per inning more accurately than the Poisson distribution.

Runs Per Inning -- 2011-2013

It’s clear that there are a lot of scoreless innings, and very few innings having multiple runs scored. This distribution allows someone to calculate the probability of the likelihood of an MLB team scoring more than 7 runs in an inning or the probability that the home team forces extra innings down by a run in the bottom of the 9th. Using a pitcher’s expected runs/inning, the NBD could be used to approximate the pitcher’s chances of throwing a no-hitter assuming he will pitch for all 9 innings.

Runs Per Game

The NBD and PD can be used to describe the runs scored in a game by a team as well. Once again, I separated the AL and NL, because the AL had an expected run value of 4.4995 runs/game and a 9.9989 variance, and the NL had 4.2577 runs/game expected value and 9.1394 variance. This data is taken from 2008-2013. I used a larger span of years to increase the total number of games.

Runs Per Game 2008-2013

Even though MLB teams average more than 4 runs in a game, the single most likely run total for one team in a game is actually 3 runs. The negative binomial distribution once again modeled the distribution well, but the Poisson distribution had a terrible fit when compared to the previous graph. Both models, however, underestimate the shut-out rate. A remedy for this is to adjust for zero-inflation. This would increase the likelihood of getting a shut out in the model and adjust the rest of the probabilities accordingly. An inference of needing zero-inflation is that baseball scoring isn’t completely random. A manager is more likely to use his best pitchers to continue a shut out rather than randomly assign pitchers from the bullpen.

Hits Per Inning

It turns out the NBD/PD are useful in many other baseball statistics like hits per inning.

Hits Per Inning 2011-2013

The distribution for hits per inning are slightly similar to runs per inning, except the expected value is higher and the variance is lower. [AL: .9769 hits/inning, 1.2847 variance | NL: .9677 hits/inning, 1.2579 variance (2011-2013)] Since the variance is much closer to the expected value, the hits per inning has more values in the middle and fewer at the extremes than the runs per inning distribution.

I could spend all day finding more applications of the NBD and PD, because there are really a lot of examples within baseball. Understanding how these discrete distributions will help you understand how the game works, and they could be used to model outcomes within baseball.

The Math Side

Hopefully, you skipped down to this section right away if you are curious about the math behind this. I’ve compiled the numbers used in the graphs for the American League above for those curious enough to look at examples of the actual values.

The Poisson distribution is given by the equation:

$latex P(X = x) = \frac{e^{-\lambda}\lambda^x}{x!}&s=2$

There are two parameters for this equation: expected value [$latex \lambda&s=1$] and the number of runs you are looking to calculate the probability for [$latex x&s=1$]. To determine the probability of a team scoring exactly three runs in a game, you would set $latex x = 3&s=1$ and using the AL expected runs per game you’d calculate:

$latex P(X = x) = \frac{e^{-4.4995}4.4995^3}{3!} = 16.87\% &s=2$

This is repeated for the entire set of $latex x&s=1$ = {0, 1, 2, 3, 4, 5, 6, … } to get the Poisson distribution used through out the post.

One of the assumption the PD makes is that mean and the variance are equal. For these examples, this assumption doesn’t hold true, so the empirical data from actual baseball results doesn’t quite fit the PD and is overdispersed. The NBD accounts for the variance by including it in the parameters.

The negative binomial distribution is usually symbolized by the following equation:

$latex P(X=k) = {{r+k-1}\choose{k}} p^{r} (1-p)^{k}&s=2$

where $latex r&s=1$ is the number of successes, $latex k&s=1$ is the number of failures, and $latex p&s=1$ is the probability of success. A key restriction is that a success has to be the last event in the series of successes and failures.

Unfortunately, we don’t have a clear value for $latex p&s=1$ or a clear concept on what will be measured, because the NBD measures the probability of binary, Bernoulli trials. It’s help to view this problem from the vantage point of the fielding team or pitcher, because a SUCCESS will be defined as getting out of the inning or game, and a FAILURE will be allowing 1 run to score. This will conform to the restriction by having a success [getting out of the inning/game] being the ultimate event of the series.

In order to make this work the NBD needs to be parameterized differently, for mean, variance, and number of runs allowed [failures]. The following equations are derived from the mean and variance equations of a negative binomial. $latex \alpha&s=1$ represents the ‘odds in favor‘ of getting out of the inning. And $latex r&s=1$ is the expected value multiplied by the ‘odds in favor’ which will yield a real, non-integer for the number of successes. The NBD can then be written as

$latex P(X=k) = \frac{\Gamma(k+r)}{\Gamma(k+1)\Gamma(r)} (\frac{\alpha}{1+\alpha})^{r} (\frac{1}{1+\alpha})^{k}&s=2$

where

$latex r = Expected Value * \alpha; \alpha = \frac{Expected Value}{Variance -Expected Value}&s=2$

So using the same example as the PD distribution, this would yield:

$latex r = 4.4995 * 0.8182 = 3.6815 ; \alpha = \frac{4.4995}{9.9989 – 4.4995} = 0.8182&s=2$

$latex P(X=3) = \frac{\Gamma(3+3.6815)}{\Gamma(3+1)\Gamma(3.6815)} (\frac{0.8182}{1+.0.8182})^{3.6815} (\frac{1}{1+0.8182})^{3}&s=2$

$latex = 14.18\% &s=2$

The above equations are adapted from this blog about negative binomials and this one about applying the distribution to baseball. The $latex \Gamma &s=1$ function is used in the equation instead of a combination operator because the combination operator, specifically the factorial, can’t handle the non-whole numbers we are using to describe the number of successes, and the gamma function is a continuous function from 0 to infinity.

Conclusion

The negative binomial distribution is really useful in modeling the distribution of discrete count data from baseball for a given inning or game. The most interesting aspect of the NBD is that a success is considered getting out of the inning/game, while a failure would be letting a run score. This is a little counterintuitive if you approach modeling the distribution from the perspective of the batting team. While the NBD has a better fit, the PD has a simpler concept to explain: the count of discrete event over a given period of time, which might make it better to discuss over beers with your friends.

The fit of the NBD suggests that run scoring is a negative binomial process, but inconsistencies especially with shut outs indicate elements of the game aren’t completely random. I’m explaining the underestimation number of shut outs as the increase use of the best relievers in shut out games over other games increasing the total number of shut outs and subsequently decreasing the frequency of other run-total games.

All MLB data is from retrosheet.org. It’s available free of charge from there. So please check it out, because it’s a great data set. If there are any errors or if you have questions, comments, or want to grab a beer to talk about the Poisson distribution please feel free to tweet me @seandolinar.

Twitter Retweet Analysis

Twitter Retweet Decay

This uses the same data set I obtained from my NU Data Mining final project [summary].

Recently, @MLBcathedrals tweeted a photo I submitted to them:

I got a bunch of retweet/favorite notifications at first then I got fewer as the day went on. Now a month later, I’ll get a favorite or retweet [RT] notification every so often. The process of getting a retweet follows a Poisson process, where there is a discrete and somewhat small outcome that can be thought of as count data — you can count retweets per minute.

I used the tweets I just had lying around from my project and pulled out several collections of native RTs that had their first RT in the data set and a high volume of retweets. This was to ensure I had the first part of the tweet’s life and not just had captured it in the middle. Time is measured in seconds from the first retweet event. This simplifies things by giving each collection of a RTed tweets a time base relative to when RTing started.

The common time base enabled me to make this comparison chart of different RTed tweets:

Twitter Retweet Analysis

Not every RT pattern is the same. Some have many more RTs, some take a little while to get momentum, but generally they start off strong, then slowly die out taking the shape of a logarithmic function. The total number of RTs over time is interesting, but this problem works better if we look at the rate of RTing. The reason why the RT pattern flattens out is that there is steadily decreasing RTing rate over time. This makes intuitive sense if you have ever used Twitter, people react to things as they happen then rarely go back to it.

It turns out you can mathematically model the RT rate with a Poisson generalized linear model reasonably well. The following three graphs show the actual RT rate data points as red dots, the expected value regression as the black line, and a probability range as blue bands.

Twitter Retweets Per Minute

The model for this particular RTed tweet is described by the equation:

$latex ln(E[Y|t]) = 2.980154 – 0.0017236*t&s=1$

$latex Y&s=1$ is the number of RT per minute. $latex t&s=1$ is time. And $latex E[Y|t]&s=1$ is read as the expected value of $latex Y&s=1$ given $latex t&s=1$, or what is the most likely number of RTs per minute at a given time. The constant [2.980154] represents the rate at t=1, and the negative regression coefficient [-0.0017236] indicates that rate will decline. This regression line represents the expected value, which is essentially an average of possible outcomes. Using the Poisson distribution and the expected value, I constructed a probability distribution showing a band where 50% of all data points should be located, and another band that should encompass 90% of them.

The bands, in my opinion, are more important than the regression line, because we are dealing with count data. So having an expected value of say 2.5342 doesn’t mean much if you don’t know the probability for getting a value of 0, 1, 2, 3, 4, etc. For this reason the last graph in the series of three has only the actual data points in the probable area bands. For each minute, the data point has a 50% chance to be in the dark blue region, a 40% chance to be in the light blue region [90%-50%], and a 10% chance to be in the white.

This is all predicated on RTing being a random process with a fixed audience. This described most of the RTs I looked at fairly well, but there will always be other factors such as viral growth and time of day. Viral growth means it starts off slow then grows large. If this were to happen, it would not follow this pattern; it would look more like an S. For better or worse, most RTs come from accounts with a large number of followers, so they aren’t actually viral, they are propagations of already popular tweets.

This specific regression by itself won’t predict how many RTs a tweet might get before it’s tweeted, but it describes what happens after people have begun retweeting.

Poisson MLB Graph

MLB — Poisson Distribution To Model Runs Scored Per Inning

I have recently written a much more mathematically involved post using the negative binomial and wrote up a discrete probability distribution primer. These are a more complete treatment of the the topic. However, this post is a good overview of the basics.

My friend sparked my recent interest in Poisson distributions by mentioning how rare it is to meet a romantic interest/significant other that you’ll have a long term relationship opposed to going out for just a few dates or even dating at all. I immediately though about earthquakes. It’s strange, but makes some sense, since the large-impact earthquakes are both very unpredictable and rare, much like dating. I’d love to show this actually happens, but since I can’t download relationship data, I’ve found something almost as good: baseball data!

A Poisson distribution [pronunciation] is used for count data and rare events over a specified time/area. This is in contrast to the more familiar bell-curve normal distribution which uses continuous data. [For math/science people, it’s a decaying exponential] A few good example potential models using a Poisson distribution are number of sick days a person uses through out a year or traffic accidents per month on a certain stretch of road.  Earthquake frequency modeling is probably one of the more famous uses of a Poisson distribution.

Getting back to baseball, runs are not common events, and I wouldn’t go so far to call them rare events. However, in the context of individual innings, runs are rare. Going back to a previous post about the Pirates’ run probability, any given team in MLB only has a 26% chance that they will score in any given inning. This means that 73% of the time you are watching baseball you are watching the teams not score.  I am interested in how often a team will score 0, 1, 2, 3 or more runs in an inning.  To determine the probability that a certain number of runs are scored in any inning a Poisson distribution can be used and it follows the general form:

[latex]P(X) = \frac{e^{-\lambda}\lambda^X}{X!} [/latex]

 Substituting the [latex] \lambda [/latex] term for the Run Expectancy for the beginning of a inning which is .4615 runs an inning in 2013, you will the red distribution line below. [ Run Expectancy/Expected Runs is a fancy way to say the average runs for a given situation.]  The blue area represents the actual run frequencies, and the gold line is the distribution which I obtained from regression.

Poisson MLB Graph

The Poisson distribution describes how often runs are scored during innings pretty well, but it’s not perfect. The trend line underestimates the shutout and big-run innings, while overestimating the one-run innings. The model shown above is suffering from overdispersion, which means the variance [how spread out the data is] is larger than what the model assumes. The short reason to account for the lack of fit is that baseball isn’t completely random. You’ll have better teams who score multiple runs in an inning against poor teams who will in turn fail to score any runs in an inning. The disparity in teams will cause a wider variance in run scoring.

The  red line in the graph above is a distribution I obtained  when I regressed the count data against the number of runs and obtained a ‘new’ mean.  This distribution is a little bit closer to the empirical data, though it still suffers overdispersion.

 

MLB Poisson TableI’ve put all the counts and frequencies/probabilities into a table so that it is easier to reference.  If you wanted to calculate the probability that you would see an entire game (full 9 innings) with 7 or more runs in an inning (like last night’s Pirates game), you would use the following formula:

[latex] P(X) =1-P([/latex]of not having any +7-run innings[latex])^{18 innings}[/latex]

[latex]= 1 – (1-[.0009+.0003+.0001])^{18} = .02314 [/latex]

So there’s a roughly 2% chance that any baseball game you attend will have an inning with 7 or more runs scored in it.