Tag Archives: stats

One-Tailed Z-test

One Mean Z-test [with R code]

I’ve included the full R code and the data set can be found on UCLA’s Stats Wiki

Building on finding z-scores for individual measurement or values within a population, a z-test can determine if there is a statistically significance different between a sample mean and a population mean with a known population standard deviation. [Those conditions are essential for using this test.] The z-test uses z-scores and a normal distribution to determine the probability the sample mean is drawn randomly from a known population. If the test fails, the conclusion is that random sampling is likely to have produced this. If the test rejects the null hypothesis, then the sample is likely to be a result of non-random sampling [ie. like team captains picking the tallest kids for a basketball game in gym class].

The z-test relies critically on the central limit theorem, which basically states that if you take a n >= 30 sample a population [with any distribution] many times over, you’ll get a normal distribution of the sample means. [This needs it’s own post to explain fully, and there are interesting ways you can program R to illustrate this.] The sample mean distribution chart is shown below compared to the population distribution. The important concepts to notice here are:

  • the area of both distributions is equal to 1
  • the sample mean distribution is a normal distribution
  • the sample mean distribution is tighter and taller than the population distribution

Central Limit Theorem Comparison to Population Distribution

For the rest of this post, the sample mean distribution will be used for the z-test and it is also represent in green opposed to blue. Also the data I use in this post is height data from this data set. It represents the heights of 25,000 children from Hong Kong. The data doesn’t reflect US adults, but it’s a great normally distributed data set.

The goal of the z-test will be to test to see if a sample and its mean are randomly sampled from the population or if there’s some significant difference. For example, you could use this test to see if the average height of NBA players is statistically significantly different than the general population. While the NBA example is pretty common sense, not every problem will be that clear. Sample size [like in many hypothesis tests] is a huge factor. Small sample sizes require huge differences between the sample mean and the population mean to be significant.

For a one-mean z-test, we will be using a one-tail hypothesis test. The null hypothesis will be that there is NO difference between the sample mean and the population mean. The alternate hypothesis will test to see if the sample mean is greater. The null and alternate hypotheses are written out as:

  • $latex H_0: \bar{x} = \mu&s=2$
  • $latex H_A: \bar{x} > \mu&s=2$

One-Tailed Z-test

The graph above shows the critical regions for a right-tailed z-test. The critical regions reflect areas where the z-stat has to fall in order for the test to reject the null hypothesis. The critical regions are defined because they represent a probability less the the stated confidence level. For example the critical region for 95% confidence level only has an area [probability] of 5%. If the sample mean is the same as the population mean, there’s a 5% chance it was drawn by random chance. This concept is the basis for almost every hypothesis test.

The z-test uses the z-stat, which is calculated analogously to the z-score the difference being it uses standard error instead of standard deviation. These two concepts are similar; The standard deviation applies to the ‘spread’ of the blue population distribution, while the standard error applies to the ‘spread’ of the green sample mean distribution. The z-stat is calculated as:

$latex z = \frac{\bar{x} – \mu}{\sigma/\sqrt{n}} &s=2$

The higher the z-stat is the more certainty there is that the sample mean and the population are different. There are three things make the z-stat larger:

  • a bigger difference between sample mean and population mean
  • a small population standard deviation
  • a larger sample size

Example

I have two sets of sample from the data set: one is entirely random and the other I weighted heavily towards taller people. The null hypothesis would be that both there’s no difference between the sample mean and the population mean. The alternate would be that the sample mean is greater than the population mean. The weighted sample would be the sample if you were evaluating the mean height of a basketball team vs the general population. Here are the two sets of an n=50 sample and R code on how I constructed them using a set random seed of 123.

Unbiased random sample

unbiased_sample

Tall-biased random sample

biased_sample

#unbiased random sample
set.seed(123)
n <- 50
height_sample <- sample(height, size=n)
sample_mean <- mean(height_sample)

#tall-biased sample
cut <- 1:25000
weights <- cut^.6
sorted_height <- sort(height)
set.seed(123)
height_sample_biased <- sample(sorted_height, size=n, prob=weights)
sample_mean_biased <- mean(height_sample_biased)

The population mean is 67.993, the first unbiased sample is 68.099, and the tall-biased group is 68.593. Both samples are higher than the than the population mean, but are both significantly higher than the mean? To figure this out, we need to calculate the z-stats and find out if those z-stats fall in the critical region using the equation:

$latex z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}} &s=2$

We can substitute and calculate with the population standard deviation [σ] = 1.902:

$latex z_{unbiased} = \frac{68.593 - 67.993}{1.902/\sqrt{50}} = 0.3922 \ \ \ \ z_{tall-biased} = \frac{68.099 - 67.993}{1.902/\sqrt{50}} = 2.229 &s=0$

#random unbiased sample
#z-stat calculation
sample_mean
z <- (sample_mean - pop_mean)/(pop_sd/sqrt(n))

#tall-biased sample
z <- (sample_mean_biased - pop_mean)/(pop_sd/sqrt(n))

Quickly, knowing that the critical value for a one-tail z-test at 95% confidence is 1.645, we can determine the unbiased random sample is not significantly different, but the tall-biased sample is significantly different. This is because the z-stat for the unbiased sample is less than the critical value, while the tall-biased is higher than the critical value.

Failed Z-test Example Comparison

Plotting the z-test for the unbiased sample, the area [probability] to the right of the z-stat is much higher than the accepted 5%. The larger the green area is the more likely the difference between the sample mean and the population mean were obtained by random chance. To get a z-test to be significant, you want to get the z-stat high so that the area [probability] is low. [In practice, this can be done by increasing sample size.]

Successful Z-test Example

The tall-baised sample mean's z-stat creates a plot with much less area to the right of the z-stat, so these results were much less likely to be obtained by chance. The p-values can be obtained by calculating the area to right of the z-stat. The R code below summarizes how to do that using R's 'pnorm' function.

#calculating the p-value
p_yellow2 <- pnorm(z)                   
p_green2 <- 1 - p_yellow2
p_green2

The p-value for the unbiased sample is .3474 or there's a 34.74% chance that the result was obtained due to random chance, while the tall-biased sample only have a p-value of .01291 or a 1.291% chance being a result of random chance. Since the p-value tall-biased sample is less than the .05, the null hypothesis is rejected, but the since the unbiased sample's p-value is well above .05, the null hypothesis is retained.

What the one-mean z-test accomplished was telling us that a simple random sample from a population wasn't really that different from population, while a sample that wasn't completely random but was much taller than the overall population was shown to be different. While this test isn't used often, the principles of distributions, calculating test stats, and p-values have many applications with in the statistics universe.

Statistics — Probability vs. Odds

Probability and odds are two basic statistic terms to describe the likeliness that an event will occur. They are often used interchangeably in causal conversation or even in published material. However, they are not mathematically equivalent because they are looking at likeliness in different contexts. In everyday conversation when numbers or values aren’t given, the two terms are synonymous . If an event has a high probability, then it has high odds for happening. The incorrect usage arises when a person ascribes a mathematical value to either the odds or probability they are discussing. Hopefully, if you aren’t quite sure what the exact mathematical difference is, this will clear it up for you.

Probability is defined as the fraction of desired outcomes in the context of every possible outcome with a value between 0 and 1, where 0 would be an impossible event and 1 would represent an inevitable event. Probabilities are usually given as percentages. [ie. 50% probability that a coin will land on HEADS.] Odds can have any value from zero to infinity and they represent a ratio of desired outcomes versus the field. Odds are a ratio, and can be given in two different ways: ‘odds in favor’ and ‘odds against’. ‘Odds in favor’ are odds describing the if an event will occur, while ‘odds against’ will describe if an event will not occur. If you are familiar with gambling, ‘odds against’ are what Vegas gives as odds. More on that later. For the coin flip odds in favor of a HEADS outcome is 1:1, not 50%.

Visual Math

Simple probability of event A occurring is mathematically defined as:

$latex P(A) = \frac{Number \ of \ Event \ A}{Total \ Number \ of \ Events}&s=2$

The best way to illustrate this is with the classic marbles-in-a-bag example. The graphic below depicts all the marbles in an opaque bag that one marble will be pulled out of. There are 6 blue, 3 red, 2 yellow, and 1 green for a total of 12 marbles in the bag.

Bag of Marbles

The probability of pulling a red marble would be calculated by taking the total number of red marbles and dividing it by the total number of marbles.

Probability Red

OR

$latex P(RED) = \frac{3 \ RED \ marbles}{12 \ TOTAL \ marbles} = 25\%&s=2$.

Notice that the probability calculation includes the red marbles in the denominator of the calculation, because probability considers the context of the entire event space. Odds, on the other hand, are the ratio of favorable outcomes to unfavorable outcomes. The denominator contains ONLY the marbles that aren’t the favorable outcomes. Odds uses the contexts of good outcomes and bad outcomes. Written as fractions, these two values are completely different. Probability is 1/4 while odds in favor are 1/3. You can see how mistakenly interchanging the terms could give the wrong information. The ‘odds in favor’ of RED would be mathematically calculated by

Odds For Red

OR

$latex Odds\_Favor(RED) = \frac{3 \ RED \ marbles}{9 \ NOT \ RED \ marbles} = 1:3&s=2$.

To find ‘odds against’ you would simply flip odds in favor upside down and this describes the odds of the event not occurring.

Odds Against Red

OR

$latex Odds\_Against(RED) = \frac{9 \ NOT \ RED \ marbles}{3 \ RED \ marbles} = 3:1&s=2$.

Gambling

‘Odds against’ are commonly are used in the context of gambling. When you hear that the Seattle Seahawks Vegas odds to win the Super Bowl are 5:1 [Retrieved 9/19/2014], the 5:1 is referring to the ‘odds against’ Seattle winning the Super Bowl. Using some quick math we could determine the probability of Seattle winning the Super Bowl would be 1/6 or 16.7%.

Vegas odds are technically payoff odds, because they describe the payout if you were to win the bet. The payout on the Seahawks would win you $5 for every $1 bet on the Seattle winning the Super Bowl. They aren’t true odds, since no one is really sure what the true odds are, because you can’t simply count and weigh the possibilities like with the bag of marbles. The payoff will increase when the event becomes less likely. If you could create a reliable predictive model that told you the Seahawks actually had a 20% probability to win the Super Bowl, you could bet on the Seahawks, knowing that their actual probability to win is better than what Vegas is giving them. And if you made enough bets like this you could beat Vegas.

Mathematical Relationship

I stated earlier that probability and odds were colloquially interchangeable when values aren’t given. This is true, because the two are mathematically related. Odds can be computed from probability and probability from odds.

$latex P(A) = \frac{Odds\_Favor(A)}{1 + Odds\_Favor(A)}&s=2$

$latex Odds\_Favor(A) = \frac{P(A)}{1 – P(A)}&s=2$

Using the RED marble example [P(RED) = 1/4 and Odds_Favor(RED) = 1/3] we can demonstrate how these are equivalent:

$latex P(RED) = \frac{1/3}{1 + 1/3} = \frac{1/3}{4/3} = \frac{1}{4}&s=2$

$latex Odds\_Favor(RED) = \frac{1/4}{1 – 1/4} = \frac{1/4}{3/4} = \frac{1}{3}&s=2$

Negative Binomial Fit

MLB — Run Distribution Per Game & Per Inning — Negative Binomial

This is an extension of an earlier post I wrote about the runs per inning distribution. In this post I use the negative binomial distribution to better model the how MLB teams score runs in an inning or in a game. I wrote a primer on the math of the different distributions mentioned in the post for reference.

The Baseball Side

A team in the American League will average .4830 runs per inning, but does this mean they will score a run every two innings? This seems intuitive if you apply math from Algebra I [1 run / 2 innings ~ .4830 runs/inning]. However, if you attend a baseball game, the vast majority of innings you’ll watch will be scoreless. This large number of scoreless innings can be described by discrete probability distributions that account for teams scoring none, one, or multiple runs in one inning.

Runs in baseball are considered rare events and count data, so they will follow a discrete probability distribution if they are random. The overall goal of this post is to describe the random process that arises with scoring runs in baseball. Previously, I’ve used the Poisson distribution (PD) to describe the probability of getting a certain number of runs within an inning. The Poisson distribution describes count data like car crashes or earthquakes over a given period of time and defined space. This worked reasonably well to get the general shape of the distribution, but it didn’t capture all the variance that the real data set contained. It predicted fewer scoreless innings and many more 1-run innings than what really occurred. The PD makes an assumption that the mean and variance are equal. In both runs per inning and runs per game, the variance is about twice as much as the mean, so the real data will ‘spread out’ more than a PD predicts.

Negative Binomial Fit

The graph above shows an example of the application of count data distributions. The actual data is in gray and the Poisson distribution in yellow. It’s not a terrible way to approximate the data or to conceptually understand the randomness behind baseball scoring, but the negative binomial distribution (NBD) works much better. The NBD is also a discrete probability distribution, but it finds the probability of a certain number of failures occurring before a certain number of successes. It would answer the question, what’s the probability that I get 3 TAILS before I get 5 HEADS when I continue to flip a coin. This doesn’t at first intuitively seem like it relates to a baseball game or an inning, but that will be explained later.

From a conceptual stand point, the two distributions are closely related. So if you are trying to describe why 73% of all MLB innings are scoreless to a friend over a beer, either will work. I’ve ploted both distributions for comparison through out the post. The second section of the post will discuss the specific equations and their application to baseball.

Runs per Inning

Because of the difference in rules regarding the designated hitter between the two different leagues there will be a different expected value [average] and variance of runs/inning for each league. I separated the two leagues to get a better fit for the data. Using data from 2011-2013, the American League had an expected value of 0.4830 runs/inning with a 1.0136 variance, while the National League had 0.4468 runs/innings as the expected value with a .9037 variance. [So NL games are shorter and more boring to watch.] Using only the expected value and the variance, the negative binomial distribution [the red line in the graph] approximates the distribution of runs per inning more accurately than the Poisson distribution.

Runs Per Inning -- 2011-2013

It’s clear that there are a lot of scoreless innings, and very few innings having multiple runs scored. This distribution allows someone to calculate the probability of the likelihood of an MLB team scoring more than 7 runs in an inning or the probability that the home team forces extra innings down by a run in the bottom of the 9th. Using a pitcher’s expected runs/inning, the NBD could be used to approximate the pitcher’s chances of throwing a no-hitter assuming he will pitch for all 9 innings.

Runs Per Game

The NBD and PD can be used to describe the runs scored in a game by a team as well. Once again, I separated the AL and NL, because the AL had an expected run value of 4.4995 runs/game and a 9.9989 variance, and the NL had 4.2577 runs/game expected value and 9.1394 variance. This data is taken from 2008-2013. I used a larger span of years to increase the total number of games.

Runs Per Game 2008-2013

Even though MLB teams average more than 4 runs in a game, the single most likely run total for one team in a game is actually 3 runs. The negative binomial distribution once again modeled the distribution well, but the Poisson distribution had a terrible fit when compared to the previous graph. Both models, however, underestimate the shut-out rate. A remedy for this is to adjust for zero-inflation. This would increase the likelihood of getting a shut out in the model and adjust the rest of the probabilities accordingly. An inference of needing zero-inflation is that baseball scoring isn’t completely random. A manager is more likely to use his best pitchers to continue a shut out rather than randomly assign pitchers from the bullpen.

Hits Per Inning

It turns out the NBD/PD are useful in many other baseball statistics like hits per inning.

Hits Per Inning 2011-2013

The distribution for hits per inning are slightly similar to runs per inning, except the expected value is higher and the variance is lower. [AL: .9769 hits/inning, 1.2847 variance | NL: .9677 hits/inning, 1.2579 variance (2011-2013)] Since the variance is much closer to the expected value, the hits per inning has more values in the middle and fewer at the extremes than the runs per inning distribution.

I could spend all day finding more applications of the NBD and PD, because there are really a lot of examples within baseball. Understanding how these discrete distributions will help you understand how the game works, and they could be used to model outcomes within baseball.

The Math Side

Hopefully, you skipped down to this section right away if you are curious about the math behind this. I’ve compiled the numbers used in the graphs for the American League above for those curious enough to look at examples of the actual values.

The Poisson distribution is given by the equation:

$latex P(X = x) = \frac{e^{-\lambda}\lambda^x}{x!}&s=2$

There are two parameters for this equation: expected value [$latex \lambda&s=1$] and the number of runs you are looking to calculate the probability for [$latex x&s=1$]. To determine the probability of a team scoring exactly three runs in a game, you would set $latex x = 3&s=1$ and using the AL expected runs per game you’d calculate:

$latex P(X = x) = \frac{e^{-4.4995}4.4995^3}{3!} = 16.87\% &s=2$

This is repeated for the entire set of $latex x&s=1$ = {0, 1, 2, 3, 4, 5, 6, … } to get the Poisson distribution used through out the post.

One of the assumption the PD makes is that mean and the variance are equal. For these examples, this assumption doesn’t hold true, so the empirical data from actual baseball results doesn’t quite fit the PD and is overdispersed. The NBD accounts for the variance by including it in the parameters.

The negative binomial distribution is usually symbolized by the following equation:

$latex P(X=k) = {{r+k-1}\choose{k}} p^{r} (1-p)^{k}&s=2$

where $latex r&s=1$ is the number of successes, $latex k&s=1$ is the number of failures, and $latex p&s=1$ is the probability of success. A key restriction is that a success has to be the last event in the series of successes and failures.

Unfortunately, we don’t have a clear value for $latex p&s=1$ or a clear concept on what will be measured, because the NBD measures the probability of binary, Bernoulli trials. It’s help to view this problem from the vantage point of the fielding team or pitcher, because a SUCCESS will be defined as getting out of the inning or game, and a FAILURE will be allowing 1 run to score. This will conform to the restriction by having a success [getting out of the inning/game] being the ultimate event of the series.

In order to make this work the NBD needs to be parameterized differently, for mean, variance, and number of runs allowed [failures]. The following equations are derived from the mean and variance equations of a negative binomial. $latex \alpha&s=1$ represents the ‘odds in favor‘ of getting out of the inning. And $latex r&s=1$ is the expected value multiplied by the ‘odds in favor’ which will yield a real, non-integer for the number of successes. The NBD can then be written as

$latex P(X=k) = \frac{\Gamma(k+r)}{\Gamma(k+1)\Gamma(r)} (\frac{\alpha}{1+\alpha})^{r} (\frac{1}{1+\alpha})^{k}&s=2$

where

$latex r = Expected Value * \alpha; \alpha = \frac{Expected Value}{Variance -Expected Value}&s=2$

So using the same example as the PD distribution, this would yield:

$latex r = 4.4995 * 0.8182 = 3.6815 ; \alpha = \frac{4.4995}{9.9989 – 4.4995} = 0.8182&s=2$

$latex P(X=3) = \frac{\Gamma(3+3.6815)}{\Gamma(3+1)\Gamma(3.6815)} (\frac{0.8182}{1+.0.8182})^{3.6815} (\frac{1}{1+0.8182})^{3}&s=2$

$latex = 14.18\% &s=2$

The above equations are adapted from this blog about negative binomials and this one about applying the distribution to baseball. The $latex \Gamma &s=1$ function is used in the equation instead of a combination operator because the combination operator, specifically the factorial, can’t handle the non-whole numbers we are using to describe the number of successes, and the gamma function is a continuous function from 0 to infinity.

Conclusion

The negative binomial distribution is really useful in modeling the distribution of discrete count data from baseball for a given inning or game. The most interesting aspect of the NBD is that a success is considered getting out of the inning/game, while a failure would be letting a run score. This is a little counterintuitive if you approach modeling the distribution from the perspective of the batting team. While the NBD has a better fit, the PD has a simpler concept to explain: the count of discrete event over a given period of time, which might make it better to discuss over beers with your friends.

The fit of the NBD suggests that run scoring is a negative binomial process, but inconsistencies especially with shut outs indicate elements of the game aren’t completely random. I’m explaining the underestimation number of shut outs as the increase use of the best relievers in shut out games over other games increasing the total number of shut outs and subsequently decreasing the frequency of other run-total games.

All MLB data is from retrosheet.org. It’s available free of charge from there. So please check it out, because it’s a great data set. If there are any errors or if you have questions, comments, or want to grab a beer to talk about the Poisson distribution please feel free to tweet me @seandolinar.

HR/Game 1950-2013

Moving Average Time Series — Baseball

Usually I use stats to describe baseball, but this post is going to use baseball to illustrate stats. There’ll be some math. If that scares you, you’ve been duly warned. Also I have collected the SAS output for each model for technical reference.

A time series is data that has been collected at a regular interval over time. This is rather intuitive when given the definition, but they are different from cross-sectional data, which is the type of data set most people are familiar with. The closing price of a stock is a time series, because it’s a measurement at 4PM every M-F. Cross-sectional data would looking at which type of stocks gained the most over a quarter in your portfolio. This is one measurement (quarterly change) made for a many different stocks. Not every data set fits neatly into a category and the analysis goal is different for each instrument.

The goal of univariate time series analysis (TSA) is to forecast a variable only using past observations of that variable. In the case of the stock market example, TSA seeks to project what the closing price for the next day will be using data from the specified time frame. However, finance is boring and I wanted a data set that I can extract some insight from, so we’ll be looking at MLB strikeouts (K) per year and home runs (HR) per year as the data sets.

What does a time series look like. If you scroll down or look up a stock market graph, you’ll see what a time series looks like. It’s messy. I created this data set, so I can describe this process accurately. It’s a first-order moving average process with a lag_1 coefficient of 0.9 and a series mean of 0. I’ve also included the normal linear regression (OLS) trend for the time series that shows it to have a slightly positive trend. This is a typical analytical technique to show that a time series is moving. In this case the trend is non-significant over these 50 data points. There is no trend, and the mean is zero.

Moving Average Process Sample

The model that corresponds to the graph above has the general form as follows:

$latex y_t = \mu + a_t + \theta_1 a_{t-1}&s=2$

where $latex y&s=1$ is the time-dependent target variable, $latex \mu&s=1$ is the average of the entire series of data, $latex \theta&s=1$ is the regression coefficient, and $latex a&s=1$ is a time dependent shock to the system. The $latex t&s=1$ terms describe which time period the variable is from starting with the most current one, $latex t=50&s=1$.

Before describing the model above, it is important to fully understand what the $latex a_t&s=1$ represents. This is a shock term that can encompass a lot of different things. If you are consider something like quarterly earnings, factors influencing the shock term are unemployment, economic growth, marketing campaigns, etc. We are looking at the data in absence of this knowledge, and since we are in the dark, the causes of the shocks appear random. The $latex a_t&s=1$ terms should be a normally distributed and not autocorrelated. The expected value should be zero, $latex E[a_t] = 0&s=1$. The expected value is another way to describe the average of all the $latex a_t&s=1$ terms.

Here’s a great way to think about the MA process. Think about a simplified personal monthly expenditures where you had a constant salary and a modest saving account. Shocks that would be included in the $latex a_t&s=1$ term would be unexpected expenses. The unexpected expense could influence the next time period if you had to dip into savings. So a high unexpected expense in January would impact the spending in February, because you’d have payoff your credit card or put money back into savings.

There are many more details to understanding time series such as autocorrelation. Hopefully I’ll write a separate post on that in the future.

Let’s look at some real data. Luckily, I have every play from MLB in a database thanks to retrosheet.org, so we’ll look at some time series from there specifically, HR and Ks per year. Conceptually for this rudimentary modeling, a MA process makes sense. A shock from the previous year like expansion, steroids, or selection bias would carry over year to year. Looking at the time series graph below, it doesn’t behave like the previous time series that was centered around zero. This time series is considered non-stationary, which means there’s a trend and that trend changes over time. The number of HR per season increased over time up until around 2001 when it leveled off and started to decline. There’s a trend up until 2001 a trend after it, and they aren’t the same. To get around this instead of modeling the actual values, the differences between two years of HRs will be model. A difference ($latex \nabla&s=1$) is simply $latex y_t – y_{t-1}&s=1$. Or the difference in HRs in 2013 and 2012, which would be -279 HRs.

HR Total 1950-2013
The green line are the actual HRs each year. The ‘cantaloupe’ colored lines are the 50% confidence interval (CI) of the forecast. The red line are the forecasted values. I used 50% CIs to show likely deviations, not statistically significant deviations.

The differenced moving average model [ARIMA(0,1,1)] takes the form:

$latex \nabla y_t = \mu + a_t – \theta a_{t-1}&s=2$

Substituting the estimated coefficient for $latex \theta&s=1$ and $latex \mu&s=1$ a forecast can be made with the following equation:

$latex y_{t+1} = \mu + y_t + a_{t+1} – \theta * a_{t}&s=2$

$latex y_{t+1} = 50.11163 + y_t + a_{t+1} – .45073 * a_{t}&s=2$

The last equation is used to generate the forecast line and the ultimately the 50% CI lines. The interpretation of this equation is that half of the shock from the previous time period still has an effect on the change to the current period. The forecast predicts that the home runs will actually increase over the past few years and not continue the decline. Looking backwards the model can be used to identify some years of interest, and I’ve marked those on the graph. Expansion probably has the greatest impact on the number of HRs, because it dilutes the talent pool and increases the total number of games per season. If you wanted to measure the impact training or steroids had on HRs, you’d wanted to use a HR/game time series [see below] instead of total HRs. [This is total HRs between both teams.]

HR/Game 1950-2013

The HR/Gm is the time series that a baseball analyst would want to use, because it controls for extra games from expansion, so the trends are also less pronounced. This is still a non-stationary time series, so it needs to be difference like the previous model and can be described by the following equation:

$latex y_{t+1} = 0.0045989 + y_t + a_t – .49927 * a_{t}&s=2$

Still the greatest shocks are the expansion years, which tend to have a bit of a lingering effect before regressing. 1987 now stands as a really enigmatic outlier. There was no expansion that year. The best explanation is there was a strike zone change, but I can only find that in one article. The home run outburst of the late 90s and early 2000s happens with the ‘steroid era’ and two close periods of expansion. This post isn’t interested in analyzing steroids effect on MLB, only that it’s ‘shock’ is mixed in with expansion team ‘shock’. Also it should be noted HRs/Gm haven’t returned to pre-1993 expansion levels.

Looking at the opposite of a home run, the strike outs per year has a trend that is much more steady, and it’s increasing.

K Total 1950-2013

The graph displayed above is also differenced first order moving average process, ARIMA(0,1,1). Its equation looks very similar to the last two so I won’t write it out. The parameters can be found in the SAS output appendix, I have for this page. The forecast has a definite increase in total strike outs over the next few years. Just like the HR per year time series, the time series of Ks are best analyzed by looking at the K/Gm. The K/Gm time series turns out to be a different model than the first three models, because it is a just a random walk around a linear trend.

K/Game 1950-2013

This process has random shocks around a positive trend with no ‘memory’ of the past shocks like the other three models had. This model for K/Gm, ARIMA(0,1,0), looks a little different than the ARIMA(0,1,1) models seen earlier since there is no lagged $latex a_{t-1}&s=1$ term. The ARIMA(0,1,0) model is given by the following equation:

$latex \nabla y_t = \mu + a_t&s=2$

and the forecast equation with parameters in it would be:

$latex y_{t+1} = 0.11637 + y_t + a_t&s=2$

This indicates that the K/Gm will increase by 0.11637 every year on average. Obviously since there are only 54 outs in a baseball game this trend can’t go on forever. As of the beginning August 2014, the current K/Gm is 15.4 and it is forecasted to be 15.2497, which is within the 50% CI of the forecast.

While these models can make predictions about baseball, I wouldn’t considering this the best [or even good] models for forecasting since we could incorporate other variables or improve the granularity of the forecast to individual players. There also isn’t much value in saying there’ll be more strike outs in 2014 than 2013. However, this example is a good academic exercise in understanding how univariate time series work. And hopefully it provides some insight into both time series and a little bit about trends in baseball.