Tag Archives: normal distribution

One-Sample t-Test [With R Code]

The one sample t-test is very similar to the one sample z-test. A sample mean is being compared to a claimed population mean. The t-test is required when the population standard deviation is unknown. The t-test uses the sample’s standard deviation (not the population’s standard deviation) and the Student t-distribution as the sampling distribution to find a p-value.

The t-Distribution

While the z-test uses the normal distribution, which is only dependent on the mean and standard deviation of the population. Any of the various t-tests [one-sample, independent, dependent] use the t-distribution, which has an extra parameter over the normal distribution: degrees of freedom (df). The theoretical basis for degrees of freedom deserves a lot of attention, but for now for the one-sample t-test, df = n – 1.

t-distribution vs normal distribution

The distributions above show how the degrees of freedom affect the shape of t distribution. The gray distribution is the normal distribution. Low df causes the tails of the distribution to be fatter, while a higher df makes the t-distribution become more like the normal distribution.

The practical outcome of this will be that samples with smaller n will need to be further from the population mean to reject the null hypothesis than samples with larger n. And compared to the z-test, the t-test will always need to have the sample mean further from the population mean.

Just like the one-sample z-test, we have to define our null hypothesis and alternate hypothesis. This time I’m going to show a two-tailed test. The null hypothesis will be that there is NO difference between the sample mean and the population mean. The alternate hypothesis will test to see if the sample mean is significantly different from the population mean. The null and alternate hypotheses are written out as:

  • $latex H_0: \bar{x} = \mu&s=2$
  • $latex H_A: \bar{x} \neq \mu&s=2$

t-distribution critical

The graphic above shows a t-distribution with a df = 5 with the critical regions highlighted. Since the shape of the distribution changes with degrees of freedom, the critical value for the t-test will change as well.

The t-stat for this test is calculated the same way as the z-stat for the z-test, except for the σ term [population standard deviation] in the z-test is replaced with s [sample standard deviation]:

$latex z = \frac{\bar{x} – \mu}{\sigma/\sqrt{n}} \hspace{1cm} t = \frac{\bar{x} – \mu}{s/\sqrt{n}} &s=2$

Like the z-stat, the higher the t-stat is the more certainty there is that the sample mean and the population mean are different. There are three things make the t-stat larger:

  • a bigger difference between sample mean and population mean
  • a small sample standard deviation
  • a larger sample size

Example in R

Since the one-sample t-test follows the same process as the z-test, I’ll simply show a case where you reject the null hypothesis. This will also be a two-tailed test, so we will use the null and alternate hypotheses found earlier on this page.

Once again using the height and weight data set from UCLA’s site, I’ll create a tall-biased sample of 50 people for us to test.

#reads data set
data <- read.csv('data/Height_data.csv')
height <- data$Height

#N - number in population
#n - number in sample
N <- length(height)
n <- 50

#population mean
pop_mean <- mean(height)

#tall-biased sample
cut <- 1:25000
weights <- cut^.6
sorted_height <- sort(height)
set.seed(123)
height_sample_biased <- sample(sorted_height, size=n, prob=weights)

This sample would represent something like athletes, CEOs, or maybe a meeting of tall people. After creating the sample, we use R's mean() and sd() functions to get the parameters for the t-stat formula from above.

sample_mean <- mean(height_sample_biased)
sample_sd <- sd(height_sample_biased)

Now using the population mean, the sample mean, the sample standard deviation, and the number of samples (n = 50) we can calculate the t-stat.

#t-stat
t <- (sample_mean - pop_mean) / (sample_sd/sqrt(n))

Now you could look up the critical value for the t-test with 49 degrees of freedom [50-1 = 49], but this is R, so we can find the area under the tail of the curve [the blue area from the critical region diagram] and see if it's under 0.025. This will be our p-value, which is the probability that the value was achieved by random chance.

#p-value for t-test
1-pt(t,n-1)

The answer should be 0.006882297, which is well below 0.025, so the null hypothesis is rejected and the difference between the tall-biased sample and the general population is statistically significant.

You can find the full R code including code to create the t-distribution and normal distribution data sets on my GitHub .

One-Tailed Z-test

One Mean Z-test [with R code]

I’ve included the full R code and the data set can be found on UCLA’s Stats Wiki

Building on finding z-scores for individual measurement or values within a population, a z-test can determine if there is a statistically significance different between a sample mean and a population mean with a known population standard deviation. [Those conditions are essential for using this test.] The z-test uses z-scores and a normal distribution to determine the probability the sample mean is drawn randomly from a known population. If the test fails, the conclusion is that random sampling is likely to have produced this. If the test rejects the null hypothesis, then the sample is likely to be a result of non-random sampling [ie. like team captains picking the tallest kids for a basketball game in gym class].

The z-test relies critically on the central limit theorem, which basically states that if you take a n >= 30 sample a population [with any distribution] many times over, you’ll get a normal distribution of the sample means. [This needs it’s own post to explain fully, and there are interesting ways you can program R to illustrate this.] The sample mean distribution chart is shown below compared to the population distribution. The important concepts to notice here are:

  • the area of both distributions is equal to 1
  • the sample mean distribution is a normal distribution
  • the sample mean distribution is tighter and taller than the population distribution

Central Limit Theorem Comparison to Population Distribution

For the rest of this post, the sample mean distribution will be used for the z-test and it is also represent in green opposed to blue. Also the data I use in this post is height data from this data set. It represents the heights of 25,000 children from Hong Kong. The data doesn’t reflect US adults, but it’s a great normally distributed data set.

The goal of the z-test will be to test to see if a sample and its mean are randomly sampled from the population or if there’s some significant difference. For example, you could use this test to see if the average height of NBA players is statistically significantly different than the general population. While the NBA example is pretty common sense, not every problem will be that clear. Sample size [like in many hypothesis tests] is a huge factor. Small sample sizes require huge differences between the sample mean and the population mean to be significant.

For a one-mean z-test, we will be using a one-tail hypothesis test. The null hypothesis will be that there is NO difference between the sample mean and the population mean. The alternate hypothesis will test to see if the sample mean is greater. The null and alternate hypotheses are written out as:

  • $latex H_0: \bar{x} = \mu&s=2$
  • $latex H_A: \bar{x} > \mu&s=2$

One-Tailed Z-test

The graph above shows the critical regions for a right-tailed z-test. The critical regions reflect areas where the z-stat has to fall in order for the test to reject the null hypothesis. The critical regions are defined because they represent a probability less the the stated confidence level. For example the critical region for 95% confidence level only has an area [probability] of 5%. If the sample mean is the same as the population mean, there’s a 5% chance it was drawn by random chance. This concept is the basis for almost every hypothesis test.

The z-test uses the z-stat, which is calculated analogously to the z-score the difference being it uses standard error instead of standard deviation. These two concepts are similar; The standard deviation applies to the ‘spread’ of the blue population distribution, while the standard error applies to the ‘spread’ of the green sample mean distribution. The z-stat is calculated as:

$latex z = \frac{\bar{x} – \mu}{\sigma/\sqrt{n}} &s=2$

The higher the z-stat is the more certainty there is that the sample mean and the population are different. There are three things make the z-stat larger:

  • a bigger difference between sample mean and population mean
  • a small population standard deviation
  • a larger sample size

Example

I have two sets of sample from the data set: one is entirely random and the other I weighted heavily towards taller people. The null hypothesis would be that both there’s no difference between the sample mean and the population mean. The alternate would be that the sample mean is greater than the population mean. The weighted sample would be the sample if you were evaluating the mean height of a basketball team vs the general population. Here are the two sets of an n=50 sample and R code on how I constructed them using a set random seed of 123.

Unbiased random sample

unbiased_sample

Tall-biased random sample

biased_sample

#unbiased random sample
set.seed(123)
n <- 50
height_sample <- sample(height, size=n)
sample_mean <- mean(height_sample)

#tall-biased sample
cut <- 1:25000
weights <- cut^.6
sorted_height <- sort(height)
set.seed(123)
height_sample_biased <- sample(sorted_height, size=n, prob=weights)
sample_mean_biased <- mean(height_sample_biased)

The population mean is 67.993, the first unbiased sample is 68.099, and the tall-biased group is 68.593. Both samples are higher than the than the population mean, but are both significantly higher than the mean? To figure this out, we need to calculate the z-stats and find out if those z-stats fall in the critical region using the equation:

$latex z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}} &s=2$

We can substitute and calculate with the population standard deviation [σ] = 1.902:

$latex z_{unbiased} = \frac{68.593 - 67.993}{1.902/\sqrt{50}} = 0.3922 \ \ \ \ z_{tall-biased} = \frac{68.099 - 67.993}{1.902/\sqrt{50}} = 2.229 &s=0$

#random unbiased sample
#z-stat calculation
sample_mean
z <- (sample_mean - pop_mean)/(pop_sd/sqrt(n))

#tall-biased sample
z <- (sample_mean_biased - pop_mean)/(pop_sd/sqrt(n))

Quickly, knowing that the critical value for a one-tail z-test at 95% confidence is 1.645, we can determine the unbiased random sample is not significantly different, but the tall-biased sample is significantly different. This is because the z-stat for the unbiased sample is less than the critical value, while the tall-biased is higher than the critical value.

Failed Z-test Example Comparison

Plotting the z-test for the unbiased sample, the area [probability] to the right of the z-stat is much higher than the accepted 5%. The larger the green area is the more likely the difference between the sample mean and the population mean were obtained by random chance. To get a z-test to be significant, you want to get the z-stat high so that the area [probability] is low. [In practice, this can be done by increasing sample size.]

Successful Z-test Example

The tall-baised sample mean's z-stat creates a plot with much less area to the right of the z-stat, so these results were much less likely to be obtained by chance. The p-values can be obtained by calculating the area to right of the z-stat. The R code below summarizes how to do that using R's 'pnorm' function.

#calculating the p-value
p_yellow2 <- pnorm(z)                   
p_green2 <- 1 - p_yellow2
p_green2

The p-value for the unbiased sample is .3474 or there's a 34.74% chance that the result was obtained due to random chance, while the tall-biased sample only have a p-value of .01291 or a 1.291% chance being a result of random chance. Since the p-value tall-biased sample is less than the .05, the null hypothesis is rejected, but the since the unbiased sample's p-value is well above .05, the null hypothesis is retained.

What the one-mean z-test accomplished was telling us that a simple random sample from a population wasn't really that different from population, while a sample that wasn't completely random but was much taller than the overall population was shown to be different. While this test isn't used often, the principles of distributions, calculating test stats, and p-values have many applications with in the statistics universe.

Probabiliy of Finding Someone Taller than 6 Comparison

Calculating Z-Scores [with R code]

I’ve included the full R code and the data set can be found on UCLA’s Stats Wiki

Normal distributions are convenient because they can be scaled to any mean or standard deviation meaning you can use the exact same distribution for weight, height, blood pressure, white-noise errors, etc. Obviously, the means and standard deviations of these measurements should all be completely different. In order to get the distributions standardized, the measurements can be changed into z-scores.

Z-scores are a stand-in for the actual measurement, and they represent the distance of a value from the mean measured in standard deviations. So a z-score of 2.0 means the measurement is 2 standard deviations away from the mean.

To demonstrate how this is calculated and used, I found a height and weight data set on UCLA’s site. They have height measurements from children from Hong Kong. Unfortunately, the site doesn’t give much detail about the data, but it is an excellent example of normal distribution as you can see in the graph below. The red line represents the theoretical normal distribution, while the blue area chart reflects a kernel density estimation of the data set obtained from UCLA. The data set doesn’t deviate much from the theoretical distribution.

Normal Distribution Z-Score Comparison

The z-scores are also listed on this normal distribution to show how the actual measurements of height correspond to the z-scores, since the z-scores are simple arithmetic transformations of the actual measurements. The first step to find the z-score is to find the population mean and standard deviation. It should be noted that the sd function in R uses the sample standard deviation and not the population standard deviation, though with 25,000 samples the different is rather small.

#DATA LOAD
data <- read.csv('Height_data.csv')
height <- data$Height

hist(height) #histogram

#POPULATION PARAMETER CALCULATIONS
pop_sd <- sd(height)*sqrt((length(height)-1)/(length(height)))
pop_mean <- mean(height)

Using just the population mean [μ = 67.99] and standard deviation [σ = 1.90], you can calculate the z-score for any given value of x. In this example I'll use 72 for x.

$latex z = \frac{x - \mu}{\sigma} &s=2$

z <- (72 - pop_mean) / pop_sd

This gives you a z-score of 2.107. To put this tool to use, let's use the z-score to find the probability of finding someone who is 72 inches [6-foot] tall. [Remember this data set doesn't apply to adults in the US, so these results might conflict with everyday experience.] The z-score will be used to determine the area [probability] underneath the distribution curve past the z-score value that we are interested in.
[One note is that you have to specify a range (72 to infinity) and not a single value (72). If you wanted to find people who are exactly 6-foot, not taller than 6-foot, you would have to specify the range of 71.5 to 72.5 inches. This is another problem, but this has everything to do with definite integrals intervals if you are familiar with Calc I.]

Probabiliy of Finding Someone Taller than 6 Comparison

The above graph shows the area we intend to calculate. The blue area is our target, since it represents the probability of finding someone taller than 6-foot. The yellow area represents the rest of the population or everyone is is under 6-feet tall. The z-score and actual height measurements are both given underscoring the relationship between the two.

Typically in an introductory stats class, you'd use the z-score and look it up in a table and find the probability that way. R has a function 'pnorm' which will give you a more precise answer than a table in a book. ['pnorm' stands for "probability normal distribution".] Both R and typical z-score tables will return the area under the curve from -infinity to value on the graph this is represented by the yellow area. In this particular problem, we want to find the blue area. The solution to this is an easy arithmetic function. The area under the curve is 1, so by subtracting the yellow area from 1 will give you the area [probability] for the blue area.

Yellow Area:

p_yellow1 <- pnorm(72, pop_mean, pop_sd)    #using x, mu, and sigma
p_yellow2 <- pnorm(z)                       #using z-score of 2.107

Blue Area [TARGET]:

p_blue1 <- 1 - p_yellow1   #using x, mu, and sigma
p_blue2 <- 1 - p_yellow2   #using z-score of 2.107

Both of these techniques in R will yield the same answer of 1.76%. I used both methods, to show that R has some versatility that traditional statistics tables don't have. I personally find statistics tables antiquated, since we have better ways to determine it, and the table doesn't help provide any insight over software solutions.

Z-scores are useful when relating different measurement distributions to each acting as a 'common denominator'. The z-scores are used extensively for determining area underneath the curve when using text book tables, and also can be easily used in programs such as R. Some statistical hypothesis tests are based on z-scores and the basic principles of finding the area beyond some value.

Box Plot PIrates' Pitch Count

Pirates — Pitch Count

At the Pirates, we like to try to guess what the pitch count will be for the Pirates’ starting pitcher. In honor of opening day today, I present a cheat sheet!

Box Plot Pirates' Pitch Count

Total Pitches Pirates Starters

The graphs might be a little bit overkill, but it’s cool all the different ways you can visualize the this simple data. The number of pitches is distributed normally with a skew left. This skew occurs because there are instances when the pitcher has a bad day and gets pulled really early. To account for this, I excluded any outing that didn’t have more than 50 pitches. We will consider these as rare events, which we shouldn’t try to use in our prediction. The idea of the game is to hit the exact pitch count, and this would preclude a rare event from being factored in. I also used the median number of pitches instead of the average number of pitches for the same reason. We want to consistently pick numbers which are the most likely to get hit, not to try to predict every game.

The idea of using the median over the mean is important when there is a skew to the normal distribution of the data. This is important for something like income. There is a huge skew for incomes across the entire US population since there are so few people that make outrageous amounts of money. The mean of incomes will be much higher than the median of incomes. The median will be much more representative of the central tendency of the data.

So applying this to the pitch count, the short outings are rare and the mean doesn’t represent the most probable outcome for the day.

Happy Opening Day!