There are three doors. And hidden behind them are two goats and a car. Your objective is to win the car. Here’s what you do:
- Pick a door.
- The host opens one of the doors you didn’t pick that has a goat behind it.
- Now there are just two doors to choose from.
- Do you stay with your original choice or switch to the other door?
- What’s the probability you get the car if you stay?
- What’s the probability you get the car if you switch?
It’s not a 50/50 choice. I won’t digress into the math behind it, but instead let you play with the simulator below. The game will tally up how many times you win and lose based on your choice.
What’s going on here? Marilyn vos Savant wrote the solution to this game in 1990. You can read vos Savant’s explanations and some of the ignorant responses. But in short, because the door that’s opened is not opened randomly, the host gives you additional information about the set of doors you didn’t choose. Effectively, if you switch, you are select all the other doors. If you choose to stay, you are select just one door.
In her answer, she suggests:
Here’s a good way to visualize what happened. Suppose there are a million doors, and you pick door #1. Then the host, who knows what’s behind the doors and will always avoid the one with the prize, opens them all except door #777,777. You’d switch to that door pretty fast, wouldn’t you?
To illustrate that in the simulation, you can increase number of number of doors in the simulator. It becomes pretty clear that switch is the correct choice.
Finally, here’s some Kevin Spacey:
Covariance — Different Ways to Explain or Visualize It
Covariance is the less understood sibling of correlation. While correlation is commonly used in reporting, covariance provides the mathematical underpinnings to a lot of different statistical concepts. Covariance describes how two variables change in relation to one another. If variable X increases and variable Y increases as well, Both X & Y will have positive covariance. Negative covariance will result from having two variables move in opposite directions, and zero covariance will result from the variables have no relationship with each other. Variance is also a specific case of covariance where both input variables are the same. All of this is rather abstract, so let’s look at more concrete definitions.
Covariance — Summation Notation
The definition you will find in most introductory stat books is a relatively simple equation using the summation operator (Σ). This shows covariances as the sum of the product of a paired data point relative to its mean. First, you need to find the mean of both variables. Then take all the data points and subtract the mean from its respective variable. Finally, you multiply the differences together
Population Covariance:
$latex
cov(X, Y) = \frac{1}{{N}} \sum\limits_{i}^{N}{(X_i – \mu_x)(Y_i – \mu_y)}
&s=2$
Sample Covariance:
$latex
cov(X, Y) = \frac{1}{{n-1}} \sum\limits_{i}^{n}{(X_i – \bar X)(Y_i – \bar Y)}
&s=2$
N
is the number of data points in the population. n
is the sample number. μX is the population mean for X; μY for Y. X̄
and Ȳ
are the mean as well but this notation designates it as a sample mean rather than a population mean. Calculating the covariance of any significant data set can be tedious if done by hand, but we can set-up the equation in R and see it work. I used modified version of Anscombe’s Quartet data set.
#get a data set X <- c(anscombe$x1, 6,4,10) Y = c(anscombe$y1, 10,8,6) #get the means X.bar = mean(X) Y.bar = mean(Y) #calculate the covariance sum((X-X.bar)*(Y-Y.bar)) / (length(X)) #manually population sum((X-X.bar)*(Y-Y.bar)) / (length(X) - 1) #manually sample cov(X,Y) #built-in function USES SAMPLE COVARIANCE
Obviously, since covariance is used so much within statistics, R has a built-in function cov()
, which yields the sample covariance for two vectors or even a matrix.
Covariance — Expected Value Notation
[Trying to explain covariance in expected value notation makes me realize I should back up and explain the expected value operator, but that will have to wait for another post. Quickly and oversimplified, the expect value is the mean value of a random variable. E[X] = mean(X)
. The expected value notation below describes the population covariance of two variables (not sample covariance):
$latex
cov(X, Y) = \textnormal{E}[(X-\textnormal{E}[X])(Y-\textnormal{E}[Y])]
&s=2$
The above formula is just the population covariance written differently. For example, E[X]
is the same as μx. And the E[]
acts the same as taking the average of (X-E[X])(Y-E[Y])
. After some algebraic transformations you can arrive at the less intuitive, but still useful formula for covariance:
$latex
cov(X, Y) = \textnormal{E}[XY] – \textnormal{E}[X]\textnormal{E}[Y]
&s=2$
This formula can be interpreted as the product of the means of variables X and Y subtracted from the average of signed areas of variables X and Y. This probably isn’t very useful if you are trying to interpret covariance. But you’ll see it from time to time. And it works! Try it in R and compare it to the population covariance from above.
mean(X*Y) - mean(X)*mean(Y) #expected value notation
Covariance — Signed Area of Rectangles
Covariance can also be thought of as the sum of the signed area of the rectangles that can be drawn from the data points to the variables respective means. It’s called the signed area because we will get two types of rectangles, ones with a positive value and ones with negative values. Area is always a positive number, but these rectangles take on a sign by virtue of their geometric position. This is more of an academic exercise, in that it provides an understanding of what the math is doing and less of a practical interpretation and application of covariance. If you plot paired data points, in this case we will use the X and Y variables we have already used, you can tell just be looking there is probably some positive covariance because it looks like there is a linear relationship in the data. I’ve chosen to scale the plot so that zero is not included. Since this is a scatter plot including zero isn’t necessary.
First, we can draw the lines for the means of both variables as straight lines. These lines effectively create a new set of axes and will be used to draw the rectangles. The sides of the rectangles will be the difference between a data point and it’s mean [Xi - X̄
]. When that is multiplied by [Yi - Ȳ
], you can see that gives you an area of a rectangle. Do that for every point in your data set, add them up and divide by the number of data points, and you get the population covariance.
The following is a plot has a rectangle for each data point, and it is coded red for negative and blue for positive signs.
The overlapping rectangles need to be considered separately so the opacity is reduced so that all the rectangles are visible. For this data set there is much more blue area than there is red area, so there is positive covariance, which jives with what we calculated earlier in R. If you were to take the areas of those rectangles and add/subtract according to the blue/red color then divide by the number of rectangles, you would arrive the population covariance: 3.16. To get the sample covariance you’d subtract one from the number of rectangles when you divide.
References:
Chatterjee, S., Hadi, A. S., & Price, B. (2000). Regression analysis by example. New York: Wiley.
Covariance. http://math.tutorvista.com/statistics/covariance.html
Covariance As Signed Area Of Rectangles. http://www.davidchudzicki.com/posts/covariance-as-signed-area-of-rectangles/
How would you explain covariance to someone who understands only the mean?
https://stats.stackexchange.com/questions/18058/how-would-you-explain-covariance-to-someone-who-understands-only-the-mean
Notes:
The signed area of rectangles on Chudzicki’s site and statexchange use a different covariance formulation, but similar concept than my approach.
The full code I used to write up this tutorial is available on my GitHub .
Introduction to Correlation with R | Anscombe’s Quartet
Correlation is one the most commonly [over]used statistical tool. In short, it measures how strong the relationship between two variables. It’s important to realize that correlation doesn’t necessarily imply that one of the variables affects the other.
Basic Calculation and Definition
Covariance also measures the relationship between two variables, but it is not scaled, so it can’t tell you the strength of that relationship. For example, Let’s look at the following vectors a, b and c. Vector c is simply a 10X-scaled transformation of vector b, and vector a has no transformational relationship with vector b or c.
a | b | c |
1 | 4 | 40 |
2 | 5 | 50 |
3 | 6 | 60 |
4 | 8 | 80 |
5 | 8 | 80 |
5 | 4 | 40 |
6 | 10 | 100 |
7 | 12 | 120 |
10 | 15 | 150 |
4 | 9 | 90 |
8 | 12 | 120 |
Plotted out, a vs. b and a vs. c look identical except the y-axis is scaled differently for each. When the covariance is taken of both a & b and a & c, you get different a large difference in results. The covariance between a & b is much smaller than the covariance between a & c even though the plots are identical except the scale. The y-axis on the c vs. a plot goes to 150 instead of 15.
$latex
cov(X, Y) = \frac{\Sigma_i^N{(X_i – \bar X)(Y_i – \bar Y)}}{N-1}
&s=2$
$latex
cov(a, b) = 8.5
&s=2$
$latex
cov(a, c) = 85
&s=2$
To account for this, correlation is takes covariance and scales it by the product of the standard deviations of the two variables.
$latex
cor(X, Y) = \frac{cov(X, Y)}{s_X s_Y}
&s=2$
$latex
cor(a, b) = 0.8954
&s=2$
$latex
cor(a, c) = 0.8954
&s=2$
Now, correlation describes how strong the relationship between the two vectors regardless of the scale. Since the standard deviation in vector c is much greater than vector b, this accounts for the larger covariance term and produces identical correlations terms. The correlation coefficient will fall between -1 and 1. Both -1 and 1 indicate a strong relationship, while the sign of the coefficient indicates the direction of the relationship. A correlation of 0 indicates no relationship.
Here’s the R code that will run through the calculations.
#covariance vs correlation a <- c(1,2,3,4,5,5,6,7,10,4,8) b <- c(4,5,6,8,8,4,10,12,15,9,12) c <- c(4,5,6,8,8,4,10,12,15,9,12) * 10 data <- data.frame(a, b, c) cov(a, b) #8.5 cov(a, c) #85 cor(a,b) #0.8954 cor(a,c) #0.8954
Caution | Anscombe's Quartet
Correlation is great. It's a basic tool that is easy to understand, but it has its limitations. The most prominent being the correlation =/= causation caveat. The linked BuzzFeed article does a good job explaining the concept some ridiculous examples, but there are real-life examples being researched or argued in crime and public policy. For example, crime is a problem that has so many variables that it's hard to isolate one factor. Politicians and pundits still try.
Another famous caution about using correlation is Anscombe's Quartet. Anscombe's Quartet uses different sets of data to achieve the same correlation coefficient (0.8164 give or take some rounding). This exercise is typically used to emphasize why it's important to visualize data.
The graphs demonstrates how different the data sets can be. If this was real-world data, the green and yellow plots would be investigated for outliers, and the blue plot would probably be modeled with non-linear terms. Only the red plot would be consider appropriate for a basic, linear model.
I created this plot in R with ggplot2. The Anscombe data set is included in base R, so you don't need to install any packages to use it. Ggplot2 is a fantastic and powerful data visualization package which can be download for free using the install.packages('ggplot2')
command. Below is the R code I used to make the graphs individually and combine them into a matrix.
#correlation cor1 <- format(cor(anscombe$x1, anscombe$y1), digits=4) cor2 <- format(cor(anscombe$x2, anscombe$y2), digits=4) cor3 <- format(cor(anscombe$x3, anscombe$y3), digits=4) cor4 <- format(cor(anscombe$x4, anscombe$y4), digits=4) #define the OLS regression line1 <- lm(y1 ~ x1, data=anscombe) line2 <- lm(y2 ~ x2, data=anscombe) line3 <- lm(y3 ~ x3, data=anscombe) line4 <- lm(y4 ~ x4, data=anscombe) circle.size = 5 colors = list('red', '#0066CC', '#4BB14B', '#FCE638') #plot1 plot1 <- ggplot(anscombe, aes(x=x1, y=y1)) + geom_point(size=circle.size, pch=21, fill=colors[[1]]) + geom_abline(intercept=line1$coefficients[1], slope=line1$coefficients[2]) + annotate("text", x = 12, y = 5, label = paste("correlation = ", cor1)) #plot2 plot2 <- ggplot(anscombe, aes(x=x2, y=y2)) + geom_point(size=circle.size, pch=21, fill=colors[[2]]) + geom_abline(intercept=line2$coefficients[1], slope=line2$coefficients[2]) + annotate("text", x = 12, y = 3, label = paste("correlation = ", cor2)) #plot3 plot3 <- ggplot(anscombe, aes(x=x3, y=y3)) + geom_point(size=circle.size, pch=21, fill=colors[[3]]) + geom_abline(intercept=line3$coefficients[1], slope=line3$coefficients[2]) + annotate("text", x = 12, y = 6, label = paste("correlation = ", cor3)) #plot4 plot4 <- ggplot(anscombe, aes(x=x4, y=y4)) + geom_point(size=circle.size, pch=21, fill=colors[[4]]) + geom_abline(intercept=line4$coefficients[1], slope=line4$coefficients[2]) + annotate("text", x = 15, y = 6, label = paste("correlation = ", cor4)) grid.arrange(plot1, plot2, plot3, plot4, top='Anscombe Quadrant -- Correlation Demostration')
The full code I used to write up this tutorial is available on my GitHub .
References:
Chatterjee, S., Hadi, A. S., & Price, B. (2000). Regression analysis by example. New York: Wiley.
Making a Correlation Matrix in R
This tutorial is a continuation of making a covariance matrix in R. These tutorials walk you through the matrix algebra necessary to create the matrices, so you can better understand what is going on underneath the hood in R. There are built-in functions within R that make this process much quicker and easier.
The correlation matrix is is rather popular for exploratory data analysis, because it can quickly show you the correlations between variables in your data set. From a practical application standpoint, this entire post is unnecessary, because I’m going to show how to derive this using matrix algebra in R.
First, the starting point will be the covariance matrix that was computed from the last post.
#create vectors -- these will be our columns a <- c(1,2,3,4,5,6) b <- c(2,3,5,6,1,9) c <- c(3,5,5,5,10,8) d <- c(10,20,30,40,50,55) e <- c(7,8,9,4,6,10) #create matrix from vectors M <- cbind(a,b,c,d,e) k <- ncol(M) #number of variables n <- nrow(M) #number of subjects #create means for each column M_mean <- matrix(data=1, nrow=n) %*% cbind(mean(a),mean(b),mean(c),mean(d),mean(e)) #creates a difference matrix D <- M - M_mean #creates the covariance matrix C <- k^-1 * t(D) %*% D
$latex {\bf C } =
\begin{bmatrix}
V_a\ & C_{a,b}\ & C_{a,c}\ & C_{a,d}\ & C_{a,e} \\
C_{a,b} & V_b & C_{b,c} & C_{b,d} & C_{b,e} \\
C_{a,c} & C_{b,c} & V_c & C_{c,d} & C_{c,e} \\
C_{a,d} & C_{b,d} & C_{c,d} & V_d & C_{d,e} \\
C_{a,e} & C_{b,e} & C_{c,e} & C_{d,e} & V_e
\end{bmatrix}&s=2$
This matrix has all the information that's needed to get the correlations for all the variables and create a correlation matrix [V -- variance, C -- Covariance]. Correlation, we are using the Pearson version of correlation, is calculated using the covariance between two vectors and their standard deviations [s, square root of the variance]:
$latex
cor(X, Y) = \frac{cov(X,Y)}{s_{X}s_{Y}}
&s=2$
The trick will be using matrix algebra to easily carry out these calculations. The variance components are all on the diagonal of the covariance matrix, so in matrix algebra notation we want to use this:
$latex {\bf V} = diag({\bf C}) = \begin{bmatrix}
V_a\ & 0\ & 0\ & 0\ & 0 \\
0 & V_b & 0 & 0 & 0 \\
0 & 0 & V_c & 0 & 0 \\
0 & 0 & 0 & V_d & 0 \\
0 & 0 & 0 & 0 & V_e
\end{bmatrix}
&s=2$
Since R doesn't quite work the same way as matrix algebra notation, the diag() function creates a vector from a matrix and a matrix from a vector, so it's used twice to create the diagonal variance matrix. Once to get a vector of the variances, and a second time to turn that vector into the above diagonal matrix. Since the standard deviations are needed, the square root is taken. Also the variances are inverted to facilitate division.
#pulls out the standard deviations from the covariance matrix S <- diag(diag(C)^(-1/2))
After getting the diagonal matrix, basic matrix multiplication is used to get the all the terms in the covariance to reflect the basic correlation formula from above.
$latex {\bf R } = {\bf S} \times {\bf C} \times {\bf S}&s=2$
#constructs the correlation matrix S %*% C %*% S
And the correlation matrix is symbolically represented as:
$latex {\bf R } =
\begin{bmatrix}
r_{a,a}\ & r_{a,b}\ & r_{a,c}\ & r_{a,d}\ & r_{a,e} \\
r_{a,b} & r_{b,b} & r_{b,c} & r_{b,d} & r_{b,e} \\
r_{a,c} & r_{b,c} & r_{c,c} & r_{c,d} & r_{c,e} \\
r_{a,d} & r_{b,d} & r_{c,d} & r_{d,d} & r_{d,e} \\
r_{a,e} & r_{b,e} & r_{c,e} & r_{d,e} & r_{e,e}
\end{bmatrix}&s=2$
The diagonal where the variances where in the covariance matrix are now 1, since a variable's correlation with itself is always 1.
Making a Covariance Matrix in R
The full R code for this post is available on my GitHub.
Understanding what a covariance matrix is can be helpful in understanding some more advanced statistical concepts. First, let’s define the data matrix, which is the essentially a matrix with n rows and k columns. I’ll define the rows as being the subjects, while the columns are the variables assigned to those subjects. While we use the matrix terminology, this would look much like a normal data table you might already have your data in. For the example in R, I’m going to create a 6×5 matrix, which 6 subjects and 5 different variables (a,b,c,d,e). I’m choosing this particular convention because R and databases use it. A row in a data frame represents represents a subject while the columns are different variables. [The underlying structure of the data frame is a collection of vectors.] This is against normal mathematical convention which has the variables as rows and not columns, so this won’t follow the normal formulas found else where online.
The covariance matrix is a matrix that only concerns the relationships between variables, so it will be a k x k square matrix. [In our case, a 5×5 matrix.] Before constructing the covariance matrix, it’s helpful to think of the data matrix as a collection of 5 vectors, which is how I built our data matrix in R.]
#create vectors -- these will be our columns a <- c(1,2,3,4,5,6) b <- c(2,3,5,6,1,9) c <- c(3,5,5,5,10,8) d <- c(10,20,30,40,50,55) e <- c(7,8,9,4,6,10) #create matrix from vectors M <- cbind(a,b,c,d,e)
The data matrix (M) written out is shown below.
a b c d e [1,] 1 2 3 10 7 [2,] 2 3 5 20 8 [3,] 3 5 5 30 9 [4,] 4 6 5 40 4 [5,] 5 1 10 50 6 [6,] 6 9 8 55 10
Each value in the covariance matrix represents the covariance (or variance) between two of the vectors. With five vectors, there are 25 different combinations that can be made and those combinations can be laid out in a 5x5 matrix.
There are a few different ways to formulate covariance matrix. You can use the cov() function on the data matrix instead of two vectors. [This is the easiest way to get a covariance matrix in R.]
cov(M)
But we'll use the following steps to construct it manually:
- Create a matrix of means (M_mean).
- Create a difference matrix (D) by subtracting the matrix of means (M_mean) from data matrix (M).
- Create the covariance matrix (C) by multiplying the transposed the difference matrix (D) with a normal difference matrix and inverse of the number of subjects (n) [We will use (n-1), since this is necessary for the unbiased, sample covariance estimator. This is covariance R will return by default.
$latex {\bf M\_mean} = \begin{bmatrix}
1 \\
1 \\
1 \\
1 \\
1 \\
\end{bmatrix}
\times
\begin{bmatrix} \bar{x_{a}} & \bar{x_{b}} & \bar{x_{c}} & \bar{x_{d}} & \bar{x_{e}}\end{bmatrix}&s=2$
$latex {\bf D = M - M\_mean} &s=2$
$latex {\bf C = } (n-1)^{-1} \times {\bf D^T} \times {\bf D} &s=2$
k <- ncol(M) #number of variables n <- nrow(M) #number of subjects #create means for each column M_mean <- matrix(data=1, nrow=n) %*% cbind(mean(a),mean(b),mean(c),mean(d),mean(e)) #creates a difference matrix D <- M - M_mean #creates the covariance matrix C <- (n-1)^-1 t(D) %*% D
The final covariance matrix made using the R code looks like this:
a b c d e a 3.5 3.000000 4.0 32.500000 0.400000 b 3.0 8.666667 0.4 25.333333 2.466667 c 4.0 0.400000 6.4 38.000000 0.400000 d 32.5 25.333333 38.0 304.166667 1.333333 e 0.4 2.466667 0.4 1.333333 4.666667
It represents the various covariances (C) and variance (V) combinations of the five different variables in our data set. These are all values that you might be familiar with if you've used the var() or cov() functions in R or similar functions in Excel, SPSS, etc.
$latex
\begin{bmatrix}
V_a\ & C_{a,b}\ & C_{a,c}\ & C_{a,d}\ & C_{a,e} \\
C_{a,b} & V_b & C_{b,c} & C_{b,d} & C_{b,e} \\
C_{a,c} & C_{b,c} & V_c & C_{c,d} & C_{c,e} \\
C_{a,d} & C_{b,d} & C_{c,d} & V_d & C_{d,e} \\
C_{a,e} & C_{b,e} & C_{c,e} & C_{d,e} & V_e
\end{bmatrix}&s=2$
This matrix is used in applications like constructing the correlation matrix and generalized least squares regressions.
One-Sample t-Test [With R Code]
The one sample t-test is very similar to the one sample z-test. A sample mean is being compared to a claimed population mean. The t-test is required when the population standard deviation is unknown. The t-test uses the sample’s standard deviation (not the population’s standard deviation) and the Student t-distribution as the sampling distribution to find a p-value.
The t-Distribution
While the z-test uses the normal distribution, which is only dependent on the mean and standard deviation of the population. Any of the various t-tests [one-sample, independent, dependent] use the t-distribution, which has an extra parameter over the normal distribution: degrees of freedom (df). The theoretical basis for degrees of freedom deserves a lot of attention, but for now for the one-sample t-test, df = n – 1.
The distributions above show how the degrees of freedom affect the shape of t distribution. The gray distribution is the normal distribution. Low df causes the tails of the distribution to be fatter, while a higher df makes the t-distribution become more like the normal distribution.
The practical outcome of this will be that samples with smaller n will need to be further from the population mean to reject the null hypothesis than samples with larger n. And compared to the z-test, the t-test will always need to have the sample mean further from the population mean.
Just like the one-sample z-test, we have to define our null hypothesis and alternate hypothesis. This time I’m going to show a two-tailed test. The null hypothesis will be that there is NO difference between the sample mean and the population mean. The alternate hypothesis will test to see if the sample mean is significantly different from the population mean. The null and alternate hypotheses are written out as:
- $latex H_0: \bar{x} = \mu&s=2$
- $latex H_A: \bar{x} \neq \mu&s=2$
The graphic above shows a t-distribution with a df = 5 with the critical regions highlighted. Since the shape of the distribution changes with degrees of freedom, the critical value for the t-test will change as well.
The t-stat for this test is calculated the same way as the z-stat for the z-test, except for the σ term [population standard deviation] in the z-test is replaced with s [sample standard deviation]:
$latex z = \frac{\bar{x} – \mu}{\sigma/\sqrt{n}} \hspace{1cm} t = \frac{\bar{x} – \mu}{s/\sqrt{n}} &s=2$
Like the z-stat, the higher the t-stat is the more certainty there is that the sample mean and the population mean are different. There are three things make the t-stat larger:
- a bigger difference between sample mean and population mean
- a small sample standard deviation
- a larger sample size
Example in R
Since the one-sample t-test follows the same process as the z-test, I’ll simply show a case where you reject the null hypothesis. This will also be a two-tailed test, so we will use the null and alternate hypotheses found earlier on this page.
Once again using the height and weight data set from UCLA’s site, I’ll create a tall-biased sample of 50 people for us to test.
#reads data set data <- read.csv('data/Height_data.csv') height <- data$Height #N - number in population #n - number in sample N <- length(height) n <- 50 #population mean pop_mean <- mean(height) #tall-biased sample cut <- 1:25000 weights <- cut^.6 sorted_height <- sort(height) set.seed(123) height_sample_biased <- sample(sorted_height, size=n, prob=weights)This sample would represent something like athletes, CEOs, or maybe a meeting of tall people. After creating the sample, we use R's mean() and sd() functions to get the parameters for the t-stat formula from above.sample_mean <- mean(height_sample_biased) sample_sd <- sd(height_sample_biased)Now using the population mean, the sample mean, the sample standard deviation, and the number of samples (n = 50) we can calculate the t-stat.
#t-stat t <- (sample_mean - pop_mean) / (sample_sd/sqrt(n))Now you could look up the critical value for the t-test with 49 degrees of freedom [50-1 = 49], but this is R, so we can find the area under the tail of the curve [the blue area from the critical region diagram] and see if it's under 0.025. This will be our p-value, which is the probability that the value was achieved by random chance.
#p-value for t-test 1-pt(t,n-1)The answer should be 0.006882297, which is well below 0.025, so the null hypothesis is rejected and the difference between the tall-biased sample and the general population is statistically significant.
You can find the full R code including code to create the t-distribution and normal distribution data sets on my GitHub .
One Mean Z-test [with R code]
I’ve included the full R code and the data set can be found on UCLA’s Stats Wiki
Building on finding z-scores for individual measurement or values within a population, a z-test can determine if there is a statistically significance different between a sample mean and a population mean with a known population standard deviation. [Those conditions are essential for using this test.] The z-test uses z-scores and a normal distribution to determine the probability the sample mean is drawn randomly from a known population. If the test fails, the conclusion is that random sampling is likely to have produced this. If the test rejects the null hypothesis, then the sample is likely to be a result of non-random sampling [ie. like team captains picking the tallest kids for a basketball game in gym class].
The z-test relies critically on the central limit theorem, which basically states that if you take a n >= 30 sample a population [with any distribution] many times over, you’ll get a normal distribution of the sample means. [This needs it’s own post to explain fully, and there are interesting ways you can program R to illustrate this.] The sample mean distribution chart is shown below compared to the population distribution. The important concepts to notice here are:
- the area of both distributions is equal to 1
- the sample mean distribution is a normal distribution
- the sample mean distribution is tighter and taller than the population distribution
For the rest of this post, the sample mean distribution will be used for the z-test and it is also represent in green opposed to blue. Also the data I use in this post is height data from this data set. It represents the heights of 25,000 children from Hong Kong. The data doesn’t reflect US adults, but it’s a great normally distributed data set.
The goal of the z-test will be to test to see if a sample and its mean are randomly sampled from the population or if there’s some significant difference. For example, you could use this test to see if the average height of NBA players is statistically significantly different than the general population. While the NBA example is pretty common sense, not every problem will be that clear. Sample size [like in many hypothesis tests] is a huge factor. Small sample sizes require huge differences between the sample mean and the population mean to be significant.
For a one-mean z-test, we will be using a one-tail hypothesis test. The null hypothesis will be that there is NO difference between the sample mean and the population mean. The alternate hypothesis will test to see if the sample mean is greater. The null and alternate hypotheses are written out as:
- $latex H_0: \bar{x} = \mu&s=2$
- $latex H_A: \bar{x} > \mu&s=2$
The graph above shows the critical regions for a right-tailed z-test. The critical regions reflect areas where the z-stat has to fall in order for the test to reject the null hypothesis. The critical regions are defined because they represent a probability less the the stated confidence level. For example the critical region for 95% confidence level only has an area [probability] of 5%. If the sample mean is the same as the population mean, there’s a 5% chance it was drawn by random chance. This concept is the basis for almost every hypothesis test.
The z-test uses the z-stat, which is calculated analogously to the z-score the difference being it uses standard error instead of standard deviation. These two concepts are similar; The standard deviation applies to the ‘spread’ of the blue population distribution, while the standard error applies to the ‘spread’ of the green sample mean distribution. The z-stat is calculated as:
$latex z = \frac{\bar{x} – \mu}{\sigma/\sqrt{n}} &s=2$
The higher the z-stat is the more certainty there is that the sample mean and the population are different. There are three things make the z-stat larger:
- a bigger difference between sample mean and population mean
- a small population standard deviation
- a larger sample size
Example
I have two sets of sample from the data set: one is entirely random and the other I weighted heavily towards taller people. The null hypothesis would be that both there’s no difference between the sample mean and the population mean. The alternate would be that the sample mean is greater than the population mean. The weighted sample would be the sample if you were evaluating the mean height of a basketball team vs the general population. Here are the two sets of an n=50 sample and R code on how I constructed them using a set random seed of 123.
Unbiased random sample
Tall-biased random sample
#unbiased random sample set.seed(123) n <- 50 height_sample <- sample(height, size=n) sample_mean <- mean(height_sample) #tall-biased sample cut <- 1:25000 weights <- cut^.6 sorted_height <- sort(height) set.seed(123) height_sample_biased <- sample(sorted_height, size=n, prob=weights) sample_mean_biased <- mean(height_sample_biased)
The population mean is 67.993, the first unbiased sample is 68.099, and the tall-biased group is 68.593. Both samples are higher than the than the population mean, but are both significantly higher than the mean? To figure this out, we need to calculate the z-stats and find out if those z-stats fall in the critical region using the equation:
$latex z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}} &s=2$
We can substitute and calculate with the population standard deviation [σ] = 1.902:
$latex z_{unbiased} = \frac{68.593 - 67.993}{1.902/\sqrt{50}} = 0.3922 \ \ \ \ z_{tall-biased} = \frac{68.099 - 67.993}{1.902/\sqrt{50}} = 2.229 &s=0$
#random unbiased sample #z-stat calculation sample_mean z <- (sample_mean - pop_mean)/(pop_sd/sqrt(n)) #tall-biased sample z <- (sample_mean_biased - pop_mean)/(pop_sd/sqrt(n))
Quickly, knowing that the critical value for a one-tail z-test at 95% confidence is 1.645, we can determine the unbiased random sample is not significantly different, but the tall-biased sample is significantly different. This is because the z-stat for the unbiased sample is less than the critical value, while the tall-biased is higher than the critical value.
Plotting the z-test for the unbiased sample, the area [probability] to the right of the z-stat is much higher than the accepted 5%. The larger the green area is the more likely the difference between the sample mean and the population mean were obtained by random chance. To get a z-test to be significant, you want to get the z-stat high so that the area [probability] is low. [In practice, this can be done by increasing sample size.]
The tall-baised sample mean's z-stat creates a plot with much less area to the right of the z-stat, so these results were much less likely to be obtained by chance. The p-values can be obtained by calculating the area to right of the z-stat. The R code below summarizes how to do that using R's 'pnorm' function.
#calculating the p-value p_yellow2 <- pnorm(z) p_green2 <- 1 - p_yellow2 p_green2
The p-value for the unbiased sample is .3474 or there's a 34.74% chance that the result was obtained due to random chance, while the tall-biased sample only have a p-value of .01291 or a 1.291% chance being a result of random chance. Since the p-value tall-biased sample is less than the .05, the null hypothesis is rejected, but the since the unbiased sample's p-value is well above .05, the null hypothesis is retained.
What the one-mean z-test accomplished was telling us that a simple random sample from a population wasn't really that different from population, while a sample that wasn't completely random but was much taller than the overall population was shown to be different. While this test isn't used often, the principles of distributions, calculating test stats, and p-values have many applications with in the statistics universe.
Calculating Z-Scores [with R code]
I’ve included the full R code and the data set can be found on UCLA’s Stats Wiki
Normal distributions are convenient because they can be scaled to any mean or standard deviation meaning you can use the exact same distribution for weight, height, blood pressure, white-noise errors, etc. Obviously, the means and standard deviations of these measurements should all be completely different. In order to get the distributions standardized, the measurements can be changed into z-scores.
Z-scores are a stand-in for the actual measurement, and they represent the distance of a value from the mean measured in standard deviations. So a z-score of 2.0 means the measurement is 2 standard deviations away from the mean.
To demonstrate how this is calculated and used, I found a height and weight data set on UCLA’s site. They have height measurements from children from Hong Kong. Unfortunately, the site doesn’t give much detail about the data, but it is an excellent example of normal distribution as you can see in the graph below. The red line represents the theoretical normal distribution, while the blue area chart reflects a kernel density estimation of the data set obtained from UCLA. The data set doesn’t deviate much from the theoretical distribution.
The z-scores are also listed on this normal distribution to show how the actual measurements of height correspond to the z-scores, since the z-scores are simple arithmetic transformations of the actual measurements. The first step to find the z-score is to find the population mean and standard deviation. It should be noted that the sd function in R uses the sample standard deviation and not the population standard deviation, though with 25,000 samples the different is rather small.
#DATA LOAD data <- read.csv('Height_data.csv') height <- data$Height hist(height) #histogram #POPULATION PARAMETER CALCULATIONS pop_sd <- sd(height)*sqrt((length(height)-1)/(length(height))) pop_mean <- mean(height)
Using just the population mean [μ = 67.99] and standard deviation [σ = 1.90], you can calculate the z-score for any given value of x. In this example I'll use 72 for x.
$latex z = \frac{x - \mu}{\sigma} &s=2$
z <- (72 - pop_mean) / pop_sd
This gives you a z-score of 2.107. To put this tool to use, let's use the z-score to find the probability of finding someone who is 72 inches [6-foot] tall. [Remember this data set doesn't apply to adults in the US, so these results might conflict with everyday experience.] The z-score will be used to determine the area [probability] underneath the distribution curve past the z-score value that we are interested in.
[One note is that you have to specify a range (72 to infinity) and not a single value (72). If you wanted to find people who are exactly 6-foot, not taller than 6-foot, you would have to specify the range of 71.5 to 72.5 inches. This is another problem, but this has everything to do with definite integrals intervals if you are familiar with Calc I.]
The above graph shows the area we intend to calculate. The blue area is our target, since it represents the probability of finding someone taller than 6-foot. The yellow area represents the rest of the population or everyone is is under 6-feet tall. The z-score and actual height measurements are both given underscoring the relationship between the two.
Typically in an introductory stats class, you'd use the z-score and look it up in a table and find the probability that way. R has a function 'pnorm' which will give you a more precise answer than a table in a book. ['pnorm' stands for "probability normal distribution".] Both R and typical z-score tables will return the area under the curve from -infinity to value on the graph this is represented by the yellow area. In this particular problem, we want to find the blue area. The solution to this is an easy arithmetic function. The area under the curve is 1, so by subtracting the yellow area from 1 will give you the area [probability] for the blue area.
Yellow Area:
p_yellow1 <- pnorm(72, pop_mean, pop_sd) #using x, mu, and sigma p_yellow2 <- pnorm(z) #using z-score of 2.107
Blue Area [TARGET]:
p_blue1 <- 1 - p_yellow1 #using x, mu, and sigma p_blue2 <- 1 - p_yellow2 #using z-score of 2.107
Both of these techniques in R will yield the same answer of 1.76%. I used both methods, to show that R has some versatility that traditional statistics tables don't have. I personally find statistics tables antiquated, since we have better ways to determine it, and the table doesn't help provide any insight over software solutions.
Z-scores are useful when relating different measurement distributions to each acting as a 'common denominator'. The z-scores are used extensively for determining area underneath the curve when using text book tables, and also can be easily used in programs such as R. Some statistical hypothesis tests are based on z-scores and the basic principles of finding the area beyond some value.
Using a Genetic Algorithm to Minimize an OLS Regression in R
A genetic algorithm allows you to optimize parameters by using an algorithm that mimics biological evolution. It will run through several generations of values trying to find the values that minimizes [or maximizes depending on the algorithm] its fitness or evaluation function, which is just any function that returns a value from the parameters the algorithm is optimizing.
There is a lot of literature on how genetic algorithms work, and I would recommend reading those if you want the technical details on how they work. Genetic algorithms are typically demonstrated by the knapsack algorithm problem [Numb3rs Scene Youtube], where you look to optimize the survival points by seeking the right combination of survival items weighing under a specified amount to fit in a knapsack. This R-bloggers site has a good demonstration of that example and code. However, I find it more interesting to use a genetic algorithm on something more familiar to analytics and statistics, and that’s the ordinary least squares regression (OLS).
OLS minimizes the sum of squared error (SSE) to find the best fit line or regression line for the data set. This is derived using matrix calculus, and it’s computational efficient, easy to understand, and ubiquitous.
Since OLS essentially is an algorithm that uses calculus to minimize SSE, we can use a genetic algorithm to accomplish the same task. R’s GA (genetic algorithm) package allows you to use either binary or real numbers as parameters for the fitness function. Traditionally, genetic algorithms use binary parameters [see the knapsack algorithm], but for this problem, real numbers will be much more useful since the regression coefficients will be real numbers.
The GA algorithm will create a vector of real numbers between -100 and 100, then use that vector to evaluation a regression equation in the fitness function. The fitness function returns the SSE. Since the GA algorithm seeks to maximize the fitness function, the function has a negative sign in front of it, so the lowest absolute SSE will at the maximum if it’s negative. The GA has a population of 500 vectors which are evaluated with the fitness function, and the best solutions are generally kept and children vectors are created, the process is repeated 500 times. The results is a SSE that is very close the OLS solution, and parameter estimates that match up as well.
I’ve included two different linear models. The first has only two variables which play significant roles in the OLS regression, and a second model which has every variable with not all being significant. You can run it a few times and see how the GA solutions differ. The first model’s GA estimates will be a lot closer to the OLS’ estimates than the second model’s.
All of this is rather academic for well-behaved linear regression problems, since GA are computationally expensive taking forever relative to your standard OLS procedure.
The full annotated R code follows:
#install.packages('GA')
library(GA)
#loads an airquality dataframe
data(airquality)
#removes missing data
airquality <- na.omit(airquality)
#### create a function to evaluate a linear regression
#### takes intercept and the two best variables to compute the predicted y_hat
#### then computes and returns the SSE for each chromosome
#### we will try to minimize the SSE like OLS does
OLS <- function(data, b0, b1, b2){
attach(data, warn.conflicts=F)
Y_hat <- b0 + b1*Wind + b2*Temp
SSE = t(Ozone-Y_hat) %*% (Ozone-Y_hat) #matrix formulation for SSE
detach(data)
return(SSE)
}
#### this sets up a real-value GA using 3 parameters all from -100 to 100
#### the parameters use real numbers (so floating decimals) and passes those to
#### the linear regression equation/function
#### the real-value GA requires a min and max
#### this takes a while to run
ga.OLS <- ga(type='real-valued', min=c(-100,-100, -100),
max=c(100, 100, 100), popSize=500, maxiter=500, names=c('intercept', 'Wind', 'Temp'),
keepBest=T, fitness = function(b) -OLS(airquality, b[1],b[2], b[3]))
#### summary of the ga with solution
ga.model <- summary(ga.OLS)
ga.model
#### check against the results against the typical OLS procedure
lm.model <- lm(formula= Ozone ~ Wind + Temp, data=airquality)
summary(lm.model)
lm.model$res %*% lm.model$res ### SSE.lm
-ga.model$fitness ### SSE.ga
lm.model$res %*% lm.model$res + ga.model$fitness ### difference between OLS and GA's SSE
#### FULL MODEL ####
OLS.FULL <- function(data, b0, b1, b2, b3, b4, b5){
attach(data, warn.conflicts=F)
Y_hat <- b0 + b1*Solar.R + b2*Wind + b3*Temp + b4*Month + b5*Day # linear regression equation
SSE = t(Ozone-Y_hat) %*% (Ozone-Y_hat) #matrix formulation for SSE
detach(data)
return(SSE)
}
#### this sets up a real-value GA using 6 parameters all from -100 to 100
#### the parameters use real numbers (so floating decimals) and passes those to
#### the linear regression equation/function
#### the real-value GA requires a min and max
#### this takes a while to run related to the survival pack
#### this will produce some values that vary a lot from OLS estimates since not all values are significant
#### some estimates should have high standard error
ga.OLS <- ga(type='real-valued', min=c(-100,-100, -100, -100, -100, -100),
max=c(100,100, 100, 100, 100, 100), popSize=500, maxiter=500,
keepBest=T, fitness = function(b) -OLS.FULL(airquality, b[1],b[2], b[3], b[4], b[5], b[6]))
#### summary of the ga with solution
summary(ga.OLS)
#### check against the results against the typical OLS procedure
summary(lm(formula= Ozone ~ Wind + Temp, data=airquality))
OLS Derivation
Ordinary Least Squares (OLS) is a great low computing power way to obtain estimates for coefficients in a linear regression model. I wanted to detail the derivation of the solution since it can be confusing for anyone not familiar with matrix calculus.
First, the initial matrix equation is setup below. With X being a matrix of the data’s p covariates plus the regression constant. [This will be represented as a column of ones if you were to look at the data in the X matrix.] Y is the column matrix of the target variable and β is the column matrix of unknown coefficients. e is a column matrix of the residuals.
$latex \mathbf{Y = X} \boldsymbol{\beta} + \boldsymbol{e} &s=1$
Before manipulating the equation it is important to note you are not solving for X or Y, but instead β and will do this by minimizing the sum of squares for the residuals (SSE). So the equation can be rewritten by moving the error term to the left side of the equation.
$latex \boldsymbol{e} = \mathbf{Y – X} \boldsymbol{\beta}&s=1$
The SSE can be written as the product of the transposed residual column vector and its original column vector. [This is actually how you would obtain the sum of squares for any vector.]
$latex \mathrm{SSE} = \boldsymbol{e}’\boldsymbol{e} &s=1$
Since you transpose and multiply one side of the equation, you have to follow suit on the other side. Yielding
$latex \boldsymbol{e’e} = (\mathbf{Y – X} \boldsymbol{\beta})'(\mathbf{Y – X} \boldsymbol{\beta})&s=1$
The transpose operator can be distributed through out the quantity on the right side, so the right side can be multiplied out.
$latex \boldsymbol{e’e} = (\mathbf{Y’ – \boldsymbol{\beta}’X’})(\mathbf{Y – X} \boldsymbol{\beta})&s=1$
Using the rule that A’X = X’A, you can multiple out the right side and simplify it.
$latex \boldsymbol{e’e} = (\mathbf{Y’Y – Y’X\boldsymbol{\beta} – \boldsymbol{\beta}’X’Y} + \boldsymbol{\beta’\mathbf{X’X}\beta})&s=1$
$latex \boldsymbol{e’e} = (\mathbf{Y’Y – \boldsymbol{\beta}’X’Y – \boldsymbol{\beta}’X’Y} + \boldsymbol{\beta’\mathbf{X’X}\beta})&s=1$
$latex \boldsymbol{e’e} = (\mathbf{Y’Y – 2\boldsymbol{\beta}’X’Y} + \boldsymbol{\beta’\mathbf{X’X}\beta})&s=1$
To minimize the SSE, you have to take the partial derivative relative to β. Any terms without a β term in them will go to zero. Using the transpose rule from before you can see how the middle term yields -2X’Y using differentiation rules from Calc1. The last term is a bit tricky, but it derives to +2X’Xβ.
$latex \frac{\delta\boldsymbol{e’e}}{\delta\boldsymbol{\beta}} = \frac{\delta\mathbf{Y’Y}}{\delta\boldsymbol{\beta}} – \frac{2\boldsymbol{\beta}’\mathbf{X’Y}}{\delta\boldsymbol{\beta}} + \frac{\delta\boldsymbol{\beta’\mathbf{X’X}\beta}}{\delta\boldsymbol{\beta}}&s=1$
$latex \frac{\delta\boldsymbol{e’e}}{\delta\boldsymbol{\beta}} = – 2\mathbf{X’Y} + 2\boldsymbol{\mathbf{X’X}\beta}&s=1$
To find the minimum (it will never be a maximum if you have all the requirements for OLS fulfilled), the derivative of the SSE is set to zero.
$latex 0 = – 2\mathbf{X’Y} + 2\mathbf{X’X}\boldsymbol{\beta}&s=1$
$latex 0 = \mathbf{- X’Y} + \mathbf{X’X}\boldsymbol{\beta}&s=1$
Using some basic linear algebra and multiplying both sides by the inverse of (X’X)…
$latex (\mathbf{X’X})^{-1}\mathbf{X’X}\boldsymbol{\beta} = (\mathbf{X’X})^{-1}\mathbf{X’Y}&s=1$
…yields the solution for β
$latex \boldsymbol{\beta} = (\mathbf{X’X})^{-1}\mathbf{X’Y}&s=1$
References:
The Mathematical Derivation of Least Squares. Psychology 8815. Retrieved from: http://isites.harvard.edu/fs/docs/icb.topic515975.files/OLSDerivation.pdf
Chatterjee, S & Hadi, A. (2012). Regression analysis by example. Hoboken, NJ: John Wiley & Sons, Inc.