# Introduction to Correlation with R | Anscombe’s Quartet

Correlation is one the most commonly [over]used statistical tool. In short, it measures how strong the relationship between two variables. It’s important to realize that correlation doesn’t necessarily imply that one of the variables affects the other.

# Basic Calculation and Definition

Covariance also measures the relationship between two variables, but it is not scaled, so it can’t tell you the strength of that relationship. For example, Let’s look at the following vectors a, b and c. Vector c is simply a 10X-scaled transformation of vector b, and vector a has no transformational relationship with vector b or c.

 a b c 1 4 40 2 5 50 3 6 60 4 8 80 5 8 80 5 4 40 6 10 100 7 12 120 10 15 150 4 9 90 8 12 120

Plotted out, a vs. b and a vs. c look identical except the y-axis is scaled differently for each. When the covariance is taken of both a & b and a & c, you get different a large difference in results. The covariance between a & b is much smaller than the covariance between a & c even though the plots are identical except the scale. The y-axis on the c vs. a plot goes to 150 instead of 15.

$latex cov(X, Y) = \frac{\Sigma_i^N{(X_i – \bar X)(Y_i – \bar Y)}}{N-1} &s=2$

$latex cov(a, b) = 8.5 &s=2$

$latex cov(a, c) = 85 &s=2$

To account for this, correlation is takes covariance and scales it by the product of the standard deviations of the two variables.

$latex cor(X, Y) = \frac{cov(X, Y)}{s_X s_Y} &s=2$

$latex cor(a, b) = 0.8954 &s=2$

$latex cor(a, c) = 0.8954 &s=2$

Now, correlation describes how strong the relationship between the two vectors regardless of the scale. Since the standard deviation in vector c is much greater than vector b, this accounts for the larger covariance term and produces identical correlations terms. The correlation coefficient will fall between -1 and 1. Both -1 and 1 indicate a strong relationship, while the sign of the coefficient indicates the direction of the relationship. A correlation of 0 indicates no relationship.

Here’s the R code that will run through the calculations.

#covariance vs correlation
a <- c(1,2,3,4,5,5,6,7,10,4,8)
b <- c(4,5,6,8,8,4,10,12,15,9,12)
c <- c(4,5,6,8,8,4,10,12,15,9,12) * 10

data <- data.frame(a, b, c)

cov(a, b)  #8.5
cov(a, c)  #85

cor(a,b)  #0.8954
cor(a,c)  #0.8954


# Caution | Anscombe's Quartet

Correlation is great. It's a basic tool that is easy to understand, but it has its limitations. The most prominent being the correlation =/= causation caveat. The linked BuzzFeed article does a good job explaining the concept some ridiculous examples, but there are real-life examples being researched or argued in crime and public policy. For example, crime is a problem that has so many variables that it's hard to isolate one factor. Politicians and pundits still try.

Another famous caution about using correlation is Anscombe's Quartet. Anscombe's Quartet uses different sets of data to achieve the same correlation coefficient (0.8164 give or take some rounding). This exercise is typically used to emphasize why it's important to visualize data.

The graphs demonstrates how different the data sets can be. If this was real-world data, the green and yellow plots would be investigated for outliers, and the blue plot would probably be modeled with non-linear terms. Only the red plot would be consider appropriate for a basic, linear model.

I created this plot in R with ggplot2. The Anscombe data set is included in base R, so you don't need to install any packages to use it. Ggplot2 is a fantastic and powerful data visualization package which can be download for free using the install.packages('ggplot2') command. Below is the R code I used to make the graphs individually and combine them into a matrix.

#correlation
cor1 <- format(cor(anscombe$x1, anscombe$y1), digits=4)
cor2 <- format(cor(anscombe$x2, anscombe$y2), digits=4)
cor3 <- format(cor(anscombe$x3, anscombe$y3), digits=4)
cor4 <- format(cor(anscombe$x4, anscombe$y4), digits=4)

#define the OLS regression
line1 <- lm(y1 ~ x1, data=anscombe)
line2 <- lm(y2 ~ x2, data=anscombe)
line3 <- lm(y3 ~ x3, data=anscombe)
line4 <- lm(y4 ~ x4, data=anscombe)

circle.size = 5
colors = list('red', '#0066CC', '#4BB14B', '#FCE638')

#plot1
plot1 <- ggplot(anscombe, aes(x=x1, y=y1)) + geom_point(size=circle.size, pch=21, fill=colors[[1]]) +
geom_abline(intercept=line1$coefficients[1], slope=line1$coefficients[2]) +
annotate("text", x = 12, y = 5, label = paste("correlation = ", cor1))

#plot2
plot2 <- ggplot(anscombe, aes(x=x2, y=y2)) + geom_point(size=circle.size, pch=21, fill=colors[[2]]) +
geom_abline(intercept=line2$coefficients[1], slope=line2$coefficients[2]) +
annotate("text", x = 12, y = 3, label = paste("correlation = ", cor2))

#plot3
plot3 <- ggplot(anscombe, aes(x=x3, y=y3)) + geom_point(size=circle.size, pch=21, fill=colors[[3]]) +
geom_abline(intercept=line3$coefficients[1], slope=line3$coefficients[2]) +
annotate("text", x = 12, y = 6, label = paste("correlation = ", cor3))

#plot4
plot4 <- ggplot(anscombe, aes(x=x4, y=y4)) + geom_point(size=circle.size, pch=21, fill=colors[[4]]) +
geom_abline(intercept=line4$coefficients[1], slope=line4$coefficients[2]) +
annotate("text", x = 15, y = 6, label = paste("correlation = ", cor4))

grid.arrange(plot1, plot2, plot3, plot4, top='Anscombe Quadrant -- Correlation Demostration')


The full code I used to write up this tutorial is available on my GitHub .

References:

Chatterjee, S., Hadi, A. S., & Price, B. (2000). Regression analysis by example. New York: Wiley.

# Making a Correlation Matrix in R

This tutorial is a continuation of making a covariance matrix in R. These tutorials walk you through the matrix algebra necessary to create the matrices, so you can better understand what is going on underneath the hood in R. There are built-in functions within R that make this process much quicker and easier.

The correlation matrix is is rather popular for exploratory data analysis, because it can quickly show you the correlations between variables in your data set. From a practical application standpoint, this entire post is unnecessary, because I’m going to show how to derive this using matrix algebra in R.

First, the starting point will be the covariance matrix that was computed from the last post.

#create vectors -- these will be our columns
a <- c(1,2,3,4,5,6)
b <- c(2,3,5,6,1,9)
c <- c(3,5,5,5,10,8)
d <- c(10,20,30,40,50,55)
e <- c(7,8,9,4,6,10)

#create matrix from vectors
M <- cbind(a,b,c,d,e)
k <- ncol(M) #number of variables
n <- nrow(M) #number of subjects

#create means for each column
M_mean <- matrix(data=1, nrow=n) %*% cbind(mean(a),mean(b),mean(c),mean(d),mean(e))

#creates a difference matrix
D <- M - M_mean

#creates the covariance matrix
C <- k^-1 * t(D) %*% D


$latex {\bf C } = \begin{bmatrix} V_a\ & C_{a,b}\ & C_{a,c}\ & C_{a,d}\ & C_{a,e} \\ C_{a,b} & V_b & C_{b,c} & C_{b,d} & C_{b,e} \\ C_{a,c} & C_{b,c} & V_c & C_{c,d} & C_{c,e} \\ C_{a,d} & C_{b,d} & C_{c,d} & V_d & C_{d,e} \\ C_{a,e} & C_{b,e} & C_{c,e} & C_{d,e} & V_e \end{bmatrix}&s=2$

This matrix has all the information that's needed to get the correlations for all the variables and create a correlation matrix [V -- variance, C -- Covariance]. Correlation, we are using the Pearson version of correlation, is calculated using the covariance between two vectors and their standard deviations [s, square root of the variance]:

$latex cor(X, Y) = \frac{cov(X,Y)}{s_{X}s_{Y}} &s=2$

The trick will be using matrix algebra to easily carry out these calculations. The variance components are all on the diagonal of the covariance matrix, so in matrix algebra notation we want to use this:

$latex {\bf V} = diag({\bf C}) = \begin{bmatrix} V_a\ & 0\ & 0\ & 0\ & 0 \\ 0 & V_b & 0 & 0 & 0 \\ 0 & 0 & V_c & 0 & 0 \\ 0 & 0 & 0 & V_d & 0 \\ 0 & 0 & 0 & 0 & V_e \end{bmatrix} &s=2$

Since R doesn't quite work the same way as matrix algebra notation, the diag() function creates a vector from a matrix and a matrix from a vector, so it's used twice to create the diagonal variance matrix. Once to get a vector of the variances, and a second time to turn that vector into the above diagonal matrix. Since the standard deviations are needed, the square root is taken. Also the variances are inverted to facilitate division.

#pulls out the standard deviations from the covariance matrix
S <- diag(diag(C)^(-1/2))


After getting the diagonal matrix, basic matrix multiplication is used to get the all the terms in the covariance to reflect the basic correlation formula from above.

$latex {\bf R } = {\bf S} \times {\bf C} \times {\bf S}&s=2$

#constructs the correlation matrix
S %*% C %*% S


And the correlation matrix is symbolically represented as:

$latex {\bf R } = \begin{bmatrix} r_{a,a}\ & r_{a,b}\ & r_{a,c}\ & r_{a,d}\ & r_{a,e} \\ r_{a,b} & r_{b,b} & r_{b,c} & r_{b,d} & r_{b,e} \\ r_{a,c} & r_{b,c} & r_{c,c} & r_{c,d} & r_{c,e} \\ r_{a,d} & r_{b,d} & r_{c,d} & r_{d,d} & r_{d,e} \\ r_{a,e} & r_{b,e} & r_{c,e} & r_{d,e} & r_{e,e} \end{bmatrix}&s=2$

The diagonal where the variances where in the covariance matrix are now 1, since a variable's correlation with itself is always 1.