*The full R code for this post is available on my GitHub.*

Understanding what a covariance matrix is can be helpful in understanding some more advanced statistical concepts. First, let’s define the data matrix, which is the essentially a matrix with n rows and k columns. I’ll define the rows as being the subjects, while the columns are the variables assigned to those subjects. While we use the matrix terminology, this would look much like a normal data table you might already have your data in. For the example in R, I’m going to create a 6×5 matrix, which 6 subjects and 5 different variables (a,b,c,d,e). I’m choosing this particular convention because R and databases use it. A row in a data frame represents represents a subject while the columns are different variables. [The underlying structure of the data frame is a collection of vectors.] This is against normal mathematical convention which has the variables as rows and not columns, so this won’t follow the normal formulas found else where online.

The covariance matrix is a matrix that only concerns the relationships between variables, so it will be a k x k square matrix. [In our case, a 5×5 matrix.] Before constructing the covariance matrix, it’s helpful to think of the data matrix as a collection of 5 vectors, which is how I built our data matrix in R.]

#create vectors -- these will be our columns a <- c(1,2,3,4,5,6) b <- c(2,3,5,6,1,9) c <- c(3,5,5,5,10,8) d <- c(10,20,30,40,50,55) e <- c(7,8,9,4,6,10) #create matrix from vectors M <- cbind(a,b,c,d,e)

The data matrix (M) written out is shown below.

a b c d e [1,] 1 2 3 10 7 [2,] 2 3 5 20 8 [3,] 3 5 5 30 9 [4,] 4 6 5 40 4 [5,] 5 1 10 50 6 [6,] 6 9 8 55 10

Each value in the covariance matrix represents the covariance (or variance) between two of the vectors. With five vectors, there are 25 different combinations that can be made and those combinations can be laid out in a 5x5 matrix.

There are a few different ways to formulate covariance matrix. You can use the cov() function on the data matrix instead of two vectors. [This is the easiest way to get a covariance matrix in R.]

cov(M)

But we'll use the following steps to construct it manually:

- Create a matrix of means (M_mean).
- Create a difference matrix (D) by subtracting the matrix of means (M_mean) from data matrix (M).
- Create the covariance matrix (C) by multiplying the transposed the difference matrix (D) with a normal difference matrix and inverse of the number of subjects (n) [We will use (n-1), since this is necessary for the unbiased, sample covariance estimator. This is covariance R will return by default.

$latex {\bf M\_mean} = \begin{bmatrix}

1 \\

1 \\

1 \\

1 \\

1 \\

\end{bmatrix}

\times

\begin{bmatrix} \bar{x_{a}} & \bar{x_{b}} & \bar{x_{c}} & \bar{x_{d}} & \bar{x_{e}}\end{bmatrix}&s=2$

$latex {\bf D = M - M\_mean} &s=2$

$latex {\bf C = } (n-1)^{-1} \times {\bf D^T} \times {\bf D} &s=2$

k <- ncol(M) #number of variables n <- nrow(M) #number of subjects #create means for each column M_mean <- matrix(data=1, nrow=n) %*% cbind(mean(a),mean(b),mean(c),mean(d),mean(e)) #creates a difference matrix D <- M - M_mean #creates the covariance matrix C <- (n-1)^-1 t(D) %*% D

The final covariance matrix made using the R code looks like this:

a b c d e a 3.5 3.000000 4.0 32.500000 0.400000 b 3.0 8.666667 0.4 25.333333 2.466667 c 4.0 0.400000 6.4 38.000000 0.400000 d 32.5 25.333333 38.0 304.166667 1.333333 e 0.4 2.466667 0.4 1.333333 4.666667

It represents the various covariances (C) and variance (V) combinations of the five different variables in our data set. These are all values that you might be familiar with if you've used the var() or cov() functions in R or similar functions in Excel, SPSS, etc.

$latex

\begin{bmatrix}

V_a\ & C_{a,b}\ & C_{a,c}\ & C_{a,d}\ & C_{a,e} \\

C_{a,b} & V_b & C_{b,c} & C_{b,d} & C_{b,e} \\

C_{a,c} & C_{b,c} & V_c & C_{c,d} & C_{c,e} \\

C_{a,d} & C_{b,d} & C_{c,d} & V_d & C_{d,e} \\

C_{a,e} & C_{b,e} & C_{c,e} & C_{d,e} & V_e

\end{bmatrix}&s=2$

This matrix is used in applications like constructing the correlation matrix and generalized least squares regressions.