Covariance and Correlation

最新推荐文章于 2024-04-20 17:25:35 发布

anArkitek

最新推荐文章于 2024-04-20 17:25:35 发布

阅读量418

点赞数

分类专栏： Mathematics

本文链接：https://blog.csdn.net/anArkitek/article/details/97331972

版权

Mathematics 专栏收录该内容

11 篇文章 1 订阅

订阅专栏

Covariance and Correlation

Demystifying the terms

Covariance indicates the direction of the linear relationship between variables.

Correlation on the other hand measures both the strength and direction of the linear relationship between two variables.

Correlation is a function of the covariance. What sets them apart is the fact that correlation values are standardized whereas, covariance values are not.

Defining the terms mathematically

Covariance

$\begin{aligned} cov(x,y) &= E[(x - \mu_x) (y - \mu_y)]\\ &= E[xy] - E[x] E[y] \end{aligned}$

If we have only a single variable $x$ , then

$\begin{aligned} cov(x, x) &= E[(x - \mu_x) (x - \mu_x)]\\ &= E[(x - \mu_x)^2] \\ &= var(x) = \sigma^2(x) = \sigma^2_x \\ \text{Let }var(x) & := s^2 \hspace{1cm} \text{sampled varaince} \end{aligned}$

Expand it, we can get

$\begin{aligned} s^2 = cov(x, x) &= \frac{\sum_{i=1}^N (x_i - \bar{x})^2}{n-1} \\ cov(x,y) &= \frac{\sum_{i=1}^{N}(x_i - \bar{x}) (y_i - \bar{y})}{n-1} \end{aligned}$

The numerator of the first equation is called sum of squared deviation, and the second is called sum of cross product.

Correlation

$\begin{aligned} corr(x,y) = \frac{cov(x,y)}{s_x s_y} &= \frac{E[(x - \mu_x) (y - \mu_y)]}{s_x s_y} \\ &= \frac{E[(x - \mu_x) (y - \mu_y)]}{\sigma_x \sigma_y} \end{aligned}$

So the values of correlation coefficient rnge from [-1, 1]. The positive sign signifies the direction of the correlation i.e. if one of the variables increases, the other variable is also supposed to increase.

Data-matrix representation of covariance and correlation

$\begin{bmatrix} x_{11} & ... & x_{1n} \\ ... & ... & ... \\ x_{m1} & ... & x_{mn} \\ \end{bmatrix} = \begin{bmatrix} \mathbf{x}_1 & ... & \mathbf{x}_n \end{bmatrix}$

order of $m\times n$

We call a row is item / subject and a column variable

Now we can calculate the sample mean of $j$ th variable

$\bar{x}_j = \frac{1}{m}\sum_{i=1}^m x_{ij}$

similarly, the row-mean is

$\bar{x}_i = \frac{1}{n}\sum_{j=1}^nx_{ij}$

We then can define the covariance matrix:

$\begin{aligned} S = \frac{1}{m}\begin{bmatrix} \mathbf{x}_1 - \bar{\mathbf{x}}_1 \\ ... \\ \mathbf{x}_n - \bar{\mathbf{x}}_n \\ \end{bmatrix} \begin{bmatrix} \mathbf{x}_1 - \bar{\mathbf{x}}_1 & ... & \mathbf{x}_n - \bar{\mathbf{x}}_n \end{bmatrix} &= \begin{bmatrix} s_{1}^2 & ... & s_{1n}^2 \\ ... & ... & ... \\ s_{n1}^2 & ... & s_{n}^2 \\ \end{bmatrix}\\ \text{where } s_j^2 &= \frac{1}{m}\sum_{i=1}^{m}(x_{ij} - \bar{x}_j)^2 \hspace{1cm} \text{variance of jth variable} \\ s_{jk} &= \frac{1}{m} \sum_{i=1}^{m}(x_{ij} - \bar{x}_j) (x_{ik} - \bar{x}_k) \hspace{1cm} \text{covariance between jth and kth variable}\\ \bar{\mathbf{x}}_j &= \frac{1}{m}\sum_{i=1}^{m}x_{ij} \hspace{1cm} \text{mean of jth variable} \end{aligned}$

We can see that the covariance matrix is a $n\times n$ symmetric matrix

Then we can define the Correlation matrix

$\begin{aligned} R &= \frac{1}{m} \begin{bmatrix} (\mathbf{x}_1 - \bar{\mathbf{x}}_1) / s_1 \\ ... \\ (\mathbf{x}_n - \bar{\mathbf{x}}_n) / s_n \\ \end{bmatrix} \begin{bmatrix} (\mathbf{x}_1 - \bar{\mathbf{x}}_1) / s_1 & ... & (\mathbf{x}_n - \bar{\mathbf{x}}_n) / s_n \\ \end{bmatrix}\\ &= \begin{bmatrix} 1 & r_{12} & ... & r_{1n} \\ ...& ... & ... & ... \\ r_{n1} & ... & ... & 1 \end{bmatrix} \end{aligned}$

Covariance versus Correlation

Covariance has unit from the product of the units of the two variables
Correlation is dimensionless
Covariance can take value from $(-\infty, +\infty)$
Correlation lies between $[- 1, 1]$