协方差与相关性-CSDN博客

定义1：两个随机样本变量 $x$ 和 $y$ 之间的协方差( $\textbf{covariance}$ )是两个变量之间线性关联的度量, 由公式定义
$\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})\tag{1}$
注意：协方差类似于方差, 不同之处在于为两个变量 (上面的 $x$ 和 $y$ ) 定义协方差, 而方差只为一个变量定义。事实上, $c o v (x, x) = v a r (x)$ 。

协方差可以被认为是 $x$ 和 $y$ 的数据元素对之间的匹配和不匹配之和：当对中的两个元素在它们的平均值的同一侧时, 就会发生匹配；当对中的一个元素高于其平均值, 而另一个元素低于其平均值时, 就会发生不匹配。

当匹配大于不匹配时, 协方差为正, 当不匹配大于匹配时, 协方差为负。协方差的绝对值大小表示 $x$ 和 $y$ 之间线性关系的强度：线性关系越强, 协方差值就越大。协方差的大小也受数据元素尺度的影响, 为了消除尺度因子, 将相关系数作为线性关系的无尺度度量。

定义2：两个样本变量 $x$ 和 $y$ 之间的相关系数是两个变量之间线性关联的无标度度量, 并通过公式给出
$r_{xy} = \frac{cov(x,y)}{s_xs_y}\tag{2}$
We also use the term $\textbf{coefficient of determination}$ for $r^2$ .

注意： Just as we saw for the variance in $\textbf{Measures of Variability}$ , the $c o v a r i a n c e$ can be calculated as
$\begin{aligned} \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})=& \frac{1}{n-1}(\sum_{i=1}^{n}x_iy_i-\bar{x}\sum_{i=1}^{n}y_i-\bar{y}\sum_{i=1}^{n}x_i+n\bar{x}\bar{y})\\ =&\frac{1}{n-1}(\sum_{i=1}^{n}x_iy_i-n\bar{x}\bar{y})\\ &\tag{3} \end{aligned}$
因此, 我们还可以将相关系数计算为
$\frac{\sum_{i = 1}^{n}x_iy_i-n\bar{x}\bar{y}}{\sqrt{\sum_{i=1}^{n}x_i^2-n\bar{x}^2}\sqrt{\sum_{i=1}^{n}y_i^2-n\bar{y}^2}}\tag{4}$
性质1： $-1\leq r\leq 1$ .

注意：如果 $r$ 接近 1, 则 $x$ 和 $y$ 呈正相关。正线性相关意味着 $x$ 的高值与 $y$ 的较高值相关, $x$ 的较低值与 $y$ 的低值相关联。

如果 $r$ 接近 -1, 则 $x$ 和 $y$ 呈负相关。负线性相关意味着 $x$ 的较高值与 $y$ 的较低值相关联, 而 $x$ 的较低值与 $y$ 的较高值相关联。

当接近 0 时, $x$ 和 $y$ 之间几乎没有线性关系。

注意： We have defined covariance and the correlation coefficient for data samples. We can also define covariance and correlation coefficient for populations, based on their probability density function (pdf).

定义3： The $\textbf{covariance}$ between two random variables $x$ and $y$ for a population with discrete or continuous pdf is defined by
$E[(x-\mu_{x})(y-\mu_{y})]\tag{5}$
Where $E []$ is the expectation function.

定义4： The $\textbf{(Pearson’s product moment)}$ correlation coefficient for two variables $x$ and $y$ for a population with discrete or continuous pdf is
$\rho = \frac{cov(x,y)}{\sigma_x\sigma_y}\tag{6}$
性质2： $-1\leq\rho\leq1$ .

性质3： $E[xy]-\mu_x\mu_y$

性质4： $c o v (x, y) = 0$ , $x$ and $y$ are independent.

性质5：
$\begin{aligned} var(x+y) = var(x)+var(y)+2cov(x,y)\\ var(x-y) = var(x)+var(y)-2cov(x,y) \tag{7} \end{aligned}$
注意： It turns out that $r$ is not an unbiased estimate of $\rho$ . A relatively unbiased estimate of $\rho^2$ is given by the $\textbf{adjusted coefficient of determination}$ $r_{adj}^2$ :
$r_{adj}^2 = 1-\frac{(1-r^2)(n-1)}{n-2}\tag{8}$
while $r_{adj}^2$ is a better estimate of of the population coefficient of determination, especially for small values of $n$ , for large values of $n$ it is easy to see that $r_{adj}^2\approx r^2$ . Note too that $r_{adj}^2\leq r^2$ , and while $r_{adj}^2$ can be negative, this is relatively rare.