相关的基本概念

定义1: 两个随机样本变量 x x x y y y 之间的协方差( covariance \textbf{covariance} covariance)是两个变量之间线性关联的度量, 由公式定义
(1) c o v ( x , y ) = 1 n − 1 ∑ i = 1 n ( x i − x ˉ ) ( y i − y ˉ ) cov(x,y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})\tag{1} cov(x,y)=n11i=1n(xixˉ)(yiyˉ)(1)
注意: 协方差类似于方差, 不同之处在于为两个变量 (上面的 x x x y y y) 定义协方差, 而方差只为一个变量定义。事实上, c o v ( x , x ) = v a r ( x ) cov (x, x) = var (x) cov(x,x)=var(x)

协方差可以被认为是 x x x y y y 的数据元素对之间的匹配和不匹配之和:当对中的两个元素在它们的平均值的同一侧时, 就会发生匹配;当对中的一个元素高于其平均值, 而另一个元素低于其平均值时, 就会发生不匹配。

当匹配大于不匹配时, 协方差为正, 当不匹配大于匹配时, 协方差为负。协方差的绝对值大小表示 x x x y y y 之间线性关系的强度:线性关系越强, 协方差值就越大。协方差的大小也受数据元素尺度的影响, 为了消除尺度因子, 将相关系数作为线性关系的无尺度度量。

定义2: 两个样本变量 x x x y y y 之间的相关系数是两个变量之间线性关联的无标度度量, 并通过公式给出
(2) r x y = c o v ( x , y ) s x s y r_{xy} = \frac{cov(x,y)}{s_xs_y}\tag{2} rxy=sxsycov(x,y)(2)
We also use the term coefficient   of   determination \textbf{coefficient of determination} coefficient of determination for r 2 r^2 r2.

注意: Just as we saw for the variance in Measures   of   Variability \textbf{Measures of Variability} Measures of Variability, the c o v a r i a n c e covariance covariance can be calculated as
(3) 1 n − 1 ∑ i = 1 n ( x i − x ˉ ) ( y i − y ˉ ) = 1 n − 1 ( ∑ i = 1 n x i y i − x ˉ ∑ i = 1 n y i − y ˉ ∑ i = 1 n x i + n x ˉ y ˉ ) = 1 n − 1 ( ∑ i = 1 n x i y i − n x ˉ y ˉ ) \begin{aligned} \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})=& \frac{1}{n-1}(\sum_{i=1}^{n}x_iy_i-\bar{x}\sum_{i=1}^{n}y_i-\bar{y}\sum_{i=1}^{n}x_i+n\bar{x}\bar{y})\\ =&\frac{1}{n-1}(\sum_{i=1}^{n}x_iy_i-n\bar{x}\bar{y})\\ &\tag{3} \end{aligned} n11i=1n(xixˉ)(yiyˉ)==n11(i=1nxiyixˉi=1nyiyˉi=1nxi+nxˉyˉ)n11(i=1nxiyinxˉyˉ)(3)
因此, 我们还可以将相关系数计算为
(4) ∑ i = 1 n x i y i − n x ˉ y ˉ ∑ i = 1 n x i 2 − n x ˉ 2 ∑ i = 1 n y i 2 − n y ˉ 2 \frac{\sum_{i = 1}^{n}x_iy_i-n\bar{x}\bar{y}}{\sqrt{\sum_{i=1}^{n}x_i^2-n\bar{x}^2}\sqrt{\sum_{i=1}^{n}y_i^2-n\bar{y}^2}}\tag{4} i=1nxi2nxˉ2 i=1nyi2nyˉ2 i=1nxiyinxˉyˉ(4)
性质1: − 1 ≤ r ≤ 1 -1\leq r\leq 1 1r1.

注意: 如果 r r r 接近 1, 则 x x x y y y 呈正相关。正线性相关意味着 x x x 的高值与 y y y 的较高值相关, x x x 的较低值与 y y y 的低值相关联。

如果 r r r 接近 -1, 则 x x x y y y 呈负相关。负线性相关意味着 x x x 的较高值与 y y y 的较低值相关联, 而 x x x 的较低值与 y y y 的较高值相关联。

当接近 0 时, x x x y y y 之间几乎没有线性关系。

注意: We have defined covariance and the correlation coefficient for data samples. We can also define covariance and correlation coefficient for populations, based on their probability density function (pdf).

定义3: The covariance \textbf{covariance} covariance between two random variables x x x and y y y for a population with discrete or continuous pdf is defined by
(5) c o v ( x , y ) = E [ ( x − μ x ) ( y − μ y ) ] cov(x,y) = E[(x-\mu_{x})(y-\mu_{y})]\tag{5} cov(x,y)=E[(xμx)(yμy)](5)
Where E [ ] E[] E[] is the expectation function.

定义4: The (Pearson’s   product   moment) \textbf{(Pearson’s product moment)} (Pearson’s product moment) correlation coefficient for two variables x x x and y y y for a population with discrete or continuous pdf is
(6) ρ = c o v ( x , y ) σ x σ y \rho = \frac{cov(x,y)}{\sigma_x\sigma_y}\tag{6} ρ=σxσycov(x,y)(6)
性质2: − 1 ≤ ρ ≤ 1 -1\leq\rho\leq1 1ρ1.

性质3: c o v ( x , y ) = E [ x y ] − μ x μ y cov(x,y) = E[xy]-\mu_x\mu_y cov(x,y)=E[xy]μxμy

性质4: c o v ( x , y ) = 0 cov(x,y) = 0 cov(x,y)=0, x x x and y y y are independent.

性质5:
(7) v a r ( x + y ) = v a r ( x ) + v a r ( y ) + 2 c o v ( x , y ) v a r ( x − y ) = v a r ( x ) + v a r ( y ) − 2 c o v ( x , y ) \begin{aligned} var(x+y) = var(x)+var(y)+2cov(x,y)\\ var(x-y) = var(x)+var(y)-2cov(x,y) \tag{7} \end{aligned} var(x+y)=var(x)+var(y)+2cov(x,y)var(xy)=var(x)+var(y)2cov(x,y)(7)
注意: It turns out that r r r is not an unbiased estimate of ρ \rho ρ. A relatively unbiased estimate of ρ 2 \rho^2 ρ2 is given by the adjusted   coefficient   of   determination \textbf{adjusted coefficient of determination} adjusted coefficient of determination r a d j 2 r_{adj}^2 radj2:
(8) r a d j 2 = 1 − ( 1 − r 2 ) ( n − 1 ) n − 2 r_{adj}^2 = 1-\frac{(1-r^2)(n-1)}{n-2}\tag{8} radj2=1n2(1r2)(n1)(8)
while r a d j 2 r_{adj}^2 radj2 is a better estimate of of the population coefficient of determination, especially for small values of n n n, for large values of n n n it is easy to see that r a d j 2 ≈ r 2 r_{adj}^2\approx r^2 radj2r2. Note too that r a d j 2 ≤ r 2 r_{adj}^2\leq r^2 radj2r2, and while r a d j 2 r_{adj}^2 radj2 can be negative, this is relatively rare.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值