定义1: 两个随机样本变量
x
x
x 和
y
y
y 之间的协方差(
covariance
\textbf{covariance}
covariance)是两个变量之间线性关联的度量, 由公式定义
(1)
c
o
v
(
x
,
y
)
=
1
n
−
1
∑
i
=
1
n
(
x
i
−
x
ˉ
)
(
y
i
−
y
ˉ
)
cov(x,y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})\tag{1}
cov(x,y)=n−11i=1∑n(xi−xˉ)(yi−yˉ)(1)
注意: 协方差类似于方差, 不同之处在于为两个变量 (上面的
x
x
x 和
y
y
y) 定义协方差, 而方差只为一个变量定义。事实上,
c
o
v
(
x
,
x
)
=
v
a
r
(
x
)
cov (x, x) = var (x)
cov(x,x)=var(x)。
协方差可以被认为是 x x x 和 y y y 的数据元素对之间的匹配和不匹配之和:当对中的两个元素在它们的平均值的同一侧时, 就会发生匹配;当对中的一个元素高于其平均值, 而另一个元素低于其平均值时, 就会发生不匹配。
当匹配大于不匹配时, 协方差为正, 当不匹配大于匹配时, 协方差为负。协方差的绝对值大小表示 x x x 和 y y y 之间线性关系的强度:线性关系越强, 协方差值就越大。协方差的大小也受数据元素尺度的影响, 为了消除尺度因子, 将相关系数作为线性关系的无尺度度量。
定义2: 两个样本变量
x
x
x 和
y
y
y 之间的相关系数是两个变量之间线性关联的无标度度量, 并通过公式给出
(2)
r
x
y
=
c
o
v
(
x
,
y
)
s
x
s
y
r_{xy} = \frac{cov(x,y)}{s_xs_y}\tag{2}
rxy=sxsycov(x,y)(2)
We also use the term
coefficient
of
determination
\textbf{coefficient of determination}
coefficient of determination for
r
2
r^2
r2.
注意: Just as we saw for the variance in
Measures
of
Variability
\textbf{Measures of Variability}
Measures of Variability, the
c
o
v
a
r
i
a
n
c
e
covariance
covariance can be calculated as
(3)
1
n
−
1
∑
i
=
1
n
(
x
i
−
x
ˉ
)
(
y
i
−
y
ˉ
)
=
1
n
−
1
(
∑
i
=
1
n
x
i
y
i
−
x
ˉ
∑
i
=
1
n
y
i
−
y
ˉ
∑
i
=
1
n
x
i
+
n
x
ˉ
y
ˉ
)
=
1
n
−
1
(
∑
i
=
1
n
x
i
y
i
−
n
x
ˉ
y
ˉ
)
\begin{aligned} \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})=& \frac{1}{n-1}(\sum_{i=1}^{n}x_iy_i-\bar{x}\sum_{i=1}^{n}y_i-\bar{y}\sum_{i=1}^{n}x_i+n\bar{x}\bar{y})\\ =&\frac{1}{n-1}(\sum_{i=1}^{n}x_iy_i-n\bar{x}\bar{y})\\ &\tag{3} \end{aligned}
n−11i=1∑n(xi−xˉ)(yi−yˉ)==n−11(i=1∑nxiyi−xˉi=1∑nyi−yˉi=1∑nxi+nxˉyˉ)n−11(i=1∑nxiyi−nxˉyˉ)(3)
因此, 我们还可以将相关系数计算为
(4)
∑
i
=
1
n
x
i
y
i
−
n
x
ˉ
y
ˉ
∑
i
=
1
n
x
i
2
−
n
x
ˉ
2
∑
i
=
1
n
y
i
2
−
n
y
ˉ
2
\frac{\sum_{i = 1}^{n}x_iy_i-n\bar{x}\bar{y}}{\sqrt{\sum_{i=1}^{n}x_i^2-n\bar{x}^2}\sqrt{\sum_{i=1}^{n}y_i^2-n\bar{y}^2}}\tag{4}
∑i=1nxi2−nxˉ2∑i=1nyi2−nyˉ2∑i=1nxiyi−nxˉyˉ(4)
性质1:
−
1
≤
r
≤
1
-1\leq r\leq 1
−1≤r≤1.
注意: 如果 r r r 接近 1, 则 x x x 和 y y y 呈正相关。正线性相关意味着 x x x 的高值与 y y y 的较高值相关, x x x 的较低值与 y y y 的低值相关联。
如果 r r r 接近 -1, 则 x x x 和 y y y 呈负相关。负线性相关意味着 x x x 的较高值与 y y y 的较低值相关联, 而 x x x 的较低值与 y y y 的较高值相关联。
当接近 0 时, x x x 和 y y y 之间几乎没有线性关系。
注意: We have defined covariance and the correlation coefficient for data samples. We can also define covariance and correlation coefficient for populations, based on their probability density function (pdf).
定义3: The
covariance
\textbf{covariance}
covariance between two random variables
x
x
x and
y
y
y for a population with discrete or continuous pdf is defined by
(5)
c
o
v
(
x
,
y
)
=
E
[
(
x
−
μ
x
)
(
y
−
μ
y
)
]
cov(x,y) = E[(x-\mu_{x})(y-\mu_{y})]\tag{5}
cov(x,y)=E[(x−μx)(y−μy)](5)
Where
E
[
]
E[]
E[] is the expectation function.
定义4: The
(Pearson’s
product
moment)
\textbf{(Pearson’s product moment)}
(Pearson’s product moment) correlation coefficient for two variables
x
x
x and
y
y
y for a population with discrete or continuous pdf is
(6)
ρ
=
c
o
v
(
x
,
y
)
σ
x
σ
y
\rho = \frac{cov(x,y)}{\sigma_x\sigma_y}\tag{6}
ρ=σxσycov(x,y)(6)
性质2:
−
1
≤
ρ
≤
1
-1\leq\rho\leq1
−1≤ρ≤1.
性质3: c o v ( x , y ) = E [ x y ] − μ x μ y cov(x,y) = E[xy]-\mu_x\mu_y cov(x,y)=E[xy]−μxμy
性质4: c o v ( x , y ) = 0 cov(x,y) = 0 cov(x,y)=0, x x x and y y y are independent.
性质5:
(7)
v
a
r
(
x
+
y
)
=
v
a
r
(
x
)
+
v
a
r
(
y
)
+
2
c
o
v
(
x
,
y
)
v
a
r
(
x
−
y
)
=
v
a
r
(
x
)
+
v
a
r
(
y
)
−
2
c
o
v
(
x
,
y
)
\begin{aligned} var(x+y) = var(x)+var(y)+2cov(x,y)\\ var(x-y) = var(x)+var(y)-2cov(x,y) \tag{7} \end{aligned}
var(x+y)=var(x)+var(y)+2cov(x,y)var(x−y)=var(x)+var(y)−2cov(x,y)(7)
注意: It turns out that
r
r
r is not an unbiased estimate of
ρ
\rho
ρ. A relatively unbiased estimate of
ρ
2
\rho^2
ρ2 is given by the
adjusted
coefficient
of
determination
\textbf{adjusted coefficient of determination}
adjusted coefficient of determination
r
a
d
j
2
r_{adj}^2
radj2:
(8)
r
a
d
j
2
=
1
−
(
1
−
r
2
)
(
n
−
1
)
n
−
2
r_{adj}^2 = 1-\frac{(1-r^2)(n-1)}{n-2}\tag{8}
radj2=1−n−2(1−r2)(n−1)(8)
while
r
a
d
j
2
r_{adj}^2
radj2 is a better estimate of of the population coefficient of determination, especially for small values of
n
n
n, for large values of
n
n
n it is easy to see that
r
a
d
j
2
≈
r
2
r_{adj}^2\approx r^2
radj2≈r2. Note too that
r
a
d
j
2
≤
r
2
r_{adj}^2\leq r^2
radj2≤r2, and while
r
a
d
j
2
r_{adj}^2
radj2 can be negative, this is relatively rare.