Law of Iterated Expectations (LIE)
在讲方差分解之前,我们需要先理解双期望定理。对于一个X,我们可以根据不同的Y将其任意的划分为几部分:
于是经过这样的划分,X总体的均值其实是等价于每一个划分下均值的总体均值。
E [ X ] = E [ E [ X ∣ Y ] ] \operatorname{E} [X]=\operatorname{E} [\operatorname{E} [X|Y]] E[X]=E[E[X∣Y]]
举个例子,假设一共划分为三部分,每部分的均值分别为70 60 80, 于是
E [ X ] = E [ E [ X ∣ Y ] ] = E [ E [ X ∣ Y = y 1 ] + E [ X ∣ Y = y 2 ] + E [ X ∣ Y = y 3 ] ] = 70 + 60 + 80 3 = 70 \begin{aligned} & E[X]=E[E[X\mid Y]]\\ = & E[E[X\mid Y=y_{1} ]+E[X\mid Y=y_{2} ]+E[X\mid Y=y_{3} ]]\\ = & \frac{70+60+80}{3}\\ = & 70 \end{aligned} ===E[X]=E[E[X∣Y]]E[E[X∣Y=y1]+E[X∣Y=y2]+E[X∣Y=y3]]370+60+8070
从理论上,
E
[
E
[
X
∣
Y
]
]
=
∫
p
(
y
)
∫
x
p
(
x
∣
y
)
d
x
d
y
=
∫
p
(
x
,
y
)
x
d
x
d
y
=
∫
p
(
x
)
x
d
x
=
E
[
X
]
\begin{aligned} E[E[X\mid Y]] & =\int p( y)\int xp( x|y) dxdy\\ & =\int p( x,y) xdxdy\\ & =\int p( x) xdx\\ & =E[ X] \end{aligned}
E[E[X∣Y]]=∫p(y)∫xp(x∣y)dxdy=∫p(x,y)xdxdy=∫p(x)xdx=E[X]
Mathematical Derivation of the Law of Total Variance
另一个重要的规则是total variance:
V
a
r
(
X
)
=
E
[
V
a
r
(
X
∣
Y
)
]
+
V
a
r
(
E
[
X
∣
Y
]
)
Var(X)=\operatorname{E} [Var(X\mid Y)\ ]+Var(\operatorname{E} [X\mid Y])
Var(X)=E[Var(X∣Y) ]+Var(E[X∣Y])
它刻画了方差的两个组成成分:
E
[
V
a
r
(
X
∣
Y
)
]
=
E
[
E
[
X
2
∣
Y
]
−
(
E
[
X
∣
Y
]
)
2
]
Def. of variance
=
E
[
E
[
X
2
∣
Y
]
]
−
E
[
(
E
[
X
∣
Y
]
)
2
]
Lin. of Expectation
=
E
[
X
2
]
−
E
[
(
E
[
X
∣
Y
]
)
2
]
law of Ite. Expect
V
a
r
(
E
[
X
∣
Y
]
)
=
E
[
(
E
[
X
∣
Y
]
)
2
]
−
E
[
E
[
X
∣
Y
]
]
2
Def. of variance
=
E
[
(
E
[
X
∣
Y
]
)
2
]
−
E
[
X
]
2
law of Ite. Expect
∴
E
[
V
a
r
(
X
∣
Y
)
]
+
V
a
r
(
E
[
X
∣
Y
]
)
=
E
[
X
2
]
−
E
[
X
]
2
=
V
a
r
(
X
)
\begin{aligned} \operatorname{E} [Var(X\mid Y)\ ] & =\operatorname{E} [\ \operatorname{E} [X^{2} \mid Y\ ]-(\operatorname{E} [X\mid Y])^{2} \ ] & \text{Def. of variance}\\ & =\operatorname{E} [\ \operatorname{E} [X^{2} \mid Y]\ ]-\operatorname{E} [\ (\operatorname{E} [X\mid Y])^{2} \ ] & \text{Lin. of Expectation}\\ & =\operatorname{E} [X^{2} ]-\operatorname{E} [\ (\operatorname{E} [X\mid Y])^{2} \ ] & \text{law of Ite. Expect} \end{aligned}\\ \\ \begin{aligned} Var(E[X\mid Y]) & =E[( E[X\mid Y])^{2} ]-E[E[X\mid Y]]^{2} & \text{Def. of variance}\\ & =E[( E[X\mid Y])^{2} ]-E[X]^{2} & \text{law of Ite. Expect} \end{aligned}\\ \\ \therefore \ \operatorname{E} [Var(X\mid Y)\ ]+Var(\operatorname{E} [X\mid Y])=\operatorname{E} [X^{2} ]-E[X]^{2} =Var( X)
E[Var(X∣Y) ]=E[ E[X2∣Y ]−(E[X∣Y])2 ]=E[ E[X2∣Y] ]−E[ (E[X∣Y])2 ]=E[X2]−E[ (E[X∣Y])2 ]Def. of varianceLin. of Expectationlaw of Ite. ExpectVar(E[X∣Y])=E[(E[X∣Y])2]−E[E[X∣Y]]2=E[(E[X∣Y])2]−E[X]2Def. of variancelaw of Ite. Expect∴ E[Var(X∣Y) ]+Var(E[X∣Y])=E[X2]−E[X]2=Var(X)
怎么理解呢?
- 什么是 E [ V a r ( X ∣ Y ) ] \displaystyle \operatorname{E} [Var(X\mid Y)\ ] E[Var(X∣Y) ]? 直观来看,他是每个划分下方差的均值,因此,它刻画了样本内差异的均值。
- 什么是 V a r ( E [ X ∣ Y ] ) \displaystyle Var(E[X\mid Y]) Var(E[X∣Y])? 它刻画了不同分组下均值的差异程度,因此,它刻画了样本间差异的程度。
因此,方差刻画了样本内和样本间差异的叠加,这就是Law of Total Variance.
与k-means聚类的联系
熟悉聚类算法的同学可能意识到,k means聚类其实有两种等价的学习方式,分别是,最小化类内距离(within-cluster sum of squares (WCSS)):
arg min
S
∑
i
=
1
k
∑
x
∈
S
i
∥
x
−
μ
i
∥
2
=
arg min
S
∑
i
=
1
k
∣
S
i
∣
Var
S
i
{\displaystyle \underset{\mathbf{S}}{\operatorname{arg\ min}}\sum ^{k}_{i=1}\sum _{\mathbf{x} \in S_{i}}\Vert \mathbf{x} -\boldsymbol{\mu }_{i}\Vert ^{2} =\underset{\mathbf{S}}{\operatorname{arg\ min}}\sum ^{k}_{i=1} |S_{i} |\operatorname{Var} S_{i}}
Sarg mini=1∑kx∈Si∑∥x−μi∥2=Sarg mini=1∑k∣Si∣VarSi
以及最大化类间距离(between-cluster sum of squares, BCSS):
arg max
S
∑
i
=
1
k
∣
S
i
∣
∥
x
‾
−
μ
i
∥
2
{\displaystyle \underset{\mathbf{S}}{\operatorname{arg\ max}}\sum ^{k}_{i=1} |S_{i} |\Vert \overline{\mathbf{{\displaystyle x}}} -\boldsymbol{\mu }_{i}\Vert ^{2}}
Sarg maxi=1∑k∣Si∣∥x−μi∥2
显然,它们分别对应着
E
[
V
a
r
(
X
∣
Y
)
]
\displaystyle \operatorname{E} [Var(X\mid Y)\ ]
E[Var(X∣Y) ]和
V
a
r
(
E
[
X
∣
Y
]
)
\displaystyle Var(E[X\mid Y])
Var(E[X∣Y]),因为他们加起来是等于常数(方差),因此根据全方差公式,最小化前者等价于最大化后者。
与最小二乘法的联系
所谓最小二乘法,其实就是搜索最优的
f
\displaystyle f
f:
E
[
(
Y
−
f
(
X
)
)
2
]
=
E
[
(
Y
−
E
(
Y
∣
X
)
+
E
(
Y
∣
X
)
−
f
(
X
)
)
2
]
=
E
[
E
{
(
Y
−
E
(
Y
∣
X
)
+
(
E
(
Y
∣
X
)
−
f
(
X
)
)
2
∣
X
}
]
=
E
[
(
(
Y
−
E
(
Y
∣
X
)
)
2
+
(
E
(
Y
∣
X
)
−
f
(
X
)
)
2
+
2
(
Y
−
E
(
Y
∣
X
)
)
(
E
(
Y
∣
X
)
−
f
(
X
)
)
∣
X
]
=
E
[
Var
(
Y
∣
X
)
]
+
E
[
(
E
(
Y
∣
X
)
−
f
(
X
)
)
2
]
+
2
(
E
[
Y
∣
X
]
−
E
(
Y
∣
X
)
)
(
E
(
Y
∣
X
)
−
f
(
X
)
)
=
E
[
Var
(
Y
∣
X
)
]
+
E
[
(
E
(
Y
∣
X
)
−
f
(
X
)
)
2
]
.
{\displaystyle \begin{aligned} \operatorname{E} [(Y-f(X))^{2} ] & =\operatorname{E} [(Y-\operatorname{E} (Y|X)\ \ +\ \ \operatorname{E} (Y|X)-f(X))^{2} ]\\ & =\operatorname{E} [\operatorname{E} \{(Y-\operatorname{E} (Y|X)\ \ +\ \ \left(\operatorname{E} (Y|X)-f(X)\right)^{2} |X\}]\\ & =\operatorname{E}\left[\left( (Y-\operatorname{E} (Y|X)\ \right)^{2} +\left(\operatorname{E} (Y|X)-f(X)\right)^{2} +2\left( Y-\operatorname{E} (Y|X)\right)\left(\operatorname{E} (Y|X)-f(X)\right) |X\right]\\ & =\operatorname{E} [\operatorname{Var} (Y|X)]+\operatorname{E}\left[\left(\operatorname{E} (Y|X)-f(X)\right)^{2}\right] +2\left( E[ Y|X] -\operatorname{E} (Y|X)\right)\left(\operatorname{E} (Y|X)-f(X)\right)\\ & =\operatorname{E} [\operatorname{Var} (Y|X)]+\operatorname{E} [(\operatorname{E} (Y|X)-f(X))^{2} ]\ . \end{aligned}}
E[(Y−f(X))2]=E[(Y−E(Y∣X) + E(Y∣X)−f(X))2]=E[E{(Y−E(Y∣X) + (E(Y∣X)−f(X))2∣X}]=E[((Y−E(Y∣X) )2+(E(Y∣X)−f(X))2+2(Y−E(Y∣X))(E(Y∣X)−f(X))∣X]=E[Var(Y∣X)]+E[(E(Y∣X)−f(X))2]+2(E[Y∣X]−E(Y∣X))(E(Y∣X)−f(X))=E[Var(Y∣X)]+E[(E(Y∣X)−f(X))2] .
其中
Var
(
Y
∣
X
)
=
E
(
(
Y
−
E
(
Y
∣
X
)
)
2
∣
X
)
=
E
(
Y
2
−
2
Y
E
(
Y
∣
X
)
+
E
(
Y
∣
X
)
2
∣
X
)
=
E
(
Y
2
∣
X
−
2
E
[
Y
∣
X
]
E
(
Y
∣
X
)
+
E
(
Y
∣
X
)
2
)
=
E
[
Y
2
∣
X
]
−
(
E
[
Y
∣
X
]
)
2
\begin{aligned} {\displaystyle \operatorname{Var} (Y|X)} & {\displaystyle =\operatorname{E}\Bigl(\bigl( Y-\operatorname{E} (Y\mid X)\bigr)^{2} \mid X\Bigr)}\\ & {\displaystyle =\operatorname{E}\Bigl( Y^{2} -2Y\operatorname{E} (Y\mid X)+\operatorname{E} (Y\mid X)^{2} \mid X\Bigr)}\\ & {\displaystyle =\operatorname{E}\Bigl( Y^{2} |X-2E[ Y|X]\operatorname{E} (Y\mid X)+\operatorname{E} (Y\mid X)^{2}\Bigr)}\\ & ={\displaystyle \operatorname{E} [Y^{2} \mid X\ ]-(\operatorname{E} [Y\mid X])^{2}} \end{aligned}
Var(Y∣X)=E((Y−E(Y∣X))2∣X)=E(Y2−2YE(Y∣X)+E(Y∣X)2∣X)=E(Y2∣X−2E[Y∣X]E(Y∣X)+E(Y∣X)2)=E[Y2∣X ]−(E[Y∣X])2
可以发现,当
f
=
E
(
Y
∣
X
)
\displaystyle f=\operatorname{E} (Y|X)
f=E(Y∣X)的时候,右边的那项将消失,因此条件期望就是最优的
f
\displaystyle f
f,
E
(
(
Y
−
E
[
Y
∣
X
]
)
2
)
=
E
[
V
a
r
[
Y
∣
X
]
]
E\left(( Y-E[ Y|X])^{2}\right) =E[ Var[ Y|X]]
E((Y−E[Y∣X])2)=E[Var[Y∣X]]
因为回归其实也可以直观上理解为一种最小化样本内差异的方法。