直观理解Law of Total Variance(方差分解公式)

Law of Iterated Expectations (LIE)

在讲方差分解之前,我们需要先理解双期望定理。对于一个X,我们可以根据不同的Y将其任意的划分为几部分:
在这里插入图片描述
于是经过这样的划分,X总体的均值其实是等价于每一个划分下均值的总体均值。

E ⁡ [ X ] = E ⁡ [ E ⁡ [ X ∣ Y ] ] \operatorname{E} [X]=\operatorname{E} [\operatorname{E} [X|Y]] E[X]=E[E[XY]]

举个例子,假设一共划分为三部分,每部分的均值分别为70 60 80, 于是

E [ X ] = E [ E [ X ∣ Y ] ] = E [ E [ X ∣ Y = y 1 ] + E [ X ∣ Y = y 2 ] + E [ X ∣ Y = y 3 ] ] = 70 + 60 + 80 3 = 70 \begin{aligned} & E[X]=E[E[X\mid Y]]\\ = & E[E[X\mid Y=y_{1} ]+E[X\mid Y=y_{2} ]+E[X\mid Y=y_{3} ]]\\ = & \frac{70+60+80}{3}\\ = & 70 \end{aligned} ===E[X]=E[E[XY]]E[E[XY=y1]+E[XY=y2]+E[XY=y3]]370+60+8070

从理论上,
E [ E [ X ∣ Y ] ] = ∫ p ( y ) ∫ x p ( x ∣ y ) d x d y = ∫ p ( x , y ) x d x d y = ∫ p ( x ) x d x = E [ X ] \begin{aligned} E[E[X\mid Y]] & =\int p( y)\int xp( x|y) dxdy\\ & =\int p( x,y) xdxdy\\ & =\int p( x) xdx\\ & =E[ X] \end{aligned} E[E[XY]]=p(y)xp(xy)dxdy=p(x,y)xdxdy=p(x)xdx=E[X]

Mathematical Derivation of the Law of Total Variance

另一个重要的规则是total variance:
V a r ( X ) = E ⁡ [ V a r ( X ∣ Y )   ] + V a r ( E ⁡ [ X ∣ Y ] ) Var(X)=\operatorname{E} [Var(X\mid Y)\ ]+Var(\operatorname{E} [X\mid Y]) Var(X)=E[Var(XY) ]+Var(E[XY])

它刻画了方差的两个组成成分:
E ⁡ [ V a r ( X ∣ Y )   ] = E ⁡ [   E ⁡ [ X 2 ∣ Y   ] − ( E ⁡ [ X ∣ Y ] ) 2   ] Def. of variance = E ⁡ [   E ⁡ [ X 2 ∣ Y ]   ] − E ⁡ [   ( E ⁡ [ X ∣ Y ] ) 2   ] Lin. of Expectation = E ⁡ [ X 2 ] − E ⁡ [   ( E ⁡ [ X ∣ Y ] ) 2   ] law of Ite. Expect V a r ( E [ X ∣ Y ] ) = E [ ( E [ X ∣ Y ] ) 2 ] − E [ E [ X ∣ Y ] ] 2 Def. of variance = E [ ( E [ X ∣ Y ] ) 2 ] − E [ X ] 2 law of Ite. Expect ∴   E ⁡ [ V a r ( X ∣ Y )   ] + V a r ( E ⁡ [ X ∣ Y ] ) = E ⁡ [ X 2 ] − E [ X ] 2 = V a r ( X ) \begin{aligned} \operatorname{E} [Var(X\mid Y)\ ] & =\operatorname{E} [\ \operatorname{E} [X^{2} \mid Y\ ]-(\operatorname{E} [X\mid Y])^{2} \ ] & \text{Def. of variance}\\ & =\operatorname{E} [\ \operatorname{E} [X^{2} \mid Y]\ ]-\operatorname{E} [\ (\operatorname{E} [X\mid Y])^{2} \ ] & \text{Lin. of Expectation}\\ & =\operatorname{E} [X^{2} ]-\operatorname{E} [\ (\operatorname{E} [X\mid Y])^{2} \ ] & \text{law of Ite. Expect} \end{aligned}\\ \\ \begin{aligned} Var(E[X\mid Y]) & =E[( E[X\mid Y])^{2} ]-E[E[X\mid Y]]^{2} & \text{Def. of variance}\\ & =E[( E[X\mid Y])^{2} ]-E[X]^{2} & \text{law of Ite. Expect} \end{aligned}\\ \\ \therefore \ \operatorname{E} [Var(X\mid Y)\ ]+Var(\operatorname{E} [X\mid Y])=\operatorname{E} [X^{2} ]-E[X]^{2} =Var( X) E[Var(XY) ]=E[ E[X2Y ](E[XY])2 ]=E[ E[X2Y] ]E[ (E[XY])2 ]=E[X2]E[ (E[XY])2 ]Def. of varianceLin. of Expectationlaw of Ite. ExpectVar(E[XY])=E[(E[XY])2]E[E[XY]]2=E[(E[XY])2]E[X]2Def. of variancelaw of Ite. Expect E[Var(XY) ]+Var(E[XY])=E[X2]E[X]2=Var(X)

怎么理解呢?

  1. 什么是 E ⁡ [ V a r ( X ∣ Y )   ] \displaystyle \operatorname{E} [Var(X\mid Y)\ ] E[Var(XY) ]? 直观来看,他是每个划分下方差的均值,因此,它刻画了样本内差异的均值。
  2. 什么是 V a r ( E [ X ∣ Y ] ) \displaystyle Var(E[X\mid Y]) Var(E[XY])? 它刻画了不同分组下均值的差异程度,因此,它刻画了样本间差异的程度。

因此,方差刻画了样本内和样本间差异的叠加,这就是Law of Total Variance.

与k-means聚类的联系

熟悉聚类算法的同学可能意识到,k means聚类其实有两种等价的学习方式,分别是,最小化类内距离(within-cluster sum of squares (WCSS)):
arg min ⁡ S ∑ i = 1 k ∑ x ∈ S i ∥ x − μ i ∥ 2 = arg min ⁡ S ∑ i = 1 k ∣ S i ∣ Var ⁡ S i {\displaystyle \underset{\mathbf{S}}{\operatorname{arg\ min}}\sum ^{k}_{i=1}\sum _{\mathbf{x} \in S_{i}}\Vert \mathbf{x} -\boldsymbol{\mu }_{i}\Vert ^{2} =\underset{\mathbf{S}}{\operatorname{arg\ min}}\sum ^{k}_{i=1} |S_{i} |\operatorname{Var} S_{i}} Sarg mini=1kxSixμi2=Sarg mini=1kSiVarSi
以及最大化类间距离(between-cluster sum of squares, BCSS):
arg max ⁡ S ∑ i = 1 k ∣ S i ∣ ∥ x ‾ − μ i ∥ 2 {\displaystyle \underset{\mathbf{S}}{\operatorname{arg\ max}}\sum ^{k}_{i=1} |S_{i} |\Vert \overline{\mathbf{{\displaystyle x}}} -\boldsymbol{\mu }_{i}\Vert ^{2}} Sarg maxi=1kSixμi2
显然,它们分别对应着 E ⁡ [ V a r ( X ∣ Y )   ] \displaystyle \operatorname{E} [Var(X\mid Y)\ ] E[Var(XY) ] V a r ( E [ X ∣ Y ] ) \displaystyle Var(E[X\mid Y]) Var(E[XY]),因为他们加起来是等于常数(方差),因此根据全方差公式,最小化前者等价于最大化后者。

与最小二乘法的联系

所谓最小二乘法,其实就是搜索最优的 f \displaystyle f f
E ⁡ [ ( Y − f ( X ) ) 2 ] = E ⁡ [ ( Y − E ⁡ ( Y ∣ X )    +    E ⁡ ( Y ∣ X ) − f ( X ) ) 2 ] = E ⁡ [ E ⁡ { ( Y − E ⁡ ( Y ∣ X )    +    ( E ⁡ ( Y ∣ X ) − f ( X ) ) 2 ∣ X } ] = E ⁡ [ ( ( Y − E ⁡ ( Y ∣ X )   ) 2 + ( E ⁡ ( Y ∣ X ) − f ( X ) ) 2 + 2 ( Y − E ⁡ ( Y ∣ X ) ) ( E ⁡ ( Y ∣ X ) − f ( X ) ) ∣ X ] = E ⁡ [ Var ⁡ ( Y ∣ X ) ] + E ⁡ [ ( E ⁡ ( Y ∣ X ) − f ( X ) ) 2 ] + 2 ( E [ Y ∣ X ] − E ⁡ ( Y ∣ X ) ) ( E ⁡ ( Y ∣ X ) − f ( X ) ) = E ⁡ [ Var ⁡ ( Y ∣ X ) ] + E ⁡ [ ( E ⁡ ( Y ∣ X ) − f ( X ) ) 2 ]   . {\displaystyle \begin{aligned} \operatorname{E} [(Y-f(X))^{2} ] & =\operatorname{E} [(Y-\operatorname{E} (Y|X)\ \ +\ \ \operatorname{E} (Y|X)-f(X))^{2} ]\\ & =\operatorname{E} [\operatorname{E} \{(Y-\operatorname{E} (Y|X)\ \ +\ \ \left(\operatorname{E} (Y|X)-f(X)\right)^{2} |X\}]\\ & =\operatorname{E}\left[\left( (Y-\operatorname{E} (Y|X)\ \right)^{2} +\left(\operatorname{E} (Y|X)-f(X)\right)^{2} +2\left( Y-\operatorname{E} (Y|X)\right)\left(\operatorname{E} (Y|X)-f(X)\right) |X\right]\\ & =\operatorname{E} [\operatorname{Var} (Y|X)]+\operatorname{E}\left[\left(\operatorname{E} (Y|X)-f(X)\right)^{2}\right] +2\left( E[ Y|X] -\operatorname{E} (Y|X)\right)\left(\operatorname{E} (Y|X)-f(X)\right)\\ & =\operatorname{E} [\operatorname{Var} (Y|X)]+\operatorname{E} [(\operatorname{E} (Y|X)-f(X))^{2} ]\ . \end{aligned}} E[(Yf(X))2]=E[(YE(YX)  +  E(YX)f(X))2]=E[E{(YE(YX)  +  (E(YX)f(X))2X}]=E[((YE(YX) )2+(E(YX)f(X))2+2(YE(YX))(E(YX)f(X))X]=E[Var(YX)]+E[(E(YX)f(X))2]+2(E[YX]E(YX))(E(YX)f(X))=E[Var(YX)]+E[(E(YX)f(X))2] .
其中
Var ⁡ ( Y ∣ X ) = E ⁡ ( ( Y − E ⁡ ( Y ∣ X ) ) 2 ∣ X ) = E ⁡ ( Y 2 − 2 Y E ⁡ ( Y ∣ X ) + E ⁡ ( Y ∣ X ) 2 ∣ X ) = E ⁡ ( Y 2 ∣ X − 2 E [ Y ∣ X ] E ⁡ ( Y ∣ X ) + E ⁡ ( Y ∣ X ) 2 ) = E ⁡ [ Y 2 ∣ X   ] − ( E ⁡ [ Y ∣ X ] ) 2 \begin{aligned} {\displaystyle \operatorname{Var} (Y|X)} & {\displaystyle =\operatorname{E}\Bigl(\bigl( Y-\operatorname{E} (Y\mid X)\bigr)^{2} \mid X\Bigr)}\\ & {\displaystyle =\operatorname{E}\Bigl( Y^{2} -2Y\operatorname{E} (Y\mid X)+\operatorname{E} (Y\mid X)^{2} \mid X\Bigr)}\\ & {\displaystyle =\operatorname{E}\Bigl( Y^{2} |X-2E[ Y|X]\operatorname{E} (Y\mid X)+\operatorname{E} (Y\mid X)^{2}\Bigr)}\\ & ={\displaystyle \operatorname{E} [Y^{2} \mid X\ ]-(\operatorname{E} [Y\mid X])^{2}} \end{aligned} Var(YX)=E((YE(YX))2X)=E(Y22YE(YX)+E(YX)2X)=E(Y2X2E[YX]E(YX)+E(YX)2)=E[Y2X ](E[YX])2

可以发现,当 f = E ⁡ ( Y ∣ X ) \displaystyle f=\operatorname{E} (Y|X) f=E(YX)的时候,右边的那项将消失,因此条件期望就是最优的 f \displaystyle f f
E ( ( Y − E [ Y ∣ X ] ) 2 ) = E [ V a r [ Y ∣ X ] ] E\left(( Y-E[ Y|X])^{2}\right) =E[ Var[ Y|X]] E((YE[YX])2)=E[Var[YX]]
因为回归其实也可以直观上理解为一种最小化样本内差异的方法。

参考资料

A mathematical derivation of the Law of Total Variance

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值