机器学习白板推导 shuhuai008

参考文献

shuhuai008

续篇

前言

数据 X = ( x 11 x 12 ⋯ x 1 p x 21 x 22 ⋯ x 2 p ⋮ ⋮ ⋱ ⋮ x N 1 x N 2 ⋯ x N p ) N × P X = \begin{pmatrix} x_{11} &x_{12} & \cdots &x_{1p} \\ x_{21} & x_{22} &\cdots & x_{2p} \\ \vdots & \vdots & \ddots &\vdots \\ x_{N1} &x_{N2} & \cdots &x_{Np} \end{pmatrix}_{N\times P} X=x11x21xN1x12x22xN2x1px2pxNpN×P,P为特征数,N为样本数。

在这片大陆上,有两大学派,一个是频率学派,另一个是统计学派。

频率派认为 θ \theta θ是未知的常量,数据 X X X为r.v(random variable,随机变量), x ∼ p ( x ∣ θ ) x\sim p(x \mid \theta) xp(xθ)
如:极大似然估计(Maximum Likelihood Estimation,MLE)
θ M L E = arg ⁡ max ⁡ θ log ⁡ P ( x ∣ θ ) \theta_{MLE}=\arg \max_{\theta } \log P(x\mid \theta) θMLE=argθmaxlogP(xθ)

贝叶斯派则认为参数 θ \theta θ为r.v, θ ∼ p ( θ ) \theta \sim p(\theta) θp(θ)
如:最大后验概率MAP
P ( θ ∣ X ) = P ( X ∣ θ ) P ( θ ) P ( X ) P(\theta \mid X)=\frac{P(X\mid \theta)P(\theta)}{P(X)} P(θX)=P(X)P(Xθ)P(θ)
P ( θ ∣ X ) P(\theta \mid X) P(θX)为后验posterior, P ( X ∣ θ ) P(X\mid \theta) P(Xθ)为likehood, p ( θ ) p(\theta) p(θ)为prior, P ( X ) = ∫ θ P ( X ∣ θ ) P ( θ ) d θ P(X)=\int_{\theta} P(X\mid \theta)P(\theta)\mathrm{d}\theta P(X)=θP(Xθ)P(θ)dθ
θ M A P = arg ⁡ max ⁡ θ P ( X ∣ θ ) P ( θ ) \theta_{MAP}=\arg \max_{\theta} P(X\mid \theta)P(\theta) θMAP=argθmaxP(Xθ)P(θ)

贝叶斯估计:
P ( θ ∣ X ) = P ( X ∣ θ ) P ( θ ) ∫ θ P ( X ∣ θ ) P ( θ ) d θ P(\theta \mid X)=\frac{P(X\mid \theta)P(\theta)}{\int_{\theta} P(X\mid \theta)P(\theta)\mathrm{d}\theta} P(θX)=θP(Xθ)P(θ)dθP(Xθ)P(θ)
贝叶斯预测: x ~ \tilde{x} x~预测数据, θ \theta θ为中间变量
P ( x ~ ∣ X ) = ∫ θ P ( x ~ , θ ∣ X ) P ( θ ) d θ = ∫ θ P ( x ~ ∣ θ ) P ( θ ∣ X ) d θ \begin{aligned}P(\tilde{x} \mid X)&=\int_{\theta} P(\tilde{x}, \theta \mid X)P(\theta)\mathrm{d}\theta\\ &=\int_{\theta} P(\tilde{x}\mid \theta )P(\theta \mid X)\mathrm{d}\theta \end{aligned} P(x~X)=θP(x~,θX)P(θ)dθ=θP(x~θ)P(θX)dθ

频率派>统计机器学习>优化:

  1. model
  2. loss function
  3. algorithm

贝叶斯派>概率图模型>求积分>MCMC

一、概率基础

1.一维高斯分布

X = ( x 1 , x 2 , … , x N ) ⊤ = ( x 1 ⊤ x 2 ⊤ ⋮ x N ⊤ ) X = (x_1,x_2,\dots ,x_N)^{\top}= \begin{pmatrix} x_1^{\top } \\ x_2^{\top } \\ \vdots \\ x_N^{\top } \end{pmatrix} X=(x1,x2,,xN)=x1x2xN
x i ∈ R p x_i \in \mathbb{R}^{p} xiRp

θ ∼ i.i.d N ( μ , Σ ) \theta \stackrel{\text{i.i.d}}{\sim } \mathcal N(\mu,\Sigma) θi.i.dN(μ,Σ)
i.i.d.(independent and identically distributed 独立同分布)

p = 1 p=1 p=1, θ = ( μ , σ 2 ) \theta=(\mu,\sigma^2) θ=(μ,σ2)
一维高斯分布:
p ( x ) = 1 2 π σ exp ⁡ { − ( x − μ ) 2 2 σ 2 } p(x)=\frac{1}{\sqrt{2\pi}\sigma}\exp \left \{ -\frac{(x-\mu)^2}{2\sigma^2} \right \} p(x)=2π σ1exp{ 2σ2(xμ)2}

极大似然估计
θ M L E = arg ⁡ max ⁡ θ P ( X ∣ θ ) \theta_{MLE}=\arg \max_{\theta }P(X\mid \theta) θMLE=argθmaxP(Xθ)

对数似然:
log ⁡ P ( X ∣ θ ) = log ⁡ ∏ i = 1 N P ( x i ∣ θ ) = ∑ i = 1 N log ⁡ P ( x i ∣ θ ) = ∑ i = 1 N log ⁡ 1 2 π σ exp ⁡ − ( x i − μ ) 2 2 σ 2 = ∑ i = 1 N [ log ⁡ 1 2 π + log ⁡ 1 σ − ( x i − μ ) 2 2 σ 2 ] \begin{aligned} \log P(X\mid \theta) &=\log \prod_{i=1}^{N} P(x_i\mid \theta)\\ &=\sum_{i=1}^{N}\log P(x_i\mid \theta)\\ &=\sum_{i=1}^{N}\log \frac{1}{\sqrt{2\pi}\sigma}\exp -\frac{(x_i-\mu)^2}{2\sigma^2}\\ &=\sum_{i=1}^{N}\left [ \log \frac{1}{\sqrt{2\pi}}+\log \frac{1}{\sigma} -\frac{(x_i-\mu)^2}{2\sigma^2} \right ] \end{aligned} logP(Xθ)=logi=1NP(xiθ)=i=1NlogP(xiθ)=i=1Nlog2π σ1exp2σ2(xiμ)2=i=1N[log2π 1+logσ12σ2(xiμ)2]
求解对数似然 μ M L E \mu_{MLE} μMLE:

μ M L E = arg ⁡ max ⁡ μ log ⁡ P ( X ∣ θ ) = arg ⁡ max ⁡ μ ∑ i = 1 N − ( x i − μ ) 2 2 σ 2 = arg ⁡ min ⁡ μ ∑ i = 1 N ( x i − μ ) 2 2 σ 2 \begin{aligned}\mu_{MLE}&=\arg \max_{\mu }\log P(X\mid \theta)\\ &=\arg \max_{\mu } \sum_{i=1}^{N}-\frac{(x_i-\mu)^2}{2\sigma^2}\\ &=\arg \min_{\mu } \sum_{i=1}^{N}\frac{(x_i-\mu)^2}{2\sigma^2} \end{aligned} μMLE=argμmaxlogP(Xθ)=argμmaxi=1N2σ2(xiμ)2=argμmini=1N2σ2(xiμ)2

μ M L E \mu_{MLE} μMLE求偏导:

∂ ∂ μ ∑ i = 1 N ( x i − μ ) 2 = ∑ i = 1 N 2 ⋅ ( x i − μ ) ⋅ ( − 1 ) = 0 ⇒ ∑ i = 1 N ( x i − μ ) = 0 ⇒ μ M L E = 1 N ∑ i = 1 N x i \begin{aligned} \frac{\partial }{\partial \mu}\sum_{i=1}^{N} (x_i-\mu)^2&=\sum_{i=1}^{N}2\cdot (x_i-\mu)\cdot(-1)=0\\ &\Rightarrow \sum_{i=1}^{N}(x_i-\mu)=0\\ &\Rightarrow \mu_{MLE}=\frac{1}{N}\sum_{i=1}^{N}x_i \end{aligned} μi=1N(xiμ)2=i=1N2(xiμ)(1)=0i=1N(xiμ)=0μMLE=N1i=1Nxi
μ M L E \mu_{MLE} μMLE求期望:
E [ μ M L E ] = 1 N ∑ i = 1 N E [ x i ] = 1 N ∑ i = 1 N μ = μ E[\mu_{MLE}]=\frac{1}{N}\sum_{i=1}^{N}E[x_i]=\frac{1}{N}\sum_{i=1}^{N}\mu=\mu E[μMLE]=N1i=1NE[xi]=N1i=1Nμ=μ
因此 μ M L E \mu_{MLE} μMLE无偏。

求解对数似然 σ M L E \sigma_{MLE} σMLE:
μ M L E = arg ⁡ max ⁡ σ log ⁡ P ( X ∣ θ ) = arg ⁡ max ⁡ σ ∑ i = 1 N − log ⁡ σ − ( x i − μ ) 2 2 σ 2 \begin{aligned} \mu_{MLE}&=\arg \max_{\sigma }\log P(X\mid \theta)\\ &=\arg \max_{\sigma }\sum_{i=1}^{N}-\log \sigma -\frac{(x_i-\mu)^2}{2\sigma^2} \end{aligned} μMLE=argσmaxlogP(Xθ)=argσmaxi=1Nlogσ2σ2(xiμ)2
对数似然 σ M L E \sigma_{MLE} σMLE偏导:

L ( σ ) = − log ⁡ σ − ( x i − μ ) 2 2 σ 2 L(\sigma)=-\log \sigma -\frac{(x_i-\mu)^2}{2\sigma^2} L(σ)=logσ2σ2(xiμ)2
∂ L ∂ σ = ∑ i = 1 N [ − 1 σ + 1 2 ( x i − μ ) 2 ⋅ ( − 2 ) ⋅ σ − 3 ] = ∑ i = 1 N [ − σ 2 + ( x i − μ ) 2 ] = 0 ⇒ ∑ i = 1 N σ 2 = ∑ i = 1 N ( x i − μ ) 2 \begin{aligned} \frac{\partial L}{\partial \sigma}&=\sum_{i=1}^{N}[-\frac{1}{\sigma}+\frac{1}{2}(x_i-\mu)^2\cdot (-2)\cdot \sigma^{-3}]\\ &=\sum_{i=1}^{N}[-\sigma^2+(x_i-\mu)^2]=0\\ &\Rightarrow \sum_{i=1}^{N}\sigma^2=\sum_{i=1}^{N}(x_i-\mu)^2 \end{aligned} σL=i=1N[σ1+21(xiμ)2(2)σ3]=i=1N[σ2+(xiμ)2]=0i=1Nσ2=i=1N(xiμ)2
σ M L E 2 = 1 N ∑ i = 1 N ( x i − μ ) 2 = 1 N ∑ i = 1 N x i 2 − μ M L E 2 \sigma_{MLE}^2=\frac{1}{N}\sum_{i=1}^{N}(x_i-\mu)^2=\frac{1}{N}\sum_{i=1}^{N}x_i^2-\mu_{MLE}^2 σMLE2=N1i=1N(xiμ)2=N1i=1Nxi2μMLE2
σ M L E 2 \sigma_{MLE}^2 σMLE2求期望:
E [ σ M L E 2 ] = E [ 1 N ∑ i = 1 N x i 2 − μ M L E 2 ] = E [ 1 N ∑ i = 1 N ( x i 2 − μ 2 ) − ( μ M L E 2 − μ 2 ) ] = E [ 1 N ∑ i = 1 N ( x i 2 − μ 2 ) ] − E [ μ M L E 2 − μ 2 ] = 1 N ∑ i = 1 N E [ x i 2 − μ 2 ] − ( E [ μ M L E 2 ] − E [ μ 2 ] ) = 1 N ∑ i = 1 N ( E [ x i 2 ] − μ 2 ) − ( E [ μ M L E 2 ] − μ 2 ) = 1 N ∑ i = 1 N V a r ( x i ) − E [ μ M L E 2 − E 2 [ μ M L E ] = 1 N ∑ i = 1 N V a r ( x i ) − V a r ( μ M L E ) = σ 2 − 1 N σ 2 \begin{aligned} E[\sigma_{MLE}^2]&=E[\frac{1}{N}\sum_{i=1}^{N}x_i^2-\mu_{MLE}^2]\\ &=E[\frac{1}{N}\sum_{i=1}^{N}(x_i^2-\mu^2)-(\mu_{MLE}^2-\mu^2)]\\ &=E[\frac{1}{N}\sum_{i=1}^{N}(x_i^2-\mu^2)]-E[\mu_{MLE}^2-\mu^2]\\ &=\frac{1}{N}\sum_{i=1}^{N}E[x_i^2-\mu^2]-(E[\mu_{MLE}^2]-E[\mu^2])\\ &=\frac{1}{N}\sum_{i=1}^{N}(E[x_i^2]-\mu^2)-(E[\mu_{MLE}^2]-\mu^2)\\ &=\frac{1}{N}\sum_{i=1}^{N}Var(x_i)-E[\mu_{MLE}^2-E^2[\mu_{MLE}]\\ &=\frac{1}{N}\sum_{i=1}^{N}Var(x_i)-Var(\mu_{MLE})\\ &=\sigma^2-\frac{1}{N}\sigma^2 \end{aligned} E[σMLE2]=E[N1i=1Nxi2μMLE2]=E[N1i=1N(xi2μ2)(μMLE2μ2)]=E[N1i=1N(xi2μ2)]E[μMLE2μ2]=N1i=1NE[xi2μ2](E[μMLE2]E[μ2])=N1i=1N(E[xi2]μ2)(E[μMLE2]μ2)=N1i=1NVar(xi)E[μMLE2E2[μMLE]=N1i=1NVar(xi)Var(μMLE)=σ2N1σ2
其中 V a r ( μ M L E ) = V a r ( 1 N ∑ i = 1 N x i ) = 1 N 2 ∑ i = 1 N V a r ( x i ) = 1 N 2 ∑ i = 1 N σ 2 = 1 N σ 2 Var(\mu_{MLE})=Var(\frac{1}{N}\sum_{i=1}^{N}x_i)=\frac{1}{N^2}\sum_{i=1}^{N}Var(x_i)=\frac{1}{N^2}\sum_{i=1}^{N}\sigma^2=\frac{1}{N}\sigma^2 Var(μMLE)=Var(N1i=1Nxi)=N21i=1NVar(xi)=N21i=1Nσ2=N1σ2

因此 σ M L E = N − 1 N σ 2 \sigma_{MLE}=\frac{N-1}{N}\sigma^2 σMLE=NN1σ2是有偏估计。

极大似然估计
μ M L E → E [ μ M L E ] σ M L E 2 → E [ σ M L E 2 ] \mu_{MLE}\to E[\mu_{MLE}]\\ \sigma_{MLE}^2 \to E[\sigma_{MLE}^2] μMLEE[μMLE]σMLE2E[σMLE2]

2.高维高斯分布

p ( x ) = 1 ( 2 π ) p / 2 ∣ Σ ∣ 1 / 2 exp ⁡ ( − 1 2 ( x − μ ) ⊤ Σ − 1 ( x − μ ) ) p(x)=\frac{1}{(2\pi)^{p/2}\left| \Sigma\right|^{1/2}} \exp( -\frac{1}{2} (x-\mu)^{\top}\Sigma^{-1}(x-\mu)) p(x)=(2π)p/2Σ1/21exp(21(xμ)Σ1(xμ))

x ∈ R p x\in \mathbb{R}^{p} xRp,r.v
x = ( x 1 x 2 ⋮ x p ) , μ = ( μ 1 μ 2 ⋮ μ p ) , Σ = ( σ 11 σ 12 ⋯ σ 1 p σ 21 σ 22 ⋯ σ 2 p ⋮ ⋮ ⋱ ⋮ σ p 1 σ p 2 ⋯ σ p p ) p × p , Σ x=\begin{pmatrix}x_1 \\x_2 \\\vdots \\x_p\end{pmatrix}, \mu=\begin{pmatrix}\mu_1 \\\mu_2 \\\vdots \\\mu_p\end{pmatrix}, \Sigma= \begin{pmatrix} \sigma_{11} & \sigma_{12} & \cdots &\sigma_{1p} \\ \sigma_{21} & \sigma_{22} &\cdots & \sigma_{2p} \\ \vdots & \vdots & \ddots &\vdots \\ \sigma_{p1} & \sigma_{p2} & \cdots &\sigma_{pp} \end{pmatrix}_{p\times p},\Sigma x=x1x2xp,μ=μ1μ2μp,Σ=σ11σ21σp1σ12σ22σp2σ1pσ2pσppp×p,Σ需要为正定矩阵,不过一般都是半正定矩阵。

( x − μ ) ⊤ Σ − 1 ( x − μ ) (x-\mu)^{\top}\Sigma^{-1}(x-\mu) (xμ)Σ1(xμ)表示 x x x μ \mu μ的马氏距离

Σ = 1 \Sigma=1 Σ=1时,为欧式距离

特征分解
U = ( U 1 , U 2 , … , U p ) p × p , i = 1 , 2 , … , p U U ⊤ = U ⊤ U = I Λ = d i a g ( λ i ) U=(U_1,U_2,\dots,U_p)_{p\times p},i=1,2,\dots,p\\ UU^{\top}=U^{\top}U=I\\ \Lambda =diag(\lambda _i) U=(U1,U2,,Up)p×p,i=1,2,,pUU=UU=IΛ=diag(λi)
Σ = U Λ U ⊤ = ( μ 1 μ 2 ⋯ μ p ) ( λ 1 λ 2 ⋱ λ p ) ( μ 1 ⊤ μ 2 ⊤ ⋮ μ p ⊤ ) = ( μ 1 λ 1 μ 2 λ 2 ⋯ μ p λ p ) ( μ 1 ⊤ μ 2 ⊤ ⋮ μ p ⊤ ) = ∑ i = 1 p u i λ i u i ⊤ \begin{aligned} \Sigma &=U\Lambda U^{\top}= \begin{pmatrix}\mu_1 &\mu_2 &\cdots &\mu_p\end{pmatrix}\begin{pmatrix} \lambda_1 & & & \\ & \lambda_2 & & \\ & & \ddots & \\ & & &\lambda_p \end{pmatrix}\begin{pmatrix}\mu_1^{\top} \\\mu_2^{\top} \\\vdots \\\mu_p^{\top}\end{pmatrix}\\ &=\begin{pmatrix}\mu_1\lambda_1 &\mu_2\lambda_2 &\cdots &\mu_p\lambda_p\end{pmatrix}\begin{pmatrix}\mu_1^{\top} \\\mu_2^{\top} \\\vdots \\\mu_p^{\top}\end{pmatrix} =\sum_{i=1}^{p}u_i\lambda_iu_i^{\top} \end{aligned} Σ=UΛU=(μ1μ2μp)λ1λ2λpμ1μ2μp=(μ1λ1μ2λ2μpλp)μ1μ2μp=i=1puiλiui
Σ − 1 = ( U Λ U ⊤ ) − 1 = ( U ⊤ ) − 1 Λ − 1 U − 1 = U Λ − 1 U ⊤ = ∑ i = 1 p u i 1 λ i u i ⊤ \Sigma^{-1}=(U\Lambda U^{\top})^{-1}=(U^{\top})^{-1}\Lambda^{-1} U^{-1}=U\Lambda^{-1} U^{\top}=\sum_{i=1}^{p}u_i\frac{1}{\lambda_i}u_i^{\top} Σ1=(UΛU)1=(U)1Λ1U1=UΛ1U=i=1puiλi1ui
( x − μ ) ⊤ Σ − 1 ( x − μ ) = ( x − μ ) ⊤ ∑ i = 1 p u i 1 λ i u i ⊤ ( x − μ ) = ∑ i = 1 p ( x − μ ) ⊤ u i 1 λ i u i ⊤ ( x − μ ) = ∑ i = 1 p y i 1 λ i y i ⊤ = ∑ i = 1 p y i 2 λ i \begin{aligned}(x-\mu)^{\top}\Sigma^{-1}(x-\mu) &=(x-\mu)^{\top}\sum_{i=1}^{p}u_i\frac{1}{\lambda_i}u_i^{\top}(x-\mu)\\ &=\sum_{i=1}^{p}(x-\mu)^{\top}u_i\frac{1}{\lambda_i}u_i^{\top}(x-\mu)\\ &=\sum_{i=1}^{p}y_i\frac{1}{\lambda_i}y_i^{\top}=\sum_{i=1}^{p}\frac{y_i^2}{\lambda_i}\end{aligned} (xμ)Σ1(xμ)=(xμ)i=1puiλi1ui(xμ)=i=1p(xμ)uiλi1ui(xμ)=i=1pyiλi1yi=i=1pλiyi2
p = 2 p=2 p=2时, y 1 2 λ 1 + y 2 2 λ 2 \frac{y_1^2}{\lambda_1}+\frac{y_2^2}{\lambda_2} λ1y12+λ2y22是一个椭圆,表示数据 x x x的聚集为椭圆,由数据 x x x的坐标变换到 y y y坐标,数据的主方向和 y y y坐标相同。
λ 1 = λ 2 \lambda_1=\lambda_2 λ1=λ2时,数据聚集为圆,各向同性(无新坐标)。

p ( x ) → Σ → Σ − 1 → ( x − μ ) ⊤ Σ − 1 ( x − μ ) = ∑ i = 1 p y i 2 λ i p(x) \to \Sigma \to \Sigma^{-1} \to (x-\mu)^{\top}\Sigma^{-1}(x-\mu)=\sum_{i=1}^{p}\frac{y_i^2}{\lambda_i} p(x)ΣΣ1(xμ)Σ1(xμ)=i=1pλiyi2

3.局限性

参数多
各向同性
多类别数据无法表示

4.边缘概率&条件概率

p ( x ) = 1 ( 2 π ) p / 2 ∣ Σ ∣ 1 / 2 exp ⁡ ( − 1 2 ( x − μ ) ⊤ Σ − 1 ( x − μ ) ) p(x)=\frac{1}{(2\pi)^{p/2}\left| \Sigma\right|^{1/2}} \exp( -\frac{1}{2} (x-\mu)^{\top}\Sigma^{-1}(x-\mu)) p(x)=(2π)p/2Σ1/21exp(21(xμ)Σ1(xμ))
x ∈ R p x\in \mathbb{R}^{p} xRp,r.v
x = ( x 1 x 2 ⋮ x p ) , μ = ( μ 1 μ 2 ⋮ μ p ) , Σ = ( σ 11 σ 12 ⋯ σ 1 p σ 21 σ 22 ⋯ σ 2 p ⋮ ⋮ ⋱ ⋮ σ p 1 σ p 2 ⋯ σ p p ) p × p x=\begin{pmatrix}x_1 \\x_2 \\\vdots \\x_p\end{pmatrix}, \mu=\begin{pmatrix}\mu_1 \\\mu_2 \\\vdots \\\mu_p\end{pmatrix}, \Sigma= \begin{pmatrix} \sigma_{11} & \sigma_{12} & \cdots &\sigma_{1p} \\ \sigma_{21} & \sigma_{22} &\cdots & \sigma_{2p} \\ \vdots & \vdots & \ddots &\vdots \\ \sigma_{p1} & \sigma_{p2} & \cdots &\sigma_{pp} \end{pmatrix}_{p\times p} x=x1x2xp,μ=μ1μ2μp,Σ=σ11σ21σp1σ12σ22σp2σ1pσ2pσppp×p

已知 x = ( x a x b ) , μ = ( μ a μ b ) , Σ = ( Σ a a Σ a b Σ b a Σ b b ) , x a : m 维 , x b : n 维 , m + n = p x=\begin{pmatrix}x_a\\x_b\end{pmatrix}, \mu=\begin{pmatrix}\mu_a\\\mu_b\end{pmatrix}, \Sigma=\begin{pmatrix}\Sigma_{aa} & \Sigma_{ab}\\ \Sigma_{ba} &\Sigma_{bb}\end{pmatrix},x_a:m维,x_b:n维,m+n=p x=(xaxb),μ=(μaμb),Σ=(ΣaaΣbaΣabΣbb),xa:m,xb:nm+n=p

求:
p ( x a ) , p ( x b ∣ x a ) p(x_a),p(x_b\mid x_a) p(xa),p(xbxa)
p ( x b ) , p ( x a ∣ x b ) p(x_b),p(x_a\mid x_b) p(xb),p(xaxb)


定理:
已 知 x ∼ N ( μ , Σ ) , y = A x + B 已知x\sim\mathcal N(\mu,\Sigma),y=Ax+B xN(μ,Σ),y=Ax+B
结论:
y ∼ N ( A μ + B , A Σ A ⊤ ) E [ y ] = E [ A x + B ] = A E [ x ] + B = A μ + B V a r [ y ] = V a r [ A x + B ] = V a r [ A x ] + V a r [ B ] = A V a r [ x ] A ⊤ = A Σ A ⊤ y\sim\mathcal N(A\mu+B,A\Sigma A^{\top})\\ E[y]=E[Ax+B]=AE[x]+B=A\mu+B\\ \begin{aligned} Var[y] &=Var[Ax+B]=Var[Ax]+Var[B]\\ &=AVar[x]A^{\top}=A\Sigma A^{\top} \end{aligned} yN(Aμ+B,AΣA)E[y]=E[Ax+B]=AE[x]+B=Aμ+BVar[y]=Var[Ax+B]=Var[Ax]+Var[B]=AVar[x]A=AΣA


  • 求: p ( x a ) p(x_a) p(xa)

x a = ( I m 0 ) ( x a x b ) x_a= \begin{pmatrix}I_m & 0\end{pmatrix} \begin{pmatrix}x_a\\x_b\end{pmatrix} xa=(Im0)(xaxb)
E [ x a ] = ( I m 0 ) ( μ a μ b ) = μ a E[x_a]=\begin{pmatrix}I_m & 0\end{pmatrix} \begin{pmatrix}\mu_a\\\mu_b\end{pmatrix}=\mu_a E[xa]=(Im0)(μaμb)=μa
V a r [ x a ] = ( I m 0 ) ( Σ a a Σ a b Σ b a Σ b b ) ( I m 0 ) = Σ a a Var[x_a]=\begin{pmatrix}I_m & 0\end{pmatrix} \begin{pmatrix}\Sigma_{aa} & \Sigma_{ab}\\ \Sigma_{ba} &\Sigma_{bb}\end{pmatrix} \begin{pmatrix}I_m \\ 0\end{pmatrix}=\Sigma_{aa} Var[xa]=(Im0)(ΣaaΣbaΣabΣbb)(Im0)=Σaa
∴ x a ∼ N ( μ a , Σ a a ) \therefore x_a \sim \mathcal N(\mu_a,\Sigma_{aa}) xaN(μa,Σaa)


  • 求: p ( x b ⋅ a ) p(x_{b\cdot a}) p(xba)
    { x b ⋅ a = x b − Σ b a Σ a a − 1 x a μ b ⋅ a = μ b − Σ b a Σ a a − 1 μ a Σ b b ⋅ a = Σ b b − Σ b a Σ a a − 1 Σ a b \begin{aligned} \begin{cases} x_{b\cdot a } &=x_b-\Sigma_{ba}\Sigma_{aa}^{-1}x_a\\ \mu _{b\cdot a } &=\mu _b-\Sigma_{ba}\Sigma_{aa}^{-1}\mu _a \\ \Sigma_{bb\cdot a } &=\Sigma_{bb}-\Sigma_{ba}\Sigma_{aa}^{-1}\Sigma_{ab} \end{cases}\end{aligned} xbaμbaΣbba=xbΣbaΣaa1xa=μbΣbaΣaa1μa=ΣbbΣbaΣaa1Σab
    x b ⋅ a = ( Σ b a Σ a a I n ) ( x a x b ) = A x x_{b\cdot a}= \begin{pmatrix} \Sigma_{ba}\Sigma_{aa} & I_n \end{pmatrix} \begin{pmatrix} x_a\\ x_b \end{pmatrix}=Ax xba=(ΣbaΣaaIn)(xaxb)=Ax
    E [ x b ⋅ a ] = ( Σ b a Σ a a − 1 I n ) ( μ a μ b ) = μ b − Σ b a Σ a a − 1 μ a = μ b ⋅ a = A μ + B \begin{aligned}E[x_{b\cdot a}]&= \begin{pmatrix} \Sigma_{ba}\Sigma_{aa} ^{-1}& I_n \end{pmatrix} \begin{pmatrix}\mu_a\\\mu_b\end{pmatrix}\\ &=\mu_b-\Sigma_{ba}\Sigma_{aa}^{-1}\mu_a =\mu_{b\cdot a}=A\mu+B \end{aligned} E[xba]=(ΣbaΣaa1In)(μaμb)=μbΣbaΣaa1μa=μba=Aμ+B
    V a r [ x b ⋅ a ] = ( Σ b a Σ a a − 1 I n ) ( Σ a a Σ a b Σ b a Σ b b ) ( Σ b a Σ a a − 1 I n ) = Σ b b − Σ b a Σ a a − 1 Σ a b = Σ b b ⋅ a = A Σ A ⊤ \begin{aligned} Var[x_{b\cdot a}] &=\begin{pmatrix} \Sigma_{ba}\Sigma_{aa} ^{-1}& I_n \end{pmatrix} \begin{pmatrix}\Sigma_{aa} & \Sigma_{ab}\\ \Sigma_{ba} &\Sigma_{bb}\end{pmatrix} \begin{pmatrix} \Sigma_{ba}\Sigma_{aa} ^{-1}\\ I_n \end{pmatrix}\\ &=\Sigma_{bb}-\Sigma_{ba}\Sigma_{aa} ^{-1}\Sigma_{ab} \\&=\Sigma_{bb\cdot a}=A\Sigma A^{\top} \end{aligned} Var[xba]=(ΣbaΣaa1In)(ΣaaΣbaΣabΣbb)(ΣbaΣaa1In)=ΣbbΣbaΣaa1Σab=Σbba=AΣA
    ∴ x b ⋅ a ∼ N ( μ b ⋅ a , Σ b b ⋅ a ) \therefore x_{b\cdot a} \sim \mathcal N(\mu_{b\cdot a},\Sigma_{bb\cdot a}) xbaN(μba,Σbba)

  • 求: p ( x b ∣ x a ) p(x_b\mid x_a) p(xbxa)
    x b = x b ⋅ a + Σ b a Σ a a − 1 x a = A x + B \begin{aligned} x_b &=x_{b\cdot a}+\Sigma_{ba}\Sigma_{aa} ^{-1}x_a \\&=Ax+B \end{aligned} xb=xba+ΣbaΣaa1xa=Ax+B
    E [ x b ∣ x a ] = μ b ⋅ a + Σ b a Σ a a − 1 x a = A μ + B \begin{aligned} E[x_b\mid x_a]&=\mu_{b\cdot a}+\Sigma_{ba}\Sigma_{aa} ^{-1}x_a\\ &=A\mu+B \end{aligned} E[xbxa]=μba+ΣbaΣaa1xa=Aμ+B
    V a r [ x b ∣ x a ] = V a r [ x b ⋅ a ] = A Σ A ⊤ = Σ \begin{aligned} Var[x_b\mid x_a] &=Var[x_{b\cdot a}]\\ &=A\Sigma A^{\top}=\Sigma \end{aligned} Var[xbxa]=Var[xba]=AΣA=Σ
    $$

x → y = A x + B } → { E [ y ] V a r [ y ] → y ∼ N ( E [ y ] , V a r [ y ] ) x \to \left.\begin{matrix} y=Ax+B\end{matrix}\right\}\to \left\{\begin{matrix} E[y] \\Var[y] \end{matrix}\right. \to y \sim \mathcal N(E[y],Var[y]) xy=Ax+B}{ E[y]Var[y]yN(E[y],Var[y])

( x a x b ) → x a } → { E [ x a ] V a r [ x a ] → x a ∼ N ( E [ x a ] , V a r [ x a ] ) x b ⋅ a } → { E [ x b ⋅ a ] V a r [ x b ⋅ a ] → x b ⋅ a ∼ N ( E [ x b ⋅ a ] , V a r [ x b ⋅ a ] ) x b ∣ x a } → { E [ x b ∣ x a ] V a r [ x b ∣ x a ] → x b ∣ x a ∼ N ( E [ x b ∣ x a ] , V a r [ x b ∣ x a ] ) \begin{aligned} \left.\begin{matrix} \binom{x_a}{x_b} \to x_a \end{matrix}\right\}\to &\left\{\begin{matrix} E[x_a] \\ Var[x_a] \end{matrix}\right. \to x_a \sim \mathcal N(E[x_a],Var[x_a])\\ \left.\begin{matrix} x_{b \cdot a} \end{matrix}\right\}\to &\left\{\begin{matrix} E[x_{b \cdot a}] \\ Var[x_{b \cdot a}] \end{matrix}\right. \to x_{b \cdot a} \sim \mathcal N(E[x_{b \cdot a}],Var[x_{b \cdot a}])\\ \left.\begin{matrix} x_b \mid x_a \end{matrix}\right\}\to &\left\{\begin{matrix} E[x_b \mid x_a] \\ Var[x_b \mid x_a] \end{matrix}\right. \to x_b \mid x_a \sim \mathcal N(E[x_b \mid x_a],Var[x_b \mid x_a]) \end{aligned} (xbxa)xa}xba}xbxa}{ E[xa]Var[xa]xaN(E[xa],Var[xa]){ E[xba]Var[xba]xbaN(E[xba],Var[xba]){ E[xbxa]Var[xbxa]xbxaN(E[xbxa],Var[xbxa])

5.联合概率

已知
p ( x ) ∼ N ( x ∣ μ , Λ − 1 ) p ( y ∣ x ) ∼ N ( y ∣ A x + B , Λ − 1 ) p(x) \sim \mathcal N(x\mid \mu,\Lambda^{-1})\\ p(y\mid x) \sim \mathcal N(y\mid Ax+B,\Lambda^{-1}) p(x)N(xμ,Λ1)p(yx)N(yAx+B,Λ1)

p ( y ) , p ( x ∣ y ) p(y),p(x\mid y) p(y),p(xy)


y = A x + b + ε x , y , ε ∼ r . v ε ∼ N ( 0 , L − 1 ) y=Ax+b+\varepsilon\\ x,y,\varepsilon \sim r.v\\ \varepsilon \sim \mathcal N(0,L^{-1}) y=Ax+b+εx,y,εr.vεN(0,L1)

E [ y ] = E [ A x + b + ε ] = E [ A x + b ] + E [ ε ] = A μ + b \begin{aligned} E[y]&=E[Ax+b+\varepsilon]\\ &=E[Ax+b]+E[\varepsilon]\\ &=A\mu +b \end{aligned} E[y]=E[Ax+b+ε]=E[Ax+b]+E[ε]=Aμ+b
V a r [ y ] = V a r [ A x + b + ε ] = V a r [ A x + b ] + V a r [ ε ] = A Λ − 1 A ⊤ + L − 1 \begin{aligned} Var[y] &=Var[Ax+b+\varepsilon]\\ &=Var[Ax+b]+Var[\varepsilon]\\ &=A\Lambda^{-1}A^{\top}+L^{-1} \end{aligned} Var[y]=Var[Ax+b+ε]=Var[Ax+b]+Var[ε]=AΛ1A+L1
∴ y ∼ N ( A μ + b , A Λ − 1 A ⊤ + L − 1 ) \therefore y \sim \mathcal N(A\mu +b,A\Lambda^{-1}A^{\top}+L^{-1}) yN(Aμ+b,AΛ1A+L1)


z = ( x y ) ∼ N ( [ μ A μ + b ] , ( L − 1 Δ Δ A Λ − 1 A ⊤ + L − 1 ) ) z=\begin{pmatrix} x\\y \end{pmatrix} \sim \mathcal N(\begin{bmatrix} \mu \\ A\mu +b \end{bmatrix},\begin{pmatrix} L^{-1} &\Delta \\ \Delta &A\Lambda^{-1}A^{\top}+L^{-1} \end{pmatrix} ) z=(xy)N([μAμ+b],(L1ΔΔAΛ1A+L1))

Δ = C o v ( x , y ) = E [ ( x − E [ x ] ) ( y − E [ y ] ) ⊤ ] = E [ ( x − μ ) ( y − A μ − b ) ⊤ ] = E [ ( x − μ ) ( A x + b + ε − A μ − b ) ⊤ ] = E [ ( x − μ ) ( A x + ε − A μ ) ⊤ ] = E [ ( x − μ ) ( A x − A μ ) ⊤ ] − E [ ( x − μ ) ε ] = E [ ( x − μ ) ( x − μ ) ⊤ ] A ⊤ = V a r [ x ] A ⊤ = Λ − 1 A ⊤ \begin{aligned} \Delta&=Cov(x,y)\\ &=E[(x-E[x])(y-E[y])^{\top}]\\ &=E[(x-\mu)(y-A\mu -b)^{\top}]\\ &=E[(x-\mu)(Ax+b+\varepsilon-A\mu -b)^{\top}]\\ &=E[(x-\mu)(Ax+\varepsilon-A\mu )^{\top}]\\ &=E[(x-\mu)(Ax-A\mu )^{\top}]-E[(x-\mu)\varepsilon]\\ &=E[(x-\mu)(x-\mu )^{\top}]A^{\top}\\ &=Var[x]A^{\top}\\ &=\Lambda^{-1}A^{\top} \end{aligned} Δ=Cov(x,y)=E[(xE[x])(yE[y])]=E[(xμ)(yAμb)]=E[(xμ)(Ax+b+εAμb)]=E[(xμ)(Ax+εAμ)]=E[(xμ)(AxAμ)]E[(xμ)ε]=E[(xμ)(xμ)]A=Var[x]A=Λ1A

z = ( x y ) = p ( x , y ) = ε ∼ N ( [ μ A μ + b ] , ( L − 1 Λ − 1 A ⊤ Λ − 1 A ⊤ Δ A Λ − 1 A ⊤ + L − 1 ) ) z=\begin{pmatrix} x\\ y \end{pmatrix}=p(x,y)=\varepsilon \sim \mathcal N(\begin{bmatrix} \mu \\ A\mu +b \end{bmatrix},\begin{pmatrix} L^{-1} &\Lambda^{-1}A^{\top} \\ \Lambda^{-1}A^{\top}\Delta &A\Lambda^{-1}A^{\top}+L^{-1} \end{pmatrix} ) z=(xy)=p(x,y)=εN([μAμ+b],(L1Λ1AΔΛ1AAΛ1A+L1))


p ( x ∣ y ) = p ( x , y ) p ( y ) p(x\mid y)=\frac{p(x,y)}{p(y)} p(xy)=p(y)p(x,y)
∴ p ( x ∣ y ) ∼ N ( , ) \therefore p(x\mid y)\sim \mathcal N(,) p(xy)N(,)

x → y } → { E [ y ] V a r [ y ] → y ∼ N ( E [ y ] , V a r [ y ] ) ( x y ) } → { ( E [ x ] E [ y ] ) ( V a r [ x ] Δ Δ V a r [ y ] ) → y ∼ N ( ( E [ x ] E [ y ] ) , ( V a r [ x ] Δ Δ V a r [ y ] ) )

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值