参考文献
前言
数据 X = ( x 11 x 12 ⋯ x 1 p x 21 x 22 ⋯ x 2 p ⋮ ⋮ ⋱ ⋮ x N 1 x N 2 ⋯ x N p ) N × P X = \begin{pmatrix} x_{11} &x_{12} & \cdots &x_{1p} \\ x_{21} & x_{22} &\cdots & x_{2p} \\ \vdots & \vdots & \ddots &\vdots \\ x_{N1} &x_{N2} & \cdots &x_{Np} \end{pmatrix}_{N\times P} X=⎝⎜⎜⎜⎛x11x21⋮xN1x12x22⋮xN2⋯⋯⋱⋯x1px2p⋮xNp⎠⎟⎟⎟⎞N×P,P为特征数,N为样本数。
在这片大陆上,有两大学派,一个是频率学派,另一个是统计学派。
频率派认为 θ \theta θ是未知的常量,数据 X X X为r.v(random variable,随机变量), x ∼ p ( x ∣ θ ) x\sim p(x \mid \theta) x∼p(x∣θ)
如:极大似然估计(Maximum Likelihood Estimation,MLE)
θ M L E = arg max θ log P ( x ∣ θ ) \theta_{MLE}=\arg \max_{\theta } \log P(x\mid \theta) θMLE=argθmaxlogP(x∣θ)
贝叶斯派则认为参数 θ \theta θ为r.v, θ ∼ p ( θ ) \theta \sim p(\theta) θ∼p(θ)
如:最大后验概率MAP
P ( θ ∣ X ) = P ( X ∣ θ ) P ( θ ) P ( X ) P(\theta \mid X)=\frac{P(X\mid \theta)P(\theta)}{P(X)} P(θ∣X)=P(X)P(X∣θ)P(θ)
P ( θ ∣ X ) P(\theta \mid X) P(θ∣X)为后验posterior, P ( X ∣ θ ) P(X\mid \theta) P(X∣θ)为likehood, p ( θ ) p(\theta) p(θ)为prior, P ( X ) = ∫ θ P ( X ∣ θ ) P ( θ ) d θ P(X)=\int_{\theta} P(X\mid \theta)P(\theta)\mathrm{d}\theta P(X)=∫θP(X∣θ)P(θ)dθ。
θ M A P = arg max θ P ( X ∣ θ ) P ( θ ) \theta_{MAP}=\arg \max_{\theta} P(X\mid \theta)P(\theta) θMAP=argθmaxP(X∣θ)P(θ)
贝叶斯估计:
P ( θ ∣ X ) = P ( X ∣ θ ) P ( θ ) ∫ θ P ( X ∣ θ ) P ( θ ) d θ P(\theta \mid X)=\frac{P(X\mid \theta)P(\theta)}{\int_{\theta} P(X\mid \theta)P(\theta)\mathrm{d}\theta} P(θ∣X)=∫θP(X∣θ)P(θ)dθP(X∣θ)P(θ)
贝叶斯预测: x ~ \tilde{x} x~预测数据, θ \theta θ为中间变量
P ( x ~ ∣ X ) = ∫ θ P ( x ~ , θ ∣ X ) P ( θ ) d θ = ∫ θ P ( x ~ ∣ θ ) P ( θ ∣ X ) d θ \begin{aligned}P(\tilde{x} \mid X)&=\int_{\theta} P(\tilde{x}, \theta \mid X)P(\theta)\mathrm{d}\theta\\ &=\int_{\theta} P(\tilde{x}\mid \theta )P(\theta \mid X)\mathrm{d}\theta \end{aligned} P(x~∣X)=∫θP(x~,θ∣X)P(θ)dθ=∫θP(x~∣θ)P(θ∣X)dθ
频率派>统计机器学习>优化:
- model
- loss function
- algorithm
贝叶斯派>概率图模型>求积分>MCMC
一、概率基础
1.一维高斯分布
X = ( x 1 , x 2 , … , x N ) ⊤ = ( x 1 ⊤ x 2 ⊤ ⋮ x N ⊤ ) X = (x_1,x_2,\dots ,x_N)^{\top}= \begin{pmatrix} x_1^{\top } \\ x_2^{\top } \\ \vdots \\ x_N^{\top } \end{pmatrix} X=(x1,x2,…,xN)⊤=⎝⎜⎜⎜⎛x1⊤x2⊤⋮xN⊤⎠⎟⎟⎟⎞
x i ∈ R p x_i \in \mathbb{R}^{p} xi∈Rp
θ ∼ i.i.d N ( μ , Σ ) \theta \stackrel{\text{i.i.d}}{\sim } \mathcal N(\mu,\Sigma) θ∼i.i.dN(μ,Σ)
i.i.d.(independent and identically distributed 独立同分布)
令 p = 1 p=1 p=1, θ = ( μ , σ 2 ) \theta=(\mu,\sigma^2) θ=(μ,σ2)
一维高斯分布:
p ( x ) = 1 2 π σ exp { − ( x − μ ) 2 2 σ 2 } p(x)=\frac{1}{\sqrt{2\pi}\sigma}\exp \left \{ -\frac{(x-\mu)^2}{2\sigma^2} \right \} p(x)=2πσ1exp{
−2σ2(x−μ)2}
极大似然估计
θ M L E = arg max θ P ( X ∣ θ ) \theta_{MLE}=\arg \max_{\theta }P(X\mid \theta) θMLE=argθmaxP(X∣θ)
对数似然:
log P ( X ∣ θ ) = log ∏ i = 1 N P ( x i ∣ θ ) = ∑ i = 1 N log P ( x i ∣ θ ) = ∑ i = 1 N log 1 2 π σ exp − ( x i − μ ) 2 2 σ 2 = ∑ i = 1 N [ log 1 2 π + log 1 σ − ( x i − μ ) 2 2 σ 2 ] \begin{aligned} \log P(X\mid \theta) &=\log \prod_{i=1}^{N} P(x_i\mid \theta)\\ &=\sum_{i=1}^{N}\log P(x_i\mid \theta)\\ &=\sum_{i=1}^{N}\log \frac{1}{\sqrt{2\pi}\sigma}\exp -\frac{(x_i-\mu)^2}{2\sigma^2}\\ &=\sum_{i=1}^{N}\left [ \log \frac{1}{\sqrt{2\pi}}+\log \frac{1}{\sigma} -\frac{(x_i-\mu)^2}{2\sigma^2} \right ] \end{aligned} logP(X∣θ)=logi=1∏NP(xi∣θ)=i=1∑NlogP(xi∣θ)=i=1∑Nlog2πσ1exp−2σ2(xi−μ)2=i=1∑N[log2π1+logσ1−2σ2(xi−μ)2]
求解对数似然 μ M L E \mu_{MLE} μMLE:
μ M L E = arg max μ log P ( X ∣ θ ) = arg max μ ∑ i = 1 N − ( x i − μ ) 2 2 σ 2 = arg min μ ∑ i = 1 N ( x i − μ ) 2 2 σ 2 \begin{aligned}\mu_{MLE}&=\arg \max_{\mu }\log P(X\mid \theta)\\ &=\arg \max_{\mu } \sum_{i=1}^{N}-\frac{(x_i-\mu)^2}{2\sigma^2}\\ &=\arg \min_{\mu } \sum_{i=1}^{N}\frac{(x_i-\mu)^2}{2\sigma^2} \end{aligned} μMLE=argμmaxlogP(X∣θ)=argμmaxi=1∑N−2σ2(xi−μ)2=argμmini=1∑N2σ2(xi−μ)2
对 μ M L E \mu_{MLE} μMLE求偏导:
∂ ∂ μ ∑ i = 1 N ( x i − μ ) 2 = ∑ i = 1 N 2 ⋅ ( x i − μ ) ⋅ ( − 1 ) = 0 ⇒ ∑ i = 1 N ( x i − μ ) = 0 ⇒ μ M L E = 1 N ∑ i = 1 N x i \begin{aligned} \frac{\partial }{\partial \mu}\sum_{i=1}^{N} (x_i-\mu)^2&=\sum_{i=1}^{N}2\cdot (x_i-\mu)\cdot(-1)=0\\ &\Rightarrow \sum_{i=1}^{N}(x_i-\mu)=0\\ &\Rightarrow \mu_{MLE}=\frac{1}{N}\sum_{i=1}^{N}x_i \end{aligned} ∂μ∂i=1∑N(xi−μ)2=i=1∑N2⋅(xi−μ)⋅(−1)=0⇒i=1∑N(xi−μ)=0⇒μMLE=N1i=1∑Nxi
对 μ M L E \mu_{MLE} μMLE求期望:
E [ μ M L E ] = 1 N ∑ i = 1 N E [ x i ] = 1 N ∑ i = 1 N μ = μ E[\mu_{MLE}]=\frac{1}{N}\sum_{i=1}^{N}E[x_i]=\frac{1}{N}\sum_{i=1}^{N}\mu=\mu E[μMLE]=N1i=1∑NE[xi]=N1i=1∑Nμ=μ
因此 μ M L E \mu_{MLE} μMLE无偏。
求解对数似然 σ M L E \sigma_{MLE} σMLE:
μ M L E = arg max σ log P ( X ∣ θ ) = arg max σ ∑ i = 1 N − log σ − ( x i − μ ) 2 2 σ 2 \begin{aligned} \mu_{MLE}&=\arg \max_{\sigma }\log P(X\mid \theta)\\ &=\arg \max_{\sigma }\sum_{i=1}^{N}-\log \sigma -\frac{(x_i-\mu)^2}{2\sigma^2} \end{aligned} μMLE=argσmaxlogP(X∣θ)=argσmaxi=1∑N−logσ−2σ2(xi−μ)2
对数似然 σ M L E \sigma_{MLE} σMLE偏导:
令 L ( σ ) = − log σ − ( x i − μ ) 2 2 σ 2 L(\sigma)=-\log \sigma -\frac{(x_i-\mu)^2}{2\sigma^2} L(σ)=−logσ−2σ2(xi−μ)2
∂ L ∂ σ = ∑ i = 1 N [ − 1 σ + 1 2 ( x i − μ ) 2 ⋅ ( − 2 ) ⋅ σ − 3 ] = ∑ i = 1 N [ − σ 2 + ( x i − μ ) 2 ] = 0 ⇒ ∑ i = 1 N σ 2 = ∑ i = 1 N ( x i − μ ) 2 \begin{aligned} \frac{\partial L}{\partial \sigma}&=\sum_{i=1}^{N}[-\frac{1}{\sigma}+\frac{1}{2}(x_i-\mu)^2\cdot (-2)\cdot \sigma^{-3}]\\ &=\sum_{i=1}^{N}[-\sigma^2+(x_i-\mu)^2]=0\\ &\Rightarrow \sum_{i=1}^{N}\sigma^2=\sum_{i=1}^{N}(x_i-\mu)^2 \end{aligned} ∂σ∂L=i=1∑N[−σ1+21(xi−μ)2⋅(−2)⋅σ−3]=i=1∑N[−σ2+(xi−μ)2]=0⇒i=1∑Nσ2=i=1∑N(xi−μ)2
σ M L E 2 = 1 N ∑ i = 1 N ( x i − μ ) 2 = 1 N ∑ i = 1 N x i 2 − μ M L E 2 \sigma_{MLE}^2=\frac{1}{N}\sum_{i=1}^{N}(x_i-\mu)^2=\frac{1}{N}\sum_{i=1}^{N}x_i^2-\mu_{MLE}^2 σMLE2=N1i=1∑N(xi−μ)2=N1i=1∑Nxi2−μMLE2
σ M L E 2 \sigma_{MLE}^2 σMLE2求期望:
E [ σ M L E 2 ] = E [ 1 N ∑ i = 1 N x i 2 − μ M L E 2 ] = E [ 1 N ∑ i = 1 N ( x i 2 − μ 2 ) − ( μ M L E 2 − μ 2 ) ] = E [ 1 N ∑ i = 1 N ( x i 2 − μ 2 ) ] − E [ μ M L E 2 − μ 2 ] = 1 N ∑ i = 1 N E [ x i 2 − μ 2 ] − ( E [ μ M L E 2 ] − E [ μ 2 ] ) = 1 N ∑ i = 1 N ( E [ x i 2 ] − μ 2 ) − ( E [ μ M L E 2 ] − μ 2 ) = 1 N ∑ i = 1 N V a r ( x i ) − E [ μ M L E 2 − E 2 [ μ M L E ] = 1 N ∑ i = 1 N V a r ( x i ) − V a r ( μ M L E ) = σ 2 − 1 N σ 2 \begin{aligned} E[\sigma_{MLE}^2]&=E[\frac{1}{N}\sum_{i=1}^{N}x_i^2-\mu_{MLE}^2]\\ &=E[\frac{1}{N}\sum_{i=1}^{N}(x_i^2-\mu^2)-(\mu_{MLE}^2-\mu^2)]\\ &=E[\frac{1}{N}\sum_{i=1}^{N}(x_i^2-\mu^2)]-E[\mu_{MLE}^2-\mu^2]\\ &=\frac{1}{N}\sum_{i=1}^{N}E[x_i^2-\mu^2]-(E[\mu_{MLE}^2]-E[\mu^2])\\ &=\frac{1}{N}\sum_{i=1}^{N}(E[x_i^2]-\mu^2)-(E[\mu_{MLE}^2]-\mu^2)\\ &=\frac{1}{N}\sum_{i=1}^{N}Var(x_i)-E[\mu_{MLE}^2-E^2[\mu_{MLE}]\\ &=\frac{1}{N}\sum_{i=1}^{N}Var(x_i)-Var(\mu_{MLE})\\ &=\sigma^2-\frac{1}{N}\sigma^2 \end{aligned} E[σMLE2]=E[N1i=1∑Nxi2−μMLE2]=E[N1i=1∑N(xi2−μ2)−(μMLE2−μ2)]=E[N1i=1∑N(xi2−μ2)]−E[μMLE2−μ2]=N1i=1∑NE[xi2−μ2]−(E[μMLE2]−E[μ2])=N1i=1∑N(E[xi2]−μ2)−(E[μMLE2]−μ2)=N1i=1∑NVar(xi)−E[μMLE2−E2[μMLE]=N1i=1∑NVar(xi)−Var(μMLE)=σ2−N1σ2
其中 V a r ( μ M L E ) = V a r ( 1 N ∑ i = 1 N x i ) = 1 N 2 ∑ i = 1 N V a r ( x i ) = 1 N 2 ∑ i = 1 N σ 2 = 1 N σ 2 Var(\mu_{MLE})=Var(\frac{1}{N}\sum_{i=1}^{N}x_i)=\frac{1}{N^2}\sum_{i=1}^{N}Var(x_i)=\frac{1}{N^2}\sum_{i=1}^{N}\sigma^2=\frac{1}{N}\sigma^2 Var(μMLE)=Var(N1∑i=1Nxi)=N21∑i=1NVar(xi)=N21∑i=1Nσ2=N1σ2
因此 σ M L E = N − 1 N σ 2 \sigma_{MLE}=\frac{N-1}{N}\sigma^2 σMLE=NN−1σ2是有偏估计。
极大似然估计
μ M L E → E [ μ M L E ] σ M L E 2 → E [ σ M L E 2 ] \mu_{MLE}\to E[\mu_{MLE}]\\ \sigma_{MLE}^2 \to E[\sigma_{MLE}^2] μMLE→E[μMLE]σMLE2→E[σMLE2]
2.高维高斯分布
p ( x ) = 1 ( 2 π ) p / 2 ∣ Σ ∣ 1 / 2 exp ( − 1 2 ( x − μ ) ⊤ Σ − 1 ( x − μ ) ) p(x)=\frac{1}{(2\pi)^{p/2}\left| \Sigma\right|^{1/2}} \exp( -\frac{1}{2} (x-\mu)^{\top}\Sigma^{-1}(x-\mu)) p(x)=(2π)p/2∣Σ∣1/21exp(−21(x−μ)⊤Σ−1(x−μ))
x ∈ R p x\in \mathbb{R}^{p} x∈Rp,r.v
x = ( x 1 x 2 ⋮ x p ) , μ = ( μ 1 μ 2 ⋮ μ p ) , Σ = ( σ 11 σ 12 ⋯ σ 1 p σ 21 σ 22 ⋯ σ 2 p ⋮ ⋮ ⋱ ⋮ σ p 1 σ p 2 ⋯ σ p p ) p × p , Σ x=\begin{pmatrix}x_1 \\x_2 \\\vdots \\x_p\end{pmatrix}, \mu=\begin{pmatrix}\mu_1 \\\mu_2 \\\vdots \\\mu_p\end{pmatrix}, \Sigma= \begin{pmatrix} \sigma_{11} & \sigma_{12} & \cdots &\sigma_{1p} \\ \sigma_{21} & \sigma_{22} &\cdots & \sigma_{2p} \\ \vdots & \vdots & \ddots &\vdots \\ \sigma_{p1} & \sigma_{p2} & \cdots &\sigma_{pp} \end{pmatrix}_{p\times p},\Sigma x=⎝⎜⎜⎜⎛x1x2⋮xp⎠⎟⎟⎟⎞,μ=⎝⎜⎜⎜⎛μ1μ2⋮μp⎠⎟⎟⎟⎞,Σ=⎝⎜⎜⎜⎛σ11σ21⋮σp1σ12σ22⋮σp2⋯⋯⋱⋯σ1pσ2p⋮σpp⎠⎟⎟⎟⎞p×p,Σ需要为正定矩阵,不过一般都是半正定矩阵。
( x − μ ) ⊤ Σ − 1 ( x − μ ) (x-\mu)^{\top}\Sigma^{-1}(x-\mu) (x−μ)⊤Σ−1(x−μ)表示 x x x与 μ \mu μ的马氏距离
当 Σ = 1 \Sigma=1 Σ=1时,为欧式距离
特征分解
U = ( U 1 , U 2 , … , U p ) p × p , i = 1 , 2 , … , p U U ⊤ = U ⊤ U = I Λ = d i a g ( λ i ) U=(U_1,U_2,\dots,U_p)_{p\times p},i=1,2,\dots,p\\ UU^{\top}=U^{\top}U=I\\ \Lambda =diag(\lambda _i) U=(U1,U2,…,Up)p×p,i=1,2,…,pUU⊤=U⊤U=IΛ=diag(λi)
Σ = U Λ U ⊤ = ( μ 1 μ 2 ⋯ μ p ) ( λ 1 λ 2 ⋱ λ p ) ( μ 1 ⊤ μ 2 ⊤ ⋮ μ p ⊤ ) = ( μ 1 λ 1 μ 2 λ 2 ⋯ μ p λ p ) ( μ 1 ⊤ μ 2 ⊤ ⋮ μ p ⊤ ) = ∑ i = 1 p u i λ i u i ⊤ \begin{aligned} \Sigma &=U\Lambda U^{\top}= \begin{pmatrix}\mu_1 &\mu_2 &\cdots &\mu_p\end{pmatrix}\begin{pmatrix} \lambda_1 & & & \\ & \lambda_2 & & \\ & & \ddots & \\ & & &\lambda_p \end{pmatrix}\begin{pmatrix}\mu_1^{\top} \\\mu_2^{\top} \\\vdots \\\mu_p^{\top}\end{pmatrix}\\ &=\begin{pmatrix}\mu_1\lambda_1 &\mu_2\lambda_2 &\cdots &\mu_p\lambda_p\end{pmatrix}\begin{pmatrix}\mu_1^{\top} \\\mu_2^{\top} \\\vdots \\\mu_p^{\top}\end{pmatrix} =\sum_{i=1}^{p}u_i\lambda_iu_i^{\top} \end{aligned} Σ=UΛU⊤=(μ1μ2⋯μp)⎝⎜⎜⎛λ1λ2⋱λp⎠⎟⎟⎞⎝⎜⎜⎜⎛μ1⊤μ2⊤⋮μp⊤⎠⎟⎟⎟⎞=(μ1λ1μ2λ2⋯μpλp)⎝⎜⎜⎜⎛μ1⊤μ2⊤⋮μp⊤⎠⎟⎟⎟⎞=i=1∑puiλiui⊤
Σ − 1 = ( U Λ U ⊤ ) − 1 = ( U ⊤ ) − 1 Λ − 1 U − 1 = U Λ − 1 U ⊤ = ∑ i = 1 p u i 1 λ i u i ⊤ \Sigma^{-1}=(U\Lambda U^{\top})^{-1}=(U^{\top})^{-1}\Lambda^{-1} U^{-1}=U\Lambda^{-1} U^{\top}=\sum_{i=1}^{p}u_i\frac{1}{\lambda_i}u_i^{\top} Σ−1=(UΛU⊤)−1=(U⊤)−1Λ−1U−1=UΛ−1U⊤=i=1∑puiλi1ui⊤
( x − μ ) ⊤ Σ − 1 ( x − μ ) = ( x − μ ) ⊤ ∑ i = 1 p u i 1 λ i u i ⊤ ( x − μ ) = ∑ i = 1 p ( x − μ ) ⊤ u i 1 λ i u i ⊤ ( x − μ ) = ∑ i = 1 p y i 1 λ i y i ⊤ = ∑ i = 1 p y i 2 λ i \begin{aligned}(x-\mu)^{\top}\Sigma^{-1}(x-\mu) &=(x-\mu)^{\top}\sum_{i=1}^{p}u_i\frac{1}{\lambda_i}u_i^{\top}(x-\mu)\\ &=\sum_{i=1}^{p}(x-\mu)^{\top}u_i\frac{1}{\lambda_i}u_i^{\top}(x-\mu)\\ &=\sum_{i=1}^{p}y_i\frac{1}{\lambda_i}y_i^{\top}=\sum_{i=1}^{p}\frac{y_i^2}{\lambda_i}\end{aligned} (x−μ)⊤Σ−1(x−μ)=(x−μ)⊤i=1∑puiλi1ui⊤(x−μ)=i=1∑p(x−μ)⊤uiλi1ui⊤(x−μ)=i=1∑pyiλi1yi⊤=i=1∑pλiyi2
当 p = 2 p=2 p=2时, y 1 2 λ 1 + y 2 2 λ 2 \frac{y_1^2}{\lambda_1}+\frac{y_2^2}{\lambda_2} λ1y12+λ2y22是一个椭圆,表示数据 x x x的聚集为椭圆,由数据 x x x的坐标变换到 y y y坐标,数据的主方向和 y y y坐标相同。
λ 1 = λ 2 \lambda_1=\lambda_2 λ1=λ2时,数据聚集为圆,各向同性(无新坐标)。
p ( x ) → Σ → Σ − 1 → ( x − μ ) ⊤ Σ − 1 ( x − μ ) = ∑ i = 1 p y i 2 λ i p(x) \to \Sigma \to \Sigma^{-1} \to (x-\mu)^{\top}\Sigma^{-1}(x-\mu)=\sum_{i=1}^{p}\frac{y_i^2}{\lambda_i} p(x)→Σ→Σ−1→(x−μ)⊤Σ−1(x−μ)=i=1∑pλiyi2
3.局限性
参数多
各向同性
多类别数据无法表示
4.边缘概率&条件概率
p ( x ) = 1 ( 2 π ) p / 2 ∣ Σ ∣ 1 / 2 exp ( − 1 2 ( x − μ ) ⊤ Σ − 1 ( x − μ ) ) p(x)=\frac{1}{(2\pi)^{p/2}\left| \Sigma\right|^{1/2}} \exp( -\frac{1}{2} (x-\mu)^{\top}\Sigma^{-1}(x-\mu)) p(x)=(2π)p/2∣Σ∣1/21exp(−21(x−μ)⊤Σ−1(x−μ))
x ∈ R p x\in \mathbb{R}^{p} x∈Rp,r.v
x = ( x 1 x 2 ⋮ x p ) , μ = ( μ 1 μ 2 ⋮ μ p ) , Σ = ( σ 11 σ 12 ⋯ σ 1 p σ 21 σ 22 ⋯ σ 2 p ⋮ ⋮ ⋱ ⋮ σ p 1 σ p 2 ⋯ σ p p ) p × p x=\begin{pmatrix}x_1 \\x_2 \\\vdots \\x_p\end{pmatrix}, \mu=\begin{pmatrix}\mu_1 \\\mu_2 \\\vdots \\\mu_p\end{pmatrix}, \Sigma= \begin{pmatrix} \sigma_{11} & \sigma_{12} & \cdots &\sigma_{1p} \\ \sigma_{21} & \sigma_{22} &\cdots & \sigma_{2p} \\ \vdots & \vdots & \ddots &\vdots \\ \sigma_{p1} & \sigma_{p2} & \cdots &\sigma_{pp} \end{pmatrix}_{p\times p} x=⎝⎜⎜⎜⎛x1x2⋮xp⎠⎟⎟⎟⎞,μ=⎝⎜⎜⎜⎛μ1μ2⋮μp⎠⎟⎟⎟⎞,Σ=⎝⎜⎜⎜⎛σ11σ21⋮σp1σ12σ22⋮σp2⋯⋯⋱⋯σ1pσ2p⋮σpp⎠⎟⎟⎟⎞p×p
已知 x = ( x a x b ) , μ = ( μ a μ b ) , Σ = ( Σ a a Σ a b Σ b a Σ b b ) , x a : m 维 , x b : n 维 , m + n = p x=\begin{pmatrix}x_a\\x_b\end{pmatrix}, \mu=\begin{pmatrix}\mu_a\\\mu_b\end{pmatrix}, \Sigma=\begin{pmatrix}\Sigma_{aa} & \Sigma_{ab}\\ \Sigma_{ba} &\Sigma_{bb}\end{pmatrix},x_a:m维,x_b:n维,m+n=p x=(xaxb),μ=(μaμb),Σ=(ΣaaΣbaΣabΣbb),xa:m维,xb:n维,m+n=p
求:
p ( x a ) , p ( x b ∣ x a ) p(x_a),p(x_b\mid x_a) p(xa),p(xb∣xa)
p ( x b ) , p ( x a ∣ x b ) p(x_b),p(x_a\mid x_b) p(xb),p(xa∣xb)
定理:
已 知 x ∼ N ( μ , Σ ) , y = A x + B 已知x\sim\mathcal N(\mu,\Sigma),y=Ax+B 已知x∼N(μ,Σ),y=Ax+B
结论:
y ∼ N ( A μ + B , A Σ A ⊤ ) E [ y ] = E [ A x + B ] = A E [ x ] + B = A μ + B V a r [ y ] = V a r [ A x + B ] = V a r [ A x ] + V a r [ B ] = A V a r [ x ] A ⊤ = A Σ A ⊤ y\sim\mathcal N(A\mu+B,A\Sigma A^{\top})\\ E[y]=E[Ax+B]=AE[x]+B=A\mu+B\\ \begin{aligned} Var[y] &=Var[Ax+B]=Var[Ax]+Var[B]\\ &=AVar[x]A^{\top}=A\Sigma A^{\top} \end{aligned} y∼N(Aμ+B,AΣA⊤)E[y]=E[Ax+B]=AE[x]+B=Aμ+BVar[y]=Var[Ax+B]=Var[Ax]+Var[B]=AVar[x]A⊤=AΣA⊤
- 求: p ( x a ) p(x_a) p(xa)
x a = ( I m 0 ) ( x a x b ) x_a= \begin{pmatrix}I_m & 0\end{pmatrix} \begin{pmatrix}x_a\\x_b\end{pmatrix} xa=(Im0)(xaxb)
E [ x a ] = ( I m 0 ) ( μ a μ b ) = μ a E[x_a]=\begin{pmatrix}I_m & 0\end{pmatrix} \begin{pmatrix}\mu_a\\\mu_b\end{pmatrix}=\mu_a E[xa]=(Im0)(μaμb)=μa
V a r [ x a ] = ( I m 0 ) ( Σ a a Σ a b Σ b a Σ b b ) ( I m 0 ) = Σ a a Var[x_a]=\begin{pmatrix}I_m & 0\end{pmatrix} \begin{pmatrix}\Sigma_{aa} & \Sigma_{ab}\\ \Sigma_{ba} &\Sigma_{bb}\end{pmatrix} \begin{pmatrix}I_m \\ 0\end{pmatrix}=\Sigma_{aa} Var[xa]=(Im0)(ΣaaΣbaΣabΣbb)(Im0)=Σaa
∴ x a ∼ N ( μ a , Σ a a ) \therefore x_a \sim \mathcal N(\mu_a,\Sigma_{aa}) ∴xa∼N(μa,Σaa)
- 求: p ( x b ⋅ a ) p(x_{b\cdot a}) p(xb⋅a)
令 { x b ⋅ a = x b − Σ b a Σ a a − 1 x a μ b ⋅ a = μ b − Σ b a Σ a a − 1 μ a Σ b b ⋅ a = Σ b b − Σ b a Σ a a − 1 Σ a b \begin{aligned} \begin{cases} x_{b\cdot a } &=x_b-\Sigma_{ba}\Sigma_{aa}^{-1}x_a\\ \mu _{b\cdot a } &=\mu _b-\Sigma_{ba}\Sigma_{aa}^{-1}\mu _a \\ \Sigma_{bb\cdot a } &=\Sigma_{bb}-\Sigma_{ba}\Sigma_{aa}^{-1}\Sigma_{ab} \end{cases}\end{aligned} ⎩⎪⎨⎪⎧xb⋅aμb⋅aΣbb⋅a=xb−ΣbaΣaa−1xa=μb−ΣbaΣaa−1μa=Σbb−ΣbaΣaa−1Σab
x b ⋅ a = ( Σ b a Σ a a I n ) ( x a x b ) = A x x_{b\cdot a}= \begin{pmatrix} \Sigma_{ba}\Sigma_{aa} & I_n \end{pmatrix} \begin{pmatrix} x_a\\ x_b \end{pmatrix}=Ax xb⋅a=(ΣbaΣaaIn)(xaxb)=Ax
E [ x b ⋅ a ] = ( Σ b a Σ a a − 1 I n ) ( μ a μ b ) = μ b − Σ b a Σ a a − 1 μ a = μ b ⋅ a = A μ + B \begin{aligned}E[x_{b\cdot a}]&= \begin{pmatrix} \Sigma_{ba}\Sigma_{aa} ^{-1}& I_n \end{pmatrix} \begin{pmatrix}\mu_a\\\mu_b\end{pmatrix}\\ &=\mu_b-\Sigma_{ba}\Sigma_{aa}^{-1}\mu_a =\mu_{b\cdot a}=A\mu+B \end{aligned} E[xb⋅a]=(ΣbaΣaa−1In)(μaμb)=μb−ΣbaΣaa−1μa=μb⋅a=Aμ+B
V a r [ x b ⋅ a ] = ( Σ b a Σ a a − 1 I n ) ( Σ a a Σ a b Σ b a Σ b b ) ( Σ b a Σ a a − 1 I n ) = Σ b b − Σ b a Σ a a − 1 Σ a b = Σ b b ⋅ a = A Σ A ⊤ \begin{aligned} Var[x_{b\cdot a}] &=\begin{pmatrix} \Sigma_{ba}\Sigma_{aa} ^{-1}& I_n \end{pmatrix} \begin{pmatrix}\Sigma_{aa} & \Sigma_{ab}\\ \Sigma_{ba} &\Sigma_{bb}\end{pmatrix} \begin{pmatrix} \Sigma_{ba}\Sigma_{aa} ^{-1}\\ I_n \end{pmatrix}\\ &=\Sigma_{bb}-\Sigma_{ba}\Sigma_{aa} ^{-1}\Sigma_{ab} \\&=\Sigma_{bb\cdot a}=A\Sigma A^{\top} \end{aligned} Var[xb⋅a]=(ΣbaΣaa−1In)(ΣaaΣbaΣabΣbb)(ΣbaΣaa−1In)=Σbb−ΣbaΣaa−1Σab=Σbb⋅a=AΣA⊤
∴ x b ⋅ a ∼ N ( μ b ⋅ a , Σ b b ⋅ a ) \therefore x_{b\cdot a} \sim \mathcal N(\mu_{b\cdot a},\Sigma_{bb\cdot a}) ∴xb⋅a∼N(μb⋅a,Σbb⋅a)
- 求: p ( x b ∣ x a ) p(x_b\mid x_a) p(xb∣xa)
x b = x b ⋅ a + Σ b a Σ a a − 1 x a = A x + B \begin{aligned} x_b &=x_{b\cdot a}+\Sigma_{ba}\Sigma_{aa} ^{-1}x_a \\&=Ax+B \end{aligned} xb=xb⋅a+ΣbaΣaa−1xa=Ax+B
E [ x b ∣ x a ] = μ b ⋅ a + Σ b a Σ a a − 1 x a = A μ + B \begin{aligned} E[x_b\mid x_a]&=\mu_{b\cdot a}+\Sigma_{ba}\Sigma_{aa} ^{-1}x_a\\ &=A\mu+B \end{aligned} E[xb∣xa]=μb⋅a+ΣbaΣaa−1xa=Aμ+B
V a r [ x b ∣ x a ] = V a r [ x b ⋅ a ] = A Σ A ⊤ = Σ \begin{aligned} Var[x_b\mid x_a] &=Var[x_{b\cdot a}]\\ &=A\Sigma A^{\top}=\Sigma \end{aligned} Var[xb∣xa]=Var[xb⋅a]=AΣA⊤=Σ
$$
x → y = A x + B } → { E [ y ] V a r [ y ] → y ∼ N ( E [ y ] , V a r [ y ] ) x \to \left.\begin{matrix} y=Ax+B\end{matrix}\right\}\to \left\{\begin{matrix} E[y] \\Var[y] \end{matrix}\right. \to y \sim \mathcal N(E[y],Var[y]) x→y=Ax+B}→{ E[y]Var[y]→y∼N(E[y],Var[y])
( x a x b ) → x a } → { E [ x a ] V a r [ x a ] → x a ∼ N ( E [ x a ] , V a r [ x a ] ) x b ⋅ a } → { E [ x b ⋅ a ] V a r [ x b ⋅ a ] → x b ⋅ a ∼ N ( E [ x b ⋅ a ] , V a r [ x b ⋅ a ] ) x b ∣ x a } → { E [ x b ∣ x a ] V a r [ x b ∣ x a ] → x b ∣ x a ∼ N ( E [ x b ∣ x a ] , V a r [ x b ∣ x a ] ) \begin{aligned} \left.\begin{matrix} \binom{x_a}{x_b} \to x_a \end{matrix}\right\}\to &\left\{\begin{matrix} E[x_a] \\ Var[x_a] \end{matrix}\right. \to x_a \sim \mathcal N(E[x_a],Var[x_a])\\ \left.\begin{matrix} x_{b \cdot a} \end{matrix}\right\}\to &\left\{\begin{matrix} E[x_{b \cdot a}] \\ Var[x_{b \cdot a}] \end{matrix}\right. \to x_{b \cdot a} \sim \mathcal N(E[x_{b \cdot a}],Var[x_{b \cdot a}])\\ \left.\begin{matrix} x_b \mid x_a \end{matrix}\right\}\to &\left\{\begin{matrix} E[x_b \mid x_a] \\ Var[x_b \mid x_a] \end{matrix}\right. \to x_b \mid x_a \sim \mathcal N(E[x_b \mid x_a],Var[x_b \mid x_a]) \end{aligned} (xbxa)→xa}→xb⋅a}→xb∣xa}→{ E[xa]Var[xa]→xa∼N(E[xa],Var[xa]){ E[xb⋅a]Var[xb⋅a]→xb⋅a∼N(E[xb⋅a],Var[xb⋅a]){ E[xb∣xa]Var[xb∣xa]→xb∣xa∼N(E[xb∣xa],Var[xb∣xa])
5.联合概率
已知
p ( x ) ∼ N ( x ∣ μ , Λ − 1 ) p ( y ∣ x ) ∼ N ( y ∣ A x + B , Λ − 1 ) p(x) \sim \mathcal N(x\mid \mu,\Lambda^{-1})\\ p(y\mid x) \sim \mathcal N(y\mid Ax+B,\Lambda^{-1}) p(x)∼N(x∣μ,Λ−1)p(y∣x)∼N(y∣Ax+B,Λ−1)
求
p ( y ) , p ( x ∣ y ) p(y),p(x\mid y) p(y),p(x∣y)
y = A x + b + ε x , y , ε ∼ r . v ε ∼ N ( 0 , L − 1 ) y=Ax+b+\varepsilon\\ x,y,\varepsilon \sim r.v\\ \varepsilon \sim \mathcal N(0,L^{-1}) y=Ax+b+εx,y,ε∼r.vε∼N(0,L−1)
E [ y ] = E [ A x + b + ε ] = E [ A x + b ] + E [ ε ] = A μ + b \begin{aligned} E[y]&=E[Ax+b+\varepsilon]\\ &=E[Ax+b]+E[\varepsilon]\\ &=A\mu +b \end{aligned} E[y]=E[Ax+b+ε]=E[Ax+b]+E[ε]=Aμ+b
V a r [ y ] = V a r [ A x + b + ε ] = V a r [ A x + b ] + V a r [ ε ] = A Λ − 1 A ⊤ + L − 1 \begin{aligned} Var[y] &=Var[Ax+b+\varepsilon]\\ &=Var[Ax+b]+Var[\varepsilon]\\ &=A\Lambda^{-1}A^{\top}+L^{-1} \end{aligned} Var[y]=Var[Ax+b+ε]=Var[Ax+b]+Var[ε]=AΛ−1A⊤+L−1
∴ y ∼ N ( A μ + b , A Λ − 1 A ⊤ + L − 1 ) \therefore y \sim \mathcal N(A\mu +b,A\Lambda^{-1}A^{\top}+L^{-1}) ∴y∼N(Aμ+b,AΛ−1A⊤+L−1)
z = ( x y ) ∼ N ( [ μ A μ + b ] , ( L − 1 Δ Δ A Λ − 1 A ⊤ + L − 1 ) ) z=\begin{pmatrix} x\\y \end{pmatrix} \sim \mathcal N(\begin{bmatrix} \mu \\ A\mu +b \end{bmatrix},\begin{pmatrix} L^{-1} &\Delta \\ \Delta &A\Lambda^{-1}A^{\top}+L^{-1} \end{pmatrix} ) z=(xy)∼N([μAμ+b],(L−1ΔΔAΛ−1A⊤+L−1))
Δ = C o v ( x , y ) = E [ ( x − E [ x ] ) ( y − E [ y ] ) ⊤ ] = E [ ( x − μ ) ( y − A μ − b ) ⊤ ] = E [ ( x − μ ) ( A x + b + ε − A μ − b ) ⊤ ] = E [ ( x − μ ) ( A x + ε − A μ ) ⊤ ] = E [ ( x − μ ) ( A x − A μ ) ⊤ ] − E [ ( x − μ ) ε ] = E [ ( x − μ ) ( x − μ ) ⊤ ] A ⊤ = V a r [ x ] A ⊤ = Λ − 1 A ⊤ \begin{aligned} \Delta&=Cov(x,y)\\ &=E[(x-E[x])(y-E[y])^{\top}]\\ &=E[(x-\mu)(y-A\mu -b)^{\top}]\\ &=E[(x-\mu)(Ax+b+\varepsilon-A\mu -b)^{\top}]\\ &=E[(x-\mu)(Ax+\varepsilon-A\mu )^{\top}]\\ &=E[(x-\mu)(Ax-A\mu )^{\top}]-E[(x-\mu)\varepsilon]\\ &=E[(x-\mu)(x-\mu )^{\top}]A^{\top}\\ &=Var[x]A^{\top}\\ &=\Lambda^{-1}A^{\top} \end{aligned} Δ=Cov(x,y)=E[(x−E[x])(y−E[y])⊤]=E[(x−μ)(y−Aμ−b)⊤]=E[(x−μ)(Ax+b+ε−Aμ−b)⊤]=E[(x−μ)(Ax+ε−Aμ)⊤]=E[(x−μ)(Ax−Aμ)⊤]−E[(x−μ)ε]=E[(x−μ)(x−μ)⊤]A⊤=Var[x]A⊤=Λ−1A⊤
z = ( x y ) = p ( x , y ) = ε ∼ N ( [ μ A μ + b ] , ( L − 1 Λ − 1 A ⊤ Λ − 1 A ⊤ Δ A Λ − 1 A ⊤ + L − 1 ) ) z=\begin{pmatrix} x\\ y \end{pmatrix}=p(x,y)=\varepsilon \sim \mathcal N(\begin{bmatrix} \mu \\ A\mu +b \end{bmatrix},\begin{pmatrix} L^{-1} &\Lambda^{-1}A^{\top} \\ \Lambda^{-1}A^{\top}\Delta &A\Lambda^{-1}A^{\top}+L^{-1} \end{pmatrix} ) z=(xy)=p(x,y)=ε∼N([μAμ+b],(L−1Λ−1A⊤ΔΛ−1A⊤AΛ−1A⊤+L−1))
p ( x ∣ y ) = p ( x , y ) p ( y ) p(x\mid y)=\frac{p(x,y)}{p(y)} p(x∣y)=p(y)p(x,y)
∴ p ( x ∣ y ) ∼ N ( , ) \therefore p(x\mid y)\sim \mathcal N(,) ∴p(x∣y)∼N(,)
x → y } → { E [ y ] V a r [ y ] → y ∼ N ( E [ y ] , V a r [ y ] ) ( x y ) } → { ( E [ x ] E [ y ] ) ( V a r [ x ] Δ Δ V a r [ y ] ) → y ∼ N ( ( E [ x ] E [ y ] ) , ( V a r [ x ] Δ Δ V a r [ y ] ) )