《Computer Vision - Model Learning and Inference》笔记
第四章
模型拟合(Model Fitting),就是确定模型的参数集合 θ \bm{\theta} θ。
最大似然估计法(ML)
为了数学上的简便性,假设每个数据点的选取都是独立的。即 P ( x i ∣ x j ) = P ( x i ) , i ≠ j P(x_i|x_j)=P(x_i),i\neq{j} P(xi∣xj)=P(xi),i=j 最大似然估计法的过程如下:
- 求出每个数据点 x i x_i xi由模型(参数带进去算)产生的概率 P ( x i ) P(x_i) P(xi)
- 求其积 ∏ i P ( x i ) \prod_iP(x_i) ∏iP(xi)
- 此时,求参数为多少时,上述乘积取最大值。例如可以求导,对正态分布可以先取对数再求导。
上述过程求得的参数记作:
θ ^ = arg max θ ( ∏ P ( x i ∣ θ ) ) \hat{\theta}=\underset{\theta}{\argmax}({\prod{P(x_i|\theta)}}) θ^=θargmax(∏P(xi∣θ))
最大后验概率(MAP)
有时候,我们可能通过经验提前知道参数取某个值的概率分布,将之记作 P ( θ ) P(\theta) P(θ)。
θ ^ = arg max [ P ( θ ∣ x 1... I ) ] = arg max [ P ( x 1... I ∣ θ ) P ( θ ) P ( x 1... I ) ] = arg max [ ∏ P ( x i ∣ θ ) P ( θ ) P ( x 1... I ) ] \hat{\theta}=\argmax\left[P\left(\theta|x_{1...I}\right)\right]\newline =\argmax\left[\frac {P(x_{1...I}|\theta)P(\theta)} {P(x_{1...I})}\right]\newline =\argmax\left[\frac {\prod{P(x_i|\theta)}P(\theta)} {P(x_{1...I})}\right] θ^=argmax[P(θ∣x1...I)]=argmax[P(x1...I)P(x1...I∣θ)P(θ)]=argmax[P(x1...I)∏P(xi∣θ)P(θ)]
由于分母和 θ \theta θ无关,不会影响最大值的位置,故可以直接去掉。
θ ^ = arg max θ ( ∏ P ( x i ∣ θ ) P ( θ ) ) \hat{\theta}=\underset{\theta}{\argmax}({\prod{P(x_i|\theta)}}P(\theta)) θ^=θargmax(∏P(xi∣θ)P(θ))
这就是MAP求参数的公式。
ML可以看作是MAP的一个特殊情况。
贝叶斯方法
贝叶斯方法不再试图求一个最大的 θ ^ \hat{\theta} θ^,而是把 θ \theta θ的分布 P ( θ ∣ x 1... i ) P(\theta|x_{1...i}) P(θ∣x1...i)求出来。承认每个 θ \theta θ的取值对结果的影响。
由贝叶斯公式:
P
(
θ
∣
x
1...
i
)
=
∏
P
(
x
i
∣
θ
)
P
(
θ
)
P
(
x
1...
I
)
P(\theta|x_{1...i})=\frac {\prod{P(x_i|\theta)}P(\theta)} {P(x_{1...I})}
P(θ∣x1...i)=P(x1...I)∏P(xi∣θ)P(θ)
求出 P ( θ ∣ x 1... i ) P(\theta|x_{1...i}) P(θ∣x1...i)后,预测新的数据点 x ∗ x^* x∗概率即求一个“加权平均”,“权”由 P ( θ ∣ x 1... i ) P(\theta|x_{1...i}) P(θ∣x1...i)给出。根据概率密度函数的定义,所有“权”的和为1。
P ( x ∗ ∣ x 1... I ) = ∫ P ( x ∗ ∣ θ ) P ( θ ∣ x 1... I ) d θ P(x^*|x_{1...I})=\int{P(x^*|\theta)P(\theta|x_{1...I})d\theta} P(x∗∣x1...I)=∫P(x∗∣θ)P(θ∣x1...I)dθ
ML和MAP都可以看作是贝叶斯方法的特殊情况,如果我们把 P ( θ ∣ x 1... i ) P(\theta|x_{1...i}) P(θ∣x1...i)看成是一个聚集在 θ ^ \hat\theta θ^的delta函数(积分为1,除了 θ ^ \hat\theta θ^处函数值均为0的函数)的话。
示例1:一元正态分布
问题:给定由正态分布产生的数据 { x i } 1 I {\{x_i\}}_1^I {xi}1I,拟合出 μ , σ \mu,\sigma μ,σ。
最大似然估计法(ML)
θ ^ = μ , σ 2 = arg max μ , σ 2 ( ∏ 1 < i < = I P ( x i ∣ μ , σ 2 ) ) = arg max μ , σ 2 ( ∏ 1 < = i < = I N o r m x i [ μ , σ 2 ] ) = arg max μ , σ 2 ( l o g ∏ 1 < i < = I N o r m x i [ μ , σ 2 ] ) = arg max μ , σ 2 ( − 0.5 I ( l o g 2 π + l o g σ 2 ) − 0.5 ∑ 1 < = i < = I ( x i − μ ) 2 σ 2 ) \hat{\theta}=\mu,\sigma^2=\underset{\mu,\sigma^2}{\argmax}({\underset{1<i<=I}{\prod}{P(x_i|\mu,\sigma^2)}})\\ =\underset{\mu,\sigma^2}{\argmax}({\underset{1<=i<=I}{\prod}{Norm_{x_i}[\mu,\sigma^2]}})\\ =\underset{\mu,\sigma^2}{\argmax}(log{\underset{1<i<=I}{\prod}{Norm_{x_i}[\mu,\sigma^2]}})\\ =\underset{\mu,\sigma^2}{\argmax}(-0.5I(log2\pi+log\sigma^2)-0.5\underset{1<=i<=I}{\sum}{\frac{(x_i-\mu)^2}{\sigma^2}})\\ θ^=μ,σ2=μ,σ2argmax(1<i<=I∏P(xi∣μ,σ2))=μ,σ2argmax(1<=i<=I∏Normxi[μ,σ2])=μ,σ2argmax(log1<i<=I∏Normxi[μ,σ2])=μ,σ2argmax(−0.5I(log2π+logσ2)−0.51<=i<=I∑σ2(xi−μ)2)
设
L
=
−
0.5
I
(
l
o
g
2
π
+
l
o
g
σ
2
)
−
0.5
∑
1
<
=
i
<
=
I
(
x
i
−
μ
)
2
σ
2
L=-0.5I(log2\pi+log\sigma^2)-0.5\underset{1<=i<=I}{\sum}{\frac{(x_i-\mu)^2}{\sigma^2}}
L=−0.5I(log2π+logσ2)−0.51<=i<=I∑σ2(xi−μ)2
求偏导,令导为0,可求极大值点。
∂
L
∂
μ
=
∑
1
<
=
i
<
=
I
(
x
i
−
μ
)
σ
2
=
0
\frac{\partial{L}}{\partial{\mu}}=\underset{1<=i<=I}{\sum}\frac{(x_i-\mu)}{\sigma^2}=0\\
∂μ∂L=1<=i<=I∑σ2(xi−μ)=0
μ
\mu
μ极大值点:
μ
^
=
∑
x
i
I
\hat{\mu}=\frac{{\sum}{x_i}}{I}
μ^=I∑xi
同理,求
σ
2
\sigma^2
σ2的极大值点:
σ
^
=
∑
1
<
=
i
<
=
I
(
x
i
−
μ
^
)
2
I
\hat{\sigma}=\underset{1<=i<=I}{\sum}\frac{(x_i-\hat{\mu})^2}{I}
σ^=1<=i<=I∑I(xi−μ^)2
可见,ML求出来的其实就是已知数据的均值和方差。
最大后验概率(MAP)
θ ^ = μ , σ 2 = arg max μ , σ 2 ( ∏ P ( x i ∣ μ , σ 2 ) P ( μ , σ 2 ) ) = arg max μ , σ 2 ( ∏ 1 < = i < = I N o r m x i ( μ , σ 2 ) N o r m I n v G a m μ , σ 2 ( α , β , γ , σ ) ) = arg max μ , σ 2 ( log ( ∏ 1 < = i < = I N o r m x i ( μ , σ 2 ) N o r m I n v G a m μ , σ 2 ( α , β , γ , σ ) ) ) \hat{\theta}=\mu,\sigma^2=\underset{\mu,\sigma^2}{\argmax}({\prod{P(x_i|\mu,\sigma^2)}}P(\mu,\sigma^2))\\ =\underset{\mu,\sigma^2}{\argmax}(\underset{1<=i<=I}{\prod}Norm_{x_i}(\mu,\sigma^2)NormInvGam_{\mu,\sigma^2}(\alpha,\beta,\gamma,\sigma))\\ =\underset{\mu,\sigma^2}{\argmax}(\log(\underset{1<=i<=I}{\prod}Norm_{x_i}(\mu,\sigma^2)NormInvGam_{\mu,\sigma^2}(\alpha,\beta,\gamma,\sigma)))\\ θ^=μ,σ2=μ,σ2argmax(∏P(xi∣μ,σ2)P(μ,σ2))=μ,σ2argmax(1<=i<=I∏Normxi(μ,σ2)NormInvGamμ,σ2(α,β,γ,σ))=μ,σ2argmax(log(1<=i<=I∏Normxi(μ,σ2)NormInvGamμ,σ2(α,β,γ,σ)))
同样求极大值,步骤很麻烦,这里略去,结果是:
μ ^ = ∑ x i + γ δ I + γ , σ ^ 2 = ∑ ( x i − μ ^ ) 2 + 2 β + γ ( δ − μ ^ ) 2 I + 3 + 2 α \hat{\mu}=\frac{\sum{x_i}+\gamma\delta}{I+\gamma}, \hat\sigma^2=\frac{\sum(x_i-\hat{\mu})^2+2\beta+\gamma(\delta-\hat{\mu})^2}{I+3+2\alpha} μ^=I+γ∑xi+γδ,σ^2=I+3+2α∑(xi−μ^)2+2β+γ(δ−μ^)2
贝叶斯方法
P ( μ , σ 2 ∣ x 1... I ) = P ( x 1... I ∣ μ , σ 2 ) P ( μ , σ 2 ) P ( x 1... I ) = ∏ N o r m x i [ μ , σ 2 ] N o r m I n v G a m μ , σ 2 [ α , β , γ , δ ] P ( x 1... I ) = κ N o r m I n v G a m μ , σ 2 [ α ~ , β ~ , γ ~ , δ ~ ] P ( x 1... I ) P(\mu,\sigma^2|x_{1...I})=\frac{P(x_{1...I}|\mu,\sigma^2)P(\mu,\sigma^2)}{P(x_{1...I})}\\ =\frac{{\prod}Norm{x_i}[\mu,\sigma^2]NormInvGam_{\mu,\sigma^2}[\alpha,\beta,\gamma,\delta]}{P(x_{1...I})}\\ =\frac{{\kappa}NormInvGam_{\mu,\sigma^2}[\tilde\alpha,\tilde\beta,\tilde\gamma,\tilde\delta]}{P(x_{1...I})}\\ P(μ,σ2∣x1...I)=P(x1...I)P(x1...I∣μ,σ2)P(μ,σ2)=P(x1...I)∏Normxi[μ,σ2]NormInvGamμ,σ2[α,β,γ,δ]=P(x1...I)κNormInvGamμ,σ2[α~,β~,γ~,δ~]
第三行使用了共轭的性质,即后验概率和先验概率具有同样形式。
α ~ = α + I / 2 , γ ~ = γ + I , δ ~ = γ δ + ∑ x i γ + I , β ~ = ∑ x i 2 2 + β + γ δ 2 2 − ( γ δ + ∑ x i ) 2 2 ( γ + I ) \tilde\alpha=\alpha+I/2,\\ \tilde\gamma=\gamma+I,\\ \tilde\delta=\frac{\gamma\delta+\sum{x_i}}{\gamma+I},\\ \tilde\beta=\frac{\sum{x_i^2}}{2}+\beta+\frac{\gamma\delta^2}{2}-\frac{(\gamma\delta+\sum{x_i})^2}{2(\gamma+I)} α~=α+I/2,γ~=γ+I,δ~=γ+Iγδ+∑xi,β~=2∑xi2+β+2γδ2−2(γ+I)(γδ+∑xi)2
由于概率密度函数必须积分为1,故
κ
\kappa
κ和分母必须能约掉,即:
P
(
μ
,
σ
2
∣
x
1...
I
)
=
N
o
r
m
I
n
v
G
a
m
μ
,
σ
2
[
α
~
,
β
~
,
γ
~
,
δ
~
]
P(\mu,\sigma^2|x_{1...I})=NormInvGam_{\mu,\sigma^2}[\tilde\alpha,\tilde\beta,\tilde\gamma,\tilde\delta]
P(μ,σ2∣x1...I)=NormInvGamμ,σ2[α~,β~,γ~,δ~]