线性模型参数求解的最大似然估计、MAP估计、正则最小二乘估计
\qquad
本文主要描述针对线性回归模型
y
(
x
,
w
)
=
w
T
ϕ
(
x
)
y(\boldsymbol x,\boldsymbol w)=\boldsymbol w^T\boldsymbol\phi(\boldsymbol x)
y(x,w)=wTϕ(x) 采用最大似然估计、最大后验估计,以及正则最小二乘估计在求取参数
w
\boldsymbol w
w 值的过程中的联系。
\qquad
1. 线性回归的概率模型
\qquad 广义的线性回归函数可定义为:
y ( x , w ) = ∑ j = 1 M w j ϕ j ( x ) = w T ϕ ( x ) \qquad\qquad y(\boldsymbol x,\boldsymbol w)=\displaystyle\sum_{j=1}^Mw_j\phi_j(\boldsymbol x)=\boldsymbol w^T\boldsymbol\phi(\boldsymbol x) y(x,w)=j=1∑Mwjϕj(x)=wTϕ(x)
\qquad 其中,权向量 w = [ w 1 , ⋯ , w M ] T \boldsymbol w=[w_1,\cdots,w_M]^T w=[w1,⋯,wM]T,观测数据 x = [ x 1 , ⋯ , x D ] T \boldsymbol x=[x_{1},\cdots,x_{D}]^T x=[x1,⋯,xD]T,基函数 ϕ ( x ) = [ ϕ 1 ( x ) , ⋯ , ϕ M ( x ) ] T \boldsymbol\phi(\boldsymbol x)=[\phi_{1}(\boldsymbol x),\cdots,\phi_{M}(\boldsymbol x)]^T ϕ(x)=[ϕ1(x),⋯,ϕM(x)]T。
\qquad
\qquad
假设由线性回归函数
y
(
x
,
w
)
y(\boldsymbol x,\boldsymbol w)
y(x,w) 所刻画的目标变量
t
t
t 满足:
t = y ( x , w ) + ε \qquad\qquad t=y(\boldsymbol x,\boldsymbol w)+\varepsilon t=y(x,w)+ε
\qquad 已知观测误差 ε ∼ N ( 0 , σ 2 ) \varepsilon\sim\mathcal N(0,\sigma^2) ε∼N(0,σ2),即: p ( ε ) = 1 2 π σ exp ( − ε 2 2 σ 2 ) p(\varepsilon)=\dfrac{1}{\sqrt{2\pi}\sigma}\exp\left({-\dfrac{\varepsilon^2}{2\sigma^2}}\right) p(ε)=2πσ1exp(−2σ2ε2)
\qquad
- 考虑一个观测数据集 X = [ x 1 , ⋯ , x N ] \bold X=[\boldsymbol x_1,\cdots,\boldsymbol x_N] X=[x1,⋯,xN],对应目标值向量为 t = [ t 1 , ⋯ , t N ] T \bold t=[t_1,\cdots,t_N]^T t=[t1,⋯,tN]T,那么第 i i i 个目标变量 t i t_i ti 满足:
t i = ∑ j = 1 M w j ϕ j ( x i ) + ε i \qquad\qquad t_i=\displaystyle\sum_{j=1}^Mw_j\phi_j(\boldsymbol x_i)+\varepsilon_i ti=j=1∑Mwjϕj(xi)+εi 或者 t i = w T ϕ ( x i ) + ε i t_i=\boldsymbol w^T\boldsymbol\phi(\boldsymbol x_i)+\varepsilon_i ti=wTϕ(xi)+εi
\qquad
其中,权向量
w
=
[
w
1
,
⋯
,
w
M
]
T
\boldsymbol w=[w_1,\cdots,w_M]^T
w=[w1,⋯,wM]T,第
i
i
i 个观测数据
x
i
=
[
x
i
1
,
⋯
,
x
i
D
]
T
\boldsymbol x_i=[x_{i1},\cdots,x_{iD}]^T
xi=[xi1,⋯,xiD]T。
\qquad
对于
基本线性模型
t = ∑ j = 1 M w j x j + ε t=\displaystyle\sum_{j=1}^Mw_jx_{j}+\varepsilon t=j=1∑Mwjxj+ε,此时 M = D M=D M=D
基函数满足 ϕ j ( x ) = x j \phi_j(\boldsymbol x)=x_{j} ϕj(x)=xj,因而 ϕ ( x i ) = [ x i 1 , ⋯ , x i j , ⋯ , x i M ] T = x i \boldsymbol\phi(\boldsymbol x_i)=[x_{i1},\cdots,x_{ij},\cdots,x_{iM}]^T=\boldsymbol x_i ϕ(xi)=[xi1,⋯,xij,⋯,xiM]T=xi,那么:
t i = ∑ j = 1 M w j x i j + ε i \qquad t_i=\displaystyle\sum_{j=1}^Mw_jx_{ij}+\varepsilon_i ti=j=1∑Mwjxij+εi 或者 t i = w T x i + ε i t_i=\boldsymbol w^T\boldsymbol x_i+\varepsilon_i ti=wTxi+εi
\qquad
- 由于观测误差 ε i ∼ N ( 0 , σ 2 ) \varepsilon_i\sim\mathcal N(0,\sigma^2) εi∼N(0,σ2),可得到第 i i i 个目标变量 t i t_i ti 的似然函数:
p ( t i ∣ w , x i ) = 1 2 π σ exp { − [ t i − w T ϕ ( x i ) ] 2 2 σ 2 } \qquad\qquad p(t_i|\boldsymbol w,\boldsymbol x_i)=\dfrac{1}{\sqrt{2\pi}\sigma}\exp\left\{-\dfrac{\left[t_i-\boldsymbol w^T\boldsymbol\phi(\boldsymbol x_i)\right]^2}{2\sigma^2}\right\} p(ti∣w,xi)=2πσ1exp{−2σ2[ti−wTϕ(xi)]2}
\qquad
- 一般假设观测误差 ε i \varepsilon_i εi 满足独立同分布,可得到关于所有观测数据集 X \bold X X 的目标值向量 t = [ t 1 , ⋯ , t N ] T \bold t=[t_1,\cdots,t_N]^T t=[t1,⋯,tN]T 的似然函数(参数为权向量 w \boldsymbol w w, X \bold X X 表示数据集):
p ( t ∣ w , X ) = p ( t 1 , ⋯ , t N ∣ w , X ) = ∏ i = 1 N p ( t i ∣ w , x i ) = 1 ( 2 π σ ) N exp { − 1 2 σ 2 ∑ i = 1 N [ t i − w T ϕ ( x i ) ] 2 } \qquad\qquad\begin{aligned}p(\bold t|\boldsymbol w,\bold X)&=p(t_1,\cdots,t_N|\boldsymbol w,\bold X)\\ &=\displaystyle\prod_{i=1}^N p(t_i|\boldsymbol w,\boldsymbol x_i)\\ &=\dfrac{1}{(\sqrt{2\pi}\sigma)^N}\exp\left\{-\dfrac{1}{2\sigma^2}\sum\limits_{i=1}^N\left[t_i-\boldsymbol w^T\boldsymbol\phi(\boldsymbol x_i)\right]^2\right\}\end{aligned} p(t∣w,X)=p(t1,⋯,tN∣w,X)=i=1∏Np(ti∣w,xi)=(2πσ)N1exp{−2σ21i=1∑N[ti−wTϕ(xi)]2}
显然,此处的似然都是指 p ( t ∣ w , X ) p(\bold t|\boldsymbol w,\bold X) p(t∣w,X) 随着参数 w \boldsymbol w w 变化时的,而数据集 X \bold X X 仅表示条件。
\qquad
2. 最大似然估计
\qquad 在得到了所有观测变量的联合概率密度 p ( t ∣ w , X ) p(\bold t|\boldsymbol w,\bold X) p(t∣w,X) 之后,最直接的方法就是采用最大似然估计求取权向量 w \boldsymbol w w 的解。
\qquad 对联合概率密度取对数,构成对数似然函数:
ln p ( t ∣ w , X ) = ln 1 ( 2 π σ ) N − 1 2 σ 2 ∑ i = 1 N [ t i − w T ϕ ( x i ) ] 2 = − N 2 ln ( 2 π ) − N ln σ − 1 2 σ 2 ∑ i = 1 N [ t i − w T ϕ ( x i ) ] 2 \qquad\qquad\begin{aligned}\ln p(\bold t|\boldsymbol w,\bold X)&=\ln\dfrac{1}{(\sqrt{2\pi}\sigma)^N}-\dfrac{1}{2\sigma^2}\sum\limits_{i=1}^N\left[t_i-\boldsymbol w^T\boldsymbol\phi(\boldsymbol x_i)\right]^2\\ &=-\dfrac{N}{2}\ln(2\pi)-N\ln\sigma-\dfrac{1}{2\sigma^2}\sum\limits_{i=1}^N\left[t_i-\boldsymbol w^T\boldsymbol\phi(\boldsymbol x_i)\right]^2\end{aligned} lnp(t∣w,X)=ln(2πσ)N1−2σ21i=1∑N[ti−wTϕ(xi)]2=−2Nln(2π)−Nlnσ−2σ21i=1∑N[ti−wTϕ(xi)]2
观测数据集 X \bold X X 出现在条件变量的位置,对于最大似然估计的求解而言可以忽略。
\qquad 采用最大似然估计求取作为参数的权向量 w \boldsymbol w w 的值,也就是:
∇ w ln p ( t ∣ w , X ) = 1 σ 2 ∑ i = 1 N [ t i − w T ϕ ( x i ) ] ϕ ( x i ) = 0 \qquad\qquad\nabla_{\boldsymbol w}\ln p(\bold t|\boldsymbol w,\bold X)=\dfrac{1}{\sigma^2}\displaystyle\sum_{i=1}^N\left[t_i-\boldsymbol w^T\boldsymbol\phi(\boldsymbol x_i)\right]\boldsymbol\phi(\boldsymbol x_i)=0 ∇wlnp(t∣w,X)=σ21i=1∑N[ti−wTϕ(xi)]ϕ(xi)=0
可以看出,在假设观测误差 ε \varepsilon ε 满足高斯噪声的情况下:
最大化对数似然函数 ln p ( t ∣ w , X ) \ln p(\bold t|\boldsymbol w,\bold X) lnp(t∣w,X),实际上就是最小化平方和误差函数 1 2 σ 2 ∑ i = 1 N [ t i − w T ϕ ( x i ) ] 2 \dfrac{1}{2\sigma^2}\displaystyle\sum_{i=1}^N\left[t_i-\boldsymbol w^T\boldsymbol\phi(\boldsymbol x_i)\right]^2 2σ21i=1∑N[ti−wTϕ(xi)]2
⟹ \qquad\qquad\Longrightarrow\qquad ⟹ ∑ i = 1 N t i ϕ ( x i ) = ∑ i = 1 N [ w T ϕ ( x i ) ] ϕ ( x i ) \displaystyle\sum_{i=1}^Nt_i\boldsymbol\phi(\boldsymbol x_i)=\displaystyle\sum_{i=1}^N[\boldsymbol w^T\boldsymbol\phi(\boldsymbol x_i)]\boldsymbol\phi(\boldsymbol x_i) i=1∑Ntiϕ(xi)=i=1∑N[wTϕ(xi)]ϕ(xi)
⟹ \qquad\qquad\Longrightarrow\qquad ⟹ [ ϕ ( x 1 ) ⋯ ϕ ( x N ) ] [ t 1 ⋮ t N ] ⏟ Φ T t = [ ϕ ( x 1 ) ⋯ ϕ ( x N ) ] [ ϕ ( x 1 ) T w ⋮ ϕ ( x N ) T w ] ⏟ Φ T Φ w \underbrace{\left[\begin{matrix}\boldsymbol\phi(\boldsymbol x_1)&\cdots&\boldsymbol\phi(\boldsymbol x_N)\end{matrix}\right]\left[\begin{matrix}t_1\\\vdots\\t_N\end{matrix}\right]}_{\Phi^T\bold t}=\underbrace{\left[\begin{matrix}\boldsymbol\phi(\boldsymbol x_1)&\cdots&\boldsymbol\phi(\boldsymbol x_N)\end{matrix}\right]\left[\begin{matrix}\boldsymbol\phi(\boldsymbol x_1)^T\boldsymbol w\\\vdots\\\boldsymbol\phi(\boldsymbol x_N)^T\boldsymbol w\end{matrix}\right]}_{\Phi^T\Phi\boldsymbol w} ΦTt [ϕ(x1)⋯ϕ(xN)] t1⋮tN =ΦTΦw [ϕ(x1)⋯ϕ(xN)] ϕ(x1)Tw⋮ϕ(xN)Tw
⟹ \qquad\qquad\Longrightarrow\qquad ⟹ Φ T t = Φ T Φ w \Phi^T\bold t=\Phi^T\Phi\boldsymbol w ΦTt=ΦTΦw
\qquad
\qquad
可求得最大似然解:
w M L = ( Φ T Φ ) − 1 Φ T t \qquad\qquad\boldsymbol w_{ML}=(\Phi^T\Phi)^{-1}\Phi^T\bold t wML=(ΦTΦ)−1ΦTt
或将误差函数写成矩阵的形式 ( t − Φ w ) T ( t − Φ w ) (\bold t-\Phi\boldsymbol w)^T(\bold t-\Phi\boldsymbol w) (t−Φw)T(t−Φw),采用矩阵的微分得到
\qquad 其中, ϕ ( x ) = [ ϕ 1 ( x ) , ⋯ , ϕ M ( x ) ] T \boldsymbol\phi(\boldsymbol x)=[\phi_{1}(\boldsymbol x),\cdots,\phi_{M}(\boldsymbol x)]^T ϕ(x)=[ϕ1(x),⋯,ϕM(x)]T, t = [ t 1 , ⋯ , t N ] T \bold t=[t_1,\cdots,t_N]^T t=[t1,⋯,tN]T
Φ = [ ϕ ( x 1 ) T ϕ ( x 2 ) T ⋮ ϕ ( x N ) T ] = [ ϕ 1 ( x 1 ) ϕ 2 ( x 1 ) ⋯ ϕ M ( x 1 ) ϕ 1 ( x 2 ) ϕ 2 ( x 2 ) ⋯ ϕ M ( x 2 ) ⋮ ⋮ ⋮ ϕ 1 ( x N ) ϕ 2 ( x N ) ⋯ ϕ M ( x N ) ] \qquad\qquad\ \ \Phi=\left[\begin{matrix}\boldsymbol\phi(\boldsymbol x_1)^T\\\boldsymbol\phi(\boldsymbol x_2)^T\\\vdots\\\boldsymbol\phi(\boldsymbol x_N)^T\end{matrix}\right]=\left[\begin{matrix}\phi_{1}(\boldsymbol x_1)&\phi_{2}(\boldsymbol x_1)&\cdots&\phi_{M}(\boldsymbol x_1)\\\phi_{1}(\boldsymbol x_2)&\phi_{2}(\boldsymbol x_2)&\cdots&\phi_{M}(\boldsymbol x_2)\\\vdots&\vdots&&\vdots\\\phi_{1}(\boldsymbol x_N)&\phi_{2}(\boldsymbol x_N)&\cdots&\phi_{M}(\boldsymbol x_N)\end{matrix}\right] Φ= ϕ(x1)Tϕ(x2)T⋮ϕ(xN)T = ϕ1(x1)ϕ1(x2)⋮ϕ1(xN)ϕ2(x1)ϕ2(x2)⋮ϕ2(xN)⋯⋯⋯ϕM(x1)ϕM(x2)⋮ϕM(xN)
Φ T Φ = [ ϕ ( x 1 ) ϕ ( x 2 ) ⋯ ϕ ( x N ) ] [ ϕ ( x 1 ) T ϕ ( x 2 ) T ⋮ ϕ ( x N ) T ] = ∑ i = 1 N ϕ ( x i ) ϕ ( x i ) T \qquad\qquad\ \ \begin{aligned}\Phi^T\Phi&=\left[\begin{matrix}\boldsymbol\phi(\boldsymbol x_1)&\boldsymbol\phi(\boldsymbol x_2)&\cdots&\boldsymbol\phi(\boldsymbol x_N)\end{matrix}\right]\left[\begin{matrix}\boldsymbol\phi(\boldsymbol x_1)^T\\\boldsymbol\phi(\boldsymbol x_2)^T\\\vdots\\\boldsymbol\phi(\boldsymbol x_N)^T\end{matrix}\right]\\ &=\displaystyle\sum_{i=1}^N\boldsymbol\phi(\boldsymbol x_i)\boldsymbol\phi(\boldsymbol x_i)^T\end{aligned} ΦTΦ=[ϕ(x1)ϕ(x2)⋯ϕ(xN)] ϕ(x1)Tϕ(x2)T⋮ϕ(xN)T =i=1∑Nϕ(xi)ϕ(xi)T
\qquad
3. 正则最小二乘(Regularized least-squares)估计
\qquad 上节在求最大似然解的时候,对数似然函数为:
ln
p
(
t
∣
w
,
X
)
=
−
N
2
ln
(
2
π
)
−
N
ln
σ
−
1
σ
2
{
1
2
∑
i
=
1
N
[
t
i
−
w
T
ϕ
(
x
i
)
]
2
}
\qquad\qquad\ln p(\bold t|\boldsymbol w,\bold X)=-\dfrac{N}{2}\ln(2\pi)-N\ln\sigma-\dfrac{1}{\sigma^2}\left\{\dfrac{1}{2}\sum\limits_{i=1}^N\left[t_i-\boldsymbol w^T\boldsymbol\phi(\boldsymbol x_i)\right]^2\right\}
lnp(t∣w,X)=−2Nln(2π)−Nlnσ−σ21{21i=1∑N[ti−wTϕ(xi)]2}
\qquad
\qquad
可以看出,对数似然函数的最后一项就是平方和误差函数:
E D ( w ) = 1 2 ∑ i = 1 N [ t i − w T ϕ ( x i ) ] 2 = 1 2 ( t − Φ w ) T ( t − Φ w ) = 1 2 ∥ t − Φ w ∥ 2 \qquad\qquad\begin{aligned}E_D(\boldsymbol w)&=\dfrac{1}{2}\displaystyle\sum_{i=1}^N\left[t_i-\boldsymbol w^T\boldsymbol\phi(\boldsymbol x_i)\right]^2\\ &=\dfrac{1}{2}(\bold t-\Phi\boldsymbol w)^T(\bold t-\Phi\boldsymbol w)\\ &=\dfrac{1}{2}\Vert\bold t-\Phi\boldsymbol w\Vert^2\end{aligned} ED(w)=21i=1∑N[ti−wTϕ(xi)]2=21(t−Φw)T(t−Φw)=21∥t−Φw∥2
\qquad 因此,最大化对数似然函数 ln p ( t ∣ w , X ) \ln p(\bold t|\boldsymbol w,\bold X) lnp(t∣w,X),实际上就是最小化平方和误差函数 E D ( w ) E_D(\boldsymbol w) ED(w)。为了防止以平方和误差函数作为代价函数时出现过拟合,可以添加正则化项来控制。因此,可以将代价函数修正为:
E D ( w ) + λ E W ( w ) \qquad\qquad E_D(\boldsymbol w)+\lambda E_W(\boldsymbol w) ED(w)+λEW(w)
\qquad 其中, λ \lambda λ 是正则化系数,用于平衡误差函数和正则化项的重要性。
\qquad 常用的正则化项可采用:
E W ( w ) = 1 2 ∥ w ∥ 2 = 1 2 w T w \qquad\qquad E_W(\boldsymbol w)=\dfrac{1}{2}\Vert\boldsymbol w\Vert^2=\dfrac{1}{2}\boldsymbol w^T\boldsymbol w EW(w)=21∥w∥2=21wTw
\qquad 因此,得到了添加正则化项之后的代价函数 F ( w ) F(\boldsymbol w) F(w),也就是:
F ( w ) = 1 2 ∥ t − Φ w ∥ 2 + λ 2 ∥ w ∥ 2 \qquad\qquad F(\boldsymbol w)=\dfrac{1}{2}\Vert\bold t-\Phi\boldsymbol w\Vert^2+\dfrac{\lambda}{2}\Vert\boldsymbol w\Vert^2 F(w)=21∥t−Φw∥2+2λ∥w∥2
( 1 ) (1) (1) 当 λ = 0 \lambda=0 λ=0,表明对“由观测训练样本集 X \bold X X 所描述的的观测模型”有完全的把握
( 2 ) (2) (2) 当 λ = ∞ \lambda=\infty λ=∞,表明对“由观测训练样本集 X \bold X X 所描述的的观测模型”没有把握
\qquad
\qquad
为了最小化代价函数
F
(
w
)
F(\boldsymbol w)
F(w),令
∇
w
F
(
w
)
=
−
Φ
T
t
+
(
Φ
T
Φ
+
λ
I
)
w
=
0
\nabla_{\boldsymbol w}F(\boldsymbol w)=-\Phi^T\bold t+\left(\Phi^T\Phi+\lambda\bold I\right)\boldsymbol w=0
∇wF(w)=−ΦTt+(ΦTΦ+λI)w=0
\qquad
可得到正则最小二乘解:
w
=
[
Φ
T
Φ
+
λ
I
]
−
1
Φ
T
t
\qquad\qquad\boldsymbol w=\left[\Phi^T\Phi+\lambda\bold I\right]^{-1}\Phi^T\bold t
w=[ΦTΦ+λI]−1ΦTt
如果采用
基本线性模型
t = ∑ j = 1 M w j x j + ε t=\displaystyle\sum_{j=1}^Mw_jx_{j}+\varepsilon t=j=1∑Mwjxj+ε ,满足 ϕ ( x i ) = x i \boldsymbol\phi(\boldsymbol x_i)=\boldsymbol x_i ϕ(xi)=xi
t i = ∑ j = 1 M w j x i j + ε i t_i=\displaystyle\sum_{j=1}^Mw_jx_{ij}+\varepsilon_i ti=j=1∑Mwjxij+εi 或者 t i = w T x i + ε i t_i=\boldsymbol w^T\boldsymbol x_i+\varepsilon_i ti=wTxi+εi
平方和误差: E D ( w ) = ∑ i = 1 N ε i 2 = ∑ i = 1 N ( t i − w T x i ) 2 = ( t − Φ w ) T ( t − Φ w ) E_D(\boldsymbol w)=\displaystyle\sum_{i=1}^N\varepsilon_i^2=\displaystyle\sum_{i=1}^N(t_i-\boldsymbol w^T\boldsymbol x_i)^2=(\bold t-\Phi\boldsymbol w)^T(\bold t-\Phi\boldsymbol w) ED(w)=i=1∑Nεi2=i=1∑N(ti−wTxi)2=(t−Φw)T(t−Φw)
此时, Φ = [ ϕ ( x 1 ) T ϕ ( x 2 ) T ⋮ ϕ ( x N ) T ] = [ x 1 T x 2 T ⋮ x N T ] = [ x 11 x 12 ⋯ x 1 M x 21 x 22 ⋯ x 2 M ⋮ x N 1 x N 2 ⋯ x N M ] \Phi=\left[\begin{matrix}\boldsymbol\phi(\boldsymbol x_1)^T\\\boldsymbol\phi(\boldsymbol x_2)^T\\\vdots\\\boldsymbol\phi(\boldsymbol x_N)^T\end{matrix}\right]=\left[\begin{matrix}\boldsymbol x_1^T\\\boldsymbol x_2^T\\\vdots\\\boldsymbol x_N^T\end{matrix}\right]= \left[\begin{matrix}x_{11}&x_{12}&\cdots&x_{1M}\\x_{21}&x_{22}&\cdots&x_{2M}\\\vdots\\x_{N1}&x_{N2}&\cdots&x_{NM}\end{matrix}\right] Φ= ϕ(x1)Tϕ(x2)T⋮ϕ(xN)T = x1Tx2T⋮xNT = x11x21⋮xN1x12x22xN2⋯⋯⋯x1Mx2MxNM
采用正则化方法防止过拟合,优化函数可表示为: F ( w ) = ( t − Φ w ) T ( t − Φ w ) + λ w T w F(\boldsymbol w)=(\bold t-\Phi\boldsymbol w)^T(\bold t-\Phi\boldsymbol w)+\lambda\boldsymbol w^T\boldsymbol w F(w)=(t−Φw)T(t−Φw)+λwTw
令 ∇ w F ( w ) = 0 \nabla_{\boldsymbol w}F(\boldsymbol w)=0 ∇wF(w)=0,可得到: w = [ Φ T Φ + λ I ] − 1 Φ T t \boldsymbol w=[\Phi^T\Phi+\lambda\bold I]^{-1}\Phi^T\bold t w=[ΦTΦ+λI]−1ΦTt
\qquad
4. 最大后验估计
\qquad
从贝叶斯分析的观点,需要考虑作为参数的权向量
w
=
[
w
1
,
⋯
,
w
M
]
T
\boldsymbol w=[w_1,\cdots,w_M]^T
w=[w1,⋯,wM]T 的先验概率。
\qquad
- 仍然假设权向量的各元素 w i w_i wi 满足独立同分布,且满足 w i ∼ N ( 0 , σ w 2 ) w_i\sim\mathcal N(0,\sigma_w^2) wi∼N(0,σw2),可得到先验概率:
p
(
w
)
=
p
(
w
1
,
⋯
,
w
M
)
=
∏
i
=
1
M
p
(
w
i
)
=
1
(
2
π
σ
w
)
M
∏
i
=
1
M
exp
(
−
w
i
2
2
σ
w
2
)
=
1
(
2
π
σ
w
)
M
exp
(
−
1
2
σ
w
2
∑
i
=
1
M
w
i
2
)
=
1
(
2
π
σ
w
)
M
exp
(
−
∥
w
∥
2
2
σ
w
2
)
\qquad\qquad\begin{aligned} p(\boldsymbol w)&=p(w_1,\cdots,w_M)\\&=\displaystyle\prod_{i=1}^Mp(w_i)\\ &=\dfrac{1}{(\sqrt{2\pi}\sigma_w)^M}\displaystyle\prod_{i=1}^M\exp\left({-\dfrac{w_i^2}{2\sigma_w^2}}\right)\\ &=\dfrac{1}{(\sqrt{2\pi}\sigma_w)^M}\exp\left({-\dfrac{1}{2\sigma_w^2}}\displaystyle\sum_{i=1}^Mw_i^2\right)\\ &=\dfrac{1}{(\sqrt{2\pi}\sigma_w)^M}\exp\left({-\dfrac{\Vert\boldsymbol w\Vert^2}{2\sigma_w^2}}\right)\end{aligned}
p(w)=p(w1,⋯,wM)=i=1∏Mp(wi)=(2πσw)M1i=1∏Mexp(−2σw2wi2)=(2πσw)M1exp(−2σw21i=1∑Mwi2)=(2πσw)M1exp(−2σw2∥w∥2)
\qquad
- 由贝叶斯公式,可得到后验概率满足:
p
(
w
∣
t
,
X
)
=
p
(
t
∣
w
,
X
)
p
(
w
)
p
(
t
)
∝
p
(
t
∣
w
,
X
)
p
(
w
)
\qquad\qquad p(\boldsymbol w|\bold t,\bold X)=\dfrac{p(\bold t|\boldsymbol w,\bold X)p(\boldsymbol w)}{p(\bold t)}\propto p(\bold t|\boldsymbol w,\bold X)p(\boldsymbol w)
p(w∣t,X)=p(t)p(t∣w,X)p(w)∝p(t∣w,X)p(w)
\qquad
\qquad
采用第
1
1
1 部分描述的线性回归的概率模型,目标向量
t
\bold t
t 的似然函数为:
p ( t ∣ w , X ) = 1 ( 2 π σ ) N exp { − 1 2 σ 2 ∑ i = 1 N [ t i − w T ϕ ( x i ) ] 2 } \qquad\qquad p(\bold t|\boldsymbol w,\bold X)=\dfrac{1}{(\sqrt{2\pi}\sigma)^N}\exp\left\{-\dfrac{1}{2\sigma^2}\sum\limits_{i=1}^N\left[t_i-\boldsymbol w^T\boldsymbol\phi(\boldsymbol x_i)\right]^2\right\} p(t∣w,X)=(2πσ)N1exp{−2σ21i=1∑N[ti−wTϕ(xi)]2}
\qquad 那么
p ( w ∣ t , X ) ∝ 1 ( 2 π σ ) N exp { − 1 2 σ 2 ∑ i = 1 N [ t i − w T ϕ ( x i ) ] 2 } 1 ( 2 π σ w ) M exp ( − ∥ w ∥ 2 2 σ w 2 ) ∝ exp { − 1 2 σ 2 ∑ i = 1 N [ t i − w T ϕ ( x i ) ] 2 − ∥ w ∥ 2 2 σ w 2 } ∝ − { 1 2 ∑ i = 1 N [ t i − w T ϕ ( x i ) ] 2 + λ 2 ∥ w ∥ 2 } \qquad\qquad\begin{aligned} p(\boldsymbol w|\bold t,\bold X)&\propto\dfrac{1}{(\sqrt{2\pi}\sigma)^N}\exp\left\{-\dfrac{1}{2\sigma^2}\sum\limits_{i=1}^N\left[t_i-\boldsymbol w^T\boldsymbol\phi(\boldsymbol x_i)\right]^2\right\}\dfrac{1}{(\sqrt{2\pi}\sigma_w)^M}\exp\left({-\dfrac{\Vert\boldsymbol w\Vert^2}{2\sigma_w^2}}\right)\\ &\propto\exp\left\{-\dfrac{1}{2\sigma^2}\sum\limits_{i=1}^N\left[t_i-\boldsymbol w^T\boldsymbol\phi(\boldsymbol x_i)\right]^2-\dfrac{\Vert\boldsymbol w\Vert^2}{2\sigma_w^2}\right\}\\ &\propto-\left\{\dfrac{1}{2}\sum\limits_{i=1}^N\left[t_i-\boldsymbol w^T\boldsymbol\phi(\boldsymbol x_i)\right]^2+\dfrac{\lambda}{2}\Vert\boldsymbol w\Vert^2\right\}\end{aligned} p(w∣t,X)∝(2πσ)N1exp{−2σ21i=1∑N[ti−wTϕ(xi)]2}(2πσw)M1exp(−2σw2∥w∥2)∝exp{−2σ21i=1∑N[ti−wTϕ(xi)]2−2σw2∥w∥2}∝−{21i=1∑N[ti−wTϕ(xi)]2+2λ∥w∥2}
\qquad\qquad
其中,定义
λ
=
σ
2
σ
w
2
\lambda=\dfrac{\sigma^2}{\sigma_w^2}
λ=σw2σ2
\qquad
- 权向量 w \boldsymbol w w 的最大后验估计值为:
w
M
A
P
=
max
w
p
(
w
∣
t
,
X
)
=
max
w
{
−
1
2
∑
i
=
1
N
[
t
i
−
w
T
ϕ
(
x
i
)
]
2
−
λ
2
∥
w
∥
2
}
\qquad\qquad\begin{aligned}\boldsymbol w_{MAP}&=\displaystyle\max_{\boldsymbol w} p(\boldsymbol w|\bold t,\bold X)\\&=\displaystyle\max_{\boldsymbol w} \left\{-\dfrac{1}{2}\sum\limits_{i=1}^N\left[t_i-\boldsymbol w^T\boldsymbol\phi(\boldsymbol x_i)\right]^2-\dfrac{\lambda}{2}\Vert\boldsymbol w\Vert^2\right\}\end{aligned}
wMAP=wmaxp(w∣t,X)=wmax{−21i=1∑N[ti−wTϕ(xi)]2−2λ∥w∥2}
\qquad
\qquad
相当于定义了代价函数
F
(
w
)
F(\boldsymbol w)
F(w),也就是:
F ( w ) = 1 2 ∑ i = 1 N [ t i − w T ϕ ( x i ) ] 2 + λ 2 ∥ w ∥ 2 = 1 2 ∥ t − Φ w ∥ 2 + λ 2 ∥ w ∥ 2 \qquad\qquad\begin{aligned}F(\boldsymbol w)&=\dfrac{1}{2}\displaystyle\sum_{i=1}^N\left[t_i-\boldsymbol w^T\boldsymbol\phi(\boldsymbol x_i)\right]^2+\dfrac{\lambda}{2}\Vert\boldsymbol w\Vert^2\\ &=\dfrac{1}{2}\Vert\bold t-\Phi\boldsymbol w\Vert^2+\dfrac{\lambda}{2}\Vert\boldsymbol w\Vert^2\end{aligned} F(w)=21i=1∑N[ti−wTϕ(xi)]2+2λ∥w∥2=21∥t−Φw∥2+2λ∥w∥2
\qquad
\qquad
最大后验估计值就为:
w
M
A
P
=
min
w
F
(
w
)
\boldsymbol w_{MAP}=\displaystyle\min_{\boldsymbol w} F(\boldsymbol w)
wMAP=wminF(w)
\qquad
显然,在假设权向量各元素满足独立同分布、且
w
i
∼
N
(
0
,
σ
w
2
)
w_i\sim\mathcal N(0,\sigma_w^2)
wi∼N(0,σw2) 的条件下,权向量
w
\boldsymbol w
w 的最大后验解,就是第
3
3
3 节所描述的正则最小二乘解。
\qquad
\qquad
此时,权向量
w
\boldsymbol w
w 的最大后验解为:
w
=
[
Φ
T
Φ
+
λ
I
]
−
1
Φ
T
t
\boldsymbol w=\left[\Phi^T\Phi+\lambda\bold I\right]^{-1}\Phi^T\bold t
w=[ΦTΦ+λI]−1ΦTt
\qquad
特别地,当取
λ
=
0
(
σ
w
2
=
∞
)
\lambda=0\ (\sigma_w^2=\infty)
λ=0 (σw2=∞) 时,权向量各元素
w
i
w_i
wi 近似于均匀分布,也就是完全忽视权向量
w
\boldsymbol w
w 的先验信息,此时的最大后验解就等价于最大似然解,也就是:
w
M
A
P
=
w
M
L
=
(
Φ
T
Φ
)
−
1
Φ
T
t
\boldsymbol w_{MAP}=\boldsymbol w_{ML}=\left(\Phi^T\Phi\right)^{-1}\Phi^T\bold t
wMAP=wML=(ΦTΦ)−1ΦTt。
\qquad
举例:双月数据分类
import numpy as np
import matplotlib.pyplot as plt
def gen_lineardata(weight,interval):
y = -(weight[0]*interval + weight[2])/weight[1]
return y
def halfmoon(rad, width, dist, n_samp):
if n_samp%2 != 0:
n_samp += 1
data = np.zeros((3,n_samp))
rd = np.random.random((2,n_samp//2))
radius = (rad-width//2) + width*rd[0,:]
theta = np.pi*rd[1,:]
x1 = radius*np.cos(theta)
y1 = radius*np.sin(theta) + dist/2
label1 = np.ones((1,len(x1))) # label= 1 for Class 1
rd = np.random.random((2,n_samp//2))
radius = (rad-width//2) + width*rd[0,:]
theta = np.pi*rd[1,:]
x2 = radius*np.cos(-theta) + rad
y2 = radius*np.sin(-theta) - dist/2
label2= -1*np.ones((1,len(x2))) # label= 0 for Class 2
data[0,:]=np.concatenate([x1,x2])
data[1,:]=np.concatenate([y1,y2])
data[2,:]=np.concatenate([label1,label2],axis=1)
shuffle_seq = np.random.permutation(np.arange(n_samp))
data_shuffle = data[:,shuffle_seq]
return data,data_shuffle
def RLS(xhat,target,lambda0):
Phi = np.asmatrix(xhat)
t = np.asmatrix(target)
print(Phi.T*Phi)
print(Phi.T*Phi+lambda0*np.eye(len(xhat.T)))
return np.array((Phi.T*Phi+lambda0*np.eye(len(xhat.T))).I*Phi.T*t)
if __name__ == "__main__":
dNum = 800
data,data_shuffle = halfmoon(10,6,1,dNum)
#data,data_shuffle = halfmoon(10,6,-4,dNum)
pos_data = data[:,0:dNum//2]
neg_data = data[:,dNum//2:dNum]
training_data = data_shuffle.T
tmp1 = training_data[0:dNum,0:2]
tmp2 = np.ones((dNum,1))
xhat = np.concatenate((tmp1,tmp2),axis=1)
target = training_data[0:dNum,2:]
interval = np.linspace(-12,20,100)
weight = RLS(xhat,target,0)
print('RLS:',weight.flatten())
y = gen_lineardata(weight,interval)
plt.figure
plt.plot(interval,y,'k')
plt.plot(pos_data[0,:],pos_data[1,:],'b+')
plt.plot(neg_data[0,:],neg_data[1,:],'r+')
plt.title('Regularized least squares')
plt.show()
\qquad
距离=1,半径=10,宽度=6
\qquad
距离=-4,半径=10,宽度=6