统计学习方法——逻辑斯蒂回归与最大熵模型
逻辑斯蒂回归与最大熵模型
模型学习的最优化算法
由于逻辑斯蒂回归模型、最大熵模型学习都可以归结为以似然函数为目标函数的最优化问题,因此可以放在一起讨论。
改进的迭代尺度法(IIS)
改进的迭代尺度法是一种最大熵模型学习的最优化算法。
已知最大熵模型为
P
w
(
y
∣
x
)
=
1
Z
w
(
x
)
exp
(
∑
i
=
1
n
w
i
f
i
(
x
,
y
)
)
{P_w}\left( {y\left| x \right.} \right) = \frac{1}{{{Z_w}\left( x \right)}}\exp \left( {\sum\limits_{i = 1}^n {{w_i}{f_i}\left( {x,y} \right)} } \right)
Pw(y∣x)=Zw(x)1exp(i=1∑nwifi(x,y))
其中
Z
w
(
x
)
=
∑
y
exp
(
∑
i
=
1
n
w
i
f
i
(
x
,
y
)
)
{Z_w}\left( x \right) = \sum\limits_y {\exp \left( {\sum\limits_{i = 1}^n {{w_i}{f_i}\left( {x,y} \right)} } \right)}
Zw(x)=y∑exp(i=1∑nwifi(x,y))
对数似然函数为
L
(
w
)
=
∑
x
,
y
P
~
(
x
,
y
)
∑
i
=
1
n
w
i
f
i
(
x
,
y
)
−
∑
x
P
~
(
x
)
log
Z
w
(
x
)
L\left( w \right) = \sum\limits_{x,y} {\tilde P\left( {x,y} \right)\sum\limits_{i = 1}^n {{w_i}{f_i}\left( {x,y} \right)} - \sum\limits_x {\tilde P\left( x \right)} } \log {Z_w}\left( x \right)
L(w)=x,y∑P~(x,y)i=1∑nwifi(x,y)−x∑P~(x)logZw(x)
- IIS的思路
假设最大熵模型当前的参数向量为 w = ( w 1 , w 2 , ⋯   , w n ) T w = {\left( {{w_1},{w_2}, \cdots ,{w_n}} \right)^T} w=(w1,w2,⋯,wn)T,我们希望找到一个新的参数向量 w + δ = ( w 1 + δ 1 , w 2 + δ 2 , ⋯   , w n + δ n ) T w + \delta = {\left( {{w_1} + {\delta _1},{w_2} + {\delta _2}, \cdots ,{w_n} + {\delta _n}} \right)^T} w+δ=(w1+δ1,w2+δ2,⋯,wn+δn)T使得模型的对数似然函数值增大。 - IIS算法
- 输入:特征函数 f 1 , f 2 , . . . , f n f_1,f_2,...,f_n f1,f2,...,fn,经验分布 P ~ ( X , Y ) {\tilde P\left( {X,Y} \right)} P~(X,Y),模型 P w ( y ∣ x ) {P_w}\left( {y\left| x \right.} \right) Pw(y∣x)
- 输出:最优参数 w i ∗ w_i^* wi∗,最优模型 P w ∗ P_{w^*} Pw∗
- 流程
- 对所有 i ∈ { 1 , 2 , ⋯   , n } i \in \left\{ {1,2, \cdots ,n} \right\} i∈{1,2,⋯,n},取初值 w i = 0 {w_i} = 0 wi=0
- 对每一
i
∈
{
1
,
2
,
⋯
 
,
n
}
i \in \left\{ {1,2, \cdots ,n} \right\}
i∈{1,2,⋯,n}:
- 令 δ i {\delta _i} δi为方程 ∑ x , y P ~ ( x ) P ( y ∣ x ) f i ( x , y ) exp ( δ i , f # ( x , y ) ) = E P ~ ( f i ) \sum\limits_{x,y} {\tilde P\left( x \right)P\left( {y\left| x \right.} \right){f_i}\left( {x,y} \right)\exp \left( {{\delta _i},{f^\# }\left( {x,y} \right)} \right)} = {E_{\tilde P}}\left( {{f_i}} \right) x,y∑P~(x)P(y∣x)fi(x,y)exp(δi,f#(x,y))=EP~(fi)的解,其中 f # ( x , y ) = ∑ i = 1 n f i ( x , y ) {f^\# }\left( {x,y} \right) = \sum\limits_{i = 1}^n {{f_i}\left( {x,y} \right)} f#(x,y)=i=1∑nfi(x,y)
- 更新 w i w_i wi: w i ← w i + δ i {w_i} \leftarrow {w_i} + {\delta _i} wi←wi+δi
- 如果不是所有的 w i w_i wi都收敛,则重复上一步骤。
- 计算
δ
i
{\delta _i}
δi
- 使用牛顿法迭代求得
δ
i
∗
{\delta _i^*}
δi∗,迭代公式为:
δ i ( k + 1 ) = δ i ( k ) − g ( δ i ( k ) ) g ′ ( δ i ( k ) ) \delta _i^{\left( {k + 1} \right)} = \delta _i^{\left( k \right)} - \frac{{g\left( {\delta _i^{\left( k \right)}} \right)}}{{g'\left( {\delta _i^{\left( k \right)}} \right)}} δi(k+1)=δi(k)−g′(δi(k))g(δi(k))
适当选取初始值 δ i ( 0 ) {\delta _i^{\left( 0 \right)}} δi(0),牛顿法会快速收敛。
- 使用牛顿法迭代求得
δ
i
∗
{\delta _i^*}
δi∗,迭代公式为:
拟牛顿法
对于最大熵模型:
P
w
(
y
∣
x
)
=
exp
(
∑
i
=
1
n
w
i
f
i
(
x
,
y
)
)
∑
y
exp
(
∑
i
=
1
n
w
i
f
i
(
x
,
y
)
)
{P_w}\left( {y\left| x \right.} \right) = \frac{{\exp \left( {\sum\limits_{i = 1}^n {{w_i}{f_i}\left( {x,y} \right)} } \right)}}{{\sum\limits_y {\exp \left( {\sum\limits_{i = 1}^n {{w_i}{f_i}\left( {x,y} \right)} } \right)} }}
Pw(y∣x)=y∑exp(i=1∑nwifi(x,y))exp(i=1∑nwifi(x,y))
目标函数:
min
w
∈
R
n
f
(
w
)
=
∑
x
P
~
(
x
)
log
∑
y
exp
(
∑
i
=
1
n
w
i
f
i
(
x
,
y
)
)
−
∑
x
,
y
P
~
(
x
,
y
)
∑
i
=
1
n
w
i
f
i
(
x
,
y
)
\mathop {\min }\limits_{w \in {R^n}} f\left( w \right) = \sum\limits_x {\tilde P\left( x \right)\log \sum\limits_y {\exp \left( {\sum\limits_{i = 1}^n {{w_i}{f_i}\left( {x,y} \right)} } \right)} } - \sum\limits_{x,y} {\tilde P\left( {x,y} \right)\sum\limits_{i = 1}^n {{w_i}{f_i}\left( {x,y} \right)} }
w∈Rnminf(w)=x∑P~(x)logy∑exp(i=1∑nwifi(x,y))−x,y∑P~(x,y)i=1∑nwifi(x,y)
梯度为:
g
(
w
)
=
(
∂
f
(
w
)
∂
w
1
,
∂
f
(
w
)
∂
w
2
,
⋯
 
,
∂
f
(
w
)
∂
w
n
)
T
g\left( w \right) = {\left( {\frac{{\partial f\left( w \right)}}{{\partial {w_1}}},\frac{{\partial f\left( w \right)}}{{\partial {w_2}}}, \cdots ,\frac{{\partial f\left( w \right)}}{{\partial {w_n}}}} \right)^T}
g(w)=(∂w1∂f(w),∂w2∂f(w),⋯,∂wn∂f(w))T
其中
∂
f
(
w
)
∂
w
i
=
∑
x
,
y
P
~
(
x
)
P
w
(
y
∣
x
)
f
i
(
x
,
y
)
−
E
P
~
(
f
i
)
,
i
=
1
,
2
,
⋯
 
,
n
\frac{{\partial f\left( w \right)}}{{\partial {w_i}}} = \sum\limits_{x,y} {\tilde P\left( x \right){P_w}\left( {y\left| x \right.} \right)} {f_i}\left( {x,y} \right) - {E_{\tilde P}}\left( {{f_i}} \right),\quad i = 1,2, \cdots ,n
∂wi∂f(w)=x,y∑P~(x)Pw(y∣x)fi(x,y)−EP~(fi),i=1,2,⋯,n
- 最大熵模型学习的BFGS算法
- 输入:特征函数 f 1 , f 2 , ⋅ , f n f_1,f_2,\cdot,f_n f1,f2,⋅,fn,经验分布 P ~ ( x , y ) \tilde P\left( x,y \right) P~(x,y),目标函数 f ( w ) f\left( w \right) f(w),梯度 g ( w ) = ∇ f ( w ) g\left( w \right) = \nabla f\left( w \right) g(w)=∇f(w),精确要求 ε \varepsilon ε
- 输出:最优参数 w ∗ w^* w∗;最优模型 P w ∗ ( y ∣ x ) {P_{{w^*}}}\left( {y\left| x \right.} \right) Pw∗(y∣x)
- 流程
- 选定初始点 w ( 0 ) {w^{\left( 0 \right)}} w(0),取 B 0 B_0 B0为正定对称矩阵,设置 k = 0 k=0 k=0
- 计算 g k = g ( w ( k ) ) {g_k} = g\left( {{w^{\left( k \right)}}} \right) gk=g(w(k)),若 ∥ g k ∥ < ε \left\| {{g_k}} \right\| < \varepsilon ∥gk∥<ε,则停止计算,得 w ∗ = w ( k ) w^*=w^{\left(k\right)} w∗=w(k),否则继续下面的计算
- 由 B k p k = − g k B_kp_k=-g_k Bkpk=−gk求出 p k p_k pk
- 一维搜索:求 λ k \lambda_k λk使得 f ( w ( k ) + λ k p k ) = min λ ≥ 0 f ( w ( k ) + λ p k ) f\left( {{w^{\left( k \right)}} + {\lambda _k}{p_k}} \right) = \mathop {\min }\limits_{\lambda \ge 0} f\left( {{w^{\left( k \right)}} + \lambda {p_k}} \right) f(w(k)+λkpk)=λ≥0minf(w(k)+λpk)
- 设置 w ( k + 1 ) = w ( k ) + λ k p k {w^{\left( {k + 1} \right)}} = {w^{\left( k \right)}} + {\lambda _k}{p_k} w(k+1)=w(k)+λkpk
- 计算
g
k
+
1
=
g
(
w
(
k
+
1
)
)
{g_{k + 1}} = g\left( {{w^{\left( {k + 1} \right)}}} \right)
gk+1=g(w(k+1)),若
∥
g
k
+
1
∥
<
ε
\left\| {{g_{k + 1}}} \right\| < \varepsilon
∥gk+1∥<ε,则停止计算,得到
w
∗
=
w
(
k
+
1
)
w^*=w^{\left(k+1\right)}
w∗=w(k+1),否则按下式求解
B
k
+
1
B_{k+1}
Bk+1:
B k + 1 = B k + y k y k T y k T δ k − B k δ k δ k T B k δ k T B k δ k {B_{k + 1}} = {B_k} + \frac{{{y_k}y_k^T}}{{y_k^T{\delta _k}}} - \frac{{{B_k}{\delta _k}\delta _k^T{B_k}}}{{\delta _k^T{B_k}{\delta _k}}} Bk+1=Bk+ykTδkykykT−δkTBkδkBkδkδkTBk
其中 y k = g k + 1 − g k , δ k = w ( k + 1 ) − w ( k ) {y_k} = {g_{k + 1}} - {g_k},\quad {\delta _k} = {w^{\left( {k + 1} \right)}} - {w^{\left( k \right)}} yk=gk+1−gk,δk=w(k+1)−w(k) - 设置 k = k + 1 k=k+1 k=k+1,转第三步。
参考文献
《统计学习方法》