改进的迭代尺度法(Improved Iterative Scaling,IIS)是一种常见的优化算法,在最大熵模型(Maximum Entropy Model,MaxEnt)和条件随机场(Conditional Random Field,CRF)中都会用IIS进行相应的处理,从而提高算法的效率。
已知模型为:
P λ ( y ∣ x ) = 1 Z λ ( x ) e x p ( ∑ 1 n λ i f i ( x , y ) ) P_{\lambda} (y|x) = \frac{1}{Z_{\lambda}(x)} exp(\sum_{1}^{n}{\lambda_i f_i(x,y) }) Pλ(y∣x)=Zλ(x)1exp(∑1nλifi(x,y))
式中: f i ( x , y ) f_{i}(x,y) fi(x,y)是二值函数, λ \lambda λ是参数, Z λ ( x ) Z_{\lambda}(x) Zλ(x) 是归一化因子,满足:
Z λ ( x ) = ∑ y e x p ( ∑ 1 n λ i f i ( x , y ) ) Z_{\lambda}(x)=\sum_{y}exp(\sum_{1}^{n}{\lambda_i f_i(x,y) }) Zλ(x)=∑yexp(∑1nλifi(x,y))
由 P λ ( y ∣ x ) P_{\lambda} (y|x) Pλ(y∣x)可得 p ~ ( x , y ) \tilde{p}(x,y) p~(x,y)似然函数:
L ( λ ) = ∑ x , y p ~ ( x , y ) log p ( y ∣ x ) L(\lambda) = \sum_{x,y} \tilde{p}(x,y) \log {p(y|x)} L(λ)=∑x,yp~(x,y)logp(y∣x)
其中, P ~ ( x , y ) \tilde{P}(x,y) P~(x,y)是样本 ( x , y ) (x,y) (x,y)出现的频率。模型参数 λ → λ + δ \lambda\rightarrow\lambda+\delta λ→λ+δ时,对数似然函数的改变量为:
L ( λ + δ ) − L ( λ ) = ∑ x , y P ~ ( x , y ) log P λ + δ ( y ∣ x ) − ∑ x , y P ~ ( x , y ) log P λ ( y ∣ x ) = ∑ x , y P ~ ( x , y ) ∑ i δ i f i ( x , y ) − ∑ x P ~ ( x ) log Z λ + δ ( x ) Z λ ( x ) \begin{matrix}L(\lambda+\delta) - L(\lambda) = \sum_{x,y} \tilde{P}(x,y)\log {P_{\lambda+\delta}(y|x)}-\sum_{x,y} \tilde{P}(x,y) \log {P_{\lambda}(y|x)} \\\\\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:=\sum_{x,y} \tilde{P}(x,y) \sum_i {\delta_i f_i(x,y)} - \sum_x{ \tilde{P}(x) \log{ \frac{Z_{\lambda+\delta}(x) }{Z_\lambda(x) } }}\end{matrix} L(λ+δ)−L(λ)=∑x,yP~(x,y)logPλ+δ(y∣x)−∑x,yP~(x,y)logPλ(y∣x)=∑x,yP~(x,y)∑iδifi(x,y)−∑xP~(x)logZλ(x)Zλ+δ(x)
使用不等式 − log α ≥ 1 − α -\log{\alpha} \ge 1 - \alpha −logα≥1−α (恒成立问题,求导证明),建立对数似然函数改变量的下界:
L ( λ + δ ) − L ( λ ) ≥ ∑ x , y P ~ ( x , y ) ∑ i δ i f i ( x , y ) + 1 − ∑ x P ~ ( x ) Z λ + δ ( x ) Z λ ( x ) = ∑ x , y P ~ ( x , y ) ∑ i δ i f i ( x , y ) + 1 − ∑ x P ~ ( x ) ∑ y P λ ( y ∣ x ) e x p ( ∑ i δ i f i ( x , y ) ) \begin{matrix} L(\lambda+\delta) - L(\lambda) \ge\sum_{x,y} \tilde{P}(x,y) \sum_i {\delta_i f_i(x,y)} +1-\sum_x{ \tilde{P}(x) \frac{Z_{\lambda+\delta}(x) }{Z_\lambda(x)}}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\\\\=\sum_{x,y} \tilde{P}(x,y) \sum_i {\delta_i f_i(x,y)} +1-\sum_x{ \tilde{P}(x) } \sum_{y}P_{\lambda}(y|x) exp(\sum_{i}{\delta_i f_i(x,y)}) \end{matrix} L(λ+δ)−L(λ)≥∑x,yP~(x,y)∑iδifi(x,y)+1−∑xP~(x)Zλ(x)Zλ+δ(x)=∑x,yP~(x,y)∑iδifi(x,y)+1−∑xP~(x)∑yPλ(y∣x)exp(∑iδifi(x,y))
引入 f # ( x , y ) f^{\#}(x,y) f#(x,y),满足:
f # ( x , y ) = ∑ i f i ( x , y ) f^{\#}(x,y) = \sum_i {f_i(x,y)} f#(x,y)=∑ifi(x,y)
记 L ( λ + δ ) − L ( λ ) = A ( δ ∣ λ ) L(\lambda+\delta) -L(\lambda)=A(\delta|\lambda) L(λ+δ)−L(λ)=A(δ∣λ)此时:
A ( δ ∣ λ ) = ∑ x , y P ~ ( x , y ) ∑ i δ i f i ( x , y ) + 1 − ∑ x P ~ ( x ) ∑ y P λ ( y ∣ x ) e x p ( f # ( x , y ) ∑ i δ i f i ( x , y ) f # ( x , y ) ) \begin{matrix}A(\delta|\lambda)=\sum_{x,y} \tilde{P}(x,y) \sum_i {\delta_i f_i(x,y)}+1 -\sum_x{ \tilde{P}(x) } \sum_{y}P_{\lambda}(y|x) exp(f^{\#}(x,y) \sum_{i}{\frac{\delta_i f_i(x,y)}{f^{\#}(x,y) }})\end{matrix} A(δ∣λ)=∑x,yP~(x,y)∑iδifi(x,y)+1−∑xP~(x)∑yPλ(y∣x)exp(f#(x,y)∑if#(x,y)δifi(x,y))
使用Jensen不等式: e x p ∑ x p ( x ) q ( x ) ≤ ∑ x p ( x ) e x p q ( x ) exp{\sum_x p(x) q(x)} \le \sum_x{ p(x) exp\:{q(x)} } exp∑xp(x)q(x)≤∑xp(x)expq(x),此时:
A ( δ ∣ λ ) ≥ ∑ x , y P ~ ( x , y ) ∑ i δ i f i ( x , y ) + 1 − ∑ x P ~ ( x ) ∑ y P λ ( y ∣ x ) ∑ i ( f i ( x , y ) ) f # ( x , y ) e x p ( δ i f # ( x , y ) ) ) \begin{matrix}A(\delta|\lambda)\ge\sum_{x,y} \tilde{P}(x,y) \sum_i {\delta_i f_i(x,y)}+1 -\sum_x{ \tilde{P}(x) } \sum_{y}P_{\lambda}(y|x) \sum_i\left({\frac{ f_i(x,y))}{ f^\#(x,y) }} exp({\delta_i f^{\#}(x,y))} \right)\end{matrix} A(δ∣λ)≥∑x,yP~(x,y)∑iδifi(x,y)+1−∑xP~(x)∑yPλ(y∣x)∑i(f#(x,y)fi(x,y))exp(δif#(x,y)))
记上式不等式右端为:
B ( δ ∣ λ ) = ∑ x , y P ~ ( x , y ) ∑ i δ i f i ( x , y ) + 1 − ∑ x P ~ ( x ) ∑ y P λ ( y ∣ x ) ∑ i ( f i ( x , y ) ) f # ( x , y ) e x p ( δ i f # ( x , y ) ) ) \begin{matrix}B(\delta|\lambda)=\sum_{x,y} \tilde{P}(x,y) \sum_i {\delta_i f_i(x,y)}+1 -\sum_x{ \tilde{P}(x) } \sum_{y}P_{\lambda}(y|x) \sum_i\left({\frac{ f_i(x,y))}{ f^\#(x,y) }} exp({\delta_i f^{\#}(x,y))} \right)\end{matrix} B(δ∣λ)=∑x,yP~(x,y)∑iδifi(x,y)+1−∑xP~(x)∑yPλ(y∣x)∑i(f#(x,y)fi(x,y))exp(δif#(x,y)))
对 δ i \delta_{i} δi求导得:
B ( δ ∣ λ ) ∂ δ i = ∑ x , y P ~ ( x , y ) f i ( x , y ) − ∑ x P ~ ( x ) ∑ y P λ ( y ∣ x ) f i ( x , y ) e x p ( δ i f # ( x , y ) ) \begin{matrix}\frac{B(\delta|\lambda)}{∂\delta_{i}}=\sum_{x,y} \tilde{P}(x,y) {f_i(x,y)} -\sum_x{ \tilde{P}(x) } \sum_{y}P_{\lambda}(y|x) f_i(x,y)exp({\delta_i f^{\#}(x,y))}\end{matrix} ∂δiB(δ∣λ)=∑x,yP~(x,y)fi(x,y)−∑xP~(x)∑yPλ(y∣x)fi(x,y)exp(δif#(x,y))
令 B ( δ ∣ λ ) ∂ δ i = 0 \frac{B(\delta|\lambda)}{∂\delta_{i}}=0 ∂δiB(δ∣λ)=0,可以求出 δ i \delta_{i} δi,重复执行直到 λ \lambda λ收敛。
参考文献:The Improved Iterative Scaling Algorithm: A Gentle Introduction