【统计学习方法】第6章 逻辑斯谛回归

逻辑斯谛回归(logistic regression)是统计学习中的经典分类方法.最大熵是概率模型学习的一个准则,将其推广到分类问题得到最大熵模型(maximum entropy model)。逻辑斯谛回归模型与最大熵模型都属于对数线性模型。

1、逻辑斯缔回归模型

逻辑斯缔分布

X X X 是连续随机变量, X X X 服从逻辑斯缔分布(logistic distribution)是指 X X X 具有下列分布函数和密度函数:
F ( x ) = P ( X ⩽ x ) = 1 1 + e − ( x − μ ) / γ f ( x ) = F ′ ( x ) = e − ( x − μ ) / γ γ ( 1 + e − ( x − μ ) / γ ) 2 \begin{array}{l} F(x)=P(X \leqslant x)=\frac{1}{1+\mathrm{e}^{-(x-\mu) / \gamma}} \\ \\ f(x)=F^{\prime}(x)=\frac{\mathrm{e}^{-(x-\mu) / \gamma}}{\gamma\left(1+\mathrm{e}^{-(x-\mu) / \gamma}\right)^{2}} \end{array} F(x)=P(Xx)=1+e(xμ)/γ1f(x)=F(x)=γ(1+e(xμ)/γ)2e(xμ)/γ

式中, μ \mu μ 为位置参数, γ > 0 \gamma>0 γ>0 为形状参数.

二项逻辑斯谛回归模型

二项逻辑斯谛回归模型是如下的条件概率分布: P ( Y = 1 ∣ x ) = 1 1 + exp ⁡ − ( w ⋅ x + b ) = exp ⁡ ( w ⋅ x + b ) ( 1 + exp ⁡ − ( w ⋅ x + b ) ) ⋅ exp ⁡ ( w ⋅ x + b ) = exp ⁡ ( w ⋅ x + b ) 1 + exp ⁡ ( w ⋅ x + b ) P ( Y = 0 ∣ x ) = 1 − P ( Y = 1 ∣ x ) = 1 − exp ⁡ ( w ⋅ x + b ) 1 + exp ⁡ ( w ⋅ x + b ) = 1 1 + exp ⁡ ( w ⋅ x + b ) \begin{aligned} & P \left( Y = 1 | x \right) = \dfrac{1}{1+\exp{-\left(w \cdot x + b \right)}} \\ &\quad\quad\quad\quad = \dfrac{\exp{\left(w \cdot x + b \right)}}{\left( 1+\exp{-\left(w \cdot x + b \right)}\right) \cdot \exp{\left(w \cdot x + b \right)}} \\ &\quad\quad\quad\quad = \dfrac{\exp{\left(w \cdot x + b \right)}}{1+\exp{\left( w \cdot x + b \right)}}\\& P \left( Y = 0 | x \right) = 1- P \left( Y = 1 | x \right) \\ & \quad\quad\quad\quad=1- \dfrac{\exp{\left(w \cdot x + b \right)}}{1+\exp{\left( w \cdot x + b \right)}} \\ &\quad\quad\quad\quad=\dfrac{1}{1+\exp{\left( w \cdot x + b \right)}}\end{aligned} P(Y=1x)=1+exp(wx+b)1=(1+exp(wx+b))exp(wx+b)exp(wx+b)=1+exp(wx+b)exp(wx+b)P(Y=0x)=1P(Y=1x)=11+exp(wx+b)exp(wx+b)=1+exp(wx+b)1其中, x ∈ R n x \in R^{n} xRn是输入, Y ∈ { 0 , 1 } Y \in \left\{ 0, 1 \right\} Y{0,1}是输出, w ∈ R n w \in R^{n} wRn b ∈ R b \in R bR是参数, w w w称为权值向量, b b b称为偏置, w ⋅ x w \cdot x wx w w w b b b的内积。

可将权值权值向量和输入向量加以扩充,即 w = ( w ( 1 ) , w ( 2 ) , ⋯   , w ( n ) , b ) T w = \left( w^{\left(1\right)},w^{\left(2\right)},\cdots,w^{\left(n\right)},b \right)^{T} w=(w(1),w(2),,w(n),b)T x = ( x ( 1 ) , x ( 2 ) , ⋯   , x ( n ) , 1 ) T x = \left( x^{\left(1\right)},x^{\left(2\right)},\cdots,x^{\left(n\right)},1 \right)^{T} x=(x(1),x(2),,x(n),1)T,则逻辑斯谛回归模型 P ( Y = 1 ∣ x ) = exp ⁡ ( w ⋅ x ) 1 + exp ⁡ ( w ⋅ x ) P ( Y = 0 ∣ x ) = 1 1 + exp ⁡ ( w ⋅ x ) \begin{aligned} & P \left( Y = 1 | x \right) = \dfrac{\exp{\left(w \cdot x \right)}}{1+\exp{\left( w \cdot x \right)}}\\& P \left( Y = 0 | x \right) =\dfrac{1}{1+\exp{\left( w \cdot x \right)}}\end{aligned} P(Y=1x)=1+exp(wx)exp(wx)P(Y=0x)=1+exp(wx)1

一个事件的几率是指事件发生的概率 p p p与事件不发生的概率 1 − p 1-p 1p的比值,即 p 1 − p \begin{aligned} & \dfrac{p}{1-p}\end{aligned} 1pp

该事件的对数几率(logit函数) l o g i t ( p ) = log ⁡ p 1 − p \begin{aligned} & logit\left( p \right) = \log \dfrac{p}{1-p}\end{aligned} logit(p)=log1pp

对于逻辑斯谛回归模型 log ⁡ P ( Y = 1 ∣ x ) 1 − P ( Y = 1 ∣ x ) = w ⋅ x \begin{aligned} & \log \dfrac{P \left( Y = 1 | x \right)}{1-P \left( Y = 1 | x \right)} = w \cdot x\end{aligned} log1P(Y=1x)P(Y=1x)=wx

即输出 Y = 1 Y=1 Y=1的对数几率是输入 x x x的线性函数。

模型参数估计

给定训练数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯   , ( x N , y N ) } \begin{aligned} &T = \left\{ \left( x_{1}, y_{1} \right), \left( x_{2}, y_{2} \right), \cdots, \left( x_{N}, y_{N} \right) \right\} \end{aligned} T={(x1,y1),(x2,y2),,(xN,yN)}

其中, x i ∈ R n + 1 , y i ∈ { 0 , 1 } , i = 1 , 2 , ⋯   , N x_{i} \in R^{n+1}, y_{i} \in \left\{ 0, 1 \right\}, i = 1, 2, \cdots, N xiRn+1,yi{0,1},i=1,2,,N

设: P ( Y = 1 ∣ x ) = π ( x ) , P ( Y = 0 ∣ x ) = 1 − π ( x ) \begin{aligned} & P \left( Y =1 | x \right) = \pi \left( x \right) ,\quad P \left( Y =0 | x \right) = 1 - \pi \left( x \right) \end{aligned} P(Y=1x)=π(x),P(Y=0x)=1π(x)
似然函数 l ( w ) = ∏ i = 1 N P ( y i ∣ x i ) = P ( Y = 1 ∣ x i , w ) ⋅ P ( Y = 0 ∣ x i , w ) = ∏ i = 1 N [ π ( x i ) ] y i [ 1 − π ( x i ) ] 1 − y i \begin{aligned} &l \left( w \right) = \prod_{i=1}^{N} P \left( y_{i} | x_{i} \right) \\ &= P \left( Y = 1 | x_{i} , w \right) \cdot P \left( Y = 0 | x_{i}, w \right) \\ &= \prod_{i=1}^{N} \left[ \pi \left( x_{i} \right) \right]^{y_{i}}\left[ 1 - \pi \left( x_{i} \right) \right]^{1 - y_{i}}\end{aligned} l(w)=i=1NP(yixi)=P(Y=1xi,w)P(Y=0xi,w)=i=1N[π(xi)]yi[1π(xi)]1yi

对数似然函数 L ( w ) = log ⁡ l ( w ) = ∑ i = 1 N [ y i log ⁡ π ( x i ) + ( 1 − y i ) log ⁡ ( 1 − π ( x i ) ) ] = ∑ i = 1 N [ y i log ⁡ π ( x i ) 1 − π ( x i ) + log ⁡ ( 1 − π ( x i ) ) ] = ∑ i = 1 N [ y i ( w ⋅ x i ) − log ⁡ ( 1 + exp ⁡ ( w ⋅ x ) ) ] \begin{aligned} & L \left( w \right) = \log l \left( w \right) \\ &= \sum_{i=1}^{N} \left[ y_{i} \log \pi \left( x_{i} \right) + \left( 1 - y_{i} \right) \log \left( 1 - \pi \left( x_{i} \right) \right) \right] \\ &= \sum_{i=1}^{N} \left[ y_{i} \log \dfrac{\pi \left( x_{i} \right)}{1- \pi \left( x_{i} \right)} + \log \left( 1 - \pi \left( x_{i} \right) \right) \right] \\ &= \sum_{i=1}^{N} \left[ y_{i} \left( w \cdot x_{i} \right) - \log \left( 1 + \exp \left( w \cdot x \right) \right) \right]\end{aligned} L(w)=logl(w)=i=1N[yilogπ(xi)+(1yi)log(1π(xi))]=i=1N[yilog1π(xi)π(xi)+log(1π(xi))]=i=1N[yi(wxi)log(1+exp(wx))]

假设 w w w的极大似然估计值是 w ^ \hat{w} w^,则学到的逻辑斯谛回归模型 P ( Y = 1 ∣ x ) = exp ⁡ ( w ^ ⋅ x ) 1 + exp ⁡ ( w ^ ⋅ x ) P ( Y = 0 ∣ x ) = 1 1 + exp ⁡ ( w ^ ⋅ x ) \begin{aligned} & P \left( Y = 1 | x \right) = \dfrac{\exp{\left(\hat{w} \cdot x \right)}}{1+\exp{\left( \hat{w} \cdot x \right)}}\\& P \left( Y = 0 | x \right) =\dfrac{1}{1+\exp{\left( \hat{w} \cdot x \right)}}\end{aligned} P(Y=1x)=1+exp(w^x)exp(w^x)P(Y=0x)=1+exp(w^x)1

多项逻辑斯谛回归

假设离散型随机变量 Y Y Y的取值集合 { 1 , 2 , ⋯   , K } \left\{ 1, 2, \cdots, K \right\} {1,2,,K},则多项逻辑斯谛回归模型 P ( Y = k ∣ x ) = exp ⁡ ( w k ⋅ x ) 1 + ∑ k = 1 K − 1 exp ⁡ ( w k ⋅ x ) , k = 1 , 2 , ⋯   , K − 1 P ( Y = K ∣ x ) = 1 − ∑ k = 1 K − 1 P ( Y = k ∣ x ) = 1 − ∑ k = 1 K − 1 exp ⁡ ( w k ⋅ x ) 1 + ∑ k = 1 K − 1 exp ⁡ ( w k ⋅ x ) = 1 1 + ∑ k = 1 K − 1 exp ⁡ ( w k ⋅ x ) \begin{aligned} & P \left( Y = k | x \right) = \dfrac{\exp{\left(w_{k} \cdot x \right)}}{1+ \sum_{k=1}^{K-1}\exp{\left( w_{k} \cdot x \right)}}, \quad k=1,2,\cdots,K-1 \\ & P \left( Y = K | x \right) = 1 - \sum_{k=1}^{K-1} P \left( Y = k | x \right) \\ &= 1 - \sum_{k=1}^{K-1} \dfrac{\exp{\left(w_{k} \cdot x \right)}}{1+ \sum_{k=1}^{K-1}\exp{\left( w_{k} \cdot x \right)}} \\ &= \dfrac{1}{1+ \sum_{k=1}^{K-1}\exp{\left( w_{k} \cdot x \right)}}\end{aligned} P(Y=kx)=1+k=1K1exp(wkx)exp(wkx),k=1,2,,K1P(Y=Kx)=1k=1K1P(Y=kx)=1k=1K11+k=1K1exp(wkx)exp(wkx)=1+k=1K1exp(wkx)1

2、最大熵

最大熵模型(maximum entropy model)由最大熵原理推导实现。最大熵原理是概率模型学习的一个准则.最大熵原理认为,学习概率模型时,在所有可能的概率模型(分布)中,熵最大的模型是最好的模型。通常用约束条件来确定概率模型的集合,所以最大熵原理也可以表述为在满足约束条件的模型集合中选取熵最大的模型。

最大熵模型的定义

训练数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯   , ( x N , y N ) } \begin{aligned} & T = \left\{ \left( x_{1}, y_{1} \right), \left( x_{2}, y_{2} \right), \cdots, \left( x_{N}, y_{N} \right) \right\} \end{aligned} T={(x1,y1),(x2,y2),,(xN,yN)}

假设分类模型是条件概率分布 P ( Y ∣ X ) , X ∈ X ⊆ R n P \left( Y | X \right), X \in \mathcal{X} \subseteq R^{n} P(YX),XXRn表示输入, Y ∈ Y Y \in \mathcal{Y} YY表示输出。给定输入 X X X,以条件概率 P ( Y ∣ X ) P \left( Y | X \right) P(YX)输出 Y Y Y

特征函数 f ( x , y ) f \left( x, y \right) f(x,y)描述输入 x x x和输出 y y y之间的某一事实, f ( x , y ) = {   1 , x 与 y 满 足 某 一 事 实 0 , 否 则 \begin{aligned} f \left( x, y \right) = \left\{ \begin{aligned} \ & 1, x与y满足某一事实 \\ & 0, 否则 \end{aligned} \right.\end{aligned} f(x,y)={ 1,xy0,

特征函数 f ( x , y ) f \left( x, y \right) f(x,y)关于经验分布 P ~ ( X , Y ) \tilde{P} \left( X, Y \right) P~(X,Y)的期望 E P ~ ( f ) = ∑ x , y P ~ ( x , y ) f ( x , y ) \begin{aligned} & E_{ \tilde{P} } \left( f \right) = \sum_{x, y} \tilde{P} \left( x, y \right) f \left( x, y \right) \end{aligned} EP~(f)=x,yP~(x,y)f(x,y)

特征函数 f ( x , y ) f \left( x, y \right) f(x,y)关于模型 P ( Y ∣ X ) P \left( Y | X \right) P(YX)与经验分布 P ~ ( X ) \tilde{P} \left( X \right) P~(X)的期望 E P ( f ) = ∑ x , y P ~ ( x ) P ( y ∣ x ) f ( x , y ) \begin{aligned} & E_{ P } \left( f \right) = \sum_{x, y} \tilde{P} \left( x \right) P \left( y | x \right) f \left( x, y \right) \end{aligned} EP(f)=x,yP~(x)P(yx)f(x,y)

最大熵模型:假设满足所有约束条件的模型集合为 C ≡ { P ∈ P ∣ E P ( f i ) = E P ~ ( f i ) , i = 1 , 2 , ⋯   , n } \begin{aligned} & \mathcal{C} \equiv \left\{ P \in \mathcal{P} | E_{ P } \left( f_{i} \right) = E_{ \tilde{P} } \left( f_{i} \right), i = 1,2, \cdots, n \right\}\end{aligned} C{PPEP(fi)=EP~(fi),i=1,2,,n}

定义在条件概率分布 P ( Y ∣ X ) P \left( Y | X \right) P(YX)上的条件熵为
H ( P ) = − ∑ x , y P ~ ( x ) P ( y ∣ x ) log ⁡ P ( y ∣ x ) \begin{aligned} & H \left( P \right) = - \sum_{x,y} \tilde{P} \left( x \right) P \left( y | x \right) \log P \left( y | x \right) \end{aligned} H(P)=x,yP~(x)P(yx)logP(yx)

则模型集合 C \mathcal{C} C中条件熵 H ( P ) H \left( P \right) H(P)最大的模型称为最大熵模型。

最大熵模型的学习

给定训练数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯   , ( x N , y N ) } T = \left\{ \left( x_{1}, y_{1} \right), \left( x_{2}, y_{2} \right), \cdots, \left( x_{N}, y_{N} \right) \right\} T={(x1,y1),(x2,y2),,(xN,yN)}以及特征函数 f i ( x , y ) , i = 1 , 2 , ⋯   , n f_{i} \left( x, y \right), i = 1, 2, \cdots, n fi(x,y),i=1,2,,n,最大熵模型的学习等价于最优化问题
max ⁡ P ∈ C H ( P ) = − ∑ x , y P ~ ( x ) P ( y ∣ x ) log ⁡ P ( y ∣ x ) s . t . E P ( f i ) = E P ~ ( f i ) , i = 1 , 2 , ⋯   , n ∑ y P ( y ∣ x ) = 1 \begin{aligned} \max_{P \in \mathcal{C} } \quad & H \left( P \right) = - \sum_{x,y} \tilde{P} \left( x \right) P \left( y | x \right) \log P \left( y | x \right) \\ s.t.\quad & E_{ P } \left( f_{i} \right) = E_{ \tilde{P} } \left( f_{i} \right), i = 1,2, \cdots, n \\ & \sum_{y} P \left( y | x \right) = 1 \end{aligned} PCmaxs.t.H(P)=x,yP~(x)P(yx)logP(yx)EP(fi)=EP~(fi),i=1,2,,nyP(yx)=1

等价的
min ⁡ P ∈ C − H ( P ) = ∑ x , y P ~ ( x ) P ( y ∣ x ) log ⁡ P ( y ∣ x ) s . t . E P ( f i ) − E P ~ ( f i ) = 0 , i = 1 , 2 , ⋯   , n ∑ y P ( y ∣ x ) = 1 \begin{aligned} \min_{P \in \mathcal{C} } \quad & -H \left( P \right) = \sum_{x,y} \tilde{P} \left( x \right) P \left( y | x \right) \log P \left( y | x \right) \\ s.t.\quad & E_{ P } \left( f_{i} \right) - E_{ \tilde{P} } \left( f_{i} \right) = 0, i = 1,2, \cdots, n \\ & \sum_{y} P \left( y | x \right) = 1 \end{aligned} PCmins.t.H(P)=x,yP~(x)P(yx)logP(yx)EP(fi)EP~(fi)=0,i=1,2,,nyP(yx)=1

最优化问题的求解

  1. 引入拉格朗日乘子 w i , i = 0 , 1 , ⋯   , n w_{i}, i = 0,1, \cdots, n wi,i=0,1,,n,定义拉格朗日函数 L ( P , w ) L \left( P, w \right) L(P,w) L ( P , w ) = − H ( P ) + w 0 ( 1 − ∑ y P ( y ∣ x ) ) + ∑ i = 1 n w i ( E P ( f i ) − E P ~ ( f i ) ) = ∑ x , y P ~ ( x ) P ( y ∣ x ) log ⁡ P ( y ∣ x ) + w 0 ( 1 − ∑ y P ( y ∣ x ) ) + ∑ i = 1 n w i ( ∑ x , y P ~ ( x ) P ( y ∣ x ) f i ( x , y ) − ∑ x , y P ~ ( x , y ) f i ( x , y ) ) \begin{aligned} & L \left( P, w \right) = - H \left( P \right) + w_{0} \left( 1 - \sum_{y} P \left( y | x \right) \right) + \sum_{i=1}^{n} w_{i} \left( E_{P} \left( f_{i} \right) - E_{\tilde{P}} \left( f_{i} \right) \right) \\ & = \sum_{x,y} \tilde{P} \left( x \right) P \left( y | x \right) \log P \left( y | x \right) + w_{0} \left( 1 - \sum_{y} P \left( y | x \right) \right) \\ & \quad + \sum_{i=1}^{n} w_{i} \left( \sum_{x, y} \tilde{P} \left( x \right) P \left( y | x \right) f_{i} \left( x, y \right) - \sum_{x, y} \tilde{P} \left( x, y \right) f_{i} \left( x, y \right) \right) \end{aligned} L(P,w)=H(P)+w0(1yP(yx))+i=1nwi(EP(fi)EP~(fi))=x,yP~(x)P(yx)logP(yx)+w0(1yP(yx))+i=1nwi(x,yP~(x)P(yx)fi(x,y)x,yP~(x,y)fi(x,y))

  2. min ⁡ P ∈ C L ( P , w ) \min_{P \in \mathcal{C} } L \left( P, w \right) minPCL(P,w):

    记对偶函数 Ψ ( w ) = m i n P ∈ C L ( P , w ) = L ( P w , w ) \Psi \left( w \right) = min_{P \in \mathcal{C} } L \left( P, w \right) = L \left( P_{w}, w \right) Ψ(w)=minPCL(P,w)=L(Pw,w),其解记 P w = arg ⁡ min ⁡ P ∈ C L ( P , w ) = P w ( y ∣ x ) P_{w} = \arg \min_{P \in \mathcal{C} } L \left( P, w \right) = P_{w} \left( y | x \right) Pw=argminPCL(P,w)=Pw(yx) ∂ L ( P , w ) ∂ P ( y ∣ x ) = ∑ x , y P ~ ( x ) ( log ⁡ P ( y ∣ x ) + 1 ) − ∑ y w 0 − ∑ x , y ( P ~ ( x ) ∑ i = 1 n w i f i ( x , y ) ) = ∑ x , y P ~ ( x ) ( log ⁡ P ( y ∣ x ) + 1 ) − ∑ x , y P ( x ) w 0 − ∑ x , y ( P ~ ( x ) ∑ i = 1 n w i f i ( x , y ) ) = ∑ x , y P ~ ( x ) ( log ⁡ P ( y ∣ x ) + 1 − w 0 − ∑ i = 1 n w i f i ( x , y ) ) = 0 \begin{aligned} & \dfrac {\partial L \left( P, w \right)} {\partial P \left( y | x \right)} = \sum_{x,y} \tilde{P} \left( x \right) \left( \log P \left( y | x \right) + 1 \right) - \sum_{y} w_{0} - \sum_{x,y} \left( \tilde{P} \left( x \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) \\ & \quad = \sum_{x,y} \tilde{P} \left( x \right) \left( \log P \left( y | x \right) + 1 \right) - \sum_{x,y} P \left( x \right) w_{0} - \sum_{x,y} \left( \tilde{P} \left( x \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) \\ & \quad = \sum_{x,y} \tilde{P} \left( x \right) \left( \log P \left( y | x \right) + 1 - w_{0} - \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) = 0\end{aligned} P(yx)L(P,w)=x,yP~(x)(logP(yx)+1)yw0x,y(P~(x)i=1nwifi(x,y))=x,yP~(x)(logP(yx)+1)x,yP(x)w0x,y(P~(x)i=1nwifi(x,y))=x,yP~(x)(logP(yx)+1w0i=1nwifi(x,y))=0

    由于 P ~ ( x ) > 0 \tilde{P} \left( x \right) \gt 0 P~(x)>0,得 log ⁡ P ( y ∣ x ) + 1 − w 0 − ∑ i = 1 n w i f i ( x , y ) = 0 P ( y ∣ x ) = exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) + w 0 − 1 ) = exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) exp ⁡ ( 1 − w 0 ) \begin{aligned} & \log P \left( y | x \right) + 1 - w_{0} - \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right)=0 \\ & P \left( y | x \right) = \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) + w_{0} -1 \right) = \dfrac{ \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) }{ \exp \left( 1 - w_{0} \right)}\end{aligned} logP(yx)+1w0i=1nwifi(x,y)=0P(yx)=exp(i=1nwifi(x,y)+w01)=exp(1w0)exp(i=1nwifi(x,y))

    由于 ∑ y P ( y ∣ x ) = 1 \sum_{y} P \left( y | x \right) = 1 yP(yx)=1,则 ∑ y P ( y ∣ x ) = ∑ y exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) exp ⁡ ( 1 − w 0 ) = 1 ∑ y exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) = exp ⁡ ( 1 − w 0 ) \begin{aligned} &\sum_{y} P \left( y | x \right) = \sum_{y} \dfrac{ \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) }{ \exp \left( 1 - w_{0} \right)} = 1 \\ & \sum_{y} \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) = \exp \left( 1 - w_{0} \right)\end{aligned} yP(yx)=yexp(1w0)exp(i=1nwifi(x,y))=1yexp(i=1nwifi(x,y))=exp(1w0)

    代入,得 P ( y ∣ x ) = 1 Z w ( x ) exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) \begin{aligned} & P \left( y | x \right) = \dfrac{1 }{Z_{w} \left( x \right)}\exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) \end{aligned} P(yx)=Zw(x)1exp(i=1nwifi(x,y))

    其中 Z w = ∑ y exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) \begin{aligned} Z_{w} = \sum_{y} \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) \end{aligned} Zw=yexp(i=1nwifi(x,y))

    Z w Z_{w} Zw称为规范化因子; f i ( x , y ) f_{i} \left( x, y \right) fi(x,y)是特征函数; w i w_{i} wi是特征的权值。

  3. max ⁡ w Ψ ( w ) \max_{w} \Psi \left( w \right) maxwΨ(w)
    将其解记为 w ∗ w^{*} w,即 w ∗ = arg ⁡ max ⁡ w Ψ ( w ) \begin{aligned} w^{*} = \arg \max_{w} \Psi \left( w \right) \end{aligned} w=argwmaxΨ(w)

这就是说,可以应用最优化算法求对偶函数 Ψ ( w ) \Psi(w) Ψ(w) 的极大化,得到 w ∗ w^{*} w ,用来 表示 P ∗ ∈ C . P^{*} \in \mathcal{C} . PC. 这里, P ∗ = P w ∗ = P w ∗ ( y ∣ x ) P^{*}=P_{w^{*}}=P_{w^{*}}(y \mid x) P=Pw=Pw(yx) 是学习到的最优模型(最大嫡模型). 也 就是说,最大嫡模型的学习归结为对偶函数 Ψ ( w ) \Psi(w) Ψ(w) 的极大化.

极大似然估计

已知训练数据的经验概率分布 P ~ ( X , Y ) \tilde{P} \left( X, Y \right) P~(X,Y),则条件概率分布 P ( X ∣ Y ) P \left( X | Y \right) P(XY)的对数似然函数 L P ~ ( P w ) = log ⁡ ∏ x , y P ( y ∣ x ) P ~ ( x , y ) = ∑ x , y P ~ ( x , y ) log ⁡ P ( y ∣ x ) = ∑ x , y P ~ ( x , y ) log ⁡ exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) Z w ( x ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x , y P ~ ( x , y ) log ⁡ Z w ( x ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x P ~ ( x ) log ⁡ Z w ( x ) \begin{aligned} & L_{\tilde{P}} \left( P_{w} \right) = \log \prod_{x,y} P \left( y | x \right)^{\tilde{P} \left( x, y \right)} \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \log P \left( y | x \right) \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \log \dfrac{\exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right)}{Z_{w} \left( x \right) } \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) - \sum_{x,y} \tilde{P} \left( x, y \right) \log Z_{w} \left( x \right) \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) - \sum_{x} \tilde{P} \left( x \right) \log Z_{w} \left( x \right)\end{aligned} LP~(Pw)=logx,yP(yx)P~(x,y)=x,yP~(x,y)logP(yx)=x,yP~(x,y)logZw(x)exp(i=1nwifi(x,y))=x,yP~(x,y)i=1nwifi(x,y)x,yP~(x,y)logZw(x)=x,yP~(x,y)i=1nwifi(x,y)xP~(x)logZw(x)

对偶函数 Ψ ( w ) = m i n P ∈ C L ( P , w ) = L ( P w , w ) = − H ( P w ) + w 0 ( 1 − ∑ y P w ( y ∣ x ) ) + ∑ i = 1 n w i ( E P ~ ( f i ) − E P w ( f i ) ) = ∑ x , y P ~ ( x ) P w ( y ∣ x ) log ⁡ P w ( y ∣ x ) + w 0 ( 1 − ∑ y 1 Z w ( x ) exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) ) + ∑ i = 1 n w i ( ∑ x , y P ~ ( x , y ) f i ( x , y ) − ∑ x , y P ~ ( x ) P w ( y ∣ x ) f i ( x , y ) ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n w i f i ( x , y ) + ∑ x , y P ~ ( x ) P w ( y ∣ x ) ( log ⁡ P w ( y ∣ x ) − ∑ i = 1 n w i f i ( x , y ) ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x , y P ~ ( x , y ) log ⁡ Z w ( x ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x P ~ ( x ) log ⁡ Z w ( x ) \begin{aligned} & \Psi \left( w \right) = min_{P \in \mathcal{C} } L \left( P, w \right) = L \left( P_{w}, w \right) \\ & = - H \left( P_{w} \right) + w_{0} \left( 1 - \sum_{y} P_{w} \left( y | x \right) \right) + \sum_{i=1}^{n} w_{i} \left( E_{\tilde{P}} \left( f_{i} \right) - E_{P_{w}} \left( f_{i} \right) \right) \\ & = \sum_{x,y} \tilde{P} \left( x \right) P_{w} \left( y | x \right) \log P_{w} \left( y | x \right) \\& \quad\quad\quad + w_{0} \left( 1 - \sum_{y} \dfrac{1 }{Z_{w} \left( x \right)}\exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) \right) \\ & \quad\quad\quad + \sum_{i=1}^{n} w_{i} \left( \sum_{x, y} \tilde{P} \left( x, y \right) f_{i} \left( x, y \right) - \sum_{x, y} \tilde{P} \left( x \right) P_{w} \left( y | x \right) f_{i} \left( x, y \right) \right) \\ & = \sum_{x, y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) + \sum_{x,y} \tilde{P} \left( x \right) P_{w} \left( y | x \right) \left( \log P_{w} \left( y | x \right) - \sum_{i=1}^{n} w_{i} f_{i} \left(x, y \right) \right) \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) - \sum_{x,y} \tilde{P} \left( x, y \right) \log Z_{w} \left( x \right) \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) - \sum_{x} \tilde{P} \left( x \right) \log Z_{w} \left( x \right)\end{aligned} Ψ(w)=minPCL(P,w)=L(Pw,w)=H(Pw)+w0(1yPw(yx))+i=1nwi(EP~(fi)EPw(fi))=x,yP~(x)Pw(yx)logPw(yx)+w0(1yZw(x)1exp(i=1nwifi(x,y)))+i=1nwi(x,yP~(x,y)fi(x,y)x,yP~(x)Pw(yx)fi(x,y))=x,yP~(x,y)i=1nwifi(x,y)+x,yP~(x)Pw(yx)(logPw(yx)i=1nwifi(x,y))=x,yP~(x,y)i=1nwifi(x,y)x,yP~(x,y)logZw(x)=x,yP~(x,y)i=1nwifi(x,y)xP~(x)logZw(x)

L P ~ ( P w ) = Ψ ( w ) \begin{aligned} & L_{\tilde{P}} \left( P_{w} \right) = \Psi \left( w \right)\end{aligned} LP~(Pw)=Ψ(w)

即,最大熵模型的极大似然估计等价于对偶函数极大化

改进的迭代尺度法

已知最大熵模型 P w ( y ∣ x ) = 1 Z w ( x ) exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) \begin{aligned} & P_{w} \left( y | x \right) = \dfrac{1 }{Z_{w} \left( x \right)}\exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) \end{aligned} Pw(yx)=Zw(x)1exp(i=1nwifi(x,y))

其中 Z w = ∑ y exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) \begin{aligned} Z_{w} = \sum_{y} \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) \end{aligned} Zw=yexp(i=1nwifi(x,y))

Z w Z_{w} Zw称为规范化因子; f i ( x , y ) f_{i} \left( x, y \right) fi(x,y)是特征函数; w i w_{i} wi是特征的权值。

对数似然函数 L ( w ) = ∑ x , y P ~ ( x , y ) log ⁡ P w ( y ∣ x ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x P ~ ( x ) log ⁡ Z w ( x ) \begin{aligned} & L \left( w \right) = \sum_{x,y} \tilde{P} \left( x, y \right) \log P_{w} \left( y | x \right) \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) - \sum_{x} \tilde{P} \left( x \right) \log Z_{w} \left( x \right) \end{aligned} L(w)=x,yP~(x,y)logPw(yx)=x,yP~(x,y)i=1nwifi(x,y)xP~(x)logZw(x)

IIS 的想法是: 假设最大嫡模型当前的参数向量是 w = ( w 1 , w 2 , ⋯   , w n ) T w=\left(w_{1}, w_{2}, \cdots, w_{n}\right)^{\mathrm{T}} w=(w1,w2,,wn)T ,我们 希望找到一个新的参数向量 w + δ = ( w 1 + δ 1 , w 2 + δ 2 , ⋯   , w n + δ n ) T , w+\delta=\left(w_{1}+\delta_{1}, w_{2}+\delta_{2}, \cdots, w_{n}+\delta_{n}\right)^{\mathrm{T}}, w+δ=(w1+δ1,w2+δ2,,wn+δn)T, 使得模型的对数
似然函数值增大. 如果能有这样一种参数向量更新的方法 τ : w → w + δ \tau: w \rightarrow w+\delta τ:ww+δ ,那么就 可以重复使用这一方法,直至找到对数似然函数的最大值.

对于给定的经验分布 P ~ \tilde{P} P~,模型参数从 w w w w + δ w + \delta w+δ,对数似然函数的改变量 L ( w + δ ) − L ( w ) = ∑ x , y P ~ ( x , y ) log ⁡ P w + δ ( y ∣ x ) − ∑ x , y P ~ ( x , y ) log ⁡ P w ( y ∣ x ) = ( ∑ x , y P ~ ( x , y ) ∑ i = 1 n ( w i + δ i ) f i ( x , y ) − ∑ x P ~ ( x ) log ⁡ Z w + δ ( x ) ) − ( ∑ x , y P ~ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x P ~ ( x ) log ⁡ Z w ( x ) ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) − ∑ x P ~ ( x ) log ⁡ Z w + δ ( x ) Z w ( x ) \begin{aligned} & L \left( w + \delta \right) - L \left( w \right) = \sum_{x,y} \tilde{P} \left( x, y \right) \log P_{w + \delta} \left( y | x \right) - \sum_{x,y} \tilde{P} \left( x, y \right) \log P_{w} \left( y | x \right) \\ & = \left( \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \left( w_{i} + \delta_{i} \right) f_{i} \left( x, y \right) - \sum_{x} \tilde{P} \left( x \right) \log Z_{w + \delta} \left( x \right) \right) \\ & \quad\quad\quad\quad\quad\quad - \left( \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) - \sum_{x} \tilde{P} \left( x \right) \log Z_{w} \left( x \right) \right) \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) - \sum_{x} \tilde{P} \left( x \right) \log \dfrac{Z_{w + \delta} \left( x \right)}{Z_{w} \left( x \right)}\end{aligned} L(w+δ)L(w)=x,yP~(x,y)logPw+δ(yx)x,yP~(x,y)logPw(yx)=(x,yP~(x,y)i=1n(wi+δi)fi(x,y)xP~(x)logZw+δ(x))(x,yP~(x,y)i=1nwifi(x,y)xP~(x)logZw(x))=x,yP~(x,y)i=1nδifi(x,y)xP~(x)logZw(x)Zw+δ(x)

− log ⁡ α ≥ 1 − α , α g t ; 0 \begin{aligned} - \log \alpha \geq 1 - \alpha, \alpha > 0 \end{aligned} logα1α,αgt;0

L ( w + δ ) − L ( w ) ≥ ∑ x , y P ~ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ~ ( x ) Z w + δ ( x ) Z w ( x ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ~ ( x ) ∑ y exp ⁡ ( ∑ i = 1 n ( w i + δ i ) f i ( x , y ) ) ∑ y exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ~ ( x ) ∑ y exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) ∑ y exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) exp ⁡ ( ∑ i = 1 n δ i f i ( x , y ) ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ~ ( x ) ∑ y P w ( y ∣ x ) exp ⁡ ( ∑ i = 1 n δ i f i ( x , y ) ) \begin{aligned} & L \left( w + \delta \right) - L \left( w \right) \geq \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \dfrac{Z_{w + \delta} \left( x \right)}{Z_{w} \left( x \right)} \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \dfrac{\sum_{y} \exp \left( \sum_{i=1}^{n} \left( w_{i} + \delta_{i} \right) f_{i} \left( x, y \right) \right)}{\sum_{y} \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right)} \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \sum_{y} \dfrac{ \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right)}{\sum_{y} \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right)} \exp \left( \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) \right) \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \sum_{y} P_{w} \left( y | x \right) \exp \left( \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) \right)\end{aligned} L(w+δ)L(w)x,yP~(x,y)i=1nδifi(x,y)+1xP~(x)Zw(x)Zw+δ(x)=x,yP~(x,y)i=1nδifi(x,y)+1xP~(x)yexp(i=1nwifi(x,y))yexp(i=1n(wi+δi)fi(x,y))=x,yP~(x,y)i=1nδifi(x,y)+1xP~(x)yyexp(i=1nwifi(x,y))exp(i=1nwifi(x,y))exp(i=1nδifi(x,y))=x,yP~(x,y)i=1nδifi(x,y)+1xP~(x)yPw(yx)exp(i=1nδifi(x,y))

A ( δ ∣ w ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ~ ( x ) ∑ y P w ( y ∣ x ) exp ⁡ ( ∑ i = 1 n δ i f i ( x , y ) ) \begin{aligned} & A \left( \delta | w \right) = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \sum_{y} P_{w} \left( y | x \right) \exp \left( \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) \right)\end{aligned} A(δw)=x,yP~(x,y)i=1nδifi(x,y)+1xP~(x)yPw(yx)exp(i=1nδifi(x,y))


L ( w + δ ) − L ( w ) ≥ A ( δ ∣ w ) \begin{aligned} & L \left( w + \delta \right) - L \left( w \right) \geq A \left( \delta | w \right)\end{aligned} L(w+δ)L(w)A(δw)

A ( δ ∣ w ) A \left( \delta | w \right) A(δw)是对数似然函数改变量的一个下界。

如果能找到适当的 δ \delta δ 使下界 A ( δ ∣ w ) A(\delta \mid w) A(δw) 提高,那么对数似然函数也会提高. 然 而,函数 A ( δ ∣ w ) A(\delta \mid w) A(δw) 中的 δ \delta δ 是一个向量,含有多个变量,不易同时优化. IIS 试图一 次只优化其中一个变量 δ i , \delta_{i}, δi, 而固定其他变量 δ i , i ≠ j . \delta_{i}, \quad i \neq j . δi,i=j.

引入 f # ( x , y ) = ∑ i f i ( x , y ) \begin{aligned} & f^{\#} \left( x, y \right) = \sum_{i} f_{i} \left( x, y \right) \end{aligned} f#(x,y)=ifi(x,y)

表示所有特征在 ( x , y ) \left( x, y \right) (x,y)出现的次数 则 A ( δ ∣ w ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ~ ( x ) ∑ y P w ( y ∣ x ) exp ⁡ ( f # ( x , y ) ∑ i = 1 n δ i f i ( x , y ) f # ( x , y ) ) \begin{aligned} & A \left( \delta | w \right) = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \sum_{y} P_{w} \left( y | x \right) \exp \left( f^{\#} \left( x, y \right) \sum_{i=1}^{n} \dfrac{\delta_{i} f_{i} \left( x, y \right) }{f^{\#} \left( x, y \right) } \right)\end{aligned} A(δw)=x,yP~(x,y)i=1nδifi(x,y)+1xP~(x)yPw(yx)exp(f#(x,y)i=1nf#(x,y)δifi(x,y))

对任意 i i i,有 f i ( x , y ) f # ( x , y ) ≥ 0 \dfrac{f_{i} \left( x, y \right)}{f^{\#} \left( x, y \right)} \geq 0 f#(x,y)fi(x,y)0 ∑ i = 1 n f i ( x , y ) f # ( x , y ) = 1 \sum_{i=1}^{n} \dfrac{f_{i} \left( x, y \right)}{f^{\#} \left( x, y \right)} = 1 i=1nf#(x,y)fi(x,y)=1,
根据Jensen不等式,得
exp ⁡ ( ∑ i = 1 n f i ( x , y ) f # ( x , y ) δ i f # ( x , y ) ) ) ≤ ∑ i = 1 n f i ( x , y ) f # ( x , y ) exp ⁡ ( δ i f # ( x , y ) ) \begin{aligned} & \exp \left( \sum_{i=1}^{n} \dfrac{f_{i} \left( x, y \right)}{f^{\#} \left( x, y \right)} \delta_{i} f_{\#} \left( x, y \right) ) \right) \leq \sum_{i=1}^{n} \dfrac{f_{i} \left( x, y \right)}{f^{\#} \left( x, y \right)} \exp \left( \delta_{i} f^{\#} \left(x, y\right) \right)\end{aligned} exp(i=1nf#(x,y)fi(x,y)δif#(x,y)))i=1nf#(x,y)fi(x,y)exp(δif#(x,y))

A ( δ ∣ w ) ≥ ∑ x , y P ~ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ~ ( x ) ∑ y P w ( y ∣ x ) ∑ i = 1 n ( f i ( x , y ) f # ( x , y ) ) exp ⁡ ( δ i f # ( x , y ) ) \begin{aligned} & A \left( \delta | w \right) \geq \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \sum_{y} P_{w} \left( y | x \right) \sum_{i=1}^{n} \left( \dfrac{f_{i} \left( x, y \right)}{f^{\#} \left( x, y \right)} \right) \exp \left( \delta_{i} f^{\#} \left(x, y\right) \right)\end{aligned} A(δw)x,yP~(x,y)i=1nδifi(x,y)+1xP~(x)yPw(yx)i=1n(f#(x,y)fi(x,y))exp(δif#(x,y))

B ( δ ∣ w ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ~ ( x ) ∑ y P w ( y ∣ x ) ∑ i = 1 n ( f i ( x , y ) f # ( x , y ) ) exp ⁡ ( δ i f # ( x , y ) ) \begin{aligned} & B \left( \delta | w \right) = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \sum_{y} P_{w} \left( y | x \right) \sum_{i=1}^{n} \left( \dfrac{f_{i} \left( x, y \right)}{f^{\#} \left( x, y \right)} \right) \exp \left( \delta_{i} f^{\#} \left(x, y\right) \right)\end{aligned} B(δw)=x,yP~(x,y)i=1nδifi(x,y)+1xP~(x)yPw(yx)i=1n(f#(x,y)fi(x,y))exp(δif#(x,y))


L ( w + δ ) − L ( w ) ≥ A ( δ ∣ w ) ≥ B ( δ ∣ w ) \begin{aligned} & L \left( w + \delta \right) - L \left( w \right) \geq A \left( \delta | w \right) \geq B \left( \delta | w \right)\end{aligned} L(w+δ)L(w)A(δw)B(δw)

B ( δ ∣ w ) B \left( \delta | w \right) B(δw)是对数似然函数改变量的一个新的(相对不紧的)下界。

∂ B ( δ ∣ w ) ∂ δ i = ∑ x , y P ~ ( x , y ) f i ( x , y ) − ∑ x P ~ ( x ) ∑ y P w ( y ∣ x ) f i ( x , y ) exp ⁡ ( δ i f # ( x , y ) ) \begin{aligned} & \dfrac {\partial B \left( \delta | w \right) }{\partial \delta_{i}} = \sum_{x,y} \tilde{P} \left( x, y \right) f_{i} \left( x, y \right) - \sum_{x} \tilde{P} \left( x \right) \sum_{y} P_{w} \left( y | x \right) f_{i} \left( x, y \right) \exp \left( \delta_{i} f^{\#} \left(x, y\right) \right)\end{aligned} δiB(δw)=x,yP~(x,y)fi(x,y)xP~(x)yPw(yx)fi(x,y)exp(δif#(x,y))
∂ B ( δ ∣ w ) ∂ δ i = 0 \dfrac {\partial B \left( \delta | w \right) }{\partial \delta_{i}} = 0 δiB(δw)=0,得
∑ x , y P ~ ( x , y ) f i ( x , y ) = ∑ x , y P ~ ( x ) P w ( y ∣ x ) f i ( x , y ) exp ⁡ ( δ i f # ( x , y ) ) \begin{aligned} & \sum_{x,y} \tilde{P} \left( x, y \right) f_{i} \left( x, y \right) = \sum_{x, y} \tilde{P} \left( x \right) P_{w} \left( y | x \right) f_{i} \left( x, y \right) \exp \left( \delta_{i} f^{\#} \left(x, y\right) \right)\end{aligned} x,yP~(x,y)fi(x,y)=x,yP~(x)Pw(yx)fi(x,y)exp(δif#(x,y))

δ i \delta_{i} δi求解可解得 δ \delta δ

改进的迭代尺度算法(IIS)

  • 输入:特征函数 f i , i = 1 , 2 , ⋯   , n f_{i},i=1, 2, \cdots, n fi,i=1,2,,n,经验分布 P ~ ( x , y ) \tilde{P} \left( x, y \right) P~(x,y),模型 P w ( y ∣ x ) P_{w} \left( y | x \right) Pw(yx)
  • 输出:最优参数值 w i ∗ w_{i}^{*} wi;最优模型 P w ∗ P_{w^{*}} Pw
  1. 对所有 i ∈ { 1 , 2 , ⋯   , n } i \in \left\{ 1, 2, \cdots, n \right\} i{1,2,,n},取 w i = 0 w_{i} = 0 wi=0

  2. 对每一 i ∈ { 1 , 2 , ⋯   , n } i \in \left\{ 1, 2, \cdots, n \right\} i{1,2,,n}
    2.1. 令 δ i \delta_{i} δi是方程 ∑ x , y P ~ ( x , y ) f i ( x , y ) = ∑ x , y P ~ ( x ) P w ( y ∣ x ) f i ( x , y ) exp ⁡ ( δ i f # ( x , y ) ) \begin{aligned} & \sum_{x,y} \tilde{P} \left( x, y \right) f_{i} \left( x, y \right) = \sum_{x, y} \tilde{P} \left( x \right) P_{w} \left( y | x \right) f_{i} \left( x, y \right) \exp \left( \delta_{i} f^{\#} \left(x, y\right) \right) \end{aligned} x,yP~(x,y)fi(x,y)=x,yP~(x)Pw(yx)fi(x,y)exp(δif#(x,y))
    的解

    2.2. 更新 w i w_{i} wi的值 w i ← w i + δ i \begin{aligned} & w_{i} \leftarrow w_{i} + \delta_{i}\end{aligned} wiwi+δi

  3. 如果不是所有 w i w_{i} wi都收敛,重复步骤2.

3、概要总结

逻辑斯谛回归(LR)是经典的分类方法

1.逻辑斯谛回归模型是由以下条件概率分布表示的分类模型。逻辑斯谛回归模型可以用于二类或多类分类。

P ( Y = k ∣ x ) = exp ⁡ ( w k ⋅ x ) 1 + ∑ k = 1 K − 1 exp ⁡ ( w k ⋅ x ) , k = 1 , 2 , ⋯   , K − 1 P(Y=k | x)=\frac{\exp \left(w_{k} \cdot x\right)}{1+\sum_{k=1}^{K-1} \exp \left(w_{k} \cdot x\right)}, \quad k=1,2, \cdots, K-1 P(Y=kx)=1+k=1K1exp(wkx)exp(wkx),k=1,2,,K1 P ( Y = K ∣ x ) = 1 1 + ∑ k = 1 K − 1 exp ⁡ ( w k ⋅ x ) P(Y=K | x)=\frac{1}{1+\sum_{k=1}^{K-1} \exp \left(w_{k} \cdot x\right)} P(Y=Kx)=1+k=1K1exp(wkx)1

这里, x x x为输入特征, w w w为特征的权值。

逻辑斯谛回归模型源自逻辑斯谛分布,其分布函数 F ( x ) F(x) F(x) S S S形函数。逻辑斯谛回归模型是由输入的线性函数表示的输出的对数几率模型。

2.最大熵模型是由以下条件概率分布表示的分类模型。最大熵模型也可以用于二类或多类分类。

P w ( y ∣ x ) = 1 Z w ( x ) exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) P_{w}(y | x)=\frac{1}{Z_{w}(x)} \exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right) Pw(yx)=Zw(x)1exp(i=1nwifi(x,y)) Z w ( x ) = ∑ y exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) Z_{w}(x)=\sum_{y} \exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right) Zw(x)=yexp(i=1nwifi(x,y))

其中, Z w ( x ) Z_w(x) Zw(x)是规范化因子, f i f_i fi为特征函数, w i w_i wi为特征的权值。

3.最大熵模型可以由最大熵原理推导得出。最大熵原理是概率模型学习或估计的一个准则。最大熵原理认为在所有可能的概率模型(分布)的集合中,熵最大的模型是最好的模型。

最大熵原理应用到分类模型的学习中,有以下约束最优化问题:

min ⁡ P ∈ C − H ( P ) = ∑ x , y P ~ ( x ) P ( y ∣ x ) log ⁡ P ( y ∣ x ) s . t . E P ( f i ) − E P ~ ( f i ) = 0 , i = 1 , 2 , ⋯   , n ∑ y P ( y ∣ x ) = 1 \begin{aligned} \min_{P \in \mathcal{C} } \quad & -H \left( P \right) = \sum_{x,y} \tilde{P} \left( x \right) P \left( y | x \right) \log P \left( y | x \right) \\ s.t.\quad & E_{ P } \left( f_{i} \right) - E_{ \tilde{P} } \left( f_{i} \right) = 0, i = 1,2, \cdots, n \\ & \sum_{y} P \left( y | x \right) = 1 \end{aligned} PCmins.t.H(P)=x,yP~(x)P(yx)logP(yx)EP(fi)EP~(fi)=0,i=1,2,,nyP(yx)=1

求解此最优化问题的对偶问题得到最大熵模型。

4.逻辑斯谛回归模型与最大熵模型都属于对数线性模型。

5.逻辑斯谛回归模型及最大熵模型学习一般采用极大似然估计,或正则化的极大似然估计。逻辑斯谛回归模型及最大熵模型学习可以形式化为无约束最优化问题。求解该最优化问题的算法有改进的迭代尺度法、梯度下降法、拟牛顿法。

回归模型: f ( x ) = 1 1 + e − w x f(x) = \frac{1}{1+e^{-wx}} f(x)=1+ewx1

其中 w x wx wx线性函数: w x = w 0 ⋅ x 0 + w 1 ⋅ x 1 + w 2 ⋅ x 2 + . . . + w n ⋅ x n , ( x 0 = 1 ) wx =w_0\cdot x_0 + w_1\cdot x_1 + w_2\cdot x_2 +...+w_n\cdot x_n,(x_0=1) wx=w0x0+w1x1+w2x2+...+wnxn,(x0=1)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值