逻辑斯谛回归(logistic regression)是统计学习中的经典分类方法.最大熵是概率模型学习的一个准则,将其推广到分类问题得到最大熵模型(maximum entropy model)。逻辑斯谛回归模型与最大熵模型都属于对数线性模型。
1、逻辑斯缔回归模型
逻辑斯缔分布
设
X
X
X 是连续随机变量,
X
X
X 服从逻辑斯缔分布(logistic distribution)是指
X
X
X 具有下列分布函数和密度函数:
F
(
x
)
=
P
(
X
⩽
x
)
=
1
1
+
e
−
(
x
−
μ
)
/
γ
f
(
x
)
=
F
′
(
x
)
=
e
−
(
x
−
μ
)
/
γ
γ
(
1
+
e
−
(
x
−
μ
)
/
γ
)
2
\begin{array}{l} F(x)=P(X \leqslant x)=\frac{1}{1+\mathrm{e}^{-(x-\mu) / \gamma}} \\ \\ f(x)=F^{\prime}(x)=\frac{\mathrm{e}^{-(x-\mu) / \gamma}}{\gamma\left(1+\mathrm{e}^{-(x-\mu) / \gamma}\right)^{2}} \end{array}
F(x)=P(X⩽x)=1+e−(x−μ)/γ1f(x)=F′(x)=γ(1+e−(x−μ)/γ)2e−(x−μ)/γ
式中, μ \mu μ 为位置参数, γ > 0 \gamma>0 γ>0 为形状参数.
二项逻辑斯谛回归模型
二项逻辑斯谛回归模型是如下的条件概率分布: P ( Y = 1 ∣ x ) = 1 1 + exp − ( w ⋅ x + b ) = exp ( w ⋅ x + b ) ( 1 + exp − ( w ⋅ x + b ) ) ⋅ exp ( w ⋅ x + b ) = exp ( w ⋅ x + b ) 1 + exp ( w ⋅ x + b ) P ( Y = 0 ∣ x ) = 1 − P ( Y = 1 ∣ x ) = 1 − exp ( w ⋅ x + b ) 1 + exp ( w ⋅ x + b ) = 1 1 + exp ( w ⋅ x + b ) \begin{aligned} & P \left( Y = 1 | x \right) = \dfrac{1}{1+\exp{-\left(w \cdot x + b \right)}} \\ &\quad\quad\quad\quad = \dfrac{\exp{\left(w \cdot x + b \right)}}{\left( 1+\exp{-\left(w \cdot x + b \right)}\right) \cdot \exp{\left(w \cdot x + b \right)}} \\ &\quad\quad\quad\quad = \dfrac{\exp{\left(w \cdot x + b \right)}}{1+\exp{\left( w \cdot x + b \right)}}\\& P \left( Y = 0 | x \right) = 1- P \left( Y = 1 | x \right) \\ & \quad\quad\quad\quad=1- \dfrac{\exp{\left(w \cdot x + b \right)}}{1+\exp{\left( w \cdot x + b \right)}} \\ &\quad\quad\quad\quad=\dfrac{1}{1+\exp{\left( w \cdot x + b \right)}}\end{aligned} P(Y=1∣x)=1+exp−(w⋅x+b)1=(1+exp−(w⋅x+b))⋅exp(w⋅x+b)exp(w⋅x+b)=1+exp(w⋅x+b)exp(w⋅x+b)P(Y=0∣x)=1−P(Y=1∣x)=1−1+exp(w⋅x+b)exp(w⋅x+b)=1+exp(w⋅x+b)1其中, x ∈ R n x \in R^{n} x∈Rn是输入, Y ∈ { 0 , 1 } Y \in \left\{ 0, 1 \right\} Y∈{0,1}是输出, w ∈ R n w \in R^{n} w∈Rn和 b ∈ R b \in R b∈R是参数, w w w称为权值向量, b b b称为偏置, w ⋅ x w \cdot x w⋅x为 w w w和 b b b的内积。
可将权值权值向量和输入向量加以扩充,即 w = ( w ( 1 ) , w ( 2 ) , ⋯ , w ( n ) , b ) T w = \left( w^{\left(1\right)},w^{\left(2\right)},\cdots,w^{\left(n\right)},b \right)^{T} w=(w(1),w(2),⋯,w(n),b)T, x = ( x ( 1 ) , x ( 2 ) , ⋯ , x ( n ) , 1 ) T x = \left( x^{\left(1\right)},x^{\left(2\right)},\cdots,x^{\left(n\right)},1 \right)^{T} x=(x(1),x(2),⋯,x(n),1)T,则逻辑斯谛回归模型: P ( Y = 1 ∣ x ) = exp ( w ⋅ x ) 1 + exp ( w ⋅ x ) P ( Y = 0 ∣ x ) = 1 1 + exp ( w ⋅ x ) \begin{aligned} & P \left( Y = 1 | x \right) = \dfrac{\exp{\left(w \cdot x \right)}}{1+\exp{\left( w \cdot x \right)}}\\& P \left( Y = 0 | x \right) =\dfrac{1}{1+\exp{\left( w \cdot x \right)}}\end{aligned} P(Y=1∣x)=1+exp(w⋅x)exp(w⋅x)P(Y=0∣x)=1+exp(w⋅x)1
一个事件的几率是指事件发生的概率 p p p与事件不发生的概率 1 − p 1-p 1−p的比值,即 p 1 − p \begin{aligned} & \dfrac{p}{1-p}\end{aligned} 1−pp
该事件的对数几率(logit函数) l o g i t ( p ) = log p 1 − p \begin{aligned} & logit\left( p \right) = \log \dfrac{p}{1-p}\end{aligned} logit(p)=log1−pp
对于逻辑斯谛回归模型 log P ( Y = 1 ∣ x ) 1 − P ( Y = 1 ∣ x ) = w ⋅ x \begin{aligned} & \log \dfrac{P \left( Y = 1 | x \right)}{1-P \left( Y = 1 | x \right)} = w \cdot x\end{aligned} log1−P(Y=1∣x)P(Y=1∣x)=w⋅x
即输出 Y = 1 Y=1 Y=1的对数几率是输入 x x x的线性函数。
模型参数估计
给定训练数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x N , y N ) } \begin{aligned} &T = \left\{ \left( x_{1}, y_{1} \right), \left( x_{2}, y_{2} \right), \cdots, \left( x_{N}, y_{N} \right) \right\} \end{aligned} T={(x1,y1),(x2,y2),⋯,(xN,yN)}
其中, x i ∈ R n + 1 , y i ∈ { 0 , 1 } , i = 1 , 2 , ⋯ , N x_{i} \in R^{n+1}, y_{i} \in \left\{ 0, 1 \right\}, i = 1, 2, \cdots, N xi∈Rn+1,yi∈{0,1},i=1,2,⋯,N。
设:
P
(
Y
=
1
∣
x
)
=
π
(
x
)
,
P
(
Y
=
0
∣
x
)
=
1
−
π
(
x
)
\begin{aligned} & P \left( Y =1 | x \right) = \pi \left( x \right) ,\quad P \left( Y =0 | x \right) = 1 - \pi \left( x \right) \end{aligned}
P(Y=1∣x)=π(x),P(Y=0∣x)=1−π(x)
似然函数
l
(
w
)
=
∏
i
=
1
N
P
(
y
i
∣
x
i
)
=
P
(
Y
=
1
∣
x
i
,
w
)
⋅
P
(
Y
=
0
∣
x
i
,
w
)
=
∏
i
=
1
N
[
π
(
x
i
)
]
y
i
[
1
−
π
(
x
i
)
]
1
−
y
i
\begin{aligned} &l \left( w \right) = \prod_{i=1}^{N} P \left( y_{i} | x_{i} \right) \\ &= P \left( Y = 1 | x_{i} , w \right) \cdot P \left( Y = 0 | x_{i}, w \right) \\ &= \prod_{i=1}^{N} \left[ \pi \left( x_{i} \right) \right]^{y_{i}}\left[ 1 - \pi \left( x_{i} \right) \right]^{1 - y_{i}}\end{aligned}
l(w)=i=1∏NP(yi∣xi)=P(Y=1∣xi,w)⋅P(Y=0∣xi,w)=i=1∏N[π(xi)]yi[1−π(xi)]1−yi
对数似然函数 L ( w ) = log l ( w ) = ∑ i = 1 N [ y i log π ( x i ) + ( 1 − y i ) log ( 1 − π ( x i ) ) ] = ∑ i = 1 N [ y i log π ( x i ) 1 − π ( x i ) + log ( 1 − π ( x i ) ) ] = ∑ i = 1 N [ y i ( w ⋅ x i ) − log ( 1 + exp ( w ⋅ x ) ) ] \begin{aligned} & L \left( w \right) = \log l \left( w \right) \\ &= \sum_{i=1}^{N} \left[ y_{i} \log \pi \left( x_{i} \right) + \left( 1 - y_{i} \right) \log \left( 1 - \pi \left( x_{i} \right) \right) \right] \\ &= \sum_{i=1}^{N} \left[ y_{i} \log \dfrac{\pi \left( x_{i} \right)}{1- \pi \left( x_{i} \right)} + \log \left( 1 - \pi \left( x_{i} \right) \right) \right] \\ &= \sum_{i=1}^{N} \left[ y_{i} \left( w \cdot x_{i} \right) - \log \left( 1 + \exp \left( w \cdot x \right) \right) \right]\end{aligned} L(w)=logl(w)=i=1∑N[yilogπ(xi)+(1−yi)log(1−π(xi))]=i=1∑N[yilog1−π(xi)π(xi)+log(1−π(xi))]=i=1∑N[yi(w⋅xi)−log(1+exp(w⋅x))]
假设 w w w的极大似然估计值是 w ^ \hat{w} w^,则学到的逻辑斯谛回归模型 P ( Y = 1 ∣ x ) = exp ( w ^ ⋅ x ) 1 + exp ( w ^ ⋅ x ) P ( Y = 0 ∣ x ) = 1 1 + exp ( w ^ ⋅ x ) \begin{aligned} & P \left( Y = 1 | x \right) = \dfrac{\exp{\left(\hat{w} \cdot x \right)}}{1+\exp{\left( \hat{w} \cdot x \right)}}\\& P \left( Y = 0 | x \right) =\dfrac{1}{1+\exp{\left( \hat{w} \cdot x \right)}}\end{aligned} P(Y=1∣x)=1+exp(w^⋅x)exp(w^⋅x)P(Y=0∣x)=1+exp(w^⋅x)1
多项逻辑斯谛回归
假设离散型随机变量 Y Y Y的取值集合 { 1 , 2 , ⋯ , K } \left\{ 1, 2, \cdots, K \right\} {1,2,⋯,K},则多项逻辑斯谛回归模型 P ( Y = k ∣ x ) = exp ( w k ⋅ x ) 1 + ∑ k = 1 K − 1 exp ( w k ⋅ x ) , k = 1 , 2 , ⋯ , K − 1 P ( Y = K ∣ x ) = 1 − ∑ k = 1 K − 1 P ( Y = k ∣ x ) = 1 − ∑ k = 1 K − 1 exp ( w k ⋅ x ) 1 + ∑ k = 1 K − 1 exp ( w k ⋅ x ) = 1 1 + ∑ k = 1 K − 1 exp ( w k ⋅ x ) \begin{aligned} & P \left( Y = k | x \right) = \dfrac{\exp{\left(w_{k} \cdot x \right)}}{1+ \sum_{k=1}^{K-1}\exp{\left( w_{k} \cdot x \right)}}, \quad k=1,2,\cdots,K-1 \\ & P \left( Y = K | x \right) = 1 - \sum_{k=1}^{K-1} P \left( Y = k | x \right) \\ &= 1 - \sum_{k=1}^{K-1} \dfrac{\exp{\left(w_{k} \cdot x \right)}}{1+ \sum_{k=1}^{K-1}\exp{\left( w_{k} \cdot x \right)}} \\ &= \dfrac{1}{1+ \sum_{k=1}^{K-1}\exp{\left( w_{k} \cdot x \right)}}\end{aligned} P(Y=k∣x)=1+∑k=1K−1exp(wk⋅x)exp(wk⋅x),k=1,2,⋯,K−1P(Y=K∣x)=1−k=1∑K−1P(Y=k∣x)=1−k=1∑K−11+∑k=1K−1exp(wk⋅x)exp(wk⋅x)=1+∑k=1K−1exp(wk⋅x)1
2、最大熵
最大熵模型(maximum entropy model)由最大熵原理推导实现。最大熵原理是概率模型学习的一个准则.最大熵原理认为,学习概率模型时,在所有可能的概率模型(分布)中,熵最大的模型是最好的模型。通常用约束条件来确定概率模型的集合,所以最大熵原理也可以表述为在满足约束条件的模型集合中选取熵最大的模型。
最大熵模型的定义
训练数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x N , y N ) } \begin{aligned} & T = \left\{ \left( x_{1}, y_{1} \right), \left( x_{2}, y_{2} \right), \cdots, \left( x_{N}, y_{N} \right) \right\} \end{aligned} T={(x1,y1),(x2,y2),⋯,(xN,yN)}
假设分类模型是条件概率分布 P ( Y ∣ X ) , X ∈ X ⊆ R n P \left( Y | X \right), X \in \mathcal{X} \subseteq R^{n} P(Y∣X),X∈X⊆Rn表示输入, Y ∈ Y Y \in \mathcal{Y} Y∈Y表示输出。给定输入 X X X,以条件概率 P ( Y ∣ X ) P \left( Y | X \right) P(Y∣X)输出 Y Y Y。
特征函数 f ( x , y ) f \left( x, y \right) f(x,y)描述输入 x x x和输出 y y y之间的某一事实, f ( x , y ) = { 1 , x 与 y 满 足 某 一 事 实 0 , 否 则 \begin{aligned} f \left( x, y \right) = \left\{ \begin{aligned} \ & 1, x与y满足某一事实 \\ & 0, 否则 \end{aligned} \right.\end{aligned} f(x,y)={ 1,x与y满足某一事实0,否则
特征函数 f ( x , y ) f \left( x, y \right) f(x,y)关于经验分布 P ~ ( X , Y ) \tilde{P} \left( X, Y \right) P~(X,Y)的期望 E P ~ ( f ) = ∑ x , y P ~ ( x , y ) f ( x , y ) \begin{aligned} & E_{ \tilde{P} } \left( f \right) = \sum_{x, y} \tilde{P} \left( x, y \right) f \left( x, y \right) \end{aligned} EP~(f)=x,y∑P~(x,y)f(x,y)
特征函数 f ( x , y ) f \left( x, y \right) f(x,y)关于模型 P ( Y ∣ X ) P \left( Y | X \right) P(Y∣X)与经验分布 P ~ ( X ) \tilde{P} \left( X \right) P~(X)的期望 E P ( f ) = ∑ x , y P ~ ( x ) P ( y ∣ x ) f ( x , y ) \begin{aligned} & E_{ P } \left( f \right) = \sum_{x, y} \tilde{P} \left( x \right) P \left( y | x \right) f \left( x, y \right) \end{aligned} EP(f)=x,y∑P~(x)P(y∣x)f(x,y)
最大熵模型:假设满足所有约束条件的模型集合为 C ≡ { P ∈ P ∣ E P ( f i ) = E P ~ ( f i ) , i = 1 , 2 , ⋯ , n } \begin{aligned} & \mathcal{C} \equiv \left\{ P \in \mathcal{P} | E_{ P } \left( f_{i} \right) = E_{ \tilde{P} } \left( f_{i} \right), i = 1,2, \cdots, n \right\}\end{aligned} C≡{P∈P∣EP(fi)=EP~(fi),i=1,2,⋯,n}
定义在条件概率分布
P
(
Y
∣
X
)
P \left( Y | X \right)
P(Y∣X)上的条件熵为
H
(
P
)
=
−
∑
x
,
y
P
~
(
x
)
P
(
y
∣
x
)
log
P
(
y
∣
x
)
\begin{aligned} & H \left( P \right) = - \sum_{x,y} \tilde{P} \left( x \right) P \left( y | x \right) \log P \left( y | x \right) \end{aligned}
H(P)=−x,y∑P~(x)P(y∣x)logP(y∣x)
则模型集合 C \mathcal{C} C中条件熵 H ( P ) H \left( P \right) H(P)最大的模型称为最大熵模型。
最大熵模型的学习
给定训练数据集
T
=
{
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
,
⋯
,
(
x
N
,
y
N
)
}
T = \left\{ \left( x_{1}, y_{1} \right), \left( x_{2}, y_{2} \right), \cdots, \left( x_{N}, y_{N} \right) \right\}
T={(x1,y1),(x2,y2),⋯,(xN,yN)}以及特征函数
f
i
(
x
,
y
)
,
i
=
1
,
2
,
⋯
,
n
f_{i} \left( x, y \right), i = 1, 2, \cdots, n
fi(x,y),i=1,2,⋯,n,最大熵模型的学习等价于最优化问题:
max
P
∈
C
H
(
P
)
=
−
∑
x
,
y
P
~
(
x
)
P
(
y
∣
x
)
log
P
(
y
∣
x
)
s
.
t
.
E
P
(
f
i
)
=
E
P
~
(
f
i
)
,
i
=
1
,
2
,
⋯
,
n
∑
y
P
(
y
∣
x
)
=
1
\begin{aligned} \max_{P \in \mathcal{C} } \quad & H \left( P \right) = - \sum_{x,y} \tilde{P} \left( x \right) P \left( y | x \right) \log P \left( y | x \right) \\ s.t.\quad & E_{ P } \left( f_{i} \right) = E_{ \tilde{P} } \left( f_{i} \right), i = 1,2, \cdots, n \\ & \sum_{y} P \left( y | x \right) = 1 \end{aligned}
P∈Cmaxs.t.H(P)=−x,y∑P~(x)P(y∣x)logP(y∣x)EP(fi)=EP~(fi),i=1,2,⋯,ny∑P(y∣x)=1
等价的
min
P
∈
C
−
H
(
P
)
=
∑
x
,
y
P
~
(
x
)
P
(
y
∣
x
)
log
P
(
y
∣
x
)
s
.
t
.
E
P
(
f
i
)
−
E
P
~
(
f
i
)
=
0
,
i
=
1
,
2
,
⋯
,
n
∑
y
P
(
y
∣
x
)
=
1
\begin{aligned} \min_{P \in \mathcal{C} } \quad & -H \left( P \right) = \sum_{x,y} \tilde{P} \left( x \right) P \left( y | x \right) \log P \left( y | x \right) \\ s.t.\quad & E_{ P } \left( f_{i} \right) - E_{ \tilde{P} } \left( f_{i} \right) = 0, i = 1,2, \cdots, n \\ & \sum_{y} P \left( y | x \right) = 1 \end{aligned}
P∈Cmins.t.−H(P)=x,y∑P~(x)P(y∣x)logP(y∣x)EP(fi)−EP~(fi)=0,i=1,2,⋯,ny∑P(y∣x)=1
最优化问题的求解:
-
引入拉格朗日乘子 w i , i = 0 , 1 , ⋯ , n w_{i}, i = 0,1, \cdots, n wi,i=0,1,⋯,n,定义拉格朗日函数 L ( P , w ) L \left( P, w \right) L(P,w) L ( P , w ) = − H ( P ) + w 0 ( 1 − ∑ y P ( y ∣ x ) ) + ∑ i = 1 n w i ( E P ( f i ) − E P ~ ( f i ) ) = ∑ x , y P ~ ( x ) P ( y ∣ x ) log P ( y ∣ x ) + w 0 ( 1 − ∑ y P ( y ∣ x ) ) + ∑ i = 1 n w i ( ∑ x , y P ~ ( x ) P ( y ∣ x ) f i ( x , y ) − ∑ x , y P ~ ( x , y ) f i ( x , y ) ) \begin{aligned} & L \left( P, w \right) = - H \left( P \right) + w_{0} \left( 1 - \sum_{y} P \left( y | x \right) \right) + \sum_{i=1}^{n} w_{i} \left( E_{P} \left( f_{i} \right) - E_{\tilde{P}} \left( f_{i} \right) \right) \\ & = \sum_{x,y} \tilde{P} \left( x \right) P \left( y | x \right) \log P \left( y | x \right) + w_{0} \left( 1 - \sum_{y} P \left( y | x \right) \right) \\ & \quad + \sum_{i=1}^{n} w_{i} \left( \sum_{x, y} \tilde{P} \left( x \right) P \left( y | x \right) f_{i} \left( x, y \right) - \sum_{x, y} \tilde{P} \left( x, y \right) f_{i} \left( x, y \right) \right) \end{aligned} L(P,w)=−H(P)+w0(1−y∑P(y∣x))+i=1∑nwi(EP(fi)−EP~(fi))=x,y∑P~(x)P(y∣x)logP(y∣x)+w0(1−y∑P(y∣x))+i=1∑nwi(x,y∑P~(x)P(y∣x)fi(x,y)−x,y∑P~(x,y)fi(x,y))
-
求 min P ∈ C L ( P , w ) \min_{P \in \mathcal{C} } L \left( P, w \right) minP∈CL(P,w):
记对偶函数 Ψ ( w ) = m i n P ∈ C L ( P , w ) = L ( P w , w ) \Psi \left( w \right) = min_{P \in \mathcal{C} } L \left( P, w \right) = L \left( P_{w}, w \right) Ψ(w)=minP∈CL(P,w)=L(Pw,w),其解记 P w = arg min P ∈ C L ( P , w ) = P w ( y ∣ x ) P_{w} = \arg \min_{P \in \mathcal{C} } L \left( P, w \right) = P_{w} \left( y | x \right) Pw=argminP∈CL(P,w)=Pw(y∣x) ∂ L ( P , w ) ∂ P ( y ∣ x ) = ∑ x , y P ~ ( x ) ( log P ( y ∣ x ) + 1 ) − ∑ y w 0 − ∑ x , y ( P ~ ( x ) ∑ i = 1 n w i f i ( x , y ) ) = ∑ x , y P ~ ( x ) ( log P ( y ∣ x ) + 1 ) − ∑ x , y P ( x ) w 0 − ∑ x , y ( P ~ ( x ) ∑ i = 1 n w i f i ( x , y ) ) = ∑ x , y P ~ ( x ) ( log P ( y ∣ x ) + 1 − w 0 − ∑ i = 1 n w i f i ( x , y ) ) = 0 \begin{aligned} & \dfrac {\partial L \left( P, w \right)} {\partial P \left( y | x \right)} = \sum_{x,y} \tilde{P} \left( x \right) \left( \log P \left( y | x \right) + 1 \right) - \sum_{y} w_{0} - \sum_{x,y} \left( \tilde{P} \left( x \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) \\ & \quad = \sum_{x,y} \tilde{P} \left( x \right) \left( \log P \left( y | x \right) + 1 \right) - \sum_{x,y} P \left( x \right) w_{0} - \sum_{x,y} \left( \tilde{P} \left( x \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) \\ & \quad = \sum_{x,y} \tilde{P} \left( x \right) \left( \log P \left( y | x \right) + 1 - w_{0} - \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) = 0\end{aligned} ∂P(y∣x)∂L(P,w)=x,y∑P~(x)(logP(y∣x)+1)−y∑w0−x,y∑(P~(x)i=1∑nwifi(x,y))=x,y∑P~(x)(logP(y∣x)+1)−x,y∑P(x)w0−x,y∑(P~(x)i=1∑nwifi(x,y))=x,y∑P~(x)(logP(y∣x)+1−w0−i=1∑nwifi(x,y))=0
由于 P ~ ( x ) > 0 \tilde{P} \left( x \right) \gt 0 P~(x)>0,得 log P ( y ∣ x ) + 1 − w 0 − ∑ i = 1 n w i f i ( x , y ) = 0 P ( y ∣ x ) = exp ( ∑ i = 1 n w i f i ( x , y ) + w 0 − 1 ) = exp ( ∑ i = 1 n w i f i ( x , y ) ) exp ( 1 − w 0 ) \begin{aligned} & \log P \left( y | x \right) + 1 - w_{0} - \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right)=0 \\ & P \left( y | x \right) = \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) + w_{0} -1 \right) = \dfrac{ \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) }{ \exp \left( 1 - w_{0} \right)}\end{aligned} logP(y∣x)+1−w0−i=1∑nwifi(x,y)=0P(y∣x)=exp(i=1∑nwifi(x,y)+w0−1)=exp(1−w0)exp(∑i=1nwifi(x,y))
由于 ∑ y P ( y ∣ x ) = 1 \sum_{y} P \left( y | x \right) = 1 ∑yP(y∣x)=1,则 ∑ y P ( y ∣ x ) = ∑ y exp ( ∑ i = 1 n w i f i ( x , y ) ) exp ( 1 − w 0 ) = 1 ∑ y exp ( ∑ i = 1 n w i f i ( x , y ) ) = exp ( 1 − w 0 ) \begin{aligned} &\sum_{y} P \left( y | x \right) = \sum_{y} \dfrac{ \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) }{ \exp \left( 1 - w_{0} \right)} = 1 \\ & \sum_{y} \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) = \exp \left( 1 - w_{0} \right)\end{aligned} y∑P(y∣x)=y∑exp(1−w0)exp(∑i=1nwifi(x,y))=1y∑exp(i=1∑nwifi(x,y))=exp(1−w0)
代入,得 P ( y ∣ x ) = 1 Z w ( x ) exp ( ∑ i = 1 n w i f i ( x , y ) ) \begin{aligned} & P \left( y | x \right) = \dfrac{1 }{Z_{w} \left( x \right)}\exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) \end{aligned} P(y∣x)=Zw(x)1exp(i=1∑nwifi(x,y))
其中 Z w = ∑ y exp ( ∑ i = 1 n w i f i ( x , y ) ) \begin{aligned} Z_{w} = \sum_{y} \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) \end{aligned} Zw=y∑exp(i=1∑nwifi(x,y))
Z w Z_{w} Zw称为规范化因子; f i ( x , y ) f_{i} \left( x, y \right) fi(x,y)是特征函数; w i w_{i} wi是特征的权值。
-
求 max w Ψ ( w ) \max_{w} \Psi \left( w \right) maxwΨ(w)
将其解记为 w ∗ w^{*} w∗,即 w ∗ = arg max w Ψ ( w ) \begin{aligned} w^{*} = \arg \max_{w} \Psi \left( w \right) \end{aligned} w∗=argwmaxΨ(w)
这就是说,可以应用最优化算法求对偶函数 Ψ ( w ) \Psi(w) Ψ(w) 的极大化,得到 w ∗ w^{*} w∗ ,用来 表示 P ∗ ∈ C . P^{*} \in \mathcal{C} . P∗∈C. 这里, P ∗ = P w ∗ = P w ∗ ( y ∣ x ) P^{*}=P_{w^{*}}=P_{w^{*}}(y \mid x) P∗=Pw∗=Pw∗(y∣x) 是学习到的最优模型(最大嫡模型). 也 就是说,最大嫡模型的学习归结为对偶函数 Ψ ( w ) \Psi(w) Ψ(w) 的极大化.
极大似然估计
已知训练数据的经验概率分布 P ~ ( X , Y ) \tilde{P} \left( X, Y \right) P~(X,Y),则条件概率分布 P ( X ∣ Y ) P \left( X | Y \right) P(X∣Y)的对数似然函数 L P ~ ( P w ) = log ∏ x , y P ( y ∣ x ) P ~ ( x , y ) = ∑ x , y P ~ ( x , y ) log P ( y ∣ x ) = ∑ x , y P ~ ( x , y ) log exp ( ∑ i = 1 n w i f i ( x , y ) ) Z w ( x ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x , y P ~ ( x , y ) log Z w ( x ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x P ~ ( x ) log Z w ( x ) \begin{aligned} & L_{\tilde{P}} \left( P_{w} \right) = \log \prod_{x,y} P \left( y | x \right)^{\tilde{P} \left( x, y \right)} \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \log P \left( y | x \right) \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \log \dfrac{\exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right)}{Z_{w} \left( x \right) } \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) - \sum_{x,y} \tilde{P} \left( x, y \right) \log Z_{w} \left( x \right) \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) - \sum_{x} \tilde{P} \left( x \right) \log Z_{w} \left( x \right)\end{aligned} LP~(Pw)=logx,y∏P(y∣x)P~(x,y)=x,y∑P~(x,y)logP(y∣x)=x,y∑P~(x,y)logZw(x)exp(∑i=1nwifi(x,y))=x,y∑P~(x,y)i=1∑nwifi(x,y)−x,y∑P~(x,y)logZw(x)=x,y∑P~(x,y)i=1∑nwifi(x,y)−x∑P~(x)logZw(x)
对偶函数 Ψ ( w ) = m i n P ∈ C L ( P , w ) = L ( P w , w ) = − H ( P w ) + w 0 ( 1 − ∑ y P w ( y ∣ x ) ) + ∑ i = 1 n w i ( E P ~ ( f i ) − E P w ( f i ) ) = ∑ x , y P ~ ( x ) P w ( y ∣ x ) log P w ( y ∣ x ) + w 0 ( 1 − ∑ y 1 Z w ( x ) exp ( ∑ i = 1 n w i f i ( x , y ) ) ) + ∑ i = 1 n w i ( ∑ x , y P ~ ( x , y ) f i ( x , y ) − ∑ x , y P ~ ( x ) P w ( y ∣ x ) f i ( x , y ) ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n w i f i ( x , y ) + ∑ x , y P ~ ( x ) P w ( y ∣ x ) ( log P w ( y ∣ x ) − ∑ i = 1 n w i f i ( x , y ) ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x , y P ~ ( x , y ) log Z w ( x ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x P ~ ( x ) log Z w ( x ) \begin{aligned} & \Psi \left( w \right) = min_{P \in \mathcal{C} } L \left( P, w \right) = L \left( P_{w}, w \right) \\ & = - H \left( P_{w} \right) + w_{0} \left( 1 - \sum_{y} P_{w} \left( y | x \right) \right) + \sum_{i=1}^{n} w_{i} \left( E_{\tilde{P}} \left( f_{i} \right) - E_{P_{w}} \left( f_{i} \right) \right) \\ & = \sum_{x,y} \tilde{P} \left( x \right) P_{w} \left( y | x \right) \log P_{w} \left( y | x \right) \\& \quad\quad\quad + w_{0} \left( 1 - \sum_{y} \dfrac{1 }{Z_{w} \left( x \right)}\exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) \right) \\ & \quad\quad\quad + \sum_{i=1}^{n} w_{i} \left( \sum_{x, y} \tilde{P} \left( x, y \right) f_{i} \left( x, y \right) - \sum_{x, y} \tilde{P} \left( x \right) P_{w} \left( y | x \right) f_{i} \left( x, y \right) \right) \\ & = \sum_{x, y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) + \sum_{x,y} \tilde{P} \left( x \right) P_{w} \left( y | x \right) \left( \log P_{w} \left( y | x \right) - \sum_{i=1}^{n} w_{i} f_{i} \left(x, y \right) \right) \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) - \sum_{x,y} \tilde{P} \left( x, y \right) \log Z_{w} \left( x \right) \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) - \sum_{x} \tilde{P} \left( x \right) \log Z_{w} \left( x \right)\end{aligned} Ψ(w)=minP∈CL(P,w)=L(Pw,w)=−H(Pw)+w0(1−y∑Pw(y∣x))+i=1∑nwi(EP~(fi)−EPw(fi))=x,y∑P~(x)Pw(y∣x)logPw(y∣x)+w0(1−y∑Zw(x)1exp(i=1∑nwifi(x,y)))+i=1∑nwi(x,y∑P~(x,y)fi(x,y)−x,y∑P~(x)Pw(y∣x)fi(x,y))=x,y∑P~(x,y)i=1∑nwifi(x,y)+x,y∑P~(x)Pw(y∣x)(logPw(y∣x)−i=1∑nwifi(x,y))=x,y∑P~(x,y)i=1∑nwifi(x,y)−x,y∑P~(x,y)logZw(x)=x,y∑P~(x,y)i=1∑nwifi(x,y)−x∑P~(x)logZw(x)
得 L P ~ ( P w ) = Ψ ( w ) \begin{aligned} & L_{\tilde{P}} \left( P_{w} \right) = \Psi \left( w \right)\end{aligned} LP~(Pw)=Ψ(w)
即,最大熵模型的极大似然估计等价于对偶函数极大化。
改进的迭代尺度法
已知最大熵模型 P w ( y ∣ x ) = 1 Z w ( x ) exp ( ∑ i = 1 n w i f i ( x , y ) ) \begin{aligned} & P_{w} \left( y | x \right) = \dfrac{1 }{Z_{w} \left( x \right)}\exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) \end{aligned} Pw(y∣x)=Zw(x)1exp(i=1∑nwifi(x,y))
其中 Z w = ∑ y exp ( ∑ i = 1 n w i f i ( x , y ) ) \begin{aligned} Z_{w} = \sum_{y} \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) \end{aligned} Zw=y∑exp(i=1∑nwifi(x,y))
Z w Z_{w} Zw称为规范化因子; f i ( x , y ) f_{i} \left( x, y \right) fi(x,y)是特征函数; w i w_{i} wi是特征的权值。
对数似然函数 L ( w ) = ∑ x , y P ~ ( x , y ) log P w ( y ∣ x ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x P ~ ( x ) log Z w ( x ) \begin{aligned} & L \left( w \right) = \sum_{x,y} \tilde{P} \left( x, y \right) \log P_{w} \left( y | x \right) \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) - \sum_{x} \tilde{P} \left( x \right) \log Z_{w} \left( x \right) \end{aligned} L(w)=x,y∑P~(x,y)logPw(y∣x)=x,y∑P~(x,y)i=1∑nwifi(x,y)−x∑P~(x)logZw(x)
IIS 的想法是: 假设最大嫡模型当前的参数向量是 w = ( w 1 , w 2 , ⋯ , w n ) T w=\left(w_{1}, w_{2}, \cdots, w_{n}\right)^{\mathrm{T}} w=(w1,w2,⋯,wn)T ,我们 希望找到一个新的参数向量 w + δ = ( w 1 + δ 1 , w 2 + δ 2 , ⋯ , w n + δ n ) T , w+\delta=\left(w_{1}+\delta_{1}, w_{2}+\delta_{2}, \cdots, w_{n}+\delta_{n}\right)^{\mathrm{T}}, w+δ=(w1+δ1,w2+δ2,⋯,wn+δn)T, 使得模型的对数
似然函数值增大. 如果能有这样一种参数向量更新的方法 τ : w → w + δ \tau: w \rightarrow w+\delta τ:w→w+δ ,那么就 可以重复使用这一方法,直至找到对数似然函数的最大值.
对于给定的经验分布 P ~ \tilde{P} P~,模型参数从 w w w到 w + δ w + \delta w+δ,对数似然函数的改变量 L ( w + δ ) − L ( w ) = ∑ x , y P ~ ( x , y ) log P w + δ ( y ∣ x ) − ∑ x , y P ~ ( x , y ) log P w ( y ∣ x ) = ( ∑ x , y P ~ ( x , y ) ∑ i = 1 n ( w i + δ i ) f i ( x , y ) − ∑ x P ~ ( x ) log Z w + δ ( x ) ) − ( ∑ x , y P ~ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x P ~ ( x ) log Z w ( x ) ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) − ∑ x P ~ ( x ) log Z w + δ ( x ) Z w ( x ) \begin{aligned} & L \left( w + \delta \right) - L \left( w \right) = \sum_{x,y} \tilde{P} \left( x, y \right) \log P_{w + \delta} \left( y | x \right) - \sum_{x,y} \tilde{P} \left( x, y \right) \log P_{w} \left( y | x \right) \\ & = \left( \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \left( w_{i} + \delta_{i} \right) f_{i} \left( x, y \right) - \sum_{x} \tilde{P} \left( x \right) \log Z_{w + \delta} \left( x \right) \right) \\ & \quad\quad\quad\quad\quad\quad - \left( \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) - \sum_{x} \tilde{P} \left( x \right) \log Z_{w} \left( x \right) \right) \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) - \sum_{x} \tilde{P} \left( x \right) \log \dfrac{Z_{w + \delta} \left( x \right)}{Z_{w} \left( x \right)}\end{aligned} L(w+δ)−L(w)=x,y∑P~(x,y)logPw+δ(y∣x)−x,y∑P~(x,y)logPw(y∣x)=(x,y∑P~(x,y)i=1∑n(wi+δi)fi(x,y)−x∑P~(x)logZw+δ(x))−(x,y∑P~(x,y)i=1∑nwifi(x,y)−x∑P~(x)logZw(x))=x,y∑P~(x,y)i=1∑nδifi(x,y)−x∑P~(x)logZw(x)Zw+δ(x)
由 − log α ≥ 1 − α , α g t ; 0 \begin{aligned} - \log \alpha \geq 1 - \alpha, \alpha > 0 \end{aligned} −logα≥1−α,αgt;0
得 L ( w + δ ) − L ( w ) ≥ ∑ x , y P ~ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ~ ( x ) Z w + δ ( x ) Z w ( x ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ~ ( x ) ∑ y exp ( ∑ i = 1 n ( w i + δ i ) f i ( x , y ) ) ∑ y exp ( ∑ i = 1 n w i f i ( x , y ) ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ~ ( x ) ∑ y exp ( ∑ i = 1 n w i f i ( x , y ) ) ∑ y exp ( ∑ i = 1 n w i f i ( x , y ) ) exp ( ∑ i = 1 n δ i f i ( x , y ) ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ~ ( x ) ∑ y P w ( y ∣ x ) exp ( ∑ i = 1 n δ i f i ( x , y ) ) \begin{aligned} & L \left( w + \delta \right) - L \left( w \right) \geq \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \dfrac{Z_{w + \delta} \left( x \right)}{Z_{w} \left( x \right)} \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \dfrac{\sum_{y} \exp \left( \sum_{i=1}^{n} \left( w_{i} + \delta_{i} \right) f_{i} \left( x, y \right) \right)}{\sum_{y} \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right)} \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \sum_{y} \dfrac{ \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right)}{\sum_{y} \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right)} \exp \left( \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) \right) \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \sum_{y} P_{w} \left( y | x \right) \exp \left( \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) \right)\end{aligned} L(w+δ)−L(w)≥x,y∑P~(x,y)i=1∑nδifi(x,y)+1−x∑P~(x)Zw(x)Zw+δ(x)=x,y∑P~(x,y)i=1∑nδifi(x,y)+1−x∑P~(x)∑yexp(∑i=1nwifi(x,y))∑yexp(∑i=1n(wi+δi)fi(x,y))=x,y∑P~(x,y)i=1∑nδifi(x,y)+1−x∑P~(x)y∑∑yexp(∑i=1nwifi(x,y))exp(∑i=1nwifi(x,y))exp(i=1∑nδifi(x,y))=x,y∑P~(x,y)i=1∑nδifi(x,y)+1−x∑P~(x)y∑Pw(y∣x)exp(i=1∑nδifi(x,y))
记 A ( δ ∣ w ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ~ ( x ) ∑ y P w ( y ∣ x ) exp ( ∑ i = 1 n δ i f i ( x , y ) ) \begin{aligned} & A \left( \delta | w \right) = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \sum_{y} P_{w} \left( y | x \right) \exp \left( \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) \right)\end{aligned} A(δ∣w)=x,y∑P~(x,y)i=1∑nδifi(x,y)+1−x∑P~(x)y∑Pw(y∣x)exp(i=1∑nδifi(x,y))
则
L
(
w
+
δ
)
−
L
(
w
)
≥
A
(
δ
∣
w
)
\begin{aligned} & L \left( w + \delta \right) - L \left( w \right) \geq A \left( \delta | w \right)\end{aligned}
L(w+δ)−L(w)≥A(δ∣w)
即 A ( δ ∣ w ) A \left( \delta | w \right) A(δ∣w)是对数似然函数改变量的一个下界。
如果能找到适当的 δ \delta δ 使下界 A ( δ ∣ w ) A(\delta \mid w) A(δ∣w) 提高,那么对数似然函数也会提高. 然 而,函数 A ( δ ∣ w ) A(\delta \mid w) A(δ∣w) 中的 δ \delta δ 是一个向量,含有多个变量,不易同时优化. IIS 试图一 次只优化其中一个变量 δ i , \delta_{i}, δi, 而固定其他变量 δ i , i ≠ j . \delta_{i}, \quad i \neq j . δi,i=j.
引入 f # ( x , y ) = ∑ i f i ( x , y ) \begin{aligned} & f^{\#} \left( x, y \right) = \sum_{i} f_{i} \left( x, y \right) \end{aligned} f#(x,y)=i∑fi(x,y)
表示所有特征在 ( x , y ) \left( x, y \right) (x,y)出现的次数 则 A ( δ ∣ w ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ~ ( x ) ∑ y P w ( y ∣ x ) exp ( f # ( x , y ) ∑ i = 1 n δ i f i ( x , y ) f # ( x , y ) ) \begin{aligned} & A \left( \delta | w \right) = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \sum_{y} P_{w} \left( y | x \right) \exp \left( f^{\#} \left( x, y \right) \sum_{i=1}^{n} \dfrac{\delta_{i} f_{i} \left( x, y \right) }{f^{\#} \left( x, y \right) } \right)\end{aligned} A(δ∣w)=x,y∑P~(x,y)i=1∑nδifi(x,y)+1−x∑P~(x)y∑Pw(y∣x)exp(f#(x,y)i=1∑nf#(x,y)δifi(x,y))
对任意
i
i
i,有
f
i
(
x
,
y
)
f
#
(
x
,
y
)
≥
0
\dfrac{f_{i} \left( x, y \right)}{f^{\#} \left( x, y \right)} \geq 0
f#(x,y)fi(x,y)≥0且
∑
i
=
1
n
f
i
(
x
,
y
)
f
#
(
x
,
y
)
=
1
\sum_{i=1}^{n} \dfrac{f_{i} \left( x, y \right)}{f^{\#} \left( x, y \right)} = 1
∑i=1nf#(x,y)fi(x,y)=1,
根据Jensen不等式,得
exp
(
∑
i
=
1
n
f
i
(
x
,
y
)
f
#
(
x
,
y
)
δ
i
f
#
(
x
,
y
)
)
)
≤
∑
i
=
1
n
f
i
(
x
,
y
)
f
#
(
x
,
y
)
exp
(
δ
i
f
#
(
x
,
y
)
)
\begin{aligned} & \exp \left( \sum_{i=1}^{n} \dfrac{f_{i} \left( x, y \right)}{f^{\#} \left( x, y \right)} \delta_{i} f_{\#} \left( x, y \right) ) \right) \leq \sum_{i=1}^{n} \dfrac{f_{i} \left( x, y \right)}{f^{\#} \left( x, y \right)} \exp \left( \delta_{i} f^{\#} \left(x, y\right) \right)\end{aligned}
exp(i=1∑nf#(x,y)fi(x,y)δif#(x,y)))≤i=1∑nf#(x,y)fi(x,y)exp(δif#(x,y))
则 A ( δ ∣ w ) ≥ ∑ x , y P ~ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ~ ( x ) ∑ y P w ( y ∣ x ) ∑ i = 1 n ( f i ( x , y ) f # ( x , y ) ) exp ( δ i f # ( x , y ) ) \begin{aligned} & A \left( \delta | w \right) \geq \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \sum_{y} P_{w} \left( y | x \right) \sum_{i=1}^{n} \left( \dfrac{f_{i} \left( x, y \right)}{f^{\#} \left( x, y \right)} \right) \exp \left( \delta_{i} f^{\#} \left(x, y\right) \right)\end{aligned} A(δ∣w)≥x,y∑P~(x,y)i=1∑nδifi(x,y)+1−x∑P~(x)y∑Pw(y∣x)i=1∑n(f#(x,y)fi(x,y))exp(δif#(x,y))
记 B ( δ ∣ w ) = ∑ x , y P ~ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ~ ( x ) ∑ y P w ( y ∣ x ) ∑ i = 1 n ( f i ( x , y ) f # ( x , y ) ) exp ( δ i f # ( x , y ) ) \begin{aligned} & B \left( \delta | w \right) = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \sum_{y} P_{w} \left( y | x \right) \sum_{i=1}^{n} \left( \dfrac{f_{i} \left( x, y \right)}{f^{\#} \left( x, y \right)} \right) \exp \left( \delta_{i} f^{\#} \left(x, y\right) \right)\end{aligned} B(δ∣w)=x,y∑P~(x,y)i=1∑nδifi(x,y)+1−x∑P~(x)y∑Pw(y∣x)i=1∑n(f#(x,y)fi(x,y))exp(δif#(x,y))
则
L
(
w
+
δ
)
−
L
(
w
)
≥
A
(
δ
∣
w
)
≥
B
(
δ
∣
w
)
\begin{aligned} & L \left( w + \delta \right) - L \left( w \right) \geq A \left( \delta | w \right) \geq B \left( \delta | w \right)\end{aligned}
L(w+δ)−L(w)≥A(δ∣w)≥B(δ∣w)
即 B ( δ ∣ w ) B \left( \delta | w \right) B(δ∣w)是对数似然函数改变量的一个新的(相对不紧的)下界。
求
∂
B
(
δ
∣
w
)
∂
δ
i
=
∑
x
,
y
P
~
(
x
,
y
)
f
i
(
x
,
y
)
−
∑
x
P
~
(
x
)
∑
y
P
w
(
y
∣
x
)
f
i
(
x
,
y
)
exp
(
δ
i
f
#
(
x
,
y
)
)
\begin{aligned} & \dfrac {\partial B \left( \delta | w \right) }{\partial \delta_{i}} = \sum_{x,y} \tilde{P} \left( x, y \right) f_{i} \left( x, y \right) - \sum_{x} \tilde{P} \left( x \right) \sum_{y} P_{w} \left( y | x \right) f_{i} \left( x, y \right) \exp \left( \delta_{i} f^{\#} \left(x, y\right) \right)\end{aligned}
∂δi∂B(δ∣w)=x,y∑P~(x,y)fi(x,y)−x∑P~(x)y∑Pw(y∣x)fi(x,y)exp(δif#(x,y))
令
∂
B
(
δ
∣
w
)
∂
δ
i
=
0
\dfrac {\partial B \left( \delta | w \right) }{\partial \delta_{i}} = 0
∂δi∂B(δ∣w)=0,得
∑
x
,
y
P
~
(
x
,
y
)
f
i
(
x
,
y
)
=
∑
x
,
y
P
~
(
x
)
P
w
(
y
∣
x
)
f
i
(
x
,
y
)
exp
(
δ
i
f
#
(
x
,
y
)
)
\begin{aligned} & \sum_{x,y} \tilde{P} \left( x, y \right) f_{i} \left( x, y \right) = \sum_{x, y} \tilde{P} \left( x \right) P_{w} \left( y | x \right) f_{i} \left( x, y \right) \exp \left( \delta_{i} f^{\#} \left(x, y\right) \right)\end{aligned}
x,y∑P~(x,y)fi(x,y)=x,y∑P~(x)Pw(y∣x)fi(x,y)exp(δif#(x,y))
对 δ i \delta_{i} δi求解可解得 δ \delta δ
改进的迭代尺度算法(IIS):
- 输入:特征函数 f i , i = 1 , 2 , ⋯ , n f_{i},i=1, 2, \cdots, n fi,i=1,2,⋯,n,经验分布 P ~ ( x , y ) \tilde{P} \left( x, y \right) P~(x,y),模型 P w ( y ∣ x ) P_{w} \left( y | x \right) Pw(y∣x)
- 输出:最优参数值 w i ∗ w_{i}^{*} wi∗;最优模型 P w ∗ P_{w^{*}} Pw∗
-
对所有 i ∈ { 1 , 2 , ⋯ , n } i \in \left\{ 1, 2, \cdots, n \right\} i∈{1,2,⋯,n},取 w i = 0 w_{i} = 0 wi=0;
-
对每一 i ∈ { 1 , 2 , ⋯ , n } i \in \left\{ 1, 2, \cdots, n \right\} i∈{1,2,⋯,n}
2.1. 令 δ i \delta_{i} δi是方程 ∑ x , y P ~ ( x , y ) f i ( x , y ) = ∑ x , y P ~ ( x ) P w ( y ∣ x ) f i ( x , y ) exp ( δ i f # ( x , y ) ) \begin{aligned} & \sum_{x,y} \tilde{P} \left( x, y \right) f_{i} \left( x, y \right) = \sum_{x, y} \tilde{P} \left( x \right) P_{w} \left( y | x \right) f_{i} \left( x, y \right) \exp \left( \delta_{i} f^{\#} \left(x, y\right) \right) \end{aligned} x,y∑P~(x,y)fi(x,y)=x,y∑P~(x)Pw(y∣x)fi(x,y)exp(δif#(x,y))
的解2.2. 更新 w i w_{i} wi的值 w i ← w i + δ i \begin{aligned} & w_{i} \leftarrow w_{i} + \delta_{i}\end{aligned} wi←wi+δi
-
如果不是所有 w i w_{i} wi都收敛,重复步骤2.
3、概要总结
逻辑斯谛回归(LR)是经典的分类方法
1.逻辑斯谛回归模型是由以下条件概率分布表示的分类模型。逻辑斯谛回归模型可以用于二类或多类分类。
P ( Y = k ∣ x ) = exp ( w k ⋅ x ) 1 + ∑ k = 1 K − 1 exp ( w k ⋅ x ) , k = 1 , 2 , ⋯ , K − 1 P(Y=k | x)=\frac{\exp \left(w_{k} \cdot x\right)}{1+\sum_{k=1}^{K-1} \exp \left(w_{k} \cdot x\right)}, \quad k=1,2, \cdots, K-1 P(Y=k∣x)=1+∑k=1K−1exp(wk⋅x)exp(wk⋅x),k=1,2,⋯,K−1 P ( Y = K ∣ x ) = 1 1 + ∑ k = 1 K − 1 exp ( w k ⋅ x ) P(Y=K | x)=\frac{1}{1+\sum_{k=1}^{K-1} \exp \left(w_{k} \cdot x\right)} P(Y=K∣x)=1+∑k=1K−1exp(wk⋅x)1
这里, x x x为输入特征, w w w为特征的权值。
逻辑斯谛回归模型源自逻辑斯谛分布,其分布函数 F ( x ) F(x) F(x)是 S S S形函数。逻辑斯谛回归模型是由输入的线性函数表示的输出的对数几率模型。
2.最大熵模型是由以下条件概率分布表示的分类模型。最大熵模型也可以用于二类或多类分类。
P w ( y ∣ x ) = 1 Z w ( x ) exp ( ∑ i = 1 n w i f i ( x , y ) ) P_{w}(y | x)=\frac{1}{Z_{w}(x)} \exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right) Pw(y∣x)=Zw(x)1exp(i=1∑nwifi(x,y)) Z w ( x ) = ∑ y exp ( ∑ i = 1 n w i f i ( x , y ) ) Z_{w}(x)=\sum_{y} \exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right) Zw(x)=y∑exp(i=1∑nwifi(x,y))
其中, Z w ( x ) Z_w(x) Zw(x)是规范化因子, f i f_i fi为特征函数, w i w_i wi为特征的权值。
3.最大熵模型可以由最大熵原理推导得出。最大熵原理是概率模型学习或估计的一个准则。最大熵原理认为在所有可能的概率模型(分布)的集合中,熵最大的模型是最好的模型。
最大熵原理应用到分类模型的学习中,有以下约束最优化问题:
min P ∈ C − H ( P ) = ∑ x , y P ~ ( x ) P ( y ∣ x ) log P ( y ∣ x ) s . t . E P ( f i ) − E P ~ ( f i ) = 0 , i = 1 , 2 , ⋯ , n ∑ y P ( y ∣ x ) = 1 \begin{aligned} \min_{P \in \mathcal{C} } \quad & -H \left( P \right) = \sum_{x,y} \tilde{P} \left( x \right) P \left( y | x \right) \log P \left( y | x \right) \\ s.t.\quad & E_{ P } \left( f_{i} \right) - E_{ \tilde{P} } \left( f_{i} \right) = 0, i = 1,2, \cdots, n \\ & \sum_{y} P \left( y | x \right) = 1 \end{aligned} P∈Cmins.t.−H(P)=x,y∑P~(x)P(y∣x)logP(y∣x)EP(fi)−EP~(fi)=0,i=1,2,⋯,ny∑P(y∣x)=1
求解此最优化问题的对偶问题得到最大熵模型。
4.逻辑斯谛回归模型与最大熵模型都属于对数线性模型。
5.逻辑斯谛回归模型及最大熵模型学习一般采用极大似然估计,或正则化的极大似然估计。逻辑斯谛回归模型及最大熵模型学习可以形式化为无约束最优化问题。求解该最优化问题的算法有改进的迭代尺度法、梯度下降法、拟牛顿法。
回归模型: f ( x ) = 1 1 + e − w x f(x) = \frac{1}{1+e^{-wx}} f(x)=1+e−wx1
其中 w x wx wx线性函数: w x = w 0 ⋅ x 0 + w 1 ⋅ x 1 + w 2 ⋅ x 2 + . . . + w n ⋅ x n , ( x 0 = 1 ) wx =w_0\cdot x_0 + w_1\cdot x_1 + w_2\cdot x_2 +...+w_n\cdot x_n,(x_0=1) wx=w0⋅x0+w1⋅x1+w2⋅x2+...+wn⋅xn,(x0=1)