一切为了数据挖掘的准备
6.逻辑斯蒂回归(logistic regression)与最大熵模型(maximum entropy model)
- 两个模型都是概率模型。在分类模型中,计算P(Y|X)
- 两个模型都属于对数线性模型
- 在logistic regression中, log P ( Y ∣ X ) = w x \log P(Y|X) = wx logP(Y∣X)=wx。如果Y是多分类,不同的Y值对应不同的w参数值;属于判别模型。
- 在最大熵模型中, log P ( Y ∣ X ) = w f ( x ) , \log P(Y|X) = wf(x), logP(Y∣X)=wf(x),关于x的函数的线性函数,属于生成模型(即需要找到P(Y|X)和样本中X的经验分布 P ~ ( X ) \tilde{P}(X) P~(X))。
6.1二项逻辑斯蒂回归模型
逻辑斯蒂回归模型是一种分类模型,由条件概率分布 P ( Y ∣ X ) P(Y|X) P(Y∣X)表示。回归会比较不同y值的条件概率值的大小,把概率值大的类型作为预测结果。
6.1.1 二项逻辑斯蒂回归模型的数学表示
P ( Y = 1 ∣ X ) = e x p ( w ⋅ x + b ) 1 + e x p ( w ⋅ x + b ) P ( Y = 0 ∣ X ) = 1 1 + e x p ( w ⋅ x + b ) \begin{aligned} P(Y=1|X)=\frac{exp(w\cdot x+b)}{1+exp(w\cdot x+b)} \\ P(Y=0|X)=\frac{1}{1+exp(w\cdot x+b)} \end{aligned} P(Y=1∣X)=1+exp(w⋅x+b)exp(w⋅x+b)P(Y=0∣X)=1+exp(w⋅x+b)1
其中 w ∈ R n , b ∈ R w \in R^n,b \in R w∈Rn,b∈R是参数,w为权值向量,b是偏置,wx是w和x的内积。
- 或表示为:
P ( Y = 1 ∣ X ) = e x p ( w ⋅ x ) 1 + e x p ( w ⋅ x ) P ( Y = 0 ∣ X ) = 1 1 + e x p ( w ⋅ x ) \begin{aligned} P(Y=1|X)=\frac{exp(w\cdot x)}{1+exp(w\cdot x)} \\ P(Y=0|X)=\frac{1}{1+exp(w\cdot x)} \end{aligned} P(Y=1∣X)=1+exp(w⋅x)exp(w⋅x)P(Y=0∣X)=1+exp(w⋅x)1
其中 w = ( w ( 1 ) , w ( 1 ) , ⋯ , w ( n ) , b ) , x = ( x ( 1 ) , x ( 2 ) , ⋯ , x ( n ) , 1 ) w=( w^{(1)},w^{(1)},\cdots,w^{(n)},b), x=(x^{(1)}, x^{(2)},\cdots, x^{(n)}, 1) w=(w(1),w(1),⋯,w(n),b),x=(x(1),x(2),⋯,x(n),1)
6.1.2 logistic regression数学表达的一种解释
- 假设 P ( Y = 1 ∣ X ) = π ( x ) P(Y=1|X)=\pi(x) P(Y=1∣X)=π(x),则 π ( x ) \pi(x) π(x)取值范围为[0,1].我们希望用线性表达式 w ⋅ x w\cdot x w⋅x来表示概率分布。
- 对
π
(
x
)
\pi(x)
π(x)做logit变换
l o g i t ( π ( x ) ) = l o g π ( x ) 1 − π ( x ) = w ⋅ x logit(\pi(x))=log\frac{\pi(x)}{1-\pi(x)}=w\cdot x logit(π(x))=log1−π(x)π(x)=w⋅x
将取值范围为[0,1]的 π ( x ) \pi(x) π(x)函数,变换为取值范围为 [ − ∞ , + ∞ ] [-\infty,+\infty] [−∞,+∞]的 w ⋅ x w\cdot x w⋅x线性函数
-
得到
P ( Y = 1 ∣ X ) = π ( x ) = e x p ( w ⋅ x ) 1 + e x p ( w ⋅ x ) P(Y=1|X)=\pi(x)=\frac{exp(w\cdot x)}{1+exp(w\cdot x)} P(Y=1∣X)=π(x)=1+exp(w⋅x)exp(w⋅x) -
线性函数值越接近 + ∞ +\infty +∞,概率值越接近1; 线性函数值越接近 − ∞ -\infty −∞,概率值越接近0
6.1.3 模型参数估计
最大似然估计:
x
i
∈
R
n
,
y
i
∈
{
0
,
1
}
,
P
(
Y
=
1
∣
X
)
=
π
(
x
)
,
P
(
Y
=
0
∣
X
)
=
1
−
π
(
x
)
x_i\in R^n,y_i\in\{0,1\},P(Y=1|X)=\pi(x),P(Y=0|X)=1-\pi(x)
xi∈Rn,yi∈{0,1},P(Y=1∣X)=π(x),P(Y=0∣X)=1−π(x)
l
(
w
)
=
∏
i
=
1
N
[
π
(
x
i
)
]
y
i
[
1
−
π
(
x
i
)
]
1
−
y
i
L
(
w
)
=
∑
i
=
1
N
[
y
i
l
o
g
π
(
x
i
)
+
(
1
−
y
i
)
l
o
g
(
1
−
π
(
x
i
)
)
]
=
∑
i
=
1
N
[
y
i
l
o
g
π
(
x
i
)
1
−
π
(
x
i
)
+
l
o
g
(
1
−
π
(
x
i
)
)
]
=
∑
i
=
1
N
[
y
i
(
w
⋅
x
i
)
−
l
o
g
(
1
+
e
x
p
(
w
⋅
x
i
)
)
]
\begin{aligned} l(w) & = \prod_{i=1}^N[\pi(x_i)]^{y_i}[1-\pi(x_i)]^{1-y_i} \\ L(w)& =\sum_{i=1}^N[y_ilog\pi(x_i)+(1-y_i)log(1-\pi(x_i))] \\ &=\sum_{i=1}^N[y_ilog \frac{\pi(x_i)}{1-\pi(x_i)} + log(1-\pi(x_i))] \\ &=\sum_{i=1}^N[y_i(w\cdot x_i)-log(1+exp(w\cdot x_i))] \end{aligned}
l(w)L(w)=i=1∏N[π(xi)]yi[1−π(xi)]1−yi=i=1∑N[yilogπ(xi)+(1−yi)log(1−π(xi))]=i=1∑N[yilog1−π(xi)π(xi)+log(1−π(xi))]=i=1∑N[yi(w⋅xi)−log(1+exp(w⋅xi))]
6.1.4采用梯度上升法对目标函数最优化
找函数最大值,朝梯度下降的反方向变化
w
′
←
w
+
η
∇
L
(
w
)
w^\prime \leftarrow w + \eta \nabla L(w)
w′←w+η∇L(w)
L
(
w
)
=
∑
i
=
1
N
[
y
i
(
w
⋅
x
i
)
−
l
o
g
(
1
+
e
x
p
(
w
⋅
x
i
)
)
]
=
∑
i
=
1
N
[
y
i
(
w
T
x
i
)
−
l
o
g
(
1
+
e
x
p
(
w
T
x
i
)
)
]
∂
L
(
w
)
∂
w
=
∑
i
=
1
N
[
y
i
x
i
−
e
x
p
(
w
T
x
i
)
1
+
e
x
p
(
w
T
x
i
)
x
i
]
=
∑
i
=
1
N
(
y
i
−
s
i
g
m
o
i
d
(
w
T
x
i
)
)
x
i
\begin{aligned} L(w) & =\sum_{i=1}^N[y_i(w\cdot x_i)-log(1+exp(w\cdot x_i))]\\ &=\sum_{i=1}^N[y_i(w^T x_i)-log(1+exp(w^Tx_i))] \\ \frac{\partial L(w)}{\partial w} &= \sum_{i=1}^N[y_ix_i-\frac{exp(w^Tx_i)}{1+exp(w^Tx_i)}x_i] \\ &=\sum_{i=1}^N(y_i - sigmoid(w^Tx_i))x_i \end{aligned}
L(w)∂w∂L(w)=i=1∑N[yi(w⋅xi)−log(1+exp(w⋅xi))]=i=1∑N[yi(wTxi)−log(1+exp(wTxi))]=i=1∑N[yixi−1+exp(wTxi)exp(wTxi)xi]=i=1∑N(yi−sigmoid(wTxi))xi
其中N为总样本数,
x
i
x_i
xi是某样本的输入向量,w是与x同型的向量。
6.2多项逻辑斯蒂回归
6.2.1数学表示
假设随机变量Y的取值集合
Y
∈
{
1
,
2
,
⋯
,
K
}
Y \in \{1,2,\cdots,K\}
Y∈{1,2,⋯,K}
P
(
Y
=
i
∣
X
)
=
e
x
p
(
w
i
⋅
x
)
1
+
∑
k
=
1
K
−
1
e
x
p
(
w
k
⋅
x
)
,
i
=
1
,
2
,
⋯
,
K
−
1
P
(
Y
=
K
∣
X
)
=
1
1
+
∑
k
=
1
K
−
1
e
x
p
(
w
k
⋅
x
)
\begin{aligned} P(Y=i|X)=\frac{exp(w_i\cdot x)}{1+\sum_{k=1}^{K-1}exp(w_k\cdot x)},&\quad i=1,2,\cdots,K-1 \\ P(Y=K|X)=\frac{1}{1+\sum_{k=1}^{K-1}exp(w_k\cdot x)} \end{aligned}
P(Y=i∣X)=1+∑k=1K−1exp(wk⋅x)exp(wi⋅x),P(Y=K∣X)=1+∑k=1K−1exp(wk⋅x)1i=1,2,⋯,K−1
6.2.2 多项逻辑斯蒂回归数学表达的一种解释
l
o
g
i
t
(
P
(
Y
=
i
∣
X
)
)
=
l
o
g
P
(
Y
=
i
∣
X
)
P
(
Y
=
K
∣
X
)
=
l
o
g
P
(
Y
=
i
∣
X
)
1
−
∑
k
=
1
K
−
1
P
(
Y
=
k
∣
X
)
=
w
i
⋅
x
,
i
=
1
,
2
,
⋯
,
K
−
1
logit(P(Y=i|X))=log\frac{P(Y=i|X)}{P(Y=K|X)}=log\frac{P(Y=i|X)}{1-\sum_{k=1}^{K-1}P(Y=k|X)}=w_i\cdot x,\quad i=1,2,\cdots,K-1
logit(P(Y=i∣X))=logP(Y=K∣X)P(Y=i∣X)=log1−∑k=1K−1P(Y=k∣X)P(Y=i∣X)=wi⋅x,i=1,2,⋯,K−1
其中
w
i
w_i
wi是Y取值i时的权重。
推导
P
(
Y
=
i
∣
X
)
=
e
x
p
(
w
i
⋅
x
)
P
(
Y
=
K
∣
X
)
,
i
=
1
,
2
,
⋯
,
K
−
1
∑
i
=
1
K
−
1
P
(
Y
=
i
∣
X
)
+
P
(
Y
=
K
∣
X
)
=
P
(
Y
=
K
∣
X
)
∑
i
=
1
K
−
1
e
x
p
(
w
i
⋅
x
)
+
P
(
Y
=
K
∣
X
)
=
1
P
(
Y
=
K
∣
X
)
=
1
1
+
∑
i
=
1
K
−
1
e
x
p
(
w
i
⋅
x
)
P
(
Y
=
i
∣
X
)
=
e
x
p
(
w
i
⋅
x
)
1
+
∑
i
=
1
K
−
1
e
x
p
(
w
i
⋅
x
)
,
i
=
1
,
2
,
⋯
,
K
−
1
\begin{aligned} P(Y=i|X) & = exp(w_i\cdot x)P(Y=K|X) ,\quad i=1,2,\cdots,K-1\\ \sum_{i=1}^{K-1}P(Y=i|X) +P(Y=K|X)& = P(Y=K|X)\sum_{i=1}^{K-1}exp(w_i\cdot x) +P(Y=K|X)=1\\ P(Y=K|X) &=\frac{1}{1+\sum_{i=1}^{K-1}exp(w_i\cdot x)} \\ P(Y=i|X) &= \frac{ exp(w_i\cdot x)}{1+\sum_{i=1}^{K-1}exp(w_i\cdot x)},\quad i=1,2,\cdots,K-1 \end{aligned}
P(Y=i∣X)i=1∑K−1P(Y=i∣X)+P(Y=K∣X)P(Y=K∣X)P(Y=i∣X)=exp(wi⋅x)P(Y=K∣X),i=1,2,⋯,K−1=P(Y=K∣X)i=1∑K−1exp(wi⋅x)+P(Y=K∣X)=1=1+∑i=1K−1exp(wi⋅x)1=1+∑i=1K−1exp(wi⋅x)exp(wi⋅x),i=1,2,⋯,K−1
6.3最大熵模型
在自然语言处理中常用
6.3.1最大熵原理
- 最大熵原理:在学习概率模型时,在所有可能的概率模型中,熵最大的模型是最好的模型。
- 通常在满足约束条件的模型集合中选取熵最大的模型。即我们在满足已有事实的条件下,认为在没有更多信息的情况时,不确定的部分概率平均,等可能;此时数据最混乱,熵最大。
- 熵的计算:
H ( Y ) = − ∑ P ( y ) l o g P ( y ) H(Y) = -\sum P(y)logP(y) H(Y)=−∑P(y)logP(y)
当Y分布是均匀分布时,熵最大,且 H ( P ) = l o g K H(P) = logK H(P)=logK,K是Y的取值个数
6.3.2最大熵模型的约束
- 先考虑模型应满足的条件:给定训练集,可以确定联合分布P(X,Y)和边缘分布P(X)的经验分布
P ~ ( X = x , Y = y ) = v ( X = x , Y = y ) N P ~ ( X = x ) = v ( X = x ) N \begin{aligned} \tilde{P}(X=x,Y=y) &= \frac{v(X=x,Y=y)}{N} \\ \tilde{P}(X=x) &= \frac{v(X=x)}{N} \end{aligned} P~(X=x,Y=y)P~(X=x)=Nv(X=x,Y=y)=Nv(X=x)
其中 v ( ) v() v()表示训练样本中该条件出现的频数。
-
特征函数f(x,y)描述输入x和输出y之间的某个事实。
f ( x , y ) = { 1 , x与y满足一事实 0 , 否则 f(x,y)=\begin{cases} 1, & \text{x与y满足一事实} \\ 0, & \text{否则}\end{cases} f(x,y)={1,0,x与y满足一事实否则
举例:单词’take’有许多意思,这些词义的集合构成Y的取值范围,另外有很多句子,构成输入变量X。那么“y=‘乘坐’,且在句子中’take’的后面有个’bus’单词“这个条件就可以构成一个特征函数。 -
特征函数f(x,y)基于经验分布 P ~ ( x , y ) \tilde{P}(x,y) P~(x,y)的期望
E P ~ ( f ) = ∑ x , y P ~ ( x , y ) f ( x , y ) E_{\tilde{P}}(f)=\sum_{x,y}\tilde{P}(x,y)f(x,y) EP~(f)=x,y∑P~(x,y)f(x,y) -
特征函数f(x,y)关于模型P(Y|X)与经验分布 P ~ ( X ) \tilde{P}(X) P~(X)的期望值
E P ( f ) = ∑ x , y P ~ ( x ) P ( y ∣ x ) f ( x , y ) E_P(f)=\sum_{x,y}\tilde{P}(x)P(y|x)f(x,y) EP(f)=x,y∑P~(x)P(y∣x)f(x,y) -
用样本集特征函数代表的信息的概率估计总体特征函数的概率,即 ∑ x , y P ~ ( x ) P ( y ∣ x ) f ( x , y ) = ∑ x , y P ~ ( x , y ) f ( x , y ) \sum_{x,y}\tilde{P}(x)P(y|x)f(x,y) = \sum_{x,y}\tilde{P}(x,y)f(x,y) ∑x,yP~(x)P(y∣x)f(x,y)=∑x,yP~(x,y)f(x,y)。将其作为约束条件。
-
如果有n个特征函数,就有n个约束条件。
6.3.3最大熵模型
-
设满足所有约束条件的模型集合为
C ≡ { ∣ E P ~ ( f i ) = E P ( f i ) , i = 1 , 2 , ⋯ , n } C \equiv \{|E_{\tilde{P}}(f_i)=E_P(f_i),i=1,2,\cdots,n\} C≡{∣EP~(fi)=EP(fi),i=1,2,⋯,n}
C就是可行域 -
在条件概率分布P(Y|X)的条件熵为
H ( P ) = − ∑ x , y P ~ ( x ) P ( y ∣ x ) l o g P ( y ∣ x ) H(P)=-\sum_{x,y}\tilde{P}(x)P(y|x)logP(y|x) H(P)=−x,y∑P~(x)P(y∣x)logP(y∣x) -
满足约束条件,且条件熵最大的模型为目标最大熵模型。
6.3.4最大熵模型的学习(优化问题求解)
-
原始问题
max P ∈ C H ( P ) = − ∑ x , y P ~ ( x ) P ( Y ∣ X ) l o g P ( Y ∣ X ) s . t . E P ~ ( f i ) = E P ( f i ) , i = 1 , 2 , ⋯ , n ∑ y P ( y ∣ x ) = 1 \begin{aligned} \max_{P \in C} \quad &H(P)=-\sum_{x,y}\tilde{P}(x)P(Y|X)logP(Y|X) \\ s.t. \quad &E_{\tilde{P}}(f_i)=E_P(f_i),\quad i=1,2,\cdots,n \\ & \sum_y P(y|x)=1 \end{aligned} P∈Cmaxs.t.H(P)=−x,y∑P~(x)P(Y∣X)logP(Y∣X)EP~(fi)=EP(fi),i=1,2,⋯,ny∑P(y∣x)=1 -
变形为常见优化问题
min P ∈ C − H ( P ) = ∑ x , y P ~ ( x ) P ( Y ∣ X ) l o g P ( Y ∣ X ) s . t . E P ~ ( f i ) − E P ( f i ) = 0 , i = 1 , 2 , ⋯ , n 1 − ∑ y P ( y ∣ x ) = 0 \begin{aligned} \min_{P \in C} \quad &-H(P)=\sum_{x,y}\tilde{P}(x)P(Y|X)logP(Y|X) \\ s.t. \quad &E_{\tilde{P}}(f_i)-E_P(f_i)=0,\quad i=1,2,\cdots,n \\ & 1-\sum_y P(y|x)=0 \end{aligned} P∈Cmins.t.−H(P)=x,y∑P~(x)P(Y∣X)logP(Y∣X)EP~(fi)−EP(fi)=0,i=1,2,⋯,n1−y∑P(y∣x)=0 -
拉格朗日函数
L ( P , w ) = − H ( P ) + w 0 [ 1 − ∑ y P ( y ∣ x ) ] + ∑ i = 1 n w i ( E P ~ ( f i ) − E P ( f i ) ) L(P,w)=-H(P)+w_0[1-\sum_y P(y|x)] +\sum_{i=1}^nw_i(E_{\tilde{P}}(f_i)-E_P(f_i)) L(P,w)=−H(P)+w0[1−y∑P(y∣x)]+i=1∑nwi(EP~(fi)−EP(fi)) -
拉格朗日函数最小最大化问题转化为对偶问题
原始问题:
min P ∈ C max w L ( P , w ) \min_{P\in C}\max_wL(P,w) P∈CminwmaxL(P,w)
对偶问题:
max w min P ∈ C L ( P , w ) \max_w\min_{P\in C}L(P,w) wmaxP∈CminL(P,w) -
− H ( P ) -H(P) −H(P)是凸函数,原始问题与对偶问题强对偶,解等价
证凸函数:
− H ( P ) = ∑ x , y P ~ ( x ) P ( y ∣ x ) l o g P ( y ∣ x ) − H ′ ( P ) = ∂ − H ( P ) ∂ P ( y ∣ x ) = ∑ x , y P ~ ( x ) [ l o g P ( y ∣ x ) + 1 ] − H ′ ′ ( P ) = ∂ 2 − H ( P ) ∂ 2 P ( y ∣ x ) = ∑ x , y P ~ ( x ) 1 P ( y ∣ x ) > 0 \begin{aligned} -H(P)&=\sum_{x,y}\tilde{P}(x)P(y|x)logP(y|x)\\ -H^\prime(P)&=\frac{\partial{-H(P)}}{\partial{P(y|x)}} = \sum_{x,y}\tilde{P}(x)[logP(y|x)+1] \\ -H^{\prime\prime}(P)&=\frac{\partial^2{-H(P)}}{\partial^2{P(y|x)}}=\sum_{x,y}\tilde{P}(x)\frac{1}{P(y|x)} >0 \end{aligned} −H(P)−H′(P)−H′′(P)=x,y∑P~(x)P(y∣x)logP(y∣x)=∂P(y∣x)∂−H(P)=x,y∑P~(x)[logP(y∣x)+1]=∂2P(y∣x)∂2−H(P)=x,y∑P~(x)P(y∣x)1>0 -
计算KKT条件,先对目标函数求最优解,L(P,w)对P(y|x)求解
L ( P , w ) = − H ( P ) + w 0 [ 1 − ∑ y P ( y ∣ x ) ] + ∑ i = 1 n w i ( E P ~ ( f i ) − E P ( f i ) ) = ∑ x , y P ~ ( x ) P ( y ∣ x ) l o g P ( y ∣ x ) + w 0 [ 1 − ∑ y P ( y ∣ x ) ] + ∑ i = 1 n w i [ ∑ x , y P ~ ( x , y ) f i ( x , y ) − ∑ x , y P ~ ( x ) P ( y ∣ x ) f i ( x , y ) ] ∂ L ( P , w ) ∂ P ( y ∣ x ) = ∑ x , y P ~ ( x ) [ l o g P ( y ∣ x ) + 1 ] − ∑ y w 0 − ∑ x , y P ~ ( x ) ∑ i = 1 n w i f i ( x , y ) = ∑ x , y P ~ ( x ) [ l o g P ( y ∣ x ) + 1 − w 0 − ∑ i = 1 n w i f i ( x , y ) ] = 0 P ( y ∣ x ) = e x p ( ∑ i = 1 n w i f i ( x , y ) ) e x p ( w 0 − 1 ) \begin{aligned} L(P,w)&=-H(P)+w_0[1-\sum_y P(y|x)] +\sum_{i=1}^nw_i(E_{\tilde{P}}(f_i)-E_P(f_i)) \\ &=\sum_{x,y}\tilde{P}(x)P(y|x)logP(y|x) + w_0[1-\sum_y P(y|x)] + \sum_{i=1}^nw_i[\sum_{x,y}\tilde{P}(x,y)f_i(x,y) - \sum_{x,y}\tilde{P}(x)P(y|x)f_i(x,y)] \\ \frac{\partial{L(P,w)}}{\partial{P(y|x)}} &=\sum_{x,y}\tilde{P}(x)[logP(y|x)+1] - \sum_yw_0 -\sum_{x,y}\tilde{P}(x)\sum_{i=1}^nw_if_i(x,y) \\ &=\sum_{x,y}\tilde{P}(x)[logP(y|x)+1-w_0 - \sum_{i=1}^nw_if_i(x,y)] =0 \\ P(y|x) &=\frac{exp(\sum_{i=1}^nw_if_i(x,y))}{exp(w_0-1)} \end{aligned} L(P,w)∂P(y∣x)∂L(P,w)P(y∣x)=−H(P)+w0[1−y∑P(y∣x)]+i=1∑nwi(EP~(fi)−EP(fi))=x,y∑P~(x)P(y∣x)logP(y∣x)+w0[1−y∑P(y∣x)]+i=1∑nwi[x,y∑P~(x,y)fi(x,y)−x,y∑P~(x)P(y∣x)fi(x,y)]=x,y∑P~(x)[logP(y∣x)+1]−y∑w0−x,y∑P~(x)i=1∑nwifi(x,y)=x,y∑P~(x)[logP(y∣x)+1−w0−i=1∑nwifi(x,y)]=0=exp(w0−1)exp(∑i=1nwifi(x,y)) -
约束条件 ∑ y P ( y ∣ x ) = 1 \sum_y P(y|x)=1 ∑yP(y∣x)=1
故 e x p ( w 0 − 1 ) = ∑ y e x p ( ∑ i = 1 n w i f i ( x , y ) ) exp(w_0-1) = \sum_yexp(\sum_{i=1}^nw_if_i(x,y)) exp(w0−1)=∑yexp(∑i=1nwifi(x,y))
P w ( y ∣ x ) = 1 Z w ( x ) e x p ( ∑ i = 1 n w i f i ( x , y ) ) Z w ( x ) = ∑ y e x p ( ∑ i = 1 n w i f i ( x , y ) ) \begin{aligned} P_w(y|x) &=\frac{1}{Z_w(x)} exp(\sum_{i=1}^n w_i f_i(x,y)) \\ Z_w(x) &=\sum_yexp(\sum_{i=1}^n w_i f_i(x,y)) \end{aligned} Pw(y∣x)Zw(x)=Zw(x)1exp(i=1∑nwifi(x,y))=y∑exp(i=1∑nwifi(x,y)) -
再KKT条件,在目标函数最优解的基础上,对参数w求最优解
L ( P , w ) = ∑ x , y P ~ ( x ) P ( y ∣ x ) l o g P ( y ∣ x ) + ∑ i = 1 n w i [ ∑ x , y P ~ ( x , y ) f i ( x , y ) − ∑ x , y P ~ ( x ) P ( y ∣ x ) f i ( x , y ) ] Ψ ( w ) = ∑ x , y P ~ ( x ) P w ( y ∣ x ) l o g P w ( y ∣ x ) + ∑ i = 1 n w i [ ∑ x , y P ~ ( x , y ) f i ( x , y ) − ∑ x , y P ~ ( x ) P w ( y ∣ x ) f i ( x , y ) ] = ∑ x , y P ~ ( x , y ) [ ∑ i = 1 n w i f i ( x , y ) ] + ∑ x , y P ~ ( x ) P w ( y ∣ x ) [ l o g P w ( y ∣ x ) − ∑ i = 1 n w i f i ( x , y ) ] = ∑ x , y P ~ ( x , y ) [ ∑ i = 1 n w i f i ( x , y ) ] − ∑ x , y P ~ ( x ) P w ( y ∣ x ) l o g Z w ( x ) = ∑ x , y P ~ ( x , y ) [ ∑ i = 1 n w i f i ( x , y ) ] − ∑ x P ~ ( x ) l o g Z w ( x ) \begin{aligned} L(P,w)&=\sum_{x,y}\tilde{P}(x)P(y|x)logP(y|x) + \sum_{i=1}^nw_i[\sum_{x,y}\tilde{P}(x,y)f_i(x,y) - \sum_{x,y}\tilde{P}(x)P(y|x)f_i(x,y)] \\ \Psi(w)&=\sum_{x,y}\tilde{P}(x)P_w(y|x)logP_w(y|x) + \sum_{i=1}^nw_i[\sum_{x,y}\tilde{P}(x,y)f_i(x,y) -\sum_{x,y}\tilde{P}(x)P_w(y|x)f_i(x,y)] \\ &=\sum_{x,y}\tilde{P}(x,y)[\sum_{i=1}^nw_if_i(x,y)] + \sum_{x,y}\tilde{P}(x)P_w(y|x)[logP_w(y|x)-\sum_{i=1}^nw_if_i(x,y)] \\ &= \sum_{x,y}\tilde{P}(x,y)[\sum_{i=1}^nw_if_i(x,y)] - \sum_{x,y}\tilde{P}(x)P_w(y|x)logZ_w(x)\\ &= \sum_{x,y}\tilde{P}(x,y)[\sum_{i=1}^nw_if_i(x,y)] - \sum_{x}\tilde{P}(x)logZ_w(x) \end{aligned} L(P,w)Ψ(w)=x,y∑P~(x)P(y∣x)logP(y∣x)+i=1∑nwi[x,y∑P~(x,y)fi(x,y)−x,y∑P~(x)P(y∣x)fi(x,y)]=x,y∑P~(x)Pw(y∣x)logPw(y∣x)+i=1∑nwi[x,y∑P~(x,y)fi(x,y)−x,y∑P~(x)Pw(y∣x)fi(x,y)]=x,y∑P~(x,y)[i=1∑nwifi(x,y)]+x,y∑P~(x)Pw(y∣x)[logPw(y∣x)−i=1∑nwifi(x,y)]=x,y∑P~(x,y)[i=1∑nwifi(x,y)]−x,y∑P~(x)Pw(y∣x)logZw(x)=x,y∑P~(x,y)[i=1∑nwifi(x,y)]−x∑P~(x)logZw(x) -
可以证明此时的 L ( P , w ) L(P,w) L(P,w)等价于最大熵模型(即对拉格朗日求极值时求得的目标函数 P w ( y ∣ x ) P_w(y|x) Pw(y∣x))的极大似然估计。 L ( P w ) = l o g ∏ x , y P w ( y ∣ x ) p ~ ( x , y ) L(P_w)=log\prod_{x,y}P_w(y|x)^{\tilde{p}(x,y)} L(Pw)=log∏x,yPw(y∣x)p~(x,y)。即最大熵模型中的对偶函数极大化等价于极大似然估计
6.4学习中的最优化算法
二值逻辑斯蒂回归最终学习目标:
max
L
(
w
)
=
∑
i
=
1
N
[
y
i
(
w
⋅
x
i
)
−
l
o
g
(
1
+
e
x
p
(
w
⋅
x
i
)
)
]
\max L(w) =\sum_{i=1}^N[y_i(w\cdot x_i)-log(1+exp(w\cdot x_i))]
maxL(w)=i=1∑N[yi(w⋅xi)−log(1+exp(w⋅xi))]
最大熵模型最终学习目标:
max
Ψ
(
w
)
=
∑
x
,
y
P
~
(
x
,
y
)
[
∑
i
=
1
n
w
i
f
i
(
x
,
y
)
]
−
∑
x
P
~
(
x
)
l
o
g
Z
w
(
x
)
\max \Psi(w) = \sum_{x,y}\tilde{P}(x,y)[\sum_{i=1}^nw_if_i(x,y)] - \sum_{x}\tilde{P}(x)logZ_w(x)
maxΨ(w)=x,y∑P~(x,y)[i=1∑nwifi(x,y)]−x∑P~(x)logZw(x)
6.4.1对于最大熵模型,改进的迭代尺度法
- 想要迭代w值,假设迭代后
w
←
w
+
δ
w \leftarrow w+\delta
w←w+δ,且学习目标的变化量为
Ψ ( w + δ ) − Ψ ( w ) = ∑ x , y P ~ ( x , y ) [ ∑ i = 1 n δ i f i ( x , y ) ] − ∑ x P ~ ( x ) l o g Z w + δ ( x ) Z w ( x ) \Psi(w+\delta)-\Psi(w) = \sum_{x,y}\tilde{P}(x,y)[\sum_{i=1}^n\delta_if_i(x,y)] - \sum_x\tilde{P}(x)log\frac{Z_{w+\delta}(x)}{Z_w(x)} Ψ(w+δ)−Ψ(w)=x,y∑P~(x,y)[i=1∑nδifi(x,y)]−x∑P~(x)logZw(x)Zw+δ(x)
求出使变化量最大的 δ \delta δ - 有不等式:
−
l
o
g
α
⩾
1
−
α
,
α
>
0
-log\alpha \geqslant1-\alpha,\alpha>0
−logα⩾1−α,α>0,对原式进行变换
Ψ ( w + δ ) − Ψ ( w ) ⩾ ∑ x , y P ~ ( x , y ) [ ∑ i = 1 n δ i f i ( x , y ) ] + 1 − ∑ x P ~ ( x ) Z w + δ ( x ) Z w ( x ) Z w + δ ( x ) Z w ( x ) = ∑ y e x p ( ∑ i = 1 n ( w i + δ i ) f i ( x , y ) ) Z w ( x ) = ∑ y e x p ( ∑ i = 1 n w i f i ( x , y ) + ∑ i = 1 n δ i f i ( x , y ) ) Z w ( x ) = ∑ y e x p ( ∑ i = 1 n w i f i ( x , y ) ) e x p ( ∑ i = 1 n δ i f i ( x , y ) Z w ( x ) = ∑ y e x p ( ∑ i = 1 n w i f i ( x , y ) ) Z w ( x ) e x p ( ∑ i = 1 n δ i f i ( x , y ) = ∑ y P w ( y ∣ x ) e x p ( ∑ i = 1 n δ i f i ( x , y ) Ψ ( w + δ ) − Ψ ( w ) ⩾ ∑ x , y P ~ ( x , y ) [ ∑ i = 1 n δ i f i ( x , y ) ] + 1 − ∑ x P ~ ( x ) ∑ y P w ( y ∣ x ) e x p ( ∑ i = 1 n δ i f i ( x , y ) 记 A ( δ ∣ w ) = ∑ x , y P ~ ( x , y ) [ ∑ i = 1 n δ i f i ( x , y ) ] + 1 − ∑ x P ~ ( x ) ∑ y P w ( y ∣ x ) e x p ( ∑ i = 1 n δ i f i ( x , y ) ) \begin{aligned} \Psi(w+\delta)-\Psi(w) &\geqslant \sum_{x,y}\tilde{P}(x,y)[\sum_{i=1}^n\delta_if_i(x,y)] + 1 - \sum_x\tilde{P}(x)\frac{Z_{w+\delta}(x)}{Z_w(x)}\\ \frac{Z_{w+\delta}(x)}{Z_w(x)}&= \frac{\sum_yexp(\sum_{i=1}^n (w_i+\delta_i) f_i(x,y))}{Z_w(x)} \\ &=\frac{\sum_yexp(\sum_{i=1}^n w_if_i(x,y)+\sum_{i=1}^n \delta_if_i(x,y))}{Z_w(x)}\\ &=\frac{\sum_yexp(\sum_{i=1}^n w_if_i(x,y))exp(\sum_{i=1}^n \delta_if_i(x,y)}{Z_w(x)}\\ &=\sum_y\frac{exp(\sum_{i=1}^n w_if_i(x,y))}{Z_w(x)}exp(\sum_{i=1}^n \delta_if_i(x,y) \\ &=\sum_yP_w(y|x)exp(\sum_{i=1}^n \delta_if_i(x,y) \\ \Psi(w+\delta)-\Psi(w) &\geqslant \sum_{x,y}\tilde{P}(x,y)[\sum_{i=1}^n\delta_if_i(x,y)] + 1-\sum_x\tilde{P}(x)\sum_yP_w(y|x)exp(\sum_{i=1}^n \delta_if_i(x,y) \\ \text{记} A(\delta|w) &= \sum_{x,y}\tilde{P}(x,y)[\sum_{i=1}^n\delta_if_i(x,y)] + 1-\sum_x\tilde{P}(x)\sum_yP_w(y|x)exp(\sum_{i=1}^n \delta_if_i(x,y)) \end{aligned} Ψ(w+δ)−Ψ(w)Zw(x)Zw+δ(x)Ψ(w+δ)−Ψ(w)记A(δ∣w)⩾x,y∑P~(x,y)[i=1∑nδifi(x,y)]+1−x∑P~(x)Zw(x)Zw+δ(x)=Zw(x)∑yexp(∑i=1n(wi+δi)fi(x,y))=Zw(x)∑yexp(∑i=1nwifi(x,y)+∑i=1nδifi(x,y))=Zw(x)∑yexp(∑i=1nwifi(x,y))exp(∑i=1nδifi(x,y)=y∑Zw(x)exp(∑i=1nwifi(x,y))exp(i=1∑nδifi(x,y)=y∑Pw(y∣x)exp(i=1∑nδifi(x,y)⩾x,y∑P~(x,y)[i=1∑nδifi(x,y)]+1−x∑P~(x)y∑Pw(y∣x)exp(i=1∑nδifi(x,y)=x,y∑P~(x,y)[i=1∑nδifi(x,y)]+1−x∑P~(x)y∑Pw(y∣x)exp(i=1∑nδifi(x,y))
那么 A ( δ ∣ w ) A(\delta|w) A(δ∣w)是原式的下界,求使下界最大的 δ \delta δ解。由于想要用 A ( δ ∣ w ) A(\delta|w) A(δ∣w)对 δ \delta δ求导,但式中带有的 e x p ( ∑ i = 1 n δ i f i ( x , y ) ) exp(\sum_{i=1}^n \delta_if_i(x,y)) exp(∑i=1nδifi(x,y))求导后仍然存在,因此利用已知不等式对 A ( δ ∣ w ) A(\delta|w) A(δ∣w)变换。 - 对于凸函数
ψ
\psi
ψ,有权重
a
i
,
∑
a
i
=
1
a_i,\sum a_i=1
ai,∑ai=1,则
ψ
(
∑
a
i
x
i
)
⩽
∑
a
i
ψ
(
x
i
)
\psi(\sum a_ix_i)\leqslant \sum a_i \psi(x_i)
ψ(∑aixi)⩽∑aiψ(xi).对
A
(
δ
∣
w
)
A(\delta|w)
A(δ∣w)变换
A ( δ ∣ w ) ⩾ ∑ x , y P ~ ( x , y ) [ ∑ i = 1 n δ i f i ( x , y ) ] + 1 − ∑ x P ~ ( x ) ∑ y P w ( y ∣ x ) ∑ i = 1 n f i ( x , y ) ∑ i = 1 n f i ( x , y ) e x p ( δ i ∑ i = 1 n f i ( x , y ) ) 另 B ( δ ∣ w ) = ∑ x , y P ~ ( x , y ) [ ∑ i = 1 n δ i f i ( x , y ) ] + 1 − ∑ x P ~ ( x ) ∑ y P w ( y ∣ x ) ∑ i = 1 n f i ( x , y ) ∑ i = 1 n f i ( x , y ) e x p ( δ i ∑ i = 1 n f i ( x , y ) ) Ψ ( w + δ ) − Ψ ( w ) ⩾ B ( δ ∣ w ) \begin{aligned} A(\delta|w) &\geqslant \sum_{x,y}\tilde{P}(x,y)[\sum_{i=1}^n\delta_if_i(x,y)] + 1 - \sum_x\tilde{P}(x)\sum_yP_w(y|x)\sum_{i=1}^n\frac{f_i(x,y)}{\sum_{i=1}^n f_i(x,y)}exp(\delta_i\sum_{i=1}^n f_i(x,y)) \\ \text{另} B(\delta|w) &= \sum_{x,y}\tilde{P}(x,y)[\sum_{i=1}^n\delta_if_i(x,y)] + 1 - \sum_x\tilde{P}(x)\sum_yP_w(y|x)\sum_{i=1}^n\frac{f_i(x,y)}{\sum_{i=1}^n f_i(x,y)}exp(\delta_i\sum_{i=1}^n f_i(x,y)) \\ & \Psi(w+\delta)-\Psi(w) \geqslant B(\delta|w) \end{aligned} A(δ∣w)另B(δ∣w)⩾x,y∑P~(x,y)[i=1∑nδifi(x,y)]+1−x∑P~(x)y∑Pw(y∣x)i=1∑n∑i=1nfi(x,y)fi(x,y)exp(δii=1∑nfi(x,y))=x,y∑P~(x,y)[i=1∑nδifi(x,y)]+1−x∑P~(x)y∑Pw(y∣x)i=1∑n∑i=1nfi(x,y)fi(x,y)exp(δii=1∑nfi(x,y))Ψ(w+δ)−Ψ(w)⩾B(δ∣w)
对新的下界 B ( δ ∣ w ) B(\delta|w) B(δ∣w)求偏导
∂ B ( δ ∣ w ) ∂ δ i = ∑ x , y P ~ ( x , y ) f i ( x , y ) − ∑ x P ~ ( x ) ∑ y P w ( y ∣ x ) e x p ( δ i ∑ i = 1 n f i ( x , y ) ) f i ( x , y ) = 0 \begin{aligned} \frac{\partial B(\delta|w)}{\partial \delta_i} = \sum_{x,y}\tilde{P}(x,y)f_i(x,y)-\sum_x\tilde{P}(x)\sum_yP_w(y|x)exp(\delta_i\sum_{i=1}^n f_i(x,y))f_i(x,y)=0 \end{aligned} ∂δi∂B(δ∣w)=x,y∑P~(x,y)fi(x,y)−x∑P~(x)y∑Pw(y∣x)exp(δii=1∑nfi(x,y))fi(x,y)=0
可求到 δ i \delta_i δi - 如果
δ
i
\delta_i
δi没有显式表达式,可以通过牛顿迭代求得解
若 g ( δ ) = 0 g(\delta)=0 g(δ)=0是方程,通过 δ ( k + 1 ) = δ ( k ) − g ( δ ( k ) ) g ′ ( δ ( k ) ) \delta^{(k+1)}=\delta^{(k)}-\frac{g(\delta^{(k)})}{g^\prime(\delta^{(k)})} δ(k+1)=δ(k)−g′(δ(k))g(δ(k))迭代求 δ \delta δ解
6.5我的logistic实现
import numpy as np
class LogisticRegression:
def __init__(self,X,Y):
self.X=X
self.Y=Y
self.w=self.training(X,Y)
def training(self,X,Y,n=200,eta=0.01):
"""
输入X,Y都是list类型,返回w为np.array类型
输入训练数据,训练数据的分类结果,总迭代次数n,WX = wx +b
"""
#将输入变量x的每个数据x=(x0,x1,...,xn,1)
[x.append(1) for x in X]
matX = np.mat(X)
#标签数据Y变为列向量
matY = np.mat(Y).reshape((len(Y),1))
#w与输入x同型,列向量
w = np.ones((matX.shape[1],1))
matw=np.mat(w)
#迭代n次,w变化值 eta*(sum[yi-sigmoid(wTxi)]xi)
for i in range(n):
matw += eta* matX.T*(matY-self.sigmoid(matX,matw))
return matw
def sigmoid(self,x,w):
"""
输入x,w都是np.mat类型
"""
return 1.0/(1+np.exp(-x*w))
def predict(self,x):
x.append(1)
x = np.mat(x)
p = self.sigmoid(x,self.w)
if p > 0.5:
return 1
else:
return 0
X=[[3,3,3],[4,3,2],[2,1,2],[1,1,1],[-1,0,1],[2,-2,1]]
Y=[1,1,1,0,0,0]
lr = LogisticRegression(X,Y)
x=[1,2,-2]
lr.predict(x)