Multi-class Logistic Regression
\qquad
在
Logistic Regression
\text{Logistic\ Regression}
Logistic Regression摘记 一文中对二元
Logistic
\text{Logistic}
Logistic回归进行了详细的介绍,本文主要描述采用
softmax
\text{softmax}
softmax 函数实现多元
Logistic
\text{Logistic}
Logistic回归:这实际上是用一个(不含隐藏层的)单层神经网络来实现多元分类,其输出函数采用的是
softmax
\text{softmax}
softmax 函数。
\qquad
1. softmax函数
\qquad 对于某个输入 x \boldsymbol x x,其对应的 softmax \text{softmax} softmax 输出为向量值 y = [ y 1 , ⋯ , y k , ⋯ , y K ] T \boldsymbol y=[y_1,\cdots,y_k,\cdots,y_K]^T y=[y1,⋯,yk,⋯,yK]T,且满足 ∑ k = 1 K y k = 1 \sum\limits_{k=1}^K y_k=1 k=1∑Kyk=1。
( 1 ) \qquad(1) (1) 分类问题 ( K > 2 ) (K>2) (K>2) 中使用 softmax \text{softmax} softmax 函数 h ( ⋅ ) h(\cdot) h(⋅) 表示输出值分量:
\qquad\qquad y k = h ( a k ) = e a k ∑ j = 1 K e a j = e w k T x + b k ∑ j = 1 K e w j T x + b j , k = 1 , 2 , ⋯ , K y_k=h(a_k)=\dfrac{e^{a_k}}{\sum\limits_{j=1}^K e^{a_j}}=\dfrac{e^{\boldsymbol{w}_k^{T}\boldsymbol{x}+b_k}}{\sum\limits_{j=1}^K e^{\boldsymbol{w}_j^{T}\boldsymbol{x}+b_j}}\ ,\ k=1,2,\cdots,K yk=h(ak)=j=1∑Keajeak=j=1∑KewjTx+bjewkTx+bk , k=1,2,⋯,K
\qquad
此处,其实是采用激活函数为softmax的感知器模型: a k = w k T x + b k , y k = h ( a k ) a_k=\boldsymbol{w}_k^{T}\boldsymbol{x}+b_k,\quad y_k=h(a_k) ak=wkTx+bk,yk=h(ak)
(
2
)
\qquad(2)
(2) 对于多元Logistic回归
:
a
j
=
w
j
T
x
+
b
j
a_j=\boldsymbol{w}_j^{T}\boldsymbol{x}+b_j
aj=wjTx+bj
\qquad 其中, x ∈ R D , w j = [ w j , 1 , w j , 2 , ⋯ , w j , D ] T ( j = 1 , 2 , ⋯ , K ) \boldsymbol{x}\in R^D,\boldsymbol{w}_j=[w_{j,1},w_{j,2},\cdots,w_{j,D}]^T\ \ (j=1,2,\cdots,K) x∈RD,wj=[wj,1,wj,2,⋯,wj,D]T (j=1,2,⋯,K)
\qquad
\qquad
若记
w
j
∗
=
[
w
j
T
,
b
j
]
T
\boldsymbol{w}_j^*=[\boldsymbol{w}_j^T,b_j]^T
wj∗=[wjT,bj]T 和
x
∗
=
[
x
T
,
1
]
T
\boldsymbol{x}^*=[\boldsymbol{x}^T,1]^T
x∗=[xT,1]T,则
a
j
=
w
j
T
x
+
b
j
=
(
w
j
∗
)
T
x
∗
a_j=\boldsymbol{w}_j^{T}\boldsymbol{x}+b_j={(\boldsymbol{w}_j^*)}^{T}\boldsymbol{x}^*
aj=wjTx+bj=(wj∗)Tx∗
\qquad
\qquad
【为了方便描述】可以略掉 ‘
∗
*
∗’ 号,直接写成:
a
j
=
w
j
T
x
a_j=\boldsymbol{w}_j^{T}\boldsymbol{x}
aj=wjTx
\qquad
(
3
)
\qquad(3)
(3) 将输出值分量
y
k
y_k
yk 描述成后验概率的形式:
\qquad\qquad
y
k
=
p
(
y
=
k
∣
x
)
=
h
(
a
k
)
=
e
w
k
T
x
∑
j
=
1
K
e
w
j
T
x
,
k
=
1
,
2
,
⋯
,
K
y_k=p(y=k|\boldsymbol x)=h(a_k)=\dfrac{e^{\boldsymbol{w}_k^{T}\boldsymbol{x}}}{\sum\limits_{j=1}^K e^{\boldsymbol{w}_j^{T}\boldsymbol{x}}}\ ,\ k=1,2,\cdots,K
yk=p(y=k∣x)=h(ak)=j=1∑KewjTxewkTx , k=1,2,⋯,K
\qquad
2. 与二元Logistic回归的关系
对比
二元Logistic回归
- x \boldsymbol{x} x为正例的概率: p ( y = 1 ∣ x ) = 1 1 + e − ( w T x + b ) p(y=1|\boldsymbol{x})=\dfrac{1}{1+e^{-(\boldsymbol{w}^{T}\boldsymbol{x}+b)}} p(y=1∣x)=1+e−(wTx+b)1
- x \boldsymbol{x} x为负例的概率: p ( y = 0 ∣ x ) = 1 − 1 1 + e − ( w T x + b ) = e − ( w T x + b ) 1 + e − ( w T x + b ) = 1 1 + e ( w T x + b ) p(y=0|\boldsymbol{x})=1-\dfrac{1}{1+e^{-(\boldsymbol{w}^{T}\boldsymbol{x}+b)}}=\dfrac{e^{-(\boldsymbol{w}^{T}\boldsymbol{x}+b)}}{1+e^{-(\boldsymbol{w}^{T}\boldsymbol{x}+b)}}=\dfrac{1}{1+e^{(\boldsymbol{w}^{T}\boldsymbol{x}+b)}} p(y=0∣x)=1−1+e−(wTx+b)1=1+e−(wTx+b)e−(wTx+b)=1+e(wTx+b)1
\qquad 当 K = 2 K=2 K=2 时, softmax \text{softmax} softmax 函数实际上等同于二元 Logistic \text{Logistic} Logistic 回归(假设 k = { 0 , 1 } k=\{0,1\} k={0,1}):
\qquad\qquad { y 0 = p ( y = 0 ∣ x ) = e w 0 T x e w 0 T x + e w 1 T x = 1 1 + e ( w 1 − w 0 ) T x y 1 = p ( y = 1 ∣ x ) = e w 1 T x e w 0 T x + e w 1 T x = 1 1 + e − ( w 1 − w 0 ) T x \begin{cases} \ \ \ y_0=p(y=0|\boldsymbol x)=\dfrac{e^{\boldsymbol{w}_0^{T}\boldsymbol{x}}}{e^{\boldsymbol{w}_0^{T}\boldsymbol{x}}+e^{\boldsymbol{w}_1^{T}\boldsymbol{x}}}=\dfrac{1}{1+e^{(\boldsymbol{w}_1-\boldsymbol{w}_0)^{T}\boldsymbol{x}}}\\ \\ \ \ \ y_1=p(y=1|\boldsymbol x)=\dfrac{e^{\boldsymbol{w}_1^{T}\boldsymbol{x}}}{e^{\boldsymbol{w}_0^{T}\boldsymbol{x}}+e^{\boldsymbol{w}_1^{T}\boldsymbol{x}}}=\dfrac{1}{1+e^{-(\boldsymbol{w}_1-\boldsymbol{w}_0)^{T}\boldsymbol{x}}} \end{cases} ⎩ ⎨ ⎧ y0=p(y=0∣x)=ew0Tx+ew1Txew0Tx=1+e(w1−w0)Tx1 y1=p(y=1∣x)=ew0Tx+ew1Txew1Tx=1+e−(w1−w0)Tx1
\qquad
令
w
^
=
w
1
−
w
0
\hat{\boldsymbol{w}}=\boldsymbol{w}_1-\boldsymbol{w}_0
w^=w1−w0,那么类后验概率就是二元
Logistic
\text{Logistic}
Logistic 回归中情形。
\qquad
3. 误差函数
\qquad
针对多元Logistic回归
,首先要写出其误差函数。
\qquad 假设训练样本集为 { ( x n , c n ) } n = 1 N \{ (\boldsymbol{x}_n,c_n)\} _{n=1}^{N} {(xn,cn)}n=1N,其中 x n ∈ R D , c n ∈ { 1 , 2 , ⋯ , K } \boldsymbol{x}_n\in R^{D},c_n\in \{1,2,\cdots,K\} xn∈RD,cn∈{1,2,⋯,K},参数为 W = ( w 1 T , w 2 T , ⋯ , w K T , b T ) T \boldsymbol W=(\boldsymbol{w}_1^T,\boldsymbol{w}_2^T,\cdots,\boldsymbol{w}_K^T,\boldsymbol b^T)^T W=(w1T,w2T,⋯,wKT,bT)T。
二元Logistic回归
假设训练样本为 { ( x n , c n ) } n = 1 N \{ (\boldsymbol{x}_n,c_n)\} _{n=1}^{N} {(xn,cn)}n=1N,其中 x n ∈ R D , c n ∈ { 0 , 1 } \boldsymbol{x}_n\in R^{D},c_n\in \{0,1\} xn∈RD,cn∈{0,1},似然函数为:
L ( w , b ) = ∏ n = 1 N h ( x n ) c n [ 1 − h ( x n ) ] 1 − c n , h ( x ) = p ( c = 1 ∣ x ) = 1 1 + e − ( w T x + b ) \qquad\qquad L(\boldsymbol{w},b)= \displaystyle\prod_{n=1}^N h(\boldsymbol{x}_n)^{c_n}\left[ 1-h(\boldsymbol{x}_n)\right] ^{1-c_n}\ ,\ h(\boldsymbol{x})=p(c=1|\boldsymbol{x})=\dfrac{1}{1+e^{-(\boldsymbol{w}^{T}\boldsymbol{x}+b)}} L(w,b)=n=1∏Nh(xn)cn[1−h(xn)]1−cn , h(x)=p(c=1∣x)=1+e−(wTx+b)1
取“负的对数似然函数”作为误差函数,即: l ( w , b ) = − ln L ( w , b ) l(\boldsymbol{w},b)=-\ln L(\boldsymbol{w},b) l(w,b)=−lnL(w,b)。
3.1 多元回归的1-of-K表示(one-hot)
( 1 ) \qquad(1) (1) 用变量 c ∈ { 1 , 2 , ⋯ , K } c\in \{1,2,\cdots,K\} c∈{1,2,⋯,K} 表示输入 x \boldsymbol x x 所对应的类别
( 2 ) \qquad(2) (2) 引入目标向量 t = [ 0 , ⋯ , 0 , 1 , 0 , ⋯ , 0 ] T ∈ R K \bold t=[0,\cdots,0,1,0,\cdots,0]^T\in R^K t=[0,⋯,0,1,0,⋯,0]T∈RK,满足 t k = 1 , t j = 0 ( j ≠ k ) t_k=1,t_j=0\ (j\neq k) tk=1,tj=0 (j=k)
\qquad 表示“输入 x \boldsymbol x x 属于第 k k k 类” 或者说变量 c = k c=k c=k
( 3 ) \qquad(3) (3) 用向量值 y = [ y 1 , ⋯ , y k , ⋯ , y K ] T \boldsymbol y=[y_1,\cdots,y_k,\cdots,y_K]^T y=[y1,⋯,yk,⋯,yK]T 表示输入 x \boldsymbol x x 所对应的 softmax \textbf{softmax} softmax输出
\qquad y k = p ( c = k ∣ x ) = p ( t k = 1 ∣ x ) = ∏ k = 1 K p ( t k ∣ x ) t k = e w k T x ∑ j = 1 K e w j T x y_k=p(c=k|\boldsymbol x)=p(t_k=1|\boldsymbol x)=\displaystyle\prod_{k=1}^K p(t_k|\boldsymbol x)^{t_k}=\dfrac{e^{\boldsymbol{w}_k^{T}\boldsymbol{x}}}{\sum\limits_{j=1}^K e^{\boldsymbol{w}_j^{T}\boldsymbol{x}}} yk=p(c=k∣x)=p(tk=1∣x)=k=1∏Kp(tk∣x)tk=j=1∑KewjTxewkTx
\qquad
显然,
∑
k
=
1
K
y
k
=
∑
k
=
1
K
p
(
y
=
k
∣
x
)
=
1
\sum\limits_{k=1}^K y_k=\sum\limits_{k=1}^K p(y=k|\boldsymbol x)=1
k=1∑Kyk=k=1∑Kp(y=k∣x)=1
\qquad
3.2 训练样本集的似然函数
( 1 ) \qquad(1) (1) 对于第 n n n 个训练样本 ( x n , c n ) (\boldsymbol{x}_n,c_n) (xn,cn),其 softmax \text{softmax} softmax 输出为 y n = [ y n 1 , ⋯ , y n k , ⋯ , y n K ] T \boldsymbol y_n=[y_{n1},\cdots,y_{nk},\cdots,y_{nK}]^T yn=[yn1,⋯,ynk,⋯,ynK]T,且
y n k = p ( c n = k ∣ x n ) = p ( t n k = 1 ∣ x n ) = ∏ k = 1 K p ( t n k ∣ x n ) t n k \qquad\qquad\ y_{nk}=p(c_n=k|\boldsymbol x_n)=p(t_{nk}=1|\boldsymbol x_n)=\displaystyle\prod_{k=1}^K p(t_{nk}|\boldsymbol x_n)^{t_{nk}} ynk=p(cn=k∣xn)=p(tnk=1∣xn)=k=1∏Kp(tnk∣xn)tnk
( 2 ) \qquad(2) (2) 训练样本集 { ( x n , c n ) } n = 1 N \{ (\boldsymbol{x}_n,c_n)\} _{n=1}^{N} {(xn,cn)}n=1N 的似然函数 L ( W ) L(\boldsymbol W) L(W) 为:
L
(
W
)
=
∏
n
=
1
N
p
(
c
n
=
k
∣
x
n
)
=
∏
n
=
1
N
p
(
t
n
k
=
1
∣
x
n
)
=
∏
n
=
1
N
∏
k
=
1
K
p
(
t
n
k
∣
x
n
)
t
n
k
\qquad\qquad\ \begin{aligned}L(\boldsymbol W)&= \displaystyle\prod_{n=1}^N p(c_n=k|\boldsymbol x_n) \\ &=\displaystyle\prod_{n=1}^N p(t_{nk}=1|\boldsymbol x_n) \\ &=\displaystyle\prod_{n=1}^N \displaystyle\prod_{k=1}^K p(t_{nk}|\boldsymbol x_n)^{t_{nk}} \\ \end{aligned}
L(W)=n=1∏Np(cn=k∣xn)=n=1∏Np(tnk=1∣xn)=n=1∏Nk=1∏Kp(tnk∣xn)tnk
\qquad
3.3 交叉熵误差函数
\qquad 定义训练样本集 { ( x n , c n ) } n = 1 N \{ (\boldsymbol{x}_n,c_n)\} _{n=1}^{N} {(xn,cn)}n=1N 的交叉熵误差函数 (cross-entropy error function) \text{(cross-entropy\ error\ function)} (cross-entropy error function) 为:
l ( W ) = − ln L ( W ) = − ∑ n = 1 N ∑ k = 1 K t n k ln p ( t n k ∣ x n ) = − ∑ n = 1 N ∑ k = 1 K t n k ln y n k \qquad\qquad\begin{aligned} l(\boldsymbol W)&=-\ln L(\boldsymbol W)=-\displaystyle\sum_{n=1}^N \displaystyle\sum_{k=1}^K t_{nk}\ln p(t_{nk}|\boldsymbol x_n)\\ &=-\displaystyle\sum_{n=1}^N \displaystyle\sum_{k=1}^K t_{nk}\ln y_{nk}\end{aligned} l(W)=−lnL(W)=−n=1∑Nk=1∑Ktnklnp(tnk∣xn)=−n=1∑Nk=1∑Ktnklnynk
\qquad 使用交叉熵作为误差函数,是因为:
( 1 ) \qquad(1) (1) 若训练样本 x n \boldsymbol x_n xn 的类别 c n = k c_n=k cn=k,则对应的目标向量 t n \bold t_n tn 只有第 k k k 个分量 t n k = 1 t_{nk}=1 tnk=1,而其他分量 t n j = 0 ( j ≠ k ) t_{nj}=0\ (j\neq k) tnj=0 (j=k)。
( 2 ) \qquad(2) (2) 在训练过程中, y n k y_{nk} ynk 是训练样本 x n \boldsymbol x_n xn 所对应 softmax \text{softmax} softmax 输出的第 k k k 个分量(训练样本的正确类别 k k k 所对应的输出分量值)。
( 3 ) \qquad(3) (3) 如果正确类别 k k k 所对应分量值 y n k y_{nk} ynk 越大, ln y n k \ln y_{nk} lnynk 也越大,交叉熵就越小,训练误差也就越小。
( 4 ) \qquad(4) (4) 理想情况下,正确类别 k k k 所对应分量值 y n k = 1 , ∀ n y_{nk}=1,\forall\ n ynk=1,∀ n,那么交叉熵为 0 0 0,也就是没有训练误差。
也可以采用均方误差 ∑ n ∣ y n − t n ∣ 2 \sum_n|y_n-t_n|^2 ∑n∣yn−tn∣2 作为误差函数。
4. 最大似然估计
\qquad 为了求出参数 W = ( w 1 T , w 2 T , ⋯ , w K T , b ) T \boldsymbol W=(\boldsymbol{w}_1^T,\boldsymbol{w}_2^T,\cdots,\boldsymbol{w}_K^T,b)^T W=(w1T,w2T,⋯,wKT,b)T,同样采用最大似然估计。
\qquad 可以将训练样本集分成 K K K 个子集 C 1 , ⋯ , C k , ⋯ , C K C_1,\cdots,C_k,\cdots,C_K C1,⋯,Ck,⋯,CK,第 k k k 个子集 C k C_k Ck 中的所有样本 x n \boldsymbol x_n xn 的类别都为 c n = k c_n=k cn=k,对应的目标向量 t n \bold t_n tn 都满足 t n k = 1 , t n j = 0 ( j ≠ k ) t_{nk}=1,t_{nj}=0\ (j\neq k) tnk=1,tnj=0 (j=k),由误差函数的表达式:
l ( W ) = − ∑ n = 1 N ∑ k = 1 K t n k ln y n k = − ∑ n ∈ C 1 t n 1 ln y n 1 − ⋯ − ∑ n ∈ C k t n k ln y n k − ⋯ − ∑ n ∈ C K t n K ln y n K = − ∑ n ∈ C 1 ln y n 1 − ⋯ − ∑ n ∈ C k ln y n k − ⋯ − ∑ n ∈ C K ln y n K \qquad\qquad\begin{aligned} l(\boldsymbol W)&=-\displaystyle\sum_{n=1}^N \displaystyle\sum_{k=1}^K t_{nk}\ln y_{nk}\\ &=-\displaystyle\sum_{n\in C_1} t_{n1}\ln y_{n1}-\cdots-\displaystyle\sum_{n\in C_k} t_{nk} \ln y_{nk}-\cdots-\displaystyle\sum_{n\in C_K} t_{nK}\ln y_{nK} \\ &=-\displaystyle\sum_{n\in C_1} \ln y_{n1}-\cdots-\displaystyle\sum_{n\in C_k} \ln y_{nk}-\cdots-\displaystyle\sum_{n\in C_K} \ln y_{nK} \end{aligned} l(W)=−n=1∑Nk=1∑Ktnklnynk=−n∈C1∑tn1lnyn1−⋯−n∈Ck∑tnklnynk−⋯−n∈CK∑tnKlnynK=−n∈C1∑lnyn1−⋯−n∈Ck∑lnynk−⋯−n∈CK∑lnynK
\qquad
对
l
(
W
)
l(\boldsymbol W)
l(W) 求参数
w
k
\boldsymbol w_k
wk 的偏导分为两个部分:
\qquad
(
1
)
\qquad(1)
(1) 对
l
(
W
)
l(\boldsymbol W)
l(W) 的第
k
k
k 个分量
l
k
(
W
)
=
−
∑
n
∈
C
k
ln
y
n
k
l_k(\boldsymbol W)=-\displaystyle\sum_{n\in C_k} \ln y_{nk}
lk(W)=−n∈Ck∑lnynk 求参数
w
k
\boldsymbol w_k
wk 的偏导
∂ l k ( W ) ∂ w k = − ∑ n ∈ C k 1 y n k ∂ y n k ∂ w k ( w k T x n = x n T w k ) = − ∑ n ∈ C k 1 y n k ( e w k T x n ) ′ ∑ j = 1 K e w j T x n − e w k T x n ( ∑ j = 1 K e w j T x n ) ′ ( ∑ j = 1 K e w j T x n ) 2 = − ∑ n ∈ C k 1 y n k ( e w k T x n ) x n ∑ j = 1 K e w j T x n − e w k T x n ( e w k T x n ) x n ( ∑ j = 1 K e w j T x n ) 2 = − ∑ n ∈ C k 1 y n k ( e w k T x n ) x n ∑ j = 1 K e w j T x n ∑ j = 1 K e w j T x n − e w k T x n ∑ j = 1 K e w j T x n = − ∑ n ∈ C k 1 y n k y n k x n ( 1 − y n k ) = − ∑ n ∈ C k ( 1 − y n k ) x n \qquad\qquad\begin{aligned} \dfrac{\partial l_k(\boldsymbol W)}{\partial \boldsymbol w_k}&=-\displaystyle\sum_{n\in C_k} \dfrac{1}{ y_{nk}} \dfrac{\partial y_{nk}}{\partial \boldsymbol w_k}\qquad\qquad (\boldsymbol{w}_k^{T}\boldsymbol{x}_n=\boldsymbol{x}_n^{T}\boldsymbol{w}_k)\\ &=-\displaystyle\sum_{n\in C_k} \dfrac{1}{ y_{nk}} \dfrac{(e^{\boldsymbol{w}_k^{T}\boldsymbol{x}_n})^{'}\sum\limits_{j=1}^K e^{\boldsymbol{w}_j^{T}\boldsymbol{x}_n}-e^{\boldsymbol{w}_k^{T}\boldsymbol{x}_n}\left(\sum\limits_{j=1}^K e^{\boldsymbol{w}_j^{T}\boldsymbol{x}_n} \right)^{'}}{\left(\sum\limits_{j=1}^K e^{\boldsymbol{w}_j^{T}\boldsymbol{x}_n}\right)^2}\\ &=-\displaystyle\sum_{n\in C_k} \dfrac{1}{ y_{nk}} \dfrac{(e^{\boldsymbol{w}_k^{T}\boldsymbol{x}_n})\boldsymbol{x}_n\sum\limits_{j=1}^K e^{\boldsymbol{w}_j^{T}\boldsymbol{x}_n}-e^{\boldsymbol{w}_k^{T}\boldsymbol{x}_n}(e^{\boldsymbol{w}_k^{T}\boldsymbol{x}_n})\boldsymbol{x}_n}{\left(\sum\limits_{j=1}^K e^{\boldsymbol{w}_j^{T}\boldsymbol{x}_n}\right)^2}\\ &=-\displaystyle\sum_{n\in C_k} \dfrac{1}{ y_{nk}} \dfrac{(e^{\boldsymbol{w}_k^{T}\boldsymbol{x}_n})\boldsymbol{x}_n}{\sum\limits_{j=1}^K e^{\boldsymbol{w}_j^{T}\boldsymbol{x}_n}}\dfrac{\sum\limits_{j=1}^K e^{\boldsymbol{w}_j^{T}\boldsymbol{x}_n}-e^{\boldsymbol{w}_k^{T}\boldsymbol{x}_n}}{\sum\limits_{j=1}^K e^{\boldsymbol{w}_j^{T}\boldsymbol{x}_n}}\\ &=-\displaystyle\sum_{n\in C_k} \dfrac{1}{ y_{nk}} y_{nk}\boldsymbol{x}_n(1-y_{nk})\\ &=-\displaystyle\sum_{n\in C_k}(1-y_{nk})\boldsymbol{x}_n \\ \end{aligned} ∂wk∂lk(W)=−n∈Ck∑ynk1∂wk∂ynk(wkTxn=xnTwk)=−n∈Ck∑ynk1(j=1∑KewjTxn)2(ewkTxn)′j=1∑KewjTxn−ewkTxn(j=1∑KewjTxn)′=−n∈Ck∑ynk1(j=1∑KewjTxn)2(ewkTxn)xnj=1∑KewjTxn−ewkTxn(ewkTxn)xn=−n∈Ck∑ynk1j=1∑KewjTxn(ewkTxn)xnj=1∑KewjTxnj=1∑KewjTxn−ewkTxn=−n∈Ck∑ynk1ynkxn(1−ynk)=−n∈Ck∑(1−ynk)xn
( 2 ) \qquad(2) (2) 对 l ( W ) l(\boldsymbol W) l(W) 的第 i ( i ≠ k ) i\ (i\neq k) i (i=k) 个分量 l i ( W ) = − ∑ n ∈ C i ln y n i l_i(\boldsymbol W)=-\displaystyle\sum_{n\in C_i} \ln y_{ni} li(W)=−n∈Ci∑lnyni 求参数 w k \boldsymbol w_k wk 的偏导
∂ l i ( W ) ∂ w k = − ∑ n ∈ C i 1 y n i ∂ y n i ∂ w k = − ∑ n ∈ C i 1 y n i ( e w i T x n ) ′ ∑ j = 1 K e w j T x n − e w i T x n ( ∑ j = 1 K e w j T x n ) ′ ( ∑ j = 1 K e w j T x n ) 2 = − ∑ n ∈ C i 1 y n i − e w i T x n ( e w k T x n ) x n ( ∑ j = 1 K e w j T x n ) 2 = − ∑ n ∈ C i 1 y n i e w i T x n ∑ j = 1 K e w j T x n − ( e w k T x n ) x n ∑ j = 1 K e w j T x n = − ∑ n ∈ C i 1 y i k y n i ( − y n k ) x n = − ∑ n ∈ C i ( − y n k ) x n \qquad\qquad\begin{aligned} \dfrac{\partial l_i(\boldsymbol W)}{\partial \boldsymbol w_k}&=-\displaystyle\sum_{n\in C_i} \dfrac{1}{ y_{ni}} \dfrac{\partial y_{ni}}{\partial \boldsymbol w_k}\\ &=-\displaystyle\sum_{n\in C_i} \dfrac{1}{ y_{ni}} \dfrac{(e^{\boldsymbol{w}_i^{T}\boldsymbol{x}_n})^{'}\sum\limits_{j=1}^K e^{\boldsymbol{w}_j^{T}\boldsymbol{x}_n}-e^{\boldsymbol{w}_i^{T}\boldsymbol{x}_n}\left(\sum\limits_{j=1}^K e^{\boldsymbol{w}_j^{T}\boldsymbol{x}_n} \right)^{'}}{\left(\sum\limits_{j=1}^K e^{\boldsymbol{w}_j^{T}\boldsymbol{x}_n}\right)^2}\\ &=-\displaystyle\sum_{n\in C_i} \dfrac{1}{ y_{ni}} \dfrac{-e^{\boldsymbol{w}_i^{T}\boldsymbol{x}_n}(e^{\boldsymbol{w}_k^{T}\boldsymbol{x}_n})\boldsymbol{x}_n}{\left(\sum\limits_{j=1}^K e^{\boldsymbol{w}_j^{T}\boldsymbol{x}_n}\right)^2}\\ &=-\displaystyle\sum_{n\in C_i} \dfrac{1}{ y_{ni}} \dfrac{e^{\boldsymbol{w}_i^{T}\boldsymbol{x}_n}}{\sum\limits_{j=1}^K e^{\boldsymbol{w}_j^{T}\boldsymbol{x}_n}}\dfrac{-(e^{\boldsymbol{w}_k^{T}\boldsymbol{x}_n})\boldsymbol{x}_n}{\sum\limits_{j=1}^K e^{\boldsymbol{w}_j^{T}\boldsymbol{x}_n}}\\ &=-\displaystyle\sum_{n\in C_i} \dfrac{1}{ y_{ik}} y_{ni}(-y_{nk})\boldsymbol{x}_n\\ &=-\displaystyle\sum_{n\in C_i}(-y_{nk})\boldsymbol{x}_n \\ \end{aligned} ∂wk∂li(W)=−n∈Ci∑yni1∂wk∂yni=−n∈Ci∑yni1(j=1∑KewjTxn)2(ewiTxn)′j=1∑KewjTxn−ewiTxn(j=1∑KewjTxn)′=−n∈Ci∑yni1(j=1∑KewjTxn)2−ewiTxn(ewkTxn)xn=−n∈Ci∑yni1j=1∑KewjTxnewiTxnj=1∑KewjTxn−(ewkTxn)xn=−n∈Ci∑yik1yni(−ynk)xn=−n∈Ci∑(−ynk)xn
\qquad 综合起来,两个公式可以表示为:
∂ l ( W ) ∂ w k = − ∑ n = 1 N ( t n k − y n k ) x n \qquad\qquad\ \dfrac{\partial l(\boldsymbol W)}{\partial \boldsymbol w_k}=-\displaystyle\sum_{n=1}^N (t_{nk}-y_{nk})\boldsymbol{x}_n ∂wk∂l(W)=−n=1∑N(tnk−ynk)xn
\qquad 采用梯度下降法时,权值更新公式为:
W ( m + 1 ) = W ( m ) − α ∂ l ( W ) ∂ W \qquad\qquad\ \boldsymbol{W}^{(m+1)}=\boldsymbol{W}^{(m)}-\alpha\dfrac{\partial l(\boldsymbol W)}{\partial \boldsymbol W} W(m+1)=W(m)−α∂W∂l(W)
\qquad
其中
α
\alpha
α 为梯度下降法的步长。
\qquad
代码实现(mnist数据集)
import numpy as np
from dataset.mnist import load_mnist
def softmax_train(train,target,alpha,num):
xhat = np.concatenate((train,np.ones((len(train),1))),axis=1)
nparam = len(xhat.T) #785
beta = np.random.rand(nparam,10) #785x10
for i in range(num):
wtx = np.dot(xhat,beta)
wtx1 = wtx - np.max(wtx,axis=1).reshape(len(train),1)
e_wtx = np.exp(wtx1)
yx = e_wtx/np.sum(e_wtx,axis=1).reshape(len(xhat),1)
print(' #'+str(i+1)+' : '+str(cross_entropy(yx,target)))
t1 = target - yx
t2 = np.dot(xhat.T, t1)
beta = beta + alpha*t2
return beta
def cross_entropy(yx,t):
sum1 = np.sum(yx*t,axis=1)
ewx = np.log(sum1+0.000001)
return -np.sum(ewx)/len(yx)
def classification(test, beta, test_t):
xhat = np.concatenate((test,np.ones((len(test),1))),axis=1)
wtx = np.dot(xhat,beta)
output = np.where(wtx==np.max(wtx,axis=1).reshape((len(test),1)))[1]
print("Percentage Correct: ",np.where(output==test_t)[0].shape[0]/len(test))
return np.array(output,dtype=np.uint8)
if __name__ == '__main__':
(x_train, t_train), (x_test, t_test) = load_mnist(flatten=True, normalize=False)
nread = 60000
train_in = x_train[:nread,:]
train_tgt = np.zeros((nread,10))
test_in = x_test[:10000,:]
test_t = t_test[:10000]
for i in range(nread):
train_tgt[i,t_train[i]] = 1
beta = softmax_train(train_in,train_tgt,0.001,60)
print(beta)
result = classification(test_in, beta, test_t)
测试结果:
#1 : 5.626381119337011
#2 : 5.415158063701459
#3 : 10.959830171565791
#4 : 8.062787294189338
#5 : 7.4643357380759765
#6 : 9.070059164063883
#7 : 9.81079287953052
#8 : 7.13921201579068
#9 : 7.176904417794094
#10 : 4.607102717465571
#11 : 3.9215536116316625
#12 : 4.199011112147004
#13 : 4.135313269465135
#14 : 3.214738972020379
#15 : 2.804664146283606
#16 : 2.901161881757491
#17 : 2.9996749271603456
#18 : 2.609904566490558
#19 : 2.6169338357951197
#20 : 2.538795429964946
#21 : 2.7159497447897256
#22 : 2.634980803678192
#23 : 2.974848646434367
#24 : 3.1286179795674154
#25 : 3.2208869228881407
#26 : 2.548910343301664
#27 : 2.5298981152704743
#28 : 2.3826001247525035
#29 : 2.4498572463653243
#30 : 2.3521370651353837
#31 : 2.4309032741212664
#32 : 2.366133209606206
#33 : 2.4462922376053364
#34 : 2.3850487760328933
#35 : 2.4481429887352792
#36 : 2.370067560256672
#37 : 2.376729198498193
#38 : 2.297488373847759
#39 : 2.265126273640295
#40 : 2.258495714414137
#41 : 2.327524884607823
#42 : 2.3130200962416128
#43 : 2.290046983208286
#44 : 2.1465196716967805
#45 : 2.0969060851949677
#46 : 1.8901858209971119
#47 : 1.844354795879705
#48 : 1.6340799726564934
#49 : 1.60064459794013
#50 : 1.4667008762515674
#51 : 1.4453938385590863
#52 : 1.3767004735390218
#53 : 1.359619935503484
#54 : 1.3153462460865966
#55 : 1.309895715988472
#56 : 1.2799649790773286
#57 : 1.2807586745656392
#58 : 1.2559139323742572
#59 : 1.2582212637839076
#60 : 1.237819660093416
权值:
[[7.69666472e-01 2.16009202e-01 9.81729719e-01 … 5.32453082e-01
7.88719040e-01 5.14326954e-01]
[3.90401951e-01 5.84040914e-01 7.94883641e-01 … 8.02009249e-01
3.29345264e-02 6.70861290e-01]
[8.69075434e-02 8.43381782e-01 4.77683466e-01 … 8.71965798e-01
4.47018470e-04 5.07498017e-01]
…
[7.96129468e-01 6.14364951e-01 8.32783158e-01 … 6.53493763e-01
2.06235991e-01 8.60469591e-01]
[1.67070291e-01 3.23211147e-02 2.41519794e-01 … 6.56026583e-01
5.98396521e-01 5.42304452e-01]
[8.43299673e-01 6.22843596e-01 6.05652099e-02 … 1.10339403e-01
1.61855811e-01 3.29385438e-01]]
识别率:
Percentage Correct: 0.9037