基本模型
Softmax 回归是logistic回归是用的一般形式,它将logistic 激活函数推广到C类(C是神经网络模型的输出),而不仅仅是两类,是一种多分类器,如果C = 2,那么Softmax实际上变回了 logistic 回归。
逻辑回归使用的是sigmoid函数,将 w x + b \mathbf wx+b wx+b 的值映射到(0, 1)的区间,输出的结果为样本标签等于1的概率值;而softmax回归采用的是softmax函数,将 w x + b \mathbf wx+b wx+b的值映射到[0, 1]的区间,输出的结果为一个向量,向量里的值为样本属于每个标签的概率值。
如下图所示:
假设sigmoid模型共有 n n n个输入,记 w i = ( w i 1 , w i 2 , ⋯ , w i n , b ) T , i = 1 , 2 , … , c ; x ( j ) = ( x j 1 , x j 2 , … , x j n , 1 ) , j = 1 , 2 , … , m ; w_i = (w_{i1}, w_{i2} , \cdots , w_{in} ,b )^T,i=1,2,\ldots,c;\quad x^{(j)} = (x_{j1},x_{j2},\ldots,x_{jn},1),j=1,2,\ldots,m; wi=(wi1,wi2,⋯,win,b)T,i=1,2,…,c;x(j)=(xj1,xj2,…,xjn,1),j=1,2,…,m; ,一共 k k k 类,m个样本。
设:
z
i
=
w
i
x
+
b
i
z_i = w_i x + b_i
zi=wix+bi
h
w
(
x
(
j
)
)
=
[
p
1
p
2
⋮
p
c
]
=
1
∑
i
=
1
K
e
z
i
[
e
z
1
e
z
2
⋮
e
z
k
]
h_w(x^{(j)}) = \begin{bmatrix}p_1\\p_2 \\ \vdots \\p_{c} \end{bmatrix} = \frac{1}{\sum_{i=1}^K e^{z_i}} \begin{bmatrix}e^{z_1}\\e^{z_2 } \\ \vdots \\e^{z_k} \end{bmatrix}
hw(x(j))=⎣⎢⎢⎢⎡p1p2⋮pc⎦⎥⎥⎥⎤=∑i=1Kezi1⎣⎢⎢⎢⎡ez1ez2⋮ezk⎦⎥⎥⎥⎤
一共
k
k
k个类别。上式结果向量中最大值得对应类别为最终类别。
损失函数
softmax分类的损失函数是最小化对数似然函数的负数:
L
(
w
)
=
−
log
P
(
y
(
i
)
∣
x
(
i
)
;
w
)
=
−
∏
k
=
1
K
log
(
e
z
i
∑
j
=
1
K
e
z
j
)
y
k
=
−
∑
k
=
1
K
y
k
log
(
e
z
k
∑
j
=
1
K
e
z
j
)
\begin{aligned} L(w) &= - \log P(y^{(i)}|x^{(i)};w) \\ &= -\prod_{k=1}^{K} \log\left(\frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}} \right)^{y_k} \\&=-\sum_{k=1}^K y_k \log\left(\frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}} \right) \end{aligned}
L(w)=−logP(y(i)∣x(i);w)=−k=1∏Klog(∑j=1Kezjezi)yk=−k=1∑Kyklog(∑j=1Kezjezk)
注:
y
k
=
I
{
y
(
j
)
=
k
}
y_k = I\{y^{(j)} = k\}
yk=I{y(j)=k} 是指示函数,当
y
(
j
)
=
k
y^{(j)} = k
y(j)=k,即当第
j
j
j个样本属于第
k
k
k个类别时,指示函数为1。 或者理解为:某个样本
x
x
x对应的标签
y
y
y为一个向量:
y
=
(
y
1
,
y
2
,
…
,
y
K
)
y=(y_1,y_2,\ldots,y_K)
y=(y1,y2,…,yK),其中只有一个元素是1,如
y
=
(
1
,
0
,
…
,
0
)
y=(1,0,\ldots,0)
y=(1,0,…,0) 。
我们的目标是:
min
L
(
w
)
\min L(w)
minL(w)
求解最优参数
通过梯度下降法则求解最优参数。
设:某个样本的第
i
i
i 个输出
s
i
=
e
z
i
∑
j
=
1
K
e
z
j
i
=
1
,
2
,
…
,
K
s_{i} = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}} \quad i=1,2,\ldots,K
si=∑j=1Kezjezii=1,2,…,K
针对某一个样本:
∂
L
∂
w
i
=
∂
L
∂
z
i
∂
z
i
∂
w
i
∂
L
∂
b
i
=
∂
L
∂
z
i
∂
z
i
∂
b
i
\begin{aligned} \frac{\partial L}{\partial w_i} &= \frac{\partial L}{\partial z_i} \frac{\partial z_i}{\partial w_i} \\ \frac{\partial L}{\partial b_i} &= \frac{\partial L}{\partial z_i} \frac{\partial z_i}{\partial b_i} \end{aligned}
∂wi∂L∂bi∂L=∂zi∂L∂wi∂zi=∂zi∂L∂bi∂zi
显然:
∂
z
i
∂
w
i
=
x
∂
z
i
∂
b
i
=
1
\frac{\partial z_i}{\partial w_i} = x \\ \frac{\partial z_i}{\partial b_i} = 1
∂wi∂zi=x∂bi∂zi=1
所以核心问题是求
∂
L
∂
z
i
\frac{\partial L}{\partial z_i}
∂zi∂L:
∂
L
∂
z
i
=
∑
k
=
1
K
[
∂
L
∂
s
k
∂
s
k
∂
z
i
]
\frac{\partial L}{\partial z_i} = \sum_{k=1}^K \left[ \frac{\partial L}{\partial s_k} \frac{\partial s_k}{\partial z_i} \right]
∂zi∂L=k=1∑K[∂sk∂L∂zi∂sk]
先求
∂
L
∂
s
k
\frac{\partial L}{\partial s_k}
∂sk∂L:
∂
L
∂
s
k
=
∂
(
−
∑
k
=
1
K
y
k
log
s
k
)
∂
s
k
=
−
y
k
s
k
\frac{\partial L}{\partial s_k} = \frac{\partial \left(-\sum_{k=1}^K y_k \log s_k \right)}{\partial s_k} = - \frac{y_k}{s_k}
∂sk∂L=∂sk∂(−∑k=1Kyklogsk)=−skyk
再求 ∂ s k ∂ z i \frac{\partial s_k}{\partial z_i} ∂zi∂sk :
先来复习一下复合求导:
f
(
x
)
=
g
(
x
)
h
(
x
)
f
′
(
x
)
=
g
′
(
x
)
h
(
x
)
−
g
(
x
)
h
′
(
x
)
[
h
(
x
)
]
2
f(x) = \frac{g(x)}{h(x)} \\ f'(x) = \frac{g'(x) h(x) - g(x)h'(x)}{[h(x)]^2}
f(x)=h(x)g(x)f′(x)=[h(x)]2g′(x)h(x)−g(x)h′(x)
所以,分两种情况讨论:
(1)当
k
≠
i
k \ne i
k=i时,那么:
∂
s
k
∂
z
i
=
∂
e
z
k
∑
j
=
1
K
e
z
j
∂
z
i
=
−
e
z
k
⋅
e
z
i
(
∑
j
=
1
K
e
z
j
)
2
=
−
e
z
k
∑
j
=
1
K
e
z
j
e
z
i
∑
j
=
1
K
e
z
j
=
−
s
k
s
i
\begin{aligned} \frac{\partial s_k}{\partial z_i} &= \frac{\partial \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}} }{\partial z_i} \\ &= \frac{-e^{z_k}\cdot e^{z_i}}{(\sum_{j=1}^K e^{z_j})^2} \\ &=-\frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}} \frac{ e^{z_i}} {\sum_{j=1}^K e^{z_j}} \\ &= -s_k s_i \end{aligned}
∂zi∂sk=∂zi∂∑j=1Kezjezk=(∑j=1Kezj)2−ezk⋅ezi=−∑j=1Kezjezk∑j=1Kezjezi=−sksi
(2)当
k
=
i
k = i
k=i时,那么:
∂
s
k
∂
z
i
=
∂
s
i
∂
z
i
=
∂
e
z
i
∑
j
=
1
K
e
z
j
∂
z
i
=
e
z
i
∑
j
=
1
K
e
z
j
−
(
e
z
i
)
2
(
∑
j
=
1
K
e
z
j
)
2
=
e
z
i
∑
j
=
1
K
e
z
j
∑
j
=
1
K
e
z
j
−
e
z
i
∑
j
=
1
K
e
z
j
=
s
i
(
1
−
s
i
)
\begin{aligned} \frac{\partial s_k}{\partial z_i} &= \frac{\partial s_i}{\partial z_i} =\frac{\partial \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}} }{\partial z_i} \\ &= \frac{e^{z_i}\sum_{j=1}^K e^{z_j} - (e^{z_i})^2}{(\sum_{j=1}^K e^{z_j})^2} \\ &=\frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}} \frac{\sum_{j=1}^K e^{z_j} - e^{z_i}} {\sum_{j=1}^K e^{z_j}} \\ &= s_i(1-s_i) \end{aligned}
∂zi∂sk=∂zi∂si=∂zi∂∑j=1Kezjezi=(∑j=1Kezj)2ezi∑j=1Kezj−(ezi)2=∑j=1Kezjezi∑j=1Kezj∑j=1Kezj−ezi=si(1−si)
所以:
∂
L
∂
z
i
=
∑
k
=
1
K
[
∂
L
∂
s
k
∂
s
k
∂
z
i
]
=
∑
k
=
1
K
[
−
y
k
s
k
∂
s
k
∂
z
i
]
=
−
y
i
s
i
∂
s
i
∂
z
i
+
∑
k
=
1
,
k
≠
i
K
[
−
y
k
s
k
∂
s
k
∂
z
i
]
=
−
y
i
s
i
s
i
(
1
−
s
i
)
+
∑
k
=
1
,
k
≠
i
K
[
−
y
k
s
k
⋅
−
s
k
s
l
]
=
y
i
(
s
i
−
1
)
+
∑
k
=
1
,
k
≠
i
K
y
k
s
i
=
−
y
i
+
y
i
s
i
+
∑
k
=
1
,
k
≠
i
K
y
k
s
i
=
−
y
i
+
s
i
∑
k
=
1
K
y
k
\begin{array}{l} \frac{\partial \mathrm{L}}{\partial \mathrm{z}_{i}}=\sum_{k=1}^{K}\left[\frac{\partial L}{\partial s_{k}} \frac{\partial s_{k}}{\partial z_{i}}\right]=\sum_{k=1}^{K}\left[-\frac{y_{k}}{s_{k}} \frac{\partial s_{k}}{\partial z_{i}}\right] \\ =-\frac{y_{i}}{s_{i}} \frac{\partial s_{i}}{\partial z_{i}}+\sum_{k=1, k \neq i}^{K}\left[-\frac{y_{k}}{s_{k}} \frac{\partial s_{k}}{\partial z_{i}}\right] \\ =-\frac{y_{i}}{s_{i}} s_{i}\left(1-s_{i}\right)+\sum_{k=1, k \neq i}^{K}\left[-\frac{y_{k}}{s_{k}} \cdot-s_{k} s_{l}\right] \\ =y_{i}\left(s_{i}-1\right)+\sum_{k=1, k \neq i}^{K} y_{k} s_{i} \\ =-y_{i}+y_{i} s_{i}+\sum_{k=1, k \neq i}^{K} y_{k} s_{i} \\ =-y_{i}+s_{i} \sum_{k=1}^{K} y_{k} \end{array}
∂zi∂L=∑k=1K[∂sk∂L∂zi∂sk]=∑k=1K[−skyk∂zi∂sk]=−siyi∂zi∂si+∑k=1,k=iK[−skyk∂zi∂sk]=−siyisi(1−si)+∑k=1,k=iK[−skyk⋅−sksl]=yi(si−1)+∑k=1,k=iKyksi=−yi+yisi+∑k=1,k=iKyksi=−yi+si∑k=1Kyk
对于某个样本
x
x
x对应的标签
y
y
y为一个向量:
y
=
(
y
1
,
y
2
,
…
,
y
K
)
y=(y_1,y_2,\ldots,y_K)
y=(y1,y2,…,yK),其中只有一个元素是1,如
y
=
(
1
,
0
,
…
,
0
)
y=(1,0,\ldots,0)
y=(1,0,…,0) 。所以有:
∑
k
=
1
K
y
k
=
1
\sum_{k=1}^{K} y_{k} = 1
∑k=1Kyk=1,所以:
∂
L
∂
z
i
=
s
i
−
y
i
\frac{\partial \mathrm{L}}{\partial \mathrm{z}_{i}}= s_i - y_i
∂zi∂L=si−yi
所以最终结果为:
∂
L
∂
w
i
=
(
s
i
−
y
i
)
x
∂
L
∂
b
i
=
s
i
−
y
i
\frac{\partial L}{\partial w_i} = (s_i - y_i)x \\ \frac{\partial L}{\partial b_i} = s_i - y_i
∂wi∂L=(si−yi)x∂bi∂L=si−yi
所以,更新法则如下:
w
i
=
w
i
−
η
(
s
i
−
y
i
)
x
b
i
=
b
i
−
η
(
s
i
−
y
i
)
w_i = w_i - \eta (s_i - y_i)x \\ b_i = b_i - \eta (s_i - y_i) \\
wi=wi−η(si−yi)xbi=bi−η(si−yi)
直至收敛为之。