深入理解Softmax回归
前面介绍的Logistic回归,浅层和深层全连接网络都是以二分类问题为例来介绍的(点击我的头像,找到 机器学习与深度学习 分区,来查看我的其他文章)。二分类模型的特征一般为:以sigmoid函数为输出层的激活函数,且输出层只有一个神经元。
本文主要介绍如何使用Softmax激活函数实现多分类问题。在介绍多分类的全连接网络前,先从介绍最简单的多分类模型Softmax回归开始。
1. Softmax回归的标签形式
在二分类问题中我们采用的是用0和1分别表示两种不同的样本。例如:实现一个网络模型来区分一张图片是猫或者不是猫,我们可以定义是猫为1,不是猫为0。如下图所示。
现在我们并不只是想知道这张图片是不是猫,我想知道它具体的物种类别,从而问题就变成了多分类问题。我们采用一种向量式标签定义方式,如下图所示:
这种标签定义的方式是以索引为根据的,即:向量的长度为分类个数,向量中只存在一个1其余都为0,用为1的元素对应的索引来区分不同类别。上图中,猫的标签中1对应的索引为0,狗为1,鼠为2,蛇为3。
这里可能有人会问:我直接用0,1,2,3来代表标签不就好了,为什么要定义形式这么奇怪的标签?后面会解释采用这种形式的原因。
2. Softmax回归的结构
下面给出了Softmax回归的结构图。
该图对应了一个三分类问题的算法结构,一共含有三个神经元(虽然画了6个圆圈,但是每个Z和A是一组的,共同构成一个神经元)。第i个神经元的职责是:判断该样本数据为i-1类数据的概率。举个例子,上图中,对于输入数据X,网络输出为列向量 A =(0.83,0.12,0.05),即网络预测该样本为0类样本的概率为0.83,为1类样本的概率是0.12,为2类样本的概率为0.05。进一步,网络取所有概率中最大者置为1,其余置0,得到输出列向量(1,0,0),从而得知该样本属于0类。
所以从原理来说,采用向量式的标签更符合Softmax的物理意义。但不光如此,后面还会介绍这种标签形式建立了Softmax回归和Logistic回归之间的联系。
本例子仅以一组输入数据为例,需要注意在实际的训练过程中,会有多组输入数据,而输出的标签一般为如下形式。
Y
^
=
(
1
0
⋯
1
0
0
⋯
0
⋮
⋮
⋱
⋮
0
1
⋯
0
)
\hat{Y}\text{=}\left( \begin{matrix} 1 & 0 & \cdots & 1 \\ 0 & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 1 & \cdots & 0 \\ \end{matrix} \right)
Y^=⎝⎜⎜⎜⎛10⋮000⋮1⋯⋯⋱⋯10⋮0⎠⎟⎟⎟⎞
3. Softmax回归原理
以下公式推导中如存在不同维数的矩阵相加减的情况,一律采用python的广播机制来实现。
3.1 参数约定
设共有
M
M
M组训练数据,输入数据的长度为
n
0
{{n}_{0}}
n0,输出数据的长度为
n
1
{{n}_{1}}
n1。定义学习率为
α
\alpha
α。定义训练数据
X
X
X,标签为
Y
Y
Y,二者满足如下形式:
X
=
(
X
1
X
2
⋮
X
n
0
)
=
(
x
1
[
1
]
x
1
[
2
]
⋯
x
1
[
M
]
x
2
[
1
]
x
2
[
2
]
⋯
x
2
[
M
]
⋮
⋮
⋱
⋮
x
n
0
[
1
]
x
n
0
[
2
]
⋯
x
n
0
[
M
]
)
X\text{=}\left( \begin{matrix} {{X}_{1}} \\ {{X}_{2}} \\ \vdots \\ {{X}_{{{n}_{0}}}} \\ \end{matrix} \right)=\left( \begin{matrix} x_{1}^{[1]} & x_{1}^{[2]} & \cdots & x_{1}^{[M]} \\ x_{2}^{[1]} & x_{2}^{[2]} & \cdots & x_{2}^{[M]} \\ \vdots & \vdots & \ddots & \vdots \\ x_{{{n}_{0}}}^{[1]} & x_{{{n}_{0}}}^{[2]} & \cdots & x_{{{n}_{0}}}^{[M]} \\ \end{matrix} \right)
X=⎝⎜⎜⎜⎛X1X2⋮Xn0⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎜⎛x1[1]x2[1]⋮xn0[1]x1[2]x2[2]⋮xn0[2]⋯⋯⋱⋯x1[M]x2[M]⋮xn0[M]⎠⎟⎟⎟⎟⎞
Y
=
(
Y
1
Y
2
⋮
Y
n
1
)
=
(
y
1
[
1
]
y
1
[
2
]
⋯
y
1
[
M
]
y
2
[
1
]
y
2
[
2
]
⋯
y
2
[
M
]
⋮
⋮
⋱
⋮
y
n
1
[
1
]
y
n
1
[
2
]
⋯
y
n
1
[
M
]
)
Y\text{=}\left( \begin{matrix} {{Y}_{1}} \\ {{Y}_{2}} \\ \vdots \\ {{Y}_{{{n}_{1}}}} \\ \end{matrix} \right)=\left( \begin{matrix} y_{1}^{[1]} & y_{1}^{[2]} & \cdots & y_{1}^{[M]} \\ y_{2}^{[1]} & y_{2}^{[2]} & \cdots & y_{2}^{[M]} \\ \vdots & \vdots & \ddots & \vdots \\ y_{{{n}_{1}}}^{[1]} & y_{{{n}_{1}}}^{[2]} & \cdots & y_{{{n}_{1}}}^{[M]} \\ \end{matrix} \right)
Y=⎝⎜⎜⎜⎛Y1Y2⋮Yn1⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎜⎛y1[1]y2[1]⋮yn1[1]y1[2]y2[2]⋮yn1[2]⋯⋯⋱⋯y1[M]y2[M]⋮yn1[M]⎠⎟⎟⎟⎟⎞
在多分类问题中,对于标签 ,每一列
(
y
1
[
m
]
y
2
[
m
]
⋯
y
n
1
[
m
]
)
T
{{\left( \begin{matrix} y_{1}^{[m]} & y_{2}^{[m]} & \cdots & y_{{{n}_{1}}}^{[m]} \\ \end{matrix} \right)}^{T}}
(y1[m]y2[m]⋯yn1[m])T中有且只有一个
y
i
[
m
]
y_{i}^{[m]}
yi[m]为1,其余都为0。定义权值矩阵
W
W
W和偏移值矩阵
B
B
B,二者的形式为:
W
=
(
W
1
T
W
2
T
⋮
W
n
1
T
)
=
(
w
11
w
12
⋯
w
1
n
0
w
21
w
22
⋯
w
2
n
0
⋮
⋮
⋱
⋮
w
n
1
1
w
n
1
2
⋯
w
n
1
n
0
)
W\text{=}\left( \begin{matrix} W_{1}^{T} \\ W_{2}^{T} \\ \vdots \\ W_{{{n}_{1}}}^{T} \\ \end{matrix} \right)=\left( \begin{matrix} {{w}_{11}} & {{w}_{12}} & \cdots & {{w}_{1{{n}_{0}}}} \\ {{w}_{21}} & {{w}_{22}} & \cdots & {{w}_{2{{n}_{0}}}} \\ \vdots & \vdots & \ddots & \vdots \\ {{w}_{{{n}_{1}}1}} & {{w}_{{{n}_{1}}2}} & \cdots & {{w}_{{{n}_{1}}{{n}_{0}}}} \\ \end{matrix} \right)
W=⎝⎜⎜⎜⎛W1TW2T⋮Wn1T⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎛w11w21⋮wn11w12w22⋮wn12⋯⋯⋱⋯w1n0w2n0⋮wn1n0⎠⎟⎟⎟⎞
B
=
(
b
1
b
2
⋮
b
n
1
)
B\text{=}\left( \begin{matrix} {{b}_{1}} \\ {{b}_{2}} \\ \vdots \\ {{b}_{{{n}_{1}}}} \\ \end{matrix} \right)
B=⎝⎜⎜⎜⎛b1b2⋮bn1⎠⎟⎟⎟⎞
3.2 前向传播过程
首先定义Softmax激活函数为:
g
(
z
1
,
z
2
,
⋯
,
z
n
)
=
(
e
z
1
∑
i
=
1
n
e
z
i
,
e
z
2
∑
i
=
1
n
e
z
i
,
⋯
,
e
z
n
∑
i
=
1
n
e
z
i
)
T
g({{z}_{1}},{{z}_{2}},\cdots ,{{z}_{n}})={{\left( \frac{{{e}^{{{z}_{1}}}}}{\sum\limits_{i=1}^{n}{{{e}^{{{z}_{i}}}}}},\frac{{{e}^{{{z}_{2}}}}}{\sum\limits_{i=1}^{n}{{{e}^{{{z}_{i}}}}}},\cdots ,\frac{{{e}^{{{z}_{n}}}}}{\sum\limits_{i=1}^{n}{{{e}^{{{z}_{i}}}}}} \right)}^{T}}
g(z1,z2,⋯,zn)=⎝⎛i=1∑neziez1,i=1∑neziez2,⋯,i=1∑neziezn⎠⎞T
前向传播过程的过程与Logistic回归类似,都是先计算
Z
Z
Z再计算
A
A
A 。下面给出计算方法:
Z
=
(
z
1
[
1
]
z
1
[
2
]
⋯
z
1
[
M
]
z
2
[
1
]
z
2
[
2
]
⋯
z
2
[
M
]
⋮
⋮
⋱
⋮
z
n
1
[
1
]
z
n
1
[
2
]
⋯
z
n
1
[
M
]
)
=
(
∑
i
=
1
n
0
(
w
1
i
x
i
[
1
]
)
+
b
1
∑
i
=
1
n
0
(
w
1
i
x
i
[
2
]
)
+
b
1
⋯
∑
i
=
1
n
0
(
w
1
i
x
i
[
M
]
)
+
b
1
∑
i
=
1
n
0
(
w
2
i
x
i
[
1
]
)
+
b
2
∑
i
=
1
n
0
(
w
2
i
x
i
[
2
]
)
+
b
2
⋯
∑
i
=
1
n
0
(
w
2
i
x
i
[
M
]
)
+
b
2
⋮
⋮
⋮
⋮
∑
i
=
1
n
0
(
w
n
1
i
x
i
[
1
]
)
+
b
n
1
∑
i
=
1
n
0
(
w
n
1
i
x
i
[
2
]
)
+
b
n
1
⋯
∑
i
=
1
n
0
(
w
n
1
i
x
i
[
M
]
)
+
b
n
1
)
=
W
X
+
B
Z\text{=}\left( \begin{matrix} z_{1}^{[1]} & z_{1}^{[2]} & \cdots & z_{1}^{[M]} \\ z_{2}^{[1]} & z_{2}^{[2]} & \cdots & z_{2}^{[M]} \\ \vdots & \vdots & \ddots & \vdots \\ z_{{{n}_{1}}}^{[1]} & z_{{{n}_{1}}}^{[2]} & \cdots & z_{{{n}_{1}}}^{[M]} \\ \end{matrix} \right)\text{=}\left( \begin{matrix} \sum\limits_{i=1}^{{{n}_{0}}}{\left( {{w}_{1i}}x_{i}^{[1]} \right)}+{{b}_{1}} & \sum\limits_{i=1}^{{{n}_{0}}}{\left( {{w}_{1i}}x_{i}^{[2]} \right)}+{{b}_{1}} & \cdots & \sum\limits_{i=1}^{{{n}_{0}}}{\left( {{w}_{1i}}x_{i}^{[M]} \right)}+{{b}_{1}} \\ \sum\limits_{i=1}^{{{n}_{0}}}{\left( {{w}_{2i}}x_{i}^{[1]} \right)}+{{b}_{2}} & \sum\limits_{i=1}^{{{n}_{0}}}{\left( {{w}_{2i}}x_{i}^{[2]} \right)}+{{b}_{2}} & \cdots & \sum\limits_{i=1}^{{{n}_{0}}}{\left( {{w}_{2i}}x_{i}^{[M]} \right)}+{{b}_{2}} \\ \vdots & \vdots & \vdots & \vdots \\ \sum\limits_{i=1}^{{{n}_{0}}}{\left( {{w}_{{{n}_{1}}i}}x_{i}^{[1]} \right)}+{{b}_{{{n}_{1}}}} & \sum\limits_{i=1}^{{{n}_{0}}}{\left( {{w}_{{{n}_{1}}i}}x_{i}^{[2]} \right)}+{{b}_{{{n}_{1}}}} & \cdots & \sum\limits_{i=1}^{{{n}_{0}}}{\left( {{w}_{{{n}_{1}}i}}x_{i}^{[M]} \right)}+{{b}_{{{n}_{1}}}} \\ \end{matrix} \right)=WX+B
Z=⎝⎜⎜⎜⎜⎛z1[1]z2[1]⋮zn1[1]z1[2]z2[2]⋮zn1[2]⋯⋯⋱⋯z1[M]z2[M]⋮zn1[M]⎠⎟⎟⎟⎟⎞=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎛i=1∑n0(w1ixi[1])+b1i=1∑n0(w2ixi[1])+b2⋮i=1∑n0(wn1ixi[1])+bn1i=1∑n0(w1ixi[2])+b1i=1∑n0(w2ixi[2])+b2⋮i=1∑n0(wn1ixi[2])+bn1⋯⋯⋮⋯i=1∑n0(w1ixi[M])+b1i=1∑n0(w2ixi[M])+b2⋮i=1∑n0(wn1ixi[M])+bn1⎠⎟⎟⎟⎟⎟⎟⎟⎟⎞=WX+B
A
=
(
a
1
[
1
]
a
1
[
2
]
⋯
a
1
[
M
]
a
2
[
1
]
a
2
[
2
]
⋯
a
2
[
M
]
⋮
⋮
⋱
⋮
a
n
1
[
1
]
a
n
1
[
2
]
⋯
a
n
1
[
M
]
)
=
(
g
(
z
1
[
1
]
,
z
2
[
1
]
,
⋯
,
z
n
1
[
1
]
)
g
(
z
1
[
2
]
,
z
2
[
2
]
,
⋯
,
z
n
1
[
2
]
)
⋯
g
(
z
1
[
M
]
,
z
2
[
M
]
,
⋯
,
z
n
1
[
M
]
)
)
=
g
(
Z
)
A\text{=}\left( \begin{matrix} a_{1}^{[1]} & a_{1}^{[2]} & \cdots & a_{1}^{[M]} \\ a_{2}^{[1]} & a_{2}^{[2]} & \cdots & a_{2}^{[M]} \\ \vdots & \vdots & \ddots & \vdots \\ a_{{{n}_{1}}}^{[1]} & a_{{{n}_{1}}}^{[2]} & \cdots & a_{{{n}_{1}}}^{[M]} \\ \end{matrix} \right)=\left( \begin{matrix} g\left( z_{1}^{[1]},z_{2}^{[1]},\cdots ,z_{{{n}_{1}}}^{[1]} \right) & g\left( z_{1}^{[2]},z_{2}^{[2]},\cdots ,z_{{{n}_{1}}}^{[2]} \right) & \cdots & g\left( z_{1}^{[M]},z_{2}^{[M]},\cdots ,z_{{{n}_{1}}}^{[M]} \right) \\ \end{matrix} \right)\text{=}g\left( Z \right)
A=⎝⎜⎜⎜⎜⎛a1[1]a2[1]⋮an1[1]a1[2]a2[2]⋮an1[2]⋯⋯⋱⋯a1[M]a2[M]⋮an1[M]⎠⎟⎟⎟⎟⎞=(g(z1[1],z2[1],⋯,zn1[1])g(z1[2],z2[2],⋯,zn1[2])⋯g(z1[M],z2[M],⋯,zn1[M]))=g(Z)
在多分类问题中,花费函数与二分类问题不同,定义为:
J
=
1
M
∑
m
=
1
M
∑
i
=
1
n
1
-
y
i
[
m
]
ln
(
a
i
m
)
J\text{=}\frac{1}{M}\sum\limits_{m=1}^{M}{\sum\limits_{i=1}^{{{n}_{1}}}{\text{-}y_{i}^{[m]}}}\ln \left( a_{i}^{m} \right)
J=M1m=1∑Mi=1∑n1-yi[m]ln(aim)
3.3 反向传播过程
与Logistic回归类似,反向传播过程需要计算花费函数对各网络参数的梯度
∂
J
∂
W
\frac{\partial J}{\partial W}
∂W∂J和
∂
J
∂
B
\frac{\partial J}{\partial B}
∂B∂J。下面给出计算方法。
根据链式求导法则,首先计算每一个
∂
J
∂
a
i
[
m
]
\frac{\partial J}{\partial a_{i}^{[m]}}
∂ai[m]∂J:
∂
J
∂
a
i
[
m
]
=
−
1
M
y
i
[
m
]
a
i
[
m
]
\frac{\partial J}{\partial a_{i}^{[m]}}=-\frac{1}{M}\frac{y_{i}^{[m]}}{a_{i}^{[m]}}
∂ai[m]∂J=−M1ai[m]yi[m]
接续下来计算每一个
∂
a
i
[
m
]
∂
z
j
[
m
]
\frac{\partial a_{i}^{[m]}}{\partial z_{j}^{[m]}}
∂zj[m]∂ai[m]。这个计算需要分两种情况来讨论。
i
≠
j
i\ne j
i=j时,有:
∂
a
i
[
m
]
∂
z
j
[
m
]
=
∂
(
e
z
i
[
m
]
e
z
1
[
m
]
+
e
z
2
[
m
]
+
⋯
+
e
z
j
[
m
]
+
⋯
+
e
z
n
1
[
m
]
)
∂
z
j
[
m
]
=
-
e
z
i
[
m
]
e
z
j
[
m
]
(
e
z
1
[
m
]
+
e
z
2
[
m
]
+
⋯
+
e
z
j
[
m
]
+
⋯
+
e
z
n
1
[
m
]
)
2
\frac{\partial a_{i}^{[m]}}{\partial z_{j}^{[m]}}=\frac{\partial \left( \frac{{{e}^{z_{i}^{[m]}}}}{{{e}^{z_{1}^{[m]}}}+{{e}^{z_{2}^{[m]}}}+\cdots +{{e}^{z_{j}^{[m]}}}+\cdots +{{e}^{z_{{{n}_{1}}}^{[m]}}}} \right)}{\partial z_{j}^{[m]}}=\text{-}\frac{{{e}^{z_{i}^{[m]}}}{{e}^{z_{j}^{[m]}}}}{{{\left( {{e}^{z_{1}^{[m]}}}+{{e}^{z_{2}^{[m]}}}+\cdots +{{e}^{z_{j}^{[m]}}}+\cdots +{{e}^{z_{{{n}_{1}}}^{[m]}}} \right)}^{2}}}
∂zj[m]∂ai[m]=∂zj[m]∂(ez1[m]+ez2[m]+⋯+ezj[m]+⋯+ezn1[m]ezi[m])=-(ez1[m]+ez2[m]+⋯+ezj[m]+⋯+ezn1[m])2ezi[m]ezj[m]
=
-
e
z
i
[
m
]
(
e
z
1
[
m
]
+
e
z
2
[
m
]
+
⋯
+
e
z
j
[
m
]
+
⋯
+
e
z
n
1
[
m
]
)
e
z
j
[
m
]
(
e
z
1
[
m
]
+
e
z
2
[
m
]
+
⋯
+
e
z
j
[
m
]
+
⋯
+
e
z
n
1
[
m
]
)
=
-
a
i
[
m
]
a
j
[
m
]
=\text{-}\frac{{{e}^{z_{i}^{[m]}}}}{\left( {{e}^{z_{1}^{[m]}}}+{{e}^{z_{2}^{[m]}}}+\cdots +{{e}^{z_{j}^{[m]}}}+\cdots +{{e}^{z_{{{n}_{1}}}^{[m]}}} \right)}\frac{{{e}^{z_{j}^{[m]}}}}{\left( {{e}^{z_{1}^{[m]}}}+{{e}^{z_{2}^{[m]}}}+\cdots +{{e}^{z_{j}^{[m]}}}+\cdots +{{e}^{z_{{{n}_{1}}}^{[m]}}} \right)}=\text{-}a_{i}^{[m]}a_{j}^{[m]}
=-(ez1[m]+ez2[m]+⋯+ezj[m]+⋯+ezn1[m])ezi[m](ez1[m]+ez2[m]+⋯+ezj[m]+⋯+ezn1[m])ezj[m]=-ai[m]aj[m]
i
=
j
i=j
i=j时,有:
∂
a
i
[
m
]
∂
z
j
[
m
]
=
∂
a
j
[
m
]
∂
z
j
[
m
]
=
∂
(
e
z
j
[
m
]
e
z
1
[
m
]
+
e
z
2
[
m
]
+
⋯
+
e
z
j
[
m
]
+
⋯
+
e
z
n
1
[
m
]
)
∂
z
j
[
m
]
=
e
z
j
[
m
]
(
e
z
1
[
m
]
+
e
z
2
[
m
]
+
⋯
+
e
z
j
[
m
]
+
⋯
+
e
z
n
1
[
m
]
)
−
e
z
j
[
m
]
e
z
j
[
m
]
(
e
z
1
[
m
]
+
e
z
2
[
m
]
+
⋯
+
e
z
j
[
m
]
+
⋯
+
e
z
n
1
[
m
]
)
2
\frac{\partial a_{i}^{[m]}}{\partial z_{j}^{[m]}}=\frac{\partial a_{j}^{[m]}}{\partial z_{j}^{[m]}}=\frac{\partial \left( \frac{{{e}^{z_{j}^{[m]}}}}{{{e}^{z_{1}^{[m]}}}+{{e}^{z_{2}^{[m]}}}+\cdots +{{e}^{z_{j}^{[m]}}}+\cdots +{{e}^{z_{{{n}_{1}}}^{[m]}}}} \right)}{\partial z_{j}^{[m]}}=\frac{{{e}^{z_{j}^{[m]}}}\left( {{e}^{z_{1}^{[m]}}}+{{e}^{z_{2}^{[m]}}}+\cdots +{{e}^{z_{j}^{[m]}}}+\cdots +{{e}^{z_{{{n}_{1}}}^{[m]}}} \right)-{{e}^{z_{j}^{[m]}}}{{e}^{z_{j}^{[m]}}}}{{{\left( {{e}^{z_{1}^{[m]}}}+{{e}^{z_{2}^{[m]}}}+\cdots +{{e}^{z_{j}^{[m]}}}+\cdots +{{e}^{z_{{{n}_{1}}}^{[m]}}} \right)}^{2}}}
∂zj[m]∂ai[m]=∂zj[m]∂aj[m]=∂zj[m]∂(ez1[m]+ez2[m]+⋯+ezj[m]+⋯+ezn1[m]ezj[m])=(ez1[m]+ez2[m]+⋯+ezj[m]+⋯+ezn1[m])2ezj[m](ez1[m]+ez2[m]+⋯+ezj[m]+⋯+ezn1[m])−ezj[m]ezj[m]
=
e
z
j
[
m
]
(
e
z
1
[
m
]
+
e
z
2
[
m
]
+
⋯
+
e
z
j
[
m
]
+
⋯
+
e
z
n
1
[
m
]
)
−
(
e
z
j
[
m
]
(
e
z
1
[
m
]
+
e
z
2
[
m
]
+
⋯
+
e
z
j
[
m
]
+
⋯
+
e
z
n
1
[
m
]
)
)
2
=
a
j
[
m
]
−
(
a
j
[
m
]
)
2
=\frac{{{e}^{z_{j}^{[m]}}}}{\left( {{e}^{z_{1}^{[m]}}}+{{e}^{z_{2}^{[m]}}}+\cdots +{{e}^{z_{j}^{[m]}}}+\cdots +{{e}^{z_{{{n}_{1}}}^{[m]}}} \right)}-{{\left( \frac{{{e}^{z_{j}^{[m]}}}}{\left( {{e}^{z_{1}^{[m]}}}+{{e}^{z_{2}^{[m]}}}+\cdots +{{e}^{z_{j}^{[m]}}}+\cdots +{{e}^{z_{{{n}_{1}}}^{[m]}}} \right)} \right)}^{2}}=a_{j}^{[m]}-{{\left( a_{j}^{[m]} \right)}^{2}}
=(ez1[m]+ez2[m]+⋯+ezj[m]+⋯+ezn1[m])ezj[m]−⎝⎛(ez1[m]+ez2[m]+⋯+ezj[m]+⋯+ezn1[m])ezj[m]⎠⎞2=aj[m]−(aj[m])2
总结有:
∂
a
i
[
m
]
∂
z
j
[
m
]
=
{
-
a
i
[
m
]
a
j
[
m
]
i
≠
j
a
j
[
m
]
−
(
a
j
[
m
]
)
2
i
=
j
\frac{\partial a_{i}^{[m]}}{\partial z_{j}^{[m]}}\text{=}\left\{ \begin{matrix} \text{-}\begin{matrix} a_{i}^{[m]}a_{j}^{[m]} & i\ne j \\ \end{matrix} \\ \begin{matrix} a_{j}^{[m]}-{{\left( a_{j}^{[m]} \right)}^{2}} & i=j \\ \end{matrix} \\ \end{matrix} \right.
∂zj[m]∂ai[m]=⎩⎨⎧-ai[m]aj[m]i=jaj[m]−(aj[m])2i=j
把上式带入到
∂
J
∂
Z
\frac{\partial J}{\partial Z}
∂Z∂J中得到:
∂
J
∂
Z
=
(
∂
J
∂
z
1
[
1
]
∂
J
∂
z
1
[
2
]
⋯
∂
J
∂
z
1
[
M
]
∂
J
∂
z
2
[
1
]
∂
J
∂
z
2
[
2
]
⋯
∂
J
∂
z
2
[
M
]
⋮
⋮
⋱
⋮
∂
J
∂
z
n
1
[
1
]
∂
J
∂
z
n
1
[
2
]
⋯
∂
J
∂
z
n
1
[
M
]
)
=
(
∑
i
=
1
n
1
∂
J
∂
a
i
[
1
]
∂
a
i
[
1
]
∂
z
1
[
1
]
∑
i
=
1
n
1
∂
J
∂
a
i
[
2
]
∂
a
i
[
2
]
∂
z
1
[
2
]
⋯
∑
i
=
1
n
1
∂
J
∂
a
i
[
M
]
∂
a
i
[
M
]
∂
z
1
[
M
]
∑
i
=
1
n
1
∂
J
∂
a
i
[
1
]
∂
a
i
[
1
]
∂
z
2
[
1
]
∑
i
=
1
n
1
∂
J
∂
a
i
[
2
]
∂
a
i
[
2
]
∂
z
2
[
2
]
⋯
∑
i
=
1
n
1
∂
J
∂
a
i
[
M
]
∂
a
i
[
M
]
∂
z
2
[
M
]
⋮
⋮
⋱
⋮
∑
i
=
1
n
1
∂
J
∂
a
i
[
1
]
∂
a
i
[
1
]
∂
z
n
1
[
1
]
∑
i
=
1
n
1
∂
J
∂
a
i
[
2
]
∂
a
i
[
2
]
∂
z
n
1
[
2
]
⋯
∑
i
=
1
n
1
∂
J
∂
a
i
[
M
]
∂
a
i
[
M
]
∂
z
n
1
[
M
]
)
\frac{\partial J}{\partial Z}\text{=}\left( \begin{matrix} \frac{\partial J}{\partial z_{1}^{[1]}} & \frac{\partial J}{\partial z_{1}^{[2]}} & \cdots & \frac{\partial J}{\partial z_{1}^{[M]}} \\ \frac{\partial J}{\partial z_{2}^{[1]}} & \frac{\partial J}{\partial z_{2}^{[2]}} & \cdots & \frac{\partial J}{\partial z_{2}^{[M]}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial J}{\partial z_{{{n}_{1}}}^{[1]}} & \frac{\partial J}{\partial z_{{{n}_{1}}}^{[2]}} & \cdots & \frac{\partial J}{\partial z_{{{n}_{1}}}^{[M]}} \\ \end{matrix} \right)=\left( \begin{matrix} \sum\limits_{i=1}^{{{n}_{1}}}{\frac{\partial J}{\partial a_{i}^{[1]}}\frac{\partial a_{i}^{[1]}}{\partial z_{1}^{[1]}}} & \sum\limits_{i=1}^{{{n}_{1}}}{\frac{\partial J}{\partial a_{i}^{[2]}}\frac{\partial a_{i}^{[2]}}{\partial z_{1}^{[2]}}} & \cdots & \sum\limits_{i=1}^{{{n}_{1}}}{\frac{\partial J}{\partial a_{i}^{[M]}}\frac{\partial a_{i}^{[M]}}{\partial z_{1}^{[M]}}} \\ \sum\limits_{i=1}^{{{n}_{1}}}{\frac{\partial J}{\partial a_{i}^{[1]}}\frac{\partial a_{i}^{[1]}}{\partial z_{2}^{[1]}}} & \sum\limits_{i=1}^{{{n}_{1}}}{\frac{\partial J}{\partial a_{i}^{[2]}}\frac{\partial a_{i}^{[2]}}{\partial z_{2}^{[2]}}} & \cdots & \sum\limits_{i=1}^{{{n}_{1}}}{\frac{\partial J}{\partial a_{i}^{[M]}}\frac{\partial a_{i}^{[M]}}{\partial z_{2}^{[M]}}} \\ \vdots & \vdots & \ddots & \vdots \\ \sum\limits_{i=1}^{{{n}_{1}}}{\frac{\partial J}{\partial a_{i}^{[1]}}\frac{\partial a_{i}^{[1]}}{\partial z_{{{n}_{1}}}^{[1]}}} & \sum\limits_{i=1}^{{{n}_{1}}}{\frac{\partial J}{\partial a_{i}^{[2]}}\frac{\partial a_{i}^{[2]}}{\partial z_{{{n}_{1}}}^{[2]}}} & \cdots & \sum\limits_{i=1}^{{{n}_{1}}}{\frac{\partial J}{\partial a_{i}^{[M]}}\frac{\partial a_{i}^{[M]}}{\partial z_{{{n}_{1}}}^{[M]}}} \\ \end{matrix} \right)
∂Z∂J=⎝⎜⎜⎜⎜⎜⎛∂z1[1]∂J∂z2[1]∂J⋮∂zn1[1]∂J∂z1[2]∂J∂z2[2]∂J⋮∂zn1[2]∂J⋯⋯⋱⋯∂z1[M]∂J∂z2[M]∂J⋮∂zn1[M]∂J⎠⎟⎟⎟⎟⎟⎞=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎛i=1∑n1∂ai[1]∂J∂z1[1]∂ai[1]i=1∑n1∂ai[1]∂J∂z2[1]∂ai[1]⋮i=1∑n1∂ai[1]∂J∂zn1[1]∂ai[1]i=1∑n1∂ai[2]∂J∂z1[2]∂ai[2]i=1∑n1∂ai[2]∂J∂z2[2]∂ai[2]⋮i=1∑n1∂ai[2]∂J∂zn1[2]∂ai[2]⋯⋯⋱⋯i=1∑n1∂ai[M]∂J∂z1[M]∂ai[M]i=1∑n1∂ai[M]∂J∂z2[M]∂ai[M]⋮i=1∑n1∂ai[M]∂J∂zn1[M]∂ai[M]⎠⎟⎟⎟⎟⎟⎟⎟⎟⎞
代入
∂
J
∂
a
i
[
m
]
\frac{\partial J}{\partial a_{i}^{[m]}}
∂ai[m]∂J和
∂
a
i
[
m
]
∂
z
j
[
m
]
\frac{\partial a_{i}^{[m]}}{\partial z_{j}^{[m]}}
∂zj[m]∂ai[m]有:
∂
J
∂
z
j
[
m
]
=
∑
i
=
1
n
1
∂
J
∂
a
i
[
m
]
∂
a
i
[
m
]
∂
z
j
[
m
]
=
−
1
M
(
y
1
[
m
]
a
1
[
m
]
(
-
a
1
[
m
]
a
j
[
m
]
)
+
y
2
[
m
]
a
2
[
m
]
(
-
a
2
[
m
]
a
j
[
m
]
)
+
⋯
+
y
j
[
m
]
a
j
[
m
]
(
a
j
[
m
]
−
(
a
j
[
m
]
)
2
)
+
⋯
+
y
n
1
[
m
]
a
n
1
[
m
]
(
-
a
n
1
[
m
]
a
j
[
m
]
)
)
\frac{\partial J}{\partial z_{j}^{[m]}}\text{=}\sum\limits_{i=1}^{{{n}_{1}}}{\frac{\partial J}{\partial a_{i}^{[m]}}\frac{\partial a_{i}^{[m]}}{\partial z_{j}^{[m]}}}=-\frac{1}{M}\left( \frac{y_{1}^{[m]}}{a_{1}^{[m]}}\left( \text{-}a_{1}^{[m]}a_{j}^{[m]} \right)+\frac{y_{2}^{[m]}}{a_{2}^{[m]}}\left( \text{-}a_{2}^{[m]}a_{j}^{[m]} \right)+\cdots +\frac{y_{j}^{[m]}}{a_{j}^{[m]}}\left( a_{j}^{[m]}-{{\left( a_{j}^{[m]} \right)}^{2}} \right)+\cdots +\frac{y_{{{n}_{1}}}^{[m]}}{a_{{{n}_{1}}}^{[m]}}\left( \text{-}a_{{{n}_{1}}}^{[m]}a_{j}^{[m]} \right) \right)
∂zj[m]∂J=i=1∑n1∂ai[m]∂J∂zj[m]∂ai[m]=−M1(a1[m]y1[m](-a1[m]aj[m])+a2[m]y2[m](-a2[m]aj[m])+⋯+aj[m]yj[m](aj[m]−(aj[m])2)+⋯+an1[m]yn1[m](-an1[m]aj[m]))
=
−
1
M
(
y
j
[
m
]
-
y
1
[
m
]
a
j
[
m
]
-
y
2
[
m
]
a
j
[
m
]
-
⋯
-
y
n
1
[
m
]
a
j
[
m
]
)
=
−
1
M
(
y
j
[
m
]
-
a
j
[
m
]
(
y
1
[
m
]
+
y
2
[
m
]
+
⋯
+
y
n
1
[
m
]
)
)
\text{=}-\frac{1}{M}\left( y_{j}^{[m]}\text{-}y_{1}^{[m]}a_{j}^{[m]}\text{-}y_{2}^{[m]}a_{j}^{[m]}\text{-}\cdots \text{-}y_{{{n}_{1}}}^{[m]}a_{j}^{[m]} \right)\text{=}-\frac{1}{M}\left( y_{j}^{[m]}\text{-}a_{j}^{[m]}\left( y_{1}^{[m]}\text{+}y_{2}^{[m]}\text{+}\cdots \text{+}y_{{{n}_{1}}}^{[m]} \right) \right)
=−M1(yj[m]-y1[m]aj[m]-y2[m]aj[m]-⋯-yn1[m]aj[m])=−M1(yj[m]-aj[m](y1[m]+y2[m]+⋯+yn1[m]))
因为前面已经约定过每一列
(
y
1
[
m
]
y
2
[
m
]
⋯
y
n
1
[
m
]
)
T
{{\left( \begin{matrix} y_{1}^{[m]} & y_{2}^{[m]} & \cdots & y_{{{n}_{1}}}^{[m]} \\ \end{matrix} \right)}^{T}}
(y1[m]y2[m]⋯yn1[m])T中有且只有一个
y
i
[
m
]
y_{i}^{[m]}
yi[m]为1,其余都为0,所以:
y
1
[
m
]
+
y
2
[
m
]
+
⋯
+
y
n
1
[
m
]
=
1
y_{1}^{[m]}\text{+}y_{2}^{[m]}\text{+}\cdots \text{+}y_{{{n}_{1}}}^{[m]}\text{=}1
y1[m]+y2[m]+⋯+yn1[m]=1
代入上面有:
∂
J
∂
z
j
[
m
]
=
−
1
M
(
y
j
[
m
]
-
a
j
[
m
]
)
=
1
M
(
a
j
[
m
]
-
y
j
[
m
]
)
\frac{\partial J}{\partial z_{j}^{[m]}}\text{=}-\frac{1}{M}\left( y_{j}^{[m]}\text{-}a_{j}^{[m]} \right)\text{=}\frac{1}{M}\left( a_{j}^{[m]}\text{-}y_{j}^{[m]} \right)
∂zj[m]∂J=−M1(yj[m]-aj[m])=M1(aj[m]-yj[m])
进一步把
∂
J
∂
z
j
[
m
]
\frac{\partial J}{\partial z_{j}^{[m]}}
∂zj[m]∂J代入
∂
J
∂
Z
\frac{\partial J}{\partial Z}
∂Z∂J中有:
∂
J
∂
Z
=
1
M
(
A
−
Y
)
\frac{\partial J}{\partial Z}\text{=}\frac{1}{M}\left( A-Y \right)
∂Z∂J=M1(A−Y)
接下来计算
∂
J
∂
W
\frac{\partial J}{\partial W}
∂W∂J和
∂
J
∂
B
\frac{\partial J}{\partial B}
∂B∂J:
∂
J
∂
W
=
(
∂
J
∂
w
11
∂
J
∂
w
12
⋯
∂
J
∂
w
1
n
0
∂
J
∂
w
21
∂
J
∂
w
22
⋯
∂
J
∂
w
2
n
0
⋮
⋮
⋱
⋮
∂
J
∂
w
n
1
1
∂
J
∂
w
n
1
2
⋯
∂
J
∂
w
n
1
n
0
)
=
(
∑
m
=
1
M
∂
J
∂
z
1
[
m
]
∂
z
1
[
m
]
∂
w
11
∑
m
=
1
M
∂
J
∂
z
1
[
m
]
∂
z
1
[
m
]
∂
w
12
⋯
∑
m
=
1
M
∂
J
∂
z
1
[
M
]
∂
z
1
[
M
]
∂
w
1
n
0
∑
m
=
1
M
∂
J
∂
z
2
[
m
]
∂
z
2
[
m
]
∂
w
21
∑
m
=
1
M
∂
J
∂
z
2
[
m
]
∂
z
2
[
m
]
∂
w
22
⋯
∑
m
=
1
M
∂
J
∂
z
2
[
m
]
∂
z
2
[
m
]
∂
w
2
n
0
⋮
⋮
⋱
⋮
∑
m
=
1
M
∂
J
∂
z
n
1
[
m
]
∂
z
n
1
[
m
]
∂
w
n
1
1
∑
m
=
1
M
∂
J
∂
z
n
1
[
m
]
∂
z
n
1
[
m
]
∂
w
n
1
2
⋯
∑
m
=
1
M
∂
J
∂
z
n
1
[
m
]
∂
z
n
1
[
m
]
∂
w
n
1
n
0
)
\frac{\partial J}{\partial W}\text{=}\left( \begin{matrix} \frac{\partial J}{\partial {{w}_{11}}} & \frac{\partial J}{\partial {{w}_{12}}} & \cdots & \frac{\partial J}{\partial {{w}_{1{{n}_{0}}}}} \\ \frac{\partial J}{\partial {{w}_{21}}} & \frac{\partial J}{\partial {{w}_{22}}} & \cdots & \frac{\partial J}{\partial {{w}_{2{{n}_{0}}}}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial J}{\partial {{w}_{{{n}_{1}}1}}} & \frac{\partial J}{\partial {{w}_{{{n}_{1}}2}}} & \cdots & \frac{\partial J}{\partial {{w}_{{{n}_{1}}{{n}_{0}}}}} \\ \end{matrix} \right)\text{=}\left( \begin{matrix} \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{1}^{[m]}}\frac{\partial z_{1}^{[m]}}{\partial {{w}_{11}}}} & \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{1}^{[m]}}\frac{\partial z_{1}^{[m]}}{\partial {{w}_{12}}}} & \cdots & \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{1}^{[M]}}\frac{\partial z_{1}^{[M]}}{\partial {{w}_{1{{n}_{0}}}}}} \\ \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{2}^{[m]}}\frac{\partial z_{2}^{[m]}}{\partial {{w}_{21}}}} & \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{2}^{[m]}}\frac{\partial z_{2}^{[m]}}{\partial {{w}_{22}}}} & \cdots & \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{2}^{[m]}}\frac{\partial z_{2}^{[m]}}{\partial {{w}_{2{{n}_{0}}}}}} \\ \vdots & \vdots & \ddots & \vdots \\ \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{{{n}_{1}}}^{[m]}}\frac{\partial z_{{{n}_{1}}}^{[m]}}{\partial {{w}_{{{n}_{1}}1}}}} & \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{{{n}_{1}}}^{[m]}}\frac{\partial z_{{{n}_{1}}}^{[m]}}{\partial {{w}_{{{n}_{1}}2}}}} & \cdots & \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{{{n}_{1}}}^{[m]}}\frac{\partial z_{{{n}_{1}}}^{[m]}}{\partial {{w}_{{{n}_{1}}{{n}_{0}}}}}} \\ \end{matrix} \right)
∂W∂J=⎝⎜⎜⎜⎜⎛∂w11∂J∂w21∂J⋮∂wn11∂J∂w12∂J∂w22∂J⋮∂wn12∂J⋯⋯⋱⋯∂w1n0∂J∂w2n0∂J⋮∂wn1n0∂J⎠⎟⎟⎟⎟⎞=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛m=1∑M∂z1[m]∂J∂w11∂z1[m]m=1∑M∂z2[m]∂J∂w21∂z2[m]⋮m=1∑M∂zn1[m]∂J∂wn11∂zn1[m]m=1∑M∂z1[m]∂J∂w12∂z1[m]m=1∑M∂z2[m]∂J∂w22∂z2[m]⋮m=1∑M∂zn1[m]∂J∂wn12∂zn1[m]⋯⋯⋱⋯m=1∑M∂z1[M]∂J∂w1n0∂z1[M]m=1∑M∂z2[m]∂J∂w2n0∂z2[m]⋮m=1∑M∂zn1[m]∂J∂wn1n0∂zn1[m]⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞
==
(
∑
m
=
1
M
∂
J
∂
z
1
[
m
]
x
1
[
m
]
∑
m
=
1
M
∂
J
∂
z
1
[
m
]
x
2
[
m
]
⋯
∑
m
=
1
M
∂
J
∂
z
1
[
M
]
x
n
0
[
m
]
∑
m
=
1
M
∂
J
∂
z
2
[
m
]
x
1
[
m
]
∑
m
=
1
M
∂
J
∂
z
2
[
m
]
x
2
[
m
]
⋯
∑
m
=
1
M
∂
J
∂
z
2
[
m
]
x
n
0
[
m
]
⋮
⋮
⋱
⋮
∑
m
=
1
M
∂
J
∂
z
n
1
[
m
]
x
1
[
m
]
∑
m
=
1
M
∂
J
∂
z
n
1
[
m
]
x
2
[
m
]
⋯
∑
m
=
1
M
∂
J
∂
z
n
1
[
m
]
x
n
0
[
m
]
)
=
(
∂
J
∂
z
1
[
1
]
∂
J
∂
z
1
[
2
]
⋯
∂
J
∂
z
1
[
M
]
∂
J
∂
z
2
[
1
]
∂
J
∂
z
2
[
2
]
⋯
∂
J
∂
z
2
[
M
]
⋮
⋮
⋱
⋮
∂
J
∂
z
n
1
[
1
]
∂
J
∂
z
n
1
[
2
]
⋯
∂
J
∂
z
n
1
[
M
]
)
(
x
1
[
1
]
x
2
[
1
]
⋯
x
n
0
[
1
]
x
1
[
2
]
x
2
[
2
]
⋯
x
n
0
[
2
]
⋮
⋮
⋱
⋮
x
1
[
M
]
x
2
[
M
]
⋯
x
n
0
[
M
]
)
\text{==}\left( \begin{matrix} \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{1}^{[m]}}x_{1}^{[m]}} & \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{1}^{[m]}}x_{2}^{[m]}} & \cdots & \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{1}^{[M]}}x_{{{n}_{0}}}^{[m]}} \\ \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{2}^{[m]}}x_{1}^{[m]}} & \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{2}^{[m]}}x_{2}^{[m]}} & \cdots & \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{2}^{[m]}}x_{{{n}_{0}}}^{[m]}} \\ \vdots & \vdots & \ddots & \vdots \\ \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{{{n}_{1}}}^{[m]}}x_{1}^{[m]}} & \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{{{n}_{1}}}^{[m]}}x_{2}^{[m]}} & \cdots & \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{{{n}_{1}}}^{[m]}}x_{{{n}_{0}}}^{[m]}} \\ \end{matrix} \right)\text{=}\left( \begin{matrix} \frac{\partial J}{\partial z_{1}^{[1]}} & \frac{\partial J}{\partial z_{1}^{[2]}} & \cdots & \frac{\partial J}{\partial z_{1}^{[M]}} \\ \frac{\partial J}{\partial z_{2}^{[1]}} & \frac{\partial J}{\partial z_{2}^{[2]}} & \cdots & \frac{\partial J}{\partial z_{2}^{[M]}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial J}{\partial z_{{{n}_{1}}}^{[1]}} & \frac{\partial J}{\partial z_{{{n}_{1}}}^{[2]}} & \cdots & \frac{\partial J}{\partial z_{{{n}_{1}}}^{[M]}} \\ \end{matrix} \right)\left( \begin{matrix} x_{1}^{[1]} & x_{2}^{[1]} & \cdots & x_{{{n}_{0}}}^{[1]} \\ x_{1}^{[2]} & x_{2}^{[2]} & \cdots & x_{{{n}_{0}}}^{[2]} \\ \vdots & \vdots & \ddots & \vdots \\ x_{1}^{[M]} & x_{2}^{[M]} & \cdots & x_{{{n}_{0}}}^{[M]} \\ \end{matrix} \right)
==⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛m=1∑M∂z1[m]∂Jx1[m]m=1∑M∂z2[m]∂Jx1[m]⋮m=1∑M∂zn1[m]∂Jx1[m]m=1∑M∂z1[m]∂Jx2[m]m=1∑M∂z2[m]∂Jx2[m]⋮m=1∑M∂zn1[m]∂Jx2[m]⋯⋯⋱⋯m=1∑M∂z1[M]∂Jxn0[m]m=1∑M∂z2[m]∂Jxn0[m]⋮m=1∑M∂zn1[m]∂Jxn0[m]⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞=⎝⎜⎜⎜⎜⎜⎛∂z1[1]∂J∂z2[1]∂J⋮∂zn1[1]∂J∂z1[2]∂J∂z2[2]∂J⋮∂zn1[2]∂J⋯⋯⋱⋯∂z1[M]∂J∂z2[M]∂J⋮∂zn1[M]∂J⎠⎟⎟⎟⎟⎟⎞⎝⎜⎜⎜⎜⎛x1[1]x1[2]⋮x1[M]x2[1]x2[2]⋮x2[M]⋯⋯⋱⋯xn0[1]xn0[2]⋮xn0[M]⎠⎟⎟⎟⎟⎞
=
∂
J
∂
Z
X
T
=\frac{\partial J}{\partial Z}{{X}^{T}}
=∂Z∂JXT
∂
J
∂
B
=
(
∂
J
b
1
∂
J
b
2
⋮
∂
J
b
n
1
)
=
(
∑
m
=
1
M
∂
J
∂
z
1
[
m
]
∂
z
1
[
m
]
∂
b
1
∑
m
=
1
M
∂
J
∂
z
2
[
m
]
∂
z
2
[
m
]
∂
b
2
⋮
∑
m
=
1
M
∂
J
∂
z
n
1
[
m
]
∂
z
2
[
m
]
∂
b
n
1
)
=
(
∑
m
=
1
M
∂
J
∂
z
1
[
m
]
∑
m
=
1
M
∂
J
∂
z
2
[
m
]
⋮
∑
m
=
1
M
∂
J
∂
z
n
1
[
m
]
)
=
s
u
m
(
∂
J
∂
Z
,
a
x
i
s
=
1
)
\frac{\partial J}{\partial B}\text{=}\left( \begin{matrix} \frac{\partial J}{{{b}_{1}}} \\ \frac{\partial J}{{{b}_{2}}} \\ \vdots \\ \frac{\partial J}{{{b}_{{{n}_{1}}}}} \\ \end{matrix} \right)\text{=}\left( \begin{matrix} \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{1}^{[m]}}\frac{\partial z_{1}^{[m]}}{\partial {{b}_{1}}}} \\ \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{2}^{[m]}}\frac{\partial z_{2}^{[m]}}{\partial {{b}_{2}}}} \\ \vdots \\ \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{{{n}_{1}}}^{[m]}}\frac{\partial z_{2}^{[m]}}{\partial {{b}_{{{n}_{1}}}}}} \\ \end{matrix} \right)=\left( \begin{matrix} \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{1}^{[m]}}} \\ \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{2}^{[m]}}} \\ \vdots \\ \sum\limits_{m=1}^{M}{\frac{\partial J}{\partial z_{{{n}_{1}}}^{[m]}}} \\ \end{matrix} \right)=sum(\frac{\partial J}{\partial Z},axis=1)
∂B∂J=⎝⎜⎜⎜⎜⎛b1∂Jb2∂J⋮bn1∂J⎠⎟⎟⎟⎟⎞=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛m=1∑M∂z1[m]∂J∂b1∂z1[m]m=1∑M∂z2[m]∂J∂b2∂z2[m]⋮m=1∑M∂zn1[m]∂J∂bn1∂z2[m]⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛m=1∑M∂z1[m]∂Jm=1∑M∂z2[m]∂J⋮m=1∑M∂zn1[m]∂J⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞=sum(∂Z∂J,axis=1)
最后,参数更新方法为:
W
=
W
−
∂
J
∂
W
W=W-\frac{\partial J}{\partial W}
W=W−∂W∂J
B
=
B
−
∂
J
∂
W
B=B-\frac{\partial J}{\partial W}
B=B−∂W∂J
3.4 算法总结
前向传播过程
Z
=
W
X
+
B
Z=WX+B
Z=WX+B
A
=
g
(
Z
)
A=g(Z)
A=g(Z)
J
=
1
M
∑
m
=
1
M
∑
i
=
1
n
1
-
y
i
[
m
]
ln
(
a
i
m
)
J\text{=}\frac{1}{M}\sum\limits_{m=1}^{M}{\sum\limits_{i=1}^{{{n}_{1}}}{\text{-}y_{i}^{[m]}}}\ln \left( a_{i}^{m} \right)
J=M1m=1∑Mi=1∑n1-yi[m]ln(aim)
反向传播过程
∂
J
∂
Z
=
1
M
(
A
−
Y
)
\frac{\partial J}{\partial Z}\text{=}\frac{1}{M}\left( A-Y \right)
∂Z∂J=M1(A−Y)
∂
J
∂
W
=
∂
J
∂
Z
X
T
\frac{\partial J}{\partial W}=\frac{\partial J}{\partial Z}{{X}^{T}}
∂W∂J=∂Z∂JXT
∂
J
∂
B
=
s
u
m
(
∂
J
∂
Z
,
a
x
i
s
=
1
)
\frac{\partial J}{\partial B}\text{=}sum(\frac{\partial J}{\partial Z},axis=1)
∂B∂J=sum(∂Z∂J,axis=1)
参数更新方法
W
=
W
−
∂
J
∂
W
W=W-\frac{\partial J}{\partial W}
W=W−∂W∂J
B
=
B
−
∂
J
∂
W
B=B-\frac{\partial J}{\partial W}
B=B−∂W∂J
我们可以发现一个很巧合的现象,即Softmax回归反向传播的表达式和Logistc回归居然一模一样!!!(参考前面介绍过的Logistic回归原理)
所以Softmax函数可以认为是一种高纬度的Logistic回归,二者在反向传播过程中具有一致性。这也是采用向量式的标签Y的原因之一。
4. 代码实现
# -*- coding: utf-8 -*-
"""
Created on Tue Nov 19 17:14:55 2019
@author: Iseno_V
"""
import numpy as np
#Softmax函数
def g(z):
t = np.sum(np.exp(z),axis=0,keepdims=True)
return np.exp(z)/t
#生成训练数据和测试数据
#生成了四类数据,分别为表平面坐标系的四个象限
def createData(m):
m/=4
m = int(m)
D1 = np.random.rand(1,m)*10
D2 = np.random.rand(1,m)*10
X1 = np.vstack((D1,D2))
D1 = np.random.rand(1,m)*(-1)*10
D2 = np.random.rand(1,m)*10
X2 = np.vstack((D1,D2))
D1 = np.random.rand(1,m)*(-1)*10
D2 = np.random.rand(1,m)*(-1)*10
X3 = np.vstack((D1,D2))
D1 = np.random.rand(1,m)*10
D2 = np.random.rand(1,m)*(-1)*10
X4 = np.vstack((D1,D2))
X = np.hstack((X1,X2,X3,X4))
Y = np.zeros((4,m*4))
for i in range(4):
for j in range(int(m)):
Y[i][i*m+j] = 1
return X,Y
#定义网络基本参数
m = 10000#数据集大小
n = [2,4]#输入维数和输出维数,分别对应 特征维数 和 分类的个数
X_train,Y_train = createData(m)#生成训练集
X_test,Y_test = createData(m)#生成测试集
W = np.random.rand(n[1],n[0])*0.01#初始化参数矩阵W
B = np.random.rand(n[1],1)*0.01#初始化参数矩阵B
I = 1000#迭代次数
alpha = 0.001#学习率
#训练过程
for i in range(I):
#前向传播
Z = np.dot(W,X_train)+B
A = g(Z)
#每100轮训练输出损失函数
if i%100==0:
J = 0
for j in range(m):
J += np.dot(Y_train[:,j].T,np.log(A[:,j]))
J = -J/m
print('step = ',i,',cost function = ',J)
#反向传播
dz = 1/m*(A-Y_train)
dw = np.dot(dz,X_train.T)
db = np.sum(dz,axis=1)
W=W-alpha*dw
B=B-alpha*db
#测试过程
Z = np.dot(W,X_test)+B
A = g(Z)
corr = 0
for i in range(m):
p = A[:,i].argmax()
if Y_test[p,i] == 1:
corr+=1
rate = corr/m
print('accuracy = ',rate)
代码运行结果为:
总之,Softmax回归一般用于解决线性多分类问题,非线性问题一般无能为力。对于非线性多分类问题,我们可以通过在全连接网络中使用Softmax激活函数来实现,这些内容会在下一期介绍。