1. 多层感知器
\quad 本文只考虑一个隐层的情况:
-
输入层有 L L L 个单元(不包含偏置, x = [ x 1 , ⋯ , x i , ⋯ , x L ] T ∈ R L \boldsymbol x=[x_1,\cdots,x_i,\cdots,x_L]^T\in R^L x=[x1,⋯,xi,⋯,xL]T∈RL),隐层有 M M M 个单元(不包含偏置, a j a_j aj 为第 j j j 个隐层单元),输出层有 N N N 个单元(其中 y = [ y 1 , ⋯ , y i , ⋯ , y N ] T ∈ R N \boldsymbol y=[y_1,\cdots,y_i,\cdots,y_N]^T\in R^N y=[y1,⋯,yi,⋯,yN]T∈RN)。
-
输入层的权值为 v = [ v i ζ ] \boldsymbol{v}=[v_{i\zeta}] v=[viζ],隐层的权值为 w = [ w j k ] \boldsymbol{w}=[w_{jk}] w=[wjk]
\qquad
\newline
\qquad 对于第 ζ \zeta ζ 个隐层单元 a ζ a_{\zeta} aζ,假设激活函数为 g ( ⋅ ) g(\cdot) g(⋅),通常为 Sigmoid \text{Sigmoid} Sigmoid 函数,则:
a
ζ
=
g
(
α
ζ
)
=
g
(
∑
i
=
0
L
x
i
v
i
ζ
)
(
1
)
\qquad \qquad a_{\zeta}=g(\textcolor{crimson}{\alpha_{\zeta}})=g\left(\textcolor{crimson}{\displaystyle\sum_{i=0}^{L}x_{i}v_{i\zeta}}\right) \ \qquad\qquad\qquad\qquad\qquad(1)\newline
aζ=g(αζ)=g(i=0∑Lxiviζ) (1)
\qquad\qquad
其中,
x
0
=
−
1
x_{0}=-1
x0=−1
\qquad
隐层单元的激活函数 g ( ⋅ ) g(\cdot) g(⋅)
\qquad
对于第
k
k
k 个输出单元
y
k
y_k
yk,假设激活函数为
h
(
⋅
)
h(\cdot)
h(⋅),则:
\newline
y
k
=
h
(
β
k
)
=
h
(
∑
j
=
0
M
a
j
w
j
k
)
(
2
)
\qquad \qquad y_{k}=h(\textcolor{blue}{\beta_k})=h\left(\textcolor{blue}{\displaystyle\sum_{j=0}^{M}a_jw_{jk}}\right) \qquad\qquad\qquad\qquad\qquad(2)
yk=h(βk)=h(j=0∑Majwjk)(2)
\qquad\qquad 其中, a 0 = − 1 a_0=-1 a0=−1
\qquad
输出层单元的激活函数 h ( ⋅ ) h(\cdot) h(⋅)
- 对于回归问题, h ( ⋅ ) h(\cdot) h(⋅) 为“恒等函数”,即 h ( β k ) = β k h(\beta_k)=\beta_k h(βk)=βk
- 对于分类问题, h ( ⋅ ) h(\cdot) h(⋅) 为“ softmax \textbf{softmax} softmax函数”,即 h ( β k ) = e β k ∑ n = 1 N e β n = e β k e β 1 + ⋯ + e β k + ⋯ + e β N h(\beta_k)=\dfrac{e^{\beta_k}}{\sum_{n=1}^{N}e^{\beta_n}}=\dfrac{e^{\beta_k}}{ e^{\beta_1}+\cdots+e^{\beta_k}+\cdots+e^{\beta_N}} h(βk)=∑n=1Neβneβk=eβ1+⋯+eβk+⋯+eβNeβk
2. 前向传播过程
2.1 前 向 传 播 过 程 示 意 图 \boldsymbol{2.1}\ \ \ 前向传播过程示意图 2.1 前向传播过程示意图
\qquad
感知器
(perceptron)
\text{(perceptron)}
(perceptron)只能解决线性分类问题,通过感知器的
“
“
“叠加层
”
”
”,可以形成多层感知器
(multilayer perceptron)
\text{(multilayer\ perceptron)}
(multilayer perceptron),可以进行更复杂的表示。
\qquad
引自Machine Learning - An Algorithmic Perspective 2nd Edition,Fig 4.9
图(a)大致可认为是一个sigmoid神经元所形成的"分类面",图(b) ~ (d)为不同的sigmoid神经元的叠加所形成的"分类面"
\qquad
引自Machine Learning - An Algorithmic Perspective 2nd Edition,Fig 4.10
多层感知器学习过程示意图
- 多层感知器的所有参数为 ( v , w ) (\boldsymbol{v}, \boldsymbol{w}) (v,w)
- 若经过训练的多层感知器的参数为 ( v , w ) (\boldsymbol{v}, \boldsymbol{w}) (v,w),对于某个输入 x \boldsymbol{x} x,其输出值为函数 y k ( x , v , w ) y_{k}(\boldsymbol{x},\boldsymbol{v}, \boldsymbol{w}) yk(x,v,w)
y
k
(
x
,
v
,
w
)
=
h
(
∑
j
=
0
M
w
j
k
g
(
∑
i
=
0
L
v
i
j
x
i
)
)
,
k
=
1
,
⋅
⋅
⋅
,
N
(
3
)
\qquad y_{k}(\boldsymbol{x},\boldsymbol{v}, \boldsymbol{w})=h \left( \displaystyle\sum_{j=0}^{M}w_{jk}g\left( \displaystyle\sum_{i=0}^{L}v_{ij}x_{i}\right) \right)\ ,\ k=1,\cdot\cdot\cdot,N\qquad(3)
yk(x,v,w)=h(j=0∑Mwjkg(i=0∑Lvijxi)) , k=1,⋅⋅⋅,N(3)
\qquad
2.2 Sequential 和 Batch 两 种 训 练 模 式 时 的 数 据 表 示 \boldsymbol{2.2}\ \ \ \text{Sequential}\ 和\ \text{Batch}\ 两种训练模式时的数据表示 2.2 Sequential 和 Batch 两种训练模式时的数据表示
-
Sequential \text{Sequential} Sequential 对应单个输入的情况,即顺序模式 \newline
\qquad 假设单个训练数据为 { x , t } \{\boldsymbol{x},\boldsymbol{t}\} {x,t}
x ∈ R L + 1 \qquad\boldsymbol{x} \in R^{L+1} x∈RL+1 为输入数据,即 x = ( x 0 , x 1 , ⋅ ⋅ ⋅ , x L ) \boldsymbol{x}=(x_{0},x_{1},\cdot\cdot\cdot,x_{L}) x=(x0,x1,⋅⋅⋅,xL),其中 x 0 = − 1 x_{0}=-1 x0=−1 为偏置
t ∈ R N \qquad\boldsymbol{t} \in R^{N} t∈RN为期望输出,即 t = ( t 1 , t 2 , ⋅ ⋅ ⋅ , t N ) \boldsymbol{t}=(t_{1},t_{2},\cdot\cdot\cdot,t_{N}) t=(t1,t2,⋅⋅⋅,tN)
\qquad 通过多层感知器之后的输出为 y ∈ R N \boldsymbol{y} \in R^{N} y∈RN,即 y = ( y 1 , y 2 , ⋅ ⋅ ⋅ , y N ) \boldsymbol{y}=(y_{1},y_{2},\cdot\cdot\cdot,y_{N}) y=(y1,y2,⋅⋅⋅,yN)
\qquad 训练该数据所产生的误差定义为: E = 1 2 ∑ k = 1 N ( y k − t k ) 2 E=\dfrac{1}{2}\displaystyle\sum_{k=1}^{N}(y_{k}-t_{k})^{2} E=21k=1∑N(yk−tk)2
\qquad -
Batch \text{Batch} Batch 对应多个输入数据的情况,即批处理模式 \newline
\qquad 假设 P P P 个训练数据为 { x ( p ) , t ( p ) } p = 1 P \{\boldsymbol{x}^{(p)},\boldsymbol{t}^{(p)}\}_{p=1}^{P} {x(p),t(p)}p=1P
x ( p ) ∈ R L + 1 \qquad\boldsymbol{x}^{(p)} \in R^{L+1} x(p)∈RL+1 为输入数据,即 x ( p ) = ( x 0 ( p ) , x 1 ( p ) , ⋅ ⋅ ⋅ , x L ( p ) ) \boldsymbol{x}^{(p)}=(x_{0}^{(p)},x_{1}^{(p)},\cdot\cdot\cdot,x_{L}^{(p)}) x(p)=(x0(p),x1(p),⋅⋅⋅,xL(p)),其中 x 0 ( p ) = − 1 x_{0}^{(p)}=-1 x0(p)=−1 为偏置
t ( p ) ∈ R N \qquad\boldsymbol{t}^{(p)} \in R^{N} t(p)∈RN 为期望输出,即 t ( p ) = ( t 1 ( p ) , t 2 ( p ) , ⋅ ⋅ ⋅ , t N ( p ) ) \boldsymbol{t}^{(p)}=(t_{1}^{(p)},t_{2}^{(p)},\cdot\cdot\cdot,t_{N}^{(p)}) t(p)=(t1(p),t2(p),⋅⋅⋅,tN(p))
\qquad 通过多层感知器之后的输出为 y ( p ) ∈ R N \boldsymbol{y}^{(p)} \in R^{N} y(p)∈RN,即 y ( p ) = ( y 1 ( p ) , y 2 ( p ) , ⋅ ⋅ ⋅ , y N ( p ) ) \boldsymbol{y}^{(p)}=(y_{1}^{(p)},y_{2}^{(p)},\cdot\cdot\cdot,y_{N}^{(p)}) y(p)=(y1(p),y2(p),⋅⋅⋅,yN(p))
\qquad 训练这一批数据所产生的平均误差定义为: E = 1 2 P ∑ p = 1 P ( ∑ k = 1 N ( y k ( p ) − t k ( p ) ) 2 ) E=\dfrac{1}{2P}\displaystyle\sum_{p=1}^{P}\left( \displaystyle\sum_{k=1}^{N}(y_{k}^{(p)}-t_{k}^{(p)})^{2}\right) E=2P1p=1∑P(k=1∑N(yk(p)−tk(p))2)
2.3 相 关 矩 阵 表 示 \boldsymbol{2.3}\ \ \ 相关矩阵表示 2.3 相关矩阵表示
-
4
个
数
据
矩
阵
4个数据矩阵\newline
4个数据矩阵
( 1 ) (1) (1) 输入矩阵,第 p p p 行表示第 p p p 个输入数据 x ( p ) \boldsymbol x^{(p)} x(p),最后一列的 “ − 1 ” “-1” “−1” 表示偏置 \newline
i n p u t s : [ x 1 ( 1 ) ⋯ x i ( 1 ) ⋯ x L ( 1 ) − 1 ⋮ ⋮ ⋮ ⋮ x 1 ( p ) ⋯ x i ( p ) ⋯ x L ( p ) − 1 ⋮ ⋮ ⋮ ⋮ x 1 ( P ) ⋯ x i ( P ) ⋯ x L ( P ) − 1 ] P × ( L + 1 ) ⟵ x ( p ) \qquad inputs:\qquad\left[ \begin{matrix} x_{1}^{(1)} & \cdots & x_{i}^{(1)} & \cdots & x_{L}^{(1)} & -1 \\ \vdots & & \vdots & & \vdots & \vdots \\ x_{1}^{(p)} & \cdots & x_{i}^{(p)} & \cdots & x_{L}^{(p)} & -1 \\ \vdots & & \vdots& & \vdots & \vdots \\ x_{1}^{(P)} & \cdots & x_{i}^{(P)} & \cdots & x_{L}^{(P)} & -1 \end{matrix} \right]_{P\times (L+1)} \begin{matrix} \\ \\ \longleftarrow\boldsymbol x^{(p)}\\ \\ \\ \end{matrix} \newline inputs:⎣⎢⎢⎢⎢⎢⎢⎢⎡x1(1)⋮x1(p)⋮x1(P)⋯⋯⋯xi(1)⋮xi(p)⋮xi(P)⋯⋯⋯xL(1)⋮xL(p)⋮xL(P)−1⋮−1⋮−1⎦⎥⎥⎥⎥⎥⎥⎥⎤P×(L+1)⟵x(p)
(
2
)
\qquad(2)
(2) 隐层矩阵,第
p
p
p 行表示第
p
p
p 个输入
x
(
p
)
\boldsymbol{x}^{(p)}
x(p) 所对应的隐层节点数据
a
(
p
)
\boldsymbol{a}^{(p)}
a(p),最后一列的
“
−
1
”
“-1”
“−1” 表示偏置
\newline
h
i
d
d
e
n
:
[
a
1
(
1
)
⋯
a
j
(
1
)
⋯
a
M
(
1
)
−
1
⋮
⋮
⋮
⋮
a
1
(
p
)
⋯
a
j
(
p
)
⋯
a
M
(
p
)
−
1
⋮
⋮
⋮
⋮
a
1
(
P
)
⋯
a
j
(
P
)
⋯
a
M
(
P
)
−
1
]
P
×
(
M
+
1
)
⟵
a
(
p
)
\qquad \qquad hidden:\qquad\left[ \begin{matrix} a_{1}^{(1)} & \cdots & a_{j}^{(1)} & \cdots & a_{M}^{(1)} & -1 \\ \vdots & & \vdots & & \vdots & \vdots \\ a_{1}^{(p)} & \cdots & a_{j}^{(p)} & \cdots & a_{M}^{(p)} & -1 \\ \vdots & & \vdots& & \vdots & \vdots \\ a_{1}^{(P)} & \cdots & a_{j}^{(P)} & \cdots & a_{M}^{(P)} & -1 \end{matrix} \right]_{P\times (M+1)} \begin{matrix} \\ \\ \longleftarrow\boldsymbol{a}^{(p)}\\ \\ \\ \end{matrix} \newline
hidden:⎣⎢⎢⎢⎢⎢⎢⎢⎡a1(1)⋮a1(p)⋮a1(P)⋯⋯⋯aj(1)⋮aj(p)⋮aj(P)⋯⋯⋯aM(1)⋮aM(p)⋮aM(P)−1⋮−1⋮−1⎦⎥⎥⎥⎥⎥⎥⎥⎤P×(M+1)⟵a(p)
\qquad\qquad
[注]:两个矩阵的偏置
−
1
-1
−1 放在最后一列是为了便于编程实现,后续都采用该方式描述
x
(
p
)
\boldsymbol{x}^{(p)}
x(p) 和
a
(
p
)
\boldsymbol{a}^{(p)}\newline
a(p)
(
3
)
\qquad (3)
(3) 输出矩阵,第
p
p
p 行表示第
p
p
p 个输入
x
(
p
)
\boldsymbol{x}^{(p)}
x(p) 所对应的输出值
y
(
p
)
\boldsymbol{y}^{(p)} \newline
y(p)
o
u
t
p
u
t
:
[
y
1
(
1
)
⋯
y
k
(
1
)
⋯
y
N
(
1
)
⋮
⋮
⋮
y
1
(
p
)
⋯
y
k
(
p
)
⋯
y
N
(
p
)
⋮
⋮
⋮
y
1
(
P
)
⋯
y
k
(
P
)
⋯
y
N
(
P
)
]
P
×
N
⟵
y
(
p
)
\qquad \qquad output:\qquad\left[ \begin{matrix} y_{1}^{(1)} & \cdots & y_{k}^{(1)} & \cdots & y_{N}^{(1)} \\ \vdots & & \vdots & & \vdots \\ y_{1}^{(p)} & \cdots & y_{k}^{(p)} & \cdots & y_{N}^{(p)} \\ \vdots & & \vdots& & \vdots \\ y_{1}^{(P)} & \cdots & y_{k}^{(P)} & \cdots & y_{N}^{(P)} \end{matrix} \right]_{P \times N} \begin{matrix} \\ \\ \longleftarrow\boldsymbol{y}^{(p)}\\ \\ \\ \end{matrix} \newline
output:⎣⎢⎢⎢⎢⎢⎢⎢⎡y1(1)⋮y1(p)⋮y1(P)⋯⋯⋯yk(1)⋮yk(p)⋮yk(P)⋯⋯⋯yN(1)⋮yN(p)⋮yN(P)⎦⎥⎥⎥⎥⎥⎥⎥⎤P×N⟵y(p)
(
4
)
\qquad (4)
(4) 目标值矩阵,第
p
p
p 行表示第
p
p
p 个输入
x
(
p
)
\boldsymbol{x}^{(p)}
x(p) 所对应的目标值
t
(
p
)
\boldsymbol{t}^{(p)} \newline
t(p)
t
a
r
g
e
t
:
[
t
1
(
1
)
⋯
t
k
(
1
)
⋯
t
N
(
1
)
⋮
⋮
⋮
t
1
(
p
)
⋯
t
k
(
p
)
⋯
t
N
(
p
)
⋮
⋮
⋮
t
1
(
P
)
⋯
t
k
(
P
)
⋯
t
N
(
P
)
]
P
×
N
⟵
t
(
p
)
\qquad \qquad target:\qquad\left[ \begin{matrix} t_{1}^{(1)} & \cdots & t_{k}^{(1)} & \cdots & t_{N}^{(1)} \\ \vdots & & \vdots & & \vdots \\ t_{1}^{(p)} & \cdots & t_{k}^{(p)} & \cdots & t_{N}^{(p)} \\ \vdots & & \vdots& & \vdots \\ t_{1}^{(P)} & \cdots & t_{k}^{(P)} & \cdots & t_{N}^{(P)} \end{matrix} \right]_{P \times N} \begin{matrix} \\ \\ \longleftarrow\boldsymbol{t}^{(p)}\\ \\ \\ \end{matrix}
target:⎣⎢⎢⎢⎢⎢⎢⎢⎡t1(1)⋮t1(p)⋮t1(P)⋯⋯⋯tk(1)⋮tk(p)⋮tk(P)⋯⋯⋯tN(1)⋮tN(p)⋮tN(P)⎦⎥⎥⎥⎥⎥⎥⎥⎤P×N⟵t(p)
\qquad
-
2
个
系
数
矩
阵
2个系数矩阵\newline
2个系数矩阵
( 1 ) (1) (1) 输入系数矩阵,第 j j j 列 表示第 j j j 个隐层节点对应于输入层节点的系数 \newline
w e i g h t s 1 : [ v 11 ⋯ v 1 j ⋯ v 1 M ⋮ ⋮ ⋮ v i 1 ⋯ v i j ⋯ v i M ⋮ ⋮ ⋮ v L 1 ⋯ v L j ⋯ v L M v 01 ⋯ v 0 j ⋯ v 0 M ] ( L + 1 ) × M \qquad weights1:\qquad\left[ \begin{matrix} v_{11} & \cdots & v_{1j} & \cdots & v_{1M} \\ \vdots & & \vdots & & \vdots \\ v_{i1} & \cdots & v_{ij} & \cdots & v_{iM} \\ \vdots & & \vdots& & \vdots \\ v_{L1} & \cdots & v_{Lj} & \cdots & v_{LM} \\ v_{01} & \cdots & v_{0j} & \cdots & v_{0M} \end{matrix} \right]_{(L+1) \times M} weights1:⎣⎢⎢⎢⎢⎢⎢⎢⎢⎡v11⋮vi1⋮vL1v01⋯⋯⋯⋯v1j⋮vij⋮vLjv0j⋯⋯⋯⋯v1M⋮viM⋮vLMv0M⎦⎥⎥⎥⎥⎥⎥⎥⎥⎤(L+1)×M
↑ \qquad \qquad\qquad\qquad\qquad\qquad\qquad\ \ \uparrow \newline ↑
(
2
)
\qquad (2)
(2) 隐层系数矩阵,第
k
k
k 列表示第
k
k
k 个输出节点对应于隐层节点的系数
\newline
w
e
i
g
h
t
s
2
:
[
w
11
⋯
w
1
k
⋯
w
1
N
⋮
⋮
⋮
w
j
1
⋯
w
j
k
⋯
w
j
N
⋮
⋮
⋮
w
M
1
⋯
w
M
k
⋯
w
M
N
w
01
⋯
w
0
k
⋯
w
0
N
]
(
M
+
1
)
×
N
\qquad\qquad weights2:\qquad\left[ \begin{matrix} w_{11} & \cdots & w_{1k} & \cdots & w_{1N} \\ \vdots & & \vdots & & \vdots \\ w_{j1} & \cdots & w_{jk} & \cdots & w_{jN} \\ \vdots & & \vdots& & \vdots \\ w_{M1} & \cdots & w_{Mk} & \cdots & w_{MN} \\ w_{01} & \cdots & w_{0k} & \cdots & w_{0N} \end{matrix} \right]_{(M+1) \times N}
weights2:⎣⎢⎢⎢⎢⎢⎢⎢⎢⎡w11⋮wj1⋮wM1w01⋯⋯⋯⋯w1k⋮wjk⋮wMkw0k⋯⋯⋯⋯w1N⋮wjN⋮wMNw0N⎦⎥⎥⎥⎥⎥⎥⎥⎥⎤(M+1)×N
↑
\qquad \qquad\qquad\qquad\qquad\qquad\qquad\qquad\ \ \ \ \ \uparrow \newline
↑
2.4
前
向
传
播
过
程
的
实
现
\boldsymbol{2.4}\ \ \ 前向传播过程的实现\newline
2.4 前向传播过程的实现
\qquad
定义了这些矩阵之后,前向传播过程的公式
(
3
)
(3)
(3)的计算过程可表示为:
\newline
(
1
)
\qquad(1)
(1) 构造
i
n
p
u
t
s
P
×
(
L
+
1
)
inputs_{P \times (L+1)}
inputsP×(L+1) 矩阵,即在输入数据矩阵
(
P
×
L
(P \times L
(P×L维
)
)
)的最后一列加上值为
−
1
-1
−1的偏置
\newline
\qquad \qquad 输入数据矩阵 : : :
[ x 1 ( 1 ) ⋯ x i ( 1 ) ⋯ x L ( 1 ) ⋮ ⋮ ⋮ x 1 ( p ) ⋯ x i ( p ) ⋯ x L ( p ) ⋮ ⋮ ⋮ x 1 ( P ) ⋯ x i ( P ) ⋯ x L ( P ) ] P × L ⟵ 第 p 个 输 入 的 训 练 数 据 x ( p ) \qquad\quad\begin{aligned} \qquad\qquad \left[ \begin{matrix} x_{1}^{(1)} & \cdots & x_{i}^{(1)} & \cdots & x_{L}^{(1)} \\ \vdots & & \vdots & & \vdots \\ \textcolor{red}{x_{1}^{(p)}} & \cdots & \textcolor{red}{x_{i}^{(p)}} & \cdots & \textcolor{red}{x_{L}^{(p)}} \\ \vdots & & \vdots& & \vdots \\ x_{1}^{(P)} & \cdots & x_{i}^{(P)} & \cdots & x_{L}^{(P)} \end{matrix} \right] _{P\times L} \begin{matrix} \\ \\ \longleftarrow\\ \\ \\ \end{matrix} \ \small第p个输入的训练数据 \boldsymbol{x}^{(p)} \end{aligned} \newline ⎣⎢⎢⎢⎢⎢⎢⎢⎡x1(1)⋮x1(p)⋮x1(P)⋯⋯⋯xi(1)⋮xi(p)⋮xi(P)⋯⋯⋯xL(1)⋮xL(p)⋮xL(P)⎦⎥⎥⎥⎥⎥⎥⎥⎤P×L⟵ 第p个输入的训练数据x(p)
( 2 ) \qquad(2) (2) 计算各个隐层节点的输入值 α ( p ) \boldsymbol{\alpha}^{(p)} α(p),也就是 i n p u t ∗ w e i g h t s 1 input *weights1 input∗weights1 矩阵 ( P × M (P \times M (P×M维 ) ) )的第 p p p 行 \newline
\qquad i n p u t s P × ( L + 1 ) ∗ w e i g h t s 1 ( L + 1 ) × M = [ x 1 ( 1 ) ⋯ x i ( 1 ) ⋯ x L ( 1 ) − 1 ⋮ ⋮ ⋮ ⋮ x 1 ( p ) ⋯ x i ( p ) ⋯ x L ( p ) − 1 ⋮ ⋮ ⋮ ⋮ x 1 ( P ) ⋯ x i ( P ) ⋯ x L ( P ) − 1 ] [ v 11 ⋯ v 1 j ⋯ v 1 M ⋮ ⋮ ⋮ v i 1 ⋯ v i j ⋯ v i M ⋮ ⋮ ⋮ v L 1 ⋯ v L j ⋯ v L M v 01 ⋯ v 0 j ⋯ v 0 M ] = [ ∑ i = 0 L x i ( 1 ) v i 1 ⋯ ∑ i = 0 L x i ( 1 ) v i j ⋯ ∑ i = 0 L x i ( 1 ) v i M ⋮ ⋮ ⋮ ∑ i = 0 L x i ( p ) v i 1 ⋯ ∑ i = 0 L x i ( p ) v i j ⋯ ∑ i = 0 L x i ( p ) v i M ⋮ ⋮ ⋮ ∑ i = 0 L x i ( P ) v i 1 ⋯ ∑ i = 0 L x i ( P ) v i j ⋯ ∑ i = 0 L x i ( P ) v i M ] P × M = [ α 1 ( 1 ) ⋯ α j ( 1 ) ⋯ α M ( 1 ) ⋮ ⋮ ⋮ α 1 ( p ) ⋯ α j ( p ) ⋯ α M ( p ) ⋮ ⋮ ⋮ α 1 ( P ) ⋯ α j ( P ) ⋯ α M ( P ) ] P × M ⟵ 第 p 个 输 入 x ( p ) 所 对 应 隐 层 节 点 的 输 入 \begin{aligned} \qquad\qquad& inputs_{P\times (L+1)}*weights1_{(L+1) \times M} \\ &= \left[ \begin{matrix} x_{1}^{(1)} & \cdots & x_{i}^{(1)} & \cdots & x_{L}^{(1)} & -1 \\ \vdots & & \vdots & & \vdots & \vdots \\ \textcolor{red}{x_{1}^{(p)}} & \cdots & \textcolor{red}{x_{i}^{(p)}} & \cdots & \textcolor{red}{x_{L}^{(p)}} & \textcolor{red}{-1} \\ \vdots & & \vdots& & \vdots & \vdots \\ x_{1}^{(P)} & \cdots & x_{i}^{(P)} & \cdots & x_{L}^{(P)} & -1 \end{matrix} \right] \left[ \begin{matrix} v_{11} & \cdots & \textcolor{red}{v_{1j}} & \cdots & v_{1M} \\ \vdots & & \vdots & & \vdots \\ v_{i1} & \cdots & \textcolor{red}{v_{ij}} & \cdots & v_{iM} \\ \vdots & & \vdots& & \vdots \\ v_{L1} & \cdots & \textcolor{red}{v_{Lj}} & \cdots & v_{LM} \\ v_{01} & \cdots & \textcolor{red}{v_{0j}} & \cdots & v_{0M} \end{matrix} \right] \\ &= \left[ \begin{matrix} \sum\limits_{i=0}^{L} x_{i}^{(1)}v_{i1} & \cdots & \sum\limits_{i=0}^{L} x_{i}^{(1)}v_{ij} & \cdots & \sum\limits_{i=0}^{L} x_{i}^{(1)}v_{iM} \\ \vdots & & \vdots & & \vdots \\ \sum\limits_{i=0}^{L} x_{i}^{(p)}v_{i1} & \cdots & \textcolor{red}{\sum\limits_{i=0}^{L} x_{i}^{(p)}v_{ij}} & \cdots & \sum\limits_{i=0}^{L} x_{i}^{(p)}v_{iM} \\ \vdots & & \vdots& & \vdots \\ \sum\limits_{i=0}^{L} x_{i}^{(P)}v_{i1} & \cdots & \sum\limits_{i=0}^{L} x_{i}^{(P)}v_{ij} & \cdots & \sum\limits_{i=0}^{L} x_{i}^{(P)}v_{iM} \end{matrix} \right] _{P\times M} \\ &= \left[ \begin{matrix} \alpha_{1}^{(1)} & \cdots & \alpha_{j}^{(1)} & \cdots & \alpha_{M}^{(1)} \\ \vdots & & \vdots & & \vdots \\ \alpha_{1}^{(p)} & \cdots & \textcolor{red}{\alpha_{j}^{(p)}} & \cdots & \alpha_{M}^{(p)} \\ \vdots & & \vdots& & \vdots \\ \alpha_{1}^{(P)} & \cdots & \alpha_{j}^{(P)} & \cdots & \alpha_{M}^{(P)} \end{matrix} \right] _{P\times M} \begin{matrix} \\ \\ \longleftarrow\\ \\ \\ \end{matrix} \ \small第 p 个输入 \boldsymbol{x}^{(p)} 所对应隐层节点的输入 \end{aligned} \newline inputsP×(L+1)∗weights1(L+1)×M=⎣⎢⎢⎢⎢⎢⎢⎢⎡x1(1)⋮x1(p)⋮x1(P)⋯⋯⋯xi(1)⋮xi(p)⋮xi(P)⋯⋯⋯xL(1)⋮xL(p)⋮xL(P)−1⋮−1⋮−1⎦⎥⎥⎥⎥⎥⎥⎥⎤⎣⎢⎢⎢⎢⎢⎢⎢⎢⎡v11⋮vi1⋮vL1v01⋯⋯⋯⋯v1j⋮vij⋮vLjv0j⋯⋯⋯⋯v1M⋮viM⋮vLMv0M⎦⎥⎥⎥⎥⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡i=0∑Lxi(1)vi1⋮i=0∑Lxi(p)vi1⋮i=0∑Lxi(P)vi1⋯⋯⋯i=0∑Lxi(1)vij⋮i=0∑Lxi(p)vij⋮i=0∑Lxi(P)vij⋯⋯⋯i=0∑Lxi(1)viM⋮i=0∑Lxi(p)viM⋮i=0∑Lxi(P)viM⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤P×M=⎣⎢⎢⎢⎢⎢⎢⎢⎡α1(1)⋮α1(p)⋮α1(P)⋯⋯⋯αj(1)⋮αj(p)⋮αj(P)⋯⋯⋯αM(1)⋮αM(p)⋮αM(P)⎦⎥⎥⎥⎥⎥⎥⎥⎤P×M⟵ 第p个输入x(p)所对应隐层节点的输入
( 3 ) \qquad(3) (3) 将隐层节点的输入值,通过 Sigmoid \text{Sigmoid} Sigmoid 激活函数 g ( ⋅ ) g(\cdot) g(⋅),转变为隐层节点输出值 a ( p ) = g ( α ( p ) ) \boldsymbol{a}^{(p)}=g(\boldsymbol{\alpha}^{(p)}) a(p)=g(α(p)) \newline
\qquad \qquad 隐层节点输出值矩阵 : g ( i n p u t s ∗ w e i g h t s 1 ) :g(inputs*weights1) :g(inputs∗weights1)
\qquad g ( i n p u t s ∗ w e i g h t s 1 ) = g ( [ α 1 ( 1 ) ⋯ α j ( 1 ) ⋯ α M ( 1 ) ⋮ ⋮ ⋮ α 1 ( p ) ⋯ α j ( p ) ⋯ α M ( p ) ⋮ ⋮ ⋮ α 1 ( P ) ⋯ α j ( P ) ⋯ α M ( P ) ] P × M ) = [ a 1 ( 1 ) ⋯ a j ( 1 ) ⋯ a M ( 1 ) ⋮ ⋮ ⋮ a 1 ( p ) ⋯ a j ( p ) ⋯ a M ( p ) ⋮ ⋮ ⋮ a 1 ( P ) ⋯ a j ( P ) ⋯ a M ( P ) ] P × M ⟵ 第 p 个 输 入 x ( p ) 所 对 应 的 隐 层 节 点 数 据 \begin{aligned} \qquad\qquad& g(inputs*weights1) \\ &=g\left( \left[ \begin{matrix} \alpha_{1}^{(1)} & \cdots & \alpha_{j}^{(1)} & \cdots & \alpha_{M}^{(1)} \\ \vdots & & \vdots & & \vdots \\ \alpha_{1}^{(p)} & \cdots & \textcolor{red}{\alpha_{j}^{(p)}} & \cdots & \alpha_{M}^{(p)} \\ \vdots & & \vdots& & \vdots \\ \alpha_{1}^{(P)} & \cdots & \alpha_{j}^{(P)} & \cdots & \alpha_{M}^{(P)} \end{matrix} \right] _{P\times M}\right) \\ &=\left[ \begin{matrix} a_{1}^{(1)} & \cdots & a_{j}^{(1)} & \cdots & a_{M}^{(1)} \\ \vdots & & \vdots & & \vdots \\ a_{1}^{(p)} & \cdots & \textcolor{red}{a_{j}^{(p)}} & \cdots & a_{M}^{(p)} \\ \vdots & & \vdots& & \vdots \\ a_{1}^{(P)} & \cdots & a_{j}^{(P)} & \cdots & a_{M}^{(P)} \end{matrix} \right] _{P\times M} \begin{matrix} \\ \\ \longleftarrow\\ \\ \\ \end{matrix} \ \small第p个输入\boldsymbol{x}^{(p)}所对应的隐层节点数据 \end{aligned} \newline g(inputs∗weights1)=g⎝⎜⎜⎜⎜⎜⎜⎜⎜⎛⎣⎢⎢⎢⎢⎢⎢⎢⎡α1(1)⋮α1(p)⋮α1(P)⋯⋯⋯αj(1)⋮αj(p)⋮αj(P)⋯⋯⋯αM(1)⋮αM(p)⋮αM(P)⎦⎥⎥⎥⎥⎥⎥⎥⎤P×M⎠⎟⎟⎟⎟⎟⎟⎟⎟⎞=⎣⎢⎢⎢⎢⎢⎢⎢⎡a1(1)⋮a1(p)⋮a1(P)⋯⋯⋯aj(1)⋮aj(p)⋮aj(P)⋯⋯⋯aM(1)⋮aM(p)⋮aM(P)⎦⎥⎥⎥⎥⎥⎥⎥⎤P×M⟵ 第p个输入x(p)所对应的隐层节点数据
( 4 ) \qquad(4) (4) 构造 h i d d e n P × ( M + 1 ) hidden_{P \times (M+1)} hiddenP×(M+1) 矩阵,即在隐层节点输出值矩阵 ( P × M (P \times M (P×M维 ) ) )的最后一列加上值为 − 1 -1 −1 的偏置 \newline
h i d d e n : [ a 1 ( 1 ) ⋯ a j ( 1 ) ⋯ a M ( 1 ) − 1 ⋮ ⋮ ⋮ ⋮ a 1 ( p ) ⋯ a j ( p ) ⋯ a M ( p ) − 1 ⋮ ⋮ ⋮ ⋮ a 1 ( P ) ⋯ a j ( P ) ⋯ a M ( P ) − 1 ] ⟵ 第 p 个 输 入 x ( p ) 所 对 应 的 隐 层 节 点 数 据 a ( p ) \qquad \qquad hidden:\left[ \begin{matrix} a_{1}^{(1)} & \cdots & a_{j}^{(1)} & \cdots & a_{M}^{(1)} & -1 \\ \vdots & & \vdots & & \vdots & \vdots \\ \textcolor{blue}{a_{1}^{(p)}} & \cdots & \textcolor{blue}{a_{j}^{(p)}} & \cdots & \textcolor{blue}{a_{M}^{(p)}} &\textcolor{blue}{ -1} \\ \vdots & & \vdots& & \vdots & \vdots \\ a_{1}^{(P)} & \cdots & a_{j}^{(P)} & \cdots & a_{M}^{(P)} & -1 \end{matrix} \right] \begin{matrix} \\ \\ \longleftarrow\\ \\ \\ \end{matrix} \ \small第p个输入\boldsymbol{x}^{(p)}所对应的隐层节点数据\boldsymbol{a}^{(p)}\newline hidden:⎣⎢⎢⎢⎢⎢⎢⎢⎡a1(1)⋮a1(p)⋮a1(P)⋯⋯⋯aj(1)⋮aj(p)⋮aj(P)⋯⋯⋯aM(1)⋮aM(p)⋮aM(P)−1⋮−1⋮−1⎦⎥⎥⎥⎥⎥⎥⎥⎤⟵ 第p个输入x(p)所对应的隐层节点数据a(p)
( 5 ) \qquad(5) (5) 计算各个输出节点的输入值 β ( p ) \boldsymbol{\beta}^{(p)} β(p),也就是 h i d d e n ∗ w e i g h t s 2 hidden *weights2 hidden∗weights2 矩阵 ( P × N (P \times N (P×N维 ) ) )的第 p p p 行 \newline
\qquad h i d d e n P × ( M + 1 ) ∗ w e i g h t s 2 ( M + 1 ) × N = [ a 1 ( 1 ) ⋯ a j ( 1 ) ⋯ a M ( 1 ) − 1 ⋮ ⋮ ⋮ ⋮ a 1 ( p ) ⋯ a j ( p ) ⋯ a M ( p ) − 1 ⋮ ⋮ ⋮ ⋮ a 1 ( P ) ⋯ a j ( P ) ⋯ a M ( P ) − 1 ] [ w 11 ⋯ w 1 k ⋯ w 1 N ⋮ ⋮ ⋮ w j 1 ⋯ w j k ⋯ w j N ⋮ ⋮ ⋮ w M 1 ⋯ w M k ⋯ w M N w 01 ⋯ w 0 k ⋯ w 0 N ] = [ ∑ j = 0 M a j ( 1 ) w j 1 ⋯ ∑ j = 0 M a j ( 1 ) w j k ⋯ ∑ j = 0 M a j ( 1 ) w j N ⋮ ⋮ ⋮ ∑ j = 0 M a j ( p ) w j 1 ⋯ ∑ j = 0 M a j ( p ) w j k ⋯ ∑ j = 0 M a j ( p ) w j N ⋮ ⋮ ⋮ ∑ j = 0 M a j ( P ) w j 1 ⋯ ∑ j = 0 M a j ( P ) w j k ⋯ ∑ j = 0 M a j ( P ) w j N ] P × N = [ β 1 ( 1 ) ⋯ β k ( 1 ) ⋯ β N ( 1 ) ⋮ ⋮ ⋮ β 1 ( p ) ⋯ β k ( p ) ⋯ β N ( p ) ⋮ ⋮ ⋮ β 1 ( P ) ⋯ β k ( P ) ⋯ β N ( P ) ] P × N ⟵ 第 p 个 输 入 x ( p ) 所 对 应 输 出 节 点 的 输 入 β ( p ) \begin{aligned} \qquad\qquad& hidden_{P\times (M+1)}*weights2_{(M+1) \times N} \\ &= \left[ \begin{matrix} a_{1}^{(1)} & \cdots & a_{j}^{(1)} & \cdots & a_{M}^{(1)} & -1 \\ \vdots & & \vdots & & \vdots & \vdots \\ \textcolor{blue}{a_{1}^{(p)}} & \cdots & \textcolor{blue}{a_{j}^{(p)}} & \cdots & \textcolor{blue}{a_{M}^{(p)}} &\textcolor{blue}{ -1} \\ \vdots & & \vdots& & \vdots & \vdots \\ a_{1}^{(P)} & \cdots & a_{j}^{(P)} & \cdots & a_{M}^{(P)} & -1 \end{matrix} \right] \left[ \begin{matrix} w_{11} & \cdots & \textcolor{blue}{w_{1k}} & \cdots & w_{1N} \\ \vdots & & \vdots & & \vdots \\ w_{j1} & \cdots & \textcolor{blue}{w_{jk}} & \cdots & w_{jN} \\ \vdots & & \vdots& & \vdots \\ w_{M1} & \cdots & \textcolor{blue}{w_{Mk}} & \cdots & w_{MN} \\ w_{01} & \cdots & \textcolor{blue}{w_{0k}} & \cdots & w_{0N} \end{matrix} \right] \\ &= \left[ \begin{matrix} \sum\limits_{j=0}^{M} a_{j}^{(1)}w_{j1} & \cdots & \sum\limits_{j=0}^{M} a_{j}^{(1)}w_{jk} & \cdots & \sum\limits_{j=0}^{M} a_{j}^{(1)}w_{jN} \\ \vdots & & \vdots & & \vdots \\ \sum\limits_{j=0}^{M} a_{j}^{(p)}w_{j1} & \cdots & \textcolor{blue}{\sum\limits_{j=0}^{M} a_{j}^{(p)}w_{jk}} & \cdots & \sum\limits_{j=0}^{M} a_{j}^{(p)}w_{jN} \\ \vdots & & \vdots& & \vdots \\ \sum\limits_{j=0}^{M} a_{j}^{(P)}w_{j1} & \cdots & \sum\limits_{j=0}^{M} a_{j}^{(P)}w_{jk} & \cdots & \sum\limits_{j=0}^{M} a_{j}^{(P)}w_{jN} \end{matrix} \right] _{P\times N} \\ &= \left[ \begin{matrix} \beta_{1}^{(1)} & \cdots & \beta_{k}^{(1)} & \cdots & \beta_{N}^{(1)} \\ \vdots & & \vdots & & \vdots \\ \beta_{1}^{(p)} & \cdots & \textcolor{blue}{\beta_{k}^{(p)}} & \cdots & \beta_{N}^{(p)} \\ \vdots & & \vdots& & \vdots \\ \beta_{1}^{(P)} & \cdots & \beta_{k}^{(P)} & \cdots & \beta_{N}^{(P)} \end{matrix} \right] _{P\times N} \begin{matrix} \\ \\ \longleftarrow\\ \\ \\ \end{matrix} \ \small第p个输入\boldsymbol{x}^{(p)}所对应输出节点的输入\boldsymbol\beta^{(p)} \end{aligned} \newline hiddenP×(M+1)∗weights2(M+1)×N=⎣⎢⎢⎢⎢⎢⎢⎢⎡a1(1)⋮a1(p)⋮a1(P)⋯⋯⋯aj(1)⋮aj(p)⋮aj(P)⋯⋯⋯aM(1)⋮aM(p)⋮aM(P)−1⋮−1⋮−1⎦⎥⎥⎥⎥⎥⎥⎥⎤⎣⎢⎢⎢⎢⎢⎢⎢⎢⎡w11⋮wj1⋮wM1w01⋯⋯⋯⋯w1k⋮wjk⋮wMkw0k⋯⋯⋯⋯w1N⋮wjN⋮wMNw0N⎦⎥⎥⎥⎥⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡j=0∑Maj(1)wj1⋮j=0∑Maj(p)wj1⋮j=0∑Maj(P)wj1⋯⋯⋯j=0∑Maj(1)wjk⋮j=0∑Maj(p)wjk⋮j=0∑Maj(P)wjk⋯⋯⋯j=0∑Maj(1)wjN⋮j=0∑Maj(p)wjN⋮j=0∑Maj(P)wjN⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤P×N=⎣⎢⎢⎢⎢⎢⎢⎢⎡β1(1)⋮β1(p)⋮β1(P)⋯⋯⋯βk(1)⋮βk(p)⋮βk(P)⋯⋯⋯βN(1)⋮βN(p)⋮βN(P)⎦⎥⎥⎥⎥⎥⎥⎥⎤P×N⟵ 第p个输入x(p)所对应输出节点的输入β(p)
( 6 ) \qquad(6) (6) 将输出节点的输入值,通过激活函数 h ( ⋅ ) h(\cdot) h(⋅),转变为输出层节点的输出值 y ( p ) = h ( β ( p ) ) \boldsymbol{y}^{(p)}=h(\boldsymbol{\beta}^{(p)})\newline y(p)=h(β(p))
\qquad\qquad 输出层节点的输出值矩阵 : h ( h i d d e n ∗ w e i g h t s 2 ) :h(hidden*weights2) :h(hidden∗weights2)
h ( h i d d e n ∗ w e i g h t s 2 ) = h ( [ β 1 ( 1 ) ⋯ β k ( 1 ) ⋯ β N ( 1 ) ⋮ ⋮ ⋮ β 1 ( p ) ⋯ β k ( p ) ⋯ β N ( p ) ⋮ ⋮ ⋮ β 1 ( P ) ⋯ β k ( P ) ⋯ β N ( P ) ] P × N ) = [ y 1 ( 1 ) ⋯ y k ( 1 ) ⋯ y N ( 1 ) ⋮ ⋮ ⋮ y 1 ( p ) ⋯ y k ( p ) ⋯ y N ( p ) ⋮ ⋮ ⋮ y 1 ( P ) ⋯ y k ( P ) ⋯ y N ( P ) ] P × N ⟵ 第 p 个 输 入 x ( p ) 所 对 应 的 输 出 y ( p ) = [ y 1 ( p ) , ⋯ , y N ( p ) ] T \qquad\begin{aligned} \qquad\qquad& h(hidden*weights2) \\ &=h \left( \left[ \begin{matrix} \beta_{1}^{(1)} & \cdots & \beta_{k}^{(1)} & \cdots & \beta_{N}^{(1)} \\ \vdots & & \vdots & & \vdots \\ \beta_{1}^{(p)} & \cdots & \textcolor{blue}{\beta_{k}^{(p)}} & \cdots & \beta_{N}^{(p)} \\ \vdots & & \vdots& & \vdots \\ \beta_{1}^{(P)} & \cdots & \beta_{k}^{(P)} & \cdots & \beta_{N}^{(P)} \end{matrix} \right] _{P\times N} \right) \\ &= \left[ \begin{matrix} y_{1}^{(1)} & \cdots & y_{k}^{(1)} & \cdots & y_{N}^{(1)} \\ \vdots & & \vdots & & \vdots \\ y_{1}^{(p)} & \cdots & \textcolor{blue}{y_{k}^{(p)}} & \cdots & y_{N}^{(p)} \\ \vdots & & \vdots& & \vdots \\ y_{1}^{(P)} & \cdots & y_{k}^{(P)} & \cdots & y_{N}^{(P)} \end{matrix} \right] _{P\times N} \begin{matrix} \\ \\ \longleftarrow\\ \\ \\ \end{matrix} \ \small第p个输入\boldsymbol{x}^{(p)}所对应的输出 \boldsymbol{y}^{(p)}=[y_{1}^{(p)} , \cdots, y_{N}^{(p)}]^T \end{aligned} h(hidden∗weights2)=h⎝⎜⎜⎜⎜⎜⎜⎜⎛⎣⎢⎢⎢⎢⎢⎢⎢⎡β1(1)⋮β1(p)⋮β1(P)⋯⋯⋯βk(1)⋮βk(p)⋮βk(P)⋯⋯⋯βN(1)⋮βN(p)⋮βN(P)⎦⎥⎥⎥⎥⎥⎥⎥⎤P×N⎠⎟⎟⎟⎟⎟⎟⎟⎞=⎣⎢⎢⎢⎢⎢⎢⎢⎡y1(1)⋮y1(p)⋮y1(P)⋯⋯⋯yk(1)⋮yk(p)⋮yk(P)⋯⋯⋯yN(1)⋮yN(p)⋮yN(P)⎦⎥⎥⎥⎥⎥⎥⎥⎤P×N⟵ 第p个输入x(p)所对应的输出y(p)=[y1(p),⋯,yN(p)]T
\qquad
\qquad
当
h
(
⋅
)
h(\cdot)
h(⋅) 为恒等函数时,前向传播过程用
python
\text{python}
python 可以描述为:
inputs = np.concatenate((inputs,-np.ones((self.ndata,1))),axis=1) #构造input矩阵, Px(L+1)维
hidden = np.dot(inputs,weights1) # inputs*weights1的结果为 PxM维
hidden = 1.0/(1.0+np.exp(-hidden)) # sigmoid激活函数
hidden = np.concatenate((hidden,-np.ones((np.shape(inputs)[0],1))),axis=1) #构造hidden矩阵, Px(M+1)维
output = np.dot(hidden,weights2) #产生output矩阵, PxN维
3. 误差反向传递
\qquad
权值的训练采用误差修正学习,通过误差在多层感知器中的反向传递,通过调整各层的权值来减小训练误差,权重的调整主要使用梯度下降法(
η
\eta
η为学习率):
\newline
\qquad
隐层权值:
w
j
k
=
w
j
k
+
Δ
w
j
k
=
w
j
k
−
η
∂
E
∂
w
j
k
(
4
)
w_{jk}=w_{jk} + \Delta w_{jk} = w_{jk}-\eta \dfrac{\partial E}{\partial w_{jk}} \qquad\qquad\qquad(4)
wjk=wjk+Δwjk=wjk−η∂wjk∂E(4)
\qquad 输入层权值: v i j = v i j + Δ v i j = v i j − η ∂ E ∂ v i j ( 5 ) v_{ij}=v_{ij} + \Delta v_{ij} =v_{ij}-\eta \dfrac{\partial E}{\partial v_{ij}} \qquad\qquad\qquad\qquad(5) vij=vij+Δvij=vij−η∂vij∂E(5)
3.1
s
e
q
u
e
n
t
i
a
l
模
式
\boldsymbol{3.1}\ \ \ sequential\ 模式\newline
3.1 sequential 模式
\qquad
每次进入单个训练数据,前向传播得到输出值之后,再通过误差反向传递来训练所有权值
(
v
,
w
)
\left(\boldsymbol{v},\boldsymbol{w}\right)
(v,w)。
\qquad
假设单个训练数据为
{
x
,
t
}
\{\boldsymbol{x},\boldsymbol{t}\}
{x,t},即输入
x
=
(
x
1
,
⋅
⋅
⋅
,
x
L
,
−
1
)
\boldsymbol{x}=(x_{1},\cdot\cdot\cdot,x_{L},-1)
x=(x1,⋅⋅⋅,xL,−1),对应目标值
t
=
(
t
1
,
t
2
,
⋅
⋅
⋅
,
t
N
)
\boldsymbol{t}=(t_{1},t_{2},\cdot\cdot\cdot,t_{N})
t=(t1,t2,⋅⋅⋅,tN),通过多层感知器后的输出值为
y
=
(
y
1
,
y
2
,
⋅
⋅
⋅
,
y
N
)
\boldsymbol{y}=(y_{1},y_{2},\cdot\cdot\cdot,y_{N})
y=(y1,y2,⋅⋅⋅,yN)
\qquad
训练该数据所产生的误差为:
E
=
1
2
∑
k
=
1
N
(
y
k
−
t
k
)
2
(
6
)
\qquad\qquad E=\dfrac{1}{2}\displaystyle\sum_{k=1}^{N}(y_{k}-t_{k})^{2}\qquad\qquad\qquad\qquad\ \ \ \ \ (6)
E=21k=1∑N(yk−tk)2 (6)
\qquad
( a ) 训 练 隐 层 权 值 改 变 量 \bold{(a)}\ 训练隐层权值改变量 (a) 训练隐层权值改变量
\qquad 按照链式求导法则:
∂ E ∂ w j k = ∂ E ∂ y k ∂ y k ∂ β k ∂ β k ∂ w j k = ( y k − t k ) ∂ y k ∂ β k ∂ β k ∂ w j k 由 β k = ∑ j = 0 M a j w j k = ( y k − t k ) ∂ y k ∂ β k a j \qquad\qquad\begin{aligned} \dfrac{\partial E}{ \partial w_{jk} } &= \dfrac{\partial E}{\partial y_{k}} \textcolor{red}{\dfrac{\partial y_{k}}{\partial \beta_k}} \dfrac{\partial \beta_{k}}{\partial w_{jk}} \\ &= (y_{k}-t_{k}) \textcolor{red}{\dfrac{\partial y_{k}}{\partial \beta_k}} \dfrac{\partial \beta_{k}}{\partial w_{jk}} \qquad\qquad 由\ \beta_{k}=\displaystyle\sum_{j=0}^{M}a_{j}w_{jk} \\ &= (y_{k}-t_{k}) \textcolor{red}{\dfrac{\partial y_{k}}{\partial \beta_k}} a_j \end{aligned} ∂wjk∂E=∂yk∂E∂βk∂yk∂wjk∂βk=(yk−tk)∂βk∂yk∂wjk∂βk由 βk=j=0∑Majwjk=(yk−tk)∂βk∂ykaj
若使用不同的激活函数 y = h ( β ) y=h(\beta) y=h(β), ∂ y k ∂ β k \textcolor{red}{\dfrac{\partial y_{k}}{\partial \beta_k}} ∂βk∂yk 也会改变。
\qquad 考虑分类问题,输出层的激活函数 h ( ⋅ ) h(\cdot) h(⋅) 采用 softmax \text{softmax} softmax 函数,即: y k = h ( β k ) = e β k ∑ n = 1 N e β n y_{k}=h(\beta_{k})=\dfrac{e^{\beta_{k}}}{\sum_{n=1}^{N}e^{\beta_{n}}} yk=h(βk)=∑n=1Neβneβk。
\qquad\qquad ∂ y k ∂ β k = ( e β k e β 1 + ⋯ + e β k + ⋯ + e β N ) ′ = e β k ( e β 1 + ⋯ + e β k + ⋯ + e β N ) − e β k e β k ( e β 1 + ⋯ + e β k + ⋯ + e β N ) 2 = e β k e β 1 + ⋯ + e β k + ⋯ + e β N ⋅ ( e β 1 + ⋯ + e β k + ⋯ + e β N ) − e β k e β 1 + ⋯ + e β k + ⋯ + e β N = y k ( 1 − y k ) \begin{aligned} \dfrac{\partial y_{k}}{\partial \beta_{k}} &= \left( \dfrac{e^{\beta_{k}}}{ e^{\beta_{1}}+\cdots+e^{\beta_{k}}+\cdots+e^{\beta_{N}}} \right)^{'} \\ &= \dfrac{e^{\beta_{k}}\left( e^{\beta_{1}}+\cdots+e^{\beta_{k}}+\cdots+e^{\beta_{N}} \right)-e^{\beta_{k}}e^{\beta_{k}}}{\left( e^{\beta_{1}}+\cdots+e^{\beta_{k}}+\cdots+e^{\beta_{N}} \right)^{2}} \\ &=\dfrac{e^{\beta_{k}}}{ e^{\beta_{1}}+\cdots+e^{\beta_{k}}+\cdots+e^{\beta_{N}} } \cdot \dfrac{\left(e^{\beta_{1}}+\cdots+e^{\beta_{k}}+\cdots+e^{\beta_{N}} \right)-e^{\beta_{k}}}{ e^{\beta_{1}}+\cdots+e^{\beta_{k}}+\cdots+e^{\beta_{N}} } \\ &=y_{k}(1-y_{k}) \end{aligned} ∂βk∂yk=(eβ1+⋯+eβk+⋯+eβNeβk)′=(eβ1+⋯+eβk+⋯+eβN)2eβk(eβ1+⋯+eβk+⋯+eβN)−eβkeβk=eβ1+⋯+eβk+⋯+eβNeβk⋅eβ1+⋯+eβk+⋯+eβN(eβ1+⋯+eβk+⋯+eβN)−eβk=yk(1−yk)
\qquad 因此,可求得:
∂ E ∂ w j k = ( y k − t k ) ∂ y k ∂ β k a j = ( y k − t k ) y k ( 1 − y k ) a j ( 7 ) \qquad\qquad\dfrac{\partial E}{\partial w_{jk} }=(y_{k}-t_{k}) \textcolor{red}{\dfrac{\partial y_{k}}{\partial \beta_k}} a_j =(y_k-t_k)\textcolor{red}{y_k(1-y_k)}a_j \qquad\qquad\qquad(7) ∂wjk∂E=(yk−tk)∂βk∂ykaj=(yk−tk)yk(1−yk)aj(7)
\qquad
(
b
)
训
练
输
入
层
权
值
改
变
量
\bold{(b)}\ 训练输入层权值改变量
(b) 训练输入层权值改变量
\newline
\qquad
由于训练误差可写为
\newline
\qquad\qquad
E
=
1
2
∑
k
=
1
N
(
y
k
−
t
k
)
2
=
1
2
(
y
1
−
t
1
)
2
+
⋯
+
1
2
(
y
k
−
t
k
)
2
+
⋯
+
1
2
(
y
N
−
t
N
)
2
\begin{aligned} E &=\dfrac{1}{2}\displaystyle\sum_{k=1}^{N}(y_{k}-t_{k})^{2} \\ &=\dfrac{1}{2}(y_{1}-t_{1})^{2}+\cdots+\dfrac{1}{2}(y_{k}-t_{k})^{2}+\cdots+\dfrac{1}{2}(y_{N}-t_{N})^{2} \end{aligned} \newline
E=21k=1∑N(yk−tk)2=21(y1−t1)2+⋯+21(yk−tk)2+⋯+21(yN−tN)2
\qquad 其中,每个 y k = h ( β k ) = h ( ∑ j = 0 M a j w j k ) , k = 1 , 2 , ⋯ , N y_{k}=h(\beta_{k})=h\left( \displaystyle\sum_{j=0}^{M}a_{j}w_{jk} \right)\ ,\ k=1,2,\cdots,N yk=h(βk)=h(j=0∑Majwjk) , k=1,2,⋯,N 中都含有 a j a_{j} aj 。
\qquad 从输出到输入的“反向链式求导过程”如下图所示:
( 1 ) \qquad(1) (1) 隐层部分的链式求导: ∂ y k ∂ a j = ∂ y k ∂ β k ∂ β k ∂ a j = y k ( 1 − y k ) w j k , k = 1 , 2 , ⋯ , N \ \ \ \dfrac{\partial y_{k}}{\partial a_{j}} = \textcolor{red}{\dfrac{\partial y_{k}}{\partial \beta_k}}\dfrac{\partial \beta_{k}} {\partial a_{j}} = y_{k}(1-y_{k})w_{jk},\quad k=1,2,\cdots,N ∂aj∂yk=∂βk∂yk∂aj∂βk=yk(1−yk)wjk,k=1,2,⋯,N
( 2 ) \qquad(2) (2) 输入层部分的链式求导: ∂ a j ∂ α j ∂ α j ∂ v i j \textcolor{red}{\dfrac{\partial a_j}{\partial \alpha_j}} \dfrac{\partial \alpha_j}{\partial v_{ij}} ∂αj∂aj∂vij∂αj
其中, ∂ y k ∂ β k \textcolor{red}{\dfrac{\partial y_{k}}{\partial \beta_k}} ∂βk∂yk 和 ∂ a j ∂ α j \textcolor{red}{\dfrac{\partial a_j}{\partial \alpha_j}} ∂αj∂aj均为与“激活函数”有关的偏微分
\qquad
\qquad 因此,完整的链式求导过程为:
∂ E ∂ v i j = ∂ E ∂ y 1 ∂ y 1 ∂ a j ∂ a j ∂ α j ∂ α j ∂ v i j + ⋯ + ∂ E ∂ y k ∂ y k ∂ a j ∂ a j ∂ α j ∂ α j ∂ v i j + ⋯ + ∂ E ∂ y N ∂ y N ∂ a j ∂ a j ∂ α j ∂ α j ∂ v i j = ( ∂ E ∂ y 1 ∂ y 1 ∂ a j + ⋯ + ∂ E ∂ y k ∂ y k ∂ a j + ⋯ + ∂ E ∂ y N ∂ y N ∂ a j ) ∂ a j ∂ α j ∂ α j ∂ v i j = ( ∑ k = 1 N ∂ E ∂ y k ∂ y k ∂ a j ) ∂ a j ∂ α j ∂ α j ∂ v i j = ∑ k = 1 N ( y k − t k ) y k ( 1 − y k ) w j k ∂ a j ∂ α j ∂ α j ∂ v i j 由 α j = ∑ i = 0 L x i v i j = ∑ k = 1 N ( y k − t k ) y k ( 1 − y k ) w j k ∂ a j ∂ α j x i \qquad\qquad\begin{aligned} \dfrac{\partial E}{ \partial v_{ij} } &= \textcolor{blue}{\dfrac{\partial E}{\partial y_1} \dfrac{\partial y_1}{\partial a_j}} \textcolor{FireBrick}{\dfrac{\partial a_j}{\partial \alpha_j} \dfrac{\partial \alpha_j}{\partial v_{ij}}} + \cdots + \textcolor{blue}{\dfrac{\partial E}{\partial y_{k}} \dfrac{\partial y_{k}}{\partial a_j}} \textcolor{FireBrick}{\dfrac{\partial a_j}{\partial \alpha_j} \dfrac{\partial \alpha_j}{\partial v_{ij}}} + \cdots +\textcolor{blue}{ \dfrac{\partial E}{\partial y_N} \dfrac{\partial y_N}{\partial a_j}} \textcolor{FireBrick}{\dfrac{\partial a_j}{\partial \alpha_j} \dfrac{\partial \alpha_j}{\partial v_{ij}}}\\ &= \left( \dfrac{\partial E}{\partial y_{1}} \dfrac{\partial y_{1}}{\partial a_{j}} + \cdots + \dfrac{\partial E}{\partial y_{k}} \dfrac{\partial y_{k}}{\partial a_{j}} + \cdots + \dfrac{\partial E}{\partial y_{N}} \dfrac{\partial y_{N}}{\partial a_{j}} \right) \textcolor{FireBrick}{\dfrac{\partial a_{j}}{\partial \alpha_{j}} \dfrac{\partial \alpha_{j}}{\partial v_{ij}}} \\ &= \left( \sum\limits_{k=1}^{N} \dfrac{\partial E}{\partial y_{k}} \dfrac{\partial y_{k}}{\partial a_{j}} \right) \textcolor{FireBrick}{\dfrac{\partial a_{j}}{\partial \alpha_{j}} \dfrac{\partial \alpha_{j}}{\partial v_{ij}}} \\ &= \sum\limits_{k=1}^{N}(y_{k}-t_{k}) y_{k}(1-y_{k})w_{jk} \textcolor{FireBrick}{\dfrac{\partial a_{j}}{\partial \alpha_{j}} \dfrac{\partial \alpha_{j}}{\partial v_{ij}}} \qquad\qquad 由\alpha_{j}=\displaystyle\sum_{i=0}^{L} x_{i} v_{ij} \\ &= \sum\limits_{k=1}^{N}(y_{k}-t_{k}) y_{k}(1-y_{k})w_{jk}\textcolor{FireBrick}{ \dfrac{\partial a_{j}}{\partial \alpha_{j}} x_i} \end{aligned} \newline ∂vij∂E=∂y1∂E∂aj∂y1∂αj∂aj∂vij∂αj+⋯+∂yk∂E∂aj∂yk∂αj∂aj∂vij∂αj+⋯+∂yN∂E∂aj∂yN∂αj∂aj∂vij∂αj=(∂y1∂E∂aj∂y1+⋯+∂yk∂E∂aj∂yk+⋯+∂yN∂E∂aj∂yN)∂αj∂aj∂vij∂αj=(k=1∑N∂yk∂E∂aj∂yk)∂αj∂aj∂vij∂αj=k=1∑N(yk−tk)yk(1−yk)wjk∂αj∂aj∂vij∂αj由αj=i=0∑Lxivij=k=1∑N(yk−tk)yk(1−yk)wjk∂αj∂ajxi
\qquad 若隐层的激活函数 g ( ⋅ ) g(\cdot) g(⋅) 采用 Sigmoid \text{Sigmoid} Sigmoid 函数,即: a j = g ( α j ) = 1 1 + e − α j a_j=g(\alpha_j)=\dfrac{1}{1+e^{-\alpha_j}} aj=g(αj)=1+e−αj1。
∂ a j ∂ α j = ( 1 1 + e − α j ) ′ = − e − α j ( − 1 ) ( 1 + e − α j ) 2 = e − α j ( 1 + e − α j ) 2 = 1 1 + e − α j ⋅ e − α j 1 + e − α j = a j ( 1 − a j ) \qquad\qquad\begin{aligned} \dfrac{\partial a_{j}}{\partial \alpha_{j}} &= \left(\dfrac{1}{1+e^{-\alpha_{j}}} \right)^{'} \\ &= \dfrac{-e^{-\alpha_{j}}(-1)}{\left(1+e^{-\alpha_{j}} \right)^{2}} \\ &= \dfrac{e^{-\alpha_{j}}}{\left(1+e^{-\alpha_{j}} \right)^{2}} \\ &=\dfrac{1}{1+e^{-\alpha_{j}}} \cdot \dfrac{e^{-\alpha_{j}}}{1+e^{-\alpha_{j}}} \\ &=a_{j}(1-a_{j}) \end{aligned} ∂αj∂aj=(1+e−αj1)′=(1+e−αj)2−e−αj(−1)=(1+e−αj)2e−αj=1+e−αj1⋅1+e−αje−αj=aj(1−aj)
\qquad 因此
∂ E ∂ v i j = ( ∑ k = 1 N ∂ E ∂ y k ∂ y k ∂ a j ) ∂ a j ∂ α j ∂ α j ∂ v i j = ∑ k = 1 N ( y k − t k ) y k ( 1 − y k ) w j k ∂ a j ∂ α j x i = ∑ k = 1 N ( y k − t k ) y k ( 1 − y k ) w j k a j ( 1 − a j ) x i ( 8 ) \qquad\qquad\begin{aligned} \dfrac{\partial E}{ \partial v_{ij} } &= \left( \sum\limits_{k=1}^{N} \dfrac{\partial E}{\partial y_{k}} \dfrac{\partial y_{k}}{\partial a_{j}}\right) \textcolor{FireBrick}{ \dfrac{\partial a_{j}}{\partial \alpha_{j}} \dfrac{\partial \alpha_{j}}{\partial v_{ij}}} \\ &= \sum\limits_{k=1}^{N}(y_{k}-t_{k}) y_{k}(1-y_{k})w_{jk}\textcolor{FireBrick}{ \dfrac{\partial a_{j}}{\partial \alpha_{j}} x_i} \\ &= \sum\limits_{k=1}^{N}(y_{k}-t_{k}) y_{k}(1-y_{k})w_{jk} \textcolor{FireBrick}{ a_{j}(1-a_{j}) x_i} \qquad\qquad\ \ \qquad(8) \end{aligned} ∂vij∂E=(k=1∑N∂yk∂E∂aj∂yk)∂αj∂aj∂vij∂αj=k=1∑N(yk−tk)yk(1−yk)wjk∂αj∂ajxi=k=1∑N(yk−tk)yk(1−yk)wjkaj(1−aj)xi (8)
\qquad 为了方便表示,记
δ o ( k ) = ( y k − t k ) y k ( 1 − y k ) ( 9 ) \qquad\qquad\textcolor{crimson}{\delta_{o}(k) = (y_{k}-t_{k})y_{k}(1-y_{k})}\qquad\qquad\qquad\qquad\qquad\qquad\qquad\ \ (9) δo(k)=(yk−tk)yk(1−yk) (9)
δ h ( j ) = a j ( 1 − a j ) ∑ k = 1 N δ o ( k ) w j k ( 10 ) \qquad\qquad\textcolor{blue}{\delta_{h}(j) = a_{j}(1-a_{j})\sum\limits_{k=1}^{N} \delta_{o}(k)w_{jk}}\qquad\qquad\qquad\qquad\qquad\qquad\ \ \ (10) δh(j)=aj(1−aj)k=1∑Nδo(k)wjk (10)
\qquad 公式 ( 7 ) 、 ( 8 ) (7)、(8) (7)、(8)分别可以改写为:
∂ E ∂ w j k = ( y k − t k ) y k ( 1 − y k ) a j = δ o ( k ) a j \qquad\qquad\dfrac{\partial E}{\partial w_{jk} }=(y_{k}-t_{k})y_{k}(1-y_{k})a_{j}=\textcolor{crimson}{\delta_{o}(k)}a_{j} ∂wjk∂E=(yk−tk)yk(1−yk)aj=δo(k)aj
∂ E ∂ v i j = ∑ k = 1 N ( y k − t k ) y k ( 1 − y k ) w j k a j ( 1 − a j ) x i = δ h ( j ) x i ( 11 ) \qquad\qquad\begin{aligned} \dfrac{\partial E}{ \partial v_{ij} } &= \sum\limits_{k=1}^{N}(y_{k}-t_{k}) y_{k}(1-y_{k})w_{jk} a_{j}(1-a_{j}) x_{i}=\textcolor{crimson}{\delta_h(j)} x_{i} \ \qquad(11)\end{aligned} ∂vij∂E=k=1∑N(yk−tk)yk(1−yk)wjkaj(1−aj)xi=δh(j)xi (11)
\qquad 隐层权值的更新公式 ( 4 ) (4) (4)、输入层权值的更新公式 ( 5 ) (5) (5)分别可以改写为:
w j k = w j k − η ∂ E ∂ w j k = w j k − η δ o ( k ) a j ( 12 ) \qquad\qquad\textcolor{crimson}{ w_{jk}=w_{jk}-\eta \dfrac{\partial E}{\partial w_{jk}}=w_{jk}-\eta\delta_{o}(k)a_{j}}\qquad\qquad\qquad\qquad\qquad(12) wjk=wjk−η∂wjk∂E=wjk−ηδo(k)aj(12)
v
i
j
=
v
i
j
−
η
∂
E
∂
v
i
j
=
v
i
j
−
η
δ
h
(
j
)
x
i
(
13
)
\qquad\qquad\textcolor{blue}{ v_{ij}=v_{ij}-\eta \dfrac{\partial E}{\partial v_{ij}}=v_{ij}-\eta\delta_{h}(j) x_{i} }\qquad\qquad\qquad\qquad\qquad\quad \ \ (13)
vij=vij−η∂vij∂E=vij−ηδh(j)xi (13)
\qquad
\qquad
此时,对于批处理训练样本
{
x
(
p
)
,
t
(
p
)
}
p
=
1
P
\{\boldsymbol{x}^{(p)},\boldsymbol{t}^{(p)}\}_{p=1}^{P}
{x(p),t(p)}p=1P 的输出
y
P
×
N
\boldsymbol y_{P\times N}
yP×N 和对应的目标向量
t
P
×
N
\boldsymbol t_{P\times N}
tP×N,那么:
( y − t ) y ( 1 − y ) = [ ( y 1 ( 1 ) − t 1 ( 1 ) ) y 1 ( 1 ) ( 1 − y 1 ( 1 ) ) ⋯ ( y k ( 1 ) − t k ( 1 ) ) y k ( 1 ) ( 1 − y k ( 1 ) ) ⋯ ( y N ( 1 ) − t N ( 1 ) ) y N ( 1 ) ( 1 − y N ( 1 ) ) ⋮ ⋮ ⋮ ( y 1 ( p ) − t 1 ( p ) ) y 1 ( p ) ( 1 − y 1 ( p ) ) ⋯ ( y k ( p ) − t k ( p ) ) y k ( p ) ( 1 − y k ( p ) ) ⋯ ( y N ( p ) − t N ( p ) ) y N ( p ) ( 1 − y N ( p ) ) ⋮ ⋮ ⋮ ( y 1 ( P ) − t 1 ( P ) ) y 1 ( P ) ( 1 − y 1 ( P ) ) ⋯ ( y k ( P ) − t k ( P ) ) y k ( P ) ( 1 − y k ( P ) ) ⋯ ( y N ( P ) − t N ( P ) ) y N ( P ) ( 1 − y N ( P ) ) ] P × N = [ δ o ( 1 ) ( 1 ) ⋯ δ o ( 1 ) ( k ) ⋯ δ o ( 1 ) ( N ) ⋮ ⋮ ⋮ δ o ( p ) ( 1 ) ⋯ δ o ( p ) ( k ) ⋯ δ o ( p ) ( N ) ⋮ ⋮ ⋮ δ o ( P ) ( 1 ) ⋯ δ o ( P ) ( k ) ⋯ δ o ( P ) ( N ) ] P × N ⟵ 对 应 第 p 个 输 入 x ( p ) \qquad\begin{aligned}(\boldsymbol y-\boldsymbol t)\boldsymbol y(\bold 1-\boldsymbol y)&=\footnotesize\left[ \begin{matrix} (y_1^{(1)}-t_1^{(1)})y_1^{(1)}(1-y_1^{(1)}) & \cdots & (y_{k}^{(1)}-t_{k}^{(1)})y_{k}^{(1)}(1-y_{k}^{(1)}) & \cdots &(y_N^{(1)}-t_N^{(1)})y_N^{(1)}(1-y_N^{(1)}) \\ \vdots & & \vdots & & \vdots \\ \textcolor{blue}{(y_1^{(p)}-t_1^{(p)})y_1^{(p)}(1-y_1^{(p)})} & \cdots & \textcolor{blue}{(y_{k}^{(p)}-t_{k}^{(p)})y_{k}^{(p)}(1-y_{k}^{(p)})} & \cdots & \textcolor{blue}{(y_N^{(p)}-t_N^{(p)})y_N^{(p)}(1-y_N^{(p)})} \\ \vdots & & \vdots& & \vdots \\ (y_1^{(P)}-t_1^{(P)})y_1^{(P)}(1-y_1^{(P)}) & \cdots & (y_{k}^{(P)}-t_{k}^{(P)})y_{k}^{(P)}(1-y_{k}^{(P)}) & \cdots &(y_N^{(P)}-t_N^{(P)})y_N^{(P)}(1-y_N^{(P)}) \end{matrix} \right] _{P\times N}\\ &=\footnotesize\left[ \begin{matrix} \delta_{o}^{(1)}(1) & \cdots & \delta_{o}^{(1)}(k) & \cdots & \delta_{o}^{(1)}(N) \\ \vdots & & \vdots & & \vdots \\ \textcolor{blue}{\delta_{o}^{(p)}(1)}& \cdots & \textcolor{blue}{\delta_{o}^{(p)}(k)} & \cdots & \textcolor{blue}{\delta_{o}^{(p)}(N)} \\ \vdots & & \vdots& & \vdots \\ \delta_{o}^{(P)}(1) & \cdots & \delta_{o}^{(P)}(k) & \cdots & \delta_{o}^{(P)}(N) \end{matrix} \right] _{P\times N} \begin{matrix} \\ \\ \longleftarrow\\ \\ \\ \end{matrix} \ \small对应第p个输入\boldsymbol{x}^{(p)}\end{aligned} (y−t)y(1−y)=⎣⎢⎢⎢⎢⎢⎡(y1(1)−t1(1))y1(1)(1−y1(1))⋮(y1(p)−t1(p))y1(p)(1−y1(p))⋮(y1(P)−t1(P))y1(P)(1−y1(P))⋯⋯⋯(yk(1)−tk(1))yk(1)(1−yk(1))⋮(yk(p)−tk(p))yk(p)(1−yk(p))⋮(yk(P)−tk(P))yk(P)(1−yk(P))⋯⋯⋯(yN(1)−tN(1))yN(1)(1−yN(1))⋮(yN(p)−tN(p))yN(p)(1−yN(p))⋮(yN(P)−tN(P))yN(P)(1−yN(P))⎦⎥⎥⎥⎥⎥⎤P×N=⎣⎢⎢⎢⎢⎢⎡δo(1)(1)⋮δo(p)(1)⋮δo(P)(1)⋯⋯⋯δo(1)(k)⋮δo(p)(k)⋮δo(P)(k)⋯⋯⋯δo(1)(N)⋮δo(p)(N)⋮δo(P)(N)⎦⎥⎥⎥⎥⎥⎤P×N⟵ 对应第p个输入x(p)
此处, ( y − t ) y ( 1 − y ) (\boldsymbol y-\boldsymbol t)\boldsymbol y(\bold 1-\boldsymbol y) (y−t)y(1−y) 为“数组乘法(同位置的元素相乘)”,并非“矩阵乘法”
\qquad 对于 Sequential \text{Sequential} Sequential 模式,相当于上式中的 P = 1 P=1 P=1,对于训练样本 { x , t } \{\boldsymbol{x},\boldsymbol{t}\} {x,t} 而言:
( y − t ) y ( 1 − y ) = [ δ o ( 1 ) , ⋯ , δ o ( k ) , ⋯ , δ o ( N ) ] \qquad\qquad(\boldsymbol y-\boldsymbol t)\boldsymbol y(\bold 1-\boldsymbol y)=[\textcolor{blue}{\delta_{o}(1)}, \cdots , \textcolor{blue}{\delta_{o}(k)} , \cdots , \textcolor{blue}{\delta_{o}(N)} ] (y−t)y(1−y)=[δo(1),⋯,δo(k),⋯,δo(N)]
\qquad
\qquad
总结:在
Sequential
\text{Sequential}
Sequential模式下,令训练数据集
{
x
(
p
)
,
t
(
p
)
}
p
=
1
P
\{\boldsymbol{x}^{(p)},\boldsymbol{t}^{(p)}\}_{p=1}^{P}
{x(p),t(p)}p=1P中的每一个数据
{
x
,
t
}
=
{
x
(
p
)
,
t
(
p
)
}
\{\boldsymbol{x},\boldsymbol{t}\}=\{\boldsymbol{x}^{(p)},\boldsymbol{t}^{(p)}\}
{x,t}={x(p),t(p)}按照一定的顺序、依次进入到多层感知器。
\qquad 其训练过程如以下步骤:
1 ) \qquad1) 1) 输入一个新的训练数据 { x , t } = { x ( p ) , t ( p ) } , p = 1 \{\boldsymbol{x},\boldsymbol{t}\}=\{\boldsymbol{x}^{(p)},\boldsymbol{t}^{(p)}\},p=1 {x,t}={x(p),t(p)},p=1,按照 2.4 2.4 2.4 的步骤,即公式 ( 3 ) (3) (3),计算多层感知器的输出值 y = y ( p ) ( 对 应 2.3 矩 阵 的 第 p 行 ) \boldsymbol{y}=\boldsymbol{y}^{(p)}\small\left(对应2.3矩阵的第\ p\ 行\right) y=y(p)(对应2.3矩阵的第 p 行) ,产生的训练误差即由公式 ( 6 ) (6) (6)计算
2 ) \qquad2) 2) 计算公式 ( 9 ) (9) (9),得到 δ o ( 1 ) , δ o ( 2 ) , ⋯ , δ o ( N ) \delta_{o}(1),\delta_{o}(2),\cdots,\delta_{o}(N) δo(1),δo(2),⋯,δo(N)
3 ) \qquad3) 3) 计算公式 ( 12 ) (12) (12),更新隐层所有的权值 w j k , j ∈ 0 ∼ M , k ∈ 1 ∼ N w_{jk},\ j\in0\sim M,\ k\in1\sim N wjk, j∈0∼M, k∈1∼N
4 ) \qquad4) 4) 计算公式 ( 10 ) (10) (10),得到 δ h ( 1 ) , δ h ( 2 ) , ⋯ , δ h ( M ) \delta_{h}(1),\delta_{h}(2),\cdots,\delta_{h}(M) δh(1),δh(2),⋯,δh(M)
5 ) \qquad5) 5) 计算公式 ( 13 ) (13) (13),更新输入层所有的权值 v i j v_{ij} vij
6 ) \qquad6) 6) 回到第一步,令 p = p + 1 p=p+1 p=p+1,重新开始更新权值,直到 p = P p=P p=P 时结束训练
\qquad
3.2
Batch
\boldsymbol{3.2}\ \ \ \text{Batch}
3.2 Batch 模式
\qquad 每次输入一批训练样本(假设为 P P P 个),前向传播得到 P P P 个输出值,再通过误差反向传递来训练所有权值 ( v , w ) (\boldsymbol{v},\boldsymbol{w}) (v,w)。
\qquad
假设
P
P
P 个训练样本为
{
x
(
p
)
,
t
(
p
)
}
p
=
1
P
\{\boldsymbol{x}^{(p)},\boldsymbol{t}^{(p)}\}_{p=1}^{P}
{x(p),t(p)}p=1P,对于第
p
p
p 个输入样本:
(
1
)
\qquad(1)
(1) 输入样本(包括偏置):
x
(
p
)
=
(
x
1
(
p
)
,
⋅
⋅
⋅
,
x
L
(
p
)
,
−
1
)
\ \boldsymbol{x}^{(p)}=(x_{1}^{(p)},\cdot\cdot\cdot,x_{L}^{(p)},-1)
x(p)=(x1(p),⋅⋅⋅,xL(p),−1)
(
2
)
\qquad(2)
(2) 对应
one hot
\text{one\ hot}
one hot目标向量:
t
(
p
)
=
(
t
1
(
p
)
,
t
2
(
p
)
,
⋅
⋅
⋅
,
t
N
(
p
)
)
\ \boldsymbol{t}^{(p)}=(t_{1}^{(p)},t_{2}^{(p)},\cdot\cdot\cdot,t_{N}^{(p)})
t(p)=(t1(p),t2(p),⋅⋅⋅,tN(p))
(
3
)
\qquad(3)
(3) 通过
MLP
\text{MLP}
MLP之后的
Softmax
\text{Softmax}
Softmax输出:
y
(
p
)
=
(
y
1
(
p
)
,
y
2
(
p
)
,
⋅
⋅
⋅
,
y
N
(
p
)
)
\boldsymbol{y}^{(p)}=(y_{1}^{(p)},y_{2}^{(p)},\cdot\cdot\cdot,y_{N}^{(p)})
y(p)=(y1(p),y2(p),⋅⋅⋅,yN(p))
\qquad
\qquad
训练这一批数据所产生的平均误差为:
E
=
1
2
P
∑
p
=
1
P
(
∑
k
=
1
N
(
y
k
(
p
)
−
t
k
(
p
)
)
2
)
(
14
)
\qquad\qquad E=\dfrac{1}{2P}\displaystyle\sum_{p=1}^{P}\left( \displaystyle\sum_{k=1}^{N}(y_{k}^{(p)}-t_{k}^{(p)})^{2}\right) \qquad\qquad\qquad\qquad(14)
E=2P1p=1∑P(k=1∑N(yk(p)−tk(p))2)(14)
\qquad
(
a
)
训
练
隐
层
权
值
改
变
量
\bold{(a)}\ 训练隐层权值改变量
(a) 训练隐层权值改变量
\qquad
一批数据所产生的平均误差又可写为:
\newline
\qquad\qquad
E
=
1
2
P
∑
p
=
1
P
(
∑
k
=
1
N
(
y
k
(
p
)
−
t
k
(
p
)
)
2
)
=
1
2
P
∑
k
=
1
N
(
y
k
(
1
)
−
t
k
(
1
)
)
2
+
⋯
+
1
2
P
∑
k
=
1
N
(
y
k
(
p
)
−
t
k
(
p
)
)
2
+
⋯
+
1
2
P
∑
k
=
1
N
(
y
k
(
P
)
−
t
k
(
P
)
)
2
\begin{aligned} E&=\dfrac{1}{2P}\displaystyle\sum_{p=1}^{P}\left( \displaystyle\sum_{k=1}^{N}(y_{k}^{(p)}-t_{k}^{(p)})^{2}\right) \\ &= \dfrac{1}{2P} \displaystyle\sum_{k=1}^{N}(y_{k}^{(1)}-t_{k}^{(1)})^{2}+\cdots+\dfrac{1}{2P} \displaystyle\sum_{k=1}^{N}(y_{k}^{(p)}-t_{k}^{(p)})^{2}+\cdots+\dfrac{1}{2P} \displaystyle\sum_{k=1}^{N}(y_{k}^{(P)}-t_{k}^{(P)})^{2} \\ \end{aligned} \newline
E=2P1p=1∑P(k=1∑N(yk(p)−tk(p))2)=2P1k=1∑N(yk(1)−tk(1))2+⋯+2P1k=1∑N(yk(p)−tk(p))2+⋯+2P1k=1∑N(yk(P)−tk(P))2
\qquad 显然,任何一个输入训练数据 x ( p ) \boldsymbol{x}^{(p)} x(p) 的输出 y ( p ) \boldsymbol{y}^{(p)} y(p) 都对 ∂ E ∂ w j k \dfrac{\partial E}{ \partial w_{jk} } ∂wjk∂E 有贡献。
\qquad 按照链式求导法则:
\qquad\qquad ∂ E ∂ w j k = ∂ E ∂ y k ( 1 ) ∂ y k ( 1 ) ∂ β k ( 1 ) ∂ β k ( 1 ) ∂ w j k + ⋯ + ∂ E ∂ y k ( p ) ∂ y k ( p ) ∂ β k ( p ) ∂ β k ( p ) ∂ w j k + ⋯ + ∂ E ∂ y k ( P ) ∂ y k ( P ) ∂ β k ( P ) ∂ β k ( P ) ∂ w j k = ∑ p = 1 P [ ∂ E ∂ y k ( p ) ∂ y k ( p ) ∂ β k ( p ) ∂ β k ( p ) ∂ w j k ] 由 y ( p ) = h ( β k ( p ) ) = h ( ∑ j = 0 M a j ( p ) w j k ) = 1 P ∑ p = 1 P [ ( y k ( p ) − t k ( p ) ) ∂ y k ( p ) ∂ β k ( p ) ∂ β k ( p ) ∂ w j k ] 由 β k ( p ) = ∑ j = 0 M a j ( p ) w j k = 1 P ∑ p = 1 P [ ( y k ( p ) − t k ( p ) ) ∂ y k ( p ) ∂ β k ( p ) a j ( p ) ] \begin{aligned} \dfrac{\partial E}{ \partial w_{jk} } &= \dfrac{\partial E}{\partial y_{k}^{(1)}} \dfrac{\partial y_{k}^{(1)}}{\partial \beta_{k}^{(1)}} \dfrac{\partial \beta_{k}^{(1)}}{\partial w_{jk}} +\cdots+ \dfrac{\partial E}{\partial y_{k}^{(p)}} \dfrac{\partial y_{k}^{(p)}}{\partial \beta_{k}^{(p)}} \dfrac{\partial \beta_{k}^{(p)}}{\partial w_{jk}} + \cdots + \dfrac{\partial E}{\partial y_{k}^{(P)}} \dfrac{\partial y_{k}^{(P)}}{\partial \beta_{k}^{(P)}} \dfrac{\partial \beta_{k}^{(P)}}{\partial w_{jk}} \\ &= \displaystyle\sum_{p=1}^{P} \left[ \dfrac{\partial E}{\partial y_{k}^{(p)}} \dfrac{\partial y_{k}^{(p)}}{\partial \beta_{k}^{(p)}} \dfrac{\partial \beta_{k}^{(p)}}{\partial w_{jk}} \right] \qquad\qquad\qquad\ \ 由y^{(p)}=h(\beta_{k}^{(p)})=h\left(\displaystyle\sum_{j=0}^{M}a_{j}^{(p)}w_{jk} \right) \\ &=\dfrac{1}{P} \displaystyle\sum_{p=1}^{P} \left[ (y_{k}^{(p)} -t_{k}^{(p)}) \dfrac{\partial y_{k}^{(p)}}{\partial \beta_{k}^{(p)}} \dfrac{\partial \beta_{k}^{(p)}}{\partial w_{jk}} \right] \qquad\ \ \ 由\beta_{k}^{(p)}=\displaystyle\sum_{j=0}^{M}a_{j}^{(p)}w_{jk} \\ &=\dfrac{1}{P} \displaystyle\sum_{p=1}^{P} \left[ (y_{k}^{(p)} -t_{k}^{(p)}) \dfrac{\partial y_{k}^{(p)}}{\partial \beta_{k}^{(p)}} a_{j}^{(p)} \right] \end{aligned} ∂wjk∂E=∂yk(1)∂E∂βk(1)∂yk(1)∂wjk∂βk(1)+⋯+∂yk(p)∂E∂βk(p)∂yk(p)∂wjk∂βk(p)+⋯+∂yk(P)∂E∂βk(P)∂yk(P)∂wjk∂βk(P)=p=1∑P[∂yk(p)∂E∂βk(p)∂yk(p)∂wjk∂βk(p)] 由y(p)=h(βk(p))=h(j=0∑Maj(p)wjk)=P1p=1∑P[(yk(p)−tk(p))∂βk(p)∂yk(p)∂wjk∂βk(p)] 由βk(p)=j=0∑Maj(p)wjk=P1p=1∑P[(yk(p)−tk(p))∂βk(p)∂yk(p)aj(p)]
\qquad 考虑分类问题,输出层的激活函 h ( ⋅ ) h(\cdot) h(⋅)采用 Softmax \text{Softmax} Softmax函数,即: y k ( p ) = h ( β k ( p ) ) = e β k ( p ) ∑ n = 1 N e β n ( p ) y_{k}^{(p)}=h(\beta_{k}^{(p)})=\dfrac{e^{\beta_{k}^{(p)}}}{\sum_{n=1}^{N}e^{\beta_{n}^{(p)}}} yk(p)=h(βk(p))=∑n=1Neβn(p)eβk(p)。
\qquad 由 Sequential \text{Sequential} Sequential 模式的结论,有: ∂ y k ( p ) ∂ β k ( p ) = y k ( p ) ( 1 − y k ( p ) ) \begin{aligned} \dfrac{\partial y_{k}^{(p)}}{\partial \beta_{k}^{(p)}} &=y_{k}^{(p)}(1-y_{k}^{(p)}) \end{aligned} ∂βk(p)∂yk(p)=yk(p)(1−yk(p))
\qquad 因此,可求得:
∂ E ∂ w j k = 1 P ∑ p = 1 P [ ( y k ( p ) − t k ( p ) ) ∂ y k ( p ) ∂ β k ( p ) a j ( p ) ] = 1 P ∑ p = 1 P [ ( y k ( p ) − t k ( p ) ) y k ( p ) ( 1 − y k ( p ) ) a j ( p ) ] ( 15 ) \qquad\qquad\begin{aligned} \dfrac{\partial E}{\partial w_{jk} }=\dfrac{1}{P}\displaystyle\sum_{p=1}^{P} \left[ (y_{k}^{(p)} -t_{k}^{(p)}) \dfrac{\partial y_{k}^{(p)}}{\partial \beta_{k}^{(p)}} a_{j}^{(p)} \right] = \dfrac{1}{P}\displaystyle\sum_{p=1}^{P} \left[ (y_{k}^{(p)} -t_{k}^{(p)}) y_{k}^{(p)}(1-y_{k}^{(p)}) a_{j}^{(p)} \right] \qquad(15) \end{aligned} ∂wjk∂E=P1p=1∑P[(yk(p)−tk(p))∂βk(p)∂yk(p)aj(p)]=P1p=1∑P[(yk(p)−tk(p))yk(p)(1−yk(p))aj(p)](15)
\qquad
(
b
)
训
练
输
入
层
权
值
改
变
量
\bold{(b)}\ 训练输入层权值改变量
(b) 训练输入层权值改变量
\newline
\qquad
由于训练误差可写为
\newline
\qquad\qquad
E
=
1
2
P
∑
p
=
1
P
(
∑
k
=
1
N
(
y
k
(
p
)
−
t
k
(
p
)
)
2
)
=
1
2
P
∑
k
=
1
N
(
y
k
(
1
)
−
t
k
(
1
)
)
2
+
⋯
+
1
2
P
∑
k
=
1
N
(
y
k
(
p
)
−
t
k
(
p
)
)
2
+
⋯
+
1
2
P
∑
k
=
1
N
(
y
k
(
P
)
−
t
k
(
P
)
)
2
\begin{aligned} E&=\dfrac{1}{2P}\displaystyle\sum_{p=1}^{P}\left( \displaystyle\sum_{k=1}^{N}(y_{k}^{(p)}-t_{k}^{(p)})^{2}\right) \\ &= \dfrac{1}{2P} \displaystyle\sum_{k=1}^{N}(y_{k}^{(1)}-t_{k}^{(1)})^{2}+\cdots+\dfrac{1}{2P} \displaystyle\sum_{k=1}^{N}(y_{k}^{(p)}-t_{k}^{(p)})^{2}+\cdots+\dfrac{1}{2P} \displaystyle\sum_{k=1}^{N}(y_{k}^{(P)}-t_{k}^{(P)})^{2} \\ \end{aligned} \newline
E=2P1p=1∑P(k=1∑N(yk(p)−tk(p))2)=2P1k=1∑N(yk(1)−tk(1))2+⋯+2P1k=1∑N(yk(p)−tk(p))2+⋯+2P1k=1∑N(yk(P)−tk(P))2
\qquad
对于其中任何一个训练数据
{
x
(
p
)
,
t
(
p
)
}
\{\boldsymbol{x}^{(p)},\boldsymbol{t}^{(p)}\}
{x(p),t(p)},可以直接套用
Sequential
\text{Sequential}
Sequential 模式的结论:
\newline
\qquad\qquad
E
(
p
)
=
1
2
P
∑
k
=
1
N
(
y
k
(
p
)
−
t
k
(
p
)
)
2
=
1
2
P
(
y
1
(
p
)
−
t
1
(
p
)
)
2
+
⋯
+
1
2
P
(
y
k
(
p
)
−
t
k
(
p
)
)
2
+
⋯
+
1
2
P
(
y
N
(
p
)
−
t
N
(
p
)
)
2
\begin{aligned} E^{(p)} &=\dfrac{1}{2P}\displaystyle\sum_{k=1}^{N}(y_{k}^{(p)} -t_{k}^{(p)})^{2} \\ &=\dfrac{1}{2P}(y_{1}^{(p)}-t_{1}^{(p)})^{2}+\cdots+\dfrac{1}{2P}(y_{k}^{(p)}-t_{k}^{(p)})^{2}+\cdots+\dfrac{1}{2P}(y_{N}^{(p)}-t_{N}^{(p)})^{2} \end{aligned} \newline
E(p)=2P1k=1∑N(yk(p)−tk(p))2=2P1(y1(p)−t1(p))2+⋯+2P1(yk(p)−tk(p))2+⋯+2P1(yN(p)−tN(p))2
\qquad
其中,每个
y
k
(
p
)
=
h
(
β
k
(
p
)
)
=
h
(
∑
j
=
0
M
a
j
(
p
)
w
j
k
)
,
k
=
1
,
2
,
⋯
,
N
y_{k}^{(p)}=h(\beta_{k}^{(p)})=h\left( \displaystyle\sum_{j=0}^{M}a_{j}^{(p)} w_{jk} \right)\ ,\ k=1,2,\cdots,N
yk(p)=h(βk(p))=h(j=0∑Maj(p)wjk) , k=1,2,⋯,N 中都含有
a
j
(
p
)
a_{j}^{(p)}
aj(p) 。
\qquad
因此,
∂
y
k
(
p
)
∂
a
j
(
p
)
=
∂
y
k
(
p
)
∂
β
k
(
p
)
∂
β
k
(
p
)
∂
a
j
(
p
)
=
y
k
(
p
)
(
1
−
y
k
(
p
)
)
w
j
k
\ \ \ \dfrac{\partial y_{k}^{(p)}}{\partial a_{j}^{(p)}} = \dfrac{\partial y_{k}^{(p)}}{\partial \beta_{k}^{(p)}}\dfrac{\partial \beta_{k}^{(p)}} {\partial a_{j}^{(p)}} = y_{k}^{(p)}(1-y_{k}^{(p)})w_{jk} \newline
∂aj(p)∂yk(p)=∂βk(p)∂yk(p)∂aj(p)∂βk(p)=yk(p)(1−yk(p))wjk
\qquad 按照链式求导法则: \newline
\qquad\qquad ∂ E ( p ) ∂ v i j = ∂ E ( p ) ∂ y 1 ( p ) ∂ y 1 ( p ) ∂ a j ( p ) ∂ a j ( p ) ∂ α j ( p ) ∂ α j ( p ) ∂ v i j + ⋯ + ∂ E ( p ) ∂ y k ( p ) ∂ y k ( p ) ∂ a j ( p ) ∂ a j ( p ) ∂ α j ( p ) ∂ α j ( p ) ∂ v i j + ⋯ + ∂ E ( p ) ∂ y N ( p ) ∂ y N ( p ) ∂ a j ( p ) ∂ a j ( p ) ∂ α j ( p ) ∂ α j ( p ) ∂ v i j = ∑ k = 1 N ∂ E ( p ) ∂ y k ( p ) ∂ y k ( p ) ∂ a j ( p ) ∂ a j ( p ) ∂ α j ( p ) ∂ α j ( p ) ∂ v i j 由 a j ( p ) = g ( α j ( p ) ) = g ( ∑ i = 0 L x i ( p ) v i j ) = 1 P ∑ k = 1 N ( y k ( p ) − t k ( p ) ) y k ( p ) ( 1 − y k ( p ) ) w j k ∂ a j ∂ α j ∂ α j ∂ v i j = 1 P ∑ k = 1 N ( y k ( p ) − t k ( p ) ) y k ( p ) ( 1 − y k ( p ) ) w j k ∂ a j ( p ) ∂ α j ( p ) x i ( p ) \begin{aligned} \dfrac{\partial E^{(p)}}{ \partial v_{ij} } &= \dfrac{\partial E^{(p)}}{\partial y_{1}^{(p)}} \dfrac{\partial y_{1}^{(p)}}{\partial a_{j}^{(p)}} \dfrac{\partial a_{j}^{(p)}}{\partial \alpha_{j}^{(p)}} \dfrac{\partial \alpha_{j}^{(p)}}{\partial v_{ij}} + \cdots + \dfrac{\partial E^{(p)}}{\partial y_{k}^{(p)}} \dfrac{\partial y_{k}^{(p)}}{\partial a_{j}^{(p)}} \dfrac{\partial a_{j}^{(p)}}{\partial \alpha_{j}^{(p)}} \dfrac{\partial \alpha_{j}^{(p)}}{\partial v_{ij}} + \cdots \\ &\qquad + \dfrac{\partial E^{(p)}}{\partial y_{N}^{(p)}} \dfrac{\partial y_{N}^{(p)}}{\partial a_{j}^{(p)}} \dfrac{\partial a_{j}^{(p)}}{\partial \alpha_{j}^{(p)}} \dfrac{\partial \alpha_{j}^{(p)}}{\partial v_{ij}}\\ &= \sum\limits_{k=1}^{N} \dfrac{\partial E^{(p)}}{\partial y_{k}^{(p)}} \dfrac{\partial y_{k}^{(p)}}{\partial a_{j}^{(p)}} \dfrac{\partial a_{j}^{(p)}}{\partial \alpha_{j}^{(p)}} \dfrac{\partial \alpha_{j}^{(p)}}{\partial v_{ij}} \qquad\ \ \ \ \ \ \ \ 由a_{j}^{(p)}=g(\alpha_{j}^{(p)})=g\left(\displaystyle\sum_{i=0}^{L} x_{i}^{(p)} v_{ij}\right) \\ &= \dfrac{1}{P}\sum\limits_{k=1}^{N}(y_{k}^{(p)}-t_{k}^{(p)}) y_{k}^{(p)}(1-y_{k}^{(p)})w_{jk} \dfrac{\partial a_{j}}{\partial \alpha_{j}} \dfrac{\partial \alpha_{j}}{\partial v_{ij}} \\ &= \dfrac{1}{P}\sum\limits_{k=1}^{N}(y_{k}^{(p)}-t_{k}^{(p)}) y_{k}^{(p)}(1-y_{k}^{(p)})w_{jk}\dfrac{\partial a_{j}^{(p)}}{\partial \alpha_{j}^{(p)}} x_{i}^{(p)} \end{aligned} \newline ∂vij∂E(p)=∂y1(p)∂E(p)∂aj(p)∂y1(p)∂αj(p)∂aj(p)∂vij∂αj(p)+⋯+∂yk(p)∂E(p)∂aj(p)∂yk(p)∂αj(p)∂aj(p)∂vij∂αj(p)+⋯+∂yN(p)∂E(p)∂aj(p)∂yN(p)∂αj(p)∂aj(p)∂vij∂αj(p)=k=1∑N∂yk(p)∂E(p)∂aj(p)∂yk(p)∂αj(p)∂aj(p)∂vij∂αj(p) 由aj(p)=g(αj(p))=g(i=0∑Lxi(p)vij)=P1k=1∑N(yk(p)−tk(p))yk(p)(1−yk(p))wjk∂αj∂aj∂vij∂αj=P1k=1∑N(yk(p)−tk(p))yk(p)(1−yk(p))wjk∂αj(p)∂aj(p)xi(p)
\qquad 隐层的激活函数 g ( ⋅ ) g(\cdot) g(⋅) 采用 Sigmoid \text{Sigmoid} Sigmoid 函数,即: a j ( p ) = g ( α j ( p ) ) = 1 1 + e − α j ( p ) a_{j}^{(p)}=g(\alpha_{j}^{(p)})=\dfrac{1}{1+e^{-\alpha_{j}^{(p)}}} aj(p)=g(αj(p))=1+e−αj(p)1,可得到:
∂ a j ( p ) ∂ α j ( p ) = a j ( p ) ( 1 − a j ( p ) ) \qquad\qquad\begin{aligned} \dfrac{\partial a_{j}^{(p)}}{\partial \alpha_{j}^{(p)}} =a_{j}^{(p)}(1-a_{j}^{(p)}) \end{aligned} ∂αj(p)∂aj(p)=aj(p)(1−aj(p))
\qquad 因此
∂ E ( p ) ∂ v i j = ∑ k = 1 N ∂ E ( p ) ∂ y k ( p ) ∂ y k ( p ) ∂ a j ( p ) ∂ a j ( p ) ∂ α j ( p ) ∂ α j ( p ) ∂ v i j = 1 P ∑ k = 1 N ( y k ( p ) − t k ( p ) ) y k ( p ) ( 1 − y k ( p ) ) w j k ∂ a j ( p ) ∂ α j ( p ) x i ( p ) = 1 P ∑ k = 1 N ( y k ( p ) − t k ( p ) ) y k ( p ) ( 1 − y k ( p ) ) w j k a j ( p ) ( 1 − a j ( p ) ) x i ( p ) ( 16 ) \qquad\qquad\begin{aligned} \dfrac{\partial E^{(p)}}{ \partial v_{ij} } &= \sum\limits_{k=1}^{N} \dfrac{\partial E^{(p)}}{\partial y_{k}^{(p)}} \dfrac{\partial y_{k}^{(p)}}{\partial a_{j}^{(p)}} \dfrac{\partial a_{j}^{(p)}}{\partial \alpha_{j}^{(p)}} \dfrac{\partial \alpha_{j}^{(p)}}{\partial v_{ij}} \\ &= \dfrac{1}{P}\sum\limits_{k=1}^{N}(y_{k}^{(p)}-t_{k}^{(p)}) y_{k}^{(p)}(1-y_{k}^{(p)})w_{jk}\dfrac{\partial a_{j}^{(p)}}{\partial \alpha_{j}^{(p)}} x_{i}^{(p)} \\ &= \dfrac{1}{P}\sum\limits_{k=1}^{N}(y_{k}^{(p)}-t_{k}^{(p)}) y_{k}^{(p)}(1-y_{k}^{(p)})w_{jk} a_{j}^{(p)}(1-a_{j}^{(p)}) x_{i}^{(p)} \qquad\ \ \ \ \ (16) \end{aligned} \newline ∂vij∂E(p)=k=1∑N∂yk(p)∂E(p)∂aj(p)∂yk(p)∂αj(p)∂aj(p)∂vij∂αj(p)=P1k=1∑N(yk(p)−tk(p))yk(p)(1−yk(p))wjk∂αj(p)∂aj(p)xi(p)=P1k=1∑N(yk(p)−tk(p))yk(p)(1−yk(p))wjkaj(p)(1−aj(p))xi(p) (16)
\qquad
为了方便表示,记
\newline
\qquad\qquad
δ
o
(
p
)
(
k
)
=
1
P
(
y
k
(
p
)
−
t
k
(
p
)
)
y
k
(
p
)
(
1
−
y
k
(
p
)
)
(
17
)
\delta_{o}^{(p)}(k) = \dfrac{1}{P}(y_{k}^{(p)}-t_{k}^{(p)})y_{k}^{(p)}(1-y_{k}^{(p)})\qquad\qquad\qquad\qquad\qquad\qquad\ \ \ \ (17)\newline
δo(p)(k)=P1(yk(p)−tk(p))yk(p)(1−yk(p)) (17)
\qquad\qquad
δ
h
(
p
)
(
j
)
=
a
j
(
p
)
(
1
−
a
j
(
p
)
)
∑
k
=
1
N
δ
o
(
p
)
(
k
)
w
j
k
(
18
)
\begin{aligned} \delta_{h}^{(p)}(j) &= a_{j}^{(p)}(1-a_{j}^{(p)}) \sum\limits_{k=1}^{N} \delta_{o}^{(p)}(k)w_{jk} \qquad\qquad\qquad\qquad\qquad\ \ \ \ \ \ \ \ \ \ \ (18) \end{aligned} \newline
δh(p)(j)=aj(p)(1−aj(p))k=1∑Nδo(p)(k)wjk (18)
\qquad
公式
(
15
)
(15)
(15) 可以改写为:
\newline
\qquad\qquad
∂
E
∂
w
j
k
=
∑
p
=
1
P
∂
E
(
p
)
∂
w
j
k
=
1
P
∑
p
=
1
P
(
y
k
(
p
)
−
t
k
(
p
)
)
y
k
(
p
)
(
1
−
y
k
(
p
)
)
a
j
(
p
)
=
∑
p
=
1
P
δ
o
(
p
)
(
k
)
a
j
(
p
)
(
19
)
\begin{aligned} \dfrac{\partial E}{\partial w_{jk} }&= \sum\limits_{p=1}^{P} \dfrac{\partial E^{(p)}}{\partial w_{jk} } \\ &=\dfrac{1}{P}\sum\limits_{p=1}^{P}(y_{k}^{(p)}-t_{k}^{(p)})y_{k}^{(p)}(1-y_{k}^{(p)})a_{j}^{(p)} \\ &=\sum\limits_{p=1}^{P}\delta_{o}^{(p)}(k)a_{j}^{(p)} \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\ \ \ \ \ \ \ \ \ \ \ (19) \end{aligned} \newline
∂wjk∂E=p=1∑P∂wjk∂E(p)=P1p=1∑P(yk(p)−tk(p))yk(p)(1−yk(p))aj(p)=p=1∑Pδo(p)(k)aj(p) (19)
\qquad
公式
(
16
)
(16)
(16) 可以改写为:
\newline
\qquad\qquad
∂
E
(
p
)
∂
v
i
j
=
1
P
∑
k
=
1
N
(
y
k
(
p
)
−
t
k
(
p
)
)
y
k
(
p
)
(
1
−
y
k
(
p
)
)
w
j
k
a
j
(
p
)
(
1
−
a
j
(
p
)
)
x
i
(
p
)
=
a
j
(
p
)
(
1
−
a
j
(
p
)
)
∑
k
=
1
N
δ
o
(
p
)
(
k
)
w
j
k
x
i
(
p
)
=
δ
h
(
p
)
(
j
)
x
i
(
p
)
\begin{aligned} \dfrac{\partial E^{(p)}}{ \partial v_{ij} } &= \dfrac{1}{P}\sum\limits_{k=1}^{N}(y_{k}^{(p)}-t_{k}^{(p)}) y_{k}^{(p)}(1-y_{k}^{(p)})w_{jk} a_{j}^{(p)}(1-a_{j}^{(p)}) x_{i}^{(p)} \\ &= a_{j}^{(p)}(1-a_{j}^{(p)}) \sum\limits_{k=1}^{N} \delta_{o}^{(p)}(k) w_{jk} x_{i}^{(p)} \\ &=\delta_{h}^{(p)}(j) x_{i}^{(p)} \end{aligned} \newline
∂vij∂E(p)=P1k=1∑N(yk(p)−tk(p))yk(p)(1−yk(p))wjkaj(p)(1−aj(p))xi(p)=aj(p)(1−aj(p))k=1∑Nδo(p)(k)wjkxi(p)=δh(p)(j)xi(p)
\qquad\qquad
∂
E
∂
v
i
j
=
∑
p
=
1
P
∂
E
(
p
)
∂
v
i
j
=
∑
p
=
1
P
δ
h
(
p
)
(
j
)
x
i
(
p
)
(
20
)
\begin{aligned} \dfrac{\partial E}{ \partial v_{ij} } = \sum\limits_{p=1}^{P}\dfrac{\partial E^{(p)}}{\partial v_{ij} } = \sum\limits_{p=1}^{P}\delta_{h}^{(p)}(j) x_{i}^{(p)} \qquad\qquad\qquad\qquad\qquad\qquad\ \ \ \ \ (20) \end{aligned} \newline
∂vij∂E=p=1∑P∂vij∂E(p)=p=1∑Pδh(p)(j)xi(p) (20)
\qquad
隐层权值的更新公式
(
4
)
(4)
(4)、输入层权值的更新公式
(
5
)
(5)
(5)分别可以改写为:
\newline
\qquad\qquad
w
j
k
=
w
j
k
−
η
∂
E
∂
w
j
k
=
w
j
k
−
η
∑
p
=
1
P
δ
o
(
p
)
(
k
)
a
j
(
p
)
(
21
)
w_{jk}=w_{jk}-\eta \dfrac{\partial E}{\partial w_{jk}}=w_{jk}-\eta \displaystyle\sum_{p=1}^{P}\delta_{o}^{(p)}(k)a_{j}^{(p)}\qquad\qquad\qquad\qquad\ \ \ \ (21) \newline
wjk=wjk−η∂wjk∂E=wjk−ηp=1∑Pδo(p)(k)aj(p) (21)
\qquad\qquad
v
i
j
=
v
i
j
−
η
∂
E
∂
v
i
j
=
v
i
j
−
η
∑
p
=
1
P
δ
h
(
p
)
(
j
)
x
i
(
p
)
(
22
)
v_{ij}=v_{ij}-\eta \dfrac{\partial E}{\partial v_{ij}}=v_{ij}-\eta \displaystyle\sum_{p=1}^{P}\delta_{h}^{(p)}(j) x_{i}^{(p)} \qquad\qquad\qquad\qquad\qquad\ \ (22) \newline
vij=vij−η∂vij∂E=vij−ηp=1∑Pδh(p)(j)xi(p) (22)
\qquad
总结:在
Batch
\text{Batch}
Batch 模式下,其训练过程如以下步骤
(
(
(可参考
2.3
2.3
2.3 中介绍的矩阵表示
)
)
):
\newline
1
)
\qquad1)
1) 所有训练数据
{
x
(
p
)
,
t
(
p
)
}
p
=
1
P
\{\boldsymbol{x}^{(p)},\boldsymbol{t}^{(p)}\}_{p=1}^{P}
{x(p),t(p)}p=1P 按照一定的顺序、依次进入到多层感知器,按照
2.4
2.4
2.4 的步骤,即公式
(
3
)
(3)
(3),计算多层感知器的输出值
{
y
(
p
)
}
p
=
1
P
\{\boldsymbol{y}^{(p)}\}_{p=1}^{P}
{y(p)}p=1P,数据集中所有的
P
P
P 个数据所产生的平均训练误差即由公式
(
14
)
(14)
(14)计算
\newline
2
)
\qquad2)
2) 计算公式
(
17
)
(17)
(17),得到
{
δ
o
(
p
)
(
k
)
,
k
=
1
,
2
,
⋯
,
N
}
p
=
1
P
\{\delta_{o}^{(p)}(k),k=1,2,\cdots,N \}_{p=1}^{P}
{δo(p)(k),k=1,2,⋯,N}p=1P
\newline
3
)
\qquad3)
3) 计算公式
(
19
)
,
(
21
)
(19),(21)
(19),(21),更新所有的隐层权值
w
j
k
w_{jk}
wjk
\newline
4
)
\qquad4)
4) 计算公式
(
18
)
(18)
(18),得到
{
δ
h
(
p
)
(
j
)
,
j
=
1
,
2
,
⋯
,
M
}
p
=
1
P
\{\delta_{h}^{(p)}(j),j=1,2,\cdots,M \}_{p=1}^{P}
{δh(p)(j),j=1,2,⋯,M}p=1P
\newline
5
)
\qquad5)
5) 计算公式
(
20
)
,
(
22
)
(20),(22)
(20),(22),更新所有的输入层权值
v
i
j
v_{ij}
vij
\newline
6
)
\qquad6)
6) 重复
1
)
∼
5
)
1)\sim5)
1)∼5)多次,比如采用
early stopping
\text{early\ stopping}
early stopping来确定重复的次数以结束训练
\newline
4. 算法实现
\qquad 算法的python实现取自于Machine Learning - An Algorithmic Perspective(2nd Edition)一书的第4章,通过MNIST数据集进行了测试。关键代码段如下:
(1)输入数据通过 MLP \text{MLP} MLP 的前向传播
def mlpfwd(self,inputs):
""" Run the network forward """
self.hidden = np.dot(inputs,self.weights1);
self.hidden = 1.0/(1.0+np.exp(-self.beta*self.hidden))
self.hidden = np.concatenate((self.hidden,-np.ones((np.shape(inputs)[0],1))),axis=1)
outputs = np.dot(self.hidden,self.weights2);
if self.outtype == 'linear':
return outputs
elif self.outtype == 'logistic':
return 1.0/(1.0+np.exp(-self.beta*outputs))
elif self.outtype == 'softmax':
normalisers = np.sum(np.exp(outputs),axis=1)*np.ones((1,np.shape(outputs)[0]))
return np.transpose(np.transpose(np.exp(outputs))/normalisers)
else:
print "error"
(2) Batch \text{Batch} Batch模式下的误差反向传播训练所有权值(权值的更新采用了动量法),weights1表示输入层权值,weights2表示隐层权值,分别对应了 2.3 2.3 2.3 中的矩阵表示
def mlptrain(self,inputs,targets,eta,niterations):
""" Train the thing """
# Add the inputs that match the bias node
inputs = np.concatenate((inputs,-np.ones((self.ndata,1))),axis=1)
change = range(self.ndata)
updatew1 = np.zeros((np.shape(self.weights1)))
updatew2 = np.zeros((np.shape(self.weights2)))
for n in range(niterations):
self.outputs = self.mlpfwd(inputs)
error = 0.5*np.sum((self.outputs-targets)**2)
if (np.mod(n,100)==0):
print "Iteration: ",n, " Error: ",error
# Different types of output neurons
if self.outtype == 'linear':
deltao = (self.outputs-targets)/self.ndata
elif self.outtype == 'softmax':
deltao = (self.outputs-targets)*(self.outputs*(-self.outputs)+self.outputs)/self.ndata
else:
print "error"
deltah = self.hidden*self.beta*(1.0-self.hidden)*(np.dot(deltao,np.transpose(self.weights2)))
updatew1 = eta*(np.dot(np.transpose(inputs),deltah[:,:-1])) + self.momentum*updatew1
updatew2 = eta*(np.dot(np.transpose(self.hidden),deltao)) + self.momentum*updatew2
self.weights1 -= updatew1
self.weights2 -= updatew2
3) early stopping \text{early\ stopping} early stopping 方式确定 Batch \text{Batch} Batch 模式的重复训练次数(默认重复训练100次为一个基本步骤)
def earlystopping(self,inputs,targets,valid,validtargets,eta,niterations=100):
valid = np.concatenate((valid,-np.ones((np.shape(valid)[0],1))),axis=1)
old_val_error1 = 100002
old_val_error2 = 100001
new_val_error = 100000
count = 0
while (((old_val_error1 - new_val_error) > 0.001) or ((old_val_error2 - old_val_error1)>0.001)):
count+=1
print count
self.mlptrain(inputs,targets,eta,niterations)
old_val_error2 = old_val_error1
old_val_error1 = new_val_error
validout = self.mlpfwd(valid)
new_val_error = 0.5*np.sum((validtargets-validout)**2)
print "Stopped", new_val_error,old_val_error1, old_val_error2
return new_val_error
初始化过程:
def __init__(self,inputs,targets,nhidden,beta=1,momentum=0.9,outtype='logistic'):
""" Constructor """
# Set up network size
self.nin = np.shape(inputs)[1]
self.nout = np.shape(targets)[1]
self.ndata = np.shape(inputs)[0]
self.nhidden = nhidden
self.beta = beta
self.momentum = momentum
self.outtype = outtype
# Initialise network
self.weights1 = (np.random.rand(self.nin+1,self.nhidden)-0.5)*2/np.sqrt(self.nin)
self.weights2 = (np.random.rand(self.nhidden+1,self.nout)-0.5)*2/np.sqrt(self.nhidden)
主程序1—— Batch \text{Batch} Batch 模式:
import pylab as pl
import numpy as np
import mlp
from dataset.mnist import load_mnist
(x_train, t_train), (x_test, t_test) = load_mnist(flatten=True, normalize=False)
nread = 10000 # 读取10000个训练数据
# Just use the first few images
train_in = x_train[:nread,:]
train_tgt = np.zeros((nread,10))
for i in range(nread):
train_tgt[i,t_train[i]] = 1
# Make sure you understand how it does it
test_in = x_test[:10000,:] # 读取10000个测试数据
test_tgt = np.zeros((10000,10))
for i in range(nread):
test_tgt[i,t_test[i]] = 1
# We will need the validation set
valid_in = x_train[nread:nread*2,:] # 读取10000个验证数据(用于early stopping)
valid_tgt = np.zeros((nread,10))
for i in range(nread):
valid_tgt[i,t_train[nread+i]] = 1
for i in [20,50]: # 隐层节点数分别为20和50
print "----- "+str(i)
net = mlp.mlp(train_in,train_tgt,i,outtype='softmax')
net.earlystopping(train_in,train_tgt,valid_in,valid_tgt,0.1)
net.confmat(test_in,test_tgt)
测试结果:
----- 20个隐层节点
1
Iteration: 0 Error: 4548.353637882863
2
Iteration: 0 Error: 1487.0759421425435
3
Iteration: 0 Error: 896.5361419007859
4
Iteration: 0 Error: 677.5059512878594
5
Iteration: 0 Error: 573.3650508668917
6
Iteration: 0 Error: 494.1811634917851
… …
12
Iteration: 0 Error: 403.34490704933353
13
Iteration: 0 Error: 397.1756959426855
… …
30
Iteration: 0 Error: 354.4125339850196
31
Iteration: 0 Error: 353.305132917339
32
Iteration: 0 Error: 366.55628156059623
33
Iteration: 0 Error: 350.1651292612425
Stopped 857.3212342735881 857.0808281664059 854.2069505493689
Percentage Correct: 88.97 (识别率)
----- 50个隐层节点
1
Iteration: 0 Error: 4575.487855803707
2
Iteration: 0 Error: 733.3021704371301
3
Iteration: 0 Error: 501.26273959844514
4
Iteration: 0 Error: 438.9892263452885
5
Iteration: 0 Error: 405.4943514143844
6
Iteration: 0 Error: 385.16563035185777
… …
16
Iteration: 0 Error: 300.94481634009423
17
Iteration: 0 Error: 296.32368821391657
… …
49
Iteration: 0 Error: 214.4658877695538
50
Iteration: 0 Error: 213.05563233179032
Stopped 698.1289903467069 697.9241528560016 697.6666825463668
Percentage Correct: 91.36(识别率)
主程序2 —— mini batch模式(训练集和验证集的选取可以随意选择):
import pylab as pl
import numpy as np
import mlp
from dataset.mnist import load_mnist
# Read the dataset in (code from sheet)
(x_train, t_train), (x_test, t_test) = load_mnist(flatten=True, normalize=False)
nread = 5000 # 确定mini batch的批处理数据集中的数据量,此处为5000个数据作为一组
test_in = x_test[:,:]
test_tgt = np.zeros((10000,10)) #测试集为10000个数据
for i in range(10000):
test_tgt[i,t_test[i]] = 1
ntimes = 60000/nread # 60000个训练集可以分成12组,第n组为训练集时,第n+1组为验证集
for n in range(ntimes):
print n
# training set 训练集
train_in = x_train[nread*n:nread*(n+1),:]
train_tgt = np.zeros((nread,10))
for i in range(nread):
train_tgt[i,t_train[nread*n+i]] = 1
# validation set 验证集
valid_tgt = np.zeros((nread,10))
if n < ntimes-1:
valid_in = x_train[nread*(n+1):nread*(n+2),:]
for i in range(nread):
valid_tgt[i,t_train[nread*(n+1)+i]] = 1
else:
valid_in = x_train[0:nread,:]
for i in range(nread):
valid_tgt[i,t_train[i]] = 1
if n==0:
net = mlp.mlp(train_in,train_tgt,40,outtype='softmax') #40个隐层节点
net.earlystopping(train_in,train_tgt,valid_in,valid_tgt,0.1)
net.confmat(test_in,test_tgt)
测试结果:(识别率逐步提高)
0
Percentage Correct: 88.66000000000001
1
Percentage Correct: 90.36
2
Percentage Correct: 90.86999999999999
3
Percentage Correct: 91.36
4
Percentage Correct: 91.67
5
Percentage Correct: 91.85
6
Percentage Correct: 92.46
7
Percentage Correct: 92.38
8
Percentage Correct: 92.75999999999999
9
Percentage Correct: 93.06
10
Percentage Correct: 93.07
11
Percentage Correct: 92.85