第二章
一切为了数据挖掘的准备
第二章:反向传播算法如何工作
反向传播的核心是一个对代价函数C关于任何权重w或偏置b的偏导数的表达式,即如何让通过改变权重和偏置来改变整个网络的行为。
2.1 神经网络中使用矩阵快速计算输出的方法
- 单个元素计算
- w j k l w_{jk}^l wjkl表示从第(l-1)层的第k个神经元到第l增的第j个神经元的链接上的权重
- b j l b_j^l bjl表示第l层第j个神经元的偏置
- a j l a_j^l ajl表示第l层第j个神经元的激活值
- 得出: a j l = σ ( ∑ k w j k l a k l − 1 + b j l ) a_j^l = \sigma (\sum_k{w_{jk}^la_k^{l-1}}+b_j^l) ajl=σ(∑kwjklakl−1+bjl)
- z j l = ∑ k w j k l a k l − 1 + b j l z_j^l = \sum_k{w_{jk}^la_k^{l-1}}+b_j^l zjl=∑kwjklakl−1+bjl表示第l层第j个神经元的激活函数的带权输入
- 矩阵表示
- w l w^l wl每一层l的权重矩阵。形状为l层神经元数 x l-1层神经元数
- b l b^l bl每一层l的偏置,形状为1 x l层神经元数
- a l a^l al激活向量,每层的输出
- 得到: a l = σ ( w l a l − 1 + b l ) a^l = \sigma (w^la^{l-1} + b^l) al=σ(wlal−1+bl)
- z l ≡ w l a l − 1 + b l z^l \equiv w^la^{l-1} + b^l zl≡wlal−1+bl中间量,称 z l z^l zl为l层神经元的带权输入
2.2关于代价函数C的两个假设
代价函数: C = 1 2 n ∑ x ∣ ∣ y ( x ) − a L ( x ) ∣ ∣ 2 C = \frac{1}{2n}\sum_x{||y(x)-a^L(x)||^2} C=2n1∑x∣∣y(x)−aL(x)∣∣2,n是训练样本的总数,求和运算遍历了每个训练样本x,y(x)是实际输出,L表示网络层数, a L ( x ) a^L(x) aL(x)是当输入为x时的网络输出的激活向量值。
- 假设1:代价函数可以写成一个每个训练样本x的代价函数 C x C_x Cx的均值。 C = 1 n ∑ x C x C=\frac{1}{n}\sum_xC_x C=n1∑xCx, C x = 1 2 ∣ ∣ y − a L ∣ ∣ 2 C_x = \frac{1}{2}||y-a^L||^2 Cx=21∣∣y−aL∣∣2.做假设的原因:反向传播是对一个独立样本计算了 ∂ C x / ∂ w \partial{C_x}/\partial{w} ∂Cx/∂w, ∂ C x / ∂ b \partial{C_x}/\partial{b} ∂Cx/∂b,然后在所有样本上平均获得 ∂ C / ∂ w \partial{C}/\partial{w} ∂C/∂w, ∂ C / ∂ w \partial{C}/\partial{w} ∂C/∂w
- 假设2:代价可以写成神经网络的输出 a L a^L aL的函数: c o s t C = C ( a L ) cost C=C(a^L) costC=C(aL). C = 1 2 ∣ ∣ y − a L ∣ ∣ = 1 2 ∑ j ( y j − a j L ) 2 C=\frac{1}{2}||y-a^L||=\frac{1}{2}\sum_j{(y_j-a_j^L)^2} C=21∣∣y−aL∣∣=21∑j(yj−ajL)2
2.3Hadamard乘积, s ⨀ t s\bigodot t s⨀t
同型矩阵同位置按元素乘积
2.4 反向传播的四个基本方程和推导
定义 δ j l ≡ ∂ C ∂ Z j l \delta_j^l \equiv \frac{\partial C}{\partial Z_j^l} δjl≡∂Zjl∂C,称为第l层第j个神经元上的误差
2.4.1 输出层误差的方程 δ L \delta^L δL,
-
每个元素定义如下:
δ j L = ∂ C ∂ a j L σ ′ ( Z j L ) \delta_j^L = \frac{\partial C}{\partial a_j^L}\sigma^\prime(Z_j^L) δjL=∂ajL∂Cσ′(ZjL) -
点乘形式: δ L = ∇ a C ⨀ σ ′ ( Z L ) = ( a l − y ) ⨀ σ ′ ( Z L ) \delta^L=\nabla_aC \bigodot \sigma^\prime(Z^L)=(a^l-y)\bigodot \sigma^\prime(Z^L) δL=∇aC⨀σ′(ZL)=(al−y)⨀σ′(ZL), ∇ a C \nabla_aC ∇aC被定义为一个向量。其中 σ ′ ( Z L ) \sigma^\prime(Z^L) σ′(ZL)是简单的将求导运算 ∂ y ∂ x = 1 1 + e − x ( 1 − 1 1 + e − x ) \frac{\partial y}{\partial x} = \frac{1}{1+e^{-x}}(1-\frac{1}{1+e^{-x}}) ∂x∂y=1+e−x1(1−1+e−x1)运用到了Z向量(维度为n,n=l层神经元数)每个数中,n x 1。而非矩阵求导 ∂ a j L ∂ Z j L \frac{\partial{a_j^L}}{\partial{Z_j^L}} ∂ZjL∂ajL,结果为n x n,非对角线上的值都为0
-
神经元级的证明
∂ C ∂ Z j L = ∂ C ∂ a j L ∂ a j L ∂ Z j L = ∂ C ∂ a j L ∂ σ ( Z j L ) ∂ Z j L = ∂ C ∂ a j L σ ′ ( Z j L ) \frac{\partial{C}}{\partial{Z_j^L}}=\frac{\partial{C}}{\partial{a_j^L}}\frac{\partial{a_j^L}}{\partial{Z_j^L}}=\frac{\partial{C}}{\partial{a_j^L}}\frac{\partial{\sigma(Z_j^L)}}{\partial{Z_j^L}}=\frac{\partial C}{\partial a_j^L}\sigma^\prime(Z_j^L) ∂ZjL∂C=∂ajL∂C∂ZjL∂ajL=∂ajL∂C∂ZjL∂σ(ZjL)=∂ajL∂Cσ′(ZjL)
其中:
∂ C ∂ a j L = ∂ 1 2 ∑ j ( y j − a j L ) 2 ∂ a j L = a j L − y j \frac{\partial C}{\partial a_j^L}=\frac{\partial \frac{1}{2}\sum_j{(y_j-a_j^L)^2}}{\partial a_j^L}=a_j^L-y_j ∂ajL∂C=∂ajL∂21∑j(yj−ajL)2=ajL−yj
y = 1 1 + e − x y=\frac{1}{1+e^{-x}} y=1+e−x1
∂ y ∂ x = 1 1 + e − x ( 1 − 1 1 + e − x ) = y ( 1 − y ) \frac{\partial y}{\partial x} = \frac{1}{1+e^{-x}}(1-\frac{1}{1+e^{-x}})=y(1-y) ∂x∂y=1+e−x1(1−1+e−x1)=y(1−y)
- 矩阵级证明:
∂ C ∂ Z L = ∂ a L ∂ Z L ∂ C ∂ a L \frac{\partial{C}}{\partial{Z^L}}=\frac{\partial{a^L}}{\partial{Z^L}}\frac{\partial{C}}{\partial{a^L}} ∂ZL∂C=∂ZL∂aL∂aL∂C
∂ a L ∂ Z L \frac{\partial{a^L}}{\partial{Z^L}} ∂ZL∂aL的维度是nxn,除对角线外的值都是0,对角线上的值为 σ ′ ( Z L ) \sigma^\prime(Z^L) σ′(ZL); ∂ C ∂ a L \frac{\partial{C}}{\partial{a^L}} ∂aL∂C的维度是n x 1,值为 ( a L − y ) (a^L-y) (aL−y);求解结果的维度为 nxn x nx1= nx1。等效于 ( a l − y ) ⨀ σ ′ ( Z L ) (a^l-y)\bigodot \sigma^\prime(Z^L) (al−y)⨀σ′(ZL)
2.4.2 使用下一层的误差 δ l + 1 \delta^{l+1} δl+1来表示当前层的误差 δ l \delta^l δl
δ
l
=
(
(
w
l
+
1
)
T
δ
l
+
1
)
⨀
σ
′
(
z
l
)
\delta^l=((w^{l+1})^T\delta^{l+1}) \bigodot \sigma^\prime(z^l)
δl=((wl+1)Tδl+1)⨀σ′(zl)
将这个公式看作沿着网络的反向移动误差,使用l+1层的误差
δ
l
+
1
\delta^{l+1}
δl+1通过l层的激活函数
σ
(
z
l
)
\sigma (z^l)
σ(zl)反向传递回来,度量l层的误差
δ
l
\delta^l
δl.这样先计算
δ
L
\delta^L
δL,再计算
δ
L
−
1
\delta^{L-1}
δL−1,一步步反向传播完整个网络
- 矩阵级证:
Z l + 1 = w l + 1 σ ( Z L ) + b l + 1 Z^{l+1} = w^{l+1}\sigma(Z^L)+b^{l+1} Zl+1=wl+1σ(ZL)+bl+1
∂ Z L + 1 ∂ Z L = ∂ σ ( Z L ) ∂ Z L ( w l + 1 ) T \frac{\partial{Z^{L+1}}}{\partial{Z^L}}= \frac{\partial \sigma(Z^L)}{\partial{Z^L}}(w^{l+1})^T ∂ZL∂ZL+1=∂ZL∂σ(ZL)(wl+1)T
∂
C
∂
Z
L
=
∂
Z
L
+
1
∂
Z
L
∂
C
∂
Z
L
+
1
=
∂
σ
(
Z
L
)
∂
Z
L
(
w
l
+
1
)
T
∂
C
∂
Z
L
+
1
\frac{\partial C}{\partial Z^L}=\frac{\partial Z^{L+1}}{\partial Z^L}\frac{\partial C}{\partial Z^{L+1}} = \frac{\partial \sigma(Z^L)}{\partial{Z^L}}(w^{l+1})^T\frac{\partial C}{\partial Z^{L+1}}
∂ZL∂C=∂ZL∂ZL+1∂ZL+1∂C=∂ZL∂σ(ZL)(wl+1)T∂ZL+1∂C
∂
σ
(
Z
L
)
∂
Z
L
\frac{\partial \sigma(Z^L)}{\partial{Z^L}}
∂ZL∂σ(ZL)的维度是nxn,除对角线外的值都是0,对角线上的值为
σ
′
(
Z
L
)
\sigma^\prime(Z^L)
σ′(ZL),
(
w
l
+
1
)
T
(w^{l+1})^T
(wl+1)T的维度为nxm(设l+1层神经元个数为m);
∂
C
∂
Z
L
+
1
\frac{\partial C}{\partial Z^{L+1}}
∂ZL+1∂C的维度为mx1;运算结果的维度nxn x nxm x mx1=nx1。等价于
δ
l
=
(
(
w
l
+
1
)
T
δ
l
+
1
)
⨀
σ
′
(
z
l
)
\delta^l=((w^{l+1})^T\delta^{l+1}) \bigodot \sigma^\prime(z^l)
δl=((wl+1)Tδl+1)⨀σ′(zl).
2.4.3 代价函数关于网络中任意偏置的改变率:
∂ C ∂ b j l = δ j l \frac{\partial C}{\partial b_j^l}=\delta_j^l ∂bjl∂C=δjl
∂ C ∂ b = δ \frac{\partial C}{\partial b}=\delta ∂b∂C=δ
- 矩阵级证:
∂ C ∂ b l = ∂ z l ∂ b l ∂ a l ∂ z l ∂ C ∂ a l = ∂ w l a l − 1 + b l ∂ b l ∂ σ ( Z L ) ∂ Z L ∂ C ∂ a l = I ∂ σ ( Z L ) ∂ Z L ∂ C ∂ a l = δ l \frac{\partial C}{\partial b^l} = \frac{\partial z^l}{\partial b^l}\frac{\partial a^l}{\partial z^l}\frac{\partial C}{\partial a^l} = \frac{\partial w^la^{l-1}+b^l}{\partial b^l} \frac{\partial \sigma(Z^L)}{\partial{Z^L}} \frac{\partial C}{\partial a^l}=I\frac{\partial \sigma(Z^L)}{\partial{Z^L}} \frac{\partial C}{\partial a^l}=\delta^l ∂bl∂C=∂bl∂zl∂zl∂al∂al∂C=∂bl∂wlal−1+bl∂ZL∂σ(ZL)∂al∂C=I∂ZL∂σ(ZL)∂al∂C=δl
2.4.4 代价函数关于任何一个权重的改变率:
∂ C ∂ w j k l = a k l − 1 δ j l \frac{\partial C}{\partial w_{jk}^l}=a_k^{l-1}\delta_j^l ∂wjkl∂C=akl−1δjl
- 矩阵级证明:
∂ C ∂ w l = ∂ z l ∂ w l ∂ a l ∂ z l ∂ C ∂ a l = ∂ w l a l − 1 + b l ∂ w l ∂ σ ( Z L ) ∂ Z L ∂ C ∂ a l \frac{\partial C}{\partial w^l} = \frac{\partial z^l}{\partial w^l}\frac{\partial a^l}{\partial z^l}\frac{\partial C}{\partial a^l} = \frac{\partial w^la^{l-1}+b^l}{\partial w^l}\frac{\partial \sigma(Z^L)}{\partial{Z^L}}\frac{\partial C}{\partial a^l} ∂wl∂C=∂wl∂zl∂zl∂al∂al∂C=∂wl∂wlal−1+bl∂ZL∂σ(ZL)∂al∂C
2.4.5总结
四个方程:
δ
L
=
∇
a
C
⨀
σ
′
(
z
L
)
\delta^L = \nabla_a C \bigodot \sigma^\prime(z^L)
δL=∇aC⨀σ′(zL)
δ l = ( ( w l + 1 ) T δ l + 1 ) ⨀ σ ′ ( z l ) \delta^l=((w^{l+1})^T\delta^{l+1}) \bigodot \sigma^\prime(z^l) δl=((wl+1)Tδl+1)⨀σ′(zl)
∂ C ∂ b j l = δ j l \frac{\partial C}{\partial b_j^l}=\delta_j^l ∂bjl∂C=δjl
∂ C ∂ w j k l = a k l − 1 δ j l \frac{\partial C}{\partial w_{jk}^l}=a_k^{l-1}\delta_j^l ∂wjkl∂C=akl−1δjl
如果 a l − 1 a^{l-1} al−1输入神经元激活值很低;或当 σ ( z l ) \sigma(z^l) σ(zl)趋近于0或1 时, σ ′ ( z l ) \sigma^\prime(z^l) σ′(zl)很小,输出神经元已饱和, ∂ C ∂ w l \frac{\partial C}{\partial w^l} ∂wl∂C会很小,这时权重学习很慢。
2.5反向传播算法
2.5.1 计算代价函数梯度
- 输入x:为输入层设置对应的激活之 a 1 a^1 a1
- 前向传播:对每个l=2,3,…,L计算 z l = w l a l − 1 + b l z^l = w^l a^{l-1}+b^l zl=wlal−1+bl, a l = σ ( z l ) a^l=\sigma(z^l) al=σ(zl)
- 输出层误差 δ L \delta^L δL: 计算向量 δ L = ∇ a C ⨀ σ ′ ( z L ) = ( a L − y ) ⨀ σ ( z L ) ( 1 − σ ( z L ) ) \delta^L = \nabla_a C \bigodot \sigma^\prime(z^L)=(a^L-y) \bigodot \sigma(z^L)(1-\sigma(z^L)) δL=∇aC⨀σ′(zL)=(aL−y)⨀σ(zL)(1−σ(zL))
- 反向误差传播:对每个层,L-1,L-2,…,2,计算 δ l = ( ( w l + 1 ) T δ l + 1 ) ⨀ σ ′ ( z l ) \delta^l=((w^{l+1})^T\delta^{l+1}) \bigodot \sigma^\prime(z^l) δl=((wl+1)Tδl+1)⨀σ′(zl)
- 输出:代价函数的梯度 ∂ C ∂ w j k l = a k l − 1 δ j l \frac{\partial C}{\partial w_{jk}^l}=a_k^{l-1}\delta_j^l ∂wjkl∂C=akl−1δjl, ∂ C ∂ b j l = δ j l \frac{\partial C}{\partial b_j^l}=\delta_j^l ∂bjl∂C=δjl得出
2.5.2 随机梯度下降算法
m为随机小批量数据容量
w
l
→
w
l
−
η
m
∑
x
δ
x
,
l
(
a
x
,
l
−
1
)
T
w^l \rightarrow w^l - \frac{\eta}{m}\sum_x \delta^{x,l}(a^{x,l-1})^T
wl→wl−mηx∑δx,l(ax,l−1)T
b l → b l − η m ∑ x δ x , l b^l \rightarrow b^l - \frac{\eta}{m}\sum_x \delta^{x,l} bl→bl−mηx∑δx,l
2.5 全局观
反向传播可以想象成一种求所有可能的路径变化率和的方式。它巧妙的追踪权重和偏置变化的传播,至敌法输出层从而影响代价函数。
– 补充矩阵求导知识
m维的向量y对n维的向量x求导
∂
y
∂
x
\frac{\partial y}{\partial x}
∂x∂y
-
分子布局:y以行布局,x以列布局, m x n
∂ y ∂ x = [ ∂ y 1 ∂ x 1 ∂ y 1 ∂ x 2 ⋯ ∂ y 1 ∂ x n ∂ y 2 ∂ x 1 ∂ y 2 ∂ x 2 ⋯ ∂ y 2 ∂ x n ⋮ ⋮ ⋱ ⋮ ∂ y m ∂ x 1 ∂ y m ∂ x 2 ⋯ ∂ y m ∂ x n ] \frac{\partial y}{\partial x}=\begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n} \\ \vdots & \vdots &\ddots & \vdots\\ \frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n}\end{bmatrix} ∂x∂y=⎣⎢⎢⎢⎢⎡∂x1∂y1∂x1∂y2⋮∂x1∂ym∂x2∂y1∂x2∂y2⋮∂x2∂ym⋯⋯⋱⋯∂xn∂y1∂xn∂y2⋮∂xn∂ym⎦⎥⎥⎥⎥⎤ -
分母布局:y以列布局,x以行布局,n x m
∂ y ∂ x = [ ∂ y 1 ∂ x 1 ∂ y 2 ∂ x 1 ⋯ ∂ y m ∂ x 1 ∂ y 1 ∂ x 2 ∂ y 2 ∂ x 2 ⋯ ∂ y m ∂ x 2 ⋮ ⋮ ⋱ ⋮ ∂ y 1 ∂ x n ∂ y 2 ∂ x n ⋯ ∂ y m ∂ x n ] \frac{\partial y}{\partial x}=\begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_2}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_1} \\ \frac{\partial y_1}{\partial x_2} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_2} \\ \vdots & \vdots &\ddots & \vdots\\ \frac{\partial y_1}{\partial x_n} & \frac{\partial y_2}{\partial x_n} & \cdots & \frac{\partial y_m}{\partial x_n}\end{bmatrix} ∂x∂y=⎣⎢⎢⎢⎢⎡∂x1∂y1∂x2∂y1⋮∂xn∂y1∂x1∂y2∂x2∂y2⋮∂xn∂y2⋯⋯⋱⋯∂x1∂ym∂x2∂ym⋮∂xn∂ym⎦⎥⎥⎥⎥⎤
对于分母布局:
条件(向量对向量求导) | 表达式 | 结果 |
---|---|---|
X | ∂ X ∂ X \frac{\partial X}{\partial X} ∂X∂X | I I I |
矩阵A与X无关 | ∂ A X ∂ X \frac{\partial AX}{\partial X} ∂X∂AX | A T A^T AT |
矩阵A与X无关 | ∂ X T A ∂ X \frac{\partial X^TA}{\partial X} ∂X∂XTA | A A A |
a是常数, u = u ( x ) u=u(x) u=u(x) | ∂ a u ∂ X \frac{\partial au}{\partial X} ∂X∂au | a ∂ u ∂ X a\frac{\partial u}{\partial X} a∂X∂u |
a = a ( x ) a=a(x) a=a(x)得到数值, u = u ( x ) u=u(x) u=u(x) | ∂ a u ∂ X \frac{\partial au}{\partial X} ∂X∂au | a ∂ u ∂ X + ∂ a ∂ X u T a\frac{\partial u}{\partial X} +\frac{\partial a}{\partial X}u^T a∂X∂u+∂X∂auT |
矩阵A, u = u ( x ) u=u(x) u=u(x) | ∂ A u ∂ X \frac{\partial Au}{\partial X} ∂X∂Au | ∂ u ∂ X A T \frac{\partial u}{\partial X}A^T ∂X∂uAT |
u = u ( x ) u=u(x) u=u(x) | ∂ f ( g ( u ) ) ∂ X \frac{\partial f(g(u))}{\partial X} ∂X∂f(g(u)) | ∂ u ∂ X ∂ g ( u ) ∂ u ∂ f ( g ) ∂ g \frac{\partial u}{\partial X}\frac{\partial g(u)}{\partial u}\frac{\partial f(g)}{\partial g} ∂X∂u∂u∂g(u)∂g∂f(g) |
a = a ( x ) , u = u ( x ) a=a(x),u=u(x) a=a(x),u=u(x) | ∂ ( u ⋅ v ) ∂ X = ∂ ( u T v ) ∂ X \frac{\partial (u\cdot v)}{\partial X}=\frac{\partial (u^T v)}{\partial X} ∂X∂(u⋅v)=∂X∂(uTv) | ∂ u ∂ X v + ∂ v ∂ X u \frac{\partial u}{\partial X}v + \frac{\partial v}{\partial X}u ∂X∂uv+∂X∂vu |