神经网络基本原理简明教程-0-基本函数导数公式

基本函数导数公式

Copyright © Microsoft Corporation. All rights reserved.
适用于License版权许可

更多微软人工智能学习资源,请见微软人工智能教育与学习共建社区

如何浏览本系列教程

由于里面包含了大量必要的数学公式,都是用LaTex格式编写的,所以:

  1. 如果使用浏览器在线观看的话,可以使用Chrome浏览器,加这个Math展示控件

  2. 也可以clone全部内容到本地,然后用VSCode浏览,但VSCode中需要安装能读取Markdown格式的扩展,比如Markdown Preview Enhanced.

这篇文章呢更多的是一些可能要用到的数学公式的推导,是一种理论基础,感兴趣的同学可以仔细瞅瞅,想直接上手的同学也可以直接跳过这一篇~

下面进入正题!站稳了别趴下!

  1. y = c y=c y=c

(1) y ′ = 0 y'=0 \tag 1 y=0(1)
2. y = x a y=x^a y=xa

(2) y ′ = a x a − 1 y'=ax^{a-1} \tag 2 y=axa1(2)

  1. y = l o g a x y=log_ax y=logax

(3) y ′ = 1 x l o g a e = 1 x l n a y'=\frac{1}{x}log_ae=\frac{1}{xlna} \tag 3 y=x1logae=xlna1(3)
( 因 为 l o g a e = 1 l o g e a = 1 l n a ) (因为log_ae=\frac{1}{log_ea}=\frac{1}{lna}) (logae=logea1=lna1)

  1. y = l n x y=lnx y=lnx

(4) y ′ = 1 x y'=\frac{1}{x} \tag4 y=x1(4)

  1. y = a x y=a^x y=ax

(5) y ′ = a x l n a y'=a^xlna \tag5 y=axlna(5)

  1. y = e x y=e^x y=ex

(6) y ′ = e x y'=e^x \tag6 y=ex(6)

  1. y = e − x y=e^{-x} y=ex

(7) y ′ = − e − x y'=-e^{-x} \tag7 y=ex(7)

  1. 正弦函数 y = s i n ( x ) y=sin(x) y=sin(x)

(8) y ′ = c o s ( x ) y'=cos(x) \tag 8 y=cos(x)(8)

  1. 余弦函数 y = c o s ( x ) y=cos(x) y=cos(x)

(9) y ′ = − s i n ( x ) y'=-sin(x) \tag 9 y=sin(x)(9)

  1. 正切函数 y = t g ( x ) y=tg(x) y=tg(x)

(10) y ′ = s e c 2 ( x ) = 1 c o s 2 x y'=sec^2(x)=\frac{1}{cos^2x} \tag{10} y=sec2(x)=cos2x1(10)

  1. 余切函数 y = c t g ( x ) y=ctg(x) y=ctg(x)

(11) y ′ = − c s c 2 ( x ) y'=-csc^2(x) \tag{11} y=csc2(x)(11)

  1. 反正弦函数 y = a r c s i n ( x ) y=arcsin(x) y=arcsin(x)

(12) y ′ = 1 1 − x 2 y'=\frac{1}{\sqrt{1-x^2}} \tag{12} y=1x2 1(12)

  1. 反余弦函数 y = a r c c o s ( x ) y=arccos(x) y=arccos(x)

(13) y ′ = − 1 1 − x 2 y'=-\frac{1}{\sqrt{1-x^2}} \tag{13} y=1x2 1(13)

  1. 反正切函数 y = a r c t a n ( x ) y=arctan(x) y=arctan(x)

(14) y ′ = 1 1 + x 2 y'=\frac{1}{1+x^2} \tag{14} y=1+x21(14)

  1. 反余切函数 y = a r c c t g ( x ) y=arcctg(x) y=arcctg(x)

(15) y ′ = − 1 1 + x 2 y'=-\frac{1}{1+x^2} \tag{15} y=1+x21(15)

  1. 双曲正弦函数 y = s i n h ( x ) = ( e x − e − x ) / 2 y=sinh(x)=(e^x-e^{-x})/2 y=sinh(x)=(exex)/2

(16) y ′ = c o s h ( x ) y'=cosh(x) \tag{16} y=cosh(x)(16)

  1. 双曲余弦函数 y = c o s h ( x ) = ( e x + e − x ) / 2 y=cosh(x)=(e^x+e^{-x})/2 y=cosh(x)=(ex+ex)/2

(17) y ′ = s i n h ( x ) y'=sinh(x) \tag{17} y=sinh(x)(17)

  1. 双曲正切函数 y = t a n h ( x ) = ( e x − e − x ) / ( e x + e − x ) y=tanh(x)=(e^x-e^{-x})/(e^x+e^{-x}) y=tanh(x)=(exex)/(ex+ex)

(18) y ′ = s e c h 2 ( x ) = 1 − t a n h 2 ( x ) y'=sech^2(x)=1-tanh^2(x) \tag{18} y=sech2(x)=1tanh2(x)(18)

  1. 双曲余切函数 y = c o t h ( x ) = ( e x + e − x ) / ( e x − e − x ) y=coth(x)=(e^x+e^{-x})/(e^x-e^{-x}) y=coth(x)=(ex+ex)/(exex)

(19) y ′ = − c s c h 2 ( x ) y'=-csch^2(x) \tag{19} y=csch2(x)(19)

  1. 双曲正割函数 y = s e c h ( x ) = 2 / ( e x + e − x ) y=sech(x)=2/(e^x+e^{-x}) y=sech(x)=2/(ex+ex)

(20) y ′ = − s e c h ( x ) ∗ t a n h ( x ) y'=-sech(x)*tanh(x) \tag{20} y=sech(x)tanh(x)(20)

  1. 双曲余割函数 y = c s c h ( x ) = 2 / ( e x − e − x ) y=csch(x)=2/(e^x-e^{-x}) y=csch(x)=2/(exex)

(21) y ′ = − c s c h ( x ) ∗ c o t h ( x ) y'=-csch(x)*coth(x) \tag{21} y=csch(x)coth(x)(21)

导数四则运算

  1. (30) [ u ( x ) + v ( x ) ] ’ = u ’ ( x ) + v ’ ( x ) [u(x) + v(x)]’ = u’(x) + v’(x) \tag{30} [u(x)+v(x)]=u(x)+v(x)(30)
  2. (31) [ u ( x ) − v ( x ) ] ’ = u ’ ( x ) − v ’ ( x ) [u(x) - v(x)]’ = u’(x) - v’(x) \tag{31} [u(x)v(x)]=u(x)v(x)(31)
  3. (32) [ u ( x ) ∗ v ( x ) ] ’ = u ’ ( x ) ∗ v ( x ) + v ’ ( x ) ∗ u ( x ) [u(x)*v(x)]’ = u’(x)*v(x) + v’(x)*u(x) \tag{32} [u(x)v(x)]=u(x)v(x)+v(x)u(x)(32)
  4. (33) [ u ( x ) v ( x ) ] ′ = u ′ ( x ) v ( x ) − v ′ ( x ) u ( x ) v 2 ( x ) [\frac{u(x)}{v(x)}]'=\frac{u'(x)v(x)-v'(x)u(x)}{v^2(x)} \tag{33} [v(x)u(x)]=v2(x)u(x)v(x)v(x)u(x)(33)

偏导数

  1. Z = f ( x , y ) Z=f(x,y) Z=f(x,y)

则Z对x的偏导可以理解为当y是个常数时,Z单独对x求导:

(40) Z x ′ = f x ′ ( x , y ) = ∂ Z ∂ x Z'_x=f'_x(x,y)=\frac{\partial{Z}}{\partial{x}} \tag{40} Zx=fx(x,y)=xZ(40)

则Z对y的偏导可以理解为当x是个常数时,Z单独对y求导:

(41) Z y ′ = f y ′ ( x , y ) = ∂ Z ∂ y Z'_y=f'_y(x,y)=\frac{\partial{Z}}{\partial{y}} \tag{41} Zy=fy(x,y)=yZ(41)

在二元函数中,偏导的何意义,就是对任意的 y = y 0 y=y_0 y=y0的取值,在二元函数曲面上做一个 y = y 0 y=y_0 y=y0切片,得到 Z = f ( x , y 0 ) Z = f(x, y_0) Z=f(x,y0)的曲线,这条曲线的一阶导数就是Z对x的偏导。对 x = x 0 x=x_0 x=x0同样,就是Z对y的偏导。

复合函数求导(链式法则)

  1. 如果 y = f ( u ) , u = g ( x ) y=f(u), u=g(x) y=f(u),u=g(x)

(50) y x ′ = f ′ ( u ) ⋅ u ′ ( x ) = y u ′ ⋅ u x ′ = d y d u ⋅ d u d x y'_x = f'(u) \cdot u'(x) = y'_u \cdot u'_x=\frac{dy}{du} \cdot \frac{du}{dx} \tag{50} yx=f(u)u(x)=yuux=dudydxdu(50)

  1. 如果 y = f ( u ) , u = g ( v ) , v = h ( x ) y=f(u),u=g(v),v=h(x) y=f(u),u=g(v),v=h(x)

(51) d y d x = f ′ ( u ) ⋅ g ′ ( v ) ⋅ h ′ ( x ) = d y d u ⋅ d u d v ⋅ d v d x \frac{dy}{dx}=f'(u) \cdot g'(v) \cdot h'(x)=\frac{dy}{du} \cdot \frac{du}{dv} \cdot \frac{dv}{dx} \tag{51} dxdy=f(u)g(v)h(x)=dudydvdudxdv(51)

  1. Z = f ( U , V ) Z=f(U,V) Z=f(U,V),通过中间变量 U = g ( x , y ) , V = h ( x , y ) U = g(x,y), V=h(x,y) U=g(x,y),V=h(x,y)成为x,y的复合函数 Z = f [ g ( x , y ) , h ( x , y ) ] Z=f[g(x,y),h(x,y)] Z=f[g(x,y),h(x,y)]

(52) ∂ Z ∂ x = ∂ Z ∂ U ⋅ ∂ U ∂ x + ∂ Z ∂ V ⋅ ∂ V ∂ x \frac{\partial{Z}}{\partial{x}}=\frac{\partial{Z}}{\partial{U}} \cdot \frac{\partial{U}}{\partial{x}} + \frac{\partial{Z}}{\partial{V}} \cdot \frac{\partial{V}}{\partial{x}} \tag{52} xZ=UZxU+VZxV(52)

∂ Z ∂ y = ∂ Z ∂ U ∗ ∂ U ∂ y + ∂ Z ∂ V ∗ ∂ V ∂ y \frac{\partial{Z}}{\partial{y}}=\frac{\partial{Z}}{\partial{U}} * \frac{\partial{U}}{\partial{y}} + \frac{\partial{Z}}{\partial{V}} * \frac{\partial{V}}{\partial{y}} yZ=UZyU+VZyV

矩阵求导

A , B , X A,B,X A,B,X都是矩阵,

(60) B ∂ ( A X ) ∂ X = A T B B\frac{\partial{(AX)}}{\partial{X}} = A^TB \tag{60} BX(AX)=ATB(60)

(61) B ∂ ( X A ) ∂ X = B A T B\frac{\partial{(XA)}}{\partial{X}} = BA^T \tag{61} BX(XA)=BAT(61)

(62) ∂ ( X T A ) ∂ X = ∂ ( A T X ) ∂ X = A \frac{\partial{(X^TA)}}{\partial{X}} = \frac{\partial{(A^TX)}}{\partial{X}}=A \tag{62} X(XTA)=X(ATX)=A(62)

(63) ∂ ( A T X B ) ∂ X = A B T \frac{\partial{(A^TXB)}}{\partial{X}} = AB^T \tag{63} X(ATXB)=ABT(63)

(64) ∂ ( A T X T B ) ∂ X = B A T , d X T A X d X = ( A + A T ) X \frac{\partial{(A^TX^TB)}}{\partial{X}} = BA^T, {dX^TAX \over dX} = (A+A^T)X \tag{64} X(ATXTB)=BAT,dXdXTAX=(A+AT)X(64)

(65) d X T d X = I , d X d X T = I , d X T X d X = 2 X {dX^T \over dX} = I, {dX \over dX^T} = I, {dX^TX \over dX}=2X\tag{65} dXdXT=I,dXTdX=I,dXdXTX=2X(65)

d u d X T = ( d u T d X ) T {du \over dX^T} = ({du^T \over dX})^T dXTdu=(dXduT)T

(66) d u T v d x = d u T d x v + d v T d x u T , d u v T d x = d u d x v T + u d v T d x {du^Tv \over dx} = {du^T \over dx}v + {dv^T \over dx}u^T, {duv^T \over dx} = {du \over dx}v^T + u{dv^T \over dx} \tag{66} dxduTv=dxduTv+dxdvTuT,dxduvT=dxduvT+udxdvT(66)

(67) d A B d X = d A d X B + A d B d X {dAB \over dX} = {dA \over dX}B + A{dB \over dX} \tag{67} dXdAB=dXdAB+AdXdB(67)

(68) d u T X v d x = u v T , d u T X T X u d X = 2 X u u T {du^TXv \over dx}=uv^T, {du^TX^TXu \over dX}=2Xuu^T \tag{68} dxduTXv=uvT,dXduTXTXu=2XuuT(68)

(69) d [ ( X u − v ) T ( X u − v ) ] d X = 2 ( X u − v ) u T {d[(Xu-v)^T(Xu-v)] \over dX}=2(Xu-v)u^T \tag{69} dXd[(Xuv)T(Xuv)]=2(Xuv)uT(69)

激活函数求导

sigmoid函数: A = 1 1 + e − Z A = \frac{1}{1+e^{-Z}} A=1+eZ1

利用公式33,令: u = 1 , v = 1 + e − Z u=1,v=1+e^{-Z} u=1v=1+eZ

(70) A z ′ = u ′ v − v ′ u v 2 = 0 − ( 1 + e − z ) ′ ( 1 + e − z ) 2 A'_z = \frac{u'v-v'u}{v^2}=\frac{0-(1+e^{-z})'}{(1+e^{-z})^2} \tag{70} Az=v2uvvu=(1+ez)20(1+ez)(70)
= e − z ( 1 + e − z ) 2 = 1 + e − z − 1 ( 1 + e − z ) 2 =\frac{e^{-z}}{(1+e^{-z})^2} =\frac{1+e^{-z}-1}{(1+e^{-z})^2} =(1+ez)2ez=(1+ez)21+ez1
= 1 1 + e − z − ( 1 1 + e − z ) 2 =\frac{1}{1+e^{-z}}-(\frac{1}{1+e^{-z}})^2 =1+ez1(1+ez1)2
= A − A 2 = A ( 1 − A ) =A-A^2=A(1-A) =AA2=A(1A)

tanh函数: A = e Z − e − Z e Z + e − Z A=\frac{e^{Z}-e^{-Z}}{e^{Z}+e^{-Z}} A=eZ+eZeZeZ

利用公式23,令: u = e Z − e − Z , v = e Z + e − Z u={e^{Z}-e^{-Z}},v=e^{Z}+e^{-Z} u=eZeZv=eZ+eZ

(71) A Z ′ = u ′ v − v ′ u v 2 A'_Z=\frac{u'v-v'u}{v^2} \tag{71} AZ=v2uvvu(71)
= ( e Z − e − Z ) ′ ( e Z + e − Z ) − ( e Z + e − Z ) ′ ( e Z − e − Z ) ( e Z + e − Z ) 2 =\frac{(e^{Z}-e^{-Z})'(e^{Z}+e^{-Z})-(e^{Z}+e^{-Z})'(e^{Z}-e^{-Z})}{(e^{Z}+e^{-Z})^2} =(eZ+eZ)2(eZeZ)(eZ+eZ)(eZ+eZ)(eZeZ)
= ( e Z + e − Z ) ( e Z + e − Z ) − ( e Z − e − Z ) ( e Z − e − Z ) ( e Z + e − Z ) 2 =\frac{(e^{Z}+e^{-Z})(e^{Z}+e^{-Z})-(e^{Z}-e^{-Z})(e^{Z}-e^{-Z})}{(e^{Z}+e^{-Z})^2} =(eZ+eZ)2(eZ+eZ)(eZ+eZ)(eZeZ)(eZeZ)
= ( e Z + e − Z ) 2 − ( e Z − e − Z ) 2 ( e Z + e − Z ) 2 =\frac{(e^{Z}+e^{-Z})^2-(e^{Z}-e^{-Z})^2}{(e^{Z}+e^{-Z})^2} =(eZ+eZ)2(eZ+eZ)2(eZeZ)2
= 1 − ( ( e Z − e − Z e Z + e − Z ) 2 = 1 − A 2 =1-(\frac{(e^{Z}-e^{-Z}}{e^{Z}+e^{-Z}})^2=1-A^2 =1(eZ+eZ(eZeZ)2=1A2

反向传播四大公式推导

著名的反向传播四大公式是:

(80) δ L = ∇ a C ⊙ σ ′ ( Z L ) \delta^{L} = \nabla_{a}C \odot \sigma_{'}(Z^L) \tag{80} δL=aCσ(ZL)(80)
(81) δ l = ( ( W l + 1 ) T δ l + 1 ) ⊙ σ ′ ( Z l ) \delta^{l} = ((W^{l + 1})^T\delta^{l+1})\odot\sigma_{'}(Z^l) \tag{81} δl=((Wl+1)Tδl+1)σ(Zl)(81)
(82) ∂ C ∂ b j l = δ j l \frac{\partial{C}}{\partial{b_j^l}} = \delta_j^l \tag{82} bjlC=δjl(82)
(83) ∂ C ∂ w j k l = a k l − 1 δ j l \frac{\partial{C}}{\partial{w_{jk}^{l}}} = a_k^{l-1}\delta_j^l \tag{83} wjklC=akl1δjl(83)

下面我们用一个简单的两个神经元的全连接神经网络来直观解释一下这四个公式,

每个结点的输入输出标记如图上所示,使用MSE作为计算loss的函数,那么可以得到这张计算图中的计算过公式如下所示:

e 01 = 1 2 ( y − a 1 3 ) 2 e_{01} = \frac{1}{2}(y-a_1^3)^2 e01=21(ya13)2
a 1 3 = s i g m o i d ( z 1 3 ) a_1^3 = sigmoid(z_1^3) a13=sigmoid(z13)
z 1 3 = ( w 11 2 ⋅ a 1 2 + w 12 2 ⋅ a 2 2 + b 1 3 ) z_1^3 = (w_{11}^2 \cdot a_1^2 + w_{12}^2 \cdot a_2^2 + b_1^3) z13=(w112a12+w122a22+b13)
a 1 2 = s i g m o i d ( z 1 2 ) a_1^2 = sigmoid(z_1^2) a12=sigmoid(z12)
z 1 2 = ( w 11 1 ⋅ a 1 1 + w 12 1 ⋅ a 2 1 + b 1 2 ) z_1^2 = (w_{11}^1 \cdot a_1^1 + w_{12}^1 \cdot a_2^1 + b_1^2) z12=(w111a11+w121a21+b12)

我们按照反向传播中梯度下降的原理来对损失求梯度,计算过程如下:

∂ e o 1 ∂ w 11 2 = ∂ e o 1 ∂ a 1 3 ∂ a 1 3 ∂ z 1 3 ∂ z 1 3 ∂ w 11 2 = ∂ e o 1 ∂ a 1 3 ∂ a 1 3 ∂ z 1 3 a 1 2 \frac{\partial{e_{o1}}}{\partial{w_{11}^2}} = \frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}}\frac{\partial{z_{1}^3}}{\partial{w_{11}^2}}=\frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}}a_{1}^2 w112eo1=a13eo1z13a13w112z13=a13eo1z13a13a12

∂ e o 1 ∂ w 12 2 = ∂ e o 1 ∂ a 1 3 ∂ a 1 3 ∂ z 1 3 ∂ z 1 3 ∂ w 12 2 = ∂ e o 1 ∂ a 1 3 ∂ a 1 3 ∂ z 1 3 a 2 2 \frac{\partial{e_{o1}}}{\partial{w_{12}^2}} = \frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}}\frac{\partial{z_{1}^3}}{\partial{w_{12}^2}}=\frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}}a_{2}^2 w122eo1=a13eo1z13a13w122z13=a13eo1z13a13a22

∂ e o 1 ∂ w 11 1 = ∂ e o 1 ∂ a 1 3 ∂ a 1 3 ∂ z 1 3 ∂ z 1 3 ∂ a 1 2 ∂ a 1 2 ∂ z 1 2 ∂ z 1 2 ∂ w 11 1 = ∂ e o 1 ∂ a 1 3 ∂ a 1 3 ∂ z 1 3 ∂ z 1 3 ∂ a 1 2 ∂ a 1 2 ∂ z 1 2 a 1 1 \frac{\partial{e_{o1}}}{\partial{w_{11}^1}} = \frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}}\frac{\partial{z_{1}^3}}{\partial{a_{1}^2}}\frac{\partial{a_{1}^2}}{\partial{z_{1}^2}}\frac{\partial{z_{1}^2}}{\partial{w_{11}^1}} = \frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}}\frac{\partial{z_{1}^3}}{\partial{a_{1}^2}}\frac{\partial{a_{1}^2}}{\partial{z_{1}^2}}a_1^1 w111eo1=a13eo1z13a13a12z13z12a12w111z12=a13eo1z13a13a12z13z12a12a11

= ∂ e o 1 ∂ a 1 3 ∂ a 1 3 ∂ z 1 3 w 11 2 ∂ a 1 2 ∂ z 1 2 a 1 1 =\frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}}w_{11}^2\frac{\partial{a_{1}^2}}{\partial{z_{1}^2}}a_1^1 =a13eo1z13a13w112z12a12a11

∂ e o 1 ∂ w 12 1 = ∂ e o 1 ∂ a 1 3 ∂ a 1 3 ∂ z 1 3 ∂ z 1 3 ∂ a 2 2 ∂ a 2 2 ∂ z 1 2 ∂ z 1 2 ∂ w 12 1 = ∂ e o 1 ∂ a 1 3 ∂ a 1 3 ∂ z 1 3 ∂ z 1 3 ∂ a 2 2 ∂ a 2 2 ∂ z 1 2 a 2 2 \frac{\partial{e_{o1}}}{\partial{w_{12}^1}} = \frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}}\frac{\partial{z_{1}^3}}{\partial{a_{2}^2}}\frac{\partial{a_{2}^2}}{\partial{z_{1}^2}}\frac{\partial{z_{1}^2}}{\partial{w_{12}^1}} = \frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}}\frac{\partial{z_{1}^3}}{\partial{a_{2}^2}}\frac{\partial{a_{2}^2}}{\partial{z_{1}^2}}a_2^2 w121eo1=a13eo1z13a13a22z13z12a22w121z12=a13eo1z13a13a22z13z12a22a22

= ∂ e o 1 ∂ a 1 3 ∂ a 1 3 ∂ z 1 3 w 12 2 ∂ a 2 2 ∂ z 1 2 a 2 2 =\frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}}w_{12}^2\frac{\partial{a_{2}^2}}{\partial{z_{1}^2}}a_2^2 =a13eo1z13a13w122z12a22a22

上述式中, ∂ a ∂ z \frac{\partial{a}}{\partial{z}} za是激活函数的导数,即 σ ′ ( z ) \sigma^{'}(z) σ(z)项。观察到在求偏导数过程中有共同项 ∂ e o 1 ∂ a 1 3 ∂ a 1 3 ∂ z 1 3 \frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}} a13eo1z13a13,采用 δ \delta δ符号记录,用矩阵形式表示,
即:

δ L = [ ∂ e o 1 ∂ a i L ∂ a i L ∂ z i L ] = ∇ a C ⊙ σ ′ ( Z L ) \delta^L = [\frac{\partial{e_{o1}}}{\partial{a_{i}^L}}\frac{\partial{a_{i}^L}}{\partial{z_{i}^L}}] = \nabla_{a}C\odot\sigma^{'}(Z^L) δL=[aiLeo1ziLaiL]=aCσ(ZL)

上述式中, [ a i ] [a_i] [ai]表示一个元素是a的矩阵, ∇ a C \nabla_{a}C aC表示将损失 C C C a a a求梯度, ⊙ \odot 表示矩阵element wise的乘积(也就是矩阵对应位置的元素相乘)。

从上面的推导过程中,我们可以得出 δ \delta δ矩阵的递推公式:

δ L − 1 = ( W L ) T [ ∂ e o 1 ∂ a i L ∂ a i L ∂ z i L ] ⊙ σ ′ ( Z L − 1 ) \delta^{L-1} = (W^L)^T[\frac{\partial{e_{o1}}}{\partial{a_{i}^L}}\frac{\partial{a_{i}^L}}{\partial{z_{i}^L}}]\odot\sigma^{'}(Z^{L - 1}) δL1=(WL)T[aiLeo1ziLaiL]σ(ZL1)

所以在反向传播过程中只需要逐层利用上一层的 δ l \delta^l δl进行递推即可。

相对而言,这是一个非常直观的结果,这份推导过程也是不严谨的。下面,我们会从比较严格的数学定义角度进行推导,首先要补充一些定义。

标量对矩阵导数的定义

假定 y y y是一个标量, X X X是一个 N × M N \times M N×M大小的矩阵,有 y = f ( X ) y=f(X) y=f(X) f ( ) f() f()是一个函数。我们来看 d f df df应该如何计算。

首先给出定义:

d f = ∑ j M ∑ i N ∂ f ∂ x i j d x i j df = \sum_j^M\sum_i^N \frac{\partial{f}}{\partial{x_{ij}}}dx_{ij} df=jMiNxijfdxij

下面我们引入矩阵迹的概念,所谓矩阵的迹,就是矩阵对角线元素之和。也就是说:

t r ( X ) = ∑ i x i i tr(X) = \sum_i x_{ii} tr(X)=ixii

引入迹的概念后,我们来看上面的梯度计算是不是可以用迹来表达呢?

(90) ∂ f ∂ X = ( ∂ f ∂ x 11 ∂ f ∂ x 12 … ∂ f ∂ x 1 M ∂ f ∂ x 21 ∂ f ∂ x 22 … ∂ f ∂ x 2 M ⋮ ⋮ ⋱ ⋮ ∂ f ∂ x N 1 ∂ f ∂ x N 2 … ∂ f ∂ x N M ) \frac{\partial{f}}{\partial{X}} = \begin{pmatrix} \frac{\partial{f}}{\partial{x_{11}}} & \frac{\partial{f}}{\partial{x_{12}}} & \dots & \frac{\partial{f}}{\partial{x_{1M}}} \\ \frac{\partial{f}}{\partial{x_{21}}} & \frac{\partial{f}}{\partial{x_{22}}} & \dots & \frac{\partial{f}}{\partial{x_{2M}}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial{f}}{\partial{x_{N1}}} & \frac{\partial{f}}{\partial{x_{N2}}} & \dots & \frac{\partial{f}}{\partial{x_{NM}}} \end{pmatrix} \tag{90} Xf=x11fx21fxN1fx12fx22fxN2fx1Mfx2MfxNMf(90)

(91) d X = ( d x 11 d x 12 … d x 1 M d x 21 d x 22 … d x 2 M ⋮ ⋮ ⋱ ⋮ d x N 1 d x N 2 … d x N M ) dX = \begin{pmatrix} dx_{11} & d{x_{12}} & \dots & d{x_{1M}} \\ d{x_{21}} & d{x_{22}} & \dots & d{x_{2M}} \\ \vdots & \vdots & \ddots & \vdots \\ d{x_{N1}} & d{x_{N2}} & \dots & d{x_{NM}} \end{pmatrix} \tag{91} dX=dx11dx21dxN1dx12dx22dxN2dx1Mdx2MdxNM(91)

我们来看矩阵 ( 90 ) (90) (90)的转置和矩阵 ( 91 ) (91) (91)乘积的对角线元素

( ( ∂ f ∂ X ) T d X ) j j = ∑ i N ∂ f ∂ x i j d x i j ((\frac{\partial f}{\partial X})^T dX)_{jj}=\sum_i^N \frac{\partial f}{\partial x_{ij}} dx_{ij} ((Xf)TdX)jj=iNxijfdxij

因此,

(92) t r ( ( ∂ f ∂ X ) T d X ) = ∑ j M ∑ i N ∂ f ∂ x i j d x i j = d f = t r ( d f ) tr({(\frac{\partial{f}}{\partial{X}})}^TdX) = \sum_j^M\sum_i^N\frac{\partial{f}}{\partial{x_{ij}}}dx_{ij} = df = tr(df) \tag{92} tr((Xf)TdX)=jMiNxijfdxij=df=tr(df)(92)

上式的最后一个等号是因为 d f df df是一个标量,标量的迹就等于其本身。

矩阵迹和导数的部分性质

这里将会给出部分矩阵的迹和导数的性质,作为后面推导过程的参考。性子急的同学可以姑且默认这是一些结论。

(93) d ( X + Y ) = d X + d Y d(X + Y) = dX + dY \tag{93} d(X+Y)=dX+dY(93)
(94) d ( X Y ) = ( d X ) Y + X ( d Y ) d(XY) = (dX)Y + X(dY)\tag{94} d(XY)=(dX)Y+X(dY)(94)
(95) d X T = ( d X ) T dX^T = {(dX)}^T \tag{95} dXT=(dX)T(95)
(96) d ( t r ( X ) ) = t r ( d X ) d(tr(X)) = tr(dX) \tag{96} d(tr(X))=tr(dX)(96)
(97) d ( X ⊙ Y ) = d X ⊙ Y + X ⊙ d Y d(X \odot Y) = dX \odot Y + X \odot dY \tag{97} d(XY)=dXY+XdY(97)
(98) d ( f ( X ) ) = f ′ ( X ) ⊙ d X d(f(X)) = f^{'}(X) \odot dX \tag{98} d(f(X))=f(X)dX(98)
(99) t r ( X Y ) = t r ( Y X ) tr(XY) = tr(YX) \tag{99} tr(XY)=tr(YX)(99)
(100) t r ( A T ( B ⊙ C ) ) = t r ( ( A ⊙ B ) T C ) tr(A^T (B \odot C)) = tr((A \odot B)^T C) \tag{100} tr(AT(BC))=tr((AB)TC)(100)

以上各性质的证明方法类似,我们选取式(94)作为证明的示例:

Z = X Y Z = XY Z=XY

则Z中的任意一项是

z i j = ∑ k x i k y k j z_{ij} = \sum_k x_{ik}y_{kj} zij=kxikykj
d z i j = ∑ k d ( x i k y k j ) dz_{ij} = \sum_k d(x_{ik}y_{kj}) dzij=kd(xikykj)
= ∑ k ( d x i k ) y k j + ∑ k x i k ( d y k j ) = \sum_k (dx_{ik}) y_{kj} + \sum_k x_{ik} (dy_{kj}) =k(dxik)ykj+kxik(dykj)
= d X i j ⋅ Y i j + X i j ⋅ d Y i j =dX_{ij} \cdot Y_{ij} + X_{ij} \cdot dY_{ij} =dXijYij+XijdYij
从上式可见, d Z dZ dZ的每一项和 ( d X ) Y + X ( d Y ) (dX)Y + X(dY) (dX)Y+X(dY)的每一项都是相等的。因此,可以得出式(94)成立。

神经网络有关公式证明:

  • 首先,来看一个通用情况,已知 f = A T X B f = A^TXB f=ATXB A , B A,B A,B是常矢量,希望得到 ∂ f ∂ X \frac{\partial{f}}{\partial{X}} Xf,推导过程如下

    根据式(94),

    d f = d ( A T X B ) = d ( A T X ) B + A T X ( d B ) = d ( A T X ) B + 0 = d ( A T ) X B + A T d X B = A T d X B df = d(A^TXB) = d(A^TX)B + A^TX(dB) = d(A^TX)B + 0 = d(A^T)XB+A^TdXB = A^TdXB df=d(ATXB)=d(ATX)B+ATX(dB)=d(ATX)B+0=d(AT)XB+ATdXB=ATdXB

    由于 d f df df是一个标量,标量的迹等于本身,同时利用公式(99):

    d f = t r ( d f ) = t r ( A T d X B ) = t r ( B A T d X ) df = tr(df) = tr(A^TdXB) = tr(BA^TdX) df=tr(df)=tr(ATdXB)=tr(BATdX)

    由于公式(92):

    t r ( d f ) = t r ( ( ∂ f ∂ X ) T d X ) tr(df) = tr({(\frac{\partial{f}}{\partial{X}})}^TdX) tr(df)=tr((Xf)TdX)

    可以得到:

    ( ∂ f ∂ X ) T = B A T (\frac{\partial{f}}{\partial{X}})^T = BA^T (Xf)T=BAT
    (101) ∂ f ∂ X = A B T \frac{\partial{f}}{\partial{X}} = AB^T \tag{101} Xf=ABT(101)

  • 我们来看全连接层的情况:

    Y = W X + B Y = WX + B Y=WX+B

    取全连接层其中一个元素

    y = w X + b y = wX + b y=wX+b

    这里的 w w w是权重矩阵的一行,尺寸是 1 × M 1 \times M 1×M,X是一个大小为 M × 1 M \times 1 M×1的矢量,y是一个标量,若添加一个大小是1的单位阵,上式整体保持不变:

    y = ( w T ) T X I + b y = (w^T)^TXI + b y=(wT)TXI+b

    利用式(92),可以得到

    ∂ y ∂ X = I T w T = w T \frac{\partial{y}}{\partial{X}} = I^Tw^T = w^T Xy=ITwT=wT

    因此在误差传递的四大公式中,在根据上层传递回来的误差 δ \delta δ继续传递的过程中,利用链式法则,有

    δ L − 1 = ( W L ) T δ L ⊙ σ ′ ( Z L − 1 ) \delta^{L-1} = (W^L)^T \delta^L \odot \sigma^{'}(Z^{L - 1}) δL1=(WL)TδLσ(ZL1)

    同理,若将 y = w X + b y=wX+b y=wX+b视作:

    y = I w X + b y = IwX + b y=IwX+b

    那么利用式(92),可以得到:

    ∂ y ∂ w = X T \frac{\partial{y}}{\partial{w}} = X^T wy=XT

  • 使用softmax和交叉熵来计算损失的情况下:

    l = − Y T l o g ( s o f t m a x ( Z ) ) l = - Y^Tlog(softmax(Z)) l=YTlog(softmax(Z))

    式中, y y y是数据的标签, Z Z Z是网络预测的输出, y y y Z Z Z的维度是 N × 1 N \times 1 N×1。经过softmax处理作为概率。希望能够得到 ∂ l ∂ Z \frac{\partial{l}}{\partial{Z}} Zl,下面是推导的过程:

    s o f t m a x ( Z ) = e x p ( Z ) 1 T e x p ( Z ) softmax(Z) = \frac{exp(Z)}{\boldsymbol{1}^Texp(Z)} softmax(Z)=1Texp(Z)exp(Z)

    其中, 1 \boldsymbol{1} 1是一个维度是 N × 1 N \times 1 N×1的全1向量。将softmax表达式代入损失函数中,有

    (102) d l = − Y T d ( l o g ( s o f t m a x ( Z ) ) ) = − Y T d ( l o g e x p ( Z ) 1 T e x p ( Z ) ) = − Y T d Z + Y T 1 d ( l o g ( 1 T e x p ( Z ) ) ) dl = -Y^T d(log(softmax(Z)))\\ = -Y^T d (log\frac{exp(Z)}{\boldsymbol{1}^Texp(Z)}) \\ = -Y^T dZ + Y^T \boldsymbol{1}d(log(\boldsymbol{1}^Texp(Z))) \tag{102} dl=YTd(log(softmax(Z)))=YTd(log1Texp(Z)exp(Z))=YTdZ+YT1d(log(1Texp(Z)))(102)

    下面来化简式(102)的后半部分,利用式(98):

    d ( l o g ( 1 T e x p ( Z ) ) ) = l o g ′ ( 1 T e x p ( Z ) ) ⊙ d Z = 1 T ( e x p ( Z ) ⊙ d Z ) 1 T e x p ( Z ) d(log(\boldsymbol{1}^Texp(Z))) = log^{'}(\boldsymbol{1}^Texp(Z)) \odot dZ = \frac{\boldsymbol{1}^T(exp(Z)\odot dZ)}{\boldsymbol{1}^Texp(Z)} d(log(1Texp(Z)))=log(1Texp(Z))dZ=1Texp(Z)1T(exp(Z)dZ)

    利用式(100),可以得到

    t r ( Y T 1 1 T ( e x p ( Z ) ⊙ d Z ) 1 T e x p ( Z ) ) = t r ( Y T 1 ( 1 ⊙ ( e x p ( Z ) ) T d Z ) 1 T e x p ( Z ) ) tr(Y^T \boldsymbol{1}\frac{\boldsymbol{1}^T(exp(Z)\odot dZ)}{\boldsymbol{1}^Texp(Z)}) = tr(Y^T \boldsymbol{1}\frac{(\boldsymbol{1} \odot (exp(Z))^T dZ)}{\boldsymbol{1}^Texp(Z)}) tr(YT11Texp(Z)1T(exp(Z)dZ))=tr(YT11Texp(Z)(1(exp(Z))TdZ))
    (103) = t r ( Y T 1 e x p ( Z ) T d Z 1 T e x p ( Z ) ) = t r ( Y T 1 s o f t m a x ( Z ) T d Z ) = tr(Y^T \boldsymbol{1}\frac{exp(Z)^T dZ}{\boldsymbol{1}^Texp(Z)}) = tr(Y^T \boldsymbol{1} softmax(Z)^TdZ) \tag{103} =tr(YT11Texp(Z)exp(Z)TdZ)=tr(YT1softmax(Z)TdZ)(103)

    将式(103)代入式(102)并两边取迹,可以得到:

    d l = t r ( d l ) = t r ( − y T d Z + y T 1 s o f t m a x ( Z ) T d Z ) = t r ( ( ∂ l ∂ Z ) T d Z ) dl = tr(dl) = tr(-y^T dZ + y^T\boldsymbol{1}softmax(Z)^TdZ) = tr((\frac{\partial{l}}{\partial{Z}})^TdZ) dl=tr(dl)=tr(yTdZ+yT1softmax(Z)TdZ)=tr((Zl)TdZ)

    在分类问题中,一个标签中只有一项会是1,所以 Y T 1 = 1 Y^T\boldsymbol{1} = 1 YT1=1,因此有

    ∂ l ∂ Z = s o f t m a x ( Z ) − Y \frac{\partial{l}}{\partial{Z}} = softmax(Z) - Y Zl=softmax(Z)Y

    这也就是在损失函数中计算反向传播的误差的公式。

参考资料

矩阵求导术
点击这里学习更多神经网络基本课程
点击这里提交问题与建议
联系我们: msraeduhub@microsoft.com
学习了这么多,还没过瘾怎么办?欢迎加入“微软 AI 应用开发实战交流群”,跟大家一起畅谈AI,答疑解惑。扫描下方二维码,回复“申请入群”,即刻邀请你入群。

  • 3
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值