三层的全连接神经网络梯度数学推导

这篇博客深入解析了深度学习中反向传播算法的实现过程,使用MNIST数据集为例,详细介绍了网络结构、损失函数以及梯度计算。通过逐层计算损失函数关于权重和偏置的梯度,展示了反向传播如何更新网络参数。内容涵盖了sigmoid激活函数、softmax函数以及交叉熵损失函数的相关导数和矩阵求导技巧。
摘要由CSDN通过智能技术生成

数据集:

使用mnist数据集,每一个样本 ( x , y ) (x,y) (x,y) x x x为原本为28*28的灰度图像,对像素进行归一化后,reshape成 ( 28 ∗ 28 , 1 ) (28*28, 1) (2828,1)的列向量,y原本是一个整数,表示该图片属于哪一类,对整数进行one-hot编码成一个 ( 10 , 1 ) (10,1) (10,1)的列向量(只在整数的位置是1,其余位置都是0)

一些函数:
σ ( x ) = 1 1 + e − x \sigma(x)=\frac 1{1+e^{-x}} σ(x)=1+ex1
σ ′ ( x ) = σ ( x ) ⋅ ( 1 − σ ( x ) ) \sigma^{'}(x)=\sigma(x)\cdot (1-\sigma(x)) σ(x)=σ(x)(1σ(x))
s o f t m a x ( x ) = e x ∑ i = 1 n e x i = e x 1 T e x softmax(x)=\frac {e^x}{\sum_{i=1}^{n}e^{x_i}}=\frac {e^x}{1^Te^x} softmax(x)=i=1nexiex=1Texex

说明:

  • ⊙ \odot 为大小相同的矩阵逐元素点乘
  • 一个数套上迹运算结果不变
  • 一个数的转置还是本身
  • 标量向矩阵求导时,根据导数和微分的关系,有 d f ( x ) = t r ( ( ∂ f ( x ) ∂ x ) T d ( x ) ) df(x) = tr((\frac {\partial f(x)}{\partial x})^T d(x)) df(x)=tr((xf(x))Td(x))

网络结构:
a 1 = w 1 ⋅ x + b 1 a_1=w_1\cdot x+b_1 a1=w1x+b1
h 1 = σ ( a 1 ) h_1=\sigma (a_1) h1=σ(a1)
a 2 = w 2 ⋅ h 1 + b 2 a_2=w_2\cdot h_1+b_2 a2=w2h1+b2
h 2 = σ ( a 2 ) h_2=\sigma (a_2) h2=σ(a2)
a 3 = w 3 ⋅ h 2 + b 3 a_3=w_3\cdot h_2+b_3 a3=w3h2+b3
y ′ = s o f t m a x ( a 3 ) y^{'}=softmax(a_3) y=softmax(a3)
y ′ y^{'} y就是输出,与标签y有着相同的shape

损失函数:
使用交叉熵作为损失函数
l ( y ′ , y ) = − y T l n ( y ′ ) l(y^{'},y)=-y^Tln(y^{'}) l(y,y)=yTln(y)

损失函数l关于各个参数的梯度:
l = − y T ln ⁡ e a 3 1 T e a 3 = − y T ( a 3 − 1 ⋅ l n ( 1 T e a 3 ) ) = − y T a 3 + l n ( 1 T ⋅ e a 3 ) \begin{aligned} l & = -y^T\ln{\frac {e^{a_3}} {1^Te^{a_3}}} \\ & = -y^T(a_3 - 1\cdot ln (1^Te^{a_3})) \\ & = -y^Ta_3 + ln(1^T \cdot e^{a_3}) \end{aligned} l=yTln1Tea3ea3=yT(a31ln(1Tea3))=yTa3+ln(1Tea3)

d l = t r ( − y T d ( a 3 ) ) + t r ( 1 T e a 3 ⊙ d ( a 3 ) 1 T e a 3 ) = t r ( − y T d ( a 3 ) ) + t r ( ( 1 ⊙ e a 3 ) T d ( a 3 ) 1 T e a 3 ) = t r ( − y T d ( a 3 ) ) + t r ( ( e a 3 ) T d ( a 3 ) 1 T e a 3 ) = t r ( − y T d ( a 3 ) ) + t r ( ( e a 3 1 T e a 3 ) T d ( a 3 ) ) = t r ( − y T d ( a 3 ) ) + t r ( ( s o f t m a x ( a 3 ) ) T d ( a 3 ) ) = t r [ ( ( s o f t m a x ( a 3 ) ) T − y T )   d ( a 3 ) ] = t r [ ( ( s o f t m a x ( a 3 ) ) − y ) T d ( a 3 ) ] \begin{aligned} dl & = tr(-y^Td(a_3)) + tr(\frac {1^Te^{a_3}\odot d(a_3)}{1^Te^{a_3}}) \\ & = tr(-y^Td(a_3)) + tr(\frac {(1\odot e^{a_3})^Td(a_3)}{1^Te^{a3}}) \\ & = tr(-y^Td(a_3)) + tr(\frac {(e^{a_3})^Td(a_3)}{1^Te^{a_3}}) \\ & = tr(-y^Td(a_3)) + tr((\frac {e^{a_3}}{1^Te^{a_3}})^T d(a_3)) \\ & = tr(-y^Td(a_3)) + tr((softmax(a_3))^T d(a_3)) \\ & = tr[((softmax(a_3))^T - y^T)\ d(a_3)] \\ & = tr[((softmax(a_3)) - y)^T d(a_3)] \end{aligned} dl=tr(yTd(a3))+tr(1Tea31Tea3d(a3))=tr(yTd(a3))+tr(1Tea3(1ea3)Td(a3))=tr(yTd(a3))+tr(1Tea3(ea3)Td(a3))=tr(yTd(a3))+tr((1Tea3ea3)Td(a3))=tr(yTd(a3))+tr((softmax(a3))Td(a3))=tr[((softmax(a3))TyT) d(a3)]=tr[((softmax(a3))y)Td(a3)]

所以 ∂ l ∂ a 3 = s o f t m a x ( a 3 ) − y \frac {\partial l}{\partial a_3} = softmax(a_3) - y a3l=softmax(a3)y

同理,还有
d l = t r [ ( ∂ l ∂ a 3 ) T d ( a 3 ) ] = t r [ ( ∂ l ∂ a 3 ) T d ( w 3 ) h 2 ] + t r [ ( ∂ l ∂ a 3 ) T w 3 d ( h 2 ) ] + t r [ ( ∂ l ∂ a 3 ) T d ( b 3 ) ] = t r [ h 2 ( ∂ l ∂ a 3 ) T d ( w 3 ) ] + t r [ ( ∂ l ∂ a 3 ) T w 3 d ( h 2 ) ] + t r [ ( ∂ l ∂ a 3 ) T d ( b 3 ) ] \begin{aligned} dl & = tr[(\frac {\partial l}{\partial a_3})^T d(a_3)] \\ & = tr[(\frac {\partial l}{\partial a_3})^T d(w_3)h_2] + tr[(\frac {\partial l}{\partial a_3})^T w_3 d(h_2)] + tr[(\frac {\partial l}{\partial a_3})^T d(b_3)] \\ & = tr[h_2(\frac {\partial l}{\partial a_3})^T d(w_3)] + tr[(\frac {\partial l}{\partial a_3})^T w_3 d(h_2)] + tr[(\frac {\partial l}{\partial a_3})^T d(b_3)] \\ \end{aligned} dl=tr[(a3l)Td(a3)]=tr[(a3l)Td(w3)h2]+tr[(a3l)Tw3d(h2)]+tr[(a3l)Td(b3)]=tr[h2(a3l)Td(w3)]+tr[(a3l)Tw3d(h2)]+tr[(a3l)Td(b3)]
从第一个式子,得到 ∂ l ∂ w 3 = ∂ l ∂ a 3 h 2 T \frac {\partial l}{\partial w_3} = \frac {\partial l}{\partial a_3}h_2^T w3l=a3lh2T
从第二个式子,得到 ∂ l ∂ h 2 = w 3 T ∂ l ∂ a 3 \frac {\partial l}{\partial h_2} = w_3^T \frac {\partial l}{\partial a_3} h2l=w3Ta3l,又由于 σ ( x ) \sigma(x) σ(x)是逐元素函数,所以 ∂ l ∂ a 2 = ∂ l ∂ h 2 ⊙ σ ′ ( a 2 ) \frac {\partial l}{\partial a_2} = \frac {\partial l}{\partial h_2} \odot \sigma^{'} (a_2) a2l=h2lσ(a2)
从第三个式子,得到 ∂ l ∂ b 3 = ∂ l ∂ a 3 \frac {\partial l}{\partial b_3} = \frac {\partial l}{\partial a_3} b3l=a3l

同理,还有
d l = t r [ ( ∂ l ∂ a 2 ) T d ( a 2 ) ] = t r [ ( ∂ l ∂ a 2 ) T d ( w 2 ) h 1 ] + t r [ ( ∂ l ∂ a 2 ) T w 2 d ( h 1 ) ] + t r [ ( ∂ l ∂ a 2 ) T d ( b 2 ) ] = t r [ h 1 ( ∂ l ∂ a 2 ) T d ( w 2 ) ] + t r [ ( ∂ l ∂ a 2 ) T w 2 d ( h 1 ) ] + t r [ ( ∂ l ∂ a 2 ) T d ( b 2 ) ] \begin{aligned} dl & = tr[(\frac {\partial l}{\partial a_2})^T d(a_2)] \\ & = tr[(\frac {\partial l}{\partial a_2})^T d(w_2)h_1] + tr[(\frac {\partial l}{\partial a_2})^T w_2 d(h_1)] + tr[(\frac {\partial l}{\partial a_2})^T d(b_2)] \\ & = tr[h_1(\frac {\partial l}{\partial a_2})^T d(w_2)] + tr[(\frac {\partial l}{\partial a_2})^T w_2 d(h_1)] + tr[(\frac {\partial l}{\partial a_2})^T d(b_2)] \\ \end{aligned} dl=tr[(a2l)Td(a2)]=tr[(a2l)Td(w2)h1]+tr[(a2l)Tw2d(h1)]+tr[(a2l)Td(b2)]=tr[h1(a2l)Td(w2)]+tr[(a2l)Tw2d(h1)]+tr[(a2l)Td(b2)]
从第一个式子,得到 ∂ l ∂ w 2 = ∂ l ∂ a 2 h 1 T \frac {\partial l}{\partial w_2} = \frac {\partial l}{\partial a_2}h_1^T w2l=a2lh1T
从第二个式子,得到 ∂ l ∂ h 1 = w 2 T ∂ l ∂ a 2 \frac {\partial l}{\partial h_1} = w_2^T \frac {\partial l}{\partial a_2} h1l=w2Ta2l,又由于 σ ( x ) \sigma(x) σ(x)是逐元素函数,所以 ∂ l ∂ a 1 = ∂ l ∂ h 1 ⊙ σ ′ ( a 1 ) \frac {\partial l}{\partial a_1} = \frac {\partial l}{\partial h_1} \odot \sigma^{'} (a_1) a1l=h1lσ(a1)
从第三个式子,得到 ∂ l ∂ b 2 = ∂ l ∂ a 2 \frac {\partial l}{\partial b_2} = \frac {\partial l}{\partial a_2} b2l=a2l

同理,还有
d l = t r [ ( ∂ l ∂ a 1 ) T d ( a 1 ) ] = t r [ ( ∂ l ∂ a 1 ) T d ( w 1 ) x ] + t r [ ( ∂ l ∂ a 1 ) T w 1 d ( x ) ] + t r [ ( ∂ l ∂ a 1 ) T d ( b 1 ) ] = t r [ x ( ∂ l ∂ a 1 ) T d ( w 1 ) ] + t r [ ( ∂ l ∂ a 1 ) T w 1 d ( x ) ] + t r [ ( ∂ l ∂ a 1 ) T d ( b 1 ) ] \begin{aligned} dl & = tr[(\frac {\partial l}{\partial a_1})^T d(a_1)] \\ & = tr[(\frac {\partial l}{\partial a_1})^T d(w_1)x] + tr[(\frac {\partial l}{\partial a_1})^T w_1 d(x)] + tr[(\frac {\partial l}{\partial a_1})^T d(b_1)] \\ & = tr[x (\frac {\partial l}{\partial a_1})^T d(w_1)] + tr[(\frac {\partial l}{\partial a_1})^T w_1 d(x)] + tr[(\frac {\partial l}{\partial a_1})^T d(b_1)] \\ \end{aligned} dl=tr[(a1l)Td(a1)]=tr[(a1l)Td(w1)x]+tr[(a1l)Tw1d(x)]+tr[(a1l)Td(b1)]=tr[x(a1l)Td(w1)]+tr[(a1l)Tw1d(x)]+tr[(a1l)Td(b1)]
从第一个式子,得到 ∂ l ∂ w 1 = ∂ l ∂ a 1 x T \frac {\partial l}{\partial w_1} = \frac {\partial l}{\partial a_1}x^T w1l=a1lxT
从第二个式子,由于 x x x为定值,所以该项为 0 0 0
从第三个式子,得到 ∂ l ∂ b 1 = ∂ l ∂ a 1 \frac {\partial l}{\partial b_1} = \frac {\partial l}{\partial a_1} b1l=a1l

至此,损失函数 l l l关于所有参数 w i w_i wi b i b_i bi的梯度都可以用已知矩阵表示,梯度下降时直接如下更新参数就可以了
w i = w i − l r ∗ ∂ l ∂ w i b i = b i − l r ∗ ∂ l ∂ b i w_i = w_i - lr * \frac {\partial l}{\partial w_i} \\ b_i = b_i - lr * \frac {\partial l}{\partial b_i} wi=wilrwilbi=bilrbil

Code:
写的很乱,请见谅QAQ
传送门

参考文献:
矩阵求导术

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值