Week 6 hw3-1 全连接网络反向传播推导

Week 6 hw3-1 全连接网络反向传播推导

折腾了半天,记录一下。

作业中网络由若干全连接层+ReLU组成,输出层的函数为softmax,损失函数为交叉熵。

一、记号

在这里插入图片描述

设网络有 n n n层。如图,当 i < n i<n i<n时,我们有如下几条式子成立:
{ z j i − 1 = ReLU ( x j i − 1 ) x j i = ReLU ( z ^ j i − 1 ) z ^ j i − 1 = ∑ k = 1 d i − 1 z k i − 1 w k j i − 1 + b j i − 1 \begin{cases} z^{i-1}_j=\text{ReLU}(x^{i-1}_j)\\ x^{i}_j=\text{ReLU}(\hat{z}^{i-1}_j)\\ \hat{z}_{j}^{i-1}=\sum\limits_{k=1}^{d_{i-1}}z^{i-1}_kw_{kj}^{i-1}+b_j^{i-1} \end{cases} zji1=ReLU(xji1)xji=ReLU(z^ji1)z^ji1=k=1di1zki1wkji1+bji1
同时,记:
{ x i − 1 = [ x 1 i − 1 , x 2 i − 1 , … , x d i − 1 i − 1 ] x i = [ x 1 i − 1 , x 2 i − 1 , … , x d i i ] z i − 1 = [ z 1 i − 1 , z 2 i − 1 , … , z d i − 1 i − 1 ] z ^ i − 1 = [ z ^ 1 i − 1 , z ^ 2 i − 1 , … , z ^ d i i − 1 ] b i − 1 = [ b 1 i − 1 , b 2 i − 1 , … , b d i i − 1 ] W i − 1 = ( w 11 i − 1 w 12 i − 1 … w 1 d i i − 1 w 21 i − 1 w 22 i − 1 … w 2 d i i − 1 ⋮ ⋮ ⋱ ⋮ w d i − 1 1 i − 1 w d i − 1 2 i − 1 … w d i − 1 d i i − 1 ) \begin{cases} \mathbf{x}^{i-1}=[x_1^{i-1},x_2^{i-1},\dots,x_{d_{i-1}}^{i-1}]\\ \mathbf{x}^{i}=[x_1^{i-1},x_2^{i-1},\dots,x_{d_{i}}^{i}]\\ \mathbf{z}^{i-1}=[z_1^{i-1},z_2^{i-1},\dots,z_{d_{i-1}}^{i-1}]\\ \hat{\mathbf{z}}^{i-1}=[\hat{z}_1^{i-1},\hat{z}_2^{i-1},\dots,\hat{z}_{d_{i}}^{i-1}]\\ \mathbf{b}^{i-1}=[b_1^{i-1},b_2^{i-1},\dots,b_{d_i}^{i-1}]\\ \mathbf{W}^{i-1}= \begin{pmatrix} w_{11}^{i-1} & w_{12}^{i-1} & \dots & w_{1d_i}^{i-1}\\ w_{21}^{i-1} & w_{22}^{i-1} & \dots & w_{2d_i}^{i-1}\\ \vdots & \vdots & \ddots & \vdots\\ w_{d_{i-1}1}^{i-1} & w_{d_{i-1}2}^{i-1} & \dots & w_{d_{i-1}d_i}^{i-1} \end{pmatrix} \end{cases} xi1=[x1i1,x2i1,,xdi1i1]xi=[x1i1,x2i1,,xdii]zi1=[z1i1,z2i1,,zdi1i1]z^i1=[z^1i1,z^2i1,,z^dii1]bi1=[b1i1,b2i1,,bdii1]Wi1= w11i1w21i1wdi11i1w12i1w22i1wdi12i1w1dii1w2dii1wdi1dii1

那么我们得到矩阵形式:
{ z i − 1 = ReLU ( x i − 1 ) x i = ReLU ( z ^ i − 1 ) z ^ i − 1 = z i − 1 W i − 1 + b i − 1 \begin{cases} \mathbf{z}^{i-1}=\text{ReLU}(\mathbf{x}^{i-1})\\ \mathbf{x}^i=\text{ReLU}(\hat{\mathbf{z}}^{i-1})\\ \hat{\mathbf{z}}^{i-1}=\mathbf{z}^{i-1}\mathbf{W}^{i-1}+\mathbf{b}^{i-1} \end{cases} zi1=ReLU(xi1)xi=ReLU(z^i1)z^i1=zi1Wi1+bi1
i = n i=n i=n时,记 i i i位置的真实标签为 y i y_{i} yi,预测结果为 y ^ i \hat{y}_i y^i。我们有:
{ z j n − 1 = ReLU ( x j n − 1 ) y ^ j = [ softmax ( z ^ 1 n − 1 , z ^ 2 n − 1 … , z ^ d n n − 1 ) ] j z ^ j n − 1 = ∑ k = 1 d n − 1 z k i − 1 w k j n − 1 + b j n − 1 \begin{cases} z^{n-1}_j=\text{ReLU}(x^{n-1}_j)\\ \hat{y}_j=[\text{softmax}(\hat{z}^{n-1}_1,\hat{z}^{n-1}_2\dots,\hat{z}^{n-1}_{d_n})]_j\\ \hat{z}_{j}^{n-1}=\sum\limits_{k=1}^{d_{n-1}}z^{i-1}_kw_{kj}^{n-1}+b_j^{n-1} \end{cases} zjn1=ReLU(xjn1)y^j=[softmax(z^1n1,z^2n1,z^dnn1)]jz^jn1=k=1dn1zki1wkjn1+bjn1
y = [ y 1 , y 2 , … , y d n ] , y ^ = [ y ^ 1 , y ^ 2 , … , y ^ d n ] \mathbf{y}=[y_1,y_2,\dots,y_{d_n}],\hat{\mathbf{y}}=[\hat y_1,\hat y_2,\dots,\hat y_{d_n}] y=[y1,y2,,ydn],y^=[y^1,y^2,,y^dn]。其中 y \mathbf{y} y为one-hot向量。

那么我们得到矩阵形式:
{ z n − 1 = ReLU ( x n − 1 ) y ^ = softmax ( z ^ n − 1 ) z ^ n − 1 = z n − 1 W n − 1 + b n − 1 \begin{cases} \mathbf{z}^{n-1}=\text{ReLU}(\mathbf{x}^{n-1})\\ \hat{\mathbf{y}}=\text{softmax}(\hat{\mathbf{z}}^{n-1})\\ \hat{\mathbf{z}}^{n-1}=\mathbf{z}^{n-1}\mathbf{W}^{n-1}+\mathbf{b}^{n-1} \end{cases} zn1=ReLU(xn1)y^=softmax(z^n1)z^n1=zn1Wn1+bn1
记损失函数 J ( y , y ^ ) = − ∑ i = 1 d n y i log ⁡ y ^ i J(\mathbf{y},\hat{\mathbf{y}})=-\sum\limits_{i=1}^{d_n}y_i\log \hat{y}_i J(y,y^)=i=1dnyilogy^i

二、推导

接下来计算 ∇ J W l \nabla J_{\mathbf{W}^l} JWl ∇ J b l \nabla J_{\mathbf{b}^l} Jbl

画出计算图,得到:
∇ J w i j l = ∂ J ∂ z ^ j l ∂ z ^ j l ∂ w i j l = ∂ J ∂ z ^ j l z i l ∇ J b j l = ∂ J ∂ z ^ j l ∂ z ^ j l ∂ b j l = ∂ J ∂ z ^ j l \nabla J_{w^l_{ij}}=\frac{\partial J}{\partial \hat{z}_j^l}\frac{\partial \hat{z}_j^l}{\partial w_{ij}^l}=\frac{\partial J}{\partial \hat{z}_j^l}z^l_{i}\\ \nabla J_{b^l_{j}}=\frac{\partial J}{\partial \hat{z}_j^l}\frac{\partial \hat{z}_j^l}{\partial b^l_{j}}=\frac{\partial J}{\partial \hat{z}_j^l} Jwijl=z^jlJwijlz^jl=z^jlJzilJbjl=z^jlJbjlz^jl=z^jlJ

写成矩阵形式,得到:
∇ J W l = ( z l ) T ∇ J z ^ l ∇ J b l = ∇ J z ^ l \nabla J_{\mathbf{W}^l}=(\mathbf{z}^l)^{T}\nabla J_{\mathbf{\hat{z}}^l}\\ \nabla J_{\mathbf{b}^l}=\nabla J_{\mathbf{\hat{z}}^l} JWl=(zl)TJz^lJbl=Jz^l
于是只需要计算 ∇ J z ^ l \nabla J_{\mathbf{\hat{z}}^l} Jz^l即可。我们尝试构造递推式计算。画出计算图,得到:
∇ J z ^ j l = ( ∑ k = 1 d l + 1 ∂ J ∂ z ^ k l + 1 ∂ z ^ k l + 1 ∂ x j l + 1 ) ∂ x j l + 1 ∂ z ^ j l = ( ∑ k = 1 d l + 1 ∂ J ∂ z ^ k l + 1 w j k l + 1 ) d ReLU d x ∣ x = z ^ j l \nabla J_{\hat{z}_{j}^l}=(\sum_{k=1}^{d_{l+1}}\frac{\partial J}{\partial \hat{z}_{k}^{l+1}}\frac{\partial \hat{z}_{k}^{l+1}}{\partial x_j^{l+1}})\frac{\partial x_j^{l+1}}{\partial \hat{z}_{j}^l}=(\sum_{k=1}^{d_{l+1}}\frac{\partial J}{\partial \hat{z}_{k}^{l+1}}w_{jk}^{l+1})\left.\dfrac{\text{d ReLU}}{\text{d}x}\right|_{x=\hat{z}_{j}^{l}} Jz^jl=(k=1dl+1z^kl+1Jxjl+1z^kl+1)z^jlxjl+1=(k=1dl+1z^kl+1Jwjkl+1)dxd ReLU x=z^jl
写成矩阵形式,得到:
∇ J z ^ l = ∇ J z ^ l + 1 W l + 1 ReLU’ ( z ^ l ) \nabla J_{\hat{\mathbf{z}}^l}=\nabla J_{\hat{\mathbf{z}}^{l+1}}\mathbf{W}^{l+1}\text{ReLU'}(\hat{\mathbf{z}}^l) Jz^l=Jz^l+1Wl+1ReLU’(z^l)
因此若得到 ∇ J z ^ n − 1 \nabla J_{\hat{\mathbf{z}}^{n-1}} Jz^n1则完成计算。下面计算 ∇ J z ^ n − 1 \nabla J_{\hat{\mathbf{z}}^{n-1}} Jz^n1

由于 y \mathbf{y} y为one-hot向量,若真实标签的类别为 i i i,那么我们有 J ( y , y ^ ) = − y i log ⁡ y ^ i = − log ⁡ y ^ i J(\mathbf{y},\hat{\mathbf{y}})=-y_i\log\hat{y}_i=-\log\hat{y}_i J(y,y^)=yilogy^i=logy^i

画出计算图,得到:
∇ J z ^ j n − 1 = ∂ J ∂ z ^ j n − 1 = ∑ k = 1 d n ∂ J ∂ y ^ k ∂ y ^ k ∂ z ^ j n − 1 = ∂ J ∂ y ^ i ∂ y ^ i ∂ z ^ j n − 1 = − 1 y ^ i ∂ y ^ i ∂ z ^ j n − 1 \nabla J_{\hat{z}_j^{n-1}}=\frac{\partial J}{\partial \hat{z}_j^{n-1}}=\sum_{k=1}^{d_n}\frac{\partial J}{\partial\hat{y}_k}\frac{\partial\hat{y}_k}{\partial \hat{z}_j^{n-1}}=\frac{\partial J}{\partial\hat{y}_i}\frac{\partial\hat{y}_i}{\partial \hat{z}_j^{n-1}}=-\frac{1}{\hat{y}_i}\frac{\partial\hat{y}_i}{\partial \hat{z}_j^{n-1}} Jz^jn1=z^jn1J=k=1dny^kJz^jn1y^k=y^iJz^jn1y^i=y^i1z^jn1y^i
由于 y ^ i = exp ⁡ ( z ^ i n − 1 ) ∑ k = 1 d n exp ⁡ ( z ^ k n − 1 ) \hat{y}_i=\frac{\exp(\hat{z}_i^{n-1})}{\sum_{k=1}^{d_n}\exp(\hat{z}_k^{n-1})} y^i=k=1dnexp(z^kn1)exp(z^in1),下面进行分类讨论。

i ≠ j i\not=j i=j时,有:
∂ y ^ i ∂ z ^ j n − 1 = − exp ⁡ ( z ^ i n − 1 ) ( ∑ k = 1 d n exp ⁡ ( z ^ k n − 1 ) ) 2 exp ⁡ ( z ^ j n − 1 ) = − y ^ i y ^ j \frac{\partial\hat{y}_i}{\partial \hat{z}_j^{n-1}}=-\frac{\exp(\hat{z}_i^{n-1})}{(\sum_{k=1}^{d_n}\exp(\hat{z}_k^{n-1}))^2}\exp(\hat{z}_j^{n-1})=-\hat{y}_i\hat{y}_j z^jn1y^i=(k=1dnexp(z^kn1))2exp(z^in1)exp(z^jn1)=y^iy^j
i = j i=j i=j时,有:
∂ y ^ i ∂ z ^ j n − 1 = ∑ k = 1 , k ≠ i d n exp ⁡ ( z ^ k n − 1 ) ( ∑ k = 1 d n exp ⁡ ( z ^ k n − 1 ) ) 2 exp ⁡ ( z ^ i n − 1 ) = ( 1 − y ^ i ) y ^ i \frac{\partial\hat{y}_i}{\partial \hat{z}_j^{n-1}}=\frac{\sum\limits_{k=1,k\not=i}^{d_n}\exp(\hat{z}_k^{n-1})}{(\sum\limits_{k=1}^{d_n}\exp(\hat{z}_k^{n-1}))^2}\exp(\hat{z}_i^{n-1})=(1-\hat{y}_i)\hat{y}_i z^jn1y^i=(k=1dnexp(z^kn1))2k=1,k=idnexp(z^kn1)exp(z^in1)=(1y^i)y^i
代入 ∇ J z ^ j n − 1 \nabla J_{\hat{z}_j^{n-1}} Jz^jn1,我们得到:
∇ J z ^ j n − 1 = { y ^ j , i ≠ j − 1 + y ^ j , i = j \nabla J_{\hat{z}_j^{n-1}}= \begin{cases} \hat{y}_j, & i\not= j\\ -1+\hat{y}_j, & i = j \end{cases} Jz^jn1={y^j,1+y^j,i=ji=j
至此推导完毕。

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值