# affine/linear(仿射/线性)变换函数详解及全连接层反向传播的梯度求导

## 摘要

Affine 仿射层, 又称 Linear 线性变换层, 常用于神经网络结构中的全连接层.

## 相关

Python和PyTorch对比实现affine/linear(仿射/线性)变换函数及全连接层的反向传播

## 1. Affine 的一种定义

x = ( x 1 , x 2 , x 3 , ⋯ &ThinSpace; , x k ) &ThickSpace; w = ( w 1 , w 2 , w 3 , ⋯ &ThinSpace; , w k ) &ThickSpace; a f f i n e ( x i , w i , b ) = x i w i + b x = (x_1,x_2,x_3,\cdots,x_k)\\ \;\\ w = (w_1, w_2,w_3,\cdots,w_k)\\ \;\\ affine(x_i,w_i,b) = x_iw_i+b

a T = a f f i n e ( X , w , b ) = X w T + b &ThickSpace; a T = ( x 11 x 12 x 13 ⋯ x 1 k x 21 x 22 x 23 ⋯ x 2 k x 31 x 32 x 33 ⋯ x 3 k ⋮ ⋮ ⋮ ⋱ ⋮ x m 1 x m 2 x m 3 ⋯ x m k ) ( w 1 w 2 w 3 ⋮ w k ) + b &ThickSpace; a = ( a 1 , a 2 , a 3 , ⋯ &ThinSpace; , a k ) a^T=affine(X,w,b) = Xw^T + b\\\;\\ a^T= \begin{pmatrix} x_{11}&amp;x_{12} &amp;x_{13}&amp;\cdots&amp;x_{1k}\\ x_{21}&amp;x_{22}&amp;x_{23}&amp;\cdots&amp;x_{2k}\\ x_{31}&amp;x_{32}&amp;x_{33}&amp;\cdots&amp;x_{3k}\\ \vdots&amp;\vdots&amp;\vdots&amp;\ddots&amp;\vdots\\ x_{m1}&amp;x_{m2}&amp;x_{m3}&amp;\cdots&amp;x_{mk} \end{pmatrix} \begin{pmatrix} w_1\\ w_2\\ w_3\\ \vdots\\ w_k \end{pmatrix} +b\\ \;\\ a= (a_1,a_2,a_3,\cdots,a_k)

W n × k = ( w 11 w 12 w 13 ⋯ w 1 k w 21 w 22 w 23 ⋯ w 2 k w 31 w 32 w 33 ⋯ w 3 k ⋮ ⋮ ⋮ ⋱ ⋮ w n 1 w n 2 w n 3 ⋯ w n k ) &ThickSpace; b 1 × n = ( b 1 , b 2 , b 3 , ⋯ &ThinSpace; , b n ) &ThickSpace; A m × n = a f f i n e ( X , W , b ) = X m × k W n × k T + b 1 × n W_{n\times k} =\begin{pmatrix} w_{11}&amp;w_{12} &amp;w_{13}&amp;\cdots&amp;w_{1k}\\ w_{21}&amp;w_{22}&amp;w_{23}&amp;\cdots&amp;w_{2k}\\ w_{31}&amp;w_{32}&amp;w_{33}&amp;\cdots&amp;w_{3k}\\ \vdots&amp;\vdots&amp;\vdots&amp;\ddots&amp;\vdots\\ w_{n1}&amp;w_{n2}&amp;w_{n3}&amp;\cdots&amp;w_{nk} \end{pmatrix}\\ \;\\ b_{1 \times n} = (b_1,b_2,b_3,\cdots,b_n)\\\;\\ A_{m\times n} = affine(X,W,b) = X_{m\times k}W^T_{n\times k} + b_{1 \times n}

a i j = ∑ t = 1 k x i t ⋅ w j t + b j a_{ij} =\sum_{t=1}^{k} x_{it} \cdot w_{jt} + b_j

a 23 = ∑ t = 1 k x 2 t ⋅ w 3 t + b 3 = x 21 w 31 + x 22 w 32 + x 23 w 33 + ⋯ + x 2 k w 3 k + b 3 a_{23} =\sum_{t=1}^{k} x_{2t} \cdot w_{3t} + b_3= x_{21}w_{31}+x_{22}w_{32}+x_{23}w_{33}+\cdots+x_{2k}w_{3k}+ b_3

## 2. 梯度的定义

∇ e ( 3 ) = ∂ e ∂ x i + ∂ e ∂ y j + ∂ e ∂ z k \nabla e_{(3)} = \frac{\partial e}{\partial x}i+\frac{\partial e}{\partial y}j+\frac{\partial e}{\partial z}k

∇ e ( V ) = ∂ e ∂ x 1 I 1 + ∂ e ∂ x 2 I 2 + ∂ e ∂ x 3 I 3 + ⋯ + ∂ e ∂ x t I t \nabla e_{(V)} = \frac{\partial e}{\partial x_1}I_1+\frac{\partial e}{\partial x_2}I_2+\frac{\partial e}{\partial x_3}I_3+\cdots+\frac{\partial e}{\partial x_t}I_t

## 3. 反向传播中的梯度求导

A m × n = X m × k W n × k T + b 1 × n &ThickSpace; e = f o r w a r d ( A ) A_{m \times n} = X_{m\times k}{W_{n\times k}}^T + b_{1 \times n}\\ \;\\ e=forward(A)

### 3.1 损失值 e 对 A 矩阵的梯度

d e d A = ( ∂ e / ∂ a 11 ∂ e / ∂ a 12 ∂ e / ∂ a 13 ⋯ ∂ e / ∂ a 1 n ∂ e / ∂ a 21 ∂ e / ∂ a 22 ∂ e / ∂ a 23 ⋯ ∂ e / ∂ a 2 n ∂ e / ∂ a 31 ∂ e / ∂ a 32 ∂ e / ∂ a 33 ⋯ ∂ e / ∂ a 3 n ⋮ ⋮ ⋮ ⋱ ⋮ ∂ e / ∂ a m 1 ∂ e / ∂ a m 2 ∂ e / ∂ a m 3 ⋯ ∂ e / ∂ a m n ) \frac{de}{dA} = \begin{pmatrix} \partial e/ \partial a_{11}&amp;\partial e/ \partial a_{12}&amp;\partial e/ \partial a_{13}&amp;\cdots&amp; \partial e/ \partial a_{1n}\\ \partial e/ \partial a_{21}&amp;\partial e/ \partial a_{22}&amp;\partial e/ \partial a_{23}&amp;\cdots&amp; \partial e/ \partial a_{2n}\\ \partial e/ \partial a_{31}&amp;\partial e/ \partial a_{32}&amp;\partial e/ \partial a_{33}&amp;\cdots&amp; \partial e/ \partial a_{3n}\\ \vdots&amp;\vdots&amp;\vdots&amp;\ddots&amp;\vdots\\ \partial e/ \partial a_{m1}&amp;\partial e/ \partial a_{m2}&amp;\partial e/ \partial a_{m3}&amp;\cdots&amp; \partial e/ \partial a_{mn}\\ \end{pmatrix}

∂ e ∂ a i j = a i j ′ &ThickSpace; ∇ e ( A ) = d e d A = ( a 11 ′ a 12 ′ a 13 ′ ⋯ a 1 n ′ a 21 ′ a 22 ′ a 23 ′ ⋯ a 2 n ′ a 31 ′ a 32 ′ a 33 ′ ⋯ a 3 n ′ ⋮ ⋮ ⋮ ⋱ ⋮ a m 1 ′ a m 2 ′ a m 3 ′ ⋯ a m n ′ ) \frac{\partial e}{\partial a_{ij}} = a_{ij}&#x27;\\ \;\\ \nabla e_{(A)}= \frac{de}{dA} = \begin{pmatrix} a_{11}&#x27;&amp; a_{12}&#x27;&amp; a_{13}&#x27;&amp;\cdots&amp; a_{1n}&#x27;\\ a_{21}&#x27;&amp; a_{22}&#x27;&amp; a_{23}&#x27;&amp;\cdots&amp; a_{2n}&#x27;\\ a_{31}&#x27;&amp; a_{32}&#x27;&amp; a_{33}&#x27;&amp;\cdots&amp; a_{3n}&#x27;\\ \vdots&amp;\vdots&amp;\vdots&amp;\ddots&amp;\vdots\\ a_{m1}&#x27;&amp; a_{m2}&#x27;&amp; a_{m3}&#x27;&amp;\cdots&amp; a_{mn}&#x27; \end{pmatrix}

∇ e ( A ) = ( a 11 ′ , a 12 ′ , ⋯ &ThinSpace; , a 21 ′ , a 22 ′ , ⋯ &ThinSpace; , a m 1 ′ , a m 2 ′ , ⋯ &ThinSpace; , a m n ′ ) \nabla e_{(A)}= (a_{11}&#x27;, a_{12}&#x27;,\cdots, a_{21}&#x27;, a_{22}&#x27;,\cdots,a_{m1}&#x27;, a_{m2}&#x27;,\cdots, a_{mn}&#x27;)

### 3.2 A 矩阵的元素关于 X 的梯度

A m × n = X m × k W n × k T + b 1 × n A_{m \times n} = X_{m\times k}{W_{n\times k}}^T + b_{1 \times n}\\

W j = ( w j 1 , w j 2 , w j 3 , ⋯ &ThinSpace; , w j k ) &ThickSpace; X W j T = ( a 1 j a 2 j a 3 j ⋮ a m j ) = A : , j W_j=(w_{j1},w_{j2},w_{j3},\cdots,w_{jk})\\ \;\\ XW_j^T= \begin{pmatrix} a_{1j}\\ a_{2j}\\ a_{3j}\\ \vdots\\ a_{mj} \end{pmatrix}=A_{:,j}

d a i j d X = ( ∂ a i j / ∂ x 11 ∂ a i j / ∂ x 12 ∂ a i j / ∂ x 13 ⋯ ∂ a i j / ∂ x 1 k ∂ a i j / ∂ x 21 ∂ a i j / ∂ x 22 ∂ a i j / ∂ x 23 ⋯ ∂ a i j / ∂ x 2 k ∂ a i j / ∂ x 31 ∂ a i j / ∂ x 32 ∂ a i j / ∂ x 33 ⋯ ∂ a i j / ∂ x 3 k ⋮ ⋮ ⋮ ⋱ ⋮ ∂ a i j / ∂ x m 1 ∂ a i j / ∂ x m 2 ∂ a i j / ∂ x m 3 ⋯ ∂ a i j / ∂ x m k ) \frac{d a_{ij}}{dX} = \begin{pmatrix} \partial a_{ij}/ \partial x_{11}&amp;\partial a_{ij}/ \partial x_{12}&amp;\partial a_{ij}/ \partial x_{13}&amp;\cdots&amp; \partial a_{ij}/ \partial x_{1k}\\ \partial a_{ij}/ \partial x_{21}&amp;\partial a_{ij}/ \partial x_{22}&amp;\partial a_{ij}/ \partial x_{23}&amp;\cdots&amp; \partial a_{ij}/ \partial x_{2k}\\ \partial a_{ij}/ \partial x_{31}&amp;\partial a_{ij}/ \partial x_{32}&amp;\partial a_{ij}/ \partial x_{33}&amp;\cdots&amp; \partial a_{ij}/\partial x_{3k}\\ \vdots&amp;\vdots&amp;\vdots&amp;\ddots&amp;\vdots\\ \partial a_{ij}/ \partial x_{m1}&amp;\partial a_{ij}/ \partial x_{m2}&amp;\partial a_{ij}/ \partial x_{m3}&amp;\cdots&amp; \partial a_{ij}/ \partial x_{mk}\\ \end{pmatrix}

∂ a i j ∂ x p q = x i j ∣ p q ′ &ThickSpace; ∇ a i j ( X ) = d a i j d X = ( x i j ∣ 11 ′ x i j ∣ 12 ′ x i j ∣ 13 ′ ⋯ x i j ∣ 1 k ′ x i j ∣ 21 ′ x i j ∣ 22 ′ x i j ∣ 23 ′ ⋯ x i j ∣ 2 k ′ x i j ∣ 31 ′ x i j ∣ 32 ′ x i j ∣ 33 ′ ⋯ x i j ∣ 3 k ′ ⋮ ⋮ ⋮ ⋱ ⋮ x i j ∣ m 1 ′ x i j ∣ m 2 ′ x i j ∣ m 3 ′ ⋯ x i j ∣ m k ′ ) \frac{\partial a_{ij}}{\partial x_{pq}} = x_{ij|pq}&#x27;\\ \;\\ \nabla {a_{ij}}_{(X)}=\frac{d a_{ij}}{dX} = \begin{pmatrix} x_{ij|11}&#x27;&amp;x_{ij|12}&#x27;&amp;x_{ij|13}&#x27;&amp;\cdots&amp;x_{ij|1k}&#x27;\\ x_{ij|21}&#x27;&amp;x_{ij|22}&#x27;&amp;x_{ij|23}&#x27;&amp;\cdots&amp;x_{ij|2k}&#x27;\\ x_{ij|31}&#x27;&amp;x_{ij|32}&#x27;&amp;x_{ij|33}&#x27;&amp;\cdots&amp;x_{ij|3k}&#x27;\\ \vdots&amp;\vdots&amp;\vdots&amp;\ddots&amp;\vdots\\ x_{ij|m1}&#x27;&amp;x_{ij|m2}&#x27;&amp;x_{ij|m3}&#x27;&amp;\cdots&amp;x_{ij|mk}&#x27;\\ \end{pmatrix}

### 3.3 关于 X 的反向传播

a i j = ∑ t = 1 k x i t ⋅ w j t + b j &ThickSpace; a i j = x i 1 w j 1 + x i 2 w j 2 + ⋯ + x i q w j q + ⋯ + x i k w j k + b j &ThickSpace; x i j ∣ p q ′ = ∂ a i j ∂ x p q = { w j q p = i 0 , p ≠ i a_{ij}= \sum_{t=1}^{k} x_{it}\cdot w_{jt} +b_j\\ \;\\ a_{ij}= x_{i1}w_{j1} +x_{i2}w_{j2} +\cdots+x_{iq}w_{jq} +\cdots+x_{ik}w_{jk} +b_j\\ \;\\ x_{ij|pq}&#x27;=\frac{\partial a_{ij}}{\partial x_{pq}} = \left\{ \begin{array}{rr} w_{jq}&amp; p = i\\ 0, &amp; p \neq i \end{array} \right.\\

∂ e ∂ x p q = ∑ i = 1 i = m ∑ j = 1 j = n ∂ e ∂ a i j ∂ a i j ∂ x p q = ∑ i = 1 i = m ∑ j = 1 j = n a i j ′ x i j ∣ p q ′ \frac {\partial e}{\partial x_{pq}} =\sum_{i = 1}^{i=m}\sum_{j =1}^{j =n} \frac {\partial e}{\partial a_{ij}}\frac {\partial a_{ij}}{\partial x_{pq}} =\sum_{i = 1}^{i=m}\sum_{j =1}^{j =n} a_{ij}&#x27; x_{ij|pq}&#x27;\\

∂ e ∂ x p q = ∑ j = 1 j = n a p j ′ w j q &ThickSpace; d e d X = ( ∑ j = 1 j = n a 1 j ′ w j 1 ∑ j = 1 j = n a 1 j ′ w j 2 ∑ j = 1 j = n a 1 j ′ w j 3 ⋯ ∑ j = 1 j = n a 1 j ′ w j k &ThickSpace; ∑ j = 1 j = n a 2 j ′ w j 1 ∑ j = 1 j = n a 2 j ′ w j 2 ∑ j = 1 j = n a 2 j ′ w j 3 ⋯ ∑ j = 1 j = n a 2 j ′ w j k &ThickSpace; ∑ j = 1 j = n a 3 j ′ w j 1 ∑ j = 1 j = n a 3 j ′ w j 2 ∑ j = 1 j = n a 3 j ′ w j 3 ⋯ ∑ j = 1 j = n a 3 j ′ w j k ⋮ ⋮ ⋮ ⋱ ⋮ ∑ j = 1 j = n a m j ′ w j 1 ∑ j = 1 j = n a m j ′ w j 2 ∑ j = 1 j = n a m j ′ w j 3 ⋯ ∑ j = 1 j = n a m j ′ w j k ) \frac {\partial e}{\partial x_{pq}}=\sum_{j =1}^{j =n} a_{pj}&#x27;w_{jq}\\ \;\\ \frac {d e}{d X}=\begin{pmatrix} \sum_{j =1}^{j =n} a_{1j}&#x27;w_{j1}&amp;\sum_{j =1}^{j =n} a_{1j}&#x27;w_{j2}&amp;\sum_{j =1}^{j =n} a_{1j}&#x27;w_{j3}&amp;\cdots&amp;\sum_{j =1}^{j =n} a_{1j}&#x27;w_{jk}\\\;\\ \sum_{j =1}^{j =n} a_{2j}&#x27;w_{j1}&amp;\sum_{j =1}^{j =n} a_{2j}&#x27;w_{j2}&amp;\sum_{j =1}^{j =n} a_{2j}&#x27;w_{j3}&amp;\cdots&amp;\sum_{j =1}^{j =n} a_{2j}&#x27;w_{jk}\\\;\\ \sum_{j =1}^{j =n} a_{3j}&#x27;w_{j1}&amp;\sum_{j =1}^{j =n} a_{3j}&#x27;w_{j2}&amp;\sum_{j =1}^{j =n} a_{3j}&#x27;w_{j3}&amp;\cdots&amp;\sum_{j =1}^{j =n} a_{3j}&#x27;w_{jk}\\ \vdots&amp;\vdots&amp;\vdots&amp;\ddots&amp;\vdots\\ \sum_{j =1}^{j =n} a_{mj}&#x27;w_{j1}&amp;\sum_{j =1}^{j =n} a_{mj}&#x27;w_{j2}&amp;\sum_{j =1}^{j =n} a_{mj}&#x27;w_{j3}&amp;\cdots&amp;\sum_{j =1}^{j =n} a_{mj}&#x27;w_{jk}\\ \end{pmatrix}

d e d X = ( a 11 ′ a 12 ′ a 13 ′ ⋯ a 1 n ′ a 21 ′ a 22 ′ a 23 ′ ⋯ a 2 n ′ a 31 ′ a 32 ′ a 33 ′ ⋯ a 3 n ′ ⋮ ⋮ ⋮ ⋱ ⋮ a m 1 ′ a m 2 ′ a m 3 ′ ⋯ a m n ′ ) ( w 11 w 12 w 13 ⋯ w 1 k w 21 w 22 w 23 ⋯ w 2 k w 31 w 32 w 33 ⋯ w 3 k ⋮ ⋮ ⋮ ⋱ ⋮ w n 1 w n 2 w n 3 ⋯ w n k ) \frac {d e}{d X}=\begin{pmatrix} a_{11}&#x27;&amp; a_{12}&#x27;&amp; a_{13}&#x27;&amp;\cdots&amp; a_{1n}&#x27;\\ a_{21}&#x27;&amp; a_{22}&#x27;&amp; a_{23}&#x27;&amp;\cdots&amp; a_{2n}&#x27;\\ a_{31}&#x27;&amp; a_{32}&#x27;&amp; a_{33}&#x27;&amp;\cdots&amp; a_{3n}&#x27;\\ \vdots&amp;\vdots&amp;\vdots&amp;\ddots&amp;\vdots\\ a_{m1}&#x27;&amp; a_{m2}&#x27;&amp; a_{m3}&#x27;&amp;\cdots&amp; a_{mn}&#x27; \end{pmatrix} \begin{pmatrix} w_{11}&amp;w_{12} &amp;w_{13}&amp;\cdots&amp;w_{1k}\\ w_{21}&amp;w_{22}&amp;w_{23}&amp;\cdots&amp;w_{2k}\\ w_{31}&amp;w_{32}&amp;w_{33}&amp;\cdots&amp;w_{3k}\\ \vdots&amp;\vdots&amp;\vdots&amp;\ddots&amp;\vdots\\ w_{n1}&amp;w_{n2}&amp;w_{n3}&amp;\cdots&amp;w_{nk} \end{pmatrix}

d e d X = ∇ e ( A ) W \frac {d e}{d X} =\nabla e_{(A)}W

### 3.4 关于 W 的反向传播

a i j = ∑ t = 1 k x i t ⋅ w j t + b j &ThickSpace; a i j = x i 1 w j 1 + x i 2 w j 2 + ⋯ + x i q w j q + ⋯ + x i k w j k + b j &ThickSpace; w i j ∣ p q ′ = ∂ a i j ∂ w p q = { x i q p = j 0 p ≠ j &ThickSpace; ∂ e ∂ w p q = ∑ i = 1 i = m ∑ j = 1 j = n ∂ e ∂ a i j ∂ a i j ∂ w p q = ∑ i = 1 i = m ∑ j = 1 j = n a i j ′ w i j ∣ p q ′ &ThickSpace; ∂ e ∂ w p q = ∑ i = 1 i = m a i p ′ x i q &ThickSpace; d e d W = ( ∑ i = 1 i = m a i 1 ′ x i 1 ∑ i = 1 i = m a i 1 ′ x i 2 ∑ i = 1 i = m a i 1 ′ x i 3 ⋯ ∑ i = 1 i = m a i 1 ′ x i k &ThickSpace; ∑ i = 1 i = m a i 2 ′ x i 1 ∑ i = 1 i = m a i 2 ′ x i 2 ∑ i = 1 i = m a i 2 ′ x i 3 ⋯ ∑ i = 1 i = m a i 2 ′ x i k &ThickSpace; ∑ i = 1 i = m a i 3 ′ x i 1 ∑ i = 3 i = m a i 3 ′ x i 2 ∑ i = 1 i = m a i 3 ′ x i 3 ⋯ ∑ i = 1 i = m a i 3 ′ x i k ⋮ ⋮ ⋮ ⋱ ⋮ ∑ i = 1 i = m a i n ′ x i 1 ∑ i = 3 i = m a i n ′ x i n ∑ i = 1 i = m a i n ′ x i 3 ⋯ ∑ i = 1 i = m a i n ′ x i k ) a_{ij}= \sum_{t=1}^{k} x_{it}\cdot w_{jt} +b_j\\ \;\\ a_{ij}= x_{i1}w_{j1} +x_{i2}w_{j2} +\cdots+x_{iq}w_{jq} +\cdots+x_{ik}w_{jk} +b_j\\ \;\\ w_{ij|pq}&#x27;=\frac{\partial a_{ij}}{\partial w_{pq}} = \left\{ \begin{array}{rr} x_{iq} &amp; p = j \\ 0 &amp; p \neq j \end{array} \right.\\\;\\ \frac {\partial e}{\partial w_{pq}} = \sum_{i = 1}^{i=m}\sum_{j =1}^{j =n} \frac {\partial e}{\partial a_{ij}}\frac {\partial a_{ij}}{\partial w_{pq}} =\sum_{i = 1}^{i=m}\sum_{j =1}^{j =n} a_{ij}&#x27; w_{ij|pq}&#x27;\\ \;\\ \frac {\partial e}{\partial w_{pq}}=\sum_{i =1}^{i =m} a_{ip}&#x27;x_{iq}\\ \;\\ \frac {d e}{d W}= \begin{pmatrix} \sum_{i =1}^{i =m} a_{i1}&#x27;x_{i1}&amp;\sum_{i =1}^{i =m} a_{i1}&#x27;x_{i2}&amp;\sum_{i =1}^{i =m} a_{i1}&#x27;x_{i3}&amp;\cdots&amp;\sum_{i =1}^{i =m} a_{i1}&#x27;x_{ik}\\ \;\\ \sum_{i =1}^{i =m} a_{i2}&#x27;x_{i1}&amp;\sum_{i =1}^{i =m} a_{i2}&#x27;x_{i2}&amp;\sum_{i =1}^{i =m} a_{i2}&#x27;x_{i3}&amp;\cdots&amp;\sum_{i =1}^{i =m} a_{i2}&#x27;x_{ik}\\ \;\\ \sum_{i =1}^{i =m} a_{i3}&#x27;x_{i1}&amp;\sum_{i =3}^{i =m} a_{i3}&#x27;x_{i2}&amp;\sum_{i =1}^{i =m} a_{i3}&#x27;x_{i3}&amp;\cdots&amp;\sum_{i =1}^{i =m} a_{i3}&#x27;x_{ik}\\ \vdots&amp;\vdots&amp;\vdots&amp;\ddots&amp;\vdots\\ \sum_{i =1}^{i =m} a_{in}&#x27;x_{i1}&amp;\sum_{i =3}^{i =m} a_{in}&#x27;x_{in}&amp;\sum_{i =1}^{i =m} a_{in}&#x27;x_{i3}&amp;\cdots&amp;\sum_{i =1}^{i =m} a_{in}&#x27;x_{ik}\\ \end{pmatrix}\\

d e d W = ( a 11 ′ a 21 ′ a 31 ′ ⋯ a m 1 ′ a 12 ′ a 22 ′ a 32 ′ ⋯ a m 2 ′ a 13 ′ a 23 ′ a 33 ′ ⋯ a m 3 ′ ⋮ ⋮ ⋮ ⋱ ⋮ a 1 n ′ a 2 n ′ a 3 n ′ ⋯ a m n ′ ) ( x 11 x 12 x 13 ⋯ x 1 k x 21 x 22 x 23 ⋯ x 2 k x 31 x 32 x 33 ⋯ x 3 k ⋮ ⋮ ⋮ ⋱ ⋮ x m 1 x m 2 x m 3 ⋯ x m k ) \frac {d e}{d W}= \begin{pmatrix} a_{11}&#x27;&amp; a_{21}&#x27;&amp; a_{31}&#x27;&amp;\cdots&amp; a_{m1}&#x27;\\ a_{12}&#x27;&amp; a_{22}&#x27;&amp; a_{32}&#x27;&amp;\cdots&amp; a_{m2}&#x27;\\ a_{13}&#x27;&amp; a_{23}&#x27;&amp; a_{33}&#x27;&amp;\cdots&amp; a_{m3}&#x27;\\ \vdots&amp;\vdots&amp;\vdots&amp;\ddots&amp;\vdots\\ a_{1n}&#x27;&amp; a_{2n}&#x27;&amp; a_{3n}&#x27;&amp;\cdots&amp; a_{mn}&#x27;\\ \end{pmatrix} \begin{pmatrix} x_{11}&amp;x_{12} &amp;x_{13}&amp;\cdots&amp;x_{1k}\\ x_{21}&amp;x_{22}&amp;x_{23}&amp;\cdots&amp;x_{2k}\\ x_{31}&amp;x_{32}&amp;x_{33}&amp;\cdots&amp;x_{3k}\\ \vdots&amp;\vdots&amp;\vdots&amp;\ddots&amp;\vdots\\ x_{m1}&amp;x_{m2}&amp;x_{m3}&amp;\cdots&amp;x_{mk} \end{pmatrix}

d e d W = ∇ e ( A ) T X \frac {d e}{d W} =\nabla e_{(A)}^TX

### 3.5 关于 e 对 b 的梯度

a i j = ∑ t = 1 k x i t ⋅ w j t + b j &ThickSpace; b i j ∣ p ′ = ∂ a i j ∂ b q = { 1 , q = j 0 , q ≠ j &ThickSpace; ∂ e ∂ b q = ∑ i = 1 i = m ∑ j = 1 j = n ∂ e ∂ a i j ∂ a i j ∂ b q = ∑ i = 1 i = m ∑ j = 1 j = n a i j ′ b i j ∣ q ′ &ThickSpace; ∂ e ∂ b q = ∑ i = 1 i = m a i q ′ ⋅ 1 &ThickSpace; d e d b = ( ∑ i = 1 i = m a i 1 ′ , ∑ i = 1 i = m a i 2 ′ , ∑ i = 1 i = m a i 3 ′ , ⋯ &ThinSpace; , ∑ i = 1 i = m a i m ′ ) a_{ij}= \sum_{t=1}^{k} x_{it}\cdot w_{jt} +b_j\\ \;\\ b_{ij|p}&#x27;=\frac{\partial a_{ij}}{\partial b_{q}} = \left\{ \begin{array}{rr} 1,&amp; q = j\\ 0, &amp; q \neq j \end{array} \right.\\ \;\\ \frac {\partial e}{\partial b_{q}} = \sum_{i = 1}^{i=m}\sum_{j =1}^{j =n} \frac {\partial e}{\partial a_{ij}}\frac {\partial a_{ij}}{\partial b_{q}} =\sum_{i = 1}^{i=m}\sum_{j =1}^{j =n} a_{ij}&#x27; b_{ij|q}&#x27;\\ \;\\ \frac {\partial e}{\partial b_{q}} = \sum_{i = 1}^{i=m} a_{iq}&#x27;\cdot 1 \\ \;\\ \frac {d e}{d b} = (\sum_{i = 1}^{i=m} a_{i1}&#x27;,\sum_{i = 1}^{i=m} a_{i2}&#x27;,\sum_{i = 1}^{i=m} a_{i3}&#x27;, \cdots ,\sum_{i = 1}^{i=m} a_{im}&#x27;)\\

d e d b = s u m ( ∇ e ( A ) , &ThickSpace; a x i s = 0 ) \frac {de}{db}=sum(\nabla e_{(A)},\; axis=0)

## 4. Affine 的另一种定义

A m × n = a f f i n e ( X , W , b ) = X m × k W k × n + b 1 × n &ThickSpace; a i j = ∑ t = 1 k x i t ⋅ w t j + b j A_{m\times n} = affine(X,W,b) = X_{m\times k}W_{k\times n} + b_{1 \times n} \;\\ a_{ij}= \sum_{t=1}^{k} x_{it}\cdot w_{tj} +b_j

### 4.1 关于 X 的反向传播

a i j = x i 1 w 1 j + x i 2 w 2 j + ⋯ + x i q w q j + ⋯ + x i k w k j + b j &ThickSpace; x i j ∣ p q ′ = ∂ a i j ∂ x p q = { w q j p = i 0 , p ≠ i a_{ij}= x_{i1}w_{1j} +x_{i2}w_{2j} +\cdots+x_{iq}w_{qj} +\cdots+x_{ik}w_{kj} +b_j\\ \;\\ x_{ij|pq}&#x27;=\frac{\partial a_{ij}}{\partial x_{pq}} = \left\{ \begin{array}{rr} w_{qj}&amp; p = i\\ 0, &amp; p \neq i \end{array} \right.\\

∂ e ∂ x p q = ∑ i = 1 i = m ∑ j = 1 j = n ∂ e ∂ a i j ∂ a i j ∂ x p q = ∑ i = 1 i = m ∑ j = 1 j = n a i j ′ x i j ∣ p q ′ \frac {\partial e}{\partial x_{pq}} =\sum_{i = 1}^{i=m}\sum_{j =1}^{j =n} \frac {\partial e}{\partial a_{ij}}\frac {\partial a_{ij}}{\partial x_{pq}} =\sum_{i = 1}^{i=m}\sum_{j =1}^{j =n} a_{ij}&#x27; x_{ij|pq}&#x27;\\

∂ e ∂ x p q = ∑ j = 1 j = n a p j ′ w q j &ThickSpace; d e d X = ( ∑ j = 1 j = n a 1 j ′ w 1 j ∑ j = 1 j = n a 1 j ′ w 2 j ∑ j = 1 j = n a 1 j ′ w 3 j ⋯ ∑ j = 1 j = n a 1 j ′ w k j &ThickSpace; ∑ j = 1 j = n a 2 j ′ w 1 j ∑ j = 1 j = n a 2 j ′ w 2 j ∑ j = 1 j = n a 2 j ′ w 3 j ⋯ ∑ j = 1 j = n a 2 j ′ w k j &ThickSpace; ∑ j = 1 j = n a 3 j ′ w 1 j ∑ j = 1 j = n a 3 j ′ w 2 j ∑ j = 1 j = n a 3 j ′ w 3 j ⋯ ∑ j = 1 j = n a 3 j ′ w k j ⋮ ⋮ ⋮ ⋱ ⋮ ∑ j = 1 j = n a m j ′ w 1 j ∑ j = 1 j = n a m j ′ w 2 j ∑ j = 1 j = n a m j ′ w 3 j ⋯ ∑ j = 1 j = n a m j ′ w k j ) \frac {\partial e}{\partial x_{pq}}=\sum_{j =1}^{j =n} a_{pj}&#x27;w_{qj}\\ \;\\ \frac {d e}{d X}=\begin{pmatrix} \sum_{j =1}^{j =n} a_{1j}&#x27;w_{1j}&amp;\sum_{j =1}^{j =n} a_{1j}&#x27;w_{2j}&amp;\sum_{j =1}^{j =n} a_{1j}&#x27;w_{3j}&amp;\cdots&amp;\sum_{j =1}^{j =n} a_{1j}&#x27;w_{kj}\\\;\\ \sum_{j =1}^{j =n} a_{2j}&#x27;w_{1j}&amp;\sum_{j =1}^{j =n} a_{2j}&#x27;w_{2j}&amp;\sum_{j =1}^{j =n} a_{2j}&#x27;w_{3j}&amp;\cdots&amp;\sum_{j =1}^{j =n} a_{2j}&#x27;w_{kj}\\\;\\ \sum_{j =1}^{j =n} a_{3j}&#x27;w_{1j}&amp;\sum_{j =1}^{j =n} a_{3j}&#x27;w_{2j}&amp;\sum_{j =1}^{j =n} a_{3j}&#x27;w_{3j}&amp;\cdots&amp;\sum_{j =1}^{j =n} a_{3j}&#x27;w_{kj}\\ \vdots&amp;\vdots&amp;\vdots&amp;\ddots&amp;\vdots\\ \sum_{j =1}^{j =n} a_{mj}&#x27;w_{1j}&amp;\sum_{j =1}^{j =n} a_{mj}&#x27;w_{2j}&amp;\sum_{j =1}^{j =n} a_{mj}&#x27;w_{3j}&amp;\cdots&amp;\sum_{j =1}^{j =n} a_{mj}&#x27;w_{kj}\\ \end{pmatrix}

d e d X = ( a 11 ′ a 12 ′ a 13 ′ ⋯ a 1 n ′ a 21 ′ a 22 ′ a 23 ′ ⋯ a 2 n ′ a 31 ′ a 32 ′ a 33 ′ ⋯ a 3 n ′ ⋮ ⋮ ⋮ ⋱ ⋮ a m 1 ′ a m 2 ′ a m 3 ′ ⋯ a m n ′ ) ( w 11 w 21 w 31 ⋯ w k 1 w 12 w 22 w 32 ⋯ w k 2 w 13 w 23 w 33 ⋯ w k 3 ⋮ ⋮ ⋮ ⋱ ⋮ w 1 n w 2 n w 3 n ⋯ w k n ) \frac {d e}{d X}=\begin{pmatrix} a_{11}&#x27;&amp; a_{12}&#x27;&amp; a_{13}&#x27;&amp;\cdots&amp; a_{1n}&#x27;\\ a_{21}&#x27;&amp; a_{22}&#x27;&amp; a_{23}&#x27;&amp;\cdots&amp; a_{2n}&#x27;\\ a_{31}&#x27;&amp; a_{32}&#x27;&amp; a_{33}&#x27;&amp;\cdots&amp; a_{3n}&#x27;\\ \vdots&amp;\vdots&amp;\vdots&amp;\ddots&amp;\vdots\\ a_{m1}&#x27;&amp; a_{m2}&#x27;&amp; a_{m3}&#x27;&amp;\cdots&amp; a_{mn}&#x27; \end{pmatrix} \begin{pmatrix} w_{11}&amp;w_{21} &amp;w_{31}&amp;\cdots&amp;w_{k1}\\ w_{12}&amp;w_{22}&amp;w_{32}&amp;\cdots&amp;w_{k2}\\ w_{13}&amp;w_{23}&amp;w_{33}&amp;\cdots&amp;w_{k3}\\ \vdots&amp;\vdots&amp;\vdots&amp;\ddots&amp;\vdots\\ w_{1n}&amp;w_{2n}&amp;w_{3n}&amp;\cdots&amp;w_{kn} \end{pmatrix}

d e d X = ∇ e ( A ) W T \frac {d e}{d X} =\nabla e_{(A)}W^T

### 4.2 关于 W 的反向传播

a i j = x i 1 w 1 j + x i 2 w 2 j + ⋯ + x i p w p j + ⋯ + x i k w k j + b j &ThickSpace; w i j ∣ p q ′ = ∂ a i j ∂ w p q = { x i p q = j 0 q ≠ j &ThickSpace; ∂ e ∂ w p q = ∑ i = 1 i = m ∑ j = 1 j = n ∂ e ∂ a i j ∂ a i j ∂ w p q = ∑ i = 1 i = m ∑ j = 1 j = n a i j ′ w i j ∣ p q ′ a_{ij}= x_{i1}w_{1j} +x_{i2}w_{2j} +\cdots+x_{ip}w_{pj} +\cdots+x_{ik}w_{kj} +b_j\\ \;\\ w_{ij|pq}&#x27;=\frac{\partial a_{ij}}{\partial w_{pq}} = \left\{ \begin{array}{rr} x_{ip} &amp; q = j \\ 0 &amp; q \neq j \end{array} \right.\\\;\\ \frac {\partial e}{\partial w_{pq}} = \sum_{i = 1}^{i=m}\sum_{j =1}^{j =n} \frac {\partial e}{\partial a_{ij}}\frac {\partial a_{ij}}{\partial w_{pq}} =\sum_{i = 1}^{i=m}\sum_{j =1}^{j =n} a_{ij}&#x27; w_{ij|pq}&#x27;\\
∂ e ∂ w p q = ∑ i = 1 i = m a i q ′ x i p &ThickSpace; d e d W = ( ∑ i = 1 i = m a i 1 ′ x i 1 ∑ i = 1 i = m a i 2 ′ x i 1 ∑ i = 1 i = m a i 3 ′ x i 1 ⋯ ∑ i = 1 i = m a i n ′ x i 1 &ThickSpace; ∑ i = 1 i = m a i 1 ′ x i 2 ∑ i = 1 i = m a i 2 ′ x i 2 ∑ i = 1 i = m a i 3 ′ x i 2 ⋯ ∑ i = 1 i = m a i n ′ x i 2 &ThickSpace; ∑ i = 1 i = m a i 1 ′ x i 3 ∑ i = 3 i = m a i 2 ′ x i 3 ∑ i = 1 i = m a i 3 ′ x i 3 ⋯ ∑ i = 1 i = m a i n ′ x i 3 ⋮ ⋮ ⋮ ⋱ ⋮ ∑ i = 1 i = m a i 1 ′ x i k ∑ i = 3 i = m a i 2 ′ x i k ∑ i = 1 i = m a i 3 ′ x i k ⋯ ∑ i = 1 i = m a i n ′ x i k ) \frac {\partial e}{\partial w_{pq}}=\sum_{i =1}^{i =m} a_{iq}&#x27;x_{ip}\\ \;\\ \frac {d e}{d W}= \begin{pmatrix} \sum_{i =1}^{i =m} a_{i1}&#x27;x_{i1}&amp;\sum_{i =1}^{i =m} a_{i2}&#x27;x_{i1}&amp;\sum_{i =1}^{i =m} a_{i3}&#x27;x_{i1}&amp;\cdots&amp;\sum_{i =1}^{i =m} a_{in}&#x27;x_{i1}\\ \;\\ \sum_{i =1}^{i =m} a_{i1}&#x27;x_{i2}&amp;\sum_{i =1}^{i =m} a_{i2}&#x27;x_{i2}&amp;\sum_{i =1}^{i =m} a_{i3}&#x27;x_{i2}&amp;\cdots&amp;\sum_{i =1}^{i =m} a_{in}&#x27;x_{i2}\\ \;\\ \sum_{i =1}^{i =m} a_{i1}&#x27;x_{i3}&amp;\sum_{i =3}^{i =m} a_{i2}&#x27;x_{i3}&amp;\sum_{i =1}^{i =m} a_{i3}&#x27;x_{i3}&amp;\cdots&amp;\sum_{i =1}^{i =m} a_{in}&#x27;x_{i3}\\ \vdots&amp;\vdots&amp;\vdots&amp;\ddots&amp;\vdots\\ \sum_{i =1}^{i =m} a_{i1}&#x27;x_{ik}&amp;\sum_{i =3}^{i =m} a_{i2}&#x27;x_{ik}&amp;\sum_{i =1}^{i =m} a_{i3}&#x27;x_{ik}&amp;\cdots&amp;\sum_{i =1}^{i =m} a_{in}&#x27;x_{ik}\\ \end{pmatrix}\\