两层全连接网络反向传播梯度推导(矩阵形式、sigmoid、最小均方差MSE)

Solving for Derivatives

虽然正文用了英文(写正文和敲公式的中英文输入法互换太折磨了-_-),但是都是贼简单的表述,希望能读下去,更希望能对您有帮助!

h = X W 1 + b 1 h s i g m o i d = s i g m o i d ( h ) Y p r e d = h s i g m o i d W 2 + b 2 f = ∣ ∣ Y − Y p r e d ∣ ∣ F 2 (1) \begin{aligned} h &= XW_1 + b_1 \\ h_{sigmoid} &= sigmoid(h) \\ Y_{pred} &= h_{sigmoid}W_2 + b_2 \\ f &= ||Y-Y_{pred}||^2_F \end{aligned} \tag{1} hhsigmoidYpredf=XW1+b1=sigmoid(h)=hsigmoidW2+b2=∣∣YYpredF2(1)

Solve for the derivatives of the following variables.
∂ f ∂ W 2     ∂ f ∂ b 2     ∂ f ∂ W 1     ∂ f ∂ b 1 (2) \frac{\partial f}{\partial W_2} \,\,\, \frac{\partial f}{\partial b_2} \,\,\, \frac{\partial f}{\partial W_1} \,\,\, \frac{\partial f}{\partial b_1} \tag{2} W2fb2fW1fb1f(2)

The derivation process of derivative

f = ∣ ∣ Y − Y p r e d ∣ ∣ F 2 = t r ( ( Y − Y p r e d ) T ( Y − Y p r e d ) ) (3) f = ||Y-Y_{pred}||^2_F = tr((Y-Y_{pred})^T(Y-Y_{pred})) \tag{3} f=∣∣YYpredF2=tr((YYpred)T(YYpred))(3)

d f = d { t r [ ( Y − Y p r e d ) T ( Y − Y p r e d ) ] } = t r { d [ ( Y − Y p r e d ) T ( Y − Y p r e d ) ] } = t r { [ d ( Y − Y p r e d ) T ] ( Y − Y p r e d ) + ( Y − Y p r e d ) T d ( Y − Y p r e d ) } = t r [ − ( d Y p r e d T ) ( Y − Y p r e d ) − ( Y − Y p r e d ) T d Y p r e d ] = 2 t r [ ( Y p r e d − Y ) T d Y p r e d ] (4) \begin{aligned} \mathrm{d}f &= \mathrm{d}\left\{ tr[(Y-Y_{pred})^T(Y-Y_{pred})] \right \} \\ &= tr \left \{ \mathrm{d} [(Y-Y_{pred})^T(Y-Y_{pred})] \right \} \\ &= tr\left \{ [\mathrm{d} (Y-Y_{pred})^T](Y-Y_{pred})+(Y-Y_{pred})^T \mathrm{d} (Y-Y_{pred})\right \} \\ &= tr[-(\mathrm{d} Y_{pred}^T)(Y-Y_{pred}) - (Y-Y_{pred})^T \mathrm{d} Y_{pred}] \\ &= 2tr[(Y_{pred}-Y)^T \mathrm{d} Y_{pred}] \end{aligned} \tag{4} df=d{tr[(YYpred)T(YYpred)]}=tr{d[(YYpred)T(YYpred)]}=tr{[d(YYpred)T](YYpred)+(YYpred)Td(YYpred)}=tr[(dYpredT)(YYpred)(YYpred)TdYpred]=2tr[(YpredY)TdYpred](4)

where, − ( d Y p r e d T ) ( Y − Y p r e d ) -(dY_{pred}^T)(Y-Y_{pred}) (dYpredT)(YYpred) is a scalar, so it is equivalent to − ( Y − Y p r e d ) d Y p r e d T -(Y-Y_{pred})dY_{pred}^T (YYpred)dYpredT.

According to the relationship between gradient and differential (The relationship between matrix differentiation and derivatives), we can obtain the result of ∂ f ∂ Y p r e d \frac{\partial f}{\partial Y_{pred}} Ypredf as follows.
∂ f ∂ Y p r e d = 2 ( Y p r e d − Y ) (5) \frac{\partial f}{\partial Y_{pred}} = 2(Y_{pred} - Y) \tag{5} Ypredf=2(YpredY)(5)

∂ f ∂ b 2 = ∂ f ∂ Y p r e d ∂ Y p r e d ∂ b 1 = 2 ( Y p r e d − Y )   1 (6) \frac{\partial f}{\partial b_2} = \frac{\partial f}{\partial Y_{pred}} \frac{\partial Y_{pred}}{\partial b_1} = 2(Y_{pred} - Y) \, \boldsymbol{1} \tag{6} b2f=Ypredfb1Ypred=2(YpredY)1(6)

where, 1 is a column vector of the shape h i d d e n × 1 hidden \times 1 hidden×1.

d f = t r ( ∂ f ∂ Y p r e d T d Y ) = t r ( ∂ f ∂ Y p r e d T h s i g m o i d d W 2 ) = t r ( ( h s i g m o i d T ∂ f ∂ Y p r e d ) T d W 2 ) (7) \begin{aligned} \mathrm{d} f &= tr(\frac{\partial f}{\partial Y_{pred}}^T \mathrm{d}Y) \\ &= tr(\frac{\partial f}{\partial Y_{pred}}^T h_{sigmoid} \mathrm{d}W_2) \\ &= tr((h_{sigmoid}^T\frac{\partial f}{\partial Y_{pred}})^T \mathrm{d}W_2) \end{aligned} \tag{7} df=tr(YpredfTdY)=tr(YpredfThsigmoiddW2)=tr((hsigmoidTYpredf)TdW2)(7)

where, the derivation process of d Y \mathrm{d}Y dY is shown in Eq.(8).
d Y = d ( h s i g m o i d W 2 + b 2 ) = d ( h s i g m o i d W 2 ) = ( d h s i g m o i d ) W 2 + h s i g m o i d d W 2 = h s i g m o i d d W 2 (8) \begin{aligned} \mathrm{d}Y &= \mathrm{d}(h_{sigmoid}W_2+b_2) \\ &= \mathrm{d}(h_{sigmoid}W_2) \\ &= (\mathrm{d}h_{sigmoid})W_2 + h_{sigmoid}\mathrm{d}W_2 \\ &= h_{sigmoid}dW_2 \end{aligned} \tag{8} dY=d(hsigmoidW2+b2)=d(hsigmoidW2)=(dhsigmoid)W2+hsigmoiddW2=hsigmoiddW2(8)
where, h s i g m o i d h_{sigmoid} hsigmoid is not a function of W 2 W_2 W2.

∂ f ∂ W 2 = h s i g m o i d T ∂ f ∂ Y p r e d = 2 h s i g m o i d T ( Y p r e d − Y ) (9) \frac{\partial f}{\partial W_2} = h_{sigmoid}^T \frac{\partial f}{\partial Y_{pred}} = 2 h_{sigmoid}^T (Y_{pred} - Y) \tag{9} W2f=hsigmoidTYpredf=2hsigmoidT(YpredY)(9)

h s i g m o i d = s i g m o i d ( h ) = 1 1 + e − h (10) h_{sigmoid} = sigmoid(h) = \frac{1}{1+e^{-h}} \tag{10} hsigmoid=sigmoid(h)=1+eh1(10)

d f = t r ( ∂ f ∂ Y p r e d T d Y ) = t r ( ∂ f ∂ Y p r e d T ( d h s i g m o i d ) W 2 ) = t r ( W 2 ∂ f ∂ Y p r e d T d h s i g m o i d ) = t r { [ ∂ f ∂ Y p r e d W 2 T ] T ( h s i g m o i d ∘ ( 1 − h s i g m o i d ) ∘ d h ) } = t r { [ ∂ f ∂ Y p r e d W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) ] T d h } (11) \begin{aligned} \mathrm{d} f &= tr(\frac{\partial f}{\partial Y_{pred}}^T \mathrm{d}Y) \\ &= tr(\frac{\partial f}{\partial Y_{pred}}^T(\mathrm{d}h_{sigmoid})W_2) \\ &= tr(W_2\frac{\partial f}{\partial Y_{pred}}^T\mathrm{d}h_{sigmoid}) \\ &= tr\left \{ [\frac{\partial f}{\partial Y_{pred}} W_2^T]^T (h_{sigmoid} \circ (1-h_{sigmoid}) \circ \mathrm{d}h) \right \} \\ &= tr\left \{ [\frac{\partial f}{\partial Y_{pred}} W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid})]^T \mathrm{d}h \right \} \end{aligned} \tag{11} df=tr(YpredfTdY)=tr(YpredfT(dhsigmoid)W2)=tr(W2YpredfTdhsigmoid)=tr{[YpredfW2T]T(hsigmoid(1hsigmoid)dh)}=tr{[YpredfW2Thsigmoid(1hsigmoid)]Tdh}(11)

where, the derivation process of s i g m o i d sigmoid sigmoid is shown in Eq.(12), and the derivation of the fourth to fifth steps in Eq.(11) is based on t r ( A T ( B ∘ C ) ) = t r ( ( A ∘ B ) T C ) tr(A^T(B\circ C)) = tr((A \circ B)^T C) tr(AT(BC))=tr((AB)TC)

d h s i g m o i d = h s i g m o i d ∘ ( 1 − h s i g m o i d ) ∘ d h (12) \mathrm{d} h_{sigmoid} = h_{sigmoid} \circ (1-h_{sigmoid}) \circ \mathrm{d}h \tag{12} dhsigmoid=hsigmoid(1hsigmoid)dh(12)

∂ f ∂ b 1 = ∂ f ∂ h ∂ h ∂ b 1 = ∂ f ∂ Y p r e d W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d )   1 = 2 ( Y p r e d − Y ) W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d )   1 (13) \begin{aligned} \frac{\partial f}{\partial b_1} &= \frac{\partial f}{\partial h} \frac{\partial h}{\partial b_1} \\ &= \frac{\partial f}{\partial Y_{pred}} W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid}) \, \boldsymbol{1} \\ &= 2(Y_{pred} - Y) W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid}) \, \boldsymbol{1} \end{aligned} \tag{13} b1f=hfb1h=YpredfW2Thsigmoid(1hsigmoid)1=2(YpredY)W2Thsigmoid(1hsigmoid)1(13)

where, 1 is a column vector of the shape h i d d e n × 1 hidden \times 1 hidden×1.

d f = t r ( ∂ f ∂ Y p r e d T d Y ) = t r ( ∂ f ∂ Y p r e d T ( d h s i g m o i d ) W 2 ) = t r ( W 2 ∂ f ∂ Y p r e d T d h s i g m o i d ) = t r ( [ ∂ f ∂ Y p r e d W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) ] T d h ) = t r ( [ ∂ f ∂ Y p r e d W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) ] T X d W 1 ) = t r { [ X T ∂ f ∂ Y p r e d W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) ] T d W 1 } (14) \begin{aligned} \mathrm{d} f &= tr(\frac{\partial f}{\partial Y_{pred}}^T \mathrm{d}Y) \\ &= tr(\frac{\partial f}{\partial Y_{pred}}^T(\mathrm{d}h_{sigmoid})W_2) \\ &= tr(W_2\frac{\partial f}{\partial Y_{pred}}^T\mathrm{d}h_{sigmoid}) \\ &= tr([\frac{\partial f}{\partial Y_{pred}} W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid})]^T \mathrm{d}h) \\ &= tr([\frac{\partial f}{\partial Y_{pred}} W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid})]^T X \mathrm{d}W_1) \\ &= tr \left \{ \left [X^T \frac{\partial f}{\partial Y_{pred}} W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid}) \right ]^T dW_1 \right \} \end{aligned} \tag{14} df=tr(YpredfTdY)=tr(YpredfT(dhsigmoid)W2)=tr(W2YpredfTdhsigmoid)=tr([YpredfW2Thsigmoid(1hsigmoid)]Tdh)=tr([YpredfW2Thsigmoid(1hsigmoid)]TXdW1)=tr{[XTYpredfW2Thsigmoid(1hsigmoid)]TdW1}(14)

∂ f ∂ W 1 = X T ∂ f ∂ Y p r e d W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) = 2 X T ( Y p r e d − Y ) W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) (15) \begin{aligned} \frac{\partial f}{\partial W_1} &= X^T \frac{\partial f}{\partial Y_{pred}} W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid}) \\ &= 2X^T(Y_{pred} - Y)W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid}) \end{aligned} \tag{15} W1f=XTYpredfW2Thsigmoid(1hsigmoid)=2XT(YpredY)W2Thsigmoid(1hsigmoid)(15)

In summary, the derivative expressions of each variable are as follows.

∂ f ∂ W 2 = 2 h s i g m o i d T ( Y p r e d − Y ) ∂ f ∂ b 2 = 2 ( Y p r e d − Y )   1 ∂ f ∂ W 1 = 2 X T ( Y p r e d − Y ) W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) ∂ f ∂ b 1 = 2 ( Y p r e d − Y ) W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d )   1 (16) \begin{aligned} \frac{\partial f}{\partial W_2} &= 2 h_{sigmoid}^T (Y_{pred} - Y) \\ \frac{\partial f}{\partial b_2} &= 2(Y_{pred} - Y) \, \boldsymbol{1} \\ \frac{\partial f}{\partial W_1} &= 2X^T(Y_{pred} - Y)W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid}) \\ \frac{\partial f}{\partial b_1} &= 2(Y_{pred} - Y) W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid}) \, \boldsymbol{1} \end{aligned} \tag{16} W2fb2fW1fb1f=2hsigmoidT(YpredY)=2(YpredY)1=2XT(YpredY)W2Thsigmoid(1hsigmoid)=2(YpredY)W2Thsigmoid(1hsigmoid)1(16)

Reference formula

Basic differential formula

d ( X ± Y ) = d X ± d Y d ( X Y ) = ( d X ) Y + X d Y d ( X T ) = ( d X ) T d t r ( X ) = t r ( d X ) (17) \begin{aligned} \mathrm{d}(X \pm Y) &= \mathrm{d}X \pm \mathrm{d} Y \\ \mathrm{d}(XY) &= (\mathrm{d}X) Y + X\mathrm{d}Y \\ \mathrm{d}(X^T) &= (\mathrm{d}X)^T \\ \mathrm{d} tr(X) &= tr(\mathrm{d}X) \\ \end{aligned} \tag{17} d(X±Y)d(XY)d(XT)dtr(X)=dX±dY=(dX)Y+XdY=(dX)T=tr(dX)(17)

Element-wise formula

d ( X ∘ Y ) = d X ∘ Y + X d ∘ Y d σ ( X ) = σ ′ ( X ) ∘ d X (18) \begin{aligned} \mathrm{d}(X \circ Y) &= \mathrm{d}X \circ Y + X \mathrm{d} \circ Y \\ \mathrm{d} \sigma(X) &= \sigma^{\prime}(X) \circ \mathrm{d}X \end{aligned} \tag{18} d(XY)dσ(X)=dXY+XdY=σ(X)dX(18)

where, σ \sigma σ is a element-wise function, and σ ′ ( X ) \sigma^{\prime}(X) σ(X) is the element-wise derivative. You can refer to the following example. Note that ∘ \circ means element-wise multiplication, i.e., Hadamard product ,which can also be denoted as ⊙ \odot .

X = [ X 11 X 12 X 21 X 22 ] d s i n ( X ) = [ c o s ( X 11 ) d X 11 c o s ( X 12 ) d X 12 c o s ( X 21 ) d X 21 c o s ( X 22 ) d X 22 ] = c o s ( X ) ∘ d X (19) \begin{aligned} X &= \left [ \begin{matrix} X_{11} & X_{12} \\ X_{21} & X_{22} \end{matrix} \right ] \\ \mathrm{d} sin(X) &= \left [ \begin{matrix} cos(X_{11})\mathrm{d}X_{11} & cos(X_{12})\mathrm{d}X_{12} \\ cos(X_{21})\mathrm{d}X_{21} & cos(X_{22})\mathrm{d}X_{22} \end{matrix} \right ] = cos(X) \circ \mathrm{d}X \\ \end{aligned} \tag{19} Xdsin(X)=[X11X21X12X22]=[cos(X11)dX11cos(X21)dX21cos(X12)dX12cos(X22)dX22]=cos(X)dX(19)

The properties of the trace of matrix

a = t r ( a ) t r ( A T ) = t r ( A ) t r ( A ± B ) = t r ( A ) ± t r ( B ) t r ( A B ) = t r ( B A ) t r ( A T B ) = t r ( B T A ) = ∑ i , j A i j B i j t r ( A T ( B ∘ C ) ) = t r ( ( A ∘ B ) T C ) = ∑ i , j A i j B i j C i j (20) \begin{aligned} a &= tr(a) \\ tr(A^T) &= tr(A) \\ tr(A \pm B) &= tr(A) \pm tr(B) \\ tr(AB) &= tr(BA) \\ tr(A^TB) &= tr(B^TA) = \sum_{i,j}{A_{ij}B_{ij}} \\ tr(A^T(B \circ C)) &= tr((A \circ B)^T C) = \sum_{i,j}{A_{ij}B_{ij}C_{ij}} \\ \end{aligned} \tag{20} atr(AT)tr(A±B)tr(AB)tr(ATB)tr(AT(BC))=tr(a)=tr(A)=tr(A)±tr(B)=tr(BA)=tr(BTA)=i,jAijBij=tr((AB)TC)=i,jAijBijCij(20)

where, a a a is a scalar, A A A and B T B^T BT have the same shape in the forth equation of Eq., A A A and B B B and C C C have the same shape in the sixth equation of Eq… Notice here A A A and B B B have the same shape in the fifth equation of Eq.(20), which is different to the forth equation of Eq.(20).

Derivative of assuming input is a matrix

Let σ \sigma σ: R m × n → R m × n \mathbb{R}^{m \times n} \rightarrow \mathbb{R}^{m \times n} Rm×nRm×n apply the s i g m o i d sigmoid sigmoid function to each element.

σ ( X ) = 1 1 − e x p ( − X ) (21) \sigma(X) = \frac{1}{1-exp(-X)} \tag{21} σ(X)=1exp(X)1(21)

d σ ( X ) = − 1 [ e x p ( − X ) ∘ d ( − X ) ] ( 1 + e x p ( − X ) ) 2 = − 1 [ e x p ( − X ) ∘ ( − 1 ) ∘ d X ] ( 1 + e x p ( − X ) ) 2 = 1 ∘ e x p ( − X ) ∘ d X ( 1 + e x p ( − X ) ) 2 = 1 1 − e x p ( − X ) ∘ e x p ( − X ) + 1 − 1 1 + e x p ( − X ) ∘ d X = σ ( X ) ∘ ( 1 − σ ( X ) ) ∘ d X (22) \begin{aligned} \mathrm{d} \sigma(X) &= \frac{-1[exp(-X) \circ \mathrm{d}(-X)]}{(1+exp(-X))^2} \\ &= \frac{-1[exp(-X) \circ (\boldsymbol{-1}) \circ \mathrm{d}X]}{(1+exp(-X))^2} \\ &= \frac{\boldsymbol{1} \circ exp(-X) \circ \mathrm{d}X}{(1+exp(-X))^2} \\ &= \frac{\boldsymbol{1}}{1-exp(-X)} \circ \frac{ exp(-X) + \boldsymbol{1} - \boldsymbol{1} }{1+exp(-X)} \circ \mathrm{d}X \\ &= \sigma(X) \circ (\boldsymbol{1} - \sigma(X)) \circ \mathrm{d}X \end{aligned} \tag{22} dσ(X)=(1+exp(X))21[exp(X)d(X)]=(1+exp(X))21[exp(X)(1)dX]=(1+exp(X))21exp(X)dX=1exp(X)11+exp(X)exp(X)+11dX=σ(X)(1σ(X))dX(22)

where, 1 1 1 and 1 \boldsymbol{1} 1 are both matrices of the same shape as X X X.

Differentiation and derivatives

Derivative of scalar to scalar

d f = f ′ ( x ) d x (23) \mathrm{d}f = f^{\prime}(x) \mathrm{d}x \tag{23} df=f(x)dx(23)

Derivative of scalar to vector (Multivariate Differential)

d f = ∑ i = 1 n ∂ f ∂ x i d x i = ∂ f ∂ x T d x (24) \mathrm{d} f = \sum_{i=1}^{n} \frac{\partial f}{\partial x_i} \mathrm{d} x_i = \frac{\partial f}{\partial x}^T \mathrm{d} x \tag{24} df=i=1nxifdxi=xfTdx(24)

As shown in Eq.(24), total differential d f \mathrm{d} f df is the inner product of the gradient vector ∂ f ∂ x ( n × 1 ) \frac{\partial f}{\partial x} (n \times 1) xf(n×1) and the differential vector d x ( n × 1 ) \mathrm{d}x (n \times 1) dx(n×1). The first equal sign is the total differential formula, and the second equal sign is the relationship of gradient and differential.

The relationship between matrix differentiation and derivatives

d f = ∑ i = 1 m ∑ j = 1 n ∂ f ∂ X i j d X i j = t r ( ∂ f ∂ X T d X ) (25) \mathrm{d} f = \sum_{i=1}^{m}{\sum_{j=1}^{n}{\frac{\partial f}{\partial X_{ij}} \mathrm{d} X_{ij}}} = tr \left (\frac{\partial f}{\partial X}^T \mathrm{d}X \right ) \tag{25} df=i=1mj=1nXijfdXij=tr(XfTdX)(25)

where, the second equal sign refers to the fifth equation of Eq.(20).

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值