Solving for Derivatives
虽然正文用了英文(写正文和敲公式的中英文输入法互换太折磨了-_-),但是都是贼简单的表述,希望能读下去,更希望能对您有帮助!
h = X W 1 + b 1 h s i g m o i d = s i g m o i d ( h ) Y p r e d = h s i g m o i d W 2 + b 2 f = ∣ ∣ Y − Y p r e d ∣ ∣ F 2 (1) \begin{aligned} h &= XW_1 + b_1 \\ h_{sigmoid} &= sigmoid(h) \\ Y_{pred} &= h_{sigmoid}W_2 + b_2 \\ f &= ||Y-Y_{pred}||^2_F \end{aligned} \tag{1} hhsigmoidYpredf=XW1+b1=sigmoid(h)=hsigmoidW2+b2=∣∣Y−Ypred∣∣F2(1)
Solve for the derivatives of the following variables.
∂
f
∂
W
2
∂
f
∂
b
2
∂
f
∂
W
1
∂
f
∂
b
1
(2)
\frac{\partial f}{\partial W_2} \,\,\, \frac{\partial f}{\partial b_2} \,\,\, \frac{\partial f}{\partial W_1} \,\,\, \frac{\partial f}{\partial b_1} \tag{2}
∂W2∂f∂b2∂f∂W1∂f∂b1∂f(2)
The derivation process of derivative
f = ∣ ∣ Y − Y p r e d ∣ ∣ F 2 = t r ( ( Y − Y p r e d ) T ( Y − Y p r e d ) ) (3) f = ||Y-Y_{pred}||^2_F = tr((Y-Y_{pred})^T(Y-Y_{pred})) \tag{3} f=∣∣Y−Ypred∣∣F2=tr((Y−Ypred)T(Y−Ypred))(3)
d f = d { t r [ ( Y − Y p r e d ) T ( Y − Y p r e d ) ] } = t r { d [ ( Y − Y p r e d ) T ( Y − Y p r e d ) ] } = t r { [ d ( Y − Y p r e d ) T ] ( Y − Y p r e d ) + ( Y − Y p r e d ) T d ( Y − Y p r e d ) } = t r [ − ( d Y p r e d T ) ( Y − Y p r e d ) − ( Y − Y p r e d ) T d Y p r e d ] = 2 t r [ ( Y p r e d − Y ) T d Y p r e d ] (4) \begin{aligned} \mathrm{d}f &= \mathrm{d}\left\{ tr[(Y-Y_{pred})^T(Y-Y_{pred})] \right \} \\ &= tr \left \{ \mathrm{d} [(Y-Y_{pred})^T(Y-Y_{pred})] \right \} \\ &= tr\left \{ [\mathrm{d} (Y-Y_{pred})^T](Y-Y_{pred})+(Y-Y_{pred})^T \mathrm{d} (Y-Y_{pred})\right \} \\ &= tr[-(\mathrm{d} Y_{pred}^T)(Y-Y_{pred}) - (Y-Y_{pred})^T \mathrm{d} Y_{pred}] \\ &= 2tr[(Y_{pred}-Y)^T \mathrm{d} Y_{pred}] \end{aligned} \tag{4} df=d{tr[(Y−Ypred)T(Y−Ypred)]}=tr{d[(Y−Ypred)T(Y−Ypred)]}=tr{[d(Y−Ypred)T](Y−Ypred)+(Y−Ypred)Td(Y−Ypred)}=tr[−(dYpredT)(Y−Ypred)−(Y−Ypred)TdYpred]=2tr[(Ypred−Y)TdYpred](4)
where, − ( d Y p r e d T ) ( Y − Y p r e d ) -(dY_{pred}^T)(Y-Y_{pred}) −(dYpredT)(Y−Ypred) is a scalar, so it is equivalent to − ( Y − Y p r e d ) d Y p r e d T -(Y-Y_{pred})dY_{pred}^T −(Y−Ypred)dYpredT.
According to the relationship between gradient and differential (The relationship between matrix differentiation and derivatives), we can obtain the result of
∂
f
∂
Y
p
r
e
d
\frac{\partial f}{\partial Y_{pred}}
∂Ypred∂f as follows.
∂
f
∂
Y
p
r
e
d
=
2
(
Y
p
r
e
d
−
Y
)
(5)
\frac{\partial f}{\partial Y_{pred}} = 2(Y_{pred} - Y) \tag{5}
∂Ypred∂f=2(Ypred−Y)(5)
∂ f ∂ b 2 = ∂ f ∂ Y p r e d ∂ Y p r e d ∂ b 1 = 2 ( Y p r e d − Y ) 1 (6) \frac{\partial f}{\partial b_2} = \frac{\partial f}{\partial Y_{pred}} \frac{\partial Y_{pred}}{\partial b_1} = 2(Y_{pred} - Y) \, \boldsymbol{1} \tag{6} ∂b2∂f=∂Ypred∂f∂b1∂Ypred=2(Ypred−Y)1(6)
where, 1 is a column vector of the shape h i d d e n × 1 hidden \times 1 hidden×1.
d f = t r ( ∂ f ∂ Y p r e d T d Y ) = t r ( ∂ f ∂ Y p r e d T h s i g m o i d d W 2 ) = t r ( ( h s i g m o i d T ∂ f ∂ Y p r e d ) T d W 2 ) (7) \begin{aligned} \mathrm{d} f &= tr(\frac{\partial f}{\partial Y_{pred}}^T \mathrm{d}Y) \\ &= tr(\frac{\partial f}{\partial Y_{pred}}^T h_{sigmoid} \mathrm{d}W_2) \\ &= tr((h_{sigmoid}^T\frac{\partial f}{\partial Y_{pred}})^T \mathrm{d}W_2) \end{aligned} \tag{7} df=tr(∂Ypred∂fTdY)=tr(∂Ypred∂fThsigmoiddW2)=tr((hsigmoidT∂Ypred∂f)TdW2)(7)
where, the derivation process of
d
Y
\mathrm{d}Y
dY is shown in Eq.(8).
d
Y
=
d
(
h
s
i
g
m
o
i
d
W
2
+
b
2
)
=
d
(
h
s
i
g
m
o
i
d
W
2
)
=
(
d
h
s
i
g
m
o
i
d
)
W
2
+
h
s
i
g
m
o
i
d
d
W
2
=
h
s
i
g
m
o
i
d
d
W
2
(8)
\begin{aligned} \mathrm{d}Y &= \mathrm{d}(h_{sigmoid}W_2+b_2) \\ &= \mathrm{d}(h_{sigmoid}W_2) \\ &= (\mathrm{d}h_{sigmoid})W_2 + h_{sigmoid}\mathrm{d}W_2 \\ &= h_{sigmoid}dW_2 \end{aligned} \tag{8}
dY=d(hsigmoidW2+b2)=d(hsigmoidW2)=(dhsigmoid)W2+hsigmoiddW2=hsigmoiddW2(8)
where,
h
s
i
g
m
o
i
d
h_{sigmoid}
hsigmoid is not a function of
W
2
W_2
W2.
∂ f ∂ W 2 = h s i g m o i d T ∂ f ∂ Y p r e d = 2 h s i g m o i d T ( Y p r e d − Y ) (9) \frac{\partial f}{\partial W_2} = h_{sigmoid}^T \frac{\partial f}{\partial Y_{pred}} = 2 h_{sigmoid}^T (Y_{pred} - Y) \tag{9} ∂W2∂f=hsigmoidT∂Ypred∂f=2hsigmoidT(Ypred−Y)(9)
h s i g m o i d = s i g m o i d ( h ) = 1 1 + e − h (10) h_{sigmoid} = sigmoid(h) = \frac{1}{1+e^{-h}} \tag{10} hsigmoid=sigmoid(h)=1+e−h1(10)
d f = t r ( ∂ f ∂ Y p r e d T d Y ) = t r ( ∂ f ∂ Y p r e d T ( d h s i g m o i d ) W 2 ) = t r ( W 2 ∂ f ∂ Y p r e d T d h s i g m o i d ) = t r { [ ∂ f ∂ Y p r e d W 2 T ] T ( h s i g m o i d ∘ ( 1 − h s i g m o i d ) ∘ d h ) } = t r { [ ∂ f ∂ Y p r e d W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) ] T d h } (11) \begin{aligned} \mathrm{d} f &= tr(\frac{\partial f}{\partial Y_{pred}}^T \mathrm{d}Y) \\ &= tr(\frac{\partial f}{\partial Y_{pred}}^T(\mathrm{d}h_{sigmoid})W_2) \\ &= tr(W_2\frac{\partial f}{\partial Y_{pred}}^T\mathrm{d}h_{sigmoid}) \\ &= tr\left \{ [\frac{\partial f}{\partial Y_{pred}} W_2^T]^T (h_{sigmoid} \circ (1-h_{sigmoid}) \circ \mathrm{d}h) \right \} \\ &= tr\left \{ [\frac{\partial f}{\partial Y_{pred}} W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid})]^T \mathrm{d}h \right \} \end{aligned} \tag{11} df=tr(∂Ypred∂fTdY)=tr(∂Ypred∂fT(dhsigmoid)W2)=tr(W2∂Ypred∂fTdhsigmoid)=tr{[∂Ypred∂fW2T]T(hsigmoid∘(1−hsigmoid)∘dh)}=tr{[∂Ypred∂fW2T∘hsigmoid∘(1−hsigmoid)]Tdh}(11)
where, the derivation process of s i g m o i d sigmoid sigmoid is shown in Eq.(12), and the derivation of the fourth to fifth steps in Eq.(11) is based on t r ( A T ( B ∘ C ) ) = t r ( ( A ∘ B ) T C ) tr(A^T(B\circ C)) = tr((A \circ B)^T C) tr(AT(B∘C))=tr((A∘B)TC)
d h s i g m o i d = h s i g m o i d ∘ ( 1 − h s i g m o i d ) ∘ d h (12) \mathrm{d} h_{sigmoid} = h_{sigmoid} \circ (1-h_{sigmoid}) \circ \mathrm{d}h \tag{12} dhsigmoid=hsigmoid∘(1−hsigmoid)∘dh(12)
∂ f ∂ b 1 = ∂ f ∂ h ∂ h ∂ b 1 = ∂ f ∂ Y p r e d W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) 1 = 2 ( Y p r e d − Y ) W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) 1 (13) \begin{aligned} \frac{\partial f}{\partial b_1} &= \frac{\partial f}{\partial h} \frac{\partial h}{\partial b_1} \\ &= \frac{\partial f}{\partial Y_{pred}} W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid}) \, \boldsymbol{1} \\ &= 2(Y_{pred} - Y) W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid}) \, \boldsymbol{1} \end{aligned} \tag{13} ∂b1∂f=∂h∂f∂b1∂h=∂Ypred∂fW2T∘hsigmoid∘(1−hsigmoid)1=2(Ypred−Y)W2T∘hsigmoid∘(1−hsigmoid)1(13)
where, 1 is a column vector of the shape h i d d e n × 1 hidden \times 1 hidden×1.
d f = t r ( ∂ f ∂ Y p r e d T d Y ) = t r ( ∂ f ∂ Y p r e d T ( d h s i g m o i d ) W 2 ) = t r ( W 2 ∂ f ∂ Y p r e d T d h s i g m o i d ) = t r ( [ ∂ f ∂ Y p r e d W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) ] T d h ) = t r ( [ ∂ f ∂ Y p r e d W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) ] T X d W 1 ) = t r { [ X T ∂ f ∂ Y p r e d W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) ] T d W 1 } (14) \begin{aligned} \mathrm{d} f &= tr(\frac{\partial f}{\partial Y_{pred}}^T \mathrm{d}Y) \\ &= tr(\frac{\partial f}{\partial Y_{pred}}^T(\mathrm{d}h_{sigmoid})W_2) \\ &= tr(W_2\frac{\partial f}{\partial Y_{pred}}^T\mathrm{d}h_{sigmoid}) \\ &= tr([\frac{\partial f}{\partial Y_{pred}} W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid})]^T \mathrm{d}h) \\ &= tr([\frac{\partial f}{\partial Y_{pred}} W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid})]^T X \mathrm{d}W_1) \\ &= tr \left \{ \left [X^T \frac{\partial f}{\partial Y_{pred}} W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid}) \right ]^T dW_1 \right \} \end{aligned} \tag{14} df=tr(∂Ypred∂fTdY)=tr(∂Ypred∂fT(dhsigmoid)W2)=tr(W2∂Ypred∂fTdhsigmoid)=tr([∂Ypred∂fW2T∘hsigmoid∘(1−hsigmoid)]Tdh)=tr([∂Ypred∂fW2T∘hsigmoid∘(1−hsigmoid)]TXdW1)=tr{[XT∂Ypred∂fW2T∘hsigmoid∘(1−hsigmoid)]TdW1}(14)
∂ f ∂ W 1 = X T ∂ f ∂ Y p r e d W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) = 2 X T ( Y p r e d − Y ) W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) (15) \begin{aligned} \frac{\partial f}{\partial W_1} &= X^T \frac{\partial f}{\partial Y_{pred}} W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid}) \\ &= 2X^T(Y_{pred} - Y)W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid}) \end{aligned} \tag{15} ∂W1∂f=XT∂Ypred∂fW2T∘hsigmoid∘(1−hsigmoid)=2XT(Ypred−Y)W2T∘hsigmoid∘(1−hsigmoid)(15)
In summary, the derivative expressions of each variable are as follows.
∂ f ∂ W 2 = 2 h s i g m o i d T ( Y p r e d − Y ) ∂ f ∂ b 2 = 2 ( Y p r e d − Y ) 1 ∂ f ∂ W 1 = 2 X T ( Y p r e d − Y ) W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) ∂ f ∂ b 1 = 2 ( Y p r e d − Y ) W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) 1 (16) \begin{aligned} \frac{\partial f}{\partial W_2} &= 2 h_{sigmoid}^T (Y_{pred} - Y) \\ \frac{\partial f}{\partial b_2} &= 2(Y_{pred} - Y) \, \boldsymbol{1} \\ \frac{\partial f}{\partial W_1} &= 2X^T(Y_{pred} - Y)W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid}) \\ \frac{\partial f}{\partial b_1} &= 2(Y_{pred} - Y) W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid}) \, \boldsymbol{1} \end{aligned} \tag{16} ∂W2∂f∂b2∂f∂W1∂f∂b1∂f=2hsigmoidT(Ypred−Y)=2(Ypred−Y)1=2XT(Ypred−Y)W2T∘hsigmoid∘(1−hsigmoid)=2(Ypred−Y)W2T∘hsigmoid∘(1−hsigmoid)1(16)
Reference formula
Basic differential formula
d ( X ± Y ) = d X ± d Y d ( X Y ) = ( d X ) Y + X d Y d ( X T ) = ( d X ) T d t r ( X ) = t r ( d X ) (17) \begin{aligned} \mathrm{d}(X \pm Y) &= \mathrm{d}X \pm \mathrm{d} Y \\ \mathrm{d}(XY) &= (\mathrm{d}X) Y + X\mathrm{d}Y \\ \mathrm{d}(X^T) &= (\mathrm{d}X)^T \\ \mathrm{d} tr(X) &= tr(\mathrm{d}X) \\ \end{aligned} \tag{17} d(X±Y)d(XY)d(XT)dtr(X)=dX±dY=(dX)Y+XdY=(dX)T=tr(dX)(17)
Element-wise formula
d ( X ∘ Y ) = d X ∘ Y + X d ∘ Y d σ ( X ) = σ ′ ( X ) ∘ d X (18) \begin{aligned} \mathrm{d}(X \circ Y) &= \mathrm{d}X \circ Y + X \mathrm{d} \circ Y \\ \mathrm{d} \sigma(X) &= \sigma^{\prime}(X) \circ \mathrm{d}X \end{aligned} \tag{18} d(X∘Y)dσ(X)=dX∘Y+Xd∘Y=σ′(X)∘dX(18)
where, σ \sigma σ is a element-wise function, and σ ′ ( X ) \sigma^{\prime}(X) σ′(X) is the element-wise derivative. You can refer to the following example. Note that ∘ \circ ∘ means element-wise multiplication, i.e., Hadamard product ,which can also be denoted as ⊙ \odot ⊙.
X = [ X 11 X 12 X 21 X 22 ] d s i n ( X ) = [ c o s ( X 11 ) d X 11 c o s ( X 12 ) d X 12 c o s ( X 21 ) d X 21 c o s ( X 22 ) d X 22 ] = c o s ( X ) ∘ d X (19) \begin{aligned} X &= \left [ \begin{matrix} X_{11} & X_{12} \\ X_{21} & X_{22} \end{matrix} \right ] \\ \mathrm{d} sin(X) &= \left [ \begin{matrix} cos(X_{11})\mathrm{d}X_{11} & cos(X_{12})\mathrm{d}X_{12} \\ cos(X_{21})\mathrm{d}X_{21} & cos(X_{22})\mathrm{d}X_{22} \end{matrix} \right ] = cos(X) \circ \mathrm{d}X \\ \end{aligned} \tag{19} Xdsin(X)=[X11X21X12X22]=[cos(X11)dX11cos(X21)dX21cos(X12)dX12cos(X22)dX22]=cos(X)∘dX(19)
The properties of the trace of matrix
a = t r ( a ) t r ( A T ) = t r ( A ) t r ( A ± B ) = t r ( A ) ± t r ( B ) t r ( A B ) = t r ( B A ) t r ( A T B ) = t r ( B T A ) = ∑ i , j A i j B i j t r ( A T ( B ∘ C ) ) = t r ( ( A ∘ B ) T C ) = ∑ i , j A i j B i j C i j (20) \begin{aligned} a &= tr(a) \\ tr(A^T) &= tr(A) \\ tr(A \pm B) &= tr(A) \pm tr(B) \\ tr(AB) &= tr(BA) \\ tr(A^TB) &= tr(B^TA) = \sum_{i,j}{A_{ij}B_{ij}} \\ tr(A^T(B \circ C)) &= tr((A \circ B)^T C) = \sum_{i,j}{A_{ij}B_{ij}C_{ij}} \\ \end{aligned} \tag{20} atr(AT)tr(A±B)tr(AB)tr(ATB)tr(AT(B∘C))=tr(a)=tr(A)=tr(A)±tr(B)=tr(BA)=tr(BTA)=i,j∑AijBij=tr((A∘B)TC)=i,j∑AijBijCij(20)
where, a a a is a scalar, A A A and B T B^T BT have the same shape in the forth equation of Eq., A A A and B B B and C C C have the same shape in the sixth equation of Eq… Notice here A A A and B B B have the same shape in the fifth equation of Eq.(20), which is different to the forth equation of Eq.(20).
Derivative of assuming input is a matrix
Let σ \sigma σ: R m × n → R m × n \mathbb{R}^{m \times n} \rightarrow \mathbb{R}^{m \times n} Rm×n→Rm×n apply the s i g m o i d sigmoid sigmoid function to each element.
σ ( X ) = 1 1 − e x p ( − X ) (21) \sigma(X) = \frac{1}{1-exp(-X)} \tag{21} σ(X)=1−exp(−X)1(21)
d σ ( X ) = − 1 [ e x p ( − X ) ∘ d ( − X ) ] ( 1 + e x p ( − X ) ) 2 = − 1 [ e x p ( − X ) ∘ ( − 1 ) ∘ d X ] ( 1 + e x p ( − X ) ) 2 = 1 ∘ e x p ( − X ) ∘ d X ( 1 + e x p ( − X ) ) 2 = 1 1 − e x p ( − X ) ∘ e x p ( − X ) + 1 − 1 1 + e x p ( − X ) ∘ d X = σ ( X ) ∘ ( 1 − σ ( X ) ) ∘ d X (22) \begin{aligned} \mathrm{d} \sigma(X) &= \frac{-1[exp(-X) \circ \mathrm{d}(-X)]}{(1+exp(-X))^2} \\ &= \frac{-1[exp(-X) \circ (\boldsymbol{-1}) \circ \mathrm{d}X]}{(1+exp(-X))^2} \\ &= \frac{\boldsymbol{1} \circ exp(-X) \circ \mathrm{d}X}{(1+exp(-X))^2} \\ &= \frac{\boldsymbol{1}}{1-exp(-X)} \circ \frac{ exp(-X) + \boldsymbol{1} - \boldsymbol{1} }{1+exp(-X)} \circ \mathrm{d}X \\ &= \sigma(X) \circ (\boldsymbol{1} - \sigma(X)) \circ \mathrm{d}X \end{aligned} \tag{22} dσ(X)=(1+exp(−X))2−1[exp(−X)∘d(−X)]=(1+exp(−X))2−1[exp(−X)∘(−1)∘dX]=(1+exp(−X))21∘exp(−X)∘dX=1−exp(−X)1∘1+exp(−X)exp(−X)+1−1∘dX=σ(X)∘(1−σ(X))∘dX(22)
where, 1 1 1 and 1 \boldsymbol{1} 1 are both matrices of the same shape as X X X.
Differentiation and derivatives
Derivative of scalar to scalar
d f = f ′ ( x ) d x (23) \mathrm{d}f = f^{\prime}(x) \mathrm{d}x \tag{23} df=f′(x)dx(23)
Derivative of scalar to vector (Multivariate Differential)
d f = ∑ i = 1 n ∂ f ∂ x i d x i = ∂ f ∂ x T d x (24) \mathrm{d} f = \sum_{i=1}^{n} \frac{\partial f}{\partial x_i} \mathrm{d} x_i = \frac{\partial f}{\partial x}^T \mathrm{d} x \tag{24} df=i=1∑n∂xi∂fdxi=∂x∂fTdx(24)
As shown in Eq.(24), total differential d f \mathrm{d} f df is the inner product of the gradient vector ∂ f ∂ x ( n × 1 ) \frac{\partial f}{\partial x} (n \times 1) ∂x∂f(n×1) and the differential vector d x ( n × 1 ) \mathrm{d}x (n \times 1) dx(n×1). The first equal sign is the total differential formula, and the second equal sign is the relationship of gradient and differential.
The relationship between matrix differentiation and derivatives
d f = ∑ i = 1 m ∑ j = 1 n ∂ f ∂ X i j d X i j = t r ( ∂ f ∂ X T d X ) (25) \mathrm{d} f = \sum_{i=1}^{m}{\sum_{j=1}^{n}{\frac{\partial f}{\partial X_{ij}} \mathrm{d} X_{ij}}} = tr \left (\frac{\partial f}{\partial X}^T \mathrm{d}X \right ) \tag{25} df=i=1∑mj=1∑n∂Xij∂fdXij=tr(∂X∂fTdX)(25)
where, the second equal sign refers to the fifth equation of Eq.(20).