两层全连接网络反向传播梯度推导（矩阵形式、sigmoid、最小均方差MSE）

dd_cs_ccc

已于 2022-10-29 12:41:05 修改

阅读量670

点赞数

分类专栏：机器学习文章标签：矩阵机器学习线性代数

于 2022-10-29 12:38:05 首次发布

本文链接：https://blog.csdn.net/qq_43561370/article/details/127585541

版权

机器学习专栏收录该内容

6 篇文章 0 订阅

订阅专栏

Solving for Derivatives

虽然正文用了英文（写正文和敲公式的中英文输入法互换太折磨了-_-），但是都是贼简单的表述，希望能读下去，更希望能对您有帮助！

$\begin{aligned} h &= XW_1 + b_1 \\ h_{sigmoid} &= sigmoid(h) \\ Y_{pred} &= h_{sigmoid}W_2 + b_2 \\ f &= ||Y-Y_{pred}||^2_F \end{aligned} \tag{1}$

Solve for the derivatives of the following variables.
$\frac{\partial f}{\partial W_2} \,\,\, \frac{\partial f}{\partial b_2} \,\,\, \frac{\partial f}{\partial W_1} \,\,\, \frac{\partial f}{\partial b_1} \tag{2}$

The derivation process of derivative

$||Y-Y_{pred}||^2_F = tr((Y-Y_{pred})^T(Y-Y_{pred})) \tag{3}$

$\begin{aligned} \mathrm{d}f &= \mathrm{d}\left\{ tr[(Y-Y_{pred})^T(Y-Y_{pred})] \right \} \\ &= tr \left \{ \mathrm{d} [(Y-Y_{pred})^T(Y-Y_{pred})] \right \} \\ &= tr\left \{ [\mathrm{d} (Y-Y_{pred})^T](Y-Y_{pred})+(Y-Y_{pred})^T \mathrm{d} (Y-Y_{pred})\right \} \\ &= tr[-(\mathrm{d} Y_{pred}^T)(Y-Y_{pred}) - (Y-Y_{pred})^T \mathrm{d} Y_{pred}] \\ &= 2tr[(Y_{pred}-Y)^T \mathrm{d} Y_{pred}] \end{aligned} \tag{4}$

where, $dY_{pred}^T)(Y-Y_{pred})$ is a scalar, so it is equivalent to $Y-Y_{pred})dY_{pred}^T$ .

According to the relationship between gradient and differential (The relationship between matrix differentiation and derivatives), we can obtain the result of $\frac{\partial f}{\partial Y_{pred}}$ as follows.
$\frac{\partial f}{\partial Y_{pred}} = 2(Y_{pred} - Y) \tag{5}$

$\frac{\partial f}{\partial b_2} = \frac{\partial f}{\partial Y_{pred}} \frac{\partial Y_{pred}}{\partial b_1} = 2(Y_{pred} - Y) \, \boldsymbol{1} \tag{6}$

where, 1 is a column vector of the shape $hidden \times 1$ .

$\begin{aligned} \mathrm{d} f &= tr(\frac{\partial f}{\partial Y_{pred}}^T \mathrm{d}Y) \\ &= tr(\frac{\partial f}{\partial Y_{pred}}^T h_{sigmoid} \mathrm{d}W_2) \\ &= tr((h_{sigmoid}^T\frac{\partial f}{\partial Y_{pred}})^T \mathrm{d}W_2) \end{aligned} \tag{7}$

where, the derivation process of $\mathrm{d}Y$ is shown in Eq.(8).
$\begin{aligned} \mathrm{d}Y &= \mathrm{d}(h_{sigmoid}W_2+b_2) \\ &= \mathrm{d}(h_{sigmoid}W_2) \\ &= (\mathrm{d}h_{sigmoid})W_2 + h_{sigmoid}\mathrm{d}W_2 \\ &= h_{sigmoid}dW_2 \end{aligned} \tag{8}$
where, $h_{sigmoid}$ is not a function of $W_2$ .

$\frac{\partial f}{\partial W_2} = h_{sigmoid}^T \frac{\partial f}{\partial Y_{pred}} = 2 h_{sigmoid}^T (Y_{pred} - Y) \tag{9}$

$h_{sigmoid} = sigmoid(h) = \frac{1}{1+e^{-h}} \tag{10}$

$\begin{aligned} \mathrm{d} f &= tr(\frac{\partial f}{\partial Y_{pred}}^T \mathrm{d}Y) \\ &= tr(\frac{\partial f}{\partial Y_{pred}}^T(\mathrm{d}h_{sigmoid})W_2) \\ &= tr(W_2\frac{\partial f}{\partial Y_{pred}}^T\mathrm{d}h_{sigmoid}) \\ &= tr\left \{ [\frac{\partial f}{\partial Y_{pred}} W_2^T]^T (h_{sigmoid} \circ (1-h_{sigmoid}) \circ \mathrm{d}h) \right \} \\ &= tr\left \{ [\frac{\partial f}{\partial Y_{pred}} W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid})]^T \mathrm{d}h \right \} \end{aligned} \tag{11}$

where, the derivation process of $s i g m o i d$ is shown in Eq.(12), and the derivation of the fourth to fifth steps in Eq.(11) is based on $tr(A^T(B\circ C)) = tr((A \circ B)^T C)$

$\mathrm{d} h_{sigmoid} = h_{sigmoid} \circ (1-h_{sigmoid}) \circ \mathrm{d}h \tag{12}$

$\begin{aligned} \frac{\partial f}{\partial b_1} &= \frac{\partial f}{\partial h} \frac{\partial h}{\partial b_1} \\ &= \frac{\partial f}{\partial Y_{pred}} W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid}) \, \boldsymbol{1} \\ &= 2(Y_{pred} - Y) W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid}) \, \boldsymbol{1} \end{aligned} \tag{13}$

where, 1 is a column vector of the shape $hidden \times 1$ .

$\begin{aligned} \mathrm{d} f &= tr(\frac{\partial f}{\partial Y_{pred}}^T \mathrm{d}Y) \\ &= tr(\frac{\partial f}{\partial Y_{pred}}^T(\mathrm{d}h_{sigmoid})W_2) \\ &= tr(W_2\frac{\partial f}{\partial Y_{pred}}^T\mathrm{d}h_{sigmoid}) \\ &= tr([\frac{\partial f}{\partial Y_{pred}} W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid})]^T \mathrm{d}h) \\ &= tr([\frac{\partial f}{\partial Y_{pred}} W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid})]^T X \mathrm{d}W_1) \\ &= tr \left \{ \left [X^T \frac{\partial f}{\partial Y_{pred}} W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid}) \right ]^T dW_1 \right \} \end{aligned} \tag{14}$

$\begin{aligned} \frac{\partial f}{\partial W_1} &= X^T \frac{\partial f}{\partial Y_{pred}} W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid}) \\ &= 2X^T(Y_{pred} - Y)W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid}) \end{aligned} \tag{15}$

In summary, the derivative expressions of each variable are as follows.

$\begin{aligned} \frac{\partial f}{\partial W_2} &= 2 h_{sigmoid}^T (Y_{pred} - Y) \\ \frac{\partial f}{\partial b_2} &= 2(Y_{pred} - Y) \, \boldsymbol{1} \\ \frac{\partial f}{\partial W_1} &= 2X^T(Y_{pred} - Y)W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid}) \\ \frac{\partial f}{\partial b_1} &= 2(Y_{pred} - Y) W_2^T \circ h_{sigmoid} \circ (1-h_{sigmoid}) \, \boldsymbol{1} \end{aligned} \tag{16}$

Reference formula

Basic differential formula

$\begin{aligned} \mathrm{d}(X \pm Y) &= \mathrm{d}X \pm \mathrm{d} Y \\ \mathrm{d}(XY) &= (\mathrm{d}X) Y + X\mathrm{d}Y \\ \mathrm{d}(X^T) &= (\mathrm{d}X)^T \\ \mathrm{d} tr(X) &= tr(\mathrm{d}X) \\ \end{aligned} \tag{17}$

Element-wise formula

$\begin{aligned} \mathrm{d}(X \circ Y) &= \mathrm{d}X \circ Y + X \mathrm{d} \circ Y \\ \mathrm{d} \sigma(X) &= \sigma^{\prime}(X) \circ \mathrm{d}X \end{aligned} \tag{18}$

where, $\sigma$ is a element-wise function, and $\sigma^{\prime}(X)$ is the element-wise derivative. You can refer to the following example. Note that $\circ$ means element-wise multiplication, i.e., Hadamard product ,which can also be denoted as $\odot$ .

$\begin{aligned} X &= \left [ \begin{matrix} X_{11} & X_{12} \\ X_{21} & X_{22} \end{matrix} \right ] \\ \mathrm{d} sin(X) &= \left [ \begin{matrix} cos(X_{11})\mathrm{d}X_{11} & cos(X_{12})\mathrm{d}X_{12} \\ cos(X_{21})\mathrm{d}X_{21} & cos(X_{22})\mathrm{d}X_{22} \end{matrix} \right ] = cos(X) \circ \mathrm{d}X \\ \end{aligned} \tag{19}$

The properties of the trace of matrix

$\begin{aligned} a &= tr(a) \\ tr(A^T) &= tr(A) \\ tr(A \pm B) &= tr(A) \pm tr(B) \\ tr(AB) &= tr(BA) \\ tr(A^TB) &= tr(B^TA) = \sum_{i,j}{A_{ij}B_{ij}} \\ tr(A^T(B \circ C)) &= tr((A \circ B)^T C) = \sum_{i,j}{A_{ij}B_{ij}C_{ij}} \\ \end{aligned} \tag{20}$

where, $a$ is a scalar, $A$ and $B^T$ have the same shape in the forth equation of Eq., $A$ and $B$ and $C$ have the same shape in the sixth equation of Eq… Notice here $A$ and $B$ have the same shape in the fifth equation of Eq.(20), which is different to the forth equation of Eq.(20).

Derivative of assuming input is a matrix

Let $\sigma$ : $\mathbb{R}^{m \times n} \rightarrow \mathbb{R}^{m \times n}$ apply the $s i g m o i d$ function to each element.

$\sigma(X) = \frac{1}{1-exp(-X)} \tag{21}$

$\begin{aligned} \mathrm{d} \sigma(X) &= \frac{-1[exp(-X) \circ \mathrm{d}(-X)]}{(1+exp(-X))^2} \\ &= \frac{-1[exp(-X) \circ (\boldsymbol{-1}) \circ \mathrm{d}X]}{(1+exp(-X))^2} \\ &= \frac{\boldsymbol{1} \circ exp(-X) \circ \mathrm{d}X}{(1+exp(-X))^2} \\ &= \frac{\boldsymbol{1}}{1-exp(-X)} \circ \frac{ exp(-X) + \boldsymbol{1} - \boldsymbol{1} }{1+exp(-X)} \circ \mathrm{d}X \\ &= \sigma(X) \circ (\boldsymbol{1} - \sigma(X)) \circ \mathrm{d}X \end{aligned} \tag{22}$

where, $1$ and $\boldsymbol{1}$ are both matrices of the same shape as $X$ .

Differentiation and derivatives

Derivative of scalar to scalar

$\mathrm{d}f = f^{\prime}(x) \mathrm{d}x \tag{23}$

Derivative of scalar to vector (Multivariate Differential)

$\mathrm{d} f = \sum_{i=1}^{n} \frac{\partial f}{\partial x_i} \mathrm{d} x_i = \frac{\partial f}{\partial x}^T \mathrm{d} x \tag{24}$

As shown in Eq.(24), total differential $\mathrm{d} f$ is the inner product of the gradient vector $\frac{\partial f}{\partial x} (n \times 1)$ and the differential vector $\mathrm{d}x (n \times 1)$ . The first equal sign is the total differential formula, and the second equal sign is the relationship of gradient and differential.

The relationship between matrix differentiation and derivatives

$\mathrm{d} f = \sum_{i=1}^{m}{\sum_{j=1}^{n}{\frac{\partial f}{\partial X_{ij}} \mathrm{d} X_{ij}}} = tr \left (\frac{\partial f}{\partial X}^T \mathrm{d}X \right ) \tag{25}$