神经网络反向传播数学推导

最新推荐文章于 2022-12-26 15:47:01 发布

Calm-Cat

最新推荐文章于 2022-12-26 15:47:01 发布

阅读量281

点赞数 1

分类专栏：机器学习人工智能

本文链接：https://blog.csdn.net/qq_25297587/article/details/105371608

版权

机器学习同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

人工智能

4 篇文章 0 订阅

订阅专栏

title: Mathematic Process of Back Propagation Derivatives in Neural Network
date: 2020-04-06 12:25:09
tags: Machine Learning
mathjax: true

I took several days to try to figure out the process of Back Propagation in Neural Network, and after I pull through I review and record the passage here.

General Speaking

As we all know, in order to do some mathematics on Neural Network, we need to define some formulas first.

We define the Training Set as
${\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),(x^{(3)},y^{(3)}),...(x^{(m)},y^{(m)})}\}$
which means we totally have $m$ samples in training set.

In the neural network we consider now has $n_l$ layers totally, and the $1^{st}$ layer is the Input Layer and the last layer i.e. the $n_l$ layer is the Output Layer. For $i$ layer the number of Neurons is $S_i$ . So the dimension of the $y^{(i)}$ is $S_{n_l}$ i.e. $y^{(i)}=[y^{(i)}_1,y^{(i)}_2,y^{(i)}_3,...y^{(i)}_{S_{n_l}}]^T$ and the dimension of the $x^{(i)}$ is $S_{1}$ i.e. $x^{(i)}=[x^{(i)}_1,x^{(i)}_2,x^{(i)}_3,...x^{(i)}_{S_{1}}]^T$ . The connection weight from $i$ nodes in $l$ layer to $j$ nodes in $l - 1$ layer is defined as $w_{ji}^{(l)}$ and the bias from $l$ layer to $l$ layer to $l - 1$ layer is defined as $b_l$ , in which $l\in\{n_l, n_l-1, ... ,2\}$ . That means in this neural network we have

The Cost Function for a particular $(x, y)$ is defined in formula $(1)$ :

$x,y)=\frac{1}{2}||h_{w,b}(x)-y||^2\tag{1}$

We use the Mean Square Error as the error criteria. For a set of ${\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),(x^{(3)},y^{(3)}),...(x^{(m)},y^{(m)})}\}$ , the Cost Function is defined as:
$J(w,b)=[\sum^{m}_{i=1}J(w,b;x^{(i)},y^{(i)})]+\frac{\lambda}{2}\sum_{l=2}^{n_l}\sum_{i=1}^{S_{l-1}}\sum_{j=1}^{S_l}(w^{(l)}_{ji})^2\tag{2}$
The second term to the right of the equation $\frac{\lambda}{2}\sum_{l=2}^{n_l}\sum_{i=1}^{S_{l-1}}\sum_{j=1}^{S_l}(w^{(l)}_{ji})$ actually is a additional term in order to avoiding “overfitting problem”, and its derivatives of any parameters is simple. So in the following section let’s just ignore it.
The target we want to solve is the $\frac{\partial{J(w,b)}}{\partial w^{(l)}_{ji}}$ and $\frac{\partial{J(w,b)}}{\partial b^{(l)}_{i}}$ for optimizing those parameters by some method (like Gradient Descent). It is actually pretty hard to calculate the $\frac{\partial{J(w,b)}}{\partial w^{(l)}_{ji}}$ and $\frac{\partial{J(w,b)}}{\partial b^{(l)}_{i}}$ in direct since we will do a huge work of matrix derivatives. so Let’s concentrate on a easier way.

Error Term $\delta$

According to the “Neural Network and Deep Learning”, based on the chain derivation rule, formula $(6) (7)$ is defined:
$\frac{\partial{J(w,b)}}{\partial w^{(l)}_{ji}} = \frac{\partial{J(w,b)}}{\partial z^{(l+1)}_{j}}\frac{\partial z^{(l+1)}_{j}}{\partial w^{(l)}_{ji}}\tag{3}$

$\frac{\partial{J(w,b)}}{\partial b^{(l)}_{j}} = \frac{\partial{J(w,b)}}{\partial z^{(l+1)}_{j}}\frac{\partial z^{(l+1)}_{j}}{\partial b^{(l)}_{j}}\tag{4}$

We can use an Error Term $\delta$ to calculate the above partial derivation more easily:
$\delta^{(l)}_{j}=\frac{\partial{J(w,b)}}{\partial z^{(l+1)}_{j}}\tag{5}$
in which,
$z^{(l)}=w^{(l)}a^{(l-1)}+b^{(l)}$

$a^{(l)}=f(z^{(l)})$

Here we define the function $f$ as the Activate Function (such as sigmoid, tanh, etc.)

To calculate $\delta$ , we could start from the output layer to the input layer step by step.

Calculation of Output Layer $\delta^{(n_l)}_i$

Calculating the output layer $\delta^{(n_l)}_i$ is relative easily:
$\delta^{(n_l)}_i=-(y_i-a_i^{(n_l)})f'(z_i^{(n_l)})\tag{6}$

Proof

In the proof, we denote the Training Set is $(x, y)$ .
$\begin{aligned} \delta^{(n_l)}_i&=\frac{\partial}{\partial z^{(n_l)}_i}J(w,b) \\ &=\frac{\partial}{\partial z^{(n_l)}_i}J(w,b;x,y) \\ &=\frac{\partial}{\partial z^{(n_l)}_i}\frac{1}{2}||y-h_{w,b}(x)||^2 \\ &=\frac{\partial}{\partial z^{(n_l)}_i}\frac{1}{2}\sum_{j=1}^{S_{n_l}}\big(y_j-f(z_j^{(n_l)})\big)^2 \\ &=-(y_i-f(z_j^{(n_l)}))f'(z_j^{(n_l)}) \\ &=-(y_i-a_i^{(n_l)})f'(z^{(n_l)}_i) \end{aligned}$

Calculation of $n_l-1$ Layers $\delta^{(n_{l})}_{i}$

For $l\in\{n_l-1, n_l-2, ..., 2\}$ (we cannot let $l = 1$ since the parameters $w$ between the input layer and the second layer is define as the $w^{(2)}$ and there is no $w^{(1)}$ parameter) the $\delta$ can be set as:
$\delta_i^{(l)}=\Big(\sum_{j=1}^{S_{l+1}}w_{ji}^{(l+1)}\delta_j^{(l+1)}\Big)f'(z_i^{(l)})\tag{7}$
To proof that, we could proof the $\delta_i^{(n_l-1)}$ in the $n_l-1$ layer first:
$\delta_i^{(n_l-1)}=\Big(\sum_{j=1}^{S_{n_l}}w_{ji}^{(n_l)}\delta_j^{(n_l)}\Big)f'(z_i^{(l)})\tag{8}$

Proof

$KaTeX parse error: No such environment: align* at position 8: \begin{̲a̲l̲i̲g̲n̲*̲}̲ \delta_i^{(n_l…$

Based on above mathematical process, we could conclude the formula $(9)$ :
$\delta_i^{(n_l-1)}=\Big(\sum_{j=1}^{S_{n_l}}\delta^{(n_l)}_jw_{ji}^{(n_l)}\Big)f'(z_i^{(n_l-1)})\tag{9}$

Calculation of Other Layers $\delta^{(l)}_{i}$

Using formula $(9)$ , we change the $n_l-1$ to any layer $l$ :
$\delta_i^{(l)}=\Big(\sum_{j=1}^{S_l}\delta^{(l+1)}_jw_{ji}^{(l+1)}\Big)f'(z_i^{(l)})\tag{10}$

Calculation of derivatives $w^{(l)}_{ji}, b^{(l)}$

According to formula $(3) (4)$ , we need to calculate the $\frac{\partial z^{(l+1)}_{j}}{\partial w^{(l)}_{ji}}$ and $\frac{\partial z^{(l+1)}_{j}}{\partial b^{(l)}_{j}}$ :
$\frac{\partial z^{(l+1)}_{j}}{\partial w^{(l)}_{ji}}=\frac{\partial \Big(\sum_{k=1}^{S_{n_l-1}}a_k^{(l)}w_{jk}^{(l+1)}+b^{(l)}\Big)}{\partial w^{(l)}_{ji}}=a_i\tag{11}$

$\frac{\partial z^{(l+1)}_{j}}{\partial b^{(l)}}=\frac{\partial \Big(\sum_{k=1}^{S_{n_l-1}}a_k^{(l)}w_{jk}^{(l+1)}+b^{(l)}\Big)}{\partial b^{(l)}}=1\tag{12}$

Using formula $(10) (11)$ , we could solve $\frac{\partial{J(w,b)}}{\partial w^{(l)}_{ji}}$ and $\frac{\partial{J(w,b)}}{\partial b^{(l)}_{i}}$ :
$\frac{\partial{J(w,b)}}{\partial w^{(l)}_{ji}} = \frac{\partial{J(w,b)}}{\partial z^{(l+1)}_{j}}\frac{\partial z^{(l+1)}_{j}}{\partial w^{(l)}_{ji}}=\delta_j^{(l+1)}a_i\tag{13}$

$\frac{\partial{J(w,b)}}{\partial b^{(l)}_{j}} = \frac{\partial{J(w,b)}}{\partial z^{(l+1)}_{j}}\frac{\partial z^{(l+1)}_{j}}{\partial b^{(l)}_{j}}=\delta_j^{(l+1)}\cdot1\tag{14}$

Calm-Cat

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
神经网络反向传播数学推导

title: Mathematic Process of Back Propagation Derivatives in Neural Networkdate: 2020-04-06 12:25:09tags: Machine Learningmathjax: trueI took several days to try to figure out the process of Bac...
复制链接

扫一扫