How back-Propagation works

最新推荐文章于 2024-09-21 09:43:26 发布

moon_stars_sun

最新推荐文章于 2024-09-21 09:43:26 发布

阅读量366

点赞数

分类专栏： ML 文章标签：网络 ML DL BP

本文链接：https://blog.csdn.net/moon_stars_sun/article/details/72527539

版权

ML 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

In order to minimize the cost function, we expect small changes in weights lead to samll changes in output. So we can use this property to modify weights to make network getting closer to what we want step by step. This is also one reason why we need smooth activation function rather than {0, 1}

1. The intuitive idea

Δ C o s t \approx \sum \partial C o s t \partial w j Δ w j

$\Delta Cost \approx \sum\frac{\partial Cost}{\partial w_j} \Delta w_j$

$\Delta w_j$ should be small enough to make sure a good approximation

if we want to decrease Cost , which means $\Delta Cost < 0$ . Therefore, $\Delta w_j$ should be an opposite sign with $\frac{\partial Cost}{\partial w_j}$ .

Here, let $\Delta w = -\eta \nabla_w Cost$ , $\eta$ is learning rate. Basically, $\eta = \frac{\epsilon}{||\nabla_w Cost||}$ enables the minimization of $\Delta Cost$ in each updation. $\epsilon = ||\Delta w||$ . ( $||\Delta w||$ is small to ensure the approximation)

2. Back-Propagation

Warming-up:

$\nabla_w Cost$ or $\frac{\partial Cost}{\partial w}$ tells us how quickly the Cost changes when we update the w.

2.1 Introduction to Notations

$w_{jk}^l$ : connection from $k^{th}$ neuron in the $(l-1)^{th}$ layer to $j^{th}$ neuron in the $l^{th}$ layer.

$a_j^l$ : the activation or output of the $j^{th}$ neuron in the $l^{th}$ layer.

$z_j^l$ : the Input to the $j^{th}$ neuron in the $l^{th}$ layer.

The Detailed Notation Image

$a_j^l = f(\sum_{k} w_{jk}^l a_k^{l-1})$ can be vectorized as:

a l = f (w l * a l - 1)

$a^l = f(w^l * a^{l-1})$
(

w3 $w^3$ in above picture is a 2 * 4 matrix【the first row corresponds to the first neuron and second row corresponds to the second neuron in layer 3】;

a3 $a^3$ is a 2*1 matrix or vector)

On the other hand, for convenience reason, we define:

z l = w l * a l - 1

$z^l = w^l * a^{l-1}$

a l = f (z l)

$a^l = f(z^l)$

2.2 The Fundamental Equations

In the beginning, we set ERROR in the neuron:

δ l j = \partial C o s t \partial z l j

$\delta_j^l = \frac{\partial Cost}{\partial z_j^l}$

This definition makes sense. Suppose it adds a little change ( $\Delta z_j^l$ ) to the corresponding neuron ( $z_j^l$ ). Then its output will be $f(z_j^l +\Delta z_j^l)$ instead of $f(z_j^l )$ . This change propagates through later layer and the overall change is approximating $\frac{\partial Cost}{\partial z_j^l} \Delta z_j^l$ .

In this case, if $\frac{\partial Cost}{\partial z_j^l}$ is close to zero, the changes of Cost function can’t be great. We can think this neuron is near optimal. Thus, the heuristic sense in $\frac{\partial Cost}{\partial z_j^l}$ is a measurement of error in the neuron.

An equation for the error in the output layer

$δ L = \partial C o s t \partial z L (L represents the last layer)$ $\delta^L= \frac{\partial Cost}{\partial z^L} \quad \text{(L represents the last layer)}$

$δ L = \partial C o s t \partial a L \partial a L \partial z L = \partial C o s t \partial a L f' (z L) = \nabla a L C o s t ⊙ f' (z L) (⊙ represents componentwise production)$ $\begin{align} \delta^L &= \frac{\partial Cost}{\partial a^L} \frac{\partial a^L}{\partial z^L} \\ &= \frac{\partial Cost}{\partial a^L} f'(z^L) \\ &= \nabla_{a^L} Cost \odot f'(z^L) \quad (\odot \text{ represents componentwise production)} \end{align}$
An equation for the error $\delta^l$ in terms of the error in the next layer, $\delta^{l+1}$ :

δ l = \partial C o s t \partial z l = \partial C o s t \partial z l + 1 \partial z l + 1 \partial z l = (δ l + 1 (w l + 1) T) ⊙ f' (z l)

$\begin{align} \delta^l &= \frac{\partial Cost}{\partial z^l} \\ &= \frac{\partial Cost}{\partial z^{l+1}} \frac{\partial z^{l+1}}{\partial z^l}\\ &= (\delta^{l+1} (w^{l+1})^T ) \odot f'(z^l) \end{align}$

More specifically, for an individual neuron:

δ l j = \partial C o s t \partial z l j = \sum k \partial C o s t \partial z l + 1 k \partial z l + 1 k \partial z l k = \sum k δ l + 1 k \partial z l + 1 k \partial z l k

$\begin{align} \delta_j^l &= \frac{\partial Cost}{\partial z_j^l} \\ &= \sum_{k}\frac{\partial Cost}{\partial z_k^{l+1}} \frac{\partial z_k^{l+1}}{\partial z_k^l}\\ &=\sum_{k} \delta_k^{l+1} \frac{\partial z_k^{l+1}}{\partial z_k^l} \end{align}$
Meanwhile

z l + 1 k = \sum j w l + 1 k j a l j = \sum j w l + 1 k j f (z l j)

$z_k^{l+1} = \sum_{j} w_{kj}^{l+1} a_j^l = \sum_{j} w_{kj}^{l+1} f(z_j^l)$
So

δ l j = \sum k δ l + 1 k \partial z l + 1 k \partial z l k = \sum k w l k j δ l + 1 k f' (z l j)

$\delta_j^l =\sum_{k} \delta_k^{l+1} \frac{\partial z_k^{l+1}}{\partial z_k^l} = \sum_{k} w_{kj}^l \delta_k^{l+1} f'(z_j^l)$

using $(w^l)^T* \delta^{l}$ to obtain $\delta^{l-1}$ seems propagate error backward through the network. We then take the Hadamard product $\odot f'(z^{l-1})$ . This moves the error backward through the activation function in layer $l-1$

An equation for the rate of change of the cost with respect to any weight in the network:
$\partial C o s t \partial w l j k = \partial C o s t \partial z l j \partial z l j \partial w l j k = δ l j a l - 1 k$ $\frac{\partial Cost}{\partial w_{jk}^l} = \frac{\partial Cost}{\partial z_{j}^l} \frac{\partial z_{j}^l}{\partial w_{jk}^l} = \delta_j^l a_k^{l-1}$

z l j = \sum k w l j k a l - 1 k

$z_j^l = \sum_{k} w_{jk}^l\,a_k^{l-1}$

For less-index version:

\partial C o s t \partial w = δ l a l - 1 l i k e a i n δ o u t

$\frac{\partial Cost}{\partial w} = \delta^l a^{l-1} \quad like \quad a^{in} \delta^{out}$
example

it’s understood that $a^{in}$ is the activation of the neuron input to the weight w, and $\delta^{out}$ is the error of the neuron output from the weight w.

Note if $a^{in}$ is small, $\frac{\partial Cost}{\partial w}$ is also small. we’ll say the weight learns slowly, meaning that it’s not changing much during gradient descent. Recall from the graph of the sigmoid function that the σ function becomes very flat in the end sides. It occurs that its derivative is approximately 0. In this case it’s common to say the output neuron has saturated and, as a result, the weight has stopped learning (or is learning slowly).

psudo code for updation w: (m is the num of batch)

$w_l = w_l - \frac \eta m \sum_x \delta^{x,l}{(a^{x,l-1})}^T$

What’s clever about backpropagation is that it enables us to simultaneously compute all the partial derivatives ∂C/∂wj using just one forward pass through the network, followed by one backward pass through the network