How the backpropagation algorithm works

最新推荐文章于 2024-02-22 09:30:00 发布

wei_chao_cheng

最新推荐文章于 2024-02-22 09:30:00 发布

阅读量361

点赞数 1

文章标签：神经网络反向传播算法

How the backpropagation algorithm works

I’ll explain a fast algorithm for computing the gradients of the cost function, an algorithm known as backpropagation.

Warm up: a fast matrix-based approach to computing the output from a neural network

We’ll use $w^l_{jk}$ to denote the weight for the connection from the $k^{\rm th}$ neuron in the $(l-1)^{\rm th}$ layer to the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer. Explicitly, we use $b^l_j$ for the bias of the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer. And we use $a^l_j$ for the activation of the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer. With these notations, the activation $a^l_j$ of the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer is related to the activations in the $(l-1)^{\rm th}$ layer by the equation

a l j = σ (\sum k w l j k a l - 1 k + b l j), (1)

$\begin{eqnarray} a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right), \tag{1}\end{eqnarray}$
where the sum is over all neurons k in the

(l−1)th $(l-1)^{\rm th}$ layer.

To rewrite this expression in a matrix form we define a weight matrix $w^l$ for each layer, l. The entries of the weight matrix $w^l$ are just the weights connecting to the $l^{\rm th}$ layer of neurons, that is, the entry in the $j^{\rm th}$ row and $k^{\rm th}$ column is $w^l_{jk}$ . Similarly, for each layer l we define a bias vector, $b^l$ . You can probably guess how this works - the components of the bias vector are just the values $b^l_j$ , one component for each neuron in the $l^{\rm th}$ layer. And finally, we define an activation vector $a^l$ whose components are the activations $a^l_j$ . With these notations in mind, Equation (1) can be rewritten in the beautiful and compact vectorized form

a l = σ (w l a l - 1 + b l) . (2)

$\begin{eqnarray} a^{l} = \sigma(w^l a^{l-1}+b^l). \tag{2}\end{eqnarray}$

zl≡wlal−1+bl $z^l \equiv w^l a^{l-1}+b^l$ , we call

zl $z^l$ the weighted input to the neurons in layer l.

The two assumptions we need about the cost function

The goal of backpropagation is to compute the partial derivatives ∂C/∂w and ∂C/∂b of the cost function C with respect to any weight w or bias bb in the network. For backpropagation to work we need to make two main assumptions about the form of the cost function. The first assumption we need is that the cost function can be written as an average $C = \frac{1}{n} \sum_x C_x$ over cost functions Cx for individual training examples, x. The second assumption we make about the cost is that it can be written as a function of the outputs from the neural network: C = C( $a^L$ ).

The Hadamard product, s⊙t

We use s⊙t to denote the elementwise product of the two vectors. Thus the components of s⊙t are just $(s \odot t)_j = s_j t_j$ .

The four fundamental equations behind backpropagation

we define the error $\delta^l_j$ of neuron j in layer l by

δ l j \equiv \partial C \partial z l j . (3)

$\begin{eqnarray} \delta^l_j \equiv \frac{\partial C}{\partial z^l_j}. \tag{3}\end{eqnarray}$

An equation for the error in the output layer, $\delta^L$ : The components of $\delta^L$ are given by

δ L j = \partial C \partial a L j σ' (z L j) . (BP1)

$\begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j). \tag{BP1}\end{eqnarray}$
or

δ L = \nabla a C ⊙ σ' (z L) . (BP1a)

$\begin{eqnarray} \delta^L = \nabla_a C \odot \sigma'(z^L). \tag{BP1a}\end{eqnarray}$

An equation for the error $\delta^l$ in terms of the error in the next layer, $\delta^{l+1}$ : In particular

δ l = ((w l + 1) T δ l + 1) ⊙ σ' (z l), (BP2)

$\begin{eqnarray} \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l), \tag{BP2}\end{eqnarray}$

An equation for the rate of change of the cost with respect to any bias in the network: In particular:

\partial C \partial b l j = δ l j . (BP3)

$\begin{eqnarray} \frac{\partial C}{\partial b^l_j} = \delta^l_j. \tag{BP3}\end{eqnarray}$

An equation for the rate of change of the cost with respect to any weight in the network: In particular:

\partial C \partial w l j k = a l - 1 k δ l j . (BP4)

$\begin{eqnarray} \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j. \tag{BP4}\end{eqnarray}$

The backpropagation algorithm

The backpropagation equations provide us with a way of computing the gradient of the cost function. Let’s explicitly write this out in the form of an algorithm:

Input x: Set the corresponding activation $a^{1}$ for the input layer.
Feedforward: For each l=2,3,…,L compute $z^{l} = w^l a^{l-1}+b^l$ and $a^{l} = \sigma(z^{l})$ .
Output error $\delta^L$ : Compute the vector $\delta^{L} = \nabla_a C \odot \sigma’(z^L)$ .
Backpropagate the error: For each l=L−1,L−2,…,2 compute $\delta^{l} = ((w^{l+1})^T \delta^{l+1}) \odot \sigma’(z^{l})$ .
Output: The gradient of the cost function is given by $\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$ and $\frac{\partial C}{\partial b^l_j} = \delta^l_j$ .

Examining the algorithm you can see why it’s called backpropagation.

As I’ve described it above, the backpropagation algorithm computes the gradient of the cost function for a single training example, C=Cx. In practice, it’s common to combine backpropagation with a learning algorithm such as stochastic gradient descent, in which we compute the gradient for many training examples. In particular, given a mini-batch of m training examples, the following algorithm applies a gradient descent learning step based on that mini-batch:

Input a set of training examples
For each training example x: Set the corresponding input activation ax,1 , and perform the following steps:
- Feedforward: For each l=2,3,…,L compute $z^{x,l} = w^l a^{x,l-1}+b^l$ and $a^{x,l} = \sigma(z^{x,l})$ .
- Output error $\delta^{x,L}$ : Compute the vector $\delta^{x,L} = \nabla_a C_x \odot \sigma'(z^{x,L})$ .
- Backpropagate the error: For each l=L−1,L−2,…,2 compute $\delta^{x,l} = ((w^{l+1})^T \delta^{x,l+1}) \odot \sigma’(z^{x,l})$ .
Gradient descent: For each l=L,L−1,…,2 update the weights according to the rule $w^l \rightarrow w^l-\frac{\eta}{m} \sum_x \delta^{x,l} (a^{x,l-1})^T$ , and the biases according to the rule $b^l \rightarrow b^l-\frac{\eta}{m} \sum_x \delta^{x,l}$ .

The code for backpropagation

class Network(object):
...
    def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The "mini_batch" is a list of tuples "(x, y)", and "eta"
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw 
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb 
                       for b, nb in zip(self.biases, nabla_b)]

class Network(object):
...
   def backprop(self, x, y):
        """Return a tuple "(nabla_b, nabla_w)" representing the
        gradient for the cost function C_x.  "nabla_b" and
        "nabla_w" are layer-by-layer lists of numpy arrays, similar
        to "self.biases" and "self.weights"."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book.  Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on.  It's a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in xrange(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

...

    def cost_derivative(self, output_activations, y):
        """Return the vector of partial derivatives \partial C_x /
        \partial a for the output activations."""
        return (output_activations-y) 

def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

In what sense is backpropagation a fast algorithm?

In what sense is backpropagation a fast algorithm?
Let’s consider another approach to computing the gradient:

\partial C \partial w j \approx C ( w + ϵ e j ) - C ( w ) ϵ, (4)

$\begin{eqnarray} \frac{\partial C}{\partial w_{j}} \approx \frac{C(w+\epsilon e_j)-C(w)}{\epsilon}, \tag{4}\end{eqnarray}$
where ϵ>0 is a small positive number, and ej is the unit vector in the jth direction. In other words, we can estimate ∂C/∂wjby computing the cost C for two slightly different values of wj, and then applying Equation (4). The same idea will let us compute the partial derivatives ∂C/∂b with respect to the biases.

Unfortunately, while this approach appears promising, when you implement the code it turns out to be extremely slow. To understand why, imagine we have a million weights in our network. Then for each distinct weight wj we need to compute C(w+ϵej) in order to compute ∂C/∂wj. That means that to compute the gradient we need to compute the cost function a million different times, requiring a million forward passes through the network (per training example). We need to compute C(w) as well, so that’s a total of a million and one passes through the network.

What’s clever about backpropagation is that it enables us to simultaneously compute all the partial derivatives ∂C/∂wj using just one forward pass through the network, followed by one backward pass through the network. Roughly speaking, the computational cost of the backward pass is about the same as the forward pass. And so the total cost of backpropagation is roughly the same as making just two forward passes through the network.
Compare that to the million and one forward passes we needed for the approach based on (4)! And so even though backpropagation appears superficially more complex than the approach based on (4), it’s actually much, much faster.

Backpropagation: the big picture

wei_chao_cheng

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
How the backpropagation algorithm works

How the backpropagation algorithm works. I’ll explain a fast algorithm for computing the gradients of the cost function, an algorithm known as backpropagation.
复制链接

扫一扫