How the backpropagation algorithm works

How the backpropagation algorithm works

I’ll explain a fast algorithm for computing the gradients of the cost function, an algorithm known as backpropagation.

Warm up: a fast matrix-based approach to computing the output from a neural network

We’ll use wljk to denote the weight for the connection from the kth neuron in the (l1)th layer to the jth neuron in the lth layer. Explicitly, we use blj for the bias of the jth neuron in the lth layer. And we use alj for the activation of the jth neuron in the lth layer. With these notations, the activation alj of the jth neuron in the lth layer is related to the activations in the (l1)th layer by the equation

alj=σ(kwljkal1k+blj),(1)

where the sum is over all neurons k in the (l1)th layer.

To rewrite this expression in a matrix form we define a weight matrix wl for each layer, l. The entries of the weight matrix wl are just the weights connecting to the lth layer of neurons, that is, the entry in the jth row and kth column is wljk . Similarly, for each layer l we define a bias vector, bl . You can probably guess how this works - the components of the bias vector are just the values blj , one component for each neuron in the lth layer. And finally, we define an activation vector al whose components are the activations alj . With these notations in mind, Equation (1) can be rewritten in the beautiful and compact vectorized form

al=σ(wlal1+bl).(2)

zlwlal1+bl , we call zl the weighted input to the neurons in layer l.

The two assumptions we need about the cost function

The goal of backpropagation is to compute the partial derivatives ∂C/∂w and ∂C/∂b of the cost function C with respect to any weight w or bias bb in the network. For backpropagation to work we need to make two main assumptions about the form of the cost function. The first assumption we need is that the cost function can be written as an average C=1nxCx over cost functions Cx for individual training examples, x. The second assumption we make about the cost is that it can be written as a function of the outputs from the neural network: C = C( aL ).

The Hadamard product, s⊙t

We use s⊙t to denote the elementwise product of the two vectors. Thus the components of s⊙t are just (st)j=sjtj .

The four fundamental equations behind backpropagation

we define the error δlj of neuron j in layer l by

δljCzlj.(3)

An equation for the error in the output layer, δL : The components of δL are given by

δLj=CaLjσ(zLj).(BP1)

or
δL=aCσ(zL).(BP1a)

An equation for the error δl in terms of the error in the next layer, δl+1 : In particular

δl=((wl+1)Tδl+1)σ(zl),(BP2)

An equation for the rate of change of the cost with respect to any bias in the network: In particular:

Cblj=δlj.(BP3)

An equation for the rate of change of the cost with respect to any weight in the network: In particular:

Cwljk=al1kδlj.(BP4)

The backpropagation algorithm

The backpropagation equations provide us with a way of computing the gradient of the cost function. Let’s explicitly write this out in the form of an algorithm:

  1. Input x: Set the corresponding activation a1 for the input layer.
  2. Feedforward: For each l=2,3,…,L compute zl=wlal1+bl and al=σ(zl) .
  3. Output error δL : Compute the vector δL=aCσ(zL) .
  4. Backpropagate the error: For each l=L−1,L−2,…,2 compute δl=((wl+1)Tδl+1)σ(zl) .
  5. Output: The gradient of the cost function is given by Cwljk=al1kδlj and Cblj=δlj .

Examining the algorithm you can see why it’s called backpropagation.

As I’ve described it above, the backpropagation algorithm computes the gradient of the cost function for a single training example, C=Cx. In practice, it’s common to combine backpropagation with a learning algorithm such as stochastic gradient descent, in which we compute the gradient for many training examples. In particular, given a mini-batch of m training examples, the following algorithm applies a gradient descent learning step based on that mini-batch:

  1. Input a set of training examples
  2. For each training example x: Set the corresponding input activation ax,1 , and perform the following steps:
    • Feedforward: For each l=2,3,…,L compute zx,l=wlax,l1+bl and ax,l=σ(zx,l) .
    • Output error δx,L : Compute the vector δx,L=aCxσ(zx,L) .
    • Backpropagate the error: For each l=L−1,L−2,…,2 compute δx,l=((wl+1)Tδx,l+1)σ(zx,l) .
  3. Gradient descent: For each l=L,L−1,…,2 update the weights according to the rule wlwlηmxδx,l(ax,l1)T , and the biases according to the rule blblηmxδx,l .

The code for backpropagation

class Network(object):
...
    def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The "mini_batch" is a list of tuples "(x, y)", and "eta"
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw 
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb 
                       for b, nb in zip(self.biases, nabla_b)]
class Network(object):
...
   def backprop(self, x, y):
        """Return a tuple "(nabla_b, nabla_w)" representing the
        gradient for the cost function C_x.  "nabla_b" and
        "nabla_w" are layer-by-layer lists of numpy arrays, similar
        to "self.biases" and "self.weights"."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book.  Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on.  It's a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in xrange(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

...

    def cost_derivative(self, output_activations, y):
        """Return the vector of partial derivatives \partial C_x /
        \partial a for the output activations."""
        return (output_activations-y) 

def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

In what sense is backpropagation a fast algorithm?

In what sense is backpropagation a fast algorithm?
Let’s consider another approach to computing the gradient:

CwjC(w+ϵej)C(w)ϵ,(4)

where ϵ>0 is a small positive number, and ej is the unit vector in the jth direction. In other words, we can estimate ∂C/∂wjby computing the cost C for two slightly different values of wj, and then applying Equation (4). The same idea will let us compute the partial derivatives ∂C/∂b with respect to the biases.

Unfortunately, while this approach appears promising, when you implement the code it turns out to be extremely slow. To understand why, imagine we have a million weights in our network. Then for each distinct weight wj we need to compute C(w+ϵej) in order to compute ∂C/∂wj. That means that to compute the gradient we need to compute the cost function a million different times, requiring a million forward passes through the network (per training example). We need to compute C(w) as well, so that’s a total of a million and one passes through the network.

What’s clever about backpropagation is that it enables us to simultaneously compute all the partial derivatives ∂C/∂wj using just one forward pass through the network, followed by one backward pass through the network. Roughly speaking, the computational cost of the backward pass is about the same as the forward pass. And so the total cost of backpropagation is roughly the same as making just two forward passes through the network.
Compare that to the million and one forward passes we needed for the approach based on (4)! And so even though backpropagation appears superficially more complex than the approach based on (4), it’s actually much, much faster.

Backpropagation: the big picture

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值