神经网络与深度学习(第一章)(六)

Learning with gradient descent 学习梯度下降

Now that we have a design for our neural network, how can it learn to recognize digits? The first thing we’ll need is a data set to learn from - a so-called training data set. We’ll use the MNIST data set, which contains tens of thousands of scanned images of handwritten digits, together with their correct classifications. MNIST’s name comes from the fact that it is a modified subset of two data sets collected by NIST, the United States’ National Institute of Standards and Technology. Here’s a few images from MNIST:
现在我们设计好了我们的神经网络,如何让它学习识别数字呢?首先我们需要的是可以学习的数据集——又称训练集。我们将使用MNIST数据集,其中饱含数以万计的手写数字扫描图片和它们正确的分类。MNIST的名字来源于它是由两个数据集组成,它们是由NIST,美国国家标准和技术研究所收集。这里是MNIST中的部分图片

As you can see, these digits are, in fact, the same as those shown at the beginning of this chapter as a challenge to recognize. Of course, when testing our network we’ll ask it to recognize images which aren’t in the training set!
正如你所见的,事实上这些数字和本章开始时的一样难以识别。当然当测试我们的网络的时候,我们将使用那些不在训练集中的图片让它识别。

The MNIST data comes in two parts. The first part contains 60,000 images to be used as training data. These images are scanned handwriting samples from 250 people, half of whom were US Census Bureau employees, and half of whom were high school students. The images are greyscale and 28 by 28 pixels in size. The second part of the MNIST data set is 10,000 images to be used as test data. Again, these are 28 by 28 greyscale images. We’ll use the test data to evaluate how well our neural network has learned to recognize digits. To make this a good test of performance, the test data was taken from a different set of 250 people than the original training data (albeit still a group split between Census Bureau employees and high school students). This helps give us confidence that our system can recognize digits from people whose writing it didn’t see during training.
MNIST的数据来源于两个部分。第一部分包括了60000张图片,将被用于训练集。这些图片是扫描了250个人的笔迹样本得到的,一半的人是美国人口调查局的雇员,一半的人是高中学生。这些图片是28乘28像素的灰度图片。第二部分有10000张图片,将被用于测试集。一样,它们也是28乘28像素的灰度图片。我们将使用测试集来评估我们的神经网络学习识别照片的情况。为了很好的测试网络的性能,测试集是另外250人的数据集(即使仍然是由人口调查局的雇员和高中生构成)。这可以帮助我们相信我们的系统可以识别那些它在训练集中没见过的人所写的数字。

We’ll use the notation x to denote a training input. It’ll be convenient to regard each training input x as a 28×28=784 -dimensional vector. Each entry in the vector represents the grey value for a single pixel in the image. We’ll denote the corresponding desired output by y=y(x) , where y is a 10-dimensional vector. For example, if a particular training image, x , depicts a 6, then y(x)=(0,0,0,0,0,0,1,0,0,0)T is the desired output from the network. Note that T here is the transpose operation, turning a row vector into an ordinary (column) vector.
我们将用符号x来代表训练输入。很自然,每个训练输入 x 28×28=784维向量。向量的每个位置表示一个图片中像素的灰度值。我们将对应的要求的输出表示为 y=y(x) ,这里 y 是一个10维向量。例如如果一个特定的训练图片x描绘的是 6 ,那么y(x)=(0,0,0,0,0,0,1,0,0,0)T 就是网络期待的输出。注意 T 这里是转置算子,将行向量转变成列向量。

What we’d like is an algorithm which lets us find weights and biases so that the output from the network approximates y(x) for all training inputs x . To quantify how well we’re achieving this goal we define a cost function(Sometimes referred to as a loss or objective function. We use the term cost function throughout this book, but you should note the other terminology, since it’s often used in research papers and other discussions of neural networks.) :
我们想要的是一个算法,它让我们找到网络中的权重和偏移量,这样对于训练输入x网络的输出近似于 y(x) 。为了衡量我们多好的完成了这个目标我们定义了代价函数(有时候被称为损失函数或者目标函数。在本书中我们使用代价函数这个名称,但是你应该了解其他的名称,因为它经常被用于研究报告和其他神经网络的讨论中):

C(w,b)12nxy(x)a2.(6)

Here, ww denotes the collection of all weights in the network, bb all the biases, nn is the total number of training inputs, aa is the vector of outputs from the network when xx is input, and the sum is over all training inputs, xx. Of course, the output aa depends on xx, ww and bb, but to keep the notation simple I haven’t explicitly indicated this dependence. The notation ∥v∥‖v‖ just denotes the usual length function for a vector vv. We’ll call CC the quadratic cost function; it’s also sometimes known as the mean squared error or just MSE. Inspecting the form of the quadratic cost function, we see that C(w,b)C(w,b) is non-negative, since every term in the sum is non-negative. Furthermore, the cost C(w,b)C(w,b) becomes small, i.e., C(w,b)≈0C(w,b)≈0, precisely when y(x)y(x) is approximately equal to the output, aa, for all training inputs, xx. So our training algorithm has done a good job if it can find weights and biases so that C(w,b)≈0C(w,b)≈0. By contrast, it’s not doing so well when C(w,b)C(w,b) is large - that would mean that y(x)y(x) is not close to the output aa for a large number of inputs. So the aim of our training algorithm will be to minimize the cost C(w,b)C(w,b) as a function of the weights and biases. In other words, we want to find a set of weights and biases which make the cost as small as possible. We’ll do that using an algorithm known as gradient descent.

Why introduce the quadratic cost? After all, aren’t we primarily interested in the number of images correctly classified by the network? Why not try to maximize that number directly, rather than minimizing a proxy measure like the quadratic cost? The problem with that is that the number of images correctly classified is not a smooth function of the weights and biases in the network. For the most part, making small changes to the weights and biases won’t cause any change at all in the number of training images classified correctly. That makes it difficult to figure out how to change the weights and biases to get improved performance. If we instead use a smooth cost function like the quadratic cost it turns out to be easy to figure out how to make small changes in the weights and biases so as to get an improvement in the cost. That’s why we focus first on minimizing the quadratic cost, and only after that will we examine the classification accuracy.

Even given that we want to use a smooth cost function, you may still wonder why we choose the quadratic function used in Equation (6). Isn’t this a rather ad hoc choice? Perhaps if we chose a different cost function we’d get a totally different set of minimizing weights and biases? This is a valid concern, and later we’ll revisit the cost function, and make some modifications. However, the quadratic cost function of Equation (6) works perfectly well for understanding the basics of learning in neural networks, so we’ll stick with it for now.

Recapping, our goal in training a neural network is to find weights and biases which minimize the quadratic cost function C(w,b)C(w,b). This is a well-posed problem, but it’s got a lot of distracting structure as currently posed - the interpretation of ww and bb as weights and biases, the σσ function lurking in the background, the choice of network architecture, MNIST, and so on. It turns out that we can understand a tremendous amount by ignoring most of that structure, and just concentrating on the minimization aspect. So for now we’re going to forget all about the specific form of the cost function, the connection to neural networks, and so on. Instead, we’re going to imagine that we’ve simply been given a function of many variables and we want to minimize that function. We’re going to develop a technique called gradient descent which can be used to solve such minimization problems. Then we’ll come back to the specific function we want to minimize for neural networks.

Okay, let’s suppose we’re trying to minimize some function, C(v)C(v). This could be any real-valued function of many variables, v=v1,v2,…v=v1,v2,…. Note that I’ve replaced the ww and bb notation by vv to emphasize that this could be any function - we’re not specifically thinking in the neural networks context any more. To minimize C(v)C(v) it helps to imagine CC as a function of just two variables, which we’ll call v1v1 and v2v2:

What we’d like is to find where CC achieves its global minimum. Now, of course, for the function plotted above, we can eyeball the graph and find the minimum. In that sense, I’ve perhaps shown slightly too simple a function! A general function, CC, may be a complicated function of many variables, and it won’t usually be possible to just eyeball the graph to find the minimum.

One way of attacking the problem is to use calculus to try to find the minimum analytically. We could compute derivatives and then try using them to find places where CC is an extremum. With some luck that might work when CC is a function of just one or a few variables. But it’ll turn into a nightmare when we have many more variables. And for neural networks we’ll often want far more variables - the biggest neural networks have cost functions which depend on billions of weights and biases in an extremely complicated way. Using calculus to minimize that just won’t work!

(After asserting that we’ll gain insight by imagining CC as a function of just two variables, I’ve turned around twice in two paragraphs and said, “hey, but what if it’s a function of many more than two variables?” Sorry about that. Please believe me when I say that it really does help to imagine CC as a function of two variables. It just happens that sometimes that picture breaks down, and the last two paragraphs were dealing with such breakdowns. Good thinking about mathematics often involves juggling multiple intuitive pictures, learning when it’s appropriate to use each picture, and when it’s not.)

Okay, so calculus doesn’t work. Fortunately, there is a beautiful analogy which suggests an algorithm which works pretty well. We start by thinking of our function as a kind of a valley. If you squint just a little at the plot above, that shouldn’t be too hard. And we imagine a ball rolling down the slope of the valley. Our everyday experience tells us that the ball will eventually roll to the bottom of the valley. Perhaps we can use this idea as a way to find a minimum for the function? We’d randomly choose a starting point for an (imaginary) ball, and then simulate the motion of the ball as it rolled down to the bottom of the valley. We could do this simulation simply by computing derivatives (and perhaps some second derivatives) of CC - those derivatives would tell us everything we need to know about the local “shape” of the valley, and therefore how our ball should roll.

Based on what I’ve just written, you might suppose that we’ll be trying to write down Newton’s equations of motion for the ball, considering the effects of friction and gravity, and so on. Actually, we’re not going to take the ball-rolling analogy quite that seriously - we’re devising an algorithm to minimize CC, not developing an accurate simulation of the laws of physics! The ball’s-eye view is meant to stimulate our imagination, not constrain our thinking. So rather than get into all the messy details of physics, let’s simply ask ourselves: if we were declared God for a day, and could make up our own laws of physics, dictating to the ball how it should roll, what law or laws of motion could we pick that would make it so the ball always rolled to the bottom of the valley?

To make this question more precise, let’s think about what happens when we move the ball a small amount Δv1Δv1 in the v1v1 direction, and a small amount Δv2Δv2 in the v2v2 direction. Calculus tells us that CC changes as follows:
ΔC≈∂C∂v1Δv1+∂C∂v2Δv2.(7)
(7)ΔC≈∂C∂v1Δv1+∂C∂v2Δv2.
We’re going to find a way of choosing Δv1Δv1 and Δv2Δv2 so as to make ΔCΔC negative; i.e., we’ll choose them so the ball is rolling down into the valley. To figure out how to make such a choice it helps to define ΔvΔv to be the vector of changes in vv, Δv≡(Δv1,Δv2)TΔv≡(Δv1,Δv2)T, where TT is again the transpose operation, turning row vectors into column vectors. We’ll also define the gradient of CC to be the vector of partial derivatives, (∂C∂v1,∂C∂v2)T(∂C∂v1,∂C∂v2)T. We denote the gradient vector by ∇C∇C, i.e.:
∇C≡(∂C∂v1,∂C∂v2)T.(8)
(8)∇C≡(∂C∂v1,∂C∂v2)T.
In a moment we’ll rewrite the change ΔCΔC in terms of ΔvΔv and the gradient, ∇C∇C. Before getting to that, though, I want to clarify something that sometimes gets people hung up on the gradient. When meeting the ∇C∇C notation for the first time, people sometimes wonder how they should think about the ∇∇ symbol. What, exactly, does ∇∇ mean? In fact, it’s perfectly fine to think of ∇C∇C as a single mathematical object - the vector defined above - which happens to be written using two symbols. In this point of view, ∇∇ is just a piece of notational flag-waving, telling you “hey, ∇C∇C is a gradient vector”. There are more advanced points of view where ∇∇ can be viewed as an independent mathematical entity in its own right (for example, as a differential operator), but we won’t need such points of view.

With these definitions, the expression (7) for ΔCΔC can be rewritten as
ΔC≈∇C⋅Δv.(9)
(9)ΔC≈∇C⋅Δv.
This equation helps explain why ∇C∇C is called the gradient vector: ∇C∇C relates changes in vv to changes in CC, just as we’d expect something called a gradient to do. But what’s really exciting about the equation is that it lets us see how to choose ΔvΔv so as to make ΔCΔC negative. In particular, suppose we choose
Δv=−η∇C,(10)
(10)Δv=−η∇C,
where ηη is a small, positive parameter (known as the learning rate). Then Equation (9) tells us that ΔC≈−η∇C⋅∇C=−η∥∇C∥2ΔC≈−η∇C⋅∇C=−η‖∇C‖2. Because ∥∇C∥2≥0‖∇C‖2≥0, this guarantees that ΔC≤0ΔC≤0, i.e., CC will always decrease, never increase, if we change vv according to the prescription in (10). (Within, of course, the limits of the approximation in Equation (9)). This is exactly the property we wanted! And so we’ll take Equation (10) to define the “law of motion” for the ball in our gradient descent algorithm. That is, we’ll use Equation (10) to compute a value for ΔvΔv, then move the ball’s position vv by that amount:
v→v′=v−η∇C.(11)
(11)v→v′=v−η∇C.
Then we’ll use this update rule again, to make another move. If we keep doing this, over and over, we’ll keep decreasing CC until - we hope - we reach a global minimum.

Summing up, the way the gradient descent algorithm works is to repeatedly compute the gradient ∇C∇C, and then to move in the opposite direction, “falling down” the slope of the valley. We can visualize it like this:

Notice that with this rule gradient descent doesn’t reproduce real physical motion. In real life a ball has momentum, and that momentum may allow it to roll across the slope, or even (momentarily) roll uphill. It’s only after the effects of friction set in that the ball is guaranteed to roll down into the valley. By contrast, our rule for choosing ΔvΔv just says “go down, right now”. That’s still a pretty good rule for finding the minimum!

To make gradient descent work correctly, we need to choose the learning rate ηη to be small enough that Equation (9) is a good approximation. If we don’t, we might end up with ΔC>0ΔC>0, which obviously would not be good! At the same time, we don’t want ηη to be too small, since that will make the changes ΔvΔv tiny, and thus the gradient descent algorithm will work very slowly. In practical implementations, ηη is often varied so that Equation (9) remains a good approximation, but the algorithm isn’t too slow. We’ll see later how this works.

I’ve explained gradient descent when CC is a function of just two variables. But, in fact, everything works just as well even when CC is a function of many more variables. Suppose in particular that CC is a function of mm variables, v1,…,vmv1,…,vm. Then the change ΔCΔC in CC produced by a small change Δv=(Δv1,…,Δvm)TΔv=(Δv1,…,Δvm)T is
ΔC≈∇C⋅Δv,(12)
(12)ΔC≈∇C⋅Δv,
where the gradient ∇C∇C is the vector
∇C≡(∂C∂v1,…,∂C∂vm)T.(13)
(13)∇C≡(∂C∂v1,…,∂C∂vm)T.
Just as for the two variable case, we can choose
Δv=−η∇C,(14)
(14)Δv=−η∇C,
and we’re guaranteed that our (approximate) expression (12) for ΔCΔC will be negative. This gives us a way of following the gradient to a minimum, even when CC is a function of many variables, by repeatedly applying the update rule
v→v′=v−η∇C.(15)
(15)v→v′=v−η∇C.
You can think of this update rule as defining the gradient descent algorithm. It gives us a way of repeatedly changing the position vv in order to find a minimum of the function CC. The rule doesn’t always work - several things can go wrong and prevent gradient descent from finding the global minimum of CC, a point we’ll return to explore in later chapters. But, in practice gradient descent often works extremely well, and in neural networks we’ll find that it’s a powerful way of minimizing the cost function, and so helping the net learn.

Indeed, there’s even a sense in which gradient descent is the optimal strategy for searching for a minimum. Let’s suppose that we’re trying to make a move ΔvΔv in position so as to decrease CC as much as possible. This is equivalent to minimizing ΔC≈∇C⋅ΔvΔC≈∇C⋅Δv. We’ll constrain the size of the move so that ∥Δv∥=ϵ‖Δv‖=ϵ for some small fixed ϵ>0ϵ>0. In other words, we want a move that is a small step of a fixed size, and we’re trying to find the movement direction which decreases CC as much as possible. It can be proved that the choice of ΔvΔv which minimizes ∇C⋅Δv∇C⋅Δv is Δv=−η∇CΔv=−η∇C, where η=ϵ/∥∇C∥η=ϵ/‖∇C‖ is determined by the size constraint ∥Δv∥=ϵ‖Δv‖=ϵ. So gradient descent can be viewed as a way of taking small steps in the direction which does the most to immediately decrease CC.

Exercises

Prove the assertion of the last paragraph. Hint: If you’re not already familiar with the Cauchy-Schwarz inequality, you may find it helpful to familiarize yourself with it.
I explained gradient descent when CC is a function of two variables, and when it’s a function of more than two variables. What happens when CC is a function of just one variable? Can you provide a geometric interpretation of what gradient descent is doing in the one-dimensional case?
People have investigated many variations of gradient descent, including variations that more closely mimic a real physical ball. These ball-mimicking variations have some advantages, but also have a major disadvantage: it turns out to be necessary to compute second partial derivatives of CC, and this can be quite costly. To see why it’s costly, suppose we want to compute all the second partial derivatives ∂2C/∂vj∂vk∂2C/∂vj∂vk. If there are a million such vjvj variables then we’d need to compute something like a trillion (i.e., a million squared) second partial derivatives* *Actually, more like half a trillion, since ∂2C/∂vj∂vk=∂2C/∂vk∂vj∂2C/∂vj∂vk=∂2C/∂vk∂vj. Still, you get the point.! That’s going to be computationally costly. With that said, there are tricks for avoiding this kind of problem, and finding alternatives to gradient descent is an active area of investigation. But in this book we’ll use gradient descent (and variations) as our main approach to learning in neural networks.

How can we apply gradient descent to learn in a neural network? The idea is to use gradient descent to find the weights wkwk and biases blbl which minimize the cost in Equation (6). To see how this works, let’s restate the gradient descent update rule, with the weights and biases replacing the variables vjvj. In other words, our “position” now has components wkwk and blbl, and the gradient vector ∇C∇C has corresponding components ∂C/∂wk∂C/∂wk and ∂C/∂bl∂C/∂bl. Writing out the gradient descent update rule in terms of components, we have
wkbl→→w′k=wk−η∂C∂wkb′l=bl−η∂C∂bl.(16)(17)
(16)wk→wk′=wk−η∂C∂wk(17)bl→bl′=bl−η∂C∂bl.
By repeatedly applying this update rule we can “roll down the hill”, and hopefully find a minimum of the cost function. In other words, this is a rule which can be used to learn in a neural network.

There are a number of challenges in applying the gradient descent rule. We’ll look into those in depth in later chapters. But for now I just want to mention one problem. To understand what the problem is, let’s look back at the quadratic cost in Equation (6). Notice that this cost function has the form C=1n∑xCxC=1n∑xCx, that is, it’s an average over costs Cx≡∥y(x)−a∥22Cx≡‖y(x)−a‖22 for individual training examples. In practice, to compute the gradient ∇C∇C we need to compute the gradients ∇Cx∇Cx separately for each training input, xx, and then average them, ∇C=1n∑x∇Cx∇C=1n∑x∇Cx. Unfortunately, when the number of training inputs is very large this can take a long time, and learning thus occurs slowly.

An idea called stochastic gradient descent can be used to speed up learning. The idea is to estimate the gradient ∇C∇C by computing ∇Cx∇Cx for a small sample of randomly chosen training inputs. By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient ∇C∇C, and this helps speed up gradient descent, and thus learning.

To make these ideas more precise, stochastic gradient descent works by randomly picking out a small number mm of randomly chosen training inputs. We’ll label those random training inputs X1,X2,…,XmX1,X2,…,Xm, and refer to them as a mini-batch. Provided the sample size mm is large enough we expect that the average value of the ∇CXj∇CXj will be roughly equal to the average over all ∇Cx∇Cx, that is,
∑mj=1∇CXjm≈∑x∇Cxn=∇C,(18)
(18)∑j=1m∇CXjm≈∑x∇Cxn=∇C,
where the second sum is over the entire set of training data. Swapping sides we get
∇C≈1m∑j=1m∇CXj,(19)
(19)∇C≈1m∑j=1m∇CXj,
confirming that we can estimate the overall gradient by computing gradients just for the randomly chosen mini-batch.

To connect this explicitly to learning in neural networks, suppose wkwk and blbl denote the weights and biases in our neural network. Then stochastic gradient descent works by picking out a randomly chosen mini-batch of training inputs, and training with those,
wkbl→→w′k=wk−ηm∑j∂CXj∂wkb′l=bl−ηm∑j∂CXj∂bl,(20)(21)
(20)wk→wk′=wk−ηm∑j∂CXj∂wk(21)bl→bl′=bl−ηm∑j∂CXj∂bl,
where the sums are over all the training examples XjXj in the current mini-batch. Then we pick out another randomly chosen mini-batch and train with those. And so on, until we’ve exhausted the training inputs, which is said to complete an epoch of training. At that point we start over with a new training epoch.

Incidentally, it’s worth noting that conventions vary about scaling of the cost function and of mini-batch updates to the weights and biases. In Equation (6) we scaled the overall cost function by a factor 1n1n. People sometimes omit the 1n1n, summing over the costs of individual training examples instead of averaging. This is particularly useful when the total number of training examples isn’t known in advance. This can occur if more training data is being generated in real time, for instance. And, in a similar way, the mini-batch update rules (20) and (21) sometimes omit the 1m1m term out the front of the sums. Conceptually this makes little difference, since it’s equivalent to rescaling the learning rate ηη. But when doing detailed comparisons of different work it’s worth watching out for.

We can think of stochastic gradient descent as being like political polling: it’s much easier to sample a small mini-batch than it is to apply gradient descent to the full batch, just as carrying out a poll is easier than running a full election. For example, if we have a training set of size n=60,000n=60,000, as in MNIST, and choose a mini-batch size of (say) m=10m=10, this means we’ll get a factor of 6,0006,000 speedup in estimating the gradient! Of course, the estimate won’t be perfect - there will be statistical fluctuations - but it doesn’t need to be perfect: all we really care about is moving in a general direction that will help decrease CC, and that means we don’t need an exact computation of the gradient. In practice, stochastic gradient descent is a commonly used and powerful technique for learning in neural networks, and it’s the basis for most of the learning techniques we’ll develop in this book.

Exercise

An extreme version of gradient descent is to use a mini-batch size of just 1. That is, given a training input, xx, we update our weights and biases according to the rules wk→w′k=wk−η∂Cx/∂wkwk→wk′=wk−η∂Cx/∂wk and bl→b′l=bl−η∂Cx/∂blbl→bl′=bl−η∂Cx/∂bl. Then we choose another training input, and update the weights and biases again. And so on, repeatedly. This procedure is known as online, on-line, or incremental learning. In online learning, a neural network learns from just one training input at a time (just as human beings do). Name one advantage and one disadvantage of online learning, compared to stochastic gradient descent with a mini-batch size of, say, 2020.
Let me conclude this section by discussing a point that sometimes bugs people new to gradient descent. In neural networks the cost CC is, of course, a function of many variables - all the weights and biases - and so in some sense defines a surface in a very high-dimensional space. Some people get hung up thinking: “Hey, I have to be able to visualize all these extra dimensions”. And they may start to worry: “I can’t think in four dimensions, let alone five (or five million)”. Is there some special ability they’re missing, some ability that “real” supermathematicians have? Of course, the answer is no. Even most professional mathematicians can’t visualize four dimensions especially well, if at all. The trick they use, instead, is to develop other ways of representing what’s going on. That’s exactly what we did above: we used an algebraic (rather than visual) representation of ΔCΔC to figure out how to move so as to decrease CC. People who are good at thinking in high dimensions have a mental library containing many different techniques along these lines; our algebraic trick is just one example. Those techniques may not have the simplicity we’re accustomed to when visualizing three dimensions, but once you build up a library of such techniques, you can get pretty good at thinking in high dimensions. I won’t go into more detail here, but if you’re interested then you may enjoy reading this discussion of some of the techniques professional mathematicians use to think in high dimensions. While some of the techniques discussed are quite complex, much of the best content is intuitive and accessible, and could be mastered by anyone.

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值