Neural Networks and Deep Learning - 神经网络与深度学习 - Overfitting and regularization - 过拟合和正则化
Neural Networks and Deep Learning
http://neuralnetworksanddeeplearning.com/index.html
神经网络与深度学习
https://legacy.gitbook.com/book/hit-scir/neural-networks-and-deep-learning-zh_cn/details
https://hit-scir.gitbooks.io/neural-networks-and-deep-learning-zh_cn/content/
1. Overfitting and regularization - 过拟合和正则化
The Nobel prizewinning physicist Enrico Fermi was once asked his opinion of a mathematical model some colleagues had proposed as the solution to an important unsolved physics problem. The model gave excellent agreement with experiment, but Fermi was skeptical. He asked how many free parameters could be set in the model. “Four” was the answer. Fermi replied: “I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.”.
当某个数学模型被一些物理学家应用到了一个悬而未决的重要的物理问题上时,诺贝尔物理学奖获得者 Enrico Fermi 被问及对于此举的观点。虽然模型与实验结果契合的很好,但 Fermi 还是对此有所怀疑。他问道:“在这个模型中有多少个可以自由设置的参数?”,他得到的答案是 4 个。Fermi 回复说:“我的朋友 Johnny von Neumann 常说,他用四个参数可以模拟一头大象,用五个可以使它摇动鼻子。”
Nobel [nəʊˈbel]:n. 诺贝尔
prizewinning [ˈpraɪzwɪnɪŋ]:adj. 有得奖可能的,已获奖的
physicist [ˈfɪzɪsɪst]:n. 物理学家,唯物论者
mathematical [ˌmæθəˈmætɪkl]:adj. 数学的,数学上的,精确的
colleague [ˈkɒliːɡ]:n. 同事,同僚
skeptical ['skeptɪkəl]:adj. 怀疑的,(哲学) 怀疑论的,不可知论的
wiggle [ˈwɪɡl]:v. 使摆动,使扭动 n. 摆动,扭动
trunk [trʌŋk]:n. 树干,躯干,象鼻,汽车车尾的行李箱 vt. 把...放入旅行箱内 adj. 干线的,躯干的,箱子的
flaw [flɔː]:n. 瑕疵,裂纹,一阵狂风,短暂的风暴 v. (使) 有裂纹,(使) (证件等) 因有缺陷而失效,暴露缺点
The quote comes from a charming article by Freeman Dyson, who is one of the people who proposed the flawed model. A four-parameter elephant may be found here.
引用来自 Freeman Dyson 的一篇精彩的文章 A meeting with Enrico Fermi。他曾提出了有缺陷的模型,四参数大象的例子可以在这里找到。
The point, of course, is that models with a large number of free parameters can describe an amazingly wide range of phenomena. Even if such a model agrees well with the available data, that doesn’t make it a good model. It may just mean there’s enough freedom in the model that it can describe almost any data set of the given size, without capturing any genuine insights into the underlying phenomenon. When that happens the model will work well for the existing data, but will fail to generalize to new situations. The true test of a model is its ability to make predictions in situations it hasn’t been exposed to before.
Fermi 的这个观点是在表明:有大量自由参数的模型能够描述一个足够宽泛的现象。即使这样的模型与现有的数据吻合得很好,这也不能说它是一个好的模型。这仅仅只能说明,有足够自由度的模型基本上可以描述任何给定大小的数据集,但是它并没有真正地洞察到现象背后的本质。这就会导致该模型在现有的数据上表现得很好,但是却不能普及到新的情况上。判断一个模型真正好坏的方法,是看其对未知情况的预测能力。
phenomena [fə'nɒmɪnə]:n. 现象 (phenomenon 的复数)
genuine [ˈdʒenjuɪn]:adj. 真实的,真正的,诚恳的
phenomenon [fəˈnɒmɪnən]: n. 现象,奇迹,杰出的人才
suspicious [səˈspɪʃəs]:adj. 可疑的,怀疑的,多疑的
Fermi and von Neumann were suspicious of models with four parameters. Our 30 hidden neuron network for classifying MNIST digits has nearly 24,000 parameters! That’s a lot of parameters. Our 100 hidden neuron network has nearly 80,000 parameters, and state-of-the-art deep neural nets sometimes contain millions or even billions of parameters. Should we trust the results?
Fermi and von Neumann 对持有四个参数的模型都抱有怀疑的态度。然而我们分类 MNIST 数字所用的 30 隐藏神经元网络就有将近 24000 个参数!这个数量非常庞大。100 个隐藏神经元网络有接近 80000 个参数,最先进的深度神经网络或许包含数百万,甚至数亿计的参数。我们应该相信这些结果吗?
Let’s sharpen this problem up by constructing a situation where our network does a bad job generalizing to new situations. We’ll use our 30 hidden neuron network, with its 23,860 parameters. But we won’t train the network using all 50,000 MNIST training images. Instead, we’ll use just the first 1,000 training images. Using that restricted set will make the problem with generalization much more evident. We’ll train in a similar way to before, using the cross-entropy cost function, with a learning rate of
η
=
0.5
\eta = 0.5
η=0.5 and a mini-batch size of 10. However, we’ll train for 400 epochs, a somewhat larger number than before, because we’re not using as many training examples. Let’s use network2
to look at the way the cost function changes:
下面让我们来举个例子来突出问题的严重性。我们会制造出一种情况,使我们的网络不能够很好地适应新数据。假设我们使用有 30 个隐藏神经元网络,其中含有 23860 个参数。但是不用全部的 50000 个 MINST 图像做训练,而仅仅使用前 1000 个。用较小的训练集能够突出泛化的问题。此外,我们依然用交叉熵代价函数来训练模型,学习率为
η
=
0.5
\eta = 0.5
η=0.5,mini-batch 的大小选为 10。然而,与之前不同的是,我们将要迭代 400 epochs,由于使用了较少的训练实例,所以就要较多的训练次数。让我们用 network2
看一下代价函数的变化趋势:
sharpen [ˈʃɑːpən]:vt. 削尖,磨快,使敏捷,加重 vi. 尖锐,变锋利
>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()
>>> import network2
>>> net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost)
>>> net.large_weight_initializer()
>>> net.SGD(training_data[:1000], 400, 10, 0.5, evaluation_data=test_data,
... monitor_evaluation_accuracy=True, monitor_training_cost=True)
Using the results we can plot the way the cost changes as the network learns (This and the next four graphs were generated by the program overfitting.py.):
根据结果,我们能绘制网络学习成本的变化趋势 (本图和接下来四幅图都是通过运行 overfitting.py 产生的。):
This looks encouraging, showing a smooth decrease in the cost, just as we expect. Note that I’ve only shown training epochs 200 through 399. This gives us a nice up-close view of the later stages of learning, which, as we’ll see, turns out to be where the interesting action is.
正如我们期望的那样,代价在平滑地下降。请注意,图中仅仅展示了第 200 epochs 到第 399 epochs 的阶段。我们能够从此判断出之后阶段的学习趋势。我们稍后会看到,后面的阶段是真正有趣的地方所在。
encourage [ɪnˈkʌrɪdʒ]:vt. 鼓励,怂恿,激励,支持
Let’s now look at how the classification accuracy on the test data changes over time:
现在让我们来看看测试数据上的分类准确率随着时间是怎样变化的:
Again, I’ve zoomed in quite a bit. In the first 200 epochs (not shown) the accuracy rises to just under 82 percent. The learning then gradually slows down. Finally, at around epoch 280 the classification accuracy pretty much stops improving. Later epochs merely see small stochastic fluctuations near the value of the accuracy at epoch 280. Contrast this with the earlier graph, where the cost associated to the training data continues to smoothly drop. If we just look at that cost, it appears that our model is still getting “better”. But the test accuracy results show the improvement is an illusion. Just like the model that Fermi disliked, what our network learns after epoch 280 no longer generalizes to the test data. And so it’s not useful learning. We say the network is overfitting or overtraining beyond epoch 280.
同样,我选取了图表的一部分。在前 200 epochs (未显示) 准确率上升到接近 82%。然后学习的效果就逐渐放缓。最后,在 epoch 280 附近,分类准确率几乎停止改善。之后的学习仅仅在 epoch 280 就达到的准确率附近有一些小的随机波动。与上一个图表对比,我们会发现,训练数据的代价函数值是持续下降的。如果我们只关注代价函数,模型似乎一直在“改进”。但测试精度结果表明:此时的改进只是一种错觉。正如 Fermi 所不喜欢的模型那样,在 epoch 280 之后,我们网络的学习不能够再很好地推广到测试数据上。所以此时的学习是无用的。在 epoch 280 之后,我们称此时的网络是过拟合 (overfitting) 或过训练 (overtraining) 的。
gradually [ˈɡrædʒuəli]:adv. 逐步地,渐渐地
stochastic [stɒ'kæstɪk]:adj. 随机的,猜测的
fluctuation [ˌflʌktʃuˈeɪʃn]:n. 起伏,波动
illusion [ɪˈluːʒn]:n. 幻觉,错觉,错误的观念或信仰
You might wonder if the problem here is that I’m looking at the cost on the training data, as opposed to the classification accuracy on the test data. In other words, maybe the problem is that we’re making an apples and oranges comparison. What would happen if we compared the cost on the training data with the cost on the test data, so we’re comparing similar measures? Or perhaps we could compare the classification accuracy on both the training data and the test data? In fact, essentially the same phenomenon shows up no matter how we do the comparison. The details do change, however. For instance, let’s look at the cost on the test data:
你也许在猜测问题是否出在我关注着训练数据的代价函数,却期待着测试数据的分类准确率,这两个指标完全是风马牛不相及。如果我们选择比较训练数据的代价和测试数据的代价会发生什么呢?这两个同类的指标总可以比较吧?或者我们比较训练数据和测试数据的分类精度又会怎样呢?事实上,无论我们选取什么指标作比较,本质上都会产生同样的结果。不信的话,让我们来看看测试数据的代价变化:
We can see that the cost on the test data improves until around epoch 15, but after that it actually starts to get worse, even though the cost on the training data is continuing to get better. This is another sign that our model is overfitting. It poses a puzzle, though, which is whether we should regard epoch 15 or epoch 280 as the point at which overfitting is coming to dominate learning? From a practical point of view, what we really care about is improving classification accuracy on the test data, while the cost on the test data is no more than a proxy for classification accuracy. And so it makes most sense to regard epoch 280 as the point beyond which overfitting is dominating learning in our neural network.
从图中可以看到,测试数据的代价在 epoch 15 前一直在降低,之后它开始变大,然而训练数据的代价是在持续降低的。这是另外一个能够表明我们的模型是过拟合的迹象。而此时,我们遇到了一个难题,到底哪一个才是发生过拟合的关键点,epoch 15 or epoch 280?从实际应用的角度来看,我们真正关心的是提高测试数据的分类精度,而测试数据的代价只不过是分类精度的附属品。因此,在我们的神经网络中,把 epoch 280 视为模型开始过拟合的转折点。
puzzle ['pʌz(ə)l]:n. 谜,疑问,智力游戏,不解之谜 v. 迷惑,使困惑
dominate ['dɒmɪneɪt]:v. 控制,支配,左右,影响
practical [ˈpræktɪkl]:adj. 实际的,实用性的
Another sign of overfitting may be seen in the classification accuracy on the training data:
在训练数据的分类精度中也可以看到过拟合的迹象:
The accuracy rises all the way up to 100 percent. That is, our network correctly classifies all 1,000 training images! Meanwhile, our test accuracy tops out at just 82.27 percent. So our network really is learning about peculiarities of the training set, not just recognizing digits in general. It’s almost as though our network is merely memorizing the training set, without understanding digits well enough to generalize to the test set.
准确率一直上升到 100%。也就是说,网络能正确分类所有 1000 个训练图像!而与此同时,测试准确率仅为 82.27%。所以我们的网络只是在学习训练集的特性,而不能完全识别普通的数字。就好像网络仅仅是在记忆训练集,而没有真正理解了数字能够推广到测试集上。
peculiarity [pɪˌkjuːliˈærəti]:n. 特性,特质,怪癖,奇特
Overfitting is a major problem in neural networks. This is especially true in modern networks, which often have very large numbers of weights and biases. To train effectively, we need a way of detecting when overfitting is going on, so we don’t overtrain. And we’d like to have techniques for reducing the effects of overfitting.
过拟合是神经网络中一个主要的问题。尤其是在含有大量权重和偏差参数的现代网络中。为了更有效地训练,我们需要一种能够检测过拟合发生时间的方法,这样就不会发生过度训练。此外,我们也需要一种能减少过拟合影响的技术。
The obvious way to detect overfitting is to use the approach above, keeping track of accuracy on the test data as our network trains. If we see that the accuracy on the test data is no longer improving, then we should stop training. Of course, strictly speaking, this is not necessarily a sign of overfitting. It might be that accuracy on the test data and the training data both stop improving at the same time. Still, adopting this strategy will prevent overfitting.
检测过拟合一个显而易见的方法就是如上面提到的,跟踪网络训练过程中测试数据的准确率。如果测试数据的精度不再提高,就应该停止训练。当然,严格地说,这也不一定就是过拟合的迹象,也许需要同时检测到测试数据和训练数据的精度都不再提高时才行。当然,这个策略是能够避免过拟合的。
In fact, we’ll use a variation on this strategy. Recall that when we load in the MNIST data we load in three data sets:
事实上,我们将采用这种策略的一个变种。回想一下,我们曾载入了三个 MNIST 数据集:
>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()
Up to now we’ve been using the training_data and test_data, and ignoring the validation_data. The validation_data contains 10,000 images of digits, images which are different from the 50,000 images in the MNIST training set, and the 10,000 images in the MNIST test set. Instead of using the test_data to prevent overfitting, we will use the validation_data. To do this, we’ll use much the same strategy as was described above for the test_data. That is, we’ll compute the classification accuracy on the validation_data at the end of each epoch. Once the classification accuracy on the validation_data has saturated, we stop training. This strategy is called early stopping. Of course, in practice we won’t immediately know when the accuracy has saturated. Instead, we continue training until we’re confident that the accuracy has saturated.
到目前为止,我们一直在使用 training_data 和 test_data,忽视了 validation_data。validation_data 包含 10000 张与 50000 张 training_data 和 10000 张 test_data 图像不同的 MNIST 数字图像。我们将使用 validation_data 而不是 test_data 来预防过拟合。为了做到这一点,将使用与上面 test_data 相同的方法。也就是说,在每一步训练之后,计算 validation_data 的分类精度。一旦 validation_data 的分类精度达到饱和,就停止训练。这种策略叫做提前终止 (early stopping)。当然在实践中,我们并不能立即知道什么时候准确度已经饱和。取而代之,我们在确信精度已经饱和之前会一直训练。
saturate [ˈsætʃəreɪt]:vt. 浸透,使湿透,使饱和,使充满 adj. 浸透的,饱和的,深颜色的
pessimistic [ˌpesɪˈmɪstɪk]:adj. 悲观的,厌世的,悲观主义的
plateau [ˈplætəʊ]:n. 高原,稳定水平,托盘,平顶女帽 vi. 达到平衡,达到稳定时期
magnitude [ˈmæɡnɪtjuːd]:n. 大小,量级,震级,重要,光度
aggressive [əˈɡresɪv]:adj. 侵略性的,好斗的,有进取心的,有闯劲的
trial [ˈtraɪəl]:n. 试验,审讯,努力,磨炼 adj. 试验的,审讯的
It requires some judgment to determine when to stop. In my earlier graphs I identified epoch 280 as the place at which accuracy saturated. It’s possible that was too pessimistic. Neural networks sometimes plateau for a while in training, before continuing to improve. I wouldn’t be surprised if more learning could have occurred even after epoch 400, although the magnitude of any further improvement would likely be small. So it’s possible to adopt more or less aggressive strategies for early stopping.
它需要一些判断来决定何时停止。在前面的图表中,我挑选 epoch 280 作为精度饱和点。这有可能是过于悲观。神经网络在训练的过程中有时会停滞一段时间,然后才会接着改善。如果在 epoch 280 之后网络还能够继续学习,我也不会惊讶,不过,就算有任何进一步的提高,幅度可能都会很小。因此,或多或少地采取提前停止的激进策略是合理的。
Why use the validation_data to prevent overfitting, rather than the test_data? In fact, this is part of a more general strategy, which is to use the validation_data to evaluate different trial choices of hyper-parameters such as the number of epochs to train for, the learning rate, the best network architecture, and so on. We use such evaluations to find and set good values for the hyper-parameters. Indeed, although I haven’t mentioned it until now, that is, in part, how I arrived at the hyper-parameter choices made earlier in this book.
为什么要用 validation_data 而不是 test_data 来防止过拟合呢?通过 validation_data 来选择不同的超参数 (例如,训练步数、学习率、最佳网络结构、等等) 是一个普遍的策略。我们通过这样的评估来计算和设置合适的超参数值。虽然我到现在才提到超参数,但其实我在之前就已经提到怎样选择超参数了。
Of course, that doesn’t in any way answer the question of why we’re using the validation_data to prevent overfitting, rather than the test_data. Instead, it replaces it with a more general question, which is why we’re using the validation_data rather than the test_data to set good hyper-parameters? To understand why, consider that when setting hyper-parameters we’re likely to try many different choices for the hyper-parameters. If we set the hyper-parameters based on evaluations of the test_data it’s possible we’ll end up overfitting our hyper-parameters to the test_data. That is, we may end up finding hyper-parameters which fit particular peculiarities of the test_data, but where the performance of the network won’t generalize to other data sets. We guard against that by figuring out the hyper-parameters using the validation_data. Then, once we’ve got the hyper-parameters we want, we do a final evaluation of accuracy using the test_data. That gives us confidence that our results on the test_data are a true measure of how well our neural network generalizes. To put it another way, you can think of the validation data as a type of training data that helps us learn good hyper-parameters. This approach to finding good hyper-parameters is sometimes known as the hold out method, since the validation_data is kept apart or “held out” from the training_data.
当然,上面的解释不能回答为什么我用 validation_data 而不是 test_data 来防止过拟合。实际上,更一般的问题是为什么用 validation_data 而不是 test_data 来设置超参数?当我们设置超参数时,有很多种不同的选择。如果基于 test_data 的评估结果设置超参数,有可能我们的网络最后是对 test_data 过拟合。也就是说,我们或许只是找到了适合 test_data 具体特征的超参数,网络的性能不能推广到其它的数据集。通过 validation_data 来设置超参数能够避免这种情况的发生。然后,一旦我们得到了想要的超参数,就用 test_data 做最后的精度评估。这让我们相信test_data的精度能够真正提现网络的泛化能力。换句话说,你能把 validation_data 视为帮助我们学习合适超参数的一种训练数据。由于 validation_data 和 test_data 是完全分离开的,所以这种找到优秀超参数的方法被称为分离法 (hold out method)。
Now, in practice, even after evaluating performance on the test_data we may change our minds and want to try another approach - perhaps a different network architecture - which will involve finding a new set of hyper-parameters. If we do this, isn’t there a danger we’ll end up overfitting to the test_data as well? Do we need a potentially infinite regress of data sets, so we can be confident our results will generalize? Addressing this concern fully is a deep and difficult problem. But for our practical purposes, we’re not going to worry too much about this question. Instead, we’ll plunge ahead, using the basic hold out method, based on the training_data, validation_data, and test_data, as described above.
如果我们在对 test_data 进行了性能评估后,我们突然改变主意想要尝试选择不同方法 - 例如另一种网络结构 - 其中需要也涉及一批新的超参数设置。如果我们这样做了,不也有可能导致模型对 test_data 过拟合吗?我们需要一个潜在的无限数据集来使我们的模型更有泛化能力吗?这是个很深刻很棘手的问题。不过,出于实践目的,我们不必有过多担心。仅仅用上面提到的,基于 training_data、validation_data 和 test_data 的分离法来解决就行。
plunge [plʌndʒ]:v. 暴跌,使突然前冲 (或下落),骤降,突降 n. 跳水,突然跌落,突然分离,骤减
We’ve been looking so far at overfitting when we’re just using 1,000 training images. What happens when we use the full training set of 50,000 images? We’ll keep all the other parameters the same (30 hidden neurons, learning rate 0.5, mini-batch size of 10), but train using all 50,000 images for 30 epochs. Here’s a graph showing the results for the classification accuracy on both the training data and the test data. Note that I’ve used the test data here, rather than the validation data, in order to make the results more directly comparable with the earlier graphs.
目前为止,我们已经看过了只用 1000 张训练图片的过拟合状况。而当用全部的 50000 图片训练集时又会发生什么呢?保持相同的参数 (30 个隐藏神经元,学习率为 0.5,mini-batch 的大小为 10),使用全部 50000 张图片,迭代 30 epochs。下面是显示训练数据和测试数据分类精度的图表。注意我在这里用了测试数据,而不是验证数据,是为了与前面的图片做更直接的比较。
As you can see, the accuracy on the test and training data remain much closer together than when we were using 1,000 training examples. In particular, the best classification accuracy of 97.86 percent on the training data is only 2.53 percent higher than the 95.33 percent on the test data. That’s compared to the 17.73 percent gap we had earlier! Overfitting is still going on, but it’s been greatly reduced. Our network is generalizing much better from the training data to the test data. In general, one of the best ways of reducing overfitting is to increase the size of the training data. With enough training data it is difficult for even a very large network to overfit. Unfortunately, training data can be expensive or difficult to acquire, so this is not always a practical option.
正如你看到的,相比使用 1000 张训练实例,使用 50000 张训练实例的情况下,测试数据和训练数据的准确率更加接近。特别地,训练数据上最高的分类精度 97.86% 仅仅比测试数据的 95.33% 高出个 2.53 个百分点。而之前有 17.73% 的差距!虽然过拟合仍然存在,但已经大大降低了。我们的网络能从训练数据更好地泛化到测试数据。一般来说,增加训练数据的数量是降低过拟合的最好方法之一。即便拥有足够的训练数据,要让一个非常庞大的网络过拟合也是比较困难的。不幸的是,训练数据的获取成本太高,因此这通常不是一个现实的选择。
1.1 Regularization - 正则化
Increasing the amount of training data is one way of reducing overfitting. Are there other ways we can reduce the extent to which overfitting occurs? One possible approach is to reduce the size of our network. However, large networks have the potential to be more powerful than small networks, and so this is an option we’d only adopt reluctantly.
避免过拟合的方法之一是增加训练数据数量。避免过拟合的另一种可能的方法是减小网络的规模。然而,我们并不情愿减小规模,因为大型网络比小型网络有更大的潜力。
extent [ɪkˈstent]:n. 程度,范围,长度
reluctantly [rɪ'lʌktəntlɪ]:adv. 不情愿地,嫌恶地
Fortunately, there are other techniques which can reduce overfitting, even when we have a fixed network and fixed training data. These are known as regularization techniques. In this section I describe one of the most commonly used regularization techniques, a technique sometimes known as weight decay or L2 regularization. The idea of L2 regularization is to add an extra term to the cost function, a term called the regularization term. Here’s the regularized cross-entropy:
哪怕使用固定的网络和固定的训练数据,我们还有别的方法来避免过拟合。这就是所谓的正则化 (regularization) 技术。最常用的正则化技术 - 权重衰减 (weight decay) 或叫 L2 正则 (L2 regularization)。L2 正则的思想是,在代价函数中加入一个额外的正则化项。这是正则化之后的交叉熵:
C = − 1 n ∑ x j [ y j ln a j L + ( 1 − y j ) ln ( 1 − a j L ) ] + λ 2 n ∑ w w 2 (1) \begin{aligned} C = -\frac{1}{n} \sum_{xj} \left[ y_j \ln a^L_j+(1-y_j) \ln (1-a^L_j)\right] + \frac{\lambda}{2n} \sum_w w^2 \tag{1} \end{aligned} C=−n1xj∑[yjlnajL+(1−yj)ln(1−ajL)]+2nλw∑w2(1)
The first term is just the usual expression for the cross-entropy. But we’ve added a second term, namely the sum of the squares of all the weights in the network. This is scaled by a factor
λ
/
2
n
\lambda / 2n
λ/2n, where
λ
>
0
\lambda > 0
λ>0 is known as the regularization parameter, and
n
n
n is, as usual, the size of our training set. I’ll discuss later how
λ
\lambda
λ is chosen. It’s also worth noting that the regularization term doesn’t include the biases. I’ll also come back to that below.
第一项是常规的交叉熵表达式。但我们加入了第二项,也就是网络中所有权值的平方和。它由参数
λ
/
2
n
\lambda / 2n
λ/2n 进行调整,其中
λ
>
0
\lambda > 0
λ>0 被称为正则化参数 (regularization parameter),
n
n
n 是我们训练集的大小。请注意正则化项不包括偏置。
Of course, it’s possible to regularize other cost functions, such as the quadratic cost. This can be done in a similar way:
当然,我们也可以对其它的代价函数进行正则化,例如平方代价。正则化的方法与上面类似:
C = 1 2 n ∑ x ∥ y − a L ∥ 2 + λ 2 n ∑ w w 2 (2) \begin{aligned} C = \frac{1}{2n} \sum_x \|y-a^L\|^2 + \frac{\lambda}{2n} \sum_w w^2 \tag{2} \end{aligned} C=2n1x∑∥y−aL∥2+2nλw∑w2(2)
In both cases we can write the regularized cost function as
在两种情况中,我们都能把正则化的代价函数写成:
C = C 0 + λ 2 n ∑ w w 2 (3) \begin{aligned} C = C_0 + \frac{\lambda}{2n} \sum_w w^2 \tag{3} \end{aligned} C=C0+2nλw∑w2(3)
where
C
0
C_0
C0 is the original, unregularized cost function.
其中
C
0
C_0
C0 是原本的、没有正则化的代价函数。
quadratic [kwɒˈdrætɪk]:adj. 二次的,n. 二次方程式
intuitively [ɪnˈtjuːɪtɪvli]:adv. 直观地,直觉地
compromise [ˈkɒmprəmaɪz]:n. 妥协,和解,妥协 (或折中) 方案,达成妥协 v. 妥协,折中,违背 (原则),达不到 (标准),(因行为不当) 使陷入危险,名誉受损
obvious [ˈɒbviəs]:adj. 明显的,显著的,平淡无奇的
Intuitively, the effect of regularization is to make it so the network prefers to learn small weights, all other things being equal. Large weights will only be allowed if they considerably improve the first part of the cost function. Put another way, regularization can be viewed as a way of compromising between finding small weights and minimizing the original cost function. The relative importance of the two elements of the compromise depends on the value of
λ
\lambda
λ: when
λ
\lambda
λ is small we prefer to minimize the original cost function, but when
λ
\lambda
λ is large we prefer small weights.
直观来说,正则化的作用是让网络偏好学习更小的权值,而在其它的方面保持不变。选择较大的权值只有一种情况,那就是它们能显著地改进代价函数的第一部分。换句话说,正则化可以视作一种能够折中考虑小权值和最小化原来代价函数的方法。两个要素的相对重要性由
λ
\lambda
λ 的值决定:当
λ
\lambda
λ 较小时,我们偏好最小化原本的代价函数,而
λ
\lambda
λ 较大时我们偏好更小的权值。
Now, it’s really not at all obvious why making this kind of compromise should help reduce overfitting! But it turns out that it does. We’ll address the question of why it helps in the next section. But first, let’s work through an example showing that regularization really does reduce overfitting.
为什么这种折中能够减少过拟合,其中的原因也太不明显了!但事实是它的确能够减少过拟合。我们将在下一部分阐述为何这样的折中能够减少过拟合。先让我们通过一个例子展示正则化确实能够减小过拟合。
partial [ˈpɑːʃl]:adj. 局部的,偏爱的,不公平的
derivative [dɪ'rɪvətɪv]:n. 衍生物,派生词,衍生字,派生物 adj. 模仿他人的,缺乏独创性的
To construct such an example, we first need to figure out how to apply our stochastic gradient descent learning algorithm in a regularized neural network. In particular, we need to know how to compute the partial derivatives
∂
C
/
∂
w
\partial C / \partial w
∂C/∂w 和
∂
C
/
∂
b
\partial C / \partial b
∂C/∂b for all the weights and biases in the network. Taking the partial derivatives of Equation (3) gives
为了构建这样一个例子,我们首先要解决如何把随机梯度下降学习算法应用于正则化的神经网络中。我们尤其需要知道如何对网络中所有的权值和偏置计算偏导数
∂
C
/
∂
w
\partial C / \partial w
∂C/∂w 和
∂
C
/
∂
b
\partial C / \partial b
∂C/∂b。对等式 (3) 求偏导可得:
∂ C ∂ w = ∂ C 0 ∂ w + λ n w (4) \begin{aligned} \frac{\partial C}{\partial w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} w \tag{4} \end{aligned} ∂w∂C=∂w∂C0+nλw(4)
∂ C ∂ b = ∂ C 0 ∂ b (5) \begin{aligned} \frac{\partial C}{\partial b} = \frac{\partial C_0}{\partial b} \tag{5} \end{aligned} ∂b∂C=∂b∂C0(5)
The
∂
C
0
/
∂
w
\partial C_0 / \partial w
∂C0/∂w 和
∂
C
0
/
∂
b
\partial C_0 / \partial b
∂C0/∂b terms can be computed using backpropagation, as described in the last chapter. And so we see that it’s easy to compute the gradient of the regularized cost function: just use backpropagation, as usual, and then add
λ
n
w
\frac{\lambda}{n} w
nλw to the partial derivative of all the weight terms. The partial derivatives with respect to the biases are unchanged, and so the gradient descent learning rule for the biases doesn’t change from the usual rule:
正如上一章所述,其中
∂
C
0
/
∂
w
\partial C_0 / \partial w
∂C0/∂w 和
∂
C
0
/
∂
b
\partial C_0 / \partial b
∂C0/∂b 可由反向传播计算。于是我们发现计算正则化代价函数的梯度相当简单:只要照常使用反向传播,并把
λ
n
w
\frac{\lambda}{n} w
nλw 加到所有权值项的偏导数中。偏置的偏导数保持不变,所以偏置的梯度下降学习的规则保持常规的不变:
b → b − η ∂ C 0 ∂ b (6) \begin{aligned} b & \rightarrow b -\eta \frac{\partial C_0}{\partial b} \tag{6} \end{aligned} b→b−η∂b∂C0(6)
The learning rule for the weights becomes:
而对于权值的学习规则变为 (
n
n
n 是我们训练集的大小。):
w → w − η ∂ C 0 ∂ w − η λ n w (7) \begin{aligned} w & \rightarrow w-\eta \frac{\partial C_0}{\partial w}-\frac{\eta \lambda}{n} w \tag{7} \end{aligned} w→w−η∂w∂C0−nηλw(7)
= ( 1 − η λ n ) w − η ∂ C 0 ∂ w (8) \begin{aligned} & = \left(1-\frac{\eta \lambda}{n}\right) w -\eta \frac{\partial C_0}{\partial w} \tag{8} \end{aligned} =(1−nηλ)w−η∂w∂C0(8)
unstoppably:adv. 无法 (使) 停下来地
This is exactly the same as the usual gradient descent learning rule, except we first rescale the weight
w
w
w by a factor
1
−
η
λ
n
1-\frac{\eta \lambda}{n}
1−nηλ. This rescaling is sometimes referred to as weight decay, since it makes the weights smaller. At first glance it looks as though this means the weights are being driven unstoppably toward zero. But that’s not right, since the other term may lead the weights to increase, if so doing causes a decrease in the unregularized cost function.
其它部分与常规的梯度下降学习规则完全一样,不一样的地方是我们以
1
−
η
λ
n
1-\frac{\eta \lambda}{n}
1−nηλ 调整权值
w
w
w。这种调整有时也被称作权重衰减 (weight decay),因为它减小了权重。一眼看去权值将被不停地减小直到为 0。但实际上并不是这样的,因为如果可以减小未正则化的代价函数的话,式中的另外一项可能会让权值增加。
Okay, that’s how gradient descent works. What about stochastic gradient descent? Well, just as in unregularized stochastic gradient descent, we can estimate
∂
C
0
/
∂
w
\partial C_0 / \partial w
∂C0/∂w by averaging over a mini-batch of
m
m
m training examples. Thus the regularized learning rule for stochastic gradient descent becomes (c.f. Equation (20))
好的,这就是梯度下降实现的方法。那么随机梯度下降呢?和未正则化的随机梯度下降一样,我们首先在包含
m
m
m 个训练样例的 mini-batch 数据中进行平均以估计
∂
C
0
/
∂
w
\partial C_0 / \partial w
∂C0/∂w 的值。因此对于随机梯度下降法而言正则化的学习方法就变成了 (参考等式 (20)):
w → ( 1 − η λ n ) w − η m ∑ x ∂ C x ∂ w (9) \begin{aligned} w \rightarrow \left(1-\frac{\eta \lambda}{n}\right) w -\frac{\eta}{m} \sum_x \frac{\partial C_x}{\partial w} \tag{9} \end{aligned} w→(1−nηλ)w−mηx∑∂w∂Cx(9)
where the sum is over training examples
x
x
x in the mini-batch, and
C
x
C_x
Cx is the (unregularized) cost for each training example. This is exactly the same as the usual rule for stochastic gradient descent, except for the
1
−
η
λ
n
1-\frac{\eta \lambda}{n}
1−nηλ weight decay factor. Finally, and for completeness, let me state the regularized learning rule for the biases. This is, of course, exactly the same as in the unregularized case (c.f. Equation (21)),
其中的求和是对 mini-batch 中的所有训练样例
x
x
x 进行的,
C
x
C_x
Cx 是每个样例对应的未正则化的代价。这与通常的随机梯度下降方法一致,除了权重衰减变量
1
−
η
λ
n
1-\frac{\eta \lambda}{n}
1−nηλ。最后,为了表述完整,我要说明对于偏置的正则化学习规则。毫无疑问,那就是与未正则化的情况 (如等式 (21))完全一样,
b → b − η m ∑ x ∂ C x ∂ b (10) \begin{aligned} b \rightarrow b - \frac{\eta}{m} \sum_x \frac{\partial C_x}{\partial b} \tag{10} \end{aligned} b→b−mηx∑∂b∂Cx(10)
where the sum is over training examples
x
x
x in the mini-batch.
其中的求和是对 mini-batch 中的所有训练样例
x
x
x 进行的。
w k → w k ′ = w k − η m ∑ j ∂ C X j ∂ w k (20) \begin{aligned} w_k & \rightarrow w_k' = w_k-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} \tag{20} \end{aligned} wk→wk′=wk−mηj∑∂wk∂CXj(20)
b l → b l ′ = b l − η m ∑ j ∂ C X j ∂ b l , (21) \begin{aligned} b_l & \rightarrow b_l' = b_l-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l}, \tag{21} \end{aligned} bl→bl′=bl−mηj∑∂bl∂CXj,(21)
Let’s see how regularization changes the performance of our neural network. We’ll use a network with 30 hidden neurons, a mini-batch size of 10, a learning rate of 0.5, and the cross-entropy cost function. However, this time we’ll use a regularization parameter of
λ
=
0.1
\lambda = 0.1
λ=0.1. Note that in the code, we use the variable name lmbda
, because lambda
is a reserved word in Python, with an unrelated meaning. I’ve also used the test_data again, not the validation_data. Strictly speaking, we should use the validation_data, for all the reasons we discussed earlier. But I decided to use the test_data because it makes the results more directly comparable with our earlier, unregularized results. You can easily change the code to use the validation_data instead, and you’ll find that it gives similar results.
让我们看看正则化如何改变了我们的神经网络的表现。我们将使用一个神经网络来进行验证,其包含 30 个隐藏神经元,mini-batch 大小为 10,学习率为 0.5,并以交叉熵作为代价函数。然而,这次我们设置正则化参数
λ
=
0.1
\lambda = 0.1
λ=0.1。注意在代码中,我们使用变量名 lmbda
,因为 lambda
是 Python 的保留字,其含义与此无关。我也再次使用了 test_data 而非 validation_data。严格来说,我们应该使用 validation_data,详细原因我们在此前已经解释过。不过我决定用 test_data 因为它能让结果与我们此前未正则化的结果进行更直观的比较。对代码稍作改动你即可使用 validation_data,并且你将发现得到的结果很相似。
>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()
>>> import network2
>>> net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost)
>>> net.large_weight_initializer()
>>> net.SGD(training_data[:1000], 400, 10, 0.5,
... evaluation_data=test_data, lmbda = 0.1,
... monitor_evaluation_cost=True, monitor_evaluation_accuracy=True,
... monitor_training_cost=True, monitor_training_accuracy=True)
The cost on the training data decreases over the whole time, much as it did in the earlier, unregularized case (This and the next two graphs were produced with the program overfitting.py.):
代价函数一直都在下降,几乎与此前在未正则化的情形一样 (这张图和下两张图的结果是由程序 overfitting.py 产生的。):
But this time the accuracy on the test_data continues to increase for the entire 400 epochs:
但这次它在 test_data 上的准确率在 400 epochs 中保持提升:
Clearly, the use of regularization has suppressed overfitting. What’s more, the accuracy is considerably higher, with a peak classification accuracy of 87.1 percent, compared to the peak of 82.27 percent obtained in the unregularized case. Indeed, we could almost certainly get considerably better results by continuing to train past 400 epochs. It seems that, empirically, regularization is causing our network to generalize better, and considerably reducing the effects of overfitting.
显然,应用正则化抑制了过拟合。同时,准确率也显著提升了:分类准确率峰值为 87.1%,更高于未正则化例子中的峰值 82.27%。实际上,我们几乎确定,继续进行迭代可以得到更好的结果。从经验来看,正则化让我们的网络生成得更好,并有效地减弱了过拟合效应。
empirically [ɪmˈpɪrɪkli]:adv. 以经验为主地
What happens if we move out of the artificial environment of just having 1,000 training images, and return to the full 50,000 image training set? Of course, we’ve seen already that overfitting is much less of a problem with the full 50,000 images. Does regularization help any further? Let’s keep the hyper-parameters the same as before - 30 epochs, learning rate 0.5, mini-batch size of 10. However, we need to modify the regularization parameter. The reason is because the size
n
n
n of the training set has changed from
n
=
1
,
000
n=1,000
n=1,000 to
n
=
50
,
000
n=50,000
n=50,000, and this changes the weight decay factor
1
−
η
λ
n
1 - \frac{\eta \lambda}{n}
1−nηλ. If we continued to use
λ
=
0.1
\lambda = 0.1
λ=0.1 that would mean much less weight decay, and thus much less of a regularization effect. We compensate by changing to
λ
=
5.0
\lambda = 5.0
λ=5.0.
如果我们从仅有 1000 张训练图片迁移到有着 50000 张图片的环境,结果又会怎么样呢?当然,我们已经看到过拟合对于 50000 张图片来说不再是个大问题了。那么正则化能在这上面提供更多帮助吗?让我们把超参数保持跟以前的一样:30 epochs,学习率为 0.5,mini-batch 数目为 10。然而,我们需要调整正则化参数。原因是,训练数据集的大小已经从
n
=
1000
n=1000
n=1000 变成了
n
=
50000
n=50000
n=50000,而这就改变了权重衰减因数
1
−
η
λ
n
1 - \frac{\eta \lambda}{n}
1−nηλ。如果我们继续使用
λ
=
0.1
\lambda = 0.1
λ=0.1 那么权重衰减将会大大减小,而正则化的效果也会相应减弱。我们调整
λ
=
5.0
\lambda = 5.0
λ=5.0 以进行补偿。
Okay, let’s train our network, stopping first to re-initialize the weights:
好的,让我们训练我们的网络,并在此之前重新初始化权值:
>>> net.large_weight_initializer()
>>> net.SGD(training_data, 30, 10, 0.5,
... evaluation_data=test_data, lmbda = 5.0,
... monitor_evaluation_accuracy=True, monitor_training_accuracy=True)
We obtain the results:
There’s lots of good news here. First, our classification accuracy on the test data is up, from 95.49 percent when running unregularized, to 96.49 percent. That’s a big improvement. Second, we can see that the gap between results on the training and test data is much narrower than before, running at under a percent. That’s still a significant gap, but we’ve obviously made substantial progress reducing overfitting.
这里有很多好消息。首先,我们在测试数据上的分类准确率提升了,从未正则化时的 95.49% 到 96.49%。这是一个重大的提升。第二,我们能看到训练数据的结果与测试数据的之间的间隙相比之前也大大缩小了,在一个百分点之下。虽然这仍然是一个明显的间隙,但我们显然已经在减弱过拟合上做出了大量的提升。
Finally, let’s see what test classification accuracy we get when we use 100 hidden neurons and a regularization parameter of
λ
=
5.0
\lambda =5.0
λ=5.0. I won’t go through a detailed analysis of overfitting here, this is purely for fun, just to see how high an accuracy we can get when we use our new tricks: the cross-entropy cost function and L2 regularization.
最后,让我们看看当我们使用 100 个隐藏神经元并设置
λ
=
5.0
\lambda =5.0
λ=5.0 时测试分类准确率。我将不会对这里的过拟合进行详细的分析,这只是纯粹为了有趣,只为了看看我们使用新的技巧 - the cross-entropy cost function and L2 regularization - 能让我们得到多高的准确率。
>>> net = network2.Network([784, 100, 10], cost=network2.CrossEntropyCost)
>>> net.large_weight_initializer()
>>> net.SGD(training_data, 30, 10, 0.5, lmbda=5.0,
... evaluation_data=validation_data,
... monitor_evaluation_accuracy=True)
The final result is a classification accuracy of 97.92 percent on the validation data. That’s a big jump from the 30 hidden neuron case. In fact, tuning just a little more, to run for 60 epochs at
η
=
0.1
\eta = 0.1
η=0.1 和
λ
=
5.0
\lambda =5.0
λ=5.0 we break the 98 percent barrier, achieving 98.04 percent classification accuracy on the validation data. Not bad for what turns out to be 152 lines of code!
最终结果是,分类准确率在验证数据上达到了 97.92%。相对于 30 个隐藏神经元的情形,这是一个飞跃。事实上,只要再训练更多些,保持
η
=
0.1
\eta = 0.1
η=0.1 和
λ
=
5.0
\lambda =5.0
λ=5.0,增加到 60 epochs,我们能够突破 98% 的壁垒,在验证数据上达到 98.04% 的准确率。对于这 152 行代码来说是不错的结果了!
I’ve described regularization as a way to reduce overfitting and to increase classification accuracies. In fact, that’s not the only benefit. Empirically, when doing multiple runs of our MNIST networks, but with different (random) weight initializations, I’ve found that the unregularized runs will occasionally get “stuck”, apparently caught in local minima of the cost function. The result is that different runs sometimes provide quite different results. By contrast, the regularized runs have provided much more easily replicable results.
我已经描述了正则化作为一种减弱过拟合并提升分类准确率的方法。事实上,这并不是它唯一的好处。经验表明,当在多次运行我们的 MNIST 网络,并使用不同的 (随机) 权值初始化时,我发现未正则化的那些偶尔会被卡住,似乎陷入了代价函数的局部最优中。结果就是每次运行可能产生相当不同的结果。相反,正则化的那些的每次运行可以提供更容易复现的结果。
Why is this going on? Heuristically, if the cost function is unregularized, then the length of the weight vector is likely to grow, all other things being equal. Over time this can lead to the weight vector being very large indeed. This can cause the weight vector to get stuck pointing in more or less the same direction, since changes due to gradient descent only make tiny changes to the direction, when the length is long. I believe this phenomenon is making it hard for our learning algorithm to properly explore the weight space, and consequently harder to find good minima of the cost function.
为什么会这样呢?启发式地来说,如果代价函数没有正则化,那么权重向量的长度倾向于增长,而其它的都不变。随着时间推移,权重向量将会变得非常大。这可能导致权重向量被限制得或多或少指向同一个方向,因为当长度过长时,梯度下降只能带来很小的变化。我相信这一现象令我们的学习算法难于恰当地探索权重空间,因而难以给代价函数找到一个好的极小值。
occasionally [əˈkeɪʒnəli]:adv. 偶尔,间或
minima ['mɪnɪmə]:n. 极小值 (minimum的复数),最小数
replicable [ˈreplɪkəbl]:adj. 能复现的,可复制的
stick [stɪk]:v. 粘贴,刺,忍受,戳 n. 条,枯枝,枝条,柴火棍儿
heuristically:启发式地
consequently [ˈkɒnsɪkwəntli]:adv. 因此,结果,所以
1.2 Why does regularization help reduce overfitting? - 为什么正则化能够降低过拟合?
We’ve seen empirically that regularization helps reduce overfitting. That’s encouraging but, unfortunately, it’s not obvious why regularization helps! A standard story people tell to explain what’s going on is along the following lines: smaller weights are, in some sense, lower complexity, and so provide a simpler and more powerful explanation for the data, and should thus be preferred. That’s a pretty terse story, though, and contains several elements that perhaps seem dubious or mystifying. Let’s unpack the story and examine it critically. To do that, let’s suppose we have a simple data set for which we wish to build a model:
我们通过实验发现正则化能帮助减少过拟合。这是令人高兴的事,然而不幸的是,我们没有明显的证据证明为什么正则化可以起到这个效果!一个大家经常说起的解释是:在某种程度上,越小的权重复杂度越低,因此能够更简单且更有效地描绘数据,所以我们倾向于选择这样的权重。尽管这是个很简短的解释,却也包含了一些疑点。让我们来更加仔细地探讨一下这个解释。假设我们要对一个简单的数据集建立模型:
encourage [ɪnˈkʌrɪdʒ]:vt. 鼓励,怂恿,激励,支持
terse [tɜːs]:adj. 简洁的,精练的,扼要的
dubious [ˈdjuːbiəs]:adj. 可疑的,暧昧的,无把握的,半信半疑的
mystify [ˈmɪstɪfaɪ]:vt. 使神秘化,使迷惑,使困惑
critically ['krɪtɪklɪ]:adv. 精密地,危急地,严重地,批评性地,用钻研眼光地,很大程度上,极为重要地
Implicitly, we’re studying some real-world phenomenon here, with
x
x
x and
y
y
y representing real-world data. Our goal is to build a model which lets us predict
y
y
y as a function of
x
x
x. We could try using neural networks to build such a model, but I’m going to do something even simpler: I’ll try to model
y
y
y as a polynomial in
x
x
x. I’m doing this instead of using neural nets because using polynomials will make things particularly transparent. Once we’ve understood the polynomial case, we’ll translate to neural networks. Now, there are ten points in the graph above, which means we can find a unique 9th-order polynomial
y
=
a
0
x
9
+
a
1
x
8
+
…
+
a
9
y = a_0 x^9 + a_1 x^8 + \ldots + a_9
y=a0x9+a1x8+…+a9 which fits the data exactly. Here’s the graph of that polynomial:
这个数据是现实世界某个问题提取得到的
x
x
x 和
y
y
y。我们的目标是构建一个模型,得到基于
x
x
x 的能预测
y
y
y 的函数。我们可以尝试使用神经网络来构建模型,但是我将使用更简单的方法:我将把
y
y
y 建模为关于
x
x
x 的多项式。我们将使用这种方法来代替神经网络,因为多项式模型十分透明。一旦我们理解了多项式的情况,我们就可以把它迁移到神经网络上。现在,在上图中有十个点,这意味着我们可以找到一个 9th order 多项式
y
=
a
0
x
9
+
a
1
x
8
+
…
+
a
9
y = a_0 x^9 + a_1 x^8 + \ldots + a_9
y=a0x9+a1x8+…+a9 来精确拟合这些数据。这是该多项式的图像:
I won’t show the coefficients explicitly, although they are easy to find using a routine such as Numpy’s polyfit. You can view the exact form of the polynomial in the source code for the graph if you’re curious. It’s the function p(x)
defined starting on line 14 of the program which produces the graph.:
尽管使用 Numpy 的 polyfit 这样的例程很容易找到它们,但我不会明确显示这些系数。如果您好奇,可以在图形的源代码中查看多项式的确切形式。这是从生成图的程序的第 14 行开始定义的函数 p(x)
:
http://neuralnetworksanddeeplearning.com/js/polynomial_model.js
explicitly [ɪkˈsplɪsɪtli]:adv. 明确地,明白地
polynomial [,pɒlɪ'nəʊmɪəl]:adj. 多项的,多词的,多项式的 n. 多项式,多词拉丁学名,表示任何多项数之和
curious [ˈkjʊəriəs]:adj. 好奇的,有求知欲的,古怪的,爱挑剔的
That provides an exact fit. But we can also get a good fit using the linear model
y
=
2
x
y = 2x
y=2x:
它精确拟合了数据。但是我们也可以使用线性模型
y
=
2
x
y = 2x
y=2x 来得到一个较好的拟合:
Which of these is the better model? Which is more likely to be true? And which model is more likely to generalize well to other examples of the same underlying real-world phenomenon?
这两个哪个才是更好的模型呢?哪个更贴近真实情况?另外哪个模型能更好地泛化该问题的其它数据呢?
These are difficult questions. In fact, we can’t determine with certainty the answer to any of the above questions, without much more information about the underlying real-world phenomenon. But let’s consider two possibilities: (1) the 9th order polynomial is, in fact, the model which truly describes the real-world phenomenon, and the model will therefore generalize perfectly; (2) the correct model is
y
=
2
x
y = 2x
y=2x, but there’s a little additional noise due to, say, measurement error, and that’s why the model isn’t an exact fit.
这些问题很难回答。事实上,如果没有足够多关于现实情况的信息时,我们很难回答任何一个问题。但是让我们考虑两种可能性:(1) 9th order 多项式事实上真正描述了现实情况,因此这个模型可以完美泛化;(2) 正确的模型是
y
=
2
x
y = 2x
y=2x,但是实验中有一些因为测量误差引入的额外的噪声,也因此这个模型没有完美拟合。
It’s not a priori possible to say which of these two possibilities is correct. (Or, indeed, if some third possibility holds). Logically, either could be true. And it’s not a trivial difference. It’s true that on the data provided there’s only a small difference between the two models. But suppose we want to predict the value of
y
y
y corresponding to some large value of
x
x
x, much larger than any shown on the graph above. If we try to do that there will be a dramatic difference between the predictions of the two models, as the 9th order polynomial model comes to be dominated by the
x
9
x^9
x9 term, while the linear model remains, well, linear.
确定这两种可能 (可能还有第三种可能存在) 哪一个正确的是不能先验的。逻辑上讲,任何一个都可能是正确的。并且这两者的差异并非微不足道。我们承认如果只关注提供的数据集,这两个模型只有很细微的差别。但是如果我们想在比我们之前展示的任何一张图都多得多的
x
x
x 上,预测
y
y
y 的值,那么两个模型的预测结果将有巨大的差异,9th order 多项式模型将主要由
x
9
x^9
x9 项决定,而线性模型依旧还是线性。
priori:先验的
trivial [ˈtrɪviəl]:adj. 不重要的,琐碎的,琐细的
dominate [ˈdɒmɪneɪt]:vt. 控制,支配,占优势,在...中占主要地位 vi. 占优势,处于支配地位
One point of view is to say that in science we should go with the simpler explanation, unless compelled not to. When we find a simple model that seems to explain many data points we are tempted to shout “Eureka!” After all, it seems unlikely that a simple explanation should occur merely by coincidence. Rather, we suspect that the model must be expressing some underlying truth about the phenomenon. In the case at hand, the model
y
=
2
x
+
n
o
i
s
e
y = 2x+{\rm noise}
y=2x+noise seems much simpler than
y
=
a
0
x
9
+
a
1
x
8
+
…
y = a_0 x^9 + a_1 x^8 + \ldots
y=a0x9+a1x8+…. It would be surprising if that simplicity had occurred by chance, and so we suspect that
y
=
2
x
+
n
o
i
s
e
y = 2x+{\rm noise}
y=2x+noise expresses some underlying truth. In this point of view, the 9th order model is really just learning the effects of local noise. And so while the 9th order model works perfectly for these particular data points, the model will fail to generalize to other data points, and the noisy linear model will have greater predictive power.
有一种看法是,在科学上,除非迫不得已,我们都应该用更简单的解释。当我们找到一个看起来能解释很多数据点的简单的模型的时候我们会忍不住大喊:“找到啦!”。毕竟一个简单的解释的出现似乎不可能仅仅是因为巧合,我们猜测这个模型一定表达了关于这个现象的一些潜在真理。在我们的情况中,模型
y
=
2
x
+
n
o
i
s
e
y = 2x+{\rm noise}
y=2x+noise 看起来要比
y
=
a
0
x
9
+
a
1
x
8
+
…
y = a_0 x^9 + a_1 x^8 + \ldots
y=a0x9+a1x8+… 更简单。这种简单的模型的意外出现令人吃惊,我们也因此猜测
y
=
2
x
+
n
o
i
s
e
y = 2x+{\rm noise}
y=2x+noise 表达了一些潜在的真理。基于这种观点,9th order 模型真的只是学习到了一些局部噪声的影响。因此当 9th order 模型完美拟合了这一特定数据集的时候,这个模型不能很好泛化到其它数据集上,所以包含噪声的线性模型在预测中会有更好的表现。
suspect [ˈsʌspekt; (for v.) səˈspekt]:v. 怀疑,猜想 n. 嫌疑犯 adj. 靠不住的,可疑的
compel [kəmˈpel]:vt. 强迫,迫使,强使发生
tempt [tempt]:vt. 诱惑,引起,冒...的风险,使感兴趣
eureka [juˈriːkə]:int. 我发现了,我找到了
Let’s see what this point of view means for neural networks. Suppose our network mostly has small weights, as will tend to happen in a regularized network. The smallness of the weights means that the behaviour of the network won’t change too much if we change a few random inputs here and there. That makes it difficult for a regularized network to learn the effects of local noise in the data. Think of it as a way of making it so single pieces of evidence don’t matter too much to the output of the network. Instead, a regularized network learns to respond to types of evidence which are seen often across the training set. By contrast, a network with large weights may change its behaviour quite a bit in response to small changes in the input. And so an unregularized network can use large weights to learn a complex model that carries a lot of information about the noise in the training data. In a nutshell, regularized networks are constrained to build relatively simple models based on patterns seen often in the training data, and are resistant to learning peculiarities of the noise in the training data. The hope is that this will force our networks to do real learning about the phenomenon at hand, and to generalize better from what they learn.
我们来看看这种观点对神经网络来说意味着什么。设想我们的网络大部分都有较小的权重,正如在正则化网络中常出现的那样。小权重意味着网络的行为不会因为我们随意更改了一些输入而改变太多。这使得它不容易学习到数据中局部噪声。可以把它想象成一种能使孤立的数据不会过多影响网络输出的方法,相反地,一个正则化的网络会学习去响应一些经常出现在整个训练集中的实例。与之相对的是,如果输入有一些小的变化,一个拥有大权重的网络会大幅改变其行为来响应变化。因此一个未正则化的网络可以利用大权重来学习得到训练集中包含了大量噪声信息的复杂模型。概括来说,正则化网络能够限制在对训练数据中常见数据构建出相对简单的模型,并且对训练数据中的各种各样的噪声有较好的抵抗能力。所以我们希望它能使我们的网络真正学习到问题中的现象的本质,并且能更好的进行泛化。
evidence [ˈevɪdəns]:n. 证据,证明,迹象,明显 vt. 证明
tend [tend]:vi. 趋向,倾向,照料,照顾 vt. 照料,照管
respond [rɪˈspɒnd]:vi. 回答,作出反应,承担责任 vt. 以...回答 n. 应答,唱和
nutshell [ˈnʌtʃel]:n. 坚果的外壳,小的东西,小容器 vt. 概括
resistant [rɪˈzɪstənt]:adj. 抵抗的,反抗的,顽固的 n. 抵抗者
peculiarity [pɪˌkjuːliˈærəti] :n. 特性,特质,怪癖,奇特
phenomenon [fəˈnɒmɪnən]:n. 现象,奇迹,杰出的人才
With that said, this idea of preferring simpler explanation should make you nervous. People sometimes refer to this idea as “Occam’s Razor”, and will zealously apply it as though it has the status of some general scientific principle. But, of course, it’s not a general scientific principle. There is no a priori logical reason to prefer simple explanations over more complex explanations. Indeed, sometimes the more complex explanation turns out to be correct.
按照这种说法,你可能会对这种更倾向简单模型的想法感到紧张。人们有时把这种想法称作奥卡姆剃刀,并且就好像它是科学原理一样,热情地应用它。然而,它并不是一个普遍成立的科学原理。并不存在一个先验的符合逻辑的理由倾向于简单的模型,而不是复杂的模型。实际上,有时候更复杂的模型反而是正确的。
nervous [ˈnɜːvəs]:adj. 神经的,紧张不安的,强健有力的
razor [ˈreɪzə(r)]:n. 剃 (须) 刀 v. 用剃刀剃,用剃刀刮
zealously [ˈzeləsli]:adv. 热心地,积极地
Let me describe two examples where more complex explanations have turned out to be correct. In the 1940s the physicist Marcel Schein announced the discovery of a new particle of nature. The company he worked for, General Electric, was ecstatic, and publicized the discovery widely. But the physicist Hans Bethe was skeptical. Bethe visited Schein, and looked at the plates showing the tracks of Schein’s new particle. Schein showed Bethe plate after plate, but on each plate Bethe identified some problem that suggested the data should be discarded. Finally, Schein showed Bethe a plate that looked good. Bethe said it might just be a statistical fluke. Schein: “Yes, but the chance that this would be statistics, even according to your own formula, is one in five.” Bethe: “But we have already looked at five plates.” Finally, Schein said: “But on my plates, each one of the good plates, each one of the good pictures, you explain by a different theory, whereas I have one hypothesis that explains all the plates, that they are [the new particle].” Bethe replied: “The sole difference between your and my explanations is that yours is wrong and all of mine are right. Your single explanation is wrong, and all of my multiple explanations are right.” Subsequent work confirmed that Nature agreed with Bethe, and Schein’s particle is no more.
让我介绍两个正确结果是复杂模型的例子吧。在 1940 年代物理学家 Marcel Schein 宣布发现了一个新的自然粒子。他工作所在的通用电气公司欣喜若狂并广泛地宣传了这一发现。但是物理学家 Hans Bethe 却怀疑这一发现。Bethe 拜访了 Schein,并且查看了新粒子的轨迹图表。Schein 向 Bethe 一张一张地展示,但是 Bethe 在每一张图表上都发现了一些问题,这些问题暗示着数据应该被丢弃。最后,Schein 向 Bethe 展示了一张看起来不错的图表。Bethe 说它可能只是一个统计学上的巧合。Schein 说:“是的,但是这种统计学巧合的几率,即便是按照你自己的公式,也只有五分之一。”Bethe 说:“但是我们已经看过了五个图表。”最后,Schein 说道:“但是在我的图表上,每一个较好的图表,你都用不同的理论来解释,然而我有一个假设可以解释所有的图表,就是它们是新粒子。”Bethe 回应道:“你我的学说的唯一区别在于你的是错误的而我的都是正确的。你简单的解释是错的,而我复杂的解释是正确的。”随后的研究证实了大自然是赞同 Bethe 的学说的,之后也没有什么 Schein 的粒子了。
The story is related by the physicist Richard Feynman in an interview with the historian Charles Weiner.
这个故事来自于物理学家 Richard Feynman 和 历史学家 Charles Weiner 的一次访谈中。
plate [pleɪt]:n. 碟,金属板,金属牌,感光底片 vt. 电镀,给...装甲
fluke [fluːk]:n. 侥幸,锚爪,意外的挫折 vt. 侥幸成功,意外受挫 vi. 侥幸成功
skeptical ['skeptɪkəl]:adj. 怀疑的,(哲学) 怀疑论的,不可知论的
ecstatic [ɪkˈstætɪk]:adj. 狂喜的,入迷的 n. 狂喜的人
particle [ˈpɑːtɪkl]:n. 颗粒,质点,极小量,小品词
As a second example, in 1859 the astronomer Urbain Le Verrier observed that the orbit of the planet Mercury doesn’t have quite the shape that Newton’s theory of gravitation says it should have. It was a tiny, tiny deviation from Newton’s theory, and several of the explanations preferred at the time boiled down to saying that Newton’s theory was more or less right, but needed a tiny alteration. In 1916, Einstein showed that the deviation could be explained very well using his general theory of relativity, a theory radically different to Newtonian gravitation, and based on much more complex mathematics. Despite that additional complexity, today it’s accepted that Einstein’s explanation is correct, and Newtonian gravity, even in its modified forms, is wrong. This is in part because we now know that Einstein’s theory explains many other phenomena which Newton’s theory has difficulty with. Furthermore, and even more impressively, Einstein’s theory accurately predicts several phenomena which aren’t predicted by Newtonian gravity at all. But these impressive qualities weren’t entirely obvious in the early days. If one had judged merely on the grounds of simplicity, then some modified form of Newton’s theory would arguably have been more attractive.
另一个例子是,1859 年天文学家 Urbain Le Verrier 发现水星轨道没有按照牛顿的引力理论,形成应有的形状。它跟牛顿的理论有一个很小很小的偏差,一些当时被接受的解释是,牛顿的理论或多或少是正确的,但是需要一些小小的调整。1916 年,爱因斯坦表明这一偏差可以很好地通过他的广义相对论来解释,这一理论从根本上不同于牛顿引力理论,并且基于更复杂的数学。尽管有额外的复杂性,但我们今天已经接受了爱因斯坦的解释,而牛顿的引力理论,即便是调整过的形式,也是错误的。这某种程度上是因为我们现在知道了爱因斯坦的理论解释了许多牛顿的理论难以解释的现象。此外,更令人印象深刻的是,爱因斯坦的理论准确的预测了一些牛顿的理论完全没有预测的现象。但这些令人印象深刻的优点在早期并不是显而易见的。如果一个人仅仅是以朴素这一理由来判断,那么更好的理论就会是某种调整后的牛顿理论。
deviation [ˌdiːviˈeɪʃn]:n. 偏差,误差,背离
gravitation [ˌɡrævɪˈteɪʃn]:n. 重力,万有引力,地心吸力
astronomer [əˈstrɒnəmə(r)]:n. 天文学家
orbit [ˈɔːbɪt]:n. 轨道,眼眶,势力范围,生活常规 vi. 盘旋,绕轨道运行 vt. 绕...轨道而行
planet [ˈplænɪt]:n. 行星
mercury [ˈmɜːrkjəri]:n. 汞,水银,水星
boil [bɔɪl]:v. 煮沸,(使) 沸腾,达到沸点,用沸水煮或烫洗,对...施以烹刑,翻腾,发火 n. 沸点,沸腾,激昂,疖,皮下脓肿,突然浮上来食鱼饵
alteration [ˌɔːltəˈreɪʃn]:n. 修改,改变,变更
radically [ˈrædɪkli]:adv. 根本上,彻底地,以激进的方式
impressive [ɪmˈpresɪv]:adj. 感人的,令人钦佩的,给人以深刻印象的
simplicity [sɪmˈplɪsəti]:n. 朴素,简易,天真,愚蠢
arguably [ˈɑːɡjuəbli]:adv. 可论证地,可争辩地,正如可提出证据加以证明的那样地
attractive [əˈtræktɪv]:adj. 吸引人的,有魅力的,引人注目的
moral [ˈmɒrəl]:adj. 道德的,精神上的,品性端正的 n. 道德,寓意
subtle [ˈsʌtl]:adj. 微妙的,精细的,敏感的,狡猾的,稀薄的
regime [reɪˈʒiːm]:n. 政权,政体,社会制度,管理体制
caution [ˈkɔːʃn]:n. 小心,谨慎,警告,警示 vt. 警告
There are three morals to draw from these stories. First, it can be quite a subtle business deciding which of two explanations is truly “simpler”. Second, even if we can make such a judgment, simplicity is a guide that must be used with great caution! Third, the true test of a model is not simplicity, but rather how well it does in predicting new phenomena, in new regimes of behaviour.
这些故事有三个意义。第一,判断两个解释哪个才是真正的简单是一个非常微妙的事情。第二,即便我们能做出这样的判断,简单是一个必须非常谨慎使用的指标。第三,真正测试一个模型的不是简单与否,更重要在于它在预测新的情况时表现如何。
With that said, and keeping the need for caution in mind, it’s an empirical fact that regularized neural networks usually generalize better than unregularized networks. And so through the remainder of the book we will make frequent use of regularization. I’ve included the stories above merely to help convey why no-one has yet developed an entirely convincing theoretical explanation for why regularization helps networks generalize. Indeed, researchers continue to write papers where they try different approaches to regularization, compare them to see which works better, and attempt to understand why different approaches work better or worse. And so you can view regularization as something of a kludge. While it often helps, we don’t have an entirely satisfactory systematic understanding of what’s going on, merely incomplete heuristics and rules of thumb.
谨慎来说,经验表明正则化的神经网络通常要比未正则化的网络泛化能力更好。因此本书的剩余部分我们将频繁地使用正则化。我举出上面的故事仅仅是为了帮助解释为什么还没有人研究出一个完全令人信服的理论来解释为什么正则化会帮助网络泛化。事实上,研究人员仍然在研究正则化的不同方法,对比哪种效果更好,并且尝试去解释为什么不同的方法有更好或更差的效果。所以你可以看到正则化是作为一种杂牌军存在的。虽然它经常有帮助,但我们并没有一套令人满意的系统理解为什么它有帮助,我们有的仅仅是没有科学依据的经验法则。
kludge [kluːdʒ; klʌdʒ]:n. 杂牌电脑,组装机 (等于 kluge)
heuristic [hjuˈrɪstɪk]:adj. 启发式的,探索的 n. 启发式教育法,探索性步骤
thumb [θʌm]:n. 拇指,(手套的) 拇指部分 v. 翻阅,以拇指拨弄,作搭车手势
systematic [ˌsɪstəˈmætɪk]:adj. 系统的,体系的,有系统的,分类的,一贯的,惯常的
There’s a deeper set of issues here, issues which go to the heart of science. It’s the question of how we generalize. Regularization may give us a computational magic wand that helps our networks generalize better, but it doesn’t give us a principled understanding of how generalization works, nor of what the best approach is.
这存在一个更深层的问题,一个科学的核心问题。就是我们怎么去泛化这一问题。正则化可以给我们一个计算魔法棒来帮助我们的网络更好的泛化,但是它并没有给我们一个原则性的解释泛化是如何工作的,也没有告诉我们最好的方法是什么。
These issues go back to the problem of induction, famously discussed by the Scottish philosopher David Hume in “An Enquiry Concerning Human Understanding” (1748). The problem of induction has been given a modern machine learning form in the no-free lunch theorem (link) of David Wolpert and William Macready (1997).
这些问题可以追溯到归纳问题,这是苏格兰哲学家 David Hume 在 “An Enquiry Concerning Human Understanding” (1748) 中著名地讨论的。David Wolpert and William Macready (1997) 的免费午餐定理 (链接) 已将感应问题赋予了现代机器学习形式。
Scottish [ˈskɒtɪʃ]:adj. 苏格兰的,苏格兰人的,苏格兰语的 n. 苏格兰人,苏格兰语
philosopher [fəˈlɒsəfə(r)]:n. 哲学家,哲人
This is particularly galling because in everyday life, we humans generalize phenomenally well. Shown just a few images of an elephant a child will quickly learn to recognize other elephants. Of course, they may occasionally make mistakes, perhaps confusing a rhinoceros for an elephant, but in general this process works remarkably accurately. So we have a system - the human brain - with a huge number of free parameters. And after being shown just one or a few training images that system learns to generalize to other images. Our brains are, in some sense, regularizing amazingly well! How do we do it? At this point we don’t know. I expect that in years to come we will develop more powerful techniques for regularization in artificial neural networks, techniques that will ultimately enable neural nets to generalize well even from small data sets.
这尤其令人烦恼,因为在日常生活中,我们人类有很好的泛化现象的能力。给一个孩子看几张大象的图片,他就能很快地学习并辨认出其它大象。当然,他们偶尔也会犯错,也许会无法区分一个犀牛和一个大象,但是总体来说这个过程非常的准确。因此我们有一个系统 - 人的大脑 - 拥有大量的自由变量。对这个系统展示一张或几张训练图片之后,这个系统就可以学习并泛化其它的图片。我们的大脑在某种程度上,正则化得非常好!我们是怎么做到的?目前我们还不知道。我预计未来几年人工神经网络领域将开发出更强大的正则化技术,这些技术能使神经网络能更好地泛化,即使数据集非常小。
remarkably [rɪ'mɑːkəblɪ]:adv. 显著地,非常地,惊人地,引人注目地
elephant [ˈelɪfənt]:n. 象,大号图画纸
rhinoceros [raɪˈnɒsərəs]:n. 犀牛
particularly [pəˈtɪkjələli]:adv. 异乎寻常地,特别是,明确地
galling [ˈɡɔːlɪŋ]:adj. 难堪的,使烦恼的,擦伤人的
In fact, our networks already generalize better than one might a priori expect. A network with 100 hidden neurons has nearly 80,000 parameters. We have only 50,000 images in our training data. It’s like trying to fit an 80,000th degree polynomial to 50,000 data points. By all rights, our network should overfit terribly. And yet, as we saw earlier, such a network actually does a pretty good job generalizing. Why is that the case? It’s not well understood. It has been conjectured that “the dynamics of gradient descent learning in multilayer nets has a `self-regularization’ effect”. This is exceptionally fortunate, but it’s also somewhat disquieting that we don’t understand why it’s the case. In the meantime, we will adopt the pragmatic approach and use regularization whenever we can. Our neural networks will be the better for it.
事实上,我们的网络已先天地泛化得很好。一个具有 100 个隐层神经元的网络有近 80,000 个参数。而我们的训练数据中只有 50,000 个图像。这就好比试图将一个 80,000 次多项式拟合为 50,000 个数据点。按理来说,我们的网络应该退化得非常严重。然而,正如我们所见,这样一个网络事实上泛化得非常好。为什么会是这样?这不太好理解。据推测说:多层网络中的梯度下降学习的过程中有一个自我正则化效应。这是非常意外的好处,但是这也某种程度上让人不安,因为我们不知道它是怎么工作的。与此同时,我们将采用一些更务实的方法并尽可能地应用正则化。我们的神经网络将会因此变得更好。
Gradient-Based Learning Applied to Document Recognition, Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner (1998)
Let me conclude this section by returning to a detail which I left unexplained earlier: the fact that L2 regularization doesn’t constrain the biases. Of course, it would be easy to modify the regularization procedure to regularize the biases. Empirically, doing this often doesn’t change the results very much, so to some extent it’s merely a convention whether to regularize the biases or not. However, it’s worth noting that having a large bias doesn’t make a neuron sensitive to its inputs in the same way as having large weights. And so we don’t need to worry about large biases enabling our network to learn the noise in our training data. At the same time, allowing large biases gives our networks more flexibility in behaviour - in particular, large biases make it easier for neurons to saturate, which is sometimes desirable. For these reasons we don’t usually include bias terms when regularizing.
最后,让我说明一个之前没有解释到的细节:我们的 L2 正则化没有约束偏置 (biases)。当然,通过修改正则化过程来正则化偏置会很容易。但根据经验,这样做往往不能较明显地改变结果。所以在一定程度上,是否正则化偏置仅仅是一个习惯问题。然而值得注意的是,有一个较大的偏置并不会使得神经元对它的输入像有大权重那样敏感。所以我们不用担心较大的偏置会使我们的网络学习训练数据中的噪声。同时,允许大规模的偏置使我们的网络在性能上更为灵活 - 特别是较大的偏置使得神经元更容易饱和,这通常是我们期望的。由于这些原因,我们通常不对偏置做正则化。
veteran [ˈvetərən]:n. 老兵,老手,富有经验的人,老运动员 adj. 经验丰富的,老兵的
apron [ˈeɪprən]:n. 围裙,停机坪,舞台口 vt. 着围裙于,围绕
terribly [ˈterəbli]:adv. 非常,可怕地,极度地
pragmatic [præɡˈmætɪk]:adj. 实际的,实用主义的
fortunate [ˈfɔːtʃənət]:adj. 幸运的,侥幸的,吉祥的,带来幸运的
exceptionally [ɪkˈsepʃənəli]:adv. 异常地,特殊地,例外地
constrain [kənˈstreɪn]:vt. 驱使,强迫,束缚
1.3 Other techniques for regularization - 其它正则化技术
There are many regularization techniques other than L2 regularization. In fact, so many techniques have been developed that I can’t possibly summarize them all. In this section I briefly describe three other approaches to reducing overfitting: L1 regularization, dropout, and artificially increasing the training set size. We won’t go into nearly as much depth studying these techniques as we did earlier. Instead, the purpose is to get familiar with the main ideas, and to appreciate something of the diversity of regularization techniques available.
除了 L2 外还有很多正则化技术。实际上,正是由于数量众多,这里也不会将所有的都列举出来。在本节,我简要地给出三种减轻过度拟合的其他的方法:L1 正则化、弃权 (dropout) 和人为增加训练样本。我们不会像上面介绍得那么深入。其实,目的只是想让读者熟悉这些主要的思想,然后来体会一下正则化技术的多样性。
L1 regularization: In this approach we modify the unregularized cost function by adding the sum of the absolute values of the weights:
L1 regularization: 这个方法是在未正则化的代价函数上加上一个权重绝对值的和:
C = C 0 + λ n ∑ w ∣ w ∣ (11) \begin{aligned} C = C_0 + \frac{\lambda}{n} \sum_w |w| \tag{11} \end{aligned} C=C0+nλw∑∣w∣(11)
Intuitively, this is similar to L2 regularization, penalizing large weights, and tending to make the network prefer small weights. Of course, the L1 regularization term isn’t the same as the L2 regularization term, and so we shouldn’t expect to get exactly the same behaviour. Let’s try to understand how the behaviour of a network trained using L1 regularization differs from a network trained using L2 regularization.
凭直觉地看,这和 L2 正则化相似,惩罚大的权重,倾向于让网络优先选择小的权重。当然,L1 正则化和 L2 正则化并不相同,所以我们不应该期望从 L1 正则化得到完全同样的行为。让我们来试着理解使用 L1 正则化训练的网络和 L2 正则化训练的网络所不同的行为。
penalize ['pi:nəlaɪz]:vt. 处罚,处刑,使不利
intuitively [ɪnˈtjuːɪtɪvli]:adv. 直观地,直觉地
To do that, we’ll look at the partial derivatives of the cost function. Differentiating (11) we obtain:
首先,我们会研究一下代价函数的偏导数。对公式 (11) 求导我们有:
∂ C ∂ w = ∂ C 0 ∂ w + λ n s g n ( w ) (12) \begin{aligned} \frac{\partial C}{\partial w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} \, {\rm sgn}(w) \tag{12} \end{aligned} ∂w∂C=∂w∂C0+nλsgn(w)(12)
where
sgn
(
w
)
\text{sgn}(w)
sgn(w) is the sign of
w
w
w, that is,
+
1
+1
+1 if
w
w
w is positive, and
−
1
-1
−1 if
w
w
w is negative. Using this expression, we can easily modify backpropagation to do stochastic gradient descent using L1 regularization. The resulting update rule for an L1 regularized network is
其中
sgn
(
w
)
\text{sgn}(w)
sgn(w) 就是
w
w
w 的正负号,即
w
w
w 是正数时为
+
1
+1
+1,而
w
w
w 为负数时为
−
1
-1
−1。使用这个表达式,我们可以轻易地对反向传播进行修改从而使用基于 L1 正则化的随机梯度下降进行学习。对 L1 正则化的网络进行更新的规则就是
w → w ′ = w − η λ n sgn ( w ) − η ∂ C 0 ∂ w (13) \begin{aligned} w \rightarrow w' = w-\frac{\eta \lambda}{n} \text{sgn}(w) - \eta \frac{\partial C_0}{\partial w} \tag{13} \end{aligned} w→w′=w−nηλsgn(w)−η∂w∂C0(13)
where, as per usual, we can estimate
∂
C
0
/
∂
w
\partial C_0 / \partial w
∂C0/∂w using a mini-batch average, if we wish. Compare that to the update rule for L2 regularization (c.f. Equation (15)),
其中和往常一样,我们可以用一个 minibatch 的均值来估计
∂
C
0
/
∂
w
\partial C_0 / \partial w
∂C0/∂w。对比 L2 正则化的更新规则 (参见方程 (15)),
w → w ′ = w ( 1 − η λ n ) − η ∂ C 0 ∂ w (14) \begin{aligned} w \rightarrow w' = w\left(1 - \frac{\eta \lambda}{n} \right) - \eta \frac{\partial C_0}{\partial w} \tag{14} \end{aligned} w→w′=w(1−nηλ)−η∂w∂C0(14)
w → ( 1 − η λ n ) w − η m ∑ x ∂ C x ∂ w (15) \begin{aligned} w \rightarrow \left(1-\frac{\eta \lambda}{n}\right) w -\frac{\eta}{m} \sum_x \frac{\partial C_x}{\partial w} \tag{15} \end{aligned} w→(1−nηλ)w−mηx∑∂w∂Cx(15)
In both expressions the effect of regularization is to shrink the weights. This accords with our intuition that both kinds of regularization penalize large weights. But the way the weights shrink is different. In L1 regularization, the weights shrink by a constant amount toward 0. In L2 regularization, the weights shrink by an amount which is proportional to
w
w
w. And so when a particular weight has a large magnitude,
∣
w
∣
|w|
∣w∣, L1 regularization shrinks the weight much less than L2 regularization does. By contrast, when
∣
w
∣
|w|
∣w∣ is small, L1 regularization shrinks the weight much more than L2 regularization. The net result is that L1 regularization tends to concentrate the weight of the network in a relatively small number of high-importance connections, while the other weights are driven toward zero.
在两种情形下,正则化的效果就是缩小权重。这符合我们的直觉,两种正则化都惩罚大的权重。但权重缩小的方式不同。在 L1 正则化中,权重通过一个常量向 0 进行缩小。在 L2正则化中,权重通过一个和
w
w
w 成比例的量进行缩小的。所以,当一个特定的权重绝对值
∣
w
∣
|w|
∣w∣ 很大时,L1 正则化的权重缩小得远比 L2 正则化要小得多。相反,当一个特定的权重绝对值
∣
w
∣
|w|
∣w∣ 很小时,L1 正则化的权重缩小得要比 L2 正则化大得多。最终的结果就是:L1 正则化倾向于聚集网络的权重在相对少量的高重要度连接上,而其他权重就会被驱使向 0 接近。
I’ve glossed over an issue in the above discussion, which is that the partial derivative
∂
C
/
∂
w
\partial C / \partial w
∂C/∂w isn’t defined when
w
=
0
w=0
w=0. The reason is that the function
∣
w
∣
|w|
∣w∣ has a sharp “corner” at
w
=
0
w=0
w=0, and so isn’t differentiable at that point. That’s okay, though. What we’ll do is just apply the usual (unregularized) rule for stochastic gradient descent when
w
=
0
w=0
w=0. That should be okay - intuitively, the effect of regularization is to shrink weights, and obviously it can’t shrink a weight which is already 0. To put it more precisely, we’ll use Equations (96) and (97) with the convention that
sgn
(
0
)
=
0
\text{sgn}(0)=0
sgn(0)=0. That gives a nice, compact rule for doing stochastic gradient descent with L1 regularization.
我在上面的讨论中其实忽略了一个问题,在
w
=
0
w=0
w=0 的时候,偏导数
∂
C
/
∂
w
\partial C / \partial w
∂C/∂w 未定义。原因在于函数
∣
w
∣
|w|
∣w∣ 在
w
=
0
w=0
w=0 时有个直角,事实上,导数是不存在的。不过也没有关系。我们下面要做的就是应用通常的 (无正则化的) 随机梯度下降的规则在
w
=
0
w=0
w=0 处。这应该不会有什么问题,凭直觉地看,正则化的效果就是缩小权重,显然,不能对一个已经是 0 的权重进行缩小了。更准确地说,我们将会使用方程 (12) 和 (13) 并约定
sgn
(
0
)
=
0
\text{sgn}(0)=0
sgn(0)=0。这样就给出了一种细致又紧凑的规则来进行采用 L1 正则化的随机梯度下降学习。
corner [ˈkɔːnə(r)]:n. 角落,拐角处,地区,偏僻处,困境,窘境 vi. 囤积,相交成角 vt. 垄断,迫至一隅,使陷入绝境,把...难住
differentiable [,dɪfə'renʃɪəb(ə)l]:adj. 可微的,可辨的,可区分的
gloss [ɡlɒs]:n. 光亮,用以产生光泽的物质,光泽涂料,亮光漆 v. 在...上作注释 (或评注)
Dropout: Dropout is a radically different technique for regularization. Unlike L1 and L2 regularization, dropout doesn’t rely on modifying the cost function. Instead, in dropout we modify the network itself. Let me describe the basic mechanics of how dropout works, before getting into why it works, and what the results are.
弃权是一种相当激进的技术。和 L1、L2 正则化不同,弃权技术并不依赖对代价函数的修改。而是,在弃权中,我们改变了网络本身。在介绍它为什么能工作,以及所得到的结果前,让我描述一下弃权基本的工作机制。
radically [ˈrædɪkli]:adv. 根本上,彻底地,以激进的方式
Suppose we’re trying to train a network (假设我们尝试训练一个网络):
In particular, suppose we have a training input
x
x
x and corresponding desired output
y
y
y. Ordinarily, we’d train by forward-propagating
x
x
x through the network, and then backpropagating to determine the contribution to the gradient. With dropout, this process is modified. We start by randomly (and temporarily) deleting half the hidden neurons in the network, while leaving the input and output neurons untouched. After doing this, we’ll end up with a network along the following lines. Note that the dropout neurons, i.e., the neurons which have been temporarily deleted, are still ghosted in:
特别地,假设我们有一个训练数据
x
x
x 和对应的目标输出
y
y
y。通常我们会通过在网络中前向传播
x
x
x,然后进行反向传播来确定对梯度的贡献。使用弃权技术,这个过程就改了。我们会从随机 (临时) 地删除网络中的一半的隐藏神经元开始,同时让输入层和输出层的神经元保持不变。在此之后,我们会得到最终如下线条所示的网络。注意那些被弃权的神经元,即那些临时被删除的神经元,用虚圈表示在图中:
ghost [ɡəʊst]:n. 鬼,幽灵,鬼魂,一点 v. 悄悄地行进
ordinarily [ˈɔːdnrəli]:adv. 通常地,一般地
We forward-propagate the input
x
x
x through the modified network, and then backpropagate the result, also through the modified network. After doing this over a mini-batch of examples, we update the appropriate weights and biases. We then repeat the process, first restoring the dropout neurons, then choosing a new random subset of hidden neurons to delete, estimating the gradient for a different mini-batch, and updating the weights and biases in the network.
我们前向传播输入
x
x
x,通过修改后的网络,然后反向传播结果,同样通过这个修改后的网络。在一个 mini-batch (小批量) 的若干样本上进行这些步骤后,我们对有关的权重和偏置进行更新。然后重复这个过程,首先重置弃权的神经元,然后选择一个新的随机的隐藏神经元的子集进行删除,估计对一个不同的 minibatch 的梯度,然后更新权重和偏置。
By repeating this process over and over, our network will learn a set of weights and biases. Of course, those weights and biases will have been learnt under conditions in which half the hidden neurons were dropped out. When we actually run the full network that means that twice as many hidden neurons will be active. To compensate for that, we halve the weights outgoing from the hidden neurons.
通过不断地重复,我们的网络会学到一个权重和偏置的集合。当然,这些权重和偏置也是在一半的隐藏神经元被弃权的情形下学到的。当我们实际运行整个网络时,是指两倍的隐藏神经元将会被激活。为了补偿这个,我们将从隐藏神经元出去的权重减半。
This dropout procedure may seem strange and ad hoc. Why would we expect it to help with regularization? To explain what’s going on, I’d like you to briefly stop thinking about dropout, and instead imagine training neural networks in the standard way (no dropout). In particular, imagine we train several different neural networks, all using the same training data. Of course, the networks may not start out identical, and as a result after training they may sometimes give different results. When that happens we could use some kind of averaging or voting scheme to decide which output to accept. For instance, if we have trained five networks, and three of them are classifying a digit as a “3”, then it probably really is a “3”. The other two networks are probably just making a mistake. This kind of averaging scheme is often found to be a powerful (though expensive) way of reducing overfitting. The reason is that the different networks may overfit in different ways, and averaging may help eliminate that kind of overfitting.
这个弃权过程可能看起来奇怪,像是临时安排的。为什么我们会指望这样的方法能够进行正则化呢?为了解释所发生的事,我希望您简短地停止考虑 dropout,而想像一下以标准方式 (没有弃权) 训练神经网络。特别地,想象一下我们训练几个不同的神经网络,都使用同一个训练数据。当然,网络可能不是从同一初始状态开始的,最终的结果也会有一些差异。出现这种情况时,我们可以使用一些平均或者投票的方式来确定接受哪个输出。例如,如果我们训练了五个网络,其中三个把一个数字分类成 “3”,那很可能它就是“3”。另外两个可能就犯了错误。这种平均的方式通常是一种强大 (尽管代价昂贵) 的方式来减轻过度拟合。原因在于不同的网络可能会以不同的方式过度拟合,平均法可能会帮助我们消除那样的过度拟合。
What’s this got to do with dropout? Heuristically, when we dropout different sets of neurons, it’s rather like we’re training different neural networks. And so the dropout procedure is like averaging the effects of a very large number of different networks. The different networks will overfit in different ways, and so, hopefully, the net effect of dropout will be to reduce overfitting.
那么这和弃权有什么关系呢?启发式地看,当我们弃权掉不同的神经元集合时,有点像我们在训练不同的神经网络。所以,弃权过程就如同大量不同网络的效果的平均那样。不同的网络会以不同的方式过度拟合了,所以,弃权过的网络的效果会减轻过度拟合。
A related heuristic explanation for dropout is given in one of the earliest papers to use the technique: “This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.” In other words, if we think of our network as a model which is making predictions, then we can think of dropout as a way of making sure that the model is robust to the loss of any individual piece of evidence. In this, it’s somewhat similar to L1 and L2 regularization, which tend to reduce weights, and thus make the network more robust to losing any individual connection in the network.
一个相关的启发式解释在早期使用这项技术的论文中曾经给出:“因为神经元不能依赖其他神经元特定的存在,这个技术其实减少了复杂的互适应的神经元。所以,强制要学习那些在神经元的不同随机子集中更加健壮的特征。”换言之,如果我们将我们的神经网络看做一个进行预测的模型的话,我们就可以将弃权看做是一种确保模型对于一部分证据丢失健壮的方式。这样看来,弃权和 L1、L2 正则化也是有相似之处的,这也倾向于更小的权重,最后让网络对丢失个体连接的场景更加健壮。
ImageNet Classification with Deep Convolutional Neural Networks, by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).
Of course, the true measure of dropout is that it has been very successful in improving the performance of neural networks. The original paper introducing the technique applied it to many different tasks. For us, it’s of particular interest that they applied dropout to MNIST digit classification, using a vanilla feedforward neural network along lines similar to those we’ve been considering. The paper noted that the best result anyone had achieved up to that point using such an architecture was 98.4 percent classification accuracy on the test set. They improved that to 98.7 percent accuracy using a combination of dropout and a modified form of L2 regularization. Similarly impressive results have been obtained for many other tasks, including problems in image and speech recognition, and natural language processing. Dropout has been especially useful in training large, deep networks, where the problem of overfitting is often acute.
当然,弃权技术的真正衡量是它已经在提升神经网络性能上应用得相当成功。原始论文介绍了将其应用于很多不同任务的技术。对我们来说,特别感兴趣的是他们应用弃权在 MNIST 数字分类上,使用了一个和我们之前介绍的那种普通的前向神经网络。这篇论文提及到当前最好的结果是在测试集上取得了 98.4% 的准确率。他们使用弃权和 L2 正则化的组合将其提高到了 98.7%。类似重要的结果在其他不同的任务上也取得了一定的成效,包括图像和语音识别、自然语言处理。弃权技术在训练大规模深度网络时尤其有用,这样的网络中过度拟合问题经常特别突出。
vanilla [vəˈnɪlə]:n. 香子兰,香草 adj. 香草味的
Improving neural networks by preventing co-adaptation of feature detectors by Geoffrey Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov (2012). Note that the paper discusses a number of subtleties that I have glossed over in this brief introduction. (注意这篇论文讨论了许多细微之处,这些我在这个简要的介绍里敷衍地带过了。)
Artificially expanding the training data: We saw earlier that our MNIST classification accuracy dropped down to percentages in the mid-80s when we used only 1,000 training images. It’s not surprising that this is the case, since less training data means our network will be exposed to fewer variations in the way human beings write digits. Let’s try training our 30 hidden neuron network with a variety of different training data set sizes, to see how performance varies. We train using a mini-batch size of 10, a learning rate
η
=
0.5
\eta = 0.5
η=0.5, a regularization parameter
λ
=
5.0
\lambda = 5.0
λ=5.0, and the cross-entropy cost function. We will train for 30 epochs when the full training data set is used, and scale up the number of epochs proportionally when smaller training sets are used. To ensure the weight decay factor remains the same across training sets, we will use a regularization parameter of
λ
=
5.0
\lambda = 5.0
λ=5.0 when the full training data set is used, and scale down
λ
\lambda
λ proportionally when smaller training sets are used. (This and the next two graph are produced with the program more_data.py.)
人为扩展训练数据:我们前面看到了 MNIST 分类准确率在我们使用 1,000 幅训练图像时候下降到了 80% 中间的准确率。这种情况并不奇怪,因为更少的训练数据意味着我们的网络接触更少的人类手写的数字中的变化。让我们训练 30 个隐藏神经元的网络,使用不同的训练数据集,来看看性能的变化情况。We train using a mini-batch size of 10, a learning rate
η
=
0.5
\eta = 0.5
η=0.5, a regularization parameter
λ
=
5.0
\lambda = 5.0
λ=5.0, and the cross-entropy cost function. 我们在全部训练数据集合上训练 30 个 epochs,然后会随着训练数据量的下降而成比例增加 epochs 的数量。为了确保权重衰减因子在训练数据集上相同,我们会在全部训练集上使用正则化参数为
λ
=
5.0
\lambda = 5.0
λ=5.0,然后在使用更小的训练集的时候成比例地降低
λ
\lambda
λ 值 3。
artificially [ˌɑːtɪˈfɪʃəli]:adv. 人工地,人为地,不自然地
expand [ɪkˈspænd]:vt. 扩张,使膨胀,详述 vi. 发展,张开,展开
As you can see, the classification accuracies improve considerably as we use more training data. Presumably this improvement would continue still further if more data was available. Of course, looking at the graph above it does appear that we’re getting near saturation. Suppose, however, that we redo the graph with the training set size plotted logarithmically:
如你所见,分类准确率在使用更多的训练数据时提升了很大。根据这个趋势的话,提升会随着更多的数据而不断增加。当然,在训练的后期我们看到学习过程已经接近饱和状态。然而,如果我们使用对数作为横坐标的话,可以重画此图如下:
logarithmically:adv. 用对数
It seems clear that the graph is still going up toward the end. This suggests that if we used vastly more training data - say, millions or even billions of handwriting samples, instead of just 50,000 - then we’d likely get considerably better performance, even from this very small network.
这看起来到了后面结束的地方,增加仍旧明显。这表明如果我们使用大量更多的训练数据 - 不妨设百万或者十亿级的手写样本,而不是仅仅 50,000个 - 那么,我们可能会得到更好的性能,即使是用这样的简单网络。
Obtaining more training data is a great idea. Unfortunately, it can be expensive, and so is not always possible in practice. However, there’s another idea which can work nearly as well, and that’s to artificially expand the training data. Suppose, for example, that we take an MNIST training image of a five,
获取更多的训练样本其实是很好的想法。不幸的是,这个方法代价很大,在实践中常常是很难达到的。不过,还有一种方法能够获得类似的效果,那就是人为扩展训练数据。假设我们使用一个 5 的 MNIST 训练图像,
and rotate it by a small amount, let’s say 15 degrees (将其进行旋转,比如说 15 degrees):
It’s still recognizably the same digit. And yet at the pixel level it’s quite different to any image currently in the MNIST training data. It’s conceivable that adding this image to the training data might help our network learn more about how to classify digits. What’s more, obviously we’re not limited to adding just this one image. We can expand our training data by making many small rotations of all the MNIST training images, and then using the expanded training data to improve our network’s performance.
这还是会被识别为同样的数字的。但是在像素层级这和任何一幅在 MNIST 训练数据中的图像都不相同。所以将这样的样本加入到训练数据中是很可能帮助我们的网络学会更多如何分类数字。而且,显然我们不限于只增加这幅图像。我们可以在所有的 MNIST训练样本上通过很}小的旋转扩展训练数据,然后使用扩展后的训练数据来提升我们网络的性能。
conceivable [kənˈsiːvəbl]:adj. 可想像的,可相信的
This idea is very powerful and has been widely used. Let’s look at some of the results from a paper which applied several variations of the idea to MNIST. One of the neural network architectures they considered was along similar lines to what we’ve been using, a feedforward network with 800 hidden neurons and using the cross-entropy cost function. Running the network with the standard MNIST training data they achieved a classification accuracy of 98.4 percent on their test set. But then they expanded the training data, using not just rotations, as I described above, but also translating and skewing the images. By training on the expanded data set they increased their network’s accuracy to 98.9 percent. They also experimented with what they called “elastic distortions”, a special type of image distortion intended to emulate the random oscillations found in hand muscles. By using the elastic distortions to expand the data they achieved an even higher accuracy, 99.3 percent. Effectively, they were broadening the experience of their network by exposing it to the sort of variations that are found in real handwriting.
这个想法非常强大并且已经被广泛应用了。让我们从一篇论文看看一些结果,这篇论文中,作者在 MNIST 上使用了几个这种想法的变化方式。其中一种他们考虑的网络结构其实和我们已经使用过的类似 - 一个拥有 800 个隐藏元的前馈神经网络,使用了交叉熵代价函数。在标准的 MNIST 训练数据上运行这个网络,得到了 98.4% 的分类准确率。他们不只旋转,还转换和扭曲图像来扩展训练数据。通过在这个扩展后的数据集上的训练,他们提升到了 98.9% 的准确率。然后还在“弹性扭曲”的数据上进行了实验,这是一种特殊的为了模仿手部肌肉的随机抖动的图像扭曲方法。通过使用弹性扭曲扩展的数据,他们最终达到了 99.3% 的分类准确率。他们通过展示训练数据的所有类型的变化形式来扩展网络的经验。
elastic [ɪˈlæstɪk]:adj. 有弹性的,灵活的,易伸缩的 n. 松紧带,橡皮圈
distortion [dɪˈstɔːʃn]:n. 变形,失真,扭曲,曲解
Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis, by Patrice Simard, Dave Steinkraus, and John Platt (2003).
Variations on this idea can be used to improve performance on many learning tasks, not just handwriting recognition. The general principle is to expand the training data by applying operations that reflect real-world variation. It’s not difficult to think of ways of doing this. Suppose, for example, that you’re building a neural network to do speech recognition. We humans can recognize speech even in the presence of distortions such as background noise. And so you can expand your data by adding background noise. We can also recognize speech if it’s sped up or slowed down. So that’s another way we can expand the training data. These techniques are not always used - for instance, instead of expanding the training data by adding noise, it may well be more efficient to clean up the input to the network by first applying a noise reduction filter. Still, it’s worth keeping the idea of expanding the training data in mind, and looking for opportunities to apply it.
这个想法的变化形式也可以用在提升手写数字识别之外不同学习任务上的性能。一般就是通过应用反映真实世界变化的操作来扩展训练数据。找到这些方法其实并不困难。例如,你要构建一个神经网络来进行语音识别。我们人类甚至可以在有背景噪声的情况下识别语音。所以你可以通过增加背景噪声来扩展你的训练数据。我们同样能够对其进行加速和减速来获得相应的扩展数据。所以这是另外的一些扩展训练数据的方法。这些技术并不总是有用 - 例如,其实与其在数据中加入噪声,倒不如先对数据进行噪声的清理,这样可能更加有效。当然,记住可以进行数据的扩展,寻求应用的机会还是相当有价值的一件事。
opportunity [ˌɒpəˈtjuːnəti]:n. 时机,机会
1.4 Exercise - 练习
As discussed above, one way of expanding the MNIST training data is to use small rotations of training images. What’s a problem that might occur if we allow arbitrarily large rotations of training images?
arbitrarily [ˌɑːbɪˈtrerəli; ˈɑːbɪtrəli]:adv. 武断地,反复无常地,专横地
An aside on big data and what it means to compare classification accuracies: Let’s look again at how our neural network’s accuracy varies with training set size:
关于大数据和其对分类准确率的影响的题外话:让我们再看看神经网络准确率随着训练集大小变化的情况:
Suppose that instead of using a neural network we use some other machine learning technique to classify digits. For instance, let’s try using the support vector machines (SVM) which we met briefly back in Chapter 1. As was the case in Chapter 1, don’t worry if you’re not familiar with SVMs, we don’t need to understand their details. Instead, we’ll use the SVM supplied by the scikit-learn library. Here’s how SVM performance varies as a function of training set size. I’ve plotted the neural net results as well, to make comparison easy (This graph was produced with the program more_data.py (as were the last few graphs).):
假设我们使用别的机器学习技术来分类数字,而不是神经网络。例如,让我们试试使用支持向量机 (SVM),我们在第一章已经简要介绍过它。正如第一章中的情况,不要担心你熟不熟悉 SVM,我们不进行深入的讨论。我们使用 scikit-learn 库 (http://scikit-learn.org/stable/) 提供的 SVM 替换神经网络。下面是 SVM 模型的性能随着训练数据集的大小变化的情况,我也画出了神经网络的结果来方便对比。
Probably the first thing that strikes you about this graph is that our neural network outperforms the SVM for every training set size. That’s nice, although you shouldn’t read too much into it, since I just used the out-of-the-box settings from scikit-learn’s SVM, while we’ve done a fair bit of work improving our neural network. A more subtle but more interesting fact about the graph is that if we train our SVM using 50,000 images then it actually has better performance (94.48 percent accuracy) than our neural network does when trained using 5,000 images (93.24 percent accuracy). In other words, more training data can sometimes compensate for differences in the machine learning algorithm used.
第一件让你吃惊的是神经网络在每个训练规模下性能都超过了 SVM。这很好,尽管你对细节和原理可能不太了解,因为我们只是直接从 scikit-learn 中直接调用了这个方法,而对神经网络已经深入讲解了很多。更加微妙和有趣的现象其实是如果我们训练 SVM 使用50,000 幅图像,那么实际上它的准确率 (94.48%) 已经能够超过我们使用 5,000 幅图像的神经网络的准确率 (93.24%)。换言之,更多的训练数据可以补偿不同的机器学习算法的差距。
strike [straɪk]:v. 罢工,击,攻击,打 n. 罢工,打击,袭击,罢课
Something even more interesting can occur. Suppose we’re trying to solve a problem using two machine learning algorithms, algorithm A and algorithm B. It sometimes happens that algorithm A will outperform algorithm B with one set of training data, while algorithm B will outperform algorithm A with a different set of training data. We don’t see that above - it would require the two graphs to cross - but it does happen. The correct response to the question “Is algorithm A better than algorithm B?” is really: “What training data set are you using?”
还有更加有趣的现象也出现了。假设我们试着用两种机器学习算法去解决问题,算法 A 和算法 B。有时候出现,算法 A 在一个训练集合上超过算法 B,却在另一个训练集上弱于算法 B。上面我们并没有看到这个情况 - 因为这要求两幅图有交叉的点 - 这里并没有。对“算法 A 是不是要比算法 B 好?”正确的反应应该是“你在使用什么训练集合?”
Striking examples may be found in Scaling to very very large corpora for natural language disambiguation, by Michele Banko and Eric Brill (2001).
All this is a caution to keep in mind, both when doing development, and when reading research papers. Many papers focus on finding new tricks to wring out improved performance on standard benchmark data sets. “Our whiz-bang technique gave us an improvement of X percent on standard benchmark Y” is a canonical form of research claim. Such claims are often genuinely interesting, but they must be understood as applying only in the context of the specific training data set used. Imagine an alternate history in which the people who originally created the benchmark data set had a larger research grant. They might have used the extra money to collect more training data. It’s entirely possible that the “improvement” due to the whiz-bang technique would disappear on a larger data set. In other words, the purported improvement might be just an accident of history. The message to take away, especially in practical applications, is that what we want is both better algorithms and better training data. It’s fine to look for better algorithms, but make sure you’re not focusing on better algorithms to the exclusion of easy wins getting more or better training data.
在进行开发时或者在读研究论文时,这都是需要记住的事情。很多论文聚焦在寻找新的技巧来给出标准数据集上更好的性能。“我们的超赞的技术在标准测试集 Y 上给出了百分之 X的性能提升。”这是通常的研究声明。这样的声明通常比较有趣,不过也必须被理解为仅仅在特定的训练数据集上的应用效果。那些给出基准数据集的人们会拥有更大的研究经费支持,这样能够获得更好的训练数据。所以,很可能他们由于超赞的技术的性能提升其实在更大的数据集合上就丧失了。换言之,人们标榜的提升可能就是历史的偶然。所以需要记住的特别是在实际应用中,我们想要的是更好的算法和更好的训练数据。寻找更好的算法很好,不过需要确保你在此过程中,没有放弃对更多更好的数据的追求。
1.5 Problem - 问题
(Research problem) How do our machine learning algorithms perform in the limit of very large data sets? For any given algorithm it’s natural to attempt to define a notion of asymptotic performance in the limit of truly big data. A quick-and-dirty approach to this problem is to simply try fitting curves to graphs like those shown above, and then to extrapolate the fitted curves out to infinity. An objection to this approach is that different approaches to curve fitting will give different notions of asymptotic performance. Can you find a principled justification for fitting to some particular class of curves? If so, compare the asymptotic performance of several different machine learning algorithms.
(研究问题) 我们的机器学习算法如何在非常大的数据集限制内执行?对于任何给定的算法,尝试在真正的大数据限制内定义渐近性能的概念是很自然的。快速解决此问题的方法是,简单地尝试将曲线拟合到上面显示的图形,然后将拟合的曲线外推到无穷大。该方法的一个反对意见是,曲线拟合的不同方法将给出渐近性能的不同概念。您能找到适合某类特定曲线的原则性理由吗?如果是这样,请比较几种不同机器学习算法的渐近性能。
1.6 Summing up - 总结
We’ve now completed our dive into overfitting and regularization. Of course, we’ll return again to the issue. As I’ve mentioned several times, overfitting is a major problem in neural networks, especially as computers get more powerful, and we have the ability to train larger networks. As a result there’s a pressing need to develop powerful regularization techniques to reduce overfitting, and this is an extremely active area of current work.
我们现在已经介绍完了过度拟合和正则化。当然,我们会重回这个问题。正如我已经提过了几次,过度拟合是神经网络中一个重要的问题,尤其是计算机越来越强大,我们有训练更大的网络的能力时。我们有迫切的需要来开发出强大的正则化技术来减轻过度拟合,而这也是当前研究的极其活跃的领域。