一些异常的数据点穿插到另一组数据的区域,从而对边界产生了干扰。
Some outliers penetrate the area of theother group and disturb the boundary.
换句话说,这个数据集合中包含了很多噪声。
In other words, this data contains muchnoise.
问题是机器学习无法区分有用的数据与噪声。
The problem is that there is no way forMachine Learning to distinguish this.
由于机器学习考虑所有的数据,包括噪声,它最终可能会产生一个不正确的模型(在以上例子中这个模型是一条曲线)。
As Machine Learning considers all the data,even the noise, it ends up producing an improper model (a curve in this case).
这将导致因小失大。
This would be penny-wise and pound-foolish.
正如你可能注意到的,训练数据并不完美,可能包含不同数量(程度)的噪声。
As you may notice here, the training datais not perfect and may contain varying amounts of noise.
如果你相信训练数据的每个元素都是正确的,并且精确地拟合模型,那么你将得到一个具有较低泛化能力的模型。
If you believe that every element of thetraining data is correct and fits the model precisely, you will get a modelwith lower generalizability.
这就被称为过度拟合。
This is called overfitting.
当然,由于它本身的特性,机器学习应该尽一切努力从训练数据中分析出一个优秀的模型。
Certainly, because of its nature, MachineLearning should make every effort to derive an excellent model from thetraining data.
然而,训练数据的工作模型可能无法正确地反映现场数据的特征。
However, a working model of the trainingdata may not reflect the field data properly.
但这并不意味着我们应该使模型比训练数据更不精确。
This does not mean that we should make themodel less accurate than the training data on purpose.
这将破坏机器学习的基本策略。
This will undermine the fundamentalstrategy of Machine Learning.
现在我们面临一个进退两难的问题:减少训练数据的拟合误差会导致过度拟合,从而降低泛化性。
Now we face a dilemma—reducing the error ofthe training data leads to overfitting that degrades generalizability.
我们该怎么办?
What do we do?
当然,我们要正视这个问题!
We confront it, of course!
下一节将介绍防止过度拟合的技术。
The next section introduces the techniquesthat prevent overfitting.
正视过度拟合(Confronting Overfitting)
过度拟合显著影响机器学习的性能水平。
Overfitting significantly affects the levelof performance of Machine Learning.
我们可以看出谁是职业选手,谁是业余选手,他们在处理过度拟合时会采用他们各自的方法。
We can tell who is a pro and who is anamateur by watching their respective approaches in dealing with overfitting.
本节介绍两种用于过度拟合的典型方法:正则化和验证。
This section introduces two typical methodsused to confront overfitting: regularization and validation.
正则化是一种试图尽可能简单地构造模型结构的数值方法。
Regularization is a numerical method thatattempts to construct a model structure as simple as possible.
简化模型可以避免在低成本下过度拟合的影响。
The simplified model can avoid the effectsof overfitting at the small cost of performance.
上一节的数据分组问题可以作为一个很好的例子。
The grouping problem of the previoussection can be used as a good example.
复杂模型(或曲线)往往是过度拟合的。
The complex model (or curve) tends to beoverfitting.
相比之下,虽然简单曲线未能正确地分类某些数据点,但却更好地反映了分组的整体特征。
In contrast, although it fails to classifycorrectly some points, the simple curve reflects the overall characteristics ofthe group much better.
如果你理解它是如何运作的,那么现在就已经足够了。
If you understand how it works, that isenough for now.
——本文译自Phil Kim所著的《Matlab Deep Learning》
更多精彩文章请关注微信号: