探索正则化背后的简单令人满意的数学

最新推荐文章于 2024-04-16 15:29:39 发布

weixin_26750481

最新推荐文章于 2024-04-16 15:29:39 发布

阅读量189

点赞数

原文链接：https://towardsdatascience.com/exploring-the-simple-satisfying-math-behind-regularization-2c947755d19f

版权

Regularization is commonly used in machine learning, from the simple regression algorithm to the complex neural network, to prevent the algorithm from overfitting. Let’s explore through the lens of simple mathematics the incredibly satisfying and beautiful ways regularization keeps models in check.

从简单回归算法到复杂神经网络，正则化通常用于机器学习中，以防止算法过度拟合。让我们通过简单数学的角度来探索正则化使模型处于检查状态的令人难以置信的令人满意和优美的方式。

Consider a very simple regression task. There are six parameters (coefficients) represented as beta and and five inputs represented as x. The output of the model is simply each of the coefficients multiplied by the input (plus an intercept term, beta-0).

考虑一个非常简单的回归任务。有六个参数(系数)以beta表示，五个输入以x表示。模型的输出只是每个系数乘以输入(再加上一个截距项beta-0)。

演示地址

To humanize this problem, let’s say that we are trying to predict a student’s score on a test (the output of f(x)) based on five factors: 1) how much time they spend on homework (hw), 2) how many hours of sleep they got (sleep), 3) their score on the last test (score), 4) their current class GPA (gpa), and 5) if they ate food before taking the test (food).

为了使这个问题更加人性化，我们假设我们基于以下五个因素来预测学生的测验分数( f ( x )的输出)：1)他们花多少时间做作业( hw )，2)多少时间他们获得的睡眠时间( sleep )，3)上次测试的score ( score )，4)当前的GPA( gpa )和5)如果他们在参加测试前吃了食物( food )。

演示地址

The goal of the regression model is to minimize a ‘loss’, which attempts to quantify the error. This is usually done with the mean squared error. For example, if we have a dataset of inputs x and targets y, for each data point we would evaluate the difference in predicted target and actual target, square it, and average this among all items. For the purpose of brevity, it can be expressed shorthand like this (E for ‘expected value’ or average).

回归模型的目标是最大程度地减少“损失”，从而试图量化误差。通常用均方误差来完成。例如，如果我们有一个输入x和目标y的数据集，则对于每个数据点，我们将评估预测目标和实际目标之间的差异，将其平方并在所有项目之间求平均值。为了简洁起见，它可以这样简写表示( E代表“期望值”或平均值)。

演示地址

Hence, linear regression is trained to optimize its parameters along this loss function l. However, note that our model has many features, which usually correlates to a higher complexity, similar to increasing the number of degrees in a polynomial. Are all of the features relevant? If they aren’t, they may just be providing another degree of freedom for the model to overfit.

因此，训练线性回归以沿着该损失函数l优化其参数。但是，请注意，我们的模型具有许多功能，通常与更高的复杂度相关，类似于增加多项式的次数。所有功能都相关吗？如果不是这样，它们可能只是为模型过度拟合提供了另一个自由度。

Say that a student’s current GPA and whether or not they had food that morning turn out to provide minimal benefit while causing the model to overfit. If those coefficients are to be set to 0, then those features will be eliminated from the model altogether.

假设学生当前的GPA以及当天早上是否有食物，这会在导致模型过拟合的同时提供最小的收益。如果将这些系数设置为0，则将从模型中完全消除那些特征。

演示地址

This results in a much simpler model that has a lesser capability to overfit to the data. In a sense, this act of lowering coefficients towards zero is like a feature selection.

这样会导致模型简单得多，而对数据的过拟合能力则较小。从某种意义上说，将系数降低到零的行为就像一个特征选择。

演示地址

Regularization is predicated on the idea that larger parameters generally lead to overfitting, because they a) are not zero, so it adds another variable/degree of freedom into the ecosystem b) can cause big swings and unnatural volatility in prediction. This — high variance — is a telltale sign of overfitting.

正则化的前提是较大的参数通常会导致过度拟合，因为它们a)不为零，因此它为生态系统增加了另一个变量/自由度b)会导致预测的大幅波动和非自然波动。这种高差异性是过度拟合的明显标志。

Let’s explore two types of regularization: L1 and L2 regularization.

让我们探索两种类型的正则化：L1和L2正则化。

L1 regularization slightly alters the loss function used in linear regression by adding a ‘regularization term’, in this case ‘λE[|β|]’. Let’s break this down: E[|β|] means ‘average of the absolute value of the parameters’, and λ is a scaling factor that determines how much the average of the parameters should incur on the total loss.

L1正则化通过添加一个“正则项”来稍微改变线性回归中使用的损失函数，在这种情况下为“ λE [| β| ]”。让我们分解一下： E [| β| ]表示“参数绝对值的平均值”，而λ是比例因子，它确定参数平均值应在总损耗上产生多少。

演示地址

This encourages, overall, smaller parameters. When the model adjusts its coefficients based on the loss with the general goal to reduce their value, it must determine if a feature is valuable enough to keep because it increases the predictive power more than it does the regularization term. These features are fundamentally more profitable to keep.

总体上，这鼓励了较小的参数。当模型调整基于与总体目标，以降低其价值的损失其系数，就必须确定一个特点是有价值的，足以让因为它增加的预测能力超过它的正则化项。这些功能从根本上讲更有利可图。

Then, less useful features are discarded and the model becomes simpler.

然后，不那么有用的功能将被丢弃，模型变得更简单。

L2 regularization is essentially the same, but parameters are squared before they are averaged. Hence, the regularization is the average of the squares of parameters. This tries to accomplish the same task of encouraging the model to reduce the overall value of coefficients, but with different results.

L2正则化本质上是相同的，但是在对参数求平均值之前对它们进行平方。因此，正则化是参数平方的平均值。这试图完成鼓励模型减少系数的整体值的相同任务，但结果不同。

演示地址

It’s worth noting that regular linear regression is a special case of L1/L2 normalization in which the lambda parameter equals 0 and the regularization term is essentially cancelled out.

值得注意的是，规则线性回归是L1 / L2归一化的一种特殊情况，其中lambda参数等于0，而正则化项实质上被抵消了。

To explore the different effect L1 and L2 normalization have, we need to look at their derivatives, or the slope of their functions at a certain point. To simplify, we can represent:

要探索L1和L2归一化的不同效果，我们需要查看它们的导数或它们在特定点上的函数的斜率。为了简化，我们可以表示：

L1 regularization as y = x. Derivative is 1.
L1正则化为y = x 。导数为1。
L2 regularization as y = x². Derivative is 2x.
L2正则化为y = x² 。导数是2 x 。

This means that:

这意味着：

In L1 regularization, if a parameter decreases from 5 to 4, the corresponding regularization decreases 5–4 = 1.
在L1正则化中，如果参数从5减少到4，则相应的正则化会减少5–4 = 1。
In L2 regularization, if a parameter decreases from 5 to 4, the corresponding regularization decreases 25–16 = 9.
在L2正则化中，如果参数从5减少到4，则相应的正则化将减少25–16 = 9。

While in L1 regularization, the rewards for reducing parameters is constant, in L2 regularization the reward gets smaller and smaller as the parameter nears zero. Going from a parameter value of 5 to 4 yields a decrease of 9 but reducing from 1 to 0 only yields and improvement of 1.

在L1正则化中，用于减少参数的奖励是恒定的，而在L2正则化中，随着参数接近零，奖励会越来越小。从参数值5到4减少9，但从1减少到0仅改善1。

As a note, remember that the model only cares about relative rewards. The absolute value of a reward is irrelevant to us because the lambda parameter can always scale it up or down. What is important is how much of a decrease or increase a model will gain from a certain change in parameter.

需要注意的是，请记住，该模型仅在乎相对奖励。奖励的绝对值与我们无关，因为lambda参数始终可以按比例放大或缩小。重要的是模型因参数的某些变化而减少或增加多少。

Hence, in using L2, the model may decide that it is worth ‘keeping a feature’ (not discarding, or reducing the parameter to 0) because:

因此，在使用L2时，模型可能会认为值得“保留功能”(而不是丢弃或将参数减小为0)，因为：

It provides a substantial amount of predictive power (decreases the first term in the loss, E[(f(x)−y)²]).
它提供了可观的预测能力(减少了损失的第一项， E [( f ( x )− y )²] )。
There isn’t much gain to be had reducing the parameter’s value because it is already close to 0.
减小参数的值并没有太大好处，因为它已经接近0。
Reducing the parameter would eliminate the gain in the first term for a much smaller gain in the second term of the loss function (λE[β²]).
减小参数将消除第一项中的增益，而损失函数的第二项中的增益要小得多( λE [ β² ] )。

So, in general:

因此，通常：

L1 regularization will produce simpler models with fewer features, since it provides consistent rewards to reduce parameter values. It can be thought of as a ‘natural’ method of variable selection, and can remove, for one, multicollinear variables.
L1正则化将产生功能较少的更简单模型，因为它提供一致的奖励以减少参数值。可以将其视为变量选择的“自然”方法，并且可以删除多重共线性变量。
L2 regularization will produce more complex models with parameters near but likely not at zero, since it provides diminishing rewards to reduce parameter values. It’s able to learn more complex data patterns.
L2正则化将生成参数接近但可能不为零的更复杂的模型，因为它提供了减少参数值的收益。它能够学习更复杂的数据模式。

Both regularization methods reduce the model’s ability to overfit by preventing each of the parameters from having too big an effect on the final result, but lead to two different outcomes.

两种正则化方法都通过防止每个参数对最终结果产生太大影响来降低模型的过拟合能力，但会导致两个不同的结果。

If you’re looking for a simple and lightweight model, L1 regularization is the way to go. It takes a no-nonsense approach towards eliminating variables that have no profound impact on the output. In regression, this is known as ‘LASSO regression’, and is available in standard libraries like sci-kit learn.

如果您正在寻找一个简单而轻量级的模型，那么L1正则化是您的最佳选择。它采取一种废话的方法来消除对输出没有重大影响的变量。在回归中，这称为“ LASSO回归”，可在sci-kit learn标准库中使用。

On the other hand, if your task is more complicated, for example with regularization in neural networks, it’s likely a bad idea to use L1, which could kill numerous hyperparameters by setting them to zero. L2 regularization is usually recommended in neural networks because it acts as a guardrail but doesn’t interfere too much with the complex workings of neurons.

另一方面，如果您的任务更复杂，例如在神经网络中进行正则化，则使用L1可能是个坏主意，因为将L1设置为零会杀死大量超参数。通常在神经网络中建议使用L2正则化，因为它可以充当护栏，但不会过多地干扰神经元的复杂工作。

L2 regression is known as ‘ridge regression’, and is available for implementation in standard libraries. In neural networks, dropout is preferred to L2 regularization as more ‘natural’, but there are many use cases in which the latter or both should be used.

L2回归称为“岭回归”，可在标准库中实现。在神经网络中，辍学比L2正则化更可取，因为它更“自然”，但在许多使用案例中，应同时使用后者或两者。

关键点 (Key Points)

Regularization prevents overfitting in machine learning models by reducing the overall impact any feature can have on the outcome.
正则化通过减少任何功能可能对结果产生的总体影响来防止机器学习模型过度拟合。
Both L1 and L2 regularization continuously pressure the model to reduce their parameters. The former gives constant rewards but the latter gives diminishing rewards depending on how close the parameter is to 0.
L1和L2正则化都持续向模型施加压力以减小其参数。前者给出的奖励恒定，而后者则根据参数接近0的程度给出递减的奖励。
Regularization forces the model to repeatedly compare the predictive power a feature brings to how much it increases the regularization term, which results in the model choosing to keep more important features.
正则化迫使模型反复比较特征带来的预测能力与增加正则项的程度，从而导致模型选择保留更重要的特征。
L1 will result in many less relevant variables being eliminated all together (coefficient set to 0), whereas L2 will result in less relevant variables still existing but with small coefficients.
L1将导致许多不那么相关的变量被全部消除(系数设置为0)，而L2将导致相关性较小的变量仍然存在但系数较小。

Thanks for reading!

谢谢阅读！