Deep Learning：正则化（一）

最新推荐文章于 2022-02-13 19:31:30 发布

蚊子爱牛牛

最新推荐文章于 2022-02-13 19:31:30 发布

阅读量620

点赞数

分类专栏： deep-learning 文章标签：深度学习正则化 weight L2正则化 L1正则化

本文链接：https://blog.csdn.net/XJY104165/article/details/78319809

版权

deep-learning 专栏收录该内容

21 篇文章 0 订阅

订阅专栏

Many strategies used in machine learning are explicitly designed to reduce the test error, possibly at the expense of increased training error. These strategies are known collectively as regularization. In fact, developing more effective regularization strategies has been one of the major research efforts in the field.

We defined regularization as “any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.”
There are many regularization strategies.
(1) Some put extra constraints on a machine learning model, such as adding restrictions on the parameter values.
(2) Some add extra terms in the objective function that can be thought of as corresponding to a soft constraint on the parameter values.
(3) Sometimes these constraints and penalties are designed to encode specific kinds of prior knowledge.
(4)Other times, these constraints and penalties are designed to express a generic preference for a simpler model class in order to promote generalization.
In the context of deep learning, most regularization strategies are based on regularizing estimators. Regularization of an estimator works by trading increased bias for reduced variance.
An effective regularizer is one that makes a profitable trade, reducing variance significantly while not overly increasing the bias.
When we discussed generalization and overfitting, we focused on three situations:
(1) excluded the true data generating process—corresponding to underfitting and inducing bias
(2) matched the true data generating process
(3) included the generating process but also many other possible generating processes—the overfitting regime where variance rather than bias dominates the estimation error.
The goal of regularization is to take a model from the third regime into the second regime.
Deep learning algorithms are typically applied to extremely complicated domains such as images, audio sequences and text, for which the true generation process essentially involves simulating the entire universe. To some extent, we are always trying to fit a square peg (the data generating process) into a round hole (our model family).
What this means is that controlling the complexity of the model is not a simple matter of finding the model of the right size, with the right number of parameters. Instead, we might find—and indeed in practical deep learning scenarios, we almost always do find—that the best fitting model (in the sense of minimizing generalization error) is a large model that has been regularized appropriately.

Parameter Norm Penalties

Many regularization approaches are based on limiting the capacity of models, such as neural networks, linear regression, or logistic regression, by adding a parameter norm penalty $Ω(θ)$ to the objective function J. We denote the regularized objective function by $\tilde{J}$ :

J ~ (θ; X, y) = J (θ; X, y) + α Ω (θ)

$\tilde{J}(\theta;X,y)=J(\theta;X,y)+\alpha\Omega(\theta)$
where α ∈ [0, ∞) is a hyperparameter that weights the relative contribution of the norm penalty term, Ω, relative to the standard objective function J(x; θ).

When our training algorithm minimizes the regularized objective function $\tilde{J}$ , it will decrease both the original objective J on the training data and some measure of the size of the parameters θ (or some subset of the parameters).
Different choices for the parameter norm Ω can result in different solutions being preferred.
We note that for neural networks, we typically choose to use a parameter norm penalty Ω that penalizes only the weights of the affine transformation at each layer and leaves the biases unregularized. Reasons:
(1) The biases typically require less data to fit accurately than the weights.
(2) Each weight specifies how two variables interact. Fitting the weight well requires observing both variables in a variety of conditions.
(3) Each bias controls only a single variable. This means that we do not induce too much variance by leaving the biases unregularized.
(4) Also, regularizing the bias parameters can introduce a significant amount of underfitting.
We therefore use the vector w to indicate all of the weights that should be affected by a norm penalty, while the vector θ denotes all of the parameters, including both w and the unregularized parameters.

$L^2$ Parameter Regularization

One of the simplest and most common kinds of parameter norm penalty: the L2 parameter norm penalty commonly known as weight decay:

Ω (θ) = 1 2 | | ω | | 22

$\Omega(\theta)=\frac{1}{2}||\omega||^2_2$
We can gain some insight into the behavior of weight decay regularization by studying the gradient of the regularized objective function. To simplify the presentation, we assume no bias parameter, so θ is just w. Such a model has the following total objective function:

J ~ (ω; X, y) = α 2 ω T ω + J (ω; X, y)

$\tilde{J}(\omega;X,y)=\frac{\alpha}{2}\omega^T\omega+J(\omega;X,y)$
with the corresponding parameter gradient

\nabla ω J ~ (ω; X, y) = α ω + \nabla ω J (ω; X, g)

$\nabla_\omega \tilde{J}(\omega;X,y)=\alpha\omega+\nabla_\omega J(\omega;X,g)$
To take a single gradient step to update the weights, we perform this update:

ω \leftarrow ω - ϵ (α ω + \nabla ω J (ω; X, g))

$\omega\leftarrow \omega-\epsilon(\alpha \omega+\nabla_\omega J(\omega;X,g))$
Written another way, the update is:

ω \leftarrow (1 - ϵ α) ω - ϵ \nabla ω J (ω; X, g)

$\omega\leftarrow(1-\epsilon\alpha)\omega-\epsilon\nabla_\omega J(\omega;X,g)$
We can see that the addition of the weight decay term has modified the learning rule to multiplicatively shrink the weight vector by a constant factor on each step, just before performing the usual gradient update.
We can see that L2 regularization causes the learning algorithm to “perceive” the input X as having higher variance, which makes it shrink the weights on features whose covariance with the output target is low compared to this added variance.

$L^1$ Regularization

Another option is to use $L^1$ regularization. Formally, $L^1$ regularization on the model parameter w is defined as:

Ω (θ) = | | ω | | 1 = \sum i | ω i |

$\Omega(\theta)=||\omega||_1=\sum_i|\omega_i|$
that is, as the sum of absolute values of the individual parameters.
As with L2 weight decay, L1 weight decay controls the strength of the regularization by scaling the penalty Ω using a positive hyperparameter α. Thus, the regularized objective function

J~(w;X,y) $\tilde{J} (w; X, y)$ is given by

J ~ (ω; X, y) = α | | ω | | 1 + J (ω; X, y)

$\tilde{J} (\omega;X,y)=\alpha||\omega||_1+J(\omega;X,y)$
with the corresponding gradient (actually, sub-gradient):

\nabla ω J ~ (ω; X, y) = α s i g n (ω) + \nabla ω J (ω; X, g)

$\nabla_\omega \tilde{J}(\omega;X,y)=\alpha sign(\omega)+\nabla_\omega J(\omega;X,g)$
where sign(w) is simply the sign of w applied element-wise.

We can see immediately that the effect of L1 regularization is quite different from that of L2 regularization. Specifically, we can see that the regularization contribution to the gradient no longer scales linearly with each $w_i$ ; instead it is a constant factor with a sign equal to sign( $w_i$ ).
In comparison to L2 regularization, L1 regularization results in a solution that is more sparse.
The sparsity property induced by L1 regularization has been used extensively as a feature selection mechanism. Feature selection simplifies a machine learning problem by choosing which subset of the available features should be used.
we saw that many regularization strategies can be interpreted as MAP Bayesian inference:
(1) L2 regularization is equivalent to MAP Bayesian inference with a Gaussian prior on the weights.
(2) For L1 regularization, the penalty $αΩ(w) = α \sum_i |w_i|$ used to regularize a cost function is equivalent to the log-prior term that is maximized by MAP Bayesian inference when the prior is an isotropic Laplace distribution over $w ∈ R^n$ :
$log p (ω) = \sum i log L a p l a c e (ω i; 0, 1 α) = - α | | ω | | 1 + n log α - n log 2$ $\log p(\omega)=\sum_i\log Laplace(\omega_i;0,\frac{1}{\alpha})=-\alpha||\omega||_1+n\log\alpha-n\log2$

蚊子爱牛牛

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Deep Learning：正则化（一）

Many strategies used in machine learning are explicitly designed to reduce the test error, possibly at the expense of increased training error. These strategies are known collectively as regularization
复制链接

扫一扫