Deep Learning:正则化(一)

Many strategies used in machine learning are explicitly designed to reduce the test error, possibly at the expense of increased training error. These strategies are known collectively as regularization. In fact, developing more effective regularization strategies has been one of the major research efforts in the field.

  • We defined regularization as “any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.”
  • There are many regularization strategies.
    (1) Some put extra constraints on a machine learning model, such as adding restrictions on the parameter values.
    (2) Some add extra terms in the objective function that can be thought of as corresponding to a soft constraint on the parameter values.
    (3) Sometimes these constraints and penalties are designed to encode specific kinds of prior knowledge.
    (4)Other times, these constraints and penalties are designed to express a generic preference for a simpler model class in order to promote generalization.
  • In the context of deep learning, most regularization strategies are based on regularizing estimators. Regularization of an estimator works by trading increased bias for reduced variance.
  • An effective regularizer is one that makes a profitable trade, reducing variance significantly while not overly increasing the bias.
  • When we discussed generalization and overfitting, we focused on three situations:
    (1) excluded the true data generating process—corresponding to underfitting and inducing bias
    (2) matched the true data generating process
    (3) included the generating process but also many other possible generating processes—the overfitting regime where variance rather than bias dominates the estimation error.
    The goal of regularization is to take a model from the third regime into the second regime.
  • Deep learning algorithms are typically applied to extremely complicated domains such as images, audio sequences and text, for which the true generation process essentially involves simulating the entire universe. To some extent, we are always trying to fit a square peg (the data generating process) into a round hole (our model family).
  • What this means is that controlling the complexity of the model is not a simple matter of finding the model of the right size, with the right number of parameters. Instead, we might find—and indeed in practical deep learning scenarios, we almost always do find—that the best fitting model (in the sense of minimizing generalization error) is a large model that has been regularized appropriately.

Parameter Norm Penalties

Many regularization approaches are based on limiting the capacity of models, such as neural networks, linear regression, or logistic regression, by adding a parameter norm penalty Ω(θ) to the objective function J. We denote the regularized objective function by J~ :

J~(θ;X,y)=J(θ;X,y)+αΩ(θ)

where α ∈ [0, ∞) is a hyperparameter that weights the relative contribution of the norm penalty term, Ω, relative to the standard objective function J(x; θ).

  • When our training algorithm minimizes the regularized objective function J~ , it will decrease both the original objective J on the training data and some measure of the size of the parameters θ (or some subset of the parameters).
  • Different choices for the parameter norm Ω can result in different solutions being preferred.
  • We note that for neural networks, we typically choose to use a parameter norm penalty Ω that penalizes only the weights of the affine transformation at each layer and leaves the biases unregularized. Reasons:
    (1) The biases typically require less data to fit accurately than the weights.
    (2) Each weight specifies how two variables interact. Fitting the weight well requires observing both variables in a variety of conditions.
    (3) Each bias controls only a single variable. This means that we do not induce too much variance by leaving the biases unregularized.
    (4) Also, regularizing the bias parameters can introduce a significant amount of underfitting.

  • We therefore use the vector w to indicate all of the weights that should be affected by a norm penalty, while the vector θ denotes all of the parameters, including both w and the unregularized parameters.

L2 Parameter Regularization

One of the simplest and most common kinds of parameter norm penalty: the L2 parameter norm penalty commonly known as weight decay:

Ω(θ)=12||ω||22

We can gain some insight into the behavior of weight decay regularization by studying the gradient of the regularized objective function. To simplify the presentation, we assume no bias parameter, so θ is just w. Such a model has the following total objective function:
J~(ω;X,y)=α2ωTω+J(ω;X,y)

with the corresponding parameter gradient
ωJ~(ω;X,y)=αω+ωJ(ω;X,g)

To take a single gradient step to update the weights, we perform this update:
ωωϵ(αω+ωJ(ω;X,g))

Written another way, the update is:
ω(1ϵα)ωϵωJ(ω;X,g)

We can see that the addition of the weight decay term has modified the learning rule to multiplicatively shrink the weight vector by a constant factor on each step, just before performing the usual gradient update.
We can see that L2 regularization causes the learning algorithm to “perceive” the input X as having higher variance, which makes it shrink the weights on features whose covariance with the output target is low compared to this added variance.

L1 Regularization

Another option is to use L1 regularization. Formally, L1 regularization on the model parameter w is defined as:

Ω(θ)=||ω||1=i|ωi|

that is, as the sum of absolute values of the individual parameters.
As with L2 weight decay, L1 weight decay controls the strength of the regularization by scaling the penalty Ω using a positive hyperparameter α. Thus, the regularized objective function J~(w;X,y) is given by
J~(ω;X,y)=α||ω||1+J(ω;X,y)

with the corresponding gradient (actually, sub-gradient):
ωJ~(ω;X,y)=αsign(ω)+ωJ(ω;X,g)

where sign(w) is simply the sign of w applied element-wise.

  • We can see immediately that the effect of L1 regularization is quite different from that of L2 regularization. Specifically, we can see that the regularization contribution to the gradient no longer scales linearly with each wi ; instead it is a constant factor with a sign equal to sign( wi ).
  • In comparison to L2 regularization, L1 regularization results in a solution that is more sparse.
  • The sparsity property induced by L1 regularization has been used extensively as a feature selection mechanism. Feature selection simplifies a machine learning problem by choosing which subset of the available features should be used.
  • we saw that many regularization strategies can be interpreted as MAP Bayesian inference:
    (1) L2 regularization is equivalent to MAP Bayesian inference with a Gaussian prior on the weights.
    (2) For L1 regularization, the penalty αΩ(w)=αi|wi| used to regularize a cost function is equivalent to the log-prior term that is maximized by MAP Bayesian inference when the prior is an isotropic Laplace distribution over wRn :
    logp(ω)=ilogLaplace(ωi;0,1α)=α||ω||1+nlogαnlog2
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值