Why L1 norm for sparse models?

本文探讨了L1和L2正则化在机器学习模型中的作用,特别是它们如何影响权重值,使模型变得稀疏。通过梯度下降更新权重的过程中,L1正则化能更有效地将不重要的特征权重降低到零,而L2正则化则倾向于均匀减少所有权重。
摘要由CSDN通过智能技术生成

Explanation 1

Consider the vector x⃗ =(1,ε)R2 where ε>0 is small. The l1 and l2 norms of x⃗  , respectively, are given by
||x⃗ ||1=1+ε,  ||x⃗ ||22=1+ε2

Now say that, as part of some regularization procedure, we are going to reduce the magnitude of one of the elements of x⃗    by δε . If we change x1 to 1δ , the resulting norms are

||x⃗ (δ,0)||1=1δ+ε,  ||x⃗ (δ,0)||22=12δ+δ2+ε2

On the other hand, reducing x2 by δ gives norms

||x⃗ (0,δ)||1=1δ+ε,  ||x⃗ (0,δ)||22=12εδ+δ2+ε2

The thing to notice here is that, for an l2 penalty, regularizing the larger term x1 results in a much greater reduction in norm than doing so to the smaller term x20 . For the l1 penalty, however, the reduction is the same. Thus, when penalizing a model using the l2 norm, it is highly unlikely that anything will ever be set to zero, since the reduction in l2 norm going from ε to 0 is almost nonexistent when ε is small. On the other hand, the reduction in l1 norm is always equal to δ , regardless of the quantity being penalized.

Another way to think of it: it's not so much that  l1 penalties encourage sparsity, but that l2 penalties in some sense discourage sparsity by yielding diminishing returns as elements are moved closer to zero.

Explanation 2

With a sparse model, we think of a model where many of the weights are 0. Let us therefore reason about how L1-regularization is more likely to create 0-weights.

Consider a model consisting of the weights (w1,w2,,wm) .

With L1 regularization, you penalize the model by a loss function L1(w) = Σi|wi| .

With L2-regularization, you penalize the model by a loss function L2(w) = 12Σiw2i

If using gradient descent, you will iteratively make the weights change in the opposite direction of the gradient with a step size η . Let us look at the gradients:

dL1(w)dw=sign(w) , where sign(w)=(w1|w1|,w1|w1|,,w1|wm|)

dL2(w)dw=wDSSSSCC

If we plot the loss function and it's derivative for a model consisting of just a single parameter, it looks like this for L1:

enter image description here

And like this for L2:

enter image description here

Notice that for L1 , the gradient is either 1 or -1, except for when w1=0 . That means that L1-regularization will move any weight towards 0 with the same step size, regardless the weight's value. In contrast, you can see that the L2 gradient is linearly decreasing towards 0 as the weight goes towards 0. Therefore, L2-regularization will also move any weight towards 0, but it will take smaller and smaller steps as a weight approaches 0.

Try to imagine that you start with a model with w1=5 and using η=12 . In the following picture, you can see how gradient descent using L1-regularization makes 10 of the updates w1:=w1ηdL1(w)dw=w10.51 , until reaching a model with w1=0

:

enter image description here

In constrast, with L2-regularization where η=12 , the gradient is w1 , causing every step to be only halfway towards 0. That is we make the update w1:=w1ηdL1(w)dw=w10.5w1

Therefore, the model never reaches a weight of 0, regardless of how many steps we take:

enter image description here

Note that L2-regularization can make a weight reach zero if the step size η is so high that it reaches zero or beyond in a single step. However, the loss function will also consist of a term measuring the error of the model with the respect to the given weights, and that term will also affect the gradient and hence the change in weights. However, what is shown in this example is just how the two types of regularization contribute to a change in weights.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值