Distiller：正则化

最新推荐文章于 2024-04-16 15:29:39 发布

Yan_Joy

最新推荐文章于 2024-04-16 15:29:39 发布

阅读量838

点赞数 1

分类专栏： python pytorch 机器学习

机器学习同时被 3 个专栏收录

27 篇文章 0 订阅

订阅专栏

python

20 篇文章 0 订阅

订阅专栏

pytorch

5 篇文章 0 订阅

订阅专栏

Regularization

正则化

在Deep Learning¹书中，是这么定义正则化的：

“any modification we make to a learning algorithm that is intended to reduce its generalization error, but not its training error.”

PyTorch的优化器使用 $l_2$ 参数正则化去限制模型大小(即减小参数方差)。

总的来说，我们可以把它写为:
$loss_D(W;x;y) + \lambda_R R(W)$
特别的：
$loss_D(W;x;y) + \lambda_R \lVert W \rVert_2^2$

其中 $W$ 是网络中所有权重元素的集合（即这是model.parameters()）， $l o s s (W; x; y)$ 是总训练损失，并且 $loss_D(W)$ 是数据损失（即目标函数的误差，也称为损失函数，或者在Distiller样本图像分类器压缩中的criterion）。

optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9, weight_decay=0.0001)
criterion = nn.CrossEntropyLoss()
...
for input, target in dataset:
    optimizer.zero_grad()
    output = model(input)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

$\lambda_R$ 是一个被称为正则强度的标量，它平衡了数据误差和正则误差。在PyTorch中是 weight_decay参数。

$\lVert W \rVert_2^2$ 是 $W$ 的 $l_2$ 范数平方，被称为幅度（magnitude），表示张量大小。
$\lVert W \rVert_2^2 = \sum_{l=1}^{L} \sum_{i=1}^{n} |w_{l,i}|^2 \;\;where \;n = torch.numel(w_l)$

$L$ 是网络中的层数。

在深度学习中解释了 $l_2$ 范数和平方 $l_2$ 范数之间的定性差异。

稀疏与正则

我们提到正则化，因为正则化和一些DNN稀疏诱导方法之间存在有趣的相互作用。

在Dense-Sparse-Dense (DSD)²中使用剪枝作为正则化来提升模型准确率：

“Sparsity is a powerful form of regularization. Our intuition is that, once the network arrives at a local minimum given the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.”

正规化也可用于诱导稀疏性。为了诱导元素稀疏性，我们可以使用 $l_1$ 范数， $\lVert W \rVert_1$ 。

$\lVert W \rVert_1 = l_1(W) = \sum_{i=1}^{|W|} |w_i|$

$l_2$ 范数正则化通过减小大的参数来避免过度拟合并提高模型的精度，但它不会强制这些参数为绝对零。 $l_1$ -范数正则化将一些参数元素设置为零，因此在使模型更简单的同时限制了模型的容量。这有时被称为特征选择，并为我们提供了修剪的另一种解释。

Distiller的一个Jupyter文件解释了 $l_1$ -范数正则化器如何引起稀疏性，以及它如何与 $l_2$ -范数正则化相互作用。

如果我们将weight_decay配置为零并使用 $l_1$ -范数正则化，那么我们有：
$loss_D(W;x;y) + \lambda_R \lVert W \rVert_1$
如果同时使用两个正则化，则有:
$loss_D(W;x;y) + \lambda_{R_2} \lVert W \rVert_2^2 + \lambda_{R_1} \lVert W \rVert_1$

类 distiller.L1Regularize实现 $l_1$ -范数正则化，当然也可以通过schedule 使用。

l1_regularizer = distiller.s(model.parameters())
...
loss = criterion(output, target) + lambda * l1_regularizer()

组正则化

在Group Regularization中，我们惩罚整组参数元素，而不是单个元素。因此，整个组要么是稀疏化的（即所有组元素都具有零值），要么不是。必须预先定义组结构。

$loss_D(W;x;y) + \lambda_R R(W) + \lambda_g \sum_{l=1}^{L} R_g(W_l^{(G)})$
让我们表示组中 $g$ 的所有权重元素为 $w^{(g)}$ 。

$R_g(w^{(g)}) = \sum_{g=1}^{G} \lVert w^{(g)} \rVert_g = \sum_{g=1}^{G} \sum_{i=1}^{|w^{(g)}|} {(w_i^{(g)})}^2$
其中 $w^{(g)} \in w^{(l)}$ 并且 $w^{(g)}|$ 是 $w^{(g)}$ 中的元素数。

$\lambda_g \sum_{l=1}^{L} R_g(W_l^{(G)})$ 被称为组正规则。就像在 $l_1$ -范数正则化中我们总和所有张量元素的大小一样，在Group Lasso中，我们总结了元素结构（即组）的大小。

组正则化也称为块正则化，结构化正则化或粗粒度稀疏性（元素稀疏性有时被称为细粒度稀疏性）。组稀疏性表现出规律性（即其形状是规则的），因此对提高推理速度可能是有益的。

Huizi-et-al-2017³ 提供了一些不同组的概述：卷积核，通道，过滤器，层等。也可以使用诸如矩阵列和行的结构，以及各种形状结构（块稀疏性），甚至intra kernel strided sparsity⁴。

distiller.GroupLassoRegularizer目前实现了大多数这些组，也可以轻松添加新组。

参考

Ian Goodfellow and Yoshua Bengio and Aaron Courville.
Deep Learning,
arXiv:1607.04381v2,
2017. ↩︎
Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, William J. Dally.
DSD: Dense-Sparse-Dense Training for Deep Neural Networks,
arXiv:1607.04381v2,
2017. ↩︎
Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, William J. Dally.
Exploring the Regularity of Sparse Structure in Convolutional Neural Networks,
arXiv:1705.08922v3,
2017. ↩︎
Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung.
Structured pruning of deep convolutional neural networks,
arXiv:1512.08571,
2015 ↩︎

Yan_Joy

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Distiller：正则化

Regularization正则化在Deep Learning1书中，是这么定义正则化的：“any modification we make to a learning algorithm that is intended to reduce its generalization error, but not its training error.”PyTorch的优化器使用l2...
复制链接

扫一扫