LAYER-ADAPTIVE SPARSITY FOR THEMAGNITUDE-BASED PRUNING（LAMP）翻译

算小法白

已于 2024-01-18 11:48:54 修改

阅读量6.8k

点赞数 29

文章标签：剪枝算法机器学习

于 2024-01-13 14:23:01 首次发布

本文链接：https://blog.csdn.net/m0_68178753/article/details/135560983

版权

LAYER-ADAPTIVE SPARSITY FOR THE MAGNITUDE-BASED PRUNING

基于幅度的剪枝的层自适应稀疏性

ABSTRACT

Recent discoveries on neural network pruning reveal that, with a carefully chosen layerwise sparsity, a simple magnitude-based pruning achieves state-of-the-art tradeoff between sparsity and performance. However, without a clear consensus on “how to choose,” the layerwise sparsities are mostly selected algorithm-by-algorithm, often resorting to handcrafted heuristics or an extensive hyperparameter search. To fill this gap, we propose a novel importance score for global pruning, coined layer-adaptive magnitude-based pruning (LAMP) score; the score is a rescaled version of weight magnitude that incorporates the model-level l2 distortion incurred by pruning, and does not require any hyperparameter tuning or heavy computation. Under various image classification setups, LAMP consistently outperforms popular existing schemes for layerwise sparsity selection. Furthermore, we observe that LAMP continues to outperform baselines even in weight-rewinding setups, while the connectivity-oriented layerwise sparsity (the strongest baseline overall) performs worse than a simple global magnitude-based pruning in this case. Code: https://github.com/jaeho-lee/layer-adaptive-sparsity。

关于神经网络剪枝的最新发现表明，通过精心选择的分层稀疏性，简单的基于幅度的剪枝可以实现稀疏性和性能之间的最先进的权衡。然而，由于没有就“如何选择”达成明确共识，分层稀疏性大多是逐个算法选择的，通常采用手工启发式或广泛的超参数搜索。为了填补这一空白，我们提出了一种新的全局剪枝重要性评分，即基于层自适应幅度的剪枝（LAMP）评分；该分数是权重大小的重新缩放版本，其中包含剪枝引起的模型级 l2 失真，并且不需要任何超参数调整或大量计算。在各种图像分类设置下，LAMP 在分层稀疏性选择方面始终优于现有的流行方案。此外，我们观察到，即使在权重倒带设置中，LAMP 仍继续优于基线，而在这种情况下，面向连接的分层稀疏性（总体最强基线）的表现比简单的基于幅度的全局剪枝更差。代码：https://github.com/jaeho-lee/layer-adaptive-sparsity。

总结：首先描述了基于幅度的简单剪枝方法的实用性，进而提出这个方法当前存在的一些问题：“如何选择”分层稀疏性上并没有达成共识。因此我们提出了基于层自适应幅度的剪枝（LAMP）。实验结果我们提出的方法优于之前的方法。

1 INTRODUCTION

Neural network pruning is an art of removing “unimportant weights” from a model, with an intention to meet practical constraints (Han et al., 2015), mitigate overfitting (Hanson & Pratt, 1988), enhance interpretability (Mozer & Smolensky, 1988), or deepen our understanding on neural network training (Frankle & Carbin, 2019). Yet, the importance of weight is still a vaguely defined notion, and thus a wide range of pruning algorithms based on various importance scores has been proposed. One popular approach is to estimate the loss increment from removing the target weight to use as an importance score, e.g., Hessian-based approximations (LeCun et al., 1989; Hassibi & Stork, 1993; Dong et al.,2017), coreset-based estimates (Baykal et al., 2019; Mussay et al., 2020), convex optimization (Aghasi et al., 2017), and operator distortion (Park et al., 2020). Other approaches include on-the-fly1 regularization (Louizos et al., 2018; Xiao et al., 2019), Bayesian methods (Molchanov et al., 2017; Louizos et al., 2017; Dai et al., 2018), and reinforcement learning (Lin et al., 2017).

神经网络剪枝是一门从模型中删除“不重要权重”的艺术，旨在满足实际约束（Han et al., 2015）、减轻过度拟合（Hanson & Pratt, 1988）、增强可解释性（Mozer & Smolensky, 1988）），或者加深我们对神经网络训练的理解（Frankle & Carbin，2019）。然而，权重的重要性仍然是一个模糊定义的概念，因此已经提出了基于各种重要性分数的广泛剪枝算法。一种流行的方法是估计删除目标权重以用作重要性分数的损失增量，例如基于 Hessian 的近似（LeCun 等人，1989 年；Hassibi 和 Stork，1993 年；Dong 等人，2017 年）、基于核心集估计（Baykal 等人，2019；Mussay 等人，2020）、凸优化（Aghasi 等人，2017）和算子失真（Park 等人，2020）。其他方法包括即时正则化（Louizos et al., 2018；Xiao et al., 2019）、贝叶斯方法（Molchanov et al., 2017；Louizos et al., 2017；Dai et al., 2018）、和强化学习（Lin et al., 2017）。

Recent discoveries (Gale et al., 2019; Evci et al., 2020) demonstrate that, given an appropriate choice of layerwise sparsity, simply pruning on the basis of weight magnitude yields a surprisingly powerful unstructured pruning scheme. For instance, Gale et al. (2019) evaluates the performance of magnitude- based pruning (MP; Han et al. (2015); Zhu & Gupta (2018)) with an extensive hyperparameter tuning, and shows that MP achieves comparable or better performance than state-of-the-art pruning algorithms that use more complicated importance scores. To arrive at such a performance level, the authors introduce the following handcrafted heuristic: Leave the first convolutional layer fully dense, and prune up to only 80% of weights from the last fully-connected layer; the heuristic is motivated by the sparsity pattern from other state-of-the-art algorithms (Molchanov et al., 2017) and additional experimental/architectural observations.

最近的发现（Gale 等人，2019；Evci 等人，2020）表明，如果适当选择分层稀疏性，简单地根据权重大小进行剪枝会产生令人惊讶的强大非结构化剪枝方案。例如，盖尔等人。 (2019) 通过广泛的超参数调整评估了基于幅度的剪枝的性能（MP；Han et al. (2015)；Zhu & Gupta (2018)），并表明 MP 实现了与最先进技术相当或更好的性能使用更复杂的重要性分数的修剪算法。为了达到这样的性能水平，作者引入了以下手工启发式方法：使第一个卷积层完全密集，并从最后一个全连接层中删除最多 80% 的权重；这种启发式的动机是来自其他最先进算法的稀疏模式（Molchanov 等人，2017）和其他实验/架构观察。

Unfortunately, there is an apparent lack of consensus on “how to choose the layerwise sparsity” for the magnitude-based pruning. Instead, the layerwise sparsity is selected mostly on an algorithm-by- algorithm basis. One common method is the global MP criteria (see, e.g., Morcos et al. (2019)), where the layerwise sparsity is automatically determined by using a single global threshold on weight magnitude. Lin et al. (2020) propose a magnitude-based pruning algorithm using a feedback signal, using a heuristic rule of keeping the last fully connected layer dense. A recent work by Evci et al. (2020) proposes a magnitude-based dynamic sparse training method, adopting layerwise sparsity inspired from the network science approach toward neural network pruning (Mocanu et al., 2018).

不幸的是，对于“如何选择分层稀疏性”显然缺乏共识。基于幅度的修剪。相反，分层稀疏性主要是根据算法选择的：算法基础。一种常见的方法是全局 MP 标准（例如，参见 Morcos 等人（2019）），其中通过使用权重大小的单个全局阈值自动确定分层稀疏性。林等人。（2020）提出了一种使用反馈信号的基于幅度的修剪算法，使用保持最后一个全连接层密集的启发式规则。 Evci 等人最近的一项工作。 (2020) 提出了一种基于幅度的动态稀疏训练方法，采用受网络科学方法启发的分层稀疏性进行神经网络剪枝 (Mocanu et al., 2018)。

图 1：LAMP 分数是权重大小的平方，通过层中所有“幸存权重”的总和进行归一化。 LAMP 的全局剪枝相当于具有自动选择的分层稀疏度的基于幅度的分层剪枝。

Contributions. In search of a “go-to” layerwise sparsity for MP, we take a model-level distortion
minimization perspective towards MP. We build on the observation of Dong et al. (2017); Park
et al. (2020) that each neural network layer can be viewed as an operator, and MP is a choice that incurs minimum l2 distortion to the operator output (given a worst-case input signal). We bring the perspective further to examine the “model-level” distortion incurred by pruning a layer; preceding layers scale the input signal to the target layer, and succeeding layers scale the output distortion.

贡献。为了寻找 MP 的“首选”分层稀疏性，我们对 MP 采用模型级失真最小化的视角。我们以 Dong 等人的观察为基础。（2017）；帕克等人。 (2020) 认为每个神经网络层都可以被视为一个算子，而 MP 是对算子输出产生最小 l2 失真的选择（给定最坏情况的输入信号）。我们进一步从这个角度来检验剪枝层所带来的“模型级”失真；前面的层将输入信号缩放到目标层，后续的层缩放输出失真。

Based on the distortion minimization framework, we propose a novel importance score for global
pruning, coined LAMP (Layer-Adaptive Magnitude-based Pruning). The LAMP score is a rescaled
weight magnitude, approximating the model-level distortion from pruning. Importantly, the LAMP
score is designed to approximate the distortion on the model being pruned, i.e., all connections with a smaller LAMP score than the target weight is already pruned. Global pruning2 with the LAMP score is equivalent to the MP with an automatically determined layerwise sparsity. At the same time, pruning with LAMP keeps the benefits of MP intact; the LAMP score is efficiently computable, hyperparameter-free, and does not rely on any model-specific knowledge.

基于失真最小化框架，我们提出了一种新颖的全局剪枝重要性评分，创造了 LAMP（基于层自适应幅度的剪枝）。 LAMP 分数是重新调整的权重大小，近似于剪枝造成的模型级失真。重要的是，LAMP 分数旨在近似被修剪模型的失真，即所有 LAMP 分数小于目标权重的连接都已被修剪。具有 LAMP 分数的全局剪枝 2 相当于具有自动确定的分层稀疏度的 MP。同时，用 LAMP 修剪可以完整保留 MP 的优点； LAMP 分数可有效计算，无超参数，并且不依赖于任何特定于模型的知识。

We validate the effectiveness of LAMP under a diverse experimental setup, encompassing various convolutional neural network architectures (VGG-16, ResNet-18/34, DenseNet-121, EfficientNet-B0) and various image datasets (CIFAR-10/100, SVHN, Restricted ImageNet). In all considered setups, LAMP consistently outperforms the baseline layerwise sparsity selection schemes. We also perform additional ablation studies with one-shot pruning and weight-rewinding setup to confirm that LAMP performs reliably well under a wider range of scenarios.

我们在多种实验设置下验证了 LAMP 的有效性，包括各种卷积神经网络架构（VGG-16、ResNet-18/34、DenseNet-121、EfficientNet-B0）和各种图像数据集（CIFAR-10/100、SVHN、受限 ImageNet）。在所有考虑的设置中，LAMP 始终优于基线分层稀疏选择方案。我们还通过一次性修剪和权重倒带设置进行了额外的消融研究，以确认 LAMP 在更广泛的场景下可靠地表现良好。

Organization. In Section 2, we briefly describe existing methods to choose the layerwise sparsity
for magnitude-based pruning. In Section 3, we formally introduce LAMP and describe how the l2
distortion minimization perspective motivates the LAMP score. In Section 4, we empirically validate the effectiveness and versatility of LAMP. In Section 5, we take a closer look at the layerwise sparsity discovered by LAMP and compare with baseline methods and previously proposed handcrafted heuristics. In Section 6, we summarize our findings and discuss future directions. Appendices include the experimental details (Appendix A), complexity analysis (Appendix B), derivation of the LAMP score (Appendix C), additional experiments on Transformer (Appendix D), and detailed experimental results with standard deviations (Appendix E).

组织。在第 2 节中，我们简要描述了为基于幅度的剪枝选择分层稀疏度的现有方法。在第 3 节中，我们正式介绍 LAMP 并描述 l2 失真最小化视角如何激发 LAMP 得分。在第 4 节中，我们通过实证验证了 LAMP 的有效性和多功能性。在第 5 节中，我们仔细研究了 LAMP 发现的分层稀疏性，并与基线方法和之前提出的手工启发式方法进行比较。在第 6 节中，我们总结了我们的发现并讨论了未来的方向。附录包括实验细节（附录 A）、复杂性分析（附录 B）、LAMP 分数的推导（附录 C）、Transformer 的附加实验（附录 D）以及带有标准差的详细实验结果（附录 E）。

总结：首先介绍了神经网络剪枝是一种通过删除模型中的“不重要权重”来满足实际约束、减轻过度拟合、增强可解释性或加深对神经网络训练理解的技术。进而提出权重的重要性仍然是一个模糊定义的概念，并述说了当下流行的一种方法。最近的研究发现，通过适当选择分层稀疏性，简单地根据权重大小进行剪枝可以产生非结构化剪枝方案。提出了一种全局剪枝重要性评分 LAMP，通过失真最小化框架近似剪枝导致的模型级失真，实验证明在多种设置下优于基线分层稀疏选择。

2 RELATED WORK(相关工作)

This section gives a (necessarily non-exhaustive) survey of various layerwise sparsity selection
schemes used for magnitude-based pruning algorithms. Magnitude-based pruning of neural networks dates back to the early works of Janowsky (1989); LeCun et al. (1989), and has been actively studied again under the context of model compression since the work of Han et al. (2015). In Han et al. (2015), the authors propose an iterative pruning scheme where the layerwise pruning threshold is determined by the standard-deviation-based heuristic. Zhu & Gupta (2018) propose a uniform pruning algorithm with a carefully tuned gradual pruning schedule combined with weight re-growths. Gale et al. (2019) refine the algorithm by adding a heuristic constraint of keeping the first convolutional layer fully dense and keeping at least 20% of the weights surviving in the last fully-connected layer.

本节对各种分层稀疏性选择进行了（不一定是详尽的）调查用于基于幅度的修剪算法的方案。基于幅度的神经网络剪枝可以追溯到 Janowsky (1989) 的早期作品；LeCun等人（1989）；并且自 Han 等人的工作以来，在模型压缩的背景下再次积极研究。（2015）。在Han等人中。 (2015)，作者提出了一种迭代剪枝方案，其中分层剪枝阈值由基于标准差的启发式确定。 Zhu & Gupta (2018) 提出了一种统一的剪枝算法，该算法采用精心调整的渐进剪枝计划并结合权重重新增长。大风等人。 (2019) 通过添加启发式约束来改进算法，即保持第一个卷积层完全密集，并在最后一个全连接层中保留至少 20% 的权重。

MP has also been widely used in the context of “pruning at initialization.” Frankle & Carbin (2019)
combine MP with weight rewinding to discover efficiently trainable subnetworks: for small nets,
the authors employ uniform layerwise sparsity, but use different rates for convolutional layers and
fully-connected layers (with an added heuristic on the last fully-connected layer); for larger nets,
authors use global MP. Morcos et al. (2019) consider transferring the “winning ticket” initializations, using the global MP. Evci et al. (2020) proposes a training scheme for sparsely initialized neural networks, where the layerwise sparsity is given by the Erd˝os-R´enyi kernel method; the method generalizes the scheme initially proposed by Mocanu et al. (2018) to convolutional neural networks.

MP 也被广泛应用于“初始化剪枝”的场景中。 Frankle & Carbin (2019) 将 MP 与权重倒带相结合，以发现有效的可训练子网络：对于小型网络，作者采用统一的分层稀疏性，但对卷积层和全连接层使用不同的速率（在最后一个全连接层上添加启发式）连接层）；对于较大的网络，作者使用全局 MP。莫科斯等人(2019）考虑使用全局 MP 转移“中奖彩票”初始化。埃夫齐等人 (2020)提出了一种稀疏初始化神经网络的训练方案，其中分层稀疏性由 Erd˝os-R´enyi 核方法给出；该方法概括了 Mocanu 等人最初提出的方案（2018）到卷积神经网络。

We note that there is a line of results on the trainable layerwise sparsity; we refer the interested
readers to the recent work of Kusupati et al. (2020) for a concise survey. However, we do not make direct comparisons to these methods, as our primary purpose is to deliver an easy-to-use layerwise sparsity selection scheme without requiring the modification of training objective, or an extensive hyperparameter tuning.

我们注意到可训练的分层稀疏性有一行结果；我们建议感兴趣的读者阅读 Kusupati 等人最近的工作（2020）进行简明调查。然而，我们不会与这些方法进行直接比较，因为我们的主要目的是提供易于使用的分层稀疏选择方案，而不需要修改训练目标或进行广泛的超参数调整。

We also note that we focus on the unstructured sparsity. While such unstructured pruning techniques have been considered less practical (compared to structured pruning), several recent breakthroughs provide promising methods to bridge this gap; see Gale et al. (2020); Elsen et al. (2020).

我们还注意到我们关注非结构化稀疏性。虽然这种非结构化修剪技术被认为不太实用（与结构化修剪相比），但最近的一些突破提供了有希望的方法来弥补这一差距；参见盖尔等人（2020）；埃尔森等人（2020）。

总结：简要的述说了有关神经网络剪枝算法中分层稀疏性选择方案的历史、现状以及一些最新研究的信息。

3 LAYER-ADAPTIVE MAGNITUDE-BASED PRUNING (LAMP)(层自适应基于幅度的修剪（LAMP）)

We now formally introduce the Layer-Adaptive Magnitude-based Pruning (LAMP) score. Consider a depth-d feedforward neural network with weight tensors W(1), . . . ,W(d) associated with each fully-connected/convolutional layer. For fully-connected layers, corresponding weight tensors are twodimensional matrices, and for 2d convolutional layers, corresponding tensors are four-dimensional. To give a unified definition of the LAMP score for both fully-connected and convolutional layers, we assume that each weight tensor is unrolled (or flattened) to a one-dimensional vector. For each of these unrolled vectors, we assume (without loss of generality) that the weights are sorted in an ascending order according to the given index map, i.e., |W[u]| ≤ |W[v]| holds whenever u < v, where W[u] denote the entry ofW mapped by the index u^3.

我们现在正式介绍基于层自适应幅度的修剪（LAMP）分数。考虑一个具有权重张量的深度 d 前馈神经网络W(1)...W(d) 与每个全连接/卷积层相关。对于全连接层，对应的权重张量是二维矩阵，对于2d卷积层，对应的张量是四维的。为了给出全连接层和卷积层的 LAMP 分数的统一定义，我们假设每个权重张量都展开（或展平）为一维向量。对于每个展开的向量，我们假设（不失一般性）权重根据给定的索引图按升序排序，即 |W[u]| ≤ |W[v]|每当 u < v 时成立，其中 W[u] 表示由索引 u.3 映射的 W 的条目.

The LAMP score for the u-th index of the weight tensor W is then defined as

Informally, the LAMP score (Eq. 1) measures the relative importance of the target connection among all surviving connections belonging to the same layer, where the connections with a smaller weight magnitude (in the same layer) have already been pruned. As a consequence, two connections with identical weight magnitudes have different LAMP scores, depending on the index map being used.

权重张量 W 的第 u 个索引的 LAMP 分数定义为公式1.

非正式地，LAMP 得分（方程 1）衡量属于同一层的所有幸存连接中目标连接的相对重要性，其中权重较小（在同一层中）的连接已经被修剪。因此，具有相同权重大小的两个连接具有不同的 LAMP 分数，具体取决于所使用的索引图。

Once the LAMP score is computed, we globally prune the connections with smallest LAMP scores until the desired global sparsity constraint is met; the procedure is equivalent to performing MP with an automatically selected layerwise sparsity. To see this, it suffices to check that

holds for any weight tensor W and a pair of indices u, v. From the definition of the LAMP score
(Eq. 1), it is easy to see that the logical relation (2) holds. Indeed, for the connection with a larger
weight magnitude, the denominator of Eq. 1 is smaller, while the numerator is larger.

一旦计算出 LAMP 分数，我们就会全局修剪具有最小 LAMP 分数的连接，直到满足所需的全局稀疏性约束；该过程相当于使用自动选择的分层稀疏度执行 MP。要看到这一点，只需检查一下对于任意权张量 W 和一对索引 u、v 都成立。从 LAMP 分数的定义（方程 1），很容易看出逻辑关系（2）成立。事实上，对于与较大权重大小的连接，方程1的分母较小，而分子较大。

We note that the global pruning with respect to the LAMP score is not identical to the global pruning with respect to the magnitude score |W[u]| (or (W[u])2, equivalently). Indeed, in each layer, there exists exactly one connection with the LAMP score of 1, which is the maximum LAMP score possible. In other words, LAMP keeps at least one surviving connection in each layer. The same does not hold for the global pruning with respect to the weight magnitude score.

我们注意到，针对 LAMP 分数的全局剪枝与针对幅度分数 |W[u]| 的全局剪枝并不相同。（或（W[u]）2，等价）。事实上，在每一层中，都存在一个 LAMP 分数为 1 的连接，这是可能的最大 LAMP 分数。换句话说，LAMP 在每一层中至少保留一个存活连接。对于权重大小分数的全局剪枝则不存在同样的情况。

We also note that the LAMP score is easy-to-use. Similar to the vanilla MP, the LAMP score does not have any hyperparameter to be tuned, and is easily implementable via elementary tensor operations. Furthermore, the LAMP score can be computed with only a minimal computational overhead; the sorting of squared weight magnitudes required to compute the denominator in Eq. 1 is already a part of typical vanilla MP algorithms. For more discussions, see Appendix B.

我们还注意到 LAMP 分数很容易使用。与普通 MP 类似，LAMP 分数没有任何需要调整的超参数，并且可以通过基本张量运算轻松实现。此外，只需最小的计算开销即可计算出 LAMP 分数；计算方程中分母所需的平方权重大小的排序1 已经是典型的普通 MP 算法的一部分。有关更多讨论，请参阅附录 B。

3.1 DESIGN MOTIVATION: MINIMIZING OUTPUT l2 DISTORTION(设计动机：最小化输出l2 失真)

The LAMP score (Eq. 1) is motivated by the following observation: The layerwise MP is the solution of the layerwise minimization of Frobenius distortion incurred by pruning, which can be viewed as a relaxation of the output l2 distortion minimization with respect to the worst-case input. This observation leads us to the speculation “Reducing the pruning-incurred l2 distortion of the model output with respect to the worst-case output may be beneficial to the performance of the retrained model (and perhaps that is why MP works well in practice).” This speculation is not entirely new; the optimal brain damage (OBD; (LeCun et al., 1989)) is also designed around a similar philosophy of loss minimization, without a complete understanding on how the benefit of loss minimization seems to pertain after retraining.

LAMP 得分（方程 1）由以下观察结果推动：分层 MP 是剪枝引起的 Frobenius 失真逐层最小化的解，可以将其视为输出 l2 失真最小化相对于最坏情况的输入。这一观察结果使我们得出这样的推测：“相对于最坏情况的输出，减少模型输出的剪枝引起的 l2 失真可能有利于重新训练模型的性能（也许这就是 MP 在实践中表现良好的原因） ”。这种猜测并不新鲜。最佳脑损伤（OBD；（LeCun 等人，1989））也是围绕类似的损失最小化原理设计的，但没有完全理解损失最小化的好处在再训练后如何发挥作用。

Nevertheless, we use this speculation as a guideline to design LAMP as a natural extension of
layerwise MP to a global pruning scheme with an automatically determined layerwise sparsity. To
make arguments a bit more formal, consider a depth-d fully-connected4 neural net, whose output
given the input x is

where σ denotes the ReLU activation and Wi denotes the weight matrix for the i-th layer, and
W(1:d) = (W(1), . . . ,W(d)) denotes the set of weight matrices.

尽管如此，我们使用这种推测作为指导，将 LAMP 设计为分层 MP 到具有自动确定的分层稀疏度的全局修剪方案的自然扩展。为了使论证更正式一点，考虑一个深度 d 全连接神经网络，给定输入 x 时其输出为其中 σ 表示 ReLU 激活，Wi 表示第 i 层的权重矩阵，W(1:d) = (W(1), ...,W(d)) 表示权重矩阵集合。

Viewing MP as a relaxed layerwise l2 distortion minimization. We first focus on a single fully-
connected layer (instead of a full model), and consider the problem of minimizing the pruning-incurred l2 distortion in the layer output, given the worst-case input signal. We then observe that the problem can be relaxed to the minimization of Frobenius distortion in the weight tensor, whose solution coincides with the layerwise MP. Formally, let ξ ∈ Rn be an input vector to a fully-connected layer with the weight tensor W ∈ Rm×n. We want to prune the tensor to W∶=M⨀W W, where M is an m × n binary matrix (i.e., having only 0s and 1s as its entries) satisfying some predefined sparsity constraint M0 ≤ κ imposed by the operational constraints (e.g., model size requirements). We wish to find the pruning mask M that incurs the minimum 2 distortion in the output given the worst-case l2-bounded input, i.e.,

将 MP 视为松弛的分层 l2 失真最小化。我们首先关注单个全连接层（而不是完整模型），并考虑在给定最坏情况输入信号的情况下最小化层输出中剪枝引起的 l2 失真的问题。然后我们观察到问题可以简化为权张量中 Frobenius 畸变的最小化，其解与分层 MP 一致。形式上，令 xi ∈ Rn 为全连接层的输入向量，其权重张量为 W ∈ Rm×n。我们希望将张量修剪为W∶=M⨀W，其中 M 是一个 m × n 二进制矩阵（即只有 0 和 1 作为其条目），满足由操作限制（例如模型尺寸要求）。我们希望找到在给定最坏情况 l2 有界输入的情况下在输出中产生最小 l2 失真的剪枝掩模 M，即：

The minimax distortion (4) upper-bounds the minimum expected l2 distortion for any distribution
of ξ supported on the unit ball, and thus can be viewed as a data-oblivious version of the pruning
algorithms designed for loss minimization (using squared loss). By the definition of the spectral
norm,^5 Eq. 4 is equivalent to

distortion minimization

极小极大失真 (4) 是单位球上支持的任何 ξ分布的最小预期 l2 失真的上限，因此可以被视为专为损失最小化（使用平方损失）而设计的剪枝算法的数据忽略版本。根据谱范数的定义，^5 式： 4 相当于

where|| · || denotes the spectral norm. Using the fact that ||A|| ≤ ||A||_F holds for any matrix A6 (where || · ||_F denotes the Frobenius norm), the optimization (5) can be relaxed to the Frobenius distortion minimization

where Wij,Mij denote (i, j)-th entries of W,M, respectively. From the right-hand side of Eq. 6,
we see that the layerwise MP, i.e., setting Mij = 1 for (i, j) pairs with top-κ largest Wij, is the
optimal choice to minimize the Frobenius distortion incurred by pruning. This observation motivates us to view the layerwise MP as the (approximate) solution of the output l2 distortion minimization procedure, and speculate the connection between the small output l2 distortion and the favorable performance ofthe pruned-retrained subnetwork (given the unreasonable effectiveness of seemingly- na¨ıve MP as demonstrated by Gale et al. (2019)).

其中 ||· || 表示谱范数。利用 ||A|| ≤ ||A||_F 对于任何矩阵 A6 都成立的事实（其中 || · ||_F 表示 Frobenius 范数），优化 (5) 可以放宽到 Frobenius 畸变最小化

其中 Wij,Mij 分别表示 W,M 的第 (i, j) 个条目。从等式的右侧开始。从图 6 可以看出，分层 MP，即对具有 top-κ 最大 Wij 的 (i, j) 对设置 Mij = 1，是最小化剪枝引起的 Frobenius 失真的最佳选择。这一观察促使我们将分层 MP 视为输出 l2 失真最小化过程的（近似）解，并推测小输出 l2 失真与剪枝再训练子网络的良好性能之间的联系（考虑到Gale 等人 (2019) 证明了看似天真的 MP。

LAMP: greedy, relaxed minimization of model output distortion. Building on this speculation,
we now ask the following question: “How can we select the layerwise sparsity of MP to have small model-level output distortion?” To formalize, we consider the minimization

where κ denotes the model-level sparsity constraint imposed by the operational requirements and
denotes the pruned version of the i-th layer weight matrix.

Due to the nonlinearities from the activation functions, it is difficult to solve Eq. 7 exactly. Instead,
we consider the following greedy procedure: At each step, we (a) approximate the distortion incurred by pruning a single connection, (b) remove the connection with the smallest score, and then (c) go back to step (a) and re-compute the scores based on the pruned model.

Once we assume that only one connection is pruned at a single iteration of the step (a), we can use the following upper bound of the model output distortion to relax the optimization (7): With Wi

(see Appendix C for a derivation). Despite the sub-optimalities from the relaxation, considering the
right-hand side of Eq. 8 provides two favorable properties. First, the right-hand side is free of any
activation function, and is equivalent to the layerwise MP. Second, the score can be computed in
advance, i.e., does not require re-computing after pruning each connection. In particular, the product term $\prod_{j=1}^{d}\left \| W^{j} \right \|_{F}$ does not affect the pruning decision, and the denominator can be pre-computed with the cumulative sum  v≥u(W(i)[v])2 for each index u for W(i). This computational trick leads us to the LAMP score (1).

LAMP：贪婪、宽松的模型输出失真最小化。基于这个推测，我们现在提出以下问题：“我们如何选择 MP 的分层稀疏性以获得较小的模型级输出失真？”为了形式化，我们考虑最小化

其中κ表示由操作要求施加的模型级稀疏性约束，表示第i层权重矩阵的剪枝版本。

由于激活函数的非线性，很难求解方程（1）。正好7个。相反，我们考虑以下贪婪过程：在每一步，我们（a）近似修剪单个连接所产生的失真，（b）删除得分最小的连接，然后（c）返回步骤（a）并根据修剪后的模型重新计算分数。

一旦我们假设在步骤 (a) 的单次迭代中仅修剪一个连接，我们就可以使用以下模型输出失真的上限来放松优化 (7)：

（推导参见附录 C）。尽管放松的次优性，考虑方程的右侧。图8提供了两个有利的特性。首先，右侧没有任何激活函数，相当于逐层 MP。其次，分数可以提前计算，即修剪每个连接后不需要重新计算。特别是，乘积项 $\prod_{j=1}^{d}\left \| W^{j} \right \|_{F}$ 不影响剪枝决策，分母可以用累积和  v≥u(W(i)[v])2 预先计算对于 W(i) 的每个索引 u。这种计算技巧使我们得到 LAMP 分数 (1)。

4 EXPERIMENTS & ANALYSES(实验与分析)

图 2：VGG-16、ResNet-18、DenseNet-121 和 EfficientNetB0 的稀疏度与精度权衡曲线。所有模型均使用 CIFAR-10 数据集进行迭代修剪和重新训练。

为了凭经验验证所提出方法的有效性，我们将LAMP 与以下基于幅度的剪枝的分层稀疏性选择方案进行比较：

• 全局。对每一层的权重大小施加一个全局阈值以满足全局稀疏性约束，并根据该阈值自动确定分层稀疏度；参见，例如，Morcos 等人。（2019）。

• 制服。每一层都被修剪为具有相同的层稀疏度水平，这又等于全局稀疏度约束；例如，参见 Zhu & Gupta (2018)。

• 制服+。与 Uniform 相同，但我们施加了两个额外的约束：(1) 我们保持第一个卷积层不被剪枝，(2) 在最后一个全连接层中保留至少 20% 的连接；这个启发式规则是由 Gale 等人提出的。（2019）。

Erd˝os-R´enyi 内核。 Evci 等人提出的 Erd˝os-R´enyi 方法（最初由 Mocanu 等人 (2018) 给出）考虑卷积层的扩展。（2020）。稀疏卷积层的非零参数数量与 1 − 成比例缩放，其中 nl 表示第 l 层的神经元数量，wl、hl 表示第l层卷积核的宽度和高度。

作为默认设置，我们对每个基线方法执行五次独立试验，在每次试验中我们使用迭代修剪和再训练（Han 等人，2015）：我们在每次迭代时修剪 20% 的幸存权重。对于 Restricted-ImageNet 实验，我们提供了四次试验的结果。为了清晰的呈现，我们仅报告正文中出现的数字的平均值。第 4.1 节中五种子结果的标准差将在附录 E 中给出。此外，在附录 D 中，我们报告了 Transformers 上的语言建模任务（Penn Treebank 和 WT-2）的其他实验结果（Vaswani 等人，2017）。

观察总结。从实验结果（图 2 到图 4）中，我们观察到 LAMP 在稀疏性与准确性权衡方面始终优于所有其他基线。对于现代网络架构（例如 EfficientNet-B0），LAMP 和基线方法之间的性能差距似乎更加明显。我们还观察到，LAMP 在重量倒带设置下表现良好，而最强基线（Erd˝os-R´enyi 内核）似乎对这种倒带很敏感。

4.1 MAIN RESULTS(主要结果)

Our main experimental results are on image classification models. We explore a diverse set of model architectures and datasets, with a base setup of VGG-16 (Simonyan & Zisserman, 2015) trained on CIFAR-10 (Krizhevsky & Hinton, 2009) dataset. In particular, our experiments cover the following models and datasets.

我们的主要实验结果是图像分类模型。我们探索了一组不同的模型架构和数据集，并以在 CIFAR-10 (Krizhevsky & Hinton, 2009) 数据集上训练的 VGG-16 (Simonyan & Zisserman, 2015) 为基础设置。特别是，我们的实验涵盖以下模型和数据集。

图 3：针对 SVHN 和 CIFAR-100（在 VGG-16 上）和 Restricted ImageNet（在 ResNet-34 上）训练的修剪模型的稀疏度与精度权衡曲线。

模型。我们考虑用于图像分类实验的四种网络架构：（1）VGG-16（Simonyan & Zisserman，2015）适用于 CIFAR-10，具有批量归一化层和一个全连接层（如 Liu 等人（2019）中使用的）；弗兰克尔和卡宾 (2019))； (2) ResNet-20/34 (He et al., 2016)； (3) DenseNet-121（Huang et al., 2017）； (4) EfficientNet-B0（Tan & Le，2019）。对于所有四个模型，我们修剪了全连接和卷积单元的权重张量。偏差和批量归一化层保持未修剪。

数据集。我们考虑以下数据集； CIFAR-10/100（Krizhevsky & Hinton，2009）、SVHN（Netzer 等人，2011）和 Restricted ImageNet（Tsipras 等人，2019）。除 Restricted ImageNet 之外的所有数据集均用于训练 VGG-16； Restricted ImageNet 用于训练 ResNet-34。

其他详情。详细的实验设置在附录 A 中给出。

在图 2 中，我们提供了在 CIFAR-10 数据集上训练的四种不同模型架构的稀疏度与精度权衡曲线。第一个观察结果是 LAMP 实现了最佳权衡；在所有四个模型中，LAMP 始终优于基线方法。我们还观察到，Erd˝os-R´enyi 核方法在 VGG-16、ResNet-20 和 EfficientNet-B0 中也优于其他基线，但在 DenseNet-121 上却表现不佳。此外，随着模型架构变得更加复杂，LAMP 和 Erd˝os-R´enyi 核方法之间的差距似乎越来越大。两种方法之间的差距尤其在 EfficientNet-B0 中值得注意，其中移动反向瓶颈卷积层取代了传统的卷积模块。特别是，当只有 1.44% 的权重存活时，LAMP 的测试精度达到 88.1%，而 Erd˝os-R´enyi 内核达到 77.8%。最后，我们观察到 Gale 等人的启发式。 (2019) 似乎比 Uniform MP 有所改进。

在图 3 中，我们展示了三个附加数据集的权衡曲线：SVHN、CIFAR-100 和 Restricted ImageNet。与图 2 类似，我们观察到 LAMP 优于所有其他基线，并且 Erd˝os-R´enyi 内核仍然是最具竞争力的基线。

4.2 ABLATIONS: ONE-SHOT PRUNING, WEIGHT REWINDING, AND SNIP（消融：一次性修剪、重绕和剪断）

现代基于幅度的剪枝算法通常与定制剪枝计划（例如 Zhu & Gupta (2018)）或权重倒带（例如 Frankle & Carbin (2019)；Renda et al. (2020)）结合使用。为了确保 LAMP 与此类技术一起可靠地运行，我们进行了以下附加实验。

• 一次性修剪。作为剪枝计划的极端情况，我们测试了仅运行单个训练-剪枝-再训练周期的方案，而不是迭代多个周期。我们在使用 CIFAR-10 数据集训练的 VGG-16 上测试一次性剪枝。

• 重量重绕。修剪后，我们将剩余的权重回滚到早期时期的值，就像 Frankle & Carbin (2019) 的“彩票”实验一样。我们使用 Frankle 等人中描述的预热步骤和训练计划，在 VGG-16 上执行基于迭代幅度的剪枝（IMP）。（2020）。

• 剪断。作为额外的实验，我们测试 LAMP 是否可以为 MP 之外的剪枝方案提供通用的分层稀疏性。我们在“初始化修剪”设置下使用 SNIP 分数进行测试（Lee 等人，2019）。基线也经过类似修改以使用 SNIP 分数。我们在 CIFAR-10 数据集上使用 Conv-6 模型（有关模型的更多详细信息，请参阅 Frankle & Carbin (2019)），批量大小为 128。

图 4：一次性剪枝、权重倒带和 SNIP 设置下的稀疏度与精度权衡曲线。一次性剪枝和权重倒带实验是在 CIFAR-10 数据集上训练的 VGG-16 上完成的。 SNIP 实验是在 CIFAR-10 上训练的 Conv-6 上进行的。

同样，附录 A 中给出了其他实验细节。

图 4 中给出了所有三个实验的结果。在一次性剪枝中，我们确认 LAMP 轻松领先于其他基线，如迭代剪枝情况所示。我们注意到，对于 LAMP 来说，一次性剪枝和迭代剪枝之间的差距非常小；当所有可修剪权重中的 1.15% 幸存时，迭代 LAMP 只比一次性 LAMP 带来 1.09% 的准确度增益。相比之下，在相同稀疏度水平下，Uniform MP 的迭代增益为 41.62%。

在权重倒带实验中，我们观察到 LAMP 仍然优于基线方法。我们还指出，在这种情况下，全局基线往往表现良好，甚至在低稀疏度情况下优于 Erd˝os-R´enyi 核方法。这一现象似乎与周等人的观察有关。 (2019)模型的初始权重和最终权重高度相关；全局 MP 可能有助于保留与较大初始幅度的连接，这在初始化时的信号传播方面发挥着重要作用（Lee et al., 2020）。

在 SNIP 实验中，我们观察到 LAMP 实现了与 Global SNIP 相似的性能。回想一下 SNIP 分数是为全局剪枝而设计的（Lee et al., 2019），LAMP 如此高的性能是出乎意料的。我们怀疑这是因为LAMP也是为“输出失真最小化”而设计的，这与“梯度失真最小化”有相似的精神。

5 LAYERWISE SPARSITY: GLOBAL MP, ERD ˝OS-R´ENYI KERNEL, AND LAMP（分层稀疏性：全局 MP、ERD ˝OS-R´ENYI 内核和 LAMP）

随着 LAMP 的有效性得到证实，我们进一步研究了 LAMP 发现的分层稀疏性。我们重点回答两个问题：Q1。从 LAMP 中提取的分层稀疏性是否与根据经验构建的启发式方法类似，例如Gale 等人给出的一个。（2019）？ Q2。 LAMP 稀疏模式是否还有其他定义特征可以帮助我们指导（稀疏）网络架构的设计？

在图 5 中，我们绘制了通过 Global MP、Erd˝os-R´enyi 内核和 LAMP 迭代剪枝 VGG-16（在 CIFAR-10 上训练）发现的分层生存率和非零权重的数量。给出了全局存活率{51.2%、26.2%、13.4%、6.87%、3.52%}（从浅到深）的分层存活率。绘制了剪枝模型的非零权重数量，其中所有幸存权重的总分数为 {3.52%, 1.80%, 0.92%, 0.47%, 0.24%}。

我们观察到 LAMP 稀疏度与 Erd˝os-R´enyi 核方法给出的稀疏度水平具有相似的趋势。特别是，这两种方法都倾向于保持神经网络的第一层和第二层相对密集；这个属性让人想起 Gale 等人给出的手工启发式。 (2019)：保持第一个卷积层不被剪枝，并从最后一个全连接层剪枝至多 80%。虽然 Global MP 还保留了最后一个全连接层的大部分未剪枝，但第一个卷积层很快就被剪枝了。 LAMP 稀疏性在两个方面不同于 Erd˝os-R´enyi 核稀疏性。

尽管 LAMP 表现出保持第一层和最后一层相对不修剪的趋势，但这种趋势比较温和。当 3.52% 的权重存活时，LAMP 分别从第一层和最后一层保留约 79% 和约 62% 的权重未修剪，而 Erd˝os-R´enyi 内核不会从这两层中修剪任何权重。

LAMP 倾向于在极端稀疏水平下保持整个层中非零权重的数量相对均匀（事实上，第一个观察可以理解为第二个观察的结果）。相比之下，Erd˝os-R´enyi 核方法保持相对比率恒定，无论全局稀疏程度如何。

图 5：VGG-16 在 CIFAR-10 上迭代剪枝的分层统计数据。上：{51.2%, 26.2%, 13.4%, 6.87%, 3.52%} 权重存活的模型的分层存活率。底部：具有 {3.52%, 1.80%, 0.92%, 0.47%, 0.24%} 权重幸存的模型的非零权重数量。

在第二次观察之后，我们推测，考虑到神经网络的全局稀疏性约束，每层中具有相似数量的非零连接可能是保证最大记忆容量的必要条件（例如，参见 Yun 等人（2019））网络。对这种稀疏神经网络的近似性进行理论研究可能是未来一个有趣的研究方向，有可能产生更有原则性和鲁棒性的剪枝算法。

作为补充说明，我们注意到 LAMP 发现的分层稀疏性的行为与 AMC 的行为类似（He et al., 2018），后者使用强化学习代理来搜索所有可用分层稀疏性的空间。我们在附录 F 中提供了更多详细信息。

6 CONCLUSION(结论)

在本文中，我们研究了基于幅度的剪枝方案的分层稀疏性问题。所提出的方法被称为 LAMP（基于层自适应幅度的剪枝），是从基于幅度剪枝的 l2 失真最小化角度出发的，并且在各种模型和数据集上提供了一致的性能增益。此外，当与一次性剪枝计划或权重倒带相结合时，LAMP 性能可靠且良好，这使其成为基于幅度剪枝的“首选”分层稀疏性的有吸引力的候选者。更深入地研究 LAMP 发现的分层稀疏性，我们观察到 LAMP 自动恢复分层稀疏性的手工规则。此外，我们观察到，当我们考虑更极端的稀疏级别时，LAMP 倾向于保持整个层中非零权重的数量相对均匀。我们推测，未剪枝权重数量的这种均匀性可能是稀疏神经网络最大表达能力的必要条件。

ACKNOWLEDGMENTS
This work was supported in part by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)) in part by Samsung Advanced Institute of Technol-ogy (SAIT), and in part by the Defense Challengeable Future Technology Program of the Agency for Defense Development, Republic of Korea.

致谢

这项工作部分得到了韩国政府 (MSIT) 资助的信息与通信技术规划与评估研究所 (IITP) 赠款（编号 2019-0-00075，人工智能研究生院计划 (KAIST)）的部分支持三星高级技术学院 (SAIT)，部分由韩国国防发展局的国防挑战未来技术计划提供。