RETHINKING THE VALUE OF NETWORK PRUNING 笔记：

最新推荐文章于 2024-08-10 08:28:27 发布

向上的阿鹏

最新推荐文章于 2024-08-10 08:28:27 发布

阅读量325

点赞数 1

分类专栏：论文文章标签：深度学习

本文链接：https://blog.csdn.net/weixin_44543648/article/details/116565026

版权

论文专栏收录该内容

31 篇文章 2 订阅

订阅专栏

RETHINKING THE VALUE OF NETWORK PRUNING 笔记：

https://download.csdn.net/download/weixin_44543648/18515920

ABSTRACT：

training a large, over-parameterized model is often not necessary to obtain an efficient final model
learned “important” weights of the large model are typically not useful for the small pruned model
the pruned architecture itself, rather than a set of inherited “important”weights, is more crucial to the efficiency in the final model, which suggests that in some cases pruning can be useful as an architecture search paradigm.

1，获得一个有效的模型通常不需要去训练一个大的、过度参数化的模型。
2，去学习大模型中重要的权重，对被删减的模型作用很小。
3，删减后模型的架构远远重要过去继承大模型中的重要权重。

INTRODUCTION

A typical procedure of network pruning consists of three stages: 1) train a large, over-parameterized model (sometimes there are pretrained models available), 2) prune the trained large model according to a certain criterion, and 3) fine-tune the pruned model to regain the lost performance.

一个典型的网络修剪过程包括三个阶段:
1)训练一个大的、过参数化的模型(有时有预训练的模型可用)，
2)根据某个标准修剪训练好的大模型，
3)微调修剪后的模型以恢复丢失的性能。

Generally, there are two common beliefs behind this pruning procedure. First, it is believed that starting with training a large, over-parameterized network is important (Luo et al., 2017; Carreira-Perpinán & Idelbayev, 2018), as it provides a highperformance model (due to stronger representation & optimization power) from which one can safely remove a set of redundant parameters without significantly hurting the accuracy. Therefore, this is usually believed, and reported to be superior to directly training a smaller network from scratch (Li et al., 2017; Luo et al., 2017; He et al., 2017b;u et al., 2018) – a commonly used baseline approach. Second, both the pruned architecture and its associated weights are believed to be essential for obtaining the final efficient model (Han et al.,2015). Thus most existing pruning techniques choose to fine-tune a pruned model instead of train-ing it from scratch. The preserved weights after pruning are usually considered to be critical, as how to accurately select the set of important weights is a very active research topic in the literature (Molchanov et al., 2016; Li et al., 2017; Luo et al., 2017; He et al., 2017b; Liu et al., 2017; Suauet al., 2018)
First, for structured pruning methods with predefined target network architectures (Figure 2), directly training the small target model from random initialization can achieve the same, if not better,performance, as the model obtained from the three-stage pipeline. In this case, starting with a large model is not necessary and one could instead directly train the target model from scratch. Second, for structured pruning methods with autodiscovered target networks, training the pruned model from scratch can also achieve comparable or even better performance than fine-tuning，This observation shows that for these pruning methods,what matters more may be the obtained architecture, instead of the preserved weights, despite training the large model is needed to find that target architecture.

在网络修剪上，普遍有两个观点：
1，人们认为从训练大的、过参数化的网络开始是重要的，因为它提供了一个高性能模型(由于更强的表示和优化能力)，从中可以安全地删除一组冗余参数，而不会显著影响准确性。
2，修剪后的架构及其相关权重被认为对于获得最终的有效模型是必不可少的。
因此，大多数现有的修剪技术选择微调被修剪的模型，而不是从头开始训练它。
然而，在作者的工作中表明：这种信念是不一定正确的。
作者发现：
1，直接训练随机初始化的小目标模型可以获得与从三阶段流水线获得的模型相同或者相似的性能。在这种情况下，从一个大的模型开始是不必要的，可以直接从零开始训练目标模型。而不需要提前训练大的、过度参数化的模型。
2，对于具有自动发现的目标网络的结构化修剪方法，从头开始训练修剪后的模型也可以获得与微调相当甚至更好的性能
这一观察表明，对于这些修剪方法，更重要的可能是获得的架构，而不是保留的权重，尽管需要训练大模型来找到目标架构。

for a unstructured pruning method (Han et al., 2015) that prunes individual parameters, we found that training from scratch can mostly achieve comparable accuracy with pruning and fine-tuning on smaller-scale datasets, but fails to do so on the large-scale ImageNet benchmark.Note that in some cases, if a pretrained large model is already available, pruning and fine-tuning from it can save the training time required to obtain the efficient model.

非结构化的修剪方法在较小的数据集上通过修剪和微调基本都可以达到相当的精度，但是，在大规模的ImageNet数据集上却不能。请注意，在某些情况下，如果一个预训练的大型模型已经可用，从中进行修剪和微调可以节省获得高效模型所需的训练时间。

BACKGROUND

Those large models can be infeasible to store, and run in real time on embedded systems. To address this issue, many methods have been proposed such as low-rank approximation of weights (Denton et al., 2014; Lebedev et al., 2014), weight quantization(Courbariaux et al., 2016; Rastegari et al., 2016), knowledge distillation (Hinton et al., 2014; Romero et al., 2015) and network pruning (Han et al., 2015; Li et al., 2017), among which network pruning has gained notable attention due to their competitive performance and compatibility

为了解决大模型的问题，已有的方法有：权重的低秩近似(Denton等人，2014；Lebedev等人，2014)，权重量化(Courbariaux等人，2016；Rastegari等人，2016)，知识蒸馏(Hinton等人，2014；Romero等人，2015)和网络剪枝(Han等人，2015；李等，2017)

One major branch of network pruning methods is individual weight pruning, and it dates back to Optimal Brain Damage (LeCun et al., 1990) and Optimal Brain Surgeon (Hassibi & Stork, 1993),which prune weights based on Hessian of the loss function. More recently, Han et al. (2015) proposes to prune network weights with small magnitude, and this technique is further incorporated into the “Deep Compression” pipeline (Han et al., 2016b) to obtain highly compressed models. Srinivas & Babu (2015) proposes a data-free algorithm to remove redundant neurons iteratively. Molchanov et al. (2017) uses V ariatonal Dropout (P . Kingma et al., 2015) to prune redundant weights. Louizos et al. (2018) learns sparse networks through L0-norm regularization based on stochastic gate. However, one drawback of these unstructured pruning methods is that the resulting weight matrices are sparse, which cannot lead to compression and speedup without dedicated hardware/libraries (Han
et al., 2016a).

网络修剪方法的一个主要分支是个体权重修剪，其中就有：Han等人(2015)提出用小幅度修剪网络权重，并且该技术被进一步结合到“深度压缩”管道(Han等人，2016b)中以获得高度压缩的模型。Srinivas & Babu (2015)提出了一种迭代去除冗余神经元的无数据算法。莫尔恰诺夫等人(2017年)使用变异缺失(P . Kingma等人，2015年)来修剪冗余权重。Louizos等人(2018)通过基于随机门的L0范数正则化学习稀疏网络。

然而，这些非结构化剪枝方法的一个缺点是得到的权重矩阵是稀疏的，如果没有专用的硬件/库，这不能导致压缩和加速

In contrast, structured pruning methods prune at the level of channels or even layers. Since the original convolution structure is still preserved, no dedicated hardware/libraries are required to realize the benefits. Among structured pruning methods, channel pruning is the most popular, since it operates at the most fine-grained level while still fitting in conventional deep learning frameworks.Some heuristic methods include pruning channels based on their corresponding filter weight norm(Li et al., 2017) and average percentage of zeros in the output (Hu et al., 2016). Group sparsity is also widely used to smooth the pruning process after training (Wen et al., 2016; Alvarez & Salzmann, 2016; Lebedev & Lempitsky, 2016; Zhou et al., 2016). Liu et al. (2017) and Ye et al. (2018)impose sparsity constraints on channel-wise scaling factors during training, whose magnitudes are
then used for channel pruning. Huang & Wang (2018) uses a similar technique to prune coarser structures such as residual blocks. He et al. (2017b) and Luo et al. (2017) minimizes next layer’s feature reconstruction error to determine which channels to keep. Similarly, Yu et al. (2018) optimizes the reconstruction error of the final response layer and propagates a “importance score” for each channel. Molchanov et al. (2016) uses Taylor expansion to approximate each channel’s influence over the final loss and prune accordingly. Suau et al. (2018) analyzes the intrinsic correlation within each layer and prune redundant channels. Chin et al. (2018) proposes a layer-wise compensate filter pruning algorithm to improve commonly-adopted heuristic pruning metrics. He et al.(2018a) proposes to allow pruned filters to recover during the training process. Lin et al. (2017);Wang et al. (2017) prune certain structures in the network based on the current input。

相比之下，结构化剪枝方法在通道甚至层的层次上进行剪枝。由于原始卷积结构仍然保留（因为只改变了通道数，不需要改变原有的网络框架），因此不需要专用硬件/库来实现这些好处。已有的一些方法有：
1，基于相应的滤波器权重范数，输出中的平均零百分比来修剪通道。
2，组稀疏性也被广泛用于平滑训练后的剪枝过程。
3，在训练期间对信道方向的缩放因子施加稀疏性约束，然后将其大小用于信道修剪
4，最小化下一层的特征重构误差，以确定保留哪些通道。
5，优化最终响应层的重构误差，并为每个通道传播“重要性分数”。
6，使用泰勒展开来近似每个通道对最终损失的影响
7，分析每一层的内在相关性，并修剪冗余通道。
8，分层补偿滤波器修剪算法，以改进常用的启发式修剪度量。
9，训练过程中允许删减的过滤器恢复。
10，基于当前输入修剪网络中的某些结构
11，彩票假设：当单独训练时，某些连接及其随机初始化的权重可以实现与原始网络相当的精度。

zhu&Gupta(2018)表明，训练一个小密度模型不能达到相同的内存占用修剪大稀疏模型相同的精度。用继承的权重微调被修剪的模型并不比从头开始训练它好；由此产生的精简架构更有可能带来好处

METHODOLOGY

在这里插入图片描述

We first divide network pruning methods into two categories. In a pruning pipeline, the target pruned model’s architecture can be determined by either a human (i.e.,predefined) or the pruning algorithm (i.e., automatic)

网络修剪方法分为两类。在修剪流水线中，被修剪的目标模型的架构可以由人工(即预定义的)或者修剪算法(即自动的)来确定

When a human predefines the target architecture, a common criterion is the ratio of channels to prune in each layer. For example, we may want to prune 50% channels in each layer of VGG. In this case, no matter which specific channels are pruned, the pruned target architecture remains the same,because the pruning algorithm only locally prunes the least important 50% channels in each layer. In practice, the ratio in each layer is usually selected through empirical studies or heuristics. Examples of predefined structured pruning include Li et al. (2017), Luo et al. (2017), He et al. (2017b) and He
et al. (2018a) When the target architecture is automatically determined by a pruning algorithm, it is usually based on a pruning criterion that globally compares the importance of structures (e.g., channels) across layers. Examples of automatic structured pruning include Liu et al. (2017), Huang & Wang (2018),Molchanov et al. (2016) and Suau et al. (2018).

当人为定义修剪结构时，共同的方法就是设定每一层的通道修剪比例，局限性比较大，而自动定义时，通常是基于全局比较跨层的结构来比较通道的重要性。

Unstructured pruning (Han et al., 2015; Molchanov et al., 2017; Louizos et al., 2018) also falls in the category of automatic methods, where the positions of pruned weights are determined by the training process and the pruning algorithm, and it is usually not possible to predefine the positions of zeros before training starts.

非结构化剪枝(韩等，2015；Molchanov等人，2017年；Louizos等人，2018)也属于自动方法的范畴，其中修剪的权重的位置由训练过程和修剪算法来确定，并且通常不可能在训练开始之前预定义零的位置。