模型裁剪--Rethinking the Value of Network Pruning-CSDN博客

本文链接：https://blog.csdn.net/zhangjunhit/article/details/83506306

Rethinking the Value of Network Pruning
https://github.com/Eric-mingjie/rethinking-network-pruning

网络模型裁剪价值的重新思考

当前的深度学习网络模型一般都是 heavy computational cost，如何降低其计算量而尽可能保持网络性能是一个重要的研究课题。

在这里插入图片描述

标准的模型裁剪三部曲是：1）训练一个 large, over-parameterized network，得到最佳网络性能，以此为基准；2）基于一定的准则来裁剪大网络模型；3）在数据集上微调裁剪后的网络模型

在这个裁剪的过程中，存在两个 common beliefs：
1）一般认为一开始训练一个 large, over-parameterized network 是很重要的，以大模型的性能为基准进行裁剪，一般认为这个方式比从头训练一个小模型的方式是更好的。
2）一般认为裁剪后的网络模型结构及其参数权重都很重要。所以目前大部分方法都是在裁剪后的模型上进行微调，The preserved weights after pruning are usually considered to be critical

本文经过大量实验得出了两个较意外的结论：
1）如果我们的目标小模型是事先确定的，那么可以直接在数据集上训练此模型，得到的性能是最佳的，不比微调的性能差

First, for pruning algorithms with predefined target network architectures (Figure 2), directly training the small target model
from random initialization can achieve the same, if not better, performance, as the model obtained from the three-stage pipeline. In this case, starting with a large model is not necessary and one could instead directly train the target model from scratch。

2）对于目标模型不是事先确定的情况，从头开始训练裁剪后的模型，其得到的网络性能也是最好的，不比微调的差。
for pruning algorithms without a predefined target network, training the pruned model from scratch can also achieve comparable or even better performance than fine-tuning. This observation shows that for these pruning algorithms,
what matters is the obtained architecture, instead of the preserved weights,

模型裁剪的过程本质上可能是一个最优网络结构的搜索过程
our results suggest that the value of automatic pruning algorithms may lie in identifying efficient structures and performing implicit architecture search, rather than selecting “important” weights

predefined and non-predefined (automatically discovered) target architectures
在这里插入图片描述

predefined target architectures 这里我们举一个例子来说明一下： prune 50% channels in each layer of VGG，不管是哪个具体的 channels 被裁剪，最终的网络结构是一样的。因为 the pruning algorithm 将每个网络层中 least important 50% channels 裁掉。具体裁剪的比例一般是经验或尝试决定 the ratio in each layer is usually selected through empirical studies or heuristics

网络模型可以使用以下几个指标来描述：
model size, memory footprint, the number of computation operations (FLOPs) and power usage

本文选择了三个数据集和三个标准的网络结构
CIFAR-10， CIFAR-100 ， and ImageNet
VGG， ResNet， and DenseNet

6个网络裁剪方法：
L1-norm based Channel Pruning (Li et al., 2017)
ThiNet (Luo et al., 2017)
Regression based Feature Reconstruction (He et al., 2017b)
Network Slimming (Liu et al., 2017):
Sparse Structure Selection (Huang & Wang, 2018) :
Non-structured Weight Pruning (Han et al., 2015):

Training Budget. One crucial question is how long should we train the small pruned model from scratch?
如果从头训练小模型，那么训练时间即迭代次数是一个关键的问题

这里我们做了两个尝试：
Scratch-E 表示和训练大模型的迭代次数一样 to denote training the small pruned models for the same epochs
Scratch-B 表示两者的计算量一样（和大模型训练的计算量）to denote training for the same amount of computation budget

4 Experiments
4.1 Predefined target architectures

L1-norm based Channel Pruning (Li et al., 2017):
In each layer, a certain percentage of channels with smaller L1-norm of its filter weights will be pruned
在这里插入图片描述

ThiNet (Luo et al., 2017) greedily prunes the channel that has the smallest effect on the next layer’s activation values
在这里插入图片描述

Regression based Feature Reconstruction (He et al., 2017b)
prunes channels by minimizing the feature map reconstruction error of the next layer

在这里插入图片描述

最终的结论是：当我们确定了最终的目标网络结构，从头训练小模型比微调小模型更好。从头训练小模型的计算量如果和大模型训练一样，那么其得到的网络性能比微调后的性能一般要好

4.2 Automatically discovered target architectures

Network Slimming (Liu et al., 2017):
imposes L 1 -sparsity on channel-wise scaling factors from Batch Normalization layers (Ioffe & Szegedy, 2015) during
training, and prunes channels with lower scaling factors afterward.

在这里插入图片描述

Sparse Structure Selection (Huang & Wang, 2018) :
uses sparsified scaling factors to prune structures, and can be seen as a generalization of Network Slimming. Other
than channels, pruning can be on residual blocks in ResNet or groups in ResNeXt (Xie et al., 2017)

在这里插入图片描述

Non-structured Weight Pruning (Han et al., 2015):
prunes individual weights that have small magnitudes. This pruning granularity leaves the weight matrices sparse, hence
it is commonly referred to as non-structured weight pruning.
在这里插入图片描述