pytorch distiller Weights Pruning Algorithms

Pruning - Neural Network Distillerhttps://intellabs.github.io/distiller/algo_pruning.html

幅值剪枝:

        This is the most basic pruner: it applies a thresholding function, thresh(.)thresh(.), on each element, wiwi, of a weights tensor. A different threshold can be used for each layer's weights tensor.Because the threshold is applied on individual elements, this pruner belongs to the element-wise pruning algorithm family.

        最基本的剪枝策略。最每个元素的权值张量使用阈值函数。每层的权重张量可能使用不同的阈值。

敏感性剪枝:

        Finding a threshold magnitude per layer is daunting, especially since each layer's elements have different average absolute values. We can take advantage of the fact that the weights of convolutional and fully connected layers exhibit a Gaussian distribution with a mean value roughly zero, to avoid using a direct threshold based on the values of each specific tensor. 
The diagram below shows the distribution the weights tensor of the first convolutional layer, and first fully-connected layer in TorchVision's pre-trained Alexnet model. You can see that they have an approximate Gaussian distribution.

        给每层寻找一个阈值很麻烦。每层的元素的平均值都不同。可以利用卷积核权重和全连接层层权重符合高斯分布的特点。

        We use the standard deviation of the weights tensor as a sort of normalizing factor between the different weights tensors. For example, if a tensor is Normally distributed, then about 68% of the elements have an absolute value less than the standard deviation (σ) of the tensor. Thus, if we set the threshold to s∗σ, then basically we are thresholding s∗68%s∗68% of the tensor elements.

        权重张量所有元素的标准差σ可以作为一种归一化因子。如果权重张量所有元素是高斯分布的,68%的元素量值小于权重张量所有元素的标准差。如果把阈值设置为 s∗σ,name就能过滤掉s*68%的参数。

        How do we choose this ss multiplier?

        In Learning both Weights and Connections for Efficient Neural Networks the authors write:

        "We used the sensitivity results to find each layer’s threshold: for example, the smallest threshold was applied to the most sensitive layer, which is the first convolutional layer... The pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layer’s weights

        So the results of executing pruning sensitivity analysis on the tensor, gives us a good starting guess at s. Sensitivity analysis is an empirical method, and we still have to spend time to hone in on the exact multiplier value.

        关键在于选择s。敏感性分析结果作为s的初始值。敏感性分析是按经验的。

Method of Operation

  1. Start by running a pruning sensitivity analysis on the model.
  2. Then use the results to set and tune the threshold of each layer, but instead of using a direct threshold use a sensitivity parameter which is multiplied by the standard-deviation of the initial weight-tensor's distribution.

操作方法:

         首先对模型进行剪枝敏感性分析,然后利用分析结果设置每层的阈值。不使用直接设置阈值的方式,而是使用敏感性参数*标准差。

Schedule

        In their paper Song Han et al. use iterative pruning and change the value of the s multiplier at each pruning step. Distiller's SensitivityPruner works differently: the value s is set once based on a one-time calculation of the standard-deviation of the tensor (the first time we prune), and relies on the fact that as the tensor is pruned, more elements are "pulled" toward the center of the distribution and thus more elements gets pruned.

        This actually works quite well as we can see in the diagram below. This is a TensorBoard screen-capture from Alexnet training, which shows how this method starts off pruning very aggressively, but then slowly reduces the pruning rate.

方案:

        Song Han et al.每个剪枝step都改变s的值。Distiller's SensitivityPruner采用不同的方式,这个值只设置一次,基于权重张量的初次计算的标准差。当张量被裁剪后,更多的元素会向分布的中心聚集,更多的元素会被裁剪。这种方法工作的很好。

例子:

        We use a simple iterative-pruning schedule such as: Prune every second epoch starting at epoch 0, and ending at epoch 38. This excerpt from alexnet.schedule_sensitivity.yaml shows how this iterative schedule is conveyed in Distiller scheduling configuration YAML:

pruners: my_pruner:

        class: 'SensitivityPruner'

        sensitivities:

                'features.module.0.weight': 0.25

                'features.module.3.weight': 0.35

                'features.module.6.weight': 0.40

                'features.module.8.weight': 0.45

                'features.module.10.weight': 0.55

                'classifier.1.weight': 0.875

                'classifier.4.weight': 0.875

                'classifier.6.weight': 0.625

policies: - pruner:

        instance_name : 'my_pruner'

        starting_epoch: 0

        ending_epoch: 38

         frequency: 2

        一个简单的迭代式的剪枝例子,从epoch 0到epoch 38,每2个epoch剪枝一次。yaml配置如上所示。

Level Pruner

        Class SparsityLevelParameterPruner uses a similar method to go around specifying specific thresholding magnitudes. Instead of specifying a threshold magnitude, you specify a target sparsity level (expressed as a fraction, so 0.5 means 50% sparsity). Essentially this pruner also uses a pruning criteria based on the magnitude of each tensor element, but it has the advantage that you can aim for an exact and specific sparsity level.

level剪枝法不直接指定一个阈值,指定一个目标的稀疏级别,用分数表示。

        This pruner is much more stable compared to SensitivityPruner because the target sparsity level is not coupled to the actual magnitudes of the elements. Distiller's SensitivityPruner is unstable because the final sparsity level depends on the convergence pattern of the tensor distribution. Song Han's methodology of using several different values for the multiplier s, and the recalculation of the standard-deviation at each pruning phase, probably gives it stability, but requires much more hyper-parameters (this is the reason we have not implemented it thus far).

        level剪枝法相对于 Sensitivity剪枝更稳定,目标稀疏级别没有与具体的幅度绑定。 Distiller的Sensitivity剪枝法不总是稳定的,最终的稀疏性取决于权重张量分布的收敛模式。Song Han的方法使用不同的s,每次重新计算标准差可能就是要保证稳定性。

        To set the target sparsity levels, you can once again use pruning sensitivity analysis to make better guesses at the correct sparsity level of each。

        可以利用敏感性分析方法猜更合适的稀疏度。        

方法:

Method of Operation

  1. Sort the weights in the specified layer by their absolute values. 
  2. Mask to zero the smallest magnitude weights until the desired sparsity level is reached.

给参数权值排序。从小到大将参数置0,直到达到指定的稀疏度。

Splicing Pruner

        In Dynamic Network Surgery for Efficient DNNs Guo et. al propose that network pruning and splicing work in tandem. A SpilicingPruner is a pruner that both prunes and splices connections and works best with a Dynamic Network Surgery schedule, which, for example, configures the PruningPolicy to mask weights only during the forward pass.

剪接剪枝法同时剪枝、剪接连接,和Dynamic Network Surgery配合使用最好。

 

Automated Gradual Pruner (AGP)

        In To prune, or not to prune: exploring the efficacy of pruning for model compression, authors Michael Zhu and Suyog Gupta provide an algorithm to schedule a Level Pruner which Distiller implements in AutomatedGradualPruner.

        "We introduce a new automated gradual pruning algorithm in which the sparsity is increased from an initial sparsity value si (usually 0) to a final sparsity value sfover a span of n pruning steps. The intuition behind this sparsity function in equation (1) is to prune the network rapidly in the initial phase when the redundant connections are abundant and gradually reduce the number of weights being pruned each time as there are fewer and fewer weights remaining in the network.""

        引入一个自动渐进的剪枝算法。稀疏性逐渐增加,从si至最终的sf。在初始时刻,连接冗余的时候迅速剪枝,在后期参数越来越少的时候,逐渐剪枝。

You can play with the scheduling parameters in the agp_schedule.ipynb notebook.

The authors describe AGP:

  • Our automated gradual pruning algorithm prunes the smallest magnitude weights to achieve a preset level of network sparsity.
  • Doesn't require much hyper-parameter tuning
  • Shown to perform well across different models
  • Does not make any assumptions about the structure of the network or its constituent layers, and is therefore more generally applicable.

         自动渐进剪枝算法剪掉最小参数预制网络稀疏度。不需要太多的超参数。在很多模型上有很好的表现。网络结构不需要符合太多假定。

RNN Pruner

        The authors of Exploring Sparsity in Recurrent Neural Networks, Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta, "propose a technique to reduce the parameters of a network by pruning weights during the initial training of the network." They use a gradual pruning schedule which is reminiscent of the schedule used in AGP, for element-wise pruning of RNNs, which they also employ during training. They show pruning of RNN, GRU, LSTM and embedding layers.

在初始化训练阶段剪枝减少网络参数的算法。用一种渐进的剪枝计划对RNN剪枝。

Structure Pruners

        Element-wise pruning can create very sparse models which can be compressed to consume less memory footprint and bandwidth, but without specialized hardware that can compute using the sparse representation of the tensors, we don't gain any speedup of the computation. Structure pruners, remove entire "structures", such as kernels, filters, and even entire feature-maps.

        元素级别的剪枝可以创建非常稀疏的模型,占用非常少的带宽。但是没有特定的硬件,支持张量的稀疏表征,并不能加速计算。结构化的剪枝,移除整个结构,例如kerner、filter甚至feature map.

Structure Ranking Pruners

        Ranking pruners use some criterion to rank the structures in a tensor, and then prune the tensor to a specified level. In principle, these pruners perform one-shot pruning, but can be combined with automatic pruning-level scheduling, such as AGP (see below). In Pruning Filters for Efficient ConvNets the authors use filter ranking, with one-shot pruning followed by fine-tuning.

        结构排名剪枝法对张量中的一些机构排序,然后剪掉一些张量。原则上,这些剪枝算法是一次性剪枝,但是结合一些自动剪枝策略,例如AGP。Pruning Filters for Efficient ConvNets中先一次性剪枝,然后重新微调。

The authors of Exploiting Sparseness in Deep Neural Networks for Large Vocabulary Speech Recognition also use a one-shot pruning schedule, for fully-connected layers, and they provide an explanation:

First, after sweeping through the full training set several times the weights become relatively stable — they tend to remain either large or small magnitudes. Second, in a stabilized model, the importance of the connection is approximated well by the magnitudes of the weights (times the magnitudes of the corresponding input values, but these are relatively uniform within each layer since on the input layer, features are normalized to zero-mean and unit-variance, and hidden-layer values are probabilities)

L1RankedStructureParameterPruner

        The L1RankedStructureParameterPruner pruner calculates the magnitude of some "structure", orders all of the structures based on some magnitude function and the m lowest ranking structures are pruned away. This pruner performs ranking of structures using the mean of the absolute value of the structure as the representative of the structure magnitude. The absolute mean does not depend on the size of the structure, so it is easier to use compared to just using the L1-norm of the structure, and at the same time it is a good proxy of theL1-norm. Basically, you can think of mean(abs(t)) as a form of normalization of the structure L1-norm by the length of the structure.L1RankedStructureParameterPruner currently prunes weight filters, channels, and rows (for linear layers).        

        L1rank 算法计算一些机构的幅度,然后幅度函数排序,然后剪掉m个rank最低的结构。这个剪枝器按照结构的中值给解雇排序。中值不依赖于结构的尺寸,更容易比较。可以把中值视为一种L1。

ActivationAPoZRankedFilterPruner

        The ActivationAPoZRankedFilterPruner pruner uses the activation channels mean APoZ (average percentage of zeros) to rank weight filters and prune a specified percentage of filters. This method is called Network Trimming from the research paper: "Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures", Hengyuan Hu, Rui Peng, Yu-Wing Tai, Chi-Keung Tang, ICLR 2016 https://arxiv.org/abs/1607.03250

        ActivationAPoZRankedFilter使用激活的通道的0值平均百分比来给权值过滤器排序和剪枝。

GradientRankedFilterPruner

        The GradientRankedFilterPruner tries to asses the importance of weight filters using the product of their gradients and the filter value.

        GradientRankedFilter剪枝器尝试用梯度和卷积核的乘积来评估权重参数的重要性。

RandomRankedFilterPruner

        For research purposes we may want to compare the results of some structure-ranking pruner to a random structure-ranking. The RandomRankedFilterPruner pruner can be used for this purpose.

        随机结构剪枝,用于对比剪枝效果。

Automated Gradual Pruner (AGP) for Structures

        The idea of a mathematical formula controlling the sparsity level growth is very useful and StructuredAGP extends the implementation to structured pruning.

Pruner Compositions

        Pruners can be combined to create new pruning schemes. Specifically, with a few lines of code we currently marry the AGP sparsity level scheduler with our filter-ranking classes to create pruner compositions. For each of these, we use AGP to decided how many filters to prune at each step, and we choose the filters to remove using one of the filter-ranking methods:

  • L1RankedStructureParameterPruner_AGP
  • ActivationAPoZRankedFilterPruner_AGP
  • GradientRankedFilterPruner_AGP
  • RandomRankedFilterPruner_AGP

        剪枝方法可以组合形成新的剪枝方法,特别的,一些AGP方法和filter ranking方法结合产生了新的方法。 用AGP来决定每个步骤裁剪多少filter。然后用一种filter-ranking算法来移除filter.

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值