cnns_深入研究CNNS的多比特加权量化

最新推荐文章于 2021-11-29 19:57:42 发布

weixin_26731327

最新推荐文章于 2021-11-29 19:57:42 发布

阅读量569

点赞数

原文链接：https://medium.com/@matanweks/deep-dive-into-multi-bit-weighted-quantization-for-cnns-d2723afdc5db

版权

cnns

A joint work together with Ido Glanz

与Ido Glanz合作

Reducing neural network complexity and memory consumption has become a broad and vast field of research aiming to allow both running complex deep models on edge-devices as well as allow faster and potentially more accurate inference on various new tasks.

减少神经网络的复杂性和内存消耗已成为一个广泛而广泛的研究领域，旨在允许在边缘设备上运行复杂的深度模型，并允许对各种新任务进行更快，更准确的推断。

As part of it, quantization has become a common and effective tool to do so, yet often requiring running on the complete original data-set and re-training the network, something which is not always feasible and furthermore while some quantization schemes suits one task they might be off on a different.

作为其一部分，量化已成为一种通用且有效的工具，但常常需要在完整的原始数据集上运行并重新训练网络，这并不总是可行的，而且有些量化方案适合一项任务他们可能会有所不同。

Below we will investigate the quantization of neural networks’ weight matrices and in a weighted quantization optimization scheme, influenced by the assumption that not all weights were created equal thus capturing some more accurately than others in the quantization process could derive more accurate compressed models compared to vanilla quantization schemes.

下面我们将研究神经网络权重矩阵的量化，并在加权量化优化方案中，受以下假设的影响：并非所有权重均相等，因此在量化过程中捕获比其他权重更精确的模型可以得出更精确的压缩模型。香草量化方案。

量化 (Quantization)

First, let’s shortly discuss what quantization is all about in the sense of model compression.

首先，让我们简短地讨论一下在模型压缩意义上的量化是什么。

In recent years, many papers address the issue of model compression with quantization, many of them achieving state-of-the-art results both in terms of compression rate, complexity, and maintaining accuracy.

近年来，许多论文都讨论了量化模型压缩的问题，其中许多论文在压缩率，复杂性和保持准确性方面都取得了最新的成果。

但是什么是量化？ (But what is quantization?)

The process of constraining an input from a continuous or otherwise large set of values (such as the real numbers) to a discrete set (such as the integers)

将输入从连续的或其他较大的值集(例如，实数)限制为离散的集(例如，整数)的过程

In neural networks compression, quantization is utilized in the sense of optimizing the memory consumption of the model weights and sometimes also to simplify mathematical operations thus reducing computational complexity. A classic example is if we were looking to run our models on edge devices and need to consider the limited memory, computing resources, and power.

在神经网络压缩中，在优化模型权重的内存消耗的意义上利用了量化，有时还简化了数学运算，从而降低了计算复杂性。一个典型的例子是，如果我们希望在边缘设备上运行我们的模型，并且需要考虑有限的内存，计算资源和功能。

In addition, quantization has become popular paired together with specific hardware optimized for deep learning application tool-chains like NVIDIA TensorRT, Xilinx, and many (many) more, that optimize the matrix multiplications in a low bit form.

此外，量化与针对深度学习应用工具链(例如NVIDIA TensorRT，Xilinx以及许多其他(许多))进行了优化的特定硬件结合使用已变得很流行，这些工具链以低位形式优化了矩阵乘法。

量化技术，交替多位量化 (Quantization techniques, Alternating Multi-bit quantization)

As mentioned above, in recent years quantization had become a broad and vast field of research and many quantization techniques have been suggested, here we will focus on the work of Chen Xu et al, “Alternating Multi-bit Quantization for Recurrent Neural Networks” [1] suggesting a method of describing the multi-bit quantization as an optimization problem, separating the binary codes (-1/1 matrices) and coefficients and iterating the process of binary codes calculation and coefficients calculation (freezing each in turns) to derive a high accuracy quantization per the given number of bits.

如上所述，近年来，量化已经成为一个广泛而广泛的研究领域，并且提出了许多量化技术，这里我们将集中讨论Chen Xu等人的工作，“为递归神经网络替代多位量化” [ [1]提出了一种将多位量化描述为优化问题的方法，将二进制代码(-1/1矩阵)和系数分开，并迭代二进制代码计算和系数计算的过程(依次冻结)以得出给定位数的高精度量化。

This type of quantization decomposes the weight matrices into coefficients vectors (with 16 or 32 bits) and binary matrices (-1/1), which are then multiplied and accumulated to form an approximation of the original weight matrix (i.e., under a 2-bit constraint, we would have 2 coefficients and 2 binary matrices, as so each matrix is multiplied by a coefficient and they are summed together). See figure 1 for clearer visualization.

这种类型的量化将权重矩阵分解为系数向量(具有16位或32位)和二进制矩阵(-1/1)，然后相乘并累加以形成原始权重矩阵的近似值(即，在2之下比特约束，我们将有2个系数和2个二进制矩阵，因此每个矩阵都乘以一个系数，然后将它们相加)。参见图1，以获得更清晰的可视化效果。

Quantization as implemented by Chen Xu et al, is done as follows: Given a weight matrix W and a desired number of bits, the matrix is decomposed as so it is a linear combination of coefficients (in high resolution, e.g. uint16) and binary -1/1 matrices of the original size of W.

由Chen Xu等人实现的量化方法如下：给定权重矩阵W和所需的位数，将矩阵分解，以使其成为系数(在高分辨率下，例如uint16)和二进制的线性组合- W原始大小的1/1矩阵。

To do so, we first initialize the binary matrices and coefficients with Greedy Approximation (like suggested by (Guo et al., 2017)[2]), roughly described as iterating on the binary matrices, initializing them as the sign of the residual of the weight it should express (where the residual would start as the original weight and at each iteration decrease by the coefficient of that layer multiplied by the sign) and the coefficients as the average of the residual as follow:

为此，我们首先使用贪婪近似来初始化二进制矩阵和系数(如(Guo等人，2017)建议的[2])，粗略地描述为对二进制矩阵进行迭代，并将它们初始化为残差的符号。它应表示的权重(残差将从原始权重开始，并且在每次迭代时减去该层的系数乘以符号)，系数作为残差的平均值，如下所示：

The next step is refined a greedy approximation using below equation, then alternating to recalculate the binary matrices using Binary Search Trees (or a closed-form solution if using 2 or fewer bits)

下一步使用以下等式完善贪婪近似，然后交替使用Binary Search Trees(或使用2个或更少位的闭式解)重新计算二进制矩阵

压缩估算(Compression estimation)

For the sake of brevity, we remind that when stating n-bits this means a set of, per output filter, n vectors (alpha) of the length of the input filter (64 for example) + n binary matrices of the original shape (e.g. 64x3x3) (the B tensor). For example, for a 3-bit quantization, a CNN kernel of [64,64,3,3] float32 elements would decompose to 3x[64x64x3x3] binary elements + 3x[64,64,1] float32/16 elements. Comparing memory, we get about 42% of the size.

为简洁起见，我们提醒您，在声明n位时，这表示每个输出滤波器一组输入滤波器长度(例如64个)的n个矢量(alpha)+原始形状的n个二进制矩阵(例如64x3x3)(B张量)。例如，对于3位量化，[64,64,3,3] float32元素的CNN内核将分解为3x [64x64x3x3]二进制元素+ 3x [64,64,1] float32 / 16元素。比较内存，我们可以得到大约42％的内存。

加权精炼贪婪近似(Weighted-Refined greedy approximation)

Under the theoretical assumption that not all weights were created equal (in terms of importance), we first need to find a way to push the quantization process towards putting more “effort” into elements of the filter which have more importance in the context of overall layer activation output, i.e, acknowledging that every weight has different importance and thus we would care less if some are badly captured by the quantization algorithm.

在并非所有权重均相等(在重要性方面)的理论假设下，我们首先需要找到一种方法来推动量化过程，以将更多的“精力”投入到在总体上更重要的过滤器元素中层激活输出，即，确认每个权重具有不同的重要性，因此，如果量化算法无法正确捕获某些权重，我们将不会在意。

To do so, in addition to the quantization minimization process described before, we incorporate importance weighting (or highlighting) of the given soon-to-be compressed weight matrix therefore pushing towards a quantized representation biased by the given weights and therefore capturing the “important” weights better than others (we will later elaborate more on what important means).

为此，除了前面描述的量化最小化过程外，我们还结合了给定的即将压缩的权重矩阵的重要性权重(或突出显示)，因此朝着由给定权重产生偏差的量化表示的方向进行，因此捕获了“重要的的权重要比其他人好(我们稍后会详细说明重要的含义)。

Let’s define U a HeatMap matrix,

让我们定义一个HeatMap矩阵，

Hence,

因此，

Thus, if we fed the above algorithm a weighting matrix in the shape of the original matrix (i.e., a heat-map like matrix) pointing to elements it needed to quantize more accurately (e.g. because they have a greater role in the layers activation output), we would theoretically obtain a quantized representation serving that purpose better.

因此，如果我们将上述算法提供给原始矩阵(即类似热图的矩阵)形状的加权矩阵，该矩阵指向需要更准确量化的元素(例如，因为它们在层激活输出中的作用更大) )，从理论上讲，我们将获得更好的量化表示。

但是谁是重要的砝码？ (But who are the important weights?)

This is probably the most interesting question, how can we spot the weights that are more important to the model in the sense of the model task (e.g. classifying objects or translating text)?

这可能是最有趣的问题，在模型任务的意义上(例如，对对象进行分类或翻译文本)，我们如何才能发现对模型更重要的权重？

Let’s start with a recap to Han’s [3] work in pruning and quantization. In the pruning section, he spotted that if we create a histogram of our weights value we can notice that most of the weights are around zero and if we’ll zero-out some percentage of them, the accuracy of the network will barely decrease, thus they had less impact in the sense of the overall network.

让我们首先回顾一下Han [3]在修剪和量化方面的工作。在修剪部分，他发现如果我们创建权重值的直方图，我们可以注意到大多数权重都在零附近，如果将它们的某些百分比归零，则网络的准确性几乎不会降低，因此，它们对整个网络的影响较小。

In the quantization case, since the quantization minimization scheme tries to minimize the distance between all the weights before and after the quantization (i.e. keep the matrix as close as possible to the original) and since most of the weights are around zero, the quantization scheme will focus in this area although it does not contribute to the network performance.

在量化的情况下，由于量化最小化方案试图使量化前后的所有权重之间的距离最小化(即，使矩阵尽可能接近原始权重)，并且由于大多数权重都在零附近，因此量化方案尽管它不会影响网络性能，但它将专注于这一领域。

L1-标准加权 (L1-Norm Weighting)

Inspired by the above pruning, we’ll try a method of weighting the kernels using an L1 norm on the weight matrix and feeding its output to the quantization module as a weighting term. Under this scheme, the underlying assumption is that larger-valued weights have more importance and the quantization of these values should be of higher priority.

受以上修剪的启发，我们将尝试一种在权重矩阵上使用L1范数对内核加权的方法，并将其输出作为加权项提供给量化模块。在此方案下，基本假设是，较大的权重具有更高的重要性，并且这些值的量化应具有更高的优先级。

自我注意权重 (Self-attention Weighting)

Another weighting approach inspired by the rise of attention models (and more specifically self-attention schemes) is done by implementing a self-attention module processing each weight matrix to generate “heatmaps” of relative importance of the different kernels then used to serve as regression weights for the quantization process we described above and trained against the activation-outputs of the original network.

另一个受关注模型(尤其是自我关注方案)启发的加权方法是通过实现一个自我关注模块来处理每个权重矩阵，以生成不同内核相对重要性的“热图”，然后将其用作回归，从而完成加权上面我们描述的量化过程的权重，并针对原始网络的激活输出进行了训练。

But we will cover this optimization method and others in the next article.

但是我们将在下一篇文章中介绍此优化方法和其他方法。

实验 (Experiments)

Like everything in the deep learning filed, theory it’s fine but no one will believe you without experiments. :)

就像深度学习领域中的所有内容一样，理论也不错，但没有人会相信您而没有实验。 :)

We will evaluate the weighted quantization scheme on three types of computer vision tasks:

我们将针对三种类型的计算机视觉任务评估加权量化方案：

图片分类 (Image classification)

To evaluate the quantization scheme for image classification tasks we trained a ResNet18 and ResNet50 on a CIFAR10 dataset reaching a top 1 accuracy of 89% (probably with more time and hyperparameters search it is possible to obtain better results)

为了评估图像分类任务的量化方案，我们在CIFAR10数据集上对ResNet18和ResNet50进行了训练，达到了89％的前1位准确度(可能使用更多的时间和超参数搜索可能会获得更好的结果)

Now that we have a trained model we can apply the multi-bit weighted quantization and compare it to the vanilla quantization (i.e., without weights) to check whether it makes any difference.

现在我们有了训练有素的模型，我们可以应用多位加权量化，并将其与香草量化(即无权重)进行比较，以检查其是否有所不同。

In the table below we tested our model in 2–4 bit quantization on all the CNN layers.

在下表中，我们在所有CNN层上以2-4位量化测试了我们的模型。

As can be seen, we were able to obtain an increase in performance while using the L1 weighting method.

可以看出，在使用L1加权方法的同时，我们能够提高性能。

超分辨率 (Super-resolution)

Now we would like to evaluate different tasks such as super-resolution and style transfer, we tested it on the Coco datasets i.e., large scale datasets usually used for object detection, segmentation, and captioning and WIKI faces publicly available dataset of face images.

现在，我们要评估不同的任务，例如超分辨率和样式转换，我们在Coco数据集上进行了测试，即通常用于对象检测，分割和字幕的大型数据集，以及WIKI公开的人脸图像数据集。

to evaluate super-resolution and style transfer networks we wouldn’t be able to test accuracy (since there is no such thing in this domain) so we use the MSE score:

要评估超分辨率和样式转换网络，我们将无法测试准确性(因为在此域中没有此类内容)，因此我们使用MSE得分：

But estimating HR pictures with only MSE loss is not enough since frames that are still fairly blurry can gain small error but for the human eye, it will be noticeably bad. So, we’ll need to add an additional metric that together we’ll get a better evaluation.

但是仅凭MSE损失估计HR图片是不够的，因为仍然相当模糊的帧可能会获得很小的误差，但是对于人眼来说，这将是非常糟糕的。因此，我们需要添加一个额外的指标，以便共同获得更好的评估。

The Structural Similarity Index Measure (SSIM) is a more objective similarity comparison which is used for measuring the similarity between two images.

结构相似指数测量 (SSIM)是更客观的相似度比较，用于比较两个图像之间的相似度。

Structural Similarity Index:

结构相似指数：

In the table below we tested our model using the two losses above in 3-bit quantization on all the CNN layers.

在下表中，我们在3个CNN层上使用上述3位量化中的两个损失对模型进行了测试。

视觉效果 (Visual results)

风格转移(Style transfer)

Similarly to super-resolution, we also evaluate the Style transfer model.

与超分辨率类似，我们还评估了样式转换模型。

As can be seen, also for SR and ST tasks, using a weighted quantization can improve the MSE loss and the SSIM of the vanilla quantization without any noticeable loss in quality compared to the original model.

可以看出，对于SR和ST任务，与原始模型相比，使用加权量化还可以改善MSE损失和原始量化的SSIM，而没有任何明显的质量损失。

结论 (Conclusions)

The task of model compression and weight quantization is a large and challenging task, relevant than ever with the movement to edge computing and the use of neural networks in more and more applications and as such being researched by many and the subject of many recent papers. Above is just the tip of the iceberg in what we believe could be further researched and leveraged to obtain better post-training compressions and enable the use of neural networks in more and more applications.

模型压缩和权重量化的任务是一项艰巨而艰巨的任务，与边缘计算的发展以及神经网络在越来越多的应用中的使用相关，因此，许多最新论文对此进行了研究。以上只是冰山一角，我们认为可以进一步研究和利用该冰山以获取更好的训练后压缩，并在越来越多的应用程序中使用神经网络。

[1] C. Xu, J. Yao, Z. Lin, W. Ou, Y. Cao, Z. Wang, and H. Zha, “Alternating multi-bit quantization for recurrent neural networks,” arXiv preprint arXiv:1802.00150, 2018.

[1] C. Xu，J。Yao，Z。Lin，W。Ou，Y。Cao，Z。Wang和H. Zha，“轮流神经网络的替代多位量化”，arXiv预印本arXiv：1802.00150， 2018。

[2] Y. Guo, A. Yao, H. Zhao, and Y. Chen, “Network sketching: Exploiting binary structure in deep cnns,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5955–5963, 2017.

[2]郭毅，姚A.赵H和陈毅，“网络素描：在深cnns中利用二进制结构”，在IEEE计算机视觉和模式识别会议论文集，第9555-5963页，2017年。

[3] Han, S., Mao, H. and Dally, W.J., 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.

[3] Han，S.，Mao，H。和Dally，WJ，2015。深度压缩：通过修剪，训练有素的量化和霍夫曼编码来压缩深度神经网络。 arXiv预印本arXiv：1510.00149 。

翻译自: https://medium.com/@matanweks/deep-dive-into-multi-bit-weighted-quantization-for-cnns-d2723afdc5db

cnns

weixin_26731327

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
cnns_深入研究CNNS的多比特加权量化

cnnsA joint work together with Ido Glanz与Ido Glanz合作Reducing neural network complexity and memory consumption has become a broad and vast field of research aiming to allow both running complex deep m...
复制链接

扫一扫