How to Quantize Neural Networks with TensorFlow（这是16年的博客）

最新推荐文章于 2024-06-05 15:54:35 发布

WhyNotFocus

最新推荐文章于 2024-06-05 15:54:35 发布

阅读量249

点赞数

分类专栏： tensorflow 文章标签： tensorflow lite

原文链接：https://arxiv.org/pdf/1506.02626v3.pdf

版权

tensorflow 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

I’m pleased to say that we’ve been able to release a first version of TensorFlow’s quantized eight bit support. I was pushing hard to get it in before the Embedded Vision Summit, because it’s especially important for low-power and mobile devices, so it’s exciting to get it out there. All this documentation will be appearing on the main TensorFlow site also, but since I’ve talked so much about why eight-bit is important here, I wanted to give an overview of what we’ve released in this post too.
我很高兴地说，我们已经能够发布张量流量化八位支持的第一个版本。在嵌入式视觉峰会之前，我一直在努力推动它的发展，因为它对于低功耗和移动设备尤其重要，所以它的推出令人兴奋。所有这些文档也将出现在TensorFlow的主网站上，但是既然我已经在这里谈了很多为什么8位是重要的，我想对我们在这篇文章中发布的内容做一个概述。

When modern neural networks were being developed, the biggest challenge was getting them to work at all! That meant that accuracy and speed during training were the top priorities. Using floating point arithmetic was the easiest way to preserve accuracy, and GPUs were well-equipped to accelerate those calculations, so it’s natural that not much attention was paid to other numerical formats.
准确率和速度在神经网络的训练的过程中是首要的，我们通过浮点数运算来保持准确率，这是最简单的方法，因为我们的GPU配置够高来加速这些计算。
所以自然的人们对于其他的数值形式就没有那么多的关注了。

These days, we actually have a lot of models being being deployed in commercial applications. The computation demands of training grow with the number of researchers, but the cycles needed for inference expand in proportion to users. That means pure inference efficiency has become a burning issue for a lot of teams.
训练的计算需求随着研究人员的数量而增长，但是推理所需的周期随着用户的增加而增加。
我之前把demands看成了谓语，就不对了。
这就意味着单纯的推理效率已经成为很多队伍的燃眉之急。

That is where quantization comes in. It’s an umbrella term that covers a lot of different techniques to store numbers and perform calculations on them in more compact formats than 32-bit floating point. I am going to focus on eight-bit fixed point, for reasons I’ll go into more detail on later.
于是量化顺应而生。
这是一个总括性术语，涵盖了许多不同的技术来存储数字，并以比32位浮点更紧凑的格式对其进行计算。我将把重点放在八位定点上，原因我将在后面详细介绍。
定点和浮点的内容：
https://blog.csdn.net/qq_39378221/article/details/80658082
书上科学地对浮点数表示法的定义是，以适当的形式将比例因子表示在数据中，让小数点的位置根据需要而浮动。我们计算机的容量有限，不可能对每个数都用特别多的位数来表示，比如说2×10^99，这种非常大的数不可能用定点数来表示，所以呢利用浮点数就可以在位数有限的情况下扩大数的表示范围，同时能保持一定的有效精度。

Why does Quantization Work?
为什么量化可行？？？/
Training neural networks is done by applying many tiny nudges to the weights, and these small increments typically need floating point precision to work (though there are research efforts to use quantized representations here too).
训练神经网络是通过对权重施加许多微小的推动来完成的，这些微小的增量通常需要浮点精度才能起作用(尽管这里也有使用量化表示的研究成果)。

Taking a pre-trained model and running inference is very different. One of the magical qualities of deep networks is that they tend to cope very well with high levels of noise in their inputs. If you think about recognizing an object in a photo you’ve just taken, the network has to ignore all the CCD noise, lighting changes, and other non-essential differences between it and the training examples it’s seen before, and focus on the important similarities instead. This ability means that they seem to treat low-precision calculations as just another source of noise, and still produce accurate results even with numerical formats that hold less information.
采用预先训练好的模型和运行推理是非常不同的。深层网络的神奇之处之一是，它们能够很好地应对输入中的高噪声。如果你考虑在你刚刚拍摄的照片中识别一个物体，网络必须忽略所有的电荷耦合器件噪声、光照变化，以及它和之前看到的训练例子之间的其他非本质区别，而是关注重要的相似之处。这种能力意味着，他们似乎将低精度计算视为另一个噪声源，即使使用包含较少信息的数字格式，也能产生准确的结果。

Why Quantize?
Neural network models can take up a lot of space on disk, with the original AlexNet being over 200 MB in float format for example. Almost all of that size is taken up with the weights for the neural connections, since there are often many millions of these in a single model. Because they’re all slightly different floating point numbers, simple compression formats like zip don’t compress them well. They are arranged in large layers though, and within each layer the weights tend to be normally distributed within a certain range, for example -3.0 to 6.0.
神经网络模型可以占用磁盘上的大量空间，例如，原始的AlexNet以浮点格式超过200 MB。几乎所有这些尺寸都被神经连接的权重所占据，因为在一个模型中通常有数百万个这样的权重。因为它们都是略有不同的浮点数，像zip这样的简单压缩格式不能很好地压缩它们。然而，它们被布置在较大的层中，并且在每一层中，权重趋向于在一定范围内正常分布，例如-3.0到6.0。
zip原理：
有两种形式的重复存在于计算机数据中，zip 就是对这两种重复进行了压缩。
　　一种是短语形式的重复，即三个字节以上的重复，对于这种重复，zip用两个数字：1.重复位置距当前压缩位置的距离；2.重复的长度，来表示这个重复，假设这两个数字各占一个字节，于是数据便得到了压缩，这很容易理解。
　　第二种重复为单字节的重复，一个字节只有256种可能的取值，所以这种重复是必然的。
The simplest motivation for quantization is to shrink file sizes by storing the min and max for each layer, and then compressing each float value to an eight-bit integer representing the closest real number in a linear set of 256 within the range. For example with the -3.0 to 6.0 range, a 0 byte would represent -3.0, a 255 would stand for 6.0, and 128 would represent about 1.5. I’ll go into the exact calculations later, since there’s some subtleties, but this means you can get the benefit of a file on disk that’s shrunk by 75%, and then convert back to float after loading so that your existing floating-point code can work without any changes.
量化最简单的动机是通过存储每个层的最小值和最大值来缩小文件大小，然后将每个浮点值压缩为8位整数，表示该范围内256个线性集合中最接近的实数。例如，在-3.0到6.0的范围内，0字节代表-3.0，255代表6.0，128代表大约1.5。我将在后面进行精确的计算，因为其中有一些微妙之处，但这意味着您可以从磁盘上缩小了75%的文件中获益，然后在加载后转换回浮点代码，这样您现有的浮点代码就可以在不做任何更改的情况下工作。
这里我就有疑惑了：只是文件变小了？inferrence推理的时候还是转换为浮点？
那这对推理毫无作用啊!!!
我觉得说的应该是转换为量化的过程之后再转换为浮点，这样原有的浮点代码是不变的。
这样才合理。
举个例子：post-quantization是一种量化方法，还有一种Quantization-aware training量化感知训练。
前者是已经训练好了浮点pb模型，然后进行对其进行量化。后者是在训练的过程中进行量化。
当然前者比较好。因为模型大了怎么自己训练？
pb转换为tflite之后，文件大小变为原来的1/4.而pb文件是不会变换的。

Another reason to quantize is to reduce the computational resources you need to do the inference calculations, by running them entirely with eight-bit inputs and outputs. This is a lot more difficult since it requires changes everywhere you do calculations, but offers a lot of potential rewards. Fetching eight-bit values only requires 25% of the memory bandwidth of floats, so you’ll make much better use of caches and avoid bottlenecking on RAM access. You can also typically use SIMD operations that do many more operations per clock cycle. In some case you’ll have a DSP chip available that can accelerate eight-bit calculations too, which can offer a lot of advantages.
量化的另一个原因是通过完全使用8位输入和输出来运行推理计算，从而减少进行推理计算所需的计算资源。这要困难得多，因为无论你在哪里做计算，它都需要改变，但是会带来很多潜在的回报。获取8位值只需要浮点内存带宽的25%，因此您将更好地利用缓存，避免内存访问瓶颈。您还可以典型地使用SIMD运算，每个时钟周期执行更多的运算。在某些情况下，你会有一个可以加速8位计算的数字信号处理器芯片，这可以提供很多优势。

Moving calculations over to eight bit will help you run your models faster, and use less power (which is especially important on mobile devices). It also opens the door to a lot of embedded systems that can’t run floating point code efficiently, so it can enable a lot of applications in the IoT world.
将计算移到8位将有助于您更快地运行模型，并使用更少的功耗(这在移动设备上尤为重要)。它还为许多不能高效运行浮点代码的嵌入式系统打开了大门，因此它可以在物联网世界中启用许多应用程序。

Why Not Train in Lower Precision Directly?
为什么不直接用低准确率训练？废话。
There have been some experiments training at lower bit depths, but the results seem to indicate that you need higher than eight bit to handle the back propagation and gradients. That makes implementing the training more complicated, and so starting with inference made sense. We also already have a lot of float models already that we use and know well, so being able to convert them directly is very convenient.
有一些在较低比特深度的实验训练，但是结果似乎表明你需要高于8比特来处理反向传播和梯度。这使得培训的实施更加复杂，因此从推理开始是有意义的。我们也已经有了很多浮动模型，我们已经很好地使用和了解了，所以能够直接转换它们是非常方便的。
我觉得Quantization-aware training量化感知训练，是在训练的过程中量化，
我还看到过关键词伪量化
再结合这里的，我觉得伪量化应该就是训练的时候还是用的浮点数，只不过保存的时候用的定点。

How Does the Quantization Process Work?
We’ve implemented quantization by writing equivalent eight-bit versions of operations that are commonly used during inference. These include convolution, matrix multiplication, activation functions, pooling operations and concatenation. The conversion script first replaces all the individual ops it knows about with quantized equivalents. These are small sub-graphs that have conversion functions before and after to move the data between float and eight-bit. Below is an example of what they look like. First here’s the original Relu operation, with float inputs and outputs:
我们通过编写推理过程中常用的等价八位操作版本来实现量化。这些包括卷积、矩阵乘法、激活函数、池运算和级联。转换脚本首先用量化的等价物替换它所知道的所有单个操作。这些小的子图在浮点和八位之间移动数据之前和之后都有转换功能。下面是它们长什么样的一个例子。首先是原始的Relu操作，带有浮点输入和输出:

Then, this is the equivalent converted subgraph, still with float inputs and outputs, but with internal conversions so the calculations are done in eight bit.
然后，这是等价的转换子图，仍然有浮点输入和输出，但是有内部转换，所以计算是在8位完成的。
我不懂为什么输入输出要用浮点？
为什么不直接全部用8位数？
是为了扩大表示范围吗？
数值越大包含的信息越多。
在这里插入图片描述
The min and max operations actually look at the values in the input float tensor, and then feeds them into the Dequantize operation that converts the tensor into eight-bits. There’s more details on how the quantized representation works later on.
上面的Dequantize应该写错了把，我觉得是quantize，先量化再去量化吧。作者打错了感觉。
最小和最大运算实际上是查看输入浮点张量中的值，然后将它们输入去量化运算，将张量转换为八位。稍后会有更多关于量化表示如何工作的细节。

Once the individual operations have been converted, the next stage is to remove unnecessary conversions to and from float. If there are consecutive sequences of operations that all have float equivalents, then there will be a lot of adjacent Dequantize/Quantize ops. This stage spots that pattern, recognizes that they cancel each other out, and removes them, like this:
一旦单个操作被转换，下一步就是移除不必要的浮点转换。如果有连续的操作序列都有浮点等价物，那么将有许多相邻的去量化/量化操作。当前阶段发现这种模式，认识到它们相互抵消，然后删除它们，如下所示:
在这里插入图片描述
很合理，因为matmul出来的值要给relu
所以quantizedmatmul出来的八位数没必要先去量化然后再量化进入quantizedrelu
直接进去就行。

Applied on a large scale to models where all of the operations have quantized equivalents, this gives a graph where all of the tensor calculations are done in eight bit, without having to convert to float.
大规模应用于所有运算都有量化等价物的模型，这给出了一个图表，其中所有张量计算都在8位中完成，而不必转换为浮点运算。
我在转换用CNN算法的mnist时，发现keeprob中的RandomUniform算子并不支持
后面还得考虑这种不支持算子怎么解决。
此时用的是tf-nightly1.13
https://blog.csdn.net/yuanlulu/article/details/85731488

What Representation is Used for Quantized Tensors?
量化张量用什么表示？
We approach converting floating-point arrays of numbers into eight-bit representations as a compression problem. We know that the weights and activation tensors in trained neural network models tend to have values that are distributed across comparatively small ranges (for example you might have -15 to +15 for weights, -500 to 1000 for activations on an image model, though the exact numbers will vary). We also know from experiment that neural nets tend to be very robust in the face of noise, and so the noise-like error produced by quantizing down to a small set of values will not hurt the precision of the overall results very much. We also want to pick a representation that’s easy to perform calculations on, especially the large matrix multiplications that form the bulk of the work that’s needed to run a model.
作为一个压缩问题，我们将浮点数字数组转换为八位表示。我们知道，经过训练的神经网络模型中的权重和激活张量往往具有分布在相对较小范围内的值(例如，对于图像模型上的激活，权重可能为-15到+15，-500到1000，尽管确切的数字会有所不同)。从实验中我们还知道，面对噪声，神经网络往往非常稳健，因此通过量化到一个小的值集而产生的类似噪声的误差不会对整体结果的精度造成很大损害。我们还希望选择一种易于计算的表示，尤其是构成运行模型所需的大部分工作的大矩阵乘法。

These led us to pick a representation that has two floats to store the overall minimum and maximum values that are represented by the lowest and highest quantized value. Each entry in the quantized array represents a float value in that range, distributed linearly between the minimum and maximum. For example, if we have minimum = -10.0, and maximum = 30.0f, and an eight-bit array, here’s what the quantized values represent:
这些导致我们选择一个有两个浮点的表示来存储由最低和最高量化值表示的总最小值和最大值。量化数组中的每个条目代表该范围内的一个浮点值，线性分布在最小值和最大值之间。例如，如果我们有最小值= -10.0，最大值= 30.0，以及一个8位数组，量化值表示如下:
在这里插入图片描述
The advantages of this format are that it can represent arbitrary magnitudes of ranges, they don’t have to be symmetrical, it can represent signed and unsigned values, and the linear spread makes doing multiplications straightforward. There are alternatives like Song Han’s code books that can use lower bit depths by non-linearly distributing the float values across the representation, but these tend to be more expensive to calculate on.
这种格式的优点是它可以表示任意幅度的范围，它们不必是对称的，它可以表示有符号和无符号的值，线性扩展使得乘法变得简单。宋立科·汉的代码本也有其他选择，可以通过非线性分布浮点值来使用更低的位深度，但这些计算起来往往更昂贵。
The advantage of having a strong and clear definition of the quantized format is that it’s always possible to convert back and forth from float for operations that aren’t quantization-ready, or to inspect the tensors for debugging purposes. One implementation detail in TensorFlow that we’re hoping to improve in the future is that the minimum and maximum float values need to be passed as separate tensors to the one holding the quantized values, so graphs can get a bit dense!
对量化格式有一个强而清晰的定义的优点是，对于没有准备好量化的操作，总是可以从浮点来回转换，或者为了调试的目的，可以检查张量。张量流的一个实现细节是，最小和最大浮点值需要作为单独的张量传递给保存量化值的张量，所以图形会变得有点密集！
这里我觉得是不是可以这样：
如果算子不支持，那就这部分不用量化算法。不知道有没有这种功能。
我看人家说的是算法不支持那就自己申请加入这种算法。

How do we Determine Ranges?
我们如何确定范围？

The nice thing about the minimum and maximum ranges is that they can often be pre-calculated. Weight parameters are constants known at load time, so their ranges can also be stored as constants. We often know the ranges for inputs (for examples images are usually RGB values in the range 0.0 to 255.0), and many activation functions have known ranges too. This can avoid having to analyze the outputs of an operation to determine the range, which we need to do for math ops like convolution or matrix multiplication which produce 32-bit accumulated results from 8-bit inputs.
最小和最大范围的好处是它们通常可以预先计算出来。权重参数是加载时已知的常数，因此它们的范围也可以存储为常数。我们通常知道输入的范围(例如，图像通常是0.0到255.0范围内的RGB值)，许多激活函数也有已知的范围。这可以避免分析运算的输出来确定范围，而对于卷积或矩阵乘法等从8位输入产生32位累积结果的数学运算，我们需要分析运算的输出来确定范围。

If you’re doing any kind of arithmetic on 8-bit inputs, you’ll naturally start to accumulate results that have more than 8 bits of precision. If you add two 8 bit values, the result needs 9 bits. If you multiply two 8 bit numbers, you get 16 bits in the output. If you total up a series of 8-bit multiplications, like we do for matrix multiplication, the results grow beyond 16 bits, with the accumulator typically needing at least 20 to 25 bits, depending on how long the dot products involved are.
如果你在8位输入上做任何运算，你自然会开始积累精度超过8位的结果。如果您添加两个8位值，结果需要9位。如果你把两个8位数字相乘，你会得到16位的输出。如果您对一系列8位乘法求和，就像我们对矩阵乘法所做的那样，结果会超过16位，累加器通常需要至少20到25位，这取决于所涉及的点积有多长。

This can be an issue for our quantization approach, since we need to take an output that’s much wider than 8 bits and shrink it down to feed into the next operation. One way to do it for matrix multiplies would be to calculate the largest and smallest possible output values, assuming all of the input values were at extremes. This is safe, since we know mathematically that no results can fall outside this range, but in practice most weights and activation values are much more evenly distributed. This means that the actual range of values we see is much smaller than the theoretical one, so if we used the larger bounds we’d be wasting a lot of our 8 bits on numbers that never appeared. Instead, we use the QuantizeDownAndShrinkRange operator to take a 32-bit accumulated tensor, analyze it to understand the actual ranges used, and rescale so that the 8-bit output tensor uses that range effectively. There are strategies that involve observing the actual minimums and maximums encountered with large sets of training data, and hard-coding those to avoid analyzing the buffer for ranges every time, but we don’t currently include that optimization.
对于我们的量化方法来说，这可能是一个问题，因为我们需要获得比8位宽得多的输出，并将其缩小以馈入下一个操作。矩阵乘法的一种方法是计算最大和最小可能的输出值，假设所有的输入值都处于极值。这是安全的，因为我们从数学上知道没有结果可以超出这个范围，但是实际上大多数权重和激活值分布更加均匀。这意味着我们看到的实际值范围比理论值范围小得多，所以如果我们使用更大的界限，我们将在从未出现过的数字上浪费大量的8位。相反，我们使用量化向下和收缩范围操作符（这个应该就是在采用的算法）来获取32位累积张量，对其进行分析以了解实际使用的范围，并重新缩放以使8位输出张量有效地使用该范围。有些策略包括观察大量训练数据遇到的实际最小值和最大值，并对它们进行硬编码，以避免每次都分析缓冲区的范围，但我们目前不包括这种优化。

How is the Rounding Done?
舍入是如何完成的？
One of the hardest and most subtle problems we hit during quantization was the accumulation of biases. As I mentioned above, neural networks are very resilient to noise, but unless you’re very careful with rounding it’s easy to introduce biases in a single direction that build up during computation and wreck the final accuracy. You can see the final formula in the code, but the important part was that we needed to subtract the rounded version of the minimum from the rounded version of the float input value, rather than subtracting float minimum from the input and then rounding.
量化过程中我们遇到的最困难和最微妙的问题之一是偏差的累积。正如我上面提到的，神经网络对噪声非常有弹性，但是除非你非常小心地舍入，否则很容易在计算过程中在单一方向引入偏差，并破坏最终精度。您可以在代码中看到最终的公式，但重要的是我们需要从浮点输入值的舍入版本中减去最小值的舍入版本，而不是从输入中减去浮点最小值，然后舍入。
应该就是舍入减舍入，一定程度上可以抵消舍入带来的偏差累积吧。

What’s Next?
We’ve found that we can get extremely good performance on mobile and embedded devices by using eight-bit arithmetic rather than floating-point. You can see the framework we use to optimize matrix multiplications at gemmlowp. We still need to apply all the lessons we’ve learned to the TensorFlow ops to get maximum performance on mobile, but we’re actively working on that. Right now, this quantized implementation is a reasonably fast and accurate reference implementation that we’re hoping will enable wider support for our eight-bit models on a wider variety of devices.
我们发现，通过使用八位算术而不是浮点运算，我们可以在移动和嵌入式设备上获得非常好的性能。您可以看到我们用来优化gemmlowp矩阵乘法的框架。我们仍然需要将我们学到的所有经验应用到TensorFlow操作系统中，以在移动设备上获得最高性能，但我们正在积极努力。目前，这种量化实现是一种相当快速和准确的参考实现，我们希望它能够在更广泛的设备上为我们的8位模型提供更广泛的支持。

If you’re interested, I highly recommend digging through the quantization code in TensorFlow, especially looking at the kernels that implement quantized ops. These all include reference implementations that we’re hoping will help portability to new hardware devices.
如果您感兴趣，我强烈建议您深入研究张量流中的量化代码，尤其是那些实现量化运算的内核。这些都包括参考实现，我们希望这些实现将有助于新硬件设备的可移植性。

We also hope that this demonstration will encourage the community to explore what’s possible with low-precision neural networks. Thanks to everyone who helped put the quantization support together, it’s been great getting this out there!
我们也希望这次演示将鼓励社区探索低精度神经网络的可能性。感谢所有帮助将量化支持整合在一起的人，让这一切得以实现真是太好了！

WhyNotFocus

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
How to Quantize Neural Networks with TensorFlow（这是16年的博客）

I’m pleased to say that we’ve been able to release a first version of TensorFlow’s quantized eight bit support. I was pushing hard to get it in before the Embedded Vision Summit, because it’s especial...
复制链接

扫一扫

专栏目录