TensorflowLite量化原理

最新推荐文章于 2025-04-17 17:45:33 发布

程序猿也可以很哲学

最新推荐文章于 2025-04-17 17:45:33 发布

阅读量3.6k

点赞数 2

分类专栏： tensorflow 机器学习

本文链接：https://blog.csdn.net/qq_16564093/article/details/88425011

版权

tensorflow 同时被 2 个专栏收录

12 篇文章

订阅专栏

机器学习

10 篇文章

订阅专栏

本文深入探讨深度学习模型量化技术，包括原理、量化方式、Lite文件生成及注意事项。涵盖PostTrainingQuantization与QuantizationAwareTraining两种量化策略，以及它们在模型压缩与加速推理上的应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一 : 原理

原理公式:

Here:

r is the real value (usually float32)
q is its quantized representation as a B-bit integer (uint8, uint32, etc.)
S (float32) and z (uint) are the factors by which we scale and shift the number line. z is the quantized ‘zero-point’ which will always map back exactly to 0.f.

化简:

Consider a floating point variable with range (Xmin, Xmax) that needs to be quantized to the range (0, N_levels − 1) where N_levels = 256 for 8-bits of precision. We derive two parameters: Scale (∆) and Zero-point(z) which map the floating point values to integers . The scale specifies the step size of the quantizer and floating point zero maps to zero-point . Zero-point is an integer, ensuring that zero is quantized with no error. This is important to ensure that common operations like zero padding do not cause quantization error.

卷积原理:

二 : 量化方式

1 . Post Training Quantization

(1) . Weight only quantization

(2) . Quantizing weights and activations

2 . Quantization Aware Training

1 . Post Training Quantization

In many cases, it is desirable to reduce the model size by compressing weights and/or quantize both weights and activations for faster inference, without requiring to re-train the model. Post Training quantization techniques are simpler to use and allow for quantization with limited data.

Post Training Quantization合理说,计算过程皆为Float,而非Int,所以只能在减少模型的大小,速度方面并不能得到提升.

(1) .Weight only quantization

A simple approach is to only reduce the precision of the weights of the network to 8- bits from float. Since only the weights are quantized, this can be done without requiring any validation data .

这种模式,是将模型的weight进行quantized压缩成uint8,但在计算过程中,会将weight进行dequantized回Float.

(2) .Quantizing weights and activations

One can quantize a floating point model to 8-bit precision by calculating the quantizer parameters for all the quantities to be quantized. Since activations need to be quantized, one needs calibration data and needs to calculate the dynamic ranges of activations.

这种模式,在weight quantization的基础上,对某些支持quantized的Kernel,先进行quantization,再进行activation计算,再de-quant回float32,不支持的话会直接使用Float32进行计算,这相对与直接使用Float32进行计算会快一些.

2 . Quantization Aware Training

Quantization aware training models quantization during training and can provide higher accuracies than post quantization training schemes.

We model the effect of quantization using simulated quantization operations on both weights and activations. For the backward pass, we use the straight through estimator to model quantization. Note that we use simulated quantized weights and activations for both forward and backward pass calculations.

这种模式,除了会对weight进行quantization,也会在训练过程中,进行模拟量化,求出各个op的max跟min输出,实现不仅仅在训练过程,在测试过程,全程计算过程皆为uint8.不仅仅实现模型的压缩,计算速度也得到提高.