论文背景
-
amirali.abdolrashidi @email.ucr.edu,
{fwanglisa, shivaniagrawal, malmaud, rybakov, cleichner, lewg} @google.com
谷歌和UCR联合发表的 -
期刊/会议: CVPR 2020
Abstract
In this work, we use ResNet as a
case study to systematically investigate the effects of quantization
on inference compute cost-quality tradeoff curves.
Our results suggest that for each bfloat16 ResNet model,
there are quantized models with lower cost and higher accuracy;
in other words, the bfloat16 compute cost-quality
tradeoff curve is Pareto-dominated by the 4-bit and 8-bit
curves, with models primarily quantized to 4-bit yielding
the best Pareto curve.
introduction
一些量化算法, 添加了新的超参数, 并且有需要训练的参数, 这促使作者关注那些在最小化增加的复杂性的同时提供明确好处的方法。
作者试图了解不同的量化精度如何影响computer-cost 和精度的trade off曲线,并在不同的compute cost和质量要求下, 找到一个简单的策略来压缩模型
Background
研究mlperf里的resnet 50 v1.5
Quantization Details
主要研究cost和quality之间, 在不同精度下的的trade off, 并且找到实用的, 硬件友好的的方法来量化模型.
A和W都量化, 着重于4和8bit, 使用QAT
Uniform Quantization
A和W都是per-channel
Calibration of Clipping Bounds
Automatically picking bounds during training
量化activation:
- 在前N步的training中, 不量化activation, 但是计算max(abs(x)), 并且追踪其exponential moving averages (EMA).
- 将clipping bounds设置为1中的值, 然后启用activation 量化. 值得注意的是, 训练期间, 只进行一次calibrate activation clipping bounds, 作者发现多次的calibration会导致feedback loop. 上面这种操作就使得, 模型对于EMA的超参数敏感. 虽然N是超参数, 但是模型对于N的选择不是很敏感. 可以设置为10%~40%, 作者使用20%
对于Weights, 使用max(abs(w))为clip的边界
Quantization Library in JAX and Flax
基于JAX框架, Flax 库
features:
- Quantized JAX dot and Flax layers
- Multiple quantization strategies
- Flexible configuration system
- Support for unsigned and signed quantization
- What you train is what you serve
Accurate Quantized Training (AQT) method, 可以保证在训练时的forward-pass和inference时的forward-pass是一样的, 矩阵乘是用整数乘的, 保证了模型量化的质量, 节约训练时间, 因为不需要将train graph转化为inference graph, 所以简化编译逻辑.
Cost Models for Quantized Neural Networks
- 计算开销和bit数是线性关系
- 基于能量的建模则是平方关系, n bit的两个数相乘需要n*n个加法器
- 整个网络模型的开销近似等于各个layer之和
Experiments and Results
作者设置了一个全局变量filter multiplier, 相当于pruning的系数, 对所有的conv layer都乘以这个系数, 以此来调节参数量. 总共设置了9个{0:5; 0:62; 0:75; 0:87; 1:0; 1:25; 1:5; 1:75; 2:0}. 然后又有4个中不同的量化精度, 4bit(首尾8bit), 4bit, 8bit, 16bfloat.所以一张图上总共要做36次实验.
然后根据实验结果, 分析后提出了一种压缩的方法, 使用较少的超参数:
- Quantize all layers to 4 bits, and first and last layers (conv init and dense) to 8 bits.
- Change the number of parameters with a global filter multiplier to achieve the desired tradeoff based on the compute cost/memory cost and quality requirements
Conclusion
对于bfloat16的模型, 量化总是能减少compute cost, 并提高精度
4bit(首尾8bit)的量化精度, 在各种cost下, 都优于8bit, 所以作者猜测4bit是比8bit更好的量化bit数.
提出了一种压缩量化模型的方法.