Pareto-Optimal Quantized ResNet Is Mostly 4-bit

Ruff_XY

于 2021-07-13 16:04:18 发布

阅读量315

点赞数

分类专栏： paper reading 量化压缩深度学习文章标签：深度学习压缩量化 paper reading

本文链接：https://blog.csdn.net/xieyi4650/article/details/118672530

版权

深度学习同时被 3 个专栏收录

19 篇文章 0 订阅

订阅专栏

paper reading

6 篇文章 0 订阅

订阅专栏

量化压缩

5 篇文章 0 订阅

订阅专栏

论文背景

文章地址
 代码地址

amirali.abdolrashidi @email.ucr.edu,
{fwanglisa, shivaniagrawal, malmaud, rybakov, cleichner, lewg} @google.com
谷歌和UCR联合发表的
期刊/会议: CVPR 2020

Abstract

In this work, we use ResNet as a
case study to systematically investigate the effects of quantization
on inference compute cost-quality tradeoff curves.
Our results suggest that for each bfloat16 ResNet model,
there are quantized models with lower cost and higher accuracy;
in other words, the bfloat16 compute cost-quality
tradeoff curve is Pareto-dominated by the 4-bit and 8-bit
curves, with models primarily quantized to 4-bit yielding
the best Pareto curve.

introduction

一些量化算法, 添加了新的超参数, 并且有需要训练的参数, 这促使作者关注那些在最小化增加的复杂性的同时提供明确好处的方法。
作者试图了解不同的量化精度如何影响computer-cost 和精度的trade off曲线，并在不同的compute cost和质量要求下, 找到一个简单的策略来压缩模型

Background

研究mlperf里的resnet 50 v1.5

Quantization Details

主要研究cost和quality之间, 在不同精度下的的trade off, 并且找到实用的, 硬件友好的的方法来量化模型.
A和W都量化, 着重于4和8bit, 使用QAT

Uniform Quantization

A和W都是per-channel

Calibration of Clipping Bounds

Automatically picking bounds during training

量化activation:

在前N步的training中, 不量化activation, 但是计算max(abs(x)), 并且追踪其exponential moving averages (EMA).
将clipping bounds设置为1中的值, 然后启用activation 量化. 值得注意的是, 训练期间, 只进行一次calibrate activation clipping bounds, 作者发现多次的calibration会导致feedback loop. 上面这种操作就使得, 模型对于EMA的超参数敏感. 虽然N是超参数, 但是模型对于N的选择不是很敏感. 可以设置为10%~40%, 作者使用20%
对于Weights, 使用max(abs(w))为clip的边界

Quantization Library in JAX and Flax

基于JAX框架, Flax 库
features:

Quantized JAX dot and Flax layers
Multiple quantization strategies
Flexible configuration system
Support for unsigned and signed quantization
What you train is what you serve
Accurate Quantized Training (AQT) method, 可以保证在训练时的forward-pass和inference时的forward-pass是一样的, 矩阵乘是用整数乘的, 保证了模型量化的质量, 节约训练时间, 因为不需要将train graph转化为inference graph, 所以简化编译逻辑.

Cost Models for Quantized Neural Networks

计算开销和bit数是线性关系
基于能量的建模则是平方关系, n bit的两个数相乘需要n*n个加法器
整个网络模型的开销近似等于各个layer之和

Experiments and Results

作者设置了一个全局变量filter multiplier, 相当于pruning的系数, 对所有的conv layer都乘以这个系数, 以此来调节参数量. 总共设置了9个{0:5; 0:62; 0:75; 0:87; 1:0; 1:25; 1:5; 1:75; 2:0}. 然后又有4个中不同的量化精度, 4bit(首尾8bit), 4bit, 8bit, 16bfloat.所以一张图上总共要做36次实验.
然后根据实验结果, 分析后提出了一种压缩的方法, 使用较少的超参数:

Quantize all layers to 4 bits, and first and last layers (conv init and dense) to 8 bits.
Change the number of parameters with a global filter multiplier to achieve the desired tradeoff based on the compute cost/memory cost and quality requirements

Conclusion

对于bfloat16的模型, 量化总是能减少compute cost, 并提高精度
4bit(首尾8bit)的量化精度, 在各种cost下, 都优于8bit, 所以作者猜测4bit是比8bit更好的量化bit数.
提出了一种压缩量化模型的方法.

Ruff_XY

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Pareto-Optimal Quantized ResNet Is Mostly 4-bit

论文背景文章地址代码地址amirali.abdolrashidi @email.ucr.edu,{fwanglisa, shivaniagrawal, malmaud, rybakov, cleichner, lewg} @google.com谷歌和UCR联合发表的期刊/会议: CVPR 2020AbstractIn this work, we use ResNet as acase study to systematically investigate the effect
复制链接

扫一扫