Ristretto Hardware-Oriented Approximation of Convolutional Neural Networks

最新推荐文章于 2020-12-06 11:32:52 发布

大星小辰

最新推荐文章于 2020-12-06 11:32:52 发布

阅读量596

点赞数

分类专栏：模型量化文章标签：深度学习模型量化

本文链接：https://blog.csdn.net/qq_28306361/article/details/101512378

版权

模型量化专栏收录该内容

11 篇文章 0 订阅

订阅专栏

Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks

文章目录

Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks

文章链接

Introduction

主要介绍了Ristretto，a framework for automated neural network approximation 。它是开源的，并且基于Caffe。

Convolution Neural Networks

Layer Types

Convolution layer：特征提取，但是需要进行很多计算，比该层的参数还多，

Fully connected layer：特征提取，但是占了模型参数的很大一部分

Rectified Linear Unit(ReLU)：让模型能够学习到非线性特征

Normalization layers：Local Response Normalization(LPN)将feature map归一化。Batch Normalization。但是这个层中的数值和其他层相差较大 $2^{14})$ ，因此文章主要量化其他层的参数。

Pooling：降低feature的大小和encode translation invariance ，它也能降低参数的规模和计算的复杂性。一般的池化方式为MAX pooling，也有average pooling和L2-norm pooling。由于该层操作简单，因此不做量化。

Computational Complexity and Memory Requirements

Deep CNNs的复杂度主要在于两部分，卷积层包含了90%以上的计算操作，全连接层包含了90%以上的网络参数。因此，一个高效的CNNs加速器必须做到：提供足够大的计算吞吐量，足够的内存带宽保证数据处理从不空闲。因此本文主要考虑怎么量化这两个层。

Neural Networks With Limited Numerical Precision

对于一个给定的full precision网络，量化的主要步骤为：

1：Quantization of the layer input and weights to reduced precision format (using m
and n bits for number representation, respectively)

2：Perform the MAC(multiplication-and-accumulation ) operations using the quantized values

3：The final result is again quantized

Rounding Schemes

Round nearest even:
$\operatorname{round}(x)= \left\{\begin{array}{ll}{\lfloor x\rfloor,} & {\text { if }\lfloor x\rfloor \leq x \leq x+\frac{\epsilon}{2}} \\ {\lfloor x\rfloor+\epsilon,} & {\text { if }\lfloor x\rfloor+\frac{\epsilon}{2}<x \leq x+\epsilon}\end{array}\right. \tag{1}$
这是一种确定的取值方式，因此本文在inference/test阶段采用这种方式

Round stochastic:
$\operatorname{round}(x)= \left\{ \begin{array}{ll} {\lfloor x\rfloor,} & {\text { w.p. }\ \ 1-\frac{x-\lfloor x\rfloor}{\epsilon}} \\ {\lfloor x\rfloor+\epsilon,} & {\text { w.p. }\ \ \frac{x-\lfloor x\rfloor}{\epsilon}} \end{array} \right. \tag{2}$
式中，w.p.表示with probability。这种方式的期望取整误差为0，即 $\mathbb{E}(round(x))=0$ ，文章在对量化网络进行fine-tuning的时候使用这种round方式。

训练的时候，先训练连续值的网路，在进行量化，然后进行fine-tuning。

Related Work

Network Approximation

这里介绍理论几种得到一个近似网络的方法，包括fixed point approximation、network pruning and shared weights、binary networks

Accelerator

这一部分主要讲了一些硬件的加速方法。由于和算法无关，就不做介绍。

Fixed Point Approximation

Baseline Convolutional Neural Networks

说明了做实验的baseline为Lenet、CIFAR-10 FULL、CaffeNet、GoogLeNet、SqueezeNet。

Fixed Point Format

使用 $[\operatorname{IL}.\operatorname{FL}]$ 表示一个fixed point number， $[\operatorname{IL}]$ 和 $[\operatorname{FL}]$ 分别表示整数和分数部分。因此，表示一整个值的bits需要 $\operatorname{IL}+\operatorname{FL}$ 位。使用round nearest方式，采用补码表示的话，最大的整数可表示为
$x_{max}=2^{IL-1}-2^{-FL} \tag{3}$

Dynamic Range of Parameters and Layer Outputs

Dynamic Range in Small CNN：对于Lenet，作者发现参数值比层的输出是要小的，99%的网络参数在 $2^0$ 到 $2^{-10}$ ，对于全连接层，99%的参数在 $2^{-4}$ 到 ${2^5}$ 之间。

Dynamic Range in Large CNN ：对于CaffeNet，同样的，参数值比层的输出是要小的，但是它们两之间的差别更大，因此作者使用了16bits（Q9.7）的量化性能是最好的，尽管有一些层的输出无法表示(0.46%)，同时21.23%的值被截断到0。最后作者说Similarly to the analysis with LeNet, large layer outputs are more important than small parameters 。

Results

作者在这一节主要介绍了量化的结果，Lenet在MNIST上、CIFAR-10在CIFAR上以及CaffeNet在ImageNet上的量化方案和结果对比。
在这里插入图片描述

Dynamic Fixed Point Approximation

Mixed Precision Fixed Point

在这里插入图片描述
对网络的不同部分使用不同精度的量化方案，例如在上图中， $m$ 和 $n$ 分别表示某一层的输出和权值的量化位数。

Dynamic Fixed Point

CNN的不同部分有着不同的值区间，对于一个比较大的层，输出是经过了好多次的累加，所以网络的参数比更小网络中的参数要小，而fixed point只能覆盖一个有限的区间，因此，使用dynamic point是一个解决这个问题的好方案。在dynamic point中，每一个数被表示为：
$(-1)^s \cdot 2^{-FL}\sum_{i=0}^{B-2}{2^i \cdot x_i} \tag{4}$
式中， $B$ 表示bit-width， $s$ 是sign bit， $F L$ 是分数的长度， $x$ 是mantissa bits。对于网络的每一层，将它分为两个group，一个用于层的outputs，一个用于weights。这两个group的量化方案是不一样的，每一个group单独采用一个多少bit表示分数的方案，如下图所示。
在这里插入图片描述
Choice of Number Format ：作者为了避免saturation，使用了足够的bit位，对于一个数据集 $S$ ，使用的整数部分长 $I L$ 为：
$L=\left\lceil\lg _{2}\left(\max _{S} x+1\right)\right\rceil \tag{5}$
这个 $I L$ 长度在量化outputs时使用，对于weights，则将 $I L$ 长度减一，因为实验表示这样效果更好点(slightly better)。

Results

Impact of Dynamic Fixed Point ：结果表明，用18-bit进行作者的实验时(使用CaffeNet/AlexNet)，fixed point和dynamic point的效果都还好，但当进一步减少bit位时，fixed point的性能则急剧下降，但dynamic point的性能则相对稳定。因此，dynamic point对于这种大网络的效果更好。

Quantization of Individual Network Parts ：作者又用上面提到的三种网络，对网络的某一个部分（layer outputs， convolutional kernels， fully connected layers）单独进行8-bit的dynamic point量化，观察掉点情况，发现量化layer outputs和convolutional kernels的时候值掉了0.3%，但是量化FC层的时候，掉点0.9%。

Fine-tuned Dynamic Fixed Point Networks ：作者对fine-tuning之后的网络精度进行分析，发现小网络掉点少，大网络掉点多，但是我认为这是因为大网络是在ImageNet上测试的，所以掉点更多，在小测试集上的测试并不能完全说明问题。

Minifloat Approximation

Motivation

因为网络的训练都是在float上训练的，因此，用更小的floating point number表示的话，是不是就会使模型更小呢？

IEEE-754 Single Precision Standard

根据IEEE-754标准，单精度浮点数(single precision numbers)有一个符号位(sign bit)，8个指数位(exponent bits)和23个尾数位(mantissa bits)。其中，尾数位中的第一位被设定为1，并且表示指数值的部分被加上了127。如果指数为全0或者全1，是由特殊意义的。全0的话，要么表示数字0，要么表示一个反规格数(denormalized number)，取决于尾数位。如过是全1的话，这个数就表示正/负无穷或者NAN。

Minifloat Number Format

作者用更低bit的量化时，就不能采用IEEE-754标准了，因此，依据分配的bit数，缩短了指数位的偏差(exponent bias)：
$bias=2^{bits-1}-1 \tag{6}$
式中， $b i t s$ 就表示分配的bit数。并且它不支持denormalized number、正/负无穷、NAN。无穷被saturated number替代，denormalized number被0替代，由于前向没有除法，不会有NAN。最后，指数位和尾数位的bit数是自动搜出来的。

Network-specific Choice of Number Format ：对于指数位的bit数，作者使用足够的位数避免saturation。
$bits==\left\lceil\lg _{2}\lg_{2}\left(\max _{S} x+1\right)+1\right\rceil \tag{7}$
式中， $S$ 是逼近的数据集。

Data Path for Accelerator

网路的权重和输入进行了MAC操作，输入为minifloat的，每次乘法的输出比输入宽3个bit，进行加法的时候，全精度加，最后把加的结果量化到minifloat。整个过程如下图所示：
在这里插入图片描述

Results

对上述提到的三个CNN模型分别进行12,8,6-bit的minifloat量化，结果都还好，但是比dynamIic fixed point差点，要求的bit位数也多一些。

Turning Multiplications Into Bit shifts

Multiplier-free Arithmetic

作者认为在进行乘法操作的时候，需要很大的chip area。因此想用integer power of two weights来替换掉乘法。这些weights可以看成是没有尾数位的minifloat类型。因此，将乘法操作转换成位移操作(bit shift)。对于一般的卷积，计算方式为 $z_i=\sum_{j}{x_{j} \cdot w_{j}}$ 。首先，用最接近的2的幂指数逼近这个parameter，如式 $(8)$ ，然后通过式 $(9)$ 就可以逼近输出了。(这段没太看懂)
$e_{j}=round(\lg_{2}(w_j)) \tag{8}$

$z_i\approx \sum_{j}{x_j} << e_j \tag{9}$

Maximal Number of Shifts

网络中大部分权值都在 $[- 1, 1]$ 之间，但是大部分都是0。但是如果用2的指数幂来量化的话，对于很靠近+1和-1的权重，就有比较大的影响。作者采用4-bit来表示2的指数幂，其中第一位为符号位，能表示8种不同的数，最小值为 $2^{-8}$ ，对于小于该值的权重，对网路偶读性能影响很小。

Data Path for Accelerator

看的不是很懂，但是和原来的差不多，讲的也是数据流。

Result

小网络的掉点少，大网络掉点虽然多一些(3、4个点)，但是作者说由于指数部分只用了4-bit来表示，能够这样的效果还是挺好的。

Comparison of Different Approximations

Fixed Point Approximation：对energe和development time要求最少，但是performance也最差。

Dynamic Fixed Point Approximation ：性能最好，综合了fixed point和minifloat的优点。

Minifloat Approximation ：比fixed point性能要好，但是当量化位数降低的时候，如果表示指数的bit位数不够，性能下降很厉害。

Summary

Dynamic fixed point好。它在low bit-widths情况下表现出最好的性能，虽然它比pure fixed point arithmetic需要更多的chip area，这种方法还是很适合用于神经网络在硬件上加速。

Ristretto: An Approximation Framework for Deep CNNs

From Caffe to Ristretto

Wikipedia：Ristretto is ‘a short shot of espresso coffee made with the normal amount of ground coffee but extracted with about half the amount of water’. 它移除CNN中多余的部分，保留它的预测能力。它的输入和输出是Caffe的prototxt和模型参数。

Quantization Flow

Ristretto能够缩小任意32-bit浮点模型为fixed point，minifloat 或者integer power of two parameters。然后讲了一下Ristretto的流程，如下图所示。
在这里插入图片描述

接下来主要介绍了Ristretto的前向、反向以及fine-tuning的机制，用起来和Caffe差不多，关于未来的工作，作何提出的三个方面的展望：Network Pruning 、Binary Networks 、C-Code Generation 。

大星小辰

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
Ristretto Hardware-Oriented Approximation of Convolutional Neural Networks

Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks文章目录Ristretto: Hardware-Oriented Approximation of Convolutional Neural NetworksIntroductionConvolution Neural NetworksLayer ...
复制链接

扫一扫