Back to Simplicity How to Train Accurate BNNs from Scratch

最新推荐文章于 2020-11-24 12:27:14 发布

大星小辰

最新推荐文章于 2020-11-24 12:27:14 发布

阅读量631

点赞数

分类专栏：模型量化

本文链接：https://blog.csdn.net/qq_28306361/article/details/102382921

版权

模型量化专栏收录该内容

11 篇文章 0 订阅

订阅专栏

Back to Simplicity: How to Train Accurate BNNs from Scratch?

文章目录

Back to Simplicity: How to Train Accurate BNNs from Scratch?

文章链接
代码链接

Introduction

主要贡献在于：

1：对于如何训练出一个高精度的二值化网络提供了具体的方式，说明了原来的一些方式的效果没有那么好

2：提出了设计BNNs的一些普适的准则，在此基础上，提出了BinaryDenseNet

3：提供了开源的代码

Related Work

最近的工作主要分为三类：紧凑的网络结构设计，量化权重的网络，量化权重和激活值的网络

Compact Network Design ：将 $3\times 3$ 的滤波器换成 $1\times 1$ 的滤波器，depth-wise separable convolution 、channel shuffling，不过这些方式都必须使用GPU，不能在CPU上加速

Quantized Weights and Real-valued Activations：BinaryConnect (BC) , Binary Weight Network (BWN) , and Trained Ternary Quantization (TTQ) ，内存减少，精度损失小，但是加速不多

Quantized Weights and Activations：DoReFa-Net, High-Order Residual Quantization (HORQ) and SYQ ，用1-bit的权值和多bit的激活值取得了较好的效果

Binary Weights and Activations：Binarized Neural Network (BNN) ，XNOR-Net ，ABC-Nets

Study on Common Techniques

Implementation of Binary Layers

用符号函数来二值化，然后使用STE进行反传：
$\operatorname{sign}(x)=\left\{\begin{array}{l}{+1 \text { if } x \geq 0} \\ {-1 \text { otherwise }}\end{array}\right.\tag{1}$

$\begin{array}{c}{\text { Forward: } r_{o}=\operatorname{sign}\left(r_{i}\right)} \\ {\text { Backward: } \frac{\partial c}{\partial r_{i}}=\frac{\partial c}{\partial r_{o}} 1_{\left|r_{i}\right| \leq t_{\text {clip }}}}\end{array}\tag{2}$

Scaling Methods

作者经过试验认识BN层已经包括了尺度放缩的效果，因此，尺度+BN的效果和单纯的BN的效果是一样的，因此，作者就不使用scaling factor。

Full-Precision Pre-Training

作者对比了训练的三种方式，fully from scratch、by fine-tuning a fullprecision ResNetE18 with ReLU 、and clip as activation function。结果发现clip 的效果最差，from scratch的效果比用ReLU的效果稍微好一点，作者认为是因为BNN中我们并不使用ReLU，所以与训练模型不太适用。

Backward Pass of the Sign Function
$\frac{\partial c}{\partial r_{i}}=\frac{\partial c}{\partial r_{o}} 1_{\left|r_{i}\right| \leq t_{\text {clip }}} \cdot\left\{\begin{array}{l}{2-2 r_{i} \text { if } r_{i} \geq 0} \\ {2+2 r_{i} \text { otherwise. }}\end{array}\right.\tag{3}$
这个好像在fine-tune的时候比较好使，一般情况下作用也不大。

Proposed Approach

Golden Rules for Training Accurate BNNs

核心是保留网络中丰富的信息流 maintaining rich information flow of the network

不是是所有的real-value网络都合适用来二值化，如一些紧凑型的网络就不适合，因为这两种网络的设计理念是互斥的，一个是较少冗余eliminating redundancy，一个是补偿信息的损失compensating information loss

少用Bottleneck design(bottleneck： $1\times 1$ 的卷积可以用于降维)

为保存信息流，慎用全精度的降采样层

使用shortcut connections 对BNNs来说尤为重要

为了克服信息流的瓶颈，应该适当增加网络和宽度和深度

原来的scaling factor、approxsign、FP pre-training都没有什么用，可以直接从头训

考虑下BNN的缺点，理论上讲，同全精度网络相比，它的信息密度是低32倍的，因此需要用其他的方法来补偿：

1：使用shortcut connection

2：减少bottlenecks

3：某些关键层还是用全精度代替

ResNetE

在resnet上面做了两点改变，
1：删去了bottleneck层，将三个滤波器(kernel size 1,3,1)变为两个 $3\times3$ 的滤波器。(会增加模型的大小个参数)
2：使用full precision downsampling convolution layer
在这里插入图片描述

BinaryDenseNet

既然用resnet有好的效果，作者就想试试densenet，因为densenet中的shortcut比resnet更多。不过，在减少bottleneck层时，发现对densenet的效果并不好。作者说这是因为the limited representation capacity of binary layers。解决这个问题有两种方法，一个是增加the growth rate parameter k, which is the number of newly concatenated features from each layer。或者是用很多的blocks。

BinaryDenseNet和ResNetE 的另一个不同点在于降采样层。也有两种方案：一是使用全精度的降采样层，为了减少计算量，使用 $MaxPool\rightarrow ReLU\rightarrow \operatorname{1×1-Conv}$ 代替了 $\operatorname{1×1-Conv}\rightarrow \operatorname{AvgPool}$ 。另一种是使用binary downsampling conv-layer with a lower reduction rate, or even no reduction at all 代替full-precision layer。
在这里插入图片描述