《XNOR-Net: ImageNet Classification Using BinaryConvolutional Neural Networks》翻译

最新推荐文章于 2021-07-26 13:19:30 发布

ShaneneD

最新推荐文章于 2021-07-26 13:19:30 发布

阅读量1.4k

点赞数

分类专栏： XNOR-Net 文章标签： XNOR-Net

XNOR-Net 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

原文：https://arxiv.org/abs/1603.05279

XNOR-Net: ImageNet Classification Using BinaryConvolutional Neural Networks

Mohammad Rastegari†, Vicente Ordonez†, Joseph Redmon∗, Ali Farhadi†∗

Allen Institute for AI†, University of Washington∗{mohammadr,vicenteor}@allenai.org{pjreddie,ali}@cs.washington.edu

Abstract. We propose two efficient approximations to standard convolutionalneural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32× mem-ory saving. In XNOR-Networks, both the filters and the input to convolutionallayers are binary. XNOR-Networks approximate convolutions using primarily bi-nary operations. This results in 58× faster convolutional operations (in terms ofnumber of the high precision operations) and 32× memory savings. XNOR-Netsoffer the possibility of running state-of-the-art networks on CPUs (rather thanGPUs) in real-time. Our binary networks are simple, accurate, efficient, and work

on challenging visual tasks. We evaluate our approach on the ImageNet classifi-cation task. The classification accuracy with a Binary-Weight-Network version ofAlexNet is the same as the full-precision AlexNet. We compare our method withrecent network binarization methods, BinaryConnect and BinaryNets, and out-perform these methods by large margins on ImageNet, more than 16% in top-1accuracy. Our code is available at: http://allenai.org/plato/xnornet.

XNOR-Net：使用二元卷积神经网络的ImageNet分类
Mohammad Rastegari†，Vicente Ordonez†，Joseph Redmon *，Ali Farhadi†*

摘要。我们提出了两种有效的标准卷积神经网络的近似：二进制权重网络和XNOR网络。在二进制权重网络中，滤波器近似于二进制值，从而节省32倍的存储空间。在XNOR网络中，滤波器和卷积层输入都是二进制的。 XNOR网络主要使用二元运算来近似卷积。这使得卷积操作（根据高精度操作的数量）提高了58倍，节省了32倍的内存。 XNOR - 提供了在CPU（而不是GPU）上实时运行最先进网络的可能性。我们的二进制网络很简单，准确，高效并且工作

在具挑战性的视觉任务我们在ImageNet分类任务中评估我们的方法。二进制权重网络版本的AlexNet的分类精度与全精度AlexNet相同。我们将我们的方法与最新的网络二值化方法BinaryConnect和BinaryNets进行了比较，并在ImageNet上大幅超越了这些方法，超过了前16％的精度。我们的代码可在以下网址获得：http://allenai.org/plato/xnornet。

1 Introduction

Deep neural networks (DNN) have shown significant improvements in several applica-tion domains including computer vision and speech recognition. In computer vision, aparticular type of DNN, known as Convolutional Neural Networks (CNN), have demon-strated state-of-the-art results in object recognition [1,2,3,4] and detection [5,6,7].

Convolutional neural networks show reliable results on object recognition and de-tection that are useful in real world applications. Concurrent to the recent progress inrecognition, interesting advancements have been happening in virtual reality (VR byOculus) [8], augmented reality (AR by HoloLens) [9], and smart wearable devices.Putting these two pieces together, we argue that it is the right time to equip smartportable devices with the power of state-of-the-art recognition systems. However, CNN-based recognition systems need large amounts of memory and computational power.While they perform well on expensive, GPU-based machines, they are often unsuitablefor smaller devices like cell phones and embedded electronics.

For example, AlexNet[1] has 61M parameters (249MB of memory) and performs1.5B high precision operations to classify one image. These numbers are even higher fordeeper CNNs e.g.,VGG [2] (see section 4.1). These models quickly overtax the limitedstorage, battery power, and compute capabilities of smaller devices like cell phones.

1介绍

深度神经网络（DNN）在包括计算机视觉和语音识别在内的几个应用领域已经显示出显着的改进。在计算机视觉中，称为卷积神经网络（CNN）的特定类型的DNN在对象识别[1,2,3,4]和检测[5,6,7]中展示了最先进的结果]。

卷积神经网络在物体识别和检测方面显示出可靠的结果，这在实际应用中非常有用。在最近的认知进展的同时，虚拟现实（Oculus的VR）[8]，增强现实（AR by HoloLens）[9]以及智能可穿戴设备中已经出现了有趣的进步。将这两个部分放在一起，我们认为它是为智能便携式设备提供最先进的识别系统的强大功能。然而，基于CNN的识别系统需要大量的存储器和计算能力。虽然它们在基于GPU的昂贵机器上表现良好，但它们通常不适用于手机和嵌入式电子设备等较小的设备。

例如，AlexNet [1]具有61M参数（249MB内存）并执行1.5B高精度操作来对一个图像进行分类。这些数字对于CNN更高，例如VGG [2]（见4.1节）。这些型号很快加重了手机等小型设备的有限存储，电池供电和计算能力。

In this paper, we introduce simple, efficient, and accurate approximations to CNNs by binarizing the weights and even the intermediate representations in convolutional neural networks. Our binarization method aims at finding the best approximations of the convolutions using binary operations. We demonstrate that our way of binarizing neural networks results in ImageNet classification accuracy numbers that are comparable to standard full precision networks while requiring a significantly less memory and fewer floating point operations.

We study two approximations: Neural networks with binary weights and XNOR-Networks. In Binary-Weight-Networks all the weight values are approximated with bi-nary values. A convolutional neural network with binary weights is significantly smaller(∼ 32×) than an equivalent network with single-precision weight values. In addition,when weight values are binary, convolutions can be estimated by only addition andsubtraction (without multiplication), resulting in ∼ 2× speed up. Binary-weight ap-proximations of large CNNs can fit into the memory of even small, portable deviceswhile maintaining the same level of accuracy (See Section 4.1 and 4.2).

在本文中，我们通过对卷积神经网络中的权值甚至中间表示进行二进制化来引入对CNN的简单，高效和准确的近似。我们的二值化方法旨在使用二元运算找到卷积的最佳近似值。我们证明，我们对神经网络进行二值化的方式会导致ImageNet分类准确度数字与标准全精度网络相媲美，同时需要更少的内存和更少的浮点运算。
我们研究两个近似值：具有二进制权重的神经网络和XNOR网络。在二进制权重网络中，所有权重值都用二进制值近似。与具有单精度权重值的等效网络相比，具有二进制权重的卷积神经网络显着更小（〜32×）。另外，当权重值是二进制时，卷积可以仅通过加法和减法来估计（无乘法），导致〜2倍加速。大尺寸CNN的二进制重量近似值可以适用于甚至小型便携式设备的记忆，同时保持相同的准确度（见4.1和4.2节）。

To take this idea further, we introduce XNOR-Networks where both the weightsand the inputs to the convolutional and fully connected layers are approximated withbinary values1. Binary weights and binary inputs allow an efficient way of implement-ing convolutional operations. If all of the operands of the convolutions are binary, thenthe convolutions can be estimated by XNOR and bitcounting operations [11]. XNOR-Nets result in accurate approximation of CNNs while offering ∼ 58× speed up in CPUs(in terms of number of the high precision operations). This means that XNOR-Nets canenable real-time inference in devices with small memory and no GPUs (Inference inXNOR-Nets can be done very efficiently on CPUs).

To the best of our knowledge this paper is the first attempt to present an evalua-tion of binary neural networks on large-scale datasets like ImageNet. Our experimental results show that our proposed method for binarizing convolutional neural networks outperforms the state-of-the-art network binarization method of [11] by a large margin(16.3%) on top-1 image classification in the ImageNet challenge ILSVRC2012. Our contribution is two-fold: First, we introduce a new way of binarizing the weight values in convolutional neural networks and show the advantage of our solution compared to state-of-the-art solutions. Second, we introduce XNOR-Nets, a deep neural network model with binary weights and binary inputs and show that XNOR-Nets can obtain similar classification accuracies compared to standard networks while being significantly more efficient. Our code is available at: http://allenai.org/plato/xnornet

为了进一步理解这个想法，我们引入了XNOR网络，其中卷积层和完全连接层的权重和输入都以二进制值1近似。二进制权重和二进制输入允许实现卷积运算的有效方式。如果卷积的所有操作数都是二进制的，那么可以通过XNOR和位计数操作来估计卷积[11]。 XNOR-Nets可以精确地近似CNN，同时在CPU中提供〜58倍的加速（以高精度操作数量计）。这意味着XNOR-Nets可以在具有小内存和无GPU的器件中实现可实时推断（在CPU中可以非常高效地执行XNOR网络推理）。

就我们所知，本文是第一次尝试在像ImageNet这样的大规模数据集上展示二值神经网络的评估。我们的实验结果表明，我们提出的卷积神经网络二值化方法在ImageNet挑战ILSVRC2012中优于最先进的网络二值化方法[11]，在前1幅图像分类方面有很大的优势（16.3％）。我们的贡献有两方面：首先，我们介绍了卷积神经网络中权值二值化的新方法，并展示了我们的解决方案与最先进的解决方案相比的优势。其次，我们引入XNOR-Nets，一种具有二进制权重和二进制输入的深度神经网络模型，并表明XNOR-Nets可以获得与标准网络相似的分类精度，同时效率更高。我们的代码可在以下网址获得：http://allenai.org/plato/xnornet

2 Related Work

Deep neural networks often suffer from over-parametrization and large amounts of re-dundancy in their models. This typically results in inefficient computation and memoryusage[12]. Several methods have been proposed to address efficient training and infer-ence in deep neural networks.

Shallow networks: Estimating a deep neural network with a shallower model re-duces the size of a network. Early theoretical work by Cybenko shows that a networkwith a large enough single hidden layer of sigmoid units can approximate any decisionboundary [13]. In several areas (e.g.,vision and speech), however, shallow networkscannot compete with deep models [14]. [15] trains a shallow network on SIFT featuresto classify the ImageNet dataset. They show it is difficult to train shallow networkswith large number of parameters. [16] provides empirical evidence on small datasets(e.g.,CIFAR-10) that shallow nets are capable of learning the same functions as deepnets. In order to get the similar accuracy, the number of parameters in the shallow net-work must be close to the number of parameters in the deep network. They do this byfirst training a state-of-the-art deep model, and then training a shallow model to mimicthe deep model. These methods are different from our approach because we use thestandard deep architectures not the shallow estimations.

Compressing pre-trained deep networks: Pruning redundant, non-informativeweights in a previously trained network reduces the size of the network at inferencetime. Weight decay [17] was an early method for pruning a network. Optimal BrainDamage [18] and Optimal Brain Surgeon [19] use the Hessian of the loss function toprune a network by reducing the number of connections. Recently [20] reduced thenumber of parameters by an order of magnitude in several state-of-the-art neural net-works by pruning. [21] proposed to reduce the number of activations for compressionand acceleration. Deep compression [22] reduces the storage and energy required to runinference on large networks so they can be deployed on mobile devices. They removethe redundant connections and quantize weights so that multiple connections share thesame weight, and then they use Huffman coding to compress the weights. HashedNets[23] uses a hash function to reduce model size by randomly grouping the weights, suchthat connections in a hash bucket use a single parameter value. Matrix factorization hasbeen used by [24,25]. We are different from these approaches because we do not use apretrained network. We train binary networks from scratch.

Designing compact layers: Designing compact blocks at each layer of a deep net-work can help to save memory and computational costs. Replacing the fully connectedlayer with global average pooling was examined in the Network in Network architec-ture [26], GoogLenet[3] and Residual-Net[4], which achieved state-of-the-art resultson several benchmarks. The bottleneck structure in Residual-Net [4] has been proposedto reduce the number of parameters and improve speed. Decomposing 3 × 3 convo-lutions with two 1 × 1 is used in [27] and resulted in state-of-the-art performance onobject recognition. Replacing 3 × 3 convolutions with 1 × 1 convolutions is used in[28] to create a very compact neural network that can achieve ∼ 50× reduction in thenumber of parameters while obtaining high accuracy. Our method is different from thisline of work because we use the full network (not the compact version) but with binaryparameters.

2相关工作

深度神经网络经常遭受过度参数化和模型中的大量冗余。这通常导致计算和存储空间不足[12]。已经提出了几种方法来解决深度神经网络中的有效训练和推理。

浅层网络：使用浅层模型估计深层神经网络会缩小网络的大小。 Cybenko早期的理论工作表明，一个具有足够大的单隐藏S层单元层的网络可以逼近任何决策边界[13]。然而，在几个领域（如视觉和言语），浅层网络不能与深层模型竞争[14]。 [15]在SIFT特征上训练浅层网络，对ImageNet数据集进行分类。他们表明，很难训练具有大量参数的浅层网络。 [16]提供了小数据集（例如CIFAR-10）的经验证据，浅网能够学习与深网相同的功能。为了获得相似的精度，浅层网络中的参数数量必须接近深度网络中的参数数量。他们通过首先训练最先进的深层模型，然后训练浅层模型来模拟深层模型。这些方法与我们的方法不同，因为我们使用标准的深层架构而不是浅层估计。

压缩预先训练的深度网络：在先前训练的网络中修剪冗余的，非信息量的重量会在推断时间内减小网络的大小。重量衰减[17]是修剪网络的早期方法。 Optimal BrainDamage [18]和Optimal Brain Surgeon [19]通过减少连接数量来使用网络损失函数的Hessian。最近[20]通过修剪在几个最先进的神经网络中减少了一个数量级的参数。 [21]提出减少压缩和加速的激活次数。深度压缩[22]减少了在大型网络上运行所需的存储和能量，因此可以将它们部署在移动设备上。他们删除冗余连接并量化权重，以便多个连接共享相同的权重，然后使用霍夫曼编码来压缩权重。 HashedNets [23]使用散列函数通过随机分组权重来减少模型大小，例如，散列桶中的连接使用单个参数值。矩阵分解已被[24,25]使用。我们与这些方法不同，因为我们不使用预训练网络。我们从零开始培训二元网络。

设计紧凑的图层：在深层网络的每一层设计紧凑的模块可以帮助节省内存和计算成本。在Network in Network架构[26]，GoogLenet [3]和Residual-Net [4]中检验了全连接层与全局平均池的关系，该算法在几个基准测试中取得了最新成果。残差网络[4]中的瓶颈结构已被提出来减少参数数量并提高速度。在[27]中使用了用两个1×1分解3×3的解决方案，并导致了最先进的对象识别性能。在[28]中使用1×1卷积代替3×3卷积来创建一个非常紧凑的神经网络，在获得高精度的同时可以实现约50倍的参数减少。我们的方法与工作线不同，因为我们使用完整的网络（而不是紧凑版本），但使用二进制参数。

Quantizing parameters: High precision parameters are not very important in achiev-ing high performance in deep networks. [29] proposed to quantize the weights of fullyconnected layers in a deep network by vector quantization techniques. They showed justthresholding the weight values at zero only decreases the top-1 accuracy on ILSVRC2012by less than %10. [30] proposed a provably polynomial time algorithm for training asparse networks with +1/0/-1 weights. A fixed-point implementation of 8-bit integerwas compared with 32-bit floating point activations in [31]. Another fixed-point net-work with ternary weights and 3-bits activations was presented by [32]. Quantizing anetwork with L2 error minimization achieved better accuracy on MNIST and CIFAR-10datasets in [33]. [34] proposed a back-propagation process by quantizing the represen-tations at each layer of the network. To convert some of the remaining multiplicationsinto binary shifts the neurons get restricted values of power-of-two integers. In [34]they carry the full precision weights during the test phase, and only quantize the neu-rons during the back-propagation process, and not during the forward-propagation. Ourwork is similar to these methods since we are quantizing the parameters in the network.But our quantization is the extreme scenario +1,-1.

Network binarization: These works are the most related to our approach. Severalmethods attempt to binarize the weights and the activations in neural networks.The per-formance of highly quantized networks (e.g.,binarized) were believed to be very poordue to the destructive property of binary quantization [35]. Expectation BackPropaga-tion (EBP) in [36] showed high performance can be achieved by a network with binaryweights and binary activations. This is done by a variational Bayesian approach, thatinfers networks with binary weights and neurons. A fully binary network at run timepresented in [37] using a similar approach to EBP, showing significant improvement inenergy efficiency. In EBP the binarized parameters were only used during inference. Bi-naryConnect [38] extended the probablistic idea behind EBP. Similar to our approach,BinaryConnect uses the real-valued version of the weights as a key reference for thebinarization process. The real-valued weight updated using the back propagated errorby simply ignoring the binarization in the update. BinaryConnect achieved state-of-the-art results on small datasets (e.g.,CIFAR-10, SVHN). Our experiments shows that thismethod is not very successful on large-scale datsets (e.g.,ImageNet). BinaryNet[11]propose an extention of BinaryConnect, where both weights and activations are bi-narized. Our method is different from them in the binarization method and the network structure. We also compare our method with BinaryNet on ImageNet, and our method outperforms BinaryNet by a large margin.[39] argued that the noise introduced by weight binarization provides a form of regularization, which could help to improve test accuracy. This method binarizes weights while maintaining full precision activation. [40] proposed fully binary training and testing in an array of committee machines with randomized input. [41] retraine a previously trained neural network with binary weights and binary inputs.

量化参数：高精度参数对于实现深度网络的高性能并不重要。 [29]提出用矢量量化技术量化深层网络中完全连接层的权重。他们表明，将阈值权重设置为0只会降低ILSVRC2012上的前1精度小于％10。 [30]提出了一个可证明的多项式时间算法，用于训练具有+ 1/0 / -1权重的asparse网络。在[31]中将8位整数的定点实现与32位浮点激活进行了比较。 [32]提出了另一个带三元权重和3位激活的定点网络。使用L2误差最小化的量化网络在MNIST和CIFAR-10数据集上获得了更好的精度[33]。 [34]通过量化网络每层的表示来提出反向传播过程。为了将一些剩余的乘法运算转换为二进制移位，神经元得到两个幂整数的限制值。在[34]中，它们在测试阶段携带完整的精确权重，并且仅在反向传播过程中量化中子，而不是在向前传播过程中量化。因为我们量化网络中的参数，所以我们的工作与这些方法相似。但是我们的量化是极端情况+ 1，-1。

网络二元化：这些作品与我们的方法最相关。几种方法试图对神经网络中的权重和激活进行二进制化。高度量化的网络（例如二进制化）的性能被认为是二进制量化的破坏性非常严重的[35]。期望反馈（EBP）[36]表明高性能可以通过具有二进制权重和二进制激活的网络来实现。这是通过变分贝叶斯方法完成的，该方法引入具有二进制权重和神经元的网络。运行时的完全二元网络在[37]中用类似的方法表示EBP，显示出显着提高节能效率。在EBP中，二进制参数仅用于推理过程中。 Bi-naryConnect [38]扩展了EBP背后的概念。与我们的方法类似，BinaryConnect使用权重的实值版本作为二值化过程的关键参考。通过简单忽略更新中的二进制化，使用后向传播错误更新实值权重。 BinaryConnect在小型数据集（例如CIFAR-10，SVHN）上实现了最先进的结果。我们的实验表明，这种方法在大规模数据集（例如ImageNet）上并不是很成功。 BinaryNet [11]提出了BinaryConnect的扩展，其中权重和激活都是双重叙述的。我们的方法与二进制化方法和网络结构不同。我们还将我们的方法与ImageNet上的BinaryNet进行了比较，我们的方法大大优于BinaryNet [39]。认为由重量二值化引入的噪音提供了一种正则化的形式，这可能有助于提高测试的准确性。该方法在保持完全精确度激活的同时二值化权重。 [40]提出了一系列随机输入的委员会机器的完全二元训练和测试。 [41]重新训练先前训练过的具有二进制权重和二进制输入的神经网络。

3 Binary Convolutional Neural Network

We represent an L-layer CNN architecture with a triplet ⟨I,W,∗⟩. I is a set of ten-sors, where each element I = Il(l=1,...,L) is the input tensor for the lth layer of CNN(Green cubes in figure 1). W is a set of tensors, where each element in this set W =Wlk(k=1,...,Kl) is the kth weight filter in the lth layer of the CNN. Kl is the number ofweight filters in the lth layer of the CNN. ∗ represents a convolutional operation withI and W as its operands2. I ∈ Rc×win×hin, where (c,win,hin) represents channels,width and height respectively.W ∈ Rc×w×h, where w ≤ win, h ≤ hin. We proposetwo variations of binary CNN: Binary-weights, where the elements of W are binarytensors and XNOR-Networks, where elements of both I and W are binary tensors.

3二元卷积神经网络

我们代表一个L层CNN体系结构，其中三元组为I，W，*⟩。我是一个十进制的集合，其中每个元素I = Il（l = 1，...，L）是CNN的第l层（图1中的绿色立方体）的输入张量。 W是一组张量，其中该集合W = Wlk（k = 1，...，Kl）中的每个元素是CNN的第l层中的第k个权重滤波器。 K1是CNN第l层中的权重过滤器的数量。 *表示以I和W作为操作数2的卷积运算。 I∈Rc×win×hin，其中（c，win，hin）分别代表渠道，宽度和高度.W∈Rc×w×h，其中w≤win，h≤hh。我们提出了二元CNN的二元变体：二元权重，其中W的元素是二元张量和XNOR网络，其中I和W的元素都是二元张量。

3.1 Binary-Weight-Networks

In order to constrain a convolutional neural network ⟨I,W,∗⟩ to have binary weights,we estimate the real-value weight filter W ∈ W using a binary filter B ∈ {+1, −1}c×w×hand a scaling factor α ∈ R+ such that W ≈ αB. A convolutional operation can be ap-priximated by:

I ∗ W ≈ (I ⊕ B) α (1)

where, ⊕ indicates a convolution without any multiplication. Since the weight valuesare binary, we can implement the convolution with additions and subtractions. The bi-nary weight filters reduce memory usage by a factor of ∼ 32× compared to single-precision filters. We represent a CNN with binary weights by ⟨I,B,A,⊕⟩, where B isa set of binary tensors and A is a set of positive real scalars, such that B = Blk is abinary filter and α = Alk is an scaling factor and Wlk ≈ AlkBlk

Estimating binary weights:

Training Binary-Weights-Networks:

Algorithm 1 demonstrates our procedure for training a CNN with binary weights.First, we binarize the weight filters at each layer by computing B and A. Then we callforward propagation using binary weights and its corresponding scaling factors, whereall the convolutional operations are carried out by equation 1. Then, we call backwardpropagation, where the gradients are computed with respect to the estimated weightfilters W . Lastly, the parameters and the learning rate gets updated by an update rulee.g.,SGD update with momentum or ADAM [42].

Once the training finished, there is no need to keep the real-value weights. Because,at inference we only perform forward propagation with the binarized weights.

3.1二进制权重网络

为了限制卷积神经网络⟨I，W，*⟩具有二进制权值，我们使用二值滤波器B∈{+1，-1} c×w×hand a估计实值权重滤波器W∈W缩放因子α∈R+使得W≈αB。卷积操作可以通过以下方式来实现：

I * W≈（I⊕B）α（1）

其中，⊕表示没有任何乘法的卷积。由于权重值是二进制的，我们可以用加法和减法来实现卷积。与单精度滤波器相比，二元滤波器可将内存使用量减少约32倍。我们用二进制权重表示CNN，其中B是一组二进制张量，A是一组正实数标量，这样B = Blk是一个二阶滤波器，α= Alk是一个缩放因子和Wlk≈AlkBlk

估计二进制权重：

训练二进制权重 - 网络：

算法1演示了我们使用二进制权重训练CNN的过程。首先，我们通过计算B和A来对每一层的权重滤波器进行二值化。然后我们使用二进制权重及其相应的比例因子进行前向传播，其中卷积运算由方程1.然后，我们称之为反向传播，其中关于估计的加权滤波器W计算梯度。最后，参数和学习速率通过更新规则更新，例如带动量的SGD更新或ADAM [42]。

一旦训练完成，就不需要保留实际值权重。因为在推断中我们只用二进制权重来执行前向传播。

3.2XNOR-Networks

So far, we managed to find binary weights and a scaling factor to estimate the real-value weights. The inputs to the convolutional layers are still real-value tensors. Now,we explain how to binarize both weigths and inputs, so convolutions can be imple-mented efficiently using XNOR and bitcounting operations. This is the key element ofour XNOR-Networks. In order to constrain a convolutional neural network ⟨I,W,∗⟩to have binary weights and binary inputs, we need to enforce binary operands at eachstep of the convolutional operation. A convolution consist of repeating a shift operationand a dot product. Shift operation moves the weight filter over the input and the dotproduct performs element-wise multiplications between the values of the weight filterand the corresponding part of the input. If we express dot product in terms of binaryoperations, convolution can be approximated using binary operations. Dot product be-tween two binary vectors can be implemented by XNOR-Bitcounting operations [11].In this section, we explain how to approximate the dot product between two vectors inRn by a dot product between two vectors in {+1, −1}n . Next, we demonstrate how touse this approximation for estimating a convolutional operation between two tensors.

Binary Dot Product:

Binary Convolution:

Training XNOR-Networks:

Binary Gradient:

3.2XNOR-网络

到目前为止，我们设法找到二进制权重和一个缩放因子来估计实值权重。卷积层的输入仍然是实数张量。现在，我们解释如何对称量和输入进行二值化，因此可以使用XNOR和位计数操作高效地实现卷积。这是XNOR网络的关键要素。为了约束卷积神经网络⟨I，W，⟩具有二进制权值和二进制输入，我们需要在卷积操作的每一步执行二进制操作数。卷积由重复移位操作和点积组成。移位操作将权重过滤器移动到输入上，并且该点产品在权重过滤器的值和输入的相应部分之间执行元素方式的乘法。如果我们用二元运算表达点积，则可以使用二元运算来近似卷积。两个二进制向量之间的点积可以通过XNOR位计数操作实现[11]。在本节中，我们将解释如何近似Rn中两个向量之间点积乘以{+1，-1中两个向量之间的点积} n。接下来，我们演示如何使用这种近似估计两个张量之间的卷积运算。

二进制小点产品：

二进制卷积：

培训XNOR网络：

二进制渐变：

4 Experiments

We evaluate our method by analyzing its efficiency and accuracy. We measure the ef-ficiency by computing the computational speedup (in terms of number of high preci-sion operation) achieved by our binary convolution vs. standard convolution. To measure accuracy, we perform image classification on the large-scale ImageNet dataset.This paper is the first work that evaluates binary neural networks on the ImageNet dataset. Our binarization technique is general, we can use any CNN architecture. We evaluate AlexNet [1] and two deeper architectures in our experiments. We compare our method with two recent works on binarizing neural networks; BinaryConnect [38] andBinaryNet [11]. The classification accuracy of our binary-weight-network version ofAlexNet is as accurate as the full precision version of AlexNet. This classification ac-curacy outperforms competitors on binary neural networks by a large margin. We also present an ablation study, where we evaluate the key elements of our proposed method; computing scaling factors and our block structure for binary CNN. We shows that our method of computing the scaling factors is important to reach high accuracy.

4实验
我们通过分析其效率和准确性来评估我们的方法。我们通过计算由二元卷积与标准卷积实现的计算加速（根据高精度操作的数量）来测量效率。为了测量准确性，我们对大规模ImageNet数据集进行图像分类。本文是第一个在ImageNet数据集上评估二值神经网络的工作。我们的二进制技术是一般的，我们可以使用任何CNN架构。我们在我们的实验中评估AlexNet [1]和两个更深层次的体系结构。我们将我们的方法与最近关于二值化神经网络的两篇文章进行了比较BinaryConnect [38]和BinaryNet [11]。我们的二进制重量网络版本的AlexNet的分类准确度与全精度版本的AlexNet一样精确。这种分类的准确性大大超过了二元神经网络的竞争对手。我们还提供了一个消融研究，我们评估我们提出的方法的关键要素;计算比例因子和二进制CNN的块结构。我们表明，我们的计算比例因子的方法对于达到高精度非常重要。

4.1 Efficiency Analysis

In an standard convolution, the total number of operations is cNWNI, where c is thenumber of channels, NW = wh and NI = winhin. Note that some modern CPUs canfuse the multiplication and addition as a single cycle operation. On those CPUs, Binary-Weight-Networks does not deliver speed up. Our binary approximation of convolution(equation 11) has cNWNI binary operations and NI non-binary operations. With thecurrent generation of CPUs, we can perform 64 binary operations in one clock of CPU,

therefore the speedup can be computed by公式

The speedup depends on the channel size and filter size but not the input size. In fig-ure 4-(b-c) we illustrate the speedup achieved by changing the number of channels andfilter size. While changing one parameter, we fix other parameters as follows: c = 256,nI = 142 and nW = 32 (majority of convolutions in ResNet[4] architecture have thisstructure). Using our approximation of convolution we gain 62.27× theoretical speedup, but in our CPU implementation with all of the overheads, we achieve 58x speedup in one convolution (Excluding the process for memory allocation and memory ac-cess). With the small channel size (c = 3) and filter size (NW = 1 × 1) the speedupis not considerably high. This motivates us to avoid binarization at the first and last layer of a CNN. In the first layer the chanel size is 3 and in the last layer the filter size is 1 × 1. A similar strategy was used in [11]. Figure 4-a shows the required memory for three different CNN architectures(AlexNet, VGG-19, ResNet-18) with binary and double precision weights. Binary-weight-networks are so small that can be easily fitted into portable devices. BinaryNet [11] is in the same order of memory and computation efficiency as our method. In Figure 4, we show an analysis of computation and memory cost for a binary convolution. The same analysis is valid for BinaryNet and Binary Con-nect. The key difference of our method is using a scaling-factor, which does not change the order of efficiency while providing a significant improvement in accuracy.

4.1效率分析

在标准卷积中，操作的总数是cNWNI，其中c是通道的数量，NW = wh和NI = winhin。请注意，一些现代CPU可以将乘法和加法作为单个循环操作来使用。在这些CPU上，二进制重量网络不能提高速度。我们的卷积二进制近似（等式11）具有cNWNI二元运算和NI非二元运算。使用目前的CPU，我们可以在一个时钟的CPU上执行64个二进制操作，

因此加速可以按公式计算

加速取决于通道大小和过滤器大小，但不取决于输入大小。在图4（b-c）中，我们通过改变通道数量和滤波器大小来说明实现加速。当改变一个参数时，我们修复其他参数如下：c = 256，nI = 142和nW = 32（ResNet [4]体系结构中的大多数卷积具有这种结构）。使用我们的卷积逼近，我们获得了62.27倍的理论加速比，但是在我们的CPU实现中，所有的开销都在一次卷积中实现了58倍的加速（不包括内存分配和内存访问的过程）。对于小通道尺寸（c = 3）和过滤器尺寸（NW = 1×1），加速度不会很高。这激励我们避免在CNN的第一层和最后一层进行二值化。在第一层中，chanel大小为3，最后一层中的过滤器大小为1×1。[11]中使用了类似的策略。图4-a显示了具有二进制和双精度权重的三种不同CNN体系结构（AlexNet，VGG-19，ResNet-18）所需的内存。二进制重量网络非常小，可以轻松安装到便携式设备中。 BinaryNet [11]与我们的方法具有相同的存储和计算效率。在图4中，我们展示了二进制卷积的计算和内存成本分析。对BinaryNet和Binary连接同样的分析是有效的。我们的方法的关键区别在于使用缩放因子，它不会改变效率的顺序，同时显着提高精度。

4.2 Image Classification

We evaluate the performance of our proposed approach on the task of natural im-age classification. So far, in the literature, binary neural network methods have pre-sented their evaluations on either limited domain or simplified datasets e.g.,CIFAR-10,MNIST, SVHN. To compare with state-of-the-art vision, we evaluate our method onImageNet (ILSVRC2012). ImageNet has ∼1.2M train images from 1K categories and50K validation images. The images in this dataset are natural images with reasonablyhigh resolution compared to the CIFAR and MNIST dataset, which have relatively smallimages. We report our classification performance using Top-1 and Top-5 accuracies.We adopt three different CNN architectures as our base architectures for binarization:AlexNet [1], Residual Networks (known as ResNet) [4], and a variant of GoogLenet[3].We compare our Binary-weight-network (BWN) with BinaryConnect(BC) [38] andour XNOR-Networks(XNOR-Net) with BinaryNeuralNet(BNN) [11]. BinaryConnect(BC)is a method for training a deep neural network with binary weights during forwardand backward propagations. Similar to our approach, they keep the real-value weightsduring the updating parameters step. Our binarization is different from BC. The bina-rization in BC can be either deterministic or stochastic. We use the deterministic bina-rization for BC in our comparisons because the stochastic binarization is not efficient.The same evaluation settings have been used and discussed in [11]. BinaryNeural-Net(BNN) [11] is a neural network with binary weights and activations during infer-ence and gradient computation in training. In concept, this is a similar approach to ourXNOR-Network but the binarization method and the network structure in BNN is dif-ferent from ours. Their training algorithm is similar to BC and they used deterministicbinarization in their evaluations.

4.2图像分类

我们评估我们提出的方法在自然图像分类任务中的表现。到目前为止，在文献中，二元神经网络方法预先对有限域或简化数据集（例如CIFAR-10，MNIST，SVHN）进行评估。为了与最先进的视觉进行比较，我们在ImageNet上评估了我们的方法（ILSVRC2012）。 ImageNet拥有约1.2M的来自1K类别和50K验证图像的训练图像。与CIFAR和MNIST数据集相比，该数据集中的图像是具有相当高分辨率的自然图像，其图像相对较小。我们使用Top-1和Top-5精度报告我们的分类性能。我们采用三种不同的CNN体系结构作为二进制化的基础体系结构：AlexNet [1]，Residual Networks（ResNet）[4]和GoogLenet [ 3]。我们将二进制权重网络（BWN）与BinaryConnect（BC）[38]和我们的XNOR网络（XNOR-Net）与BinaryNeuralNet（BNN）进行比较[11]。 BinaryConnect（BC）是一种用于在前向和后向传播期间训练具有二进制权重的深度神经网络的方法。与我们的方法类似，它们在更新参数步骤中保持实际值权重。我们的二元化与BC不同。 BC中的二元化可以是确定性的也可以是随机的。因为随机二值化效率不高，所以我们在BC中使用确定性的二元化进行比较。同样的评估设置已经在[11]中使用和讨论过了。二进制神经网络（BNN）[11]是一种神经网络，在训练过程中推理和梯度计算中具有二进制加权和激活。在概念上，这与我们的XNOR网络类似，但BNN中的二值化方法和网络结构与我们的不同。他们的训练算法与BC类似，他们在评估中使用确定性二值化。

CIFAR-10 : BC and BNN showed near state-of-the-art performance on CIFAR-10, MNIST, and SVHN dataset. BWN and XNOR-Net on CIFAR-10 using the samenetwork architecture as BC and BNN achieve the error rate of 9.88% and 10.17% re-spectively. In this paper we explore the possibility of obtaining near state-of-the-artresults on a much larger and more challenging dataset (ImageNet).

AlexNet: [1] is a CNN architecture with 5 convolutional layers and two fully-connected layers. This architecture was the first CNN architecture that showed to besuccessful on ImageNet classification task. This network has 61M parameters. We useAlexNet coupled with batch normalization layers [43].

Train: In each iteration of training, images are resized to have 256 pixel at theirsmaller dimension and then a random crop of 224 × 224 is selected for training. We run the training algorithm for 16 epochs with batche size equal to 512. We use negative-log-likelihood over the soft-max of the outputs as our classification loss function. In our implementation of AlexNet we do not use the Local-Response-Normalization(LRN)layer3. We use SGD with momentum=0.9 for updating parameters in BWN and BC.For XNOR-Net and BNN we used ADAM [42]. ADAM converges faster and usually achieves better accuracy for binary inputs [11]. The learning rate starts at 0.1 and we apply a learning-rate-decay=0.01 every 4 epochs.

Test: At inference time, we use the 224 × 224 center crop for forward propagation.

CIFAR-10：BC和BNN在CIFAR-10，MNIST和SVHN数据集上表现出接近最先进的性能。使用与BC和BNN相同的网络架构的CIFAR-10上的BWN和XNOR-Net分别实现了9.88％和10.17％的错误率。在本文中，我们探讨了在更大和更具挑战性的数据集（ImageNet）上获得接近最先进的结果的可能性。

AlexNet：[1]是具有5个卷积层和两个完全连接层的CNN架构。这个架构是第一个证明在ImageNet分类任务上成功的CNN架构。该网络具有61M参数。我们使用亚毫米网络与批量标准化层[43]。

训练：在每次迭代训练中，图像的大小都调整到256个像素，然后选择224×224的随机裁剪进行训练。我们将训练算法运行16个历元，其粒度等于512.我们在输出的soft-max上使用负对数似然作为我们的分类损失函数。在我们的AlexNet实现中，我们不使用局部响应规范化（LRN）层3。我们使用SGD动量= 0.9来更新BWN和BC中的参数。对于XNOR-Net和BNN，我们使用了ADAM [42]。 ADAM收敛速度更快，通常可以实现更好的二进制输入精度[11]。学习率从0.1开始，我们每4个时期应用一次学习率衰减= 0.01。

测试：在推断时，我们使用224×224中心作物进行前向传播。

Figure 5 demonstrates the classification accuracy for training and inference alongthe training epochs for top-1 and top-5 scores. The dashed lines represent training ac-curacy and solid lines shows the validation accuracy. In all of the epochs our methodoutperforms BC and BNN by large margin (∼17%). Table 1 compares our final accu-racy with BC and BNN. We found that the scaling factors for the weights (α) is muchmore effective than the scaling factors for the inputs (β). Removing β reduces the ac-curacy by a small margin (less than 1% top-1 alexnet).

Binary Gradient: Using XNOR-Net with binary gradient the accuracy of top-1 willdrop only by 1.4%.

Residual Net : We use the ResNet-18 proposed in [4] with short-cut type B.4

Train: In each training iteration, images are resized randomly between 256 and480 pixel on the smaller dimension and then a random crop of 224 × 224 is selectedfor training. We run the training algorithm for 58 epochs with batch size equal to 256 images. The learning rate starts at 0.1 and we use the learning-rate-decay equal to 0.01at epochs number 30 and 40.

Test: At inference time, we use the 224 × 224 center crop for forward propagation.

Figure 6 demonstrates the classification accuracy (Top-1 and Top-5) along the epochsfor training and inference. The dashed lines represent training and the solid lines repre-sent inference. Table 2 shows our final accuracy by BWN and XNOR-Net.

GoogLenet Variant : We experiment with a variant of GoogLenet [3] that uses asimilar number of parameters and connections but only straightforward convolutions,no branching5. It has 21 convolutional layers with filter sizes alternating between 1 × 1and 3 × 3.

Train: Images are resized randomly between 256 and 320 pixel on the smaller di-mension and then a random crop of 224 × 224 is selected for training. We run thetraining algorithm for 80 epochs with batch size of 128. The learning rate starts at 0.1and we use polynomial rate decay, β = 4.

Test: At inference time, we use a center crop of 224 × 224.

图5显示了训练和推断前1和前5分的训练和推断的分类准确性。虚线表示训练的准确性，实线表示验证的准确性。在所有这些时代，我们的方法大幅优于BC和BNN（〜17％）。表1比较了我们的最终准确性与BC和BNN。我们发现权重（α）的比例因子比输入（β）的比例因子有效得多。删除β会使精度略微降低（小于1％的top-1 alexnet）。

二进制渐变：使用二元梯度的XNOR-Net，top-1的精度将只下降1.4％。

剩余网络：我们使用[4]中提出的ResNet-18和B.4型快捷方式

训练：在每次训练迭代中，图像在较小维度上随机调整大小在256和480像素之间，然后选择224×224的随机裁剪进行训练。我们运行58个时期的训练算法，批量大小等于256个图像。学习率从0.1开始，我们使用的学习速率衰减等于0.01和30岁和40岁的时期。

测试：在推断时，我们使用224×224中心作物进行前向传播。

图6显示了沿着培训和推理的时代的分类准确性（Top-1和Top-5）。虚线表示训练，实线代表推断。表2显示了我们通过BWN和XNOR-Net的最终准确度。

GoogLenet变体：我们试验了GoogLenet [3]的一个变体，它使用了相似数量的参数和连接，但只有简单的卷积，没有分支5。它具有21个卷积层，滤波器大小在1×1和3×3之间交替。

训练：图像在较小的尺寸上随机调整大小在256和320像素之间，然后选择224×224的随机裁剪进行训练。我们运行80个历元的训练算法，批量为128.学习速率从0.1开始，我们使用多项式速率衰减，β= 4。

测试：在推断时，我们使用224×224的中心作物。

4.3 Ablation Studies

There are two key differences between our method and the previous network binariaza-tion methods; the binararization technique and the block structure in our binary CNN.

For binarization, we find the optimal scaling factors at each iteration of training. For the block structure, we order the layers in a block in a way that decreases the quantization loss for training XNOR-Net. Here, we evaluate the effect of each of these elements in the performance of the binary networks. Instead of computing the scaling factor αusing equation 6, one can consider α as a network parameter. In other words, a layer after binary convolution multiplies the output of convolution by an scalar parameter foreach filter. This is similar to computing the affine parameters in batch normalization.Table 3-a compares the performance of a binary network with two ways of computing the scaling factors. As we mentioned in section 3.2 the typical block structure in CNN is not suitable for binarization. Table 3-b compares the standard block structure C-B-A-P(Convolution, Batch Normalization, Activation, Pooling) with our structure B-A-C-P.(A, is binary activation).

4.3消融研究

我们的方法和以前的网络分析方法有两个关键的区别;二进制化技术和我们的二进制CNN中的块结构。

对于二值化，我们在每次迭代训练中找到最优的比例因子。对于块结构，我们以减少XNOR-Net训练的量化损失的方式对块中的层进行排序。在这里，我们评估每个元素在二进制网络性能中的影响。不用等式6来计算比例因子α，可以将α看作网络参数。换句话说，二进制卷积之后的一个层将卷积输出乘以一个标量参数foreach过滤器。这与计算批量归一化中的仿射参数类似。表3-a比较了二元网络的性能与两种计算比例因子的方式。正如我们在第3.2节中所提到的，CNN中的典型块结构不适合二值化。表3-b比较了标准块结构C-B-A-P（卷积，批量标准化，活化，合并）与我们的结构B-A-C-P（A，是二元活化）。

5 Conclusion

We introduce simple, efficient, and accurate binary approximations for neural networks.We train a neural network that learns to find binary values for weights, which reducesthe size of network by ∼ 32× and provide the possibility of loading very deep neuralnetworks into portable devices with limited memory. We also propose an architecture,XNOR-Net, that uses mostly bitwise operations to approximate convolutions. This pro-vides ∼ 58× speed up and enables the possibility of running the inference of state ofthe art deep neural network on CPU (rather than GPU) in real-time.

Acknowledgements

This work is in part supported by ONR N00014-13-1-0720, NSF IIS- 1338054, AllenDistinguished Investigator Award, and the Allen Institute for Artificial Intelligence.

5结论

我们为神经网络引入简单，高效和准确的二进制逼近。我们训练一个神经网络，学习寻找权重的二进制值，它将网络的大小减少了32倍，并提供了将非常深的神经网络加载到便携式设备中的可能性。有限的记忆。我们还提出了一种架构，XNOR-Net，它使用大部分按位运算来近似卷积。这提供了〜58倍的加速，并能够实时运行CPU（而不是GPU）上最先进的深度神经网络状态的推断。

致谢

这项工作部分得到了ONR N00014-13-1-0720，NSF IIS-1338054，AllenDistinguished Investigator奖和Allen人工智能研究所的支持。