MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

最新推荐文章于 2025-04-16 09:00:00 发布

dsjdjsa

最新推荐文章于 2025-04-16 09:00:00 发布

阅读量1.6k

点赞数

本文链接：https://blog.csdn.net/qq_21046135/article/details/78632390

版权

We present a class of efficient models called MobileNets for mobile and embedded vision applications.
我们为移动和嵌入式视觉应用提出了一种名为MobileNets的高效模型。

MobileNets are based on a streamlined architecture that uses depthwise separable convolutions to build light weight deep neural networks.
MobileNets基于流线型架构，使用深度可分离的卷积来构建轻量级深度神经网络。

We introduce two simple global hyperparameters that efficiently trade off between latency and accuracy. These hyperparameters allow the model builder to choose the right sized model for their application based on the constraints of the problem. We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification. We then demonstrate the effective- ness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.

我们介绍两个简单的全局超参数，可以在延迟和准确性之间高效地进行折衷。这些超参数允许模型构建者根据问题的约束为其应用程序选择合适的大小模型。我们在资源和精度折衷方面进行了广泛的实验，并且与ImageNet分类上的其他流行模型相比，显示了强大的性能。然后，我们将展示MobileNets广泛的应用和用例（包括对象检测，细粒度分类，人脸属性和大规模地理定位）方面的有效性。

1. Introduction
Convolutional neural networks have become ubiquitous in computer vision ever since AlexNet [19] popularized deep convolutional neural networks by winning the Ima- geNet Challenge: ILSVRC 2012 [24]. The general trend has been to make deeper and more complicated networks in order to achieve higher accuracy [27, 31, 29, 8]. How- ever, these advances to improve accuracy are not necessar- ily making networks more efficient with respect to size and speed. In many real world applications such as robotics, self-driving car and augmented reality, the recognition tasks need to be carried out in a timely fashion on a computation- ally limited platform.

This paper describes an efficient network architecture and a set of two hyper-parameters in order to build very small, low latency models that can be easily matched to the design requirements for mobile and embedded vision applications. 

Section 2 reviews prior work in building small
models. Section 3 describes the MobileNet architecture and two hyper-parameters width multiplier and resolution mul- tiplier to define smaller and more efficient MobileNets. Sec- tion 4 describes experiments on ImageNet as well a variety of different applications and use cases. Section 5 closes with a summary and conclusion.

1.介绍
卷积神经网络自从AlexNet[19]通过赢得ImaNet挑战：ILSVRC 2012 [24]推广深度卷积神经网络以来，已经成为计算机视觉领域的无处不在。总体趋势是为了达到更高的准确性而做出更深更复杂的网络[27,31,29,8]。然而，这些提高精确度的进步并不一定使网络在尺寸和速度方面更有效率。在许多真实世界的应用中，例如机器人，自驾车和增强现实，识别任务需要在计算有限的平台上及时进行。
本文描述了一个高效的网络体系结构和包含两个超参数的一个集合，以建立非常小的低延迟模型，可以很容易地匹配移动和嵌入式视觉应用的设计要求。第二部分回顾以前的工作，建设小型楷模。第3节描述了MobileNet架构和两个超参数宽度乘法器和分辨率乘法器，以定义更小和更高效的MobileNets。第4节介绍ImageNet上的实验以及各种不同的应用和用例。第五部分总结总结。

2. Prior Work
There has been rising interest in building small and effi- cient neural networks in the recent literature, e.g. [16, 34, 12, 36, 22]. Many different approaches can be generally categorized into either compressing pretrained networks or training small networks directly. This paper proposes a class of network architectures that allows a model devel- oper to specifically choose a small network that matches the resource restrictions (latency, size) for their application. MobileNets primarily focus on optimizing for latency but also yield small networks. Many papers on small networks focus only on size but do not consider speed.
MobileNets are built primarily from depthwise separable convolutions initially introduced in [26] and subsequently used in Inception models [13] to reduce the computation in the first few layers. Flattened networks [16] build a network out of fully factorized convolutions and showed the poten- tial of extremely factorized networks. Independent of this current paper, Factorized Networks[34] introduces a similar factorized convolution as well as the use of topological con- nections. Subsequently, the Xception network [3] demon- strated how to scale up depthwise separable filters to out perform Inception V3 networks. Another small network is Squeezenet [12] which uses a bottleneck approach to design a very small network. Other reduced computation networks include structured transform networks [28] and deep fried convnets [37].

2.以前的工作
在近来的文献中，人们越来越关注构建小而有效的神经网络， [16,34,12,36,22]。通常可以将许多不同的方法分为压缩预训练网络或直接训练小网络。本文提出了一种网络体系结构，允许模型开发人员专门选择与其应用程序的资源限制（延迟，大小）相匹配的小型网络。 MobileNets主要专注于优化延迟，但也产生小型网络。许多关于小型网络的论文只关注尺寸，但不考虑速度。
MobileNets主要是由最初在[26]中引入的深度可分卷积构建的，随后被用于Inception模型[13]以减少前几层中的计算。扁平网络[16]利用完全因式分解的卷积建立一个网络，并显示出极端因式分解网络的潜力。独立于当前的论文，因式分解网络[34]引入了类似的分解卷积以及拓扑连接的使用。随后，Xception网络[3]演示了如何扩展深度分离滤波器来执行Inception V3网络。另一个小型网络是Squeezenet [12]，它使用瓶颈方法来设计一个非常小的网络。其他简化的计算网络包括结构化变换网络[28]和深炸鱼网[37]。

A different approach for obtaining small networks is shrinking, factorizing or compressing pretrained networks. Compression based on product quantization [36], hashing [2], and pruning, vector quantization and Huffman coding [5] have been proposed in the literature. Additionally var- ious factorizations have been proposed to speed up pre- trained networks [14, 20]. Another method for training small networks is distillation [9] which uses a larger net- work to teach a smaller network. It is complementary to our approach and is covered in some of our use cases in section 4. Another emerging approach is low bit networks [4, 22, 11].


获得小型网络的另一种方法是缩小，分解或压缩预训练网络。 在文献中已经提出了基于产品量化的压缩[36]，哈希[2]以及修剪，矢量量化和霍夫曼编码[5]。 此外，还提出了各种因子分解来加速训练前的网络[14,20]。 另一种训练小型网络的方法是蒸馏[9]，它使用更大的网络来教小型网络。 这是对我们的方法的补充，并在第4节的一些用例中进行了介绍。另一个新兴的方法是低位网络[4,22,11]。

3. MobileNet Architecture
In this section we first describe the core layers that Mo- bileNet is built on which are depthwise separable filters. We then describe the MobileNet network structure and con- clude with descriptions of the two model shrinking hyper- parameters width multiplier and resolution multiplier.
3.1. Depthwise Separable Convolution
The MobileNet model is based on depthwise separable convolutions which is a form of factorized convolutions which factorize a standard convolution into a depthwise convolution and a 1 × 1 convolution called a pointwise con- volution. For MobileNets the depthwise convolution ap- plies a single filter to each input channel. The pointwise convolution then applies a 1 × 1 convolution to combine the outputs the depthwise convolution. A standard convolution both filters and combines inputs into a new set of outputs in one step. The depthwise separable convolution splits this into two layers, a separate layer for filtering and a separate layer for combining. This factorization has the effect of drastically reducing computation and model size. Figure 2 shows how a standard convolution 2(a) is factorized into a depthwise convolution 2(b) and a 1 × 1 pointwise convolu- tion 2(c).
A standard convolutional layer takes as input a DF ×
DF ×M featuremapFandproducesaDF ×DF ×N feature map G where DF is the spatial width and height of a square input feature map1, M is the number of input channels (input depth), DG is the spatial width and height of a square output feature map and N is the number of output channel (output depth).


3. MobileNet架构
在本节中，我们首先描述MoBileNet构建的核心层，它们是深度可分离的过滤器。然后我们描述MobileNet网络结构，并包含两个缩小模型超参数宽度乘法器和分辨率乘法器的描述。
3.1。深度可分卷积
MobileNet模型基于深度可分卷积，这是一种分解卷积的形式，它将标准卷积分解成深度卷积和1×1卷积（称为逐点卷积）。对于MobileNets，深度卷积将单个滤波器应用于每个输入通道。逐点卷积然后应用1×1卷积来组合输出的深度卷积。标准卷积既能过滤又能将输入组合成一组新的输出。深度可分离的卷积将其分成两层，一层用于过滤，另一层用于组合。这种分解具有大幅度减少计算和模型大小的效果。图2显示了如何将标准卷积2（a）分解为深度卷积2（b）和1×1点群卷积2（c）。
标准的卷积层将DF×作为输入
DF×M featuremapFandproducts a DF×DF×N特征图G其中DF是方形输入特征图1的空间宽度和高度，M是输入通道的数量（输入深度），DG是方形输出特征的空间宽度和高度地图和N是输出通道的数量（输出深度）。

3.2. Network Structure and Training
The MobileNet structure is built on depthwise separable convolutions as mentioned in the previous section except for the first layer which is a full convolution. By defining the network in such simple terms we are able to easily explore network topologies to find a good network. The MobileNet architecture is defined in Table 1. All layers are followed by a batchnorm [13] and ReLU nonlinearity with the exception of the final fully connected layer which has no nonlinearity and feeds into a softmax layer for classification. 

Figure 3 contrasts a layer with regular convolutions, batchnorm and ReLU nonlinearity to the factorized layer with depthwise convolution, 1 × 1 pointwise convolution as well as batchnorm and ReLU after each convolutional layer.  Down sampling is handled with strided convolution in the depthwise convolutions as well as in the first layer. 

A final average pooling reduces the spatial resolution to 1 before the fully connected layer. 

Counting depthwise and pointwise convolutions as separate layers, MobileNet has 28 layers.

It is not enough to simply define networks in terms of a small number of Mult-Adds. 

It is also important to make sure these operations can be efficiently implementable. 


3.2。网络结构和培训
MobileNet结构是建立在上一节提到的深度可分离的卷积上的，除了第一层是完全卷积的。通过简单的定义网络，我们可以轻松地探索网络拓扑，找到一个好的网络。在表1中定义了MobileNet架构。除了没有非线性的最终完全连接层以外，所有层之后都是batchnorm [13]和ReLU非线性度，并馈入softmax层用于分类。

图3将具有常规卷积，蝙蝠科和ReLU非线性的层与具有深度卷积的分解层，1×1点卷积以及每个卷积层之后的蝙蝠科和ReLU进行对比。下采样在深度卷积中以及在第一层中通过逐步卷积来处理。

在完全连接的层之前，最终的平均汇聚将空间分辨率降低到1。

MobileNet拥有28层，将深度和逐点卷积作为单独的层数进行计算。

简单地用少量的多重添加来定义网络是不够的。

确保这些操作能够有效地实施也是很重要的。

instance unstructured sparse matrix operations are not typ- ically faster than dense matrix operations until a very high level of sparsity. Our model structure puts nearly all of the computation into dense 1 × 1 convolutions. This can be im- plemented with highly optimized general matrix multiply (GEMM) functions. Often convolutions are implemented by a GEMM but require an initial reordering in memory called im2col in order to map it to a GEMM. For instance, this approach is used in the popular Caffe package [15]. 1 × 1 convolutions do not require this reordering in memory and can be implemented directly with GEMM which is one of the most optimized numerical linear algebra algorithms. MobileNet spends 95% of it’s computation time in 1 × 1 convolutions which also has 75% of the parameters as can be seen in Table 2. Nearly all of the additional parameters are in the fully connected layer.
MobileNet models were trained in TensorFlow [1] us- ing RMSprop [33] with asynchronous gradient descent sim- ilar to Inception V3 [31]. However, contrary to training large models we use less regularization and data augmen- tation techniques because small models have less trouble with overfitting. When training MobileNets we do not use side heads or label smoothing and additionally reduce the amount image of distortions by limiting the size of small crops that are used in large Inception training [31]. Additionally, we found that it was important to put very little or no weight decay (l2 regularization) on the depthwise filters since their are so few parameters in them. For the ImageNet benchmarks in the next section all models were trained with same training parameters regardless of the size of the model.

实例非结构化稀疏矩阵运算通常不会比密集矩阵运算更快，直到非常高的稀疏度。我们的模型结构几乎把所有的计算都放到了密集为1×1卷积中。这可以通过高度优化的通用矩阵乘法（GEMM）函数来实现。经常由GEMM实现卷积，但是为了将其映射到GEMM，需要在内存中初始重新排序，称为im2col。例如，这种方法在流行的Caffe软件包[15]中使用。 1×1卷积不需要在存储器中进行这种重新排序，可以直接用GEMM来实现，GEMM是最优化的数值线性代数算法之一。 MobileNet将其计算时间的95％用于1×1卷积，其中也有75％的参数，如表2所示。几乎所有的附加参数都在完全连接的层中。
MobileNet模型在TensorFlow [1]中使用RMSprop [33]进行训练，异步梯度下降类似于Inception V3 [31]。然而，与训练大型模型相反，我们使用较少的正则化和数据增强技术，因为小型模型在过拟合方面较少麻烦。在训练MobileNets时，我们不使用侧头或标签平滑，并通过限制在大型初始训练[31]中使用的小作物的尺寸来额外减少扭曲的数量图像。另外，我们发现在深度滤波器上放置非常小的权重衰减或者没有权重衰减（l2正则化）是很重要的，因为它们中只有很少的参数。对于下一节中的ImageNet基准，所有模型均使用相同的训练参数进行训练，而与模型的大小无关。

3.3. Width Multiplier: Thinner Models
Although the base MobileNet architecture is already small and low latency, many times a specific use case or application may require the model to be smaller and faster. In order to construct these smaller and less computationally expensive models we introduce a very simple parameter α called width multiplier. The role of the width multiplier α is to thin a network uniformly at each layer. For a given layer and width multiplier α, the number of input channels M be- comes αM and the number of output channels N becomes αN.
The computational cost of a depthwise separable convo- lution with width multiplier α is:
DK ·DK ·αM ·DF ·DF +αM ·αN ·DF ·DF (6)
where α ∈ (0, 1] with typical settings of 1, 0.75, 0.5 and 0.25. α = 1 is the baseline MobileNet and α < 1 are reduced MobileNets. Width multiplier has the effect of re- ducing computational cost and the number of parameters quadratically by roughly α2. Width multiplier can be ap- plied to any model structure to define a new smaller model with a reasonable accuracy, latency and size trade off. It is used to define a new reduced structure that needs to be trained from scratch.

3.3。宽度乘数：稀疏模型
虽然MobileNet基础架构已经很小，延迟也很低，但是很多时候，特定的用例或应用程序可能会要求模型更小更快。为了构造这些更小，计算量更小的模型，我们引入一个非常简单的参数α，称为宽度乘数。宽度乘数α的作用是在每层均匀地减薄网络。对于一个给定的层和宽度乘数α，输入通道的数量M为αM，输出通道的数量为αN。
具有宽度乘法器α的深度可分离控制的计算成本为：
DK·DK·αM·DF·DF +αM·αN·DF·DF（6）

其中α∈（0,1），典型设置为1，0.75，0.5和0.25，α= 1是移动网的基线，α<1是移动网的缩减，宽度乘数具有降低计算成本的作用，参数的二次方约为α2，宽度乘数可以应用于任何模型结构，以合理的精度定义一个新的较小的模型，等待时间和大小权衡，用来定义一个新的需要从零开始训练的简化结构。

3.4. Resolution Multiplier: Reduced Representation
The second hyper-parameter to reduce the computational cost of a neural network is a resolution multiplier ρ. We apply this to the input image and the internal representation of every layer is subsequently reduced by the same multiplier. In practice we implicitly set ρ by setting the input resolution.
We can now express the computational cost for the core layers of our network as depthwise separable convolutions with width multiplier α and resolution multiplier ρ:
DK ·DK ·αM ·ρDF ·ρDF +αM ·αN ·ρDF ·ρDF (7) where ρ ∈ (0, 1] which is typically set implicitly so that
the input resolution of the network is 224, 192, 160 or 128. ρ = 1 is the baseline MobileNet and ρ < 1 are reduced computation MobileNets. Resolution multiplier has the ef- fect of reducing computational cost by ρ2.
As an example we can look at a typical layer in Mo- bileNet and see how depthwise separable convolutions, width multiplier and resolution multiplier reduce the cost and parameters. Table 3 shows the computation and number of parameters for a layer as architecture shrinking methods are sequentially applied to the layer. The first row shows the Mult-Adds and parameters for a full convolutional layer with an input feature map of size 14 × 14 × 512 with a ker- nelK ofsize3×3×512×512. Wewilllookindetail in the next section at the trade offs between resources and accuracy.

3.4 分辨率乘数：简化表示
第二个降低神经网络计算成本的超参数是分辨率乘数ρ。我们将其应用于输入图像，每个图层的内部表示随后由相同的乘数减少。在实践中，我们通过设置输入分辨率隐式设置ρ。
我们现在可以将网络核心层的计算成本表示为深度可分卷积，其中宽度乘法器α和分辨率乘法器ρ：
DK·DK·αM·ρDF·ρDF+αM·αN·ρDF·ρDF（7）其中ρ∈（0,1），通常隐含地设定为
网络的输入分辨率是224,192,160或128.ρ= 1是基线MobileNet，ρ<1是计算MobileNets的减少。分辨率乘数有降低计算成本的效果ρ2。
作为一个例子，我们可以看一下MoBileNet中的典型图层，看看深度可分卷积，宽度乘法器和分辨率乘法器如何降低成本和参数。表3显示了随着体系结构缩小方法被顺序地应用到层上，层的参数的计算和数目。第一行显示了具有尺寸为14×14×512，尺寸为3×3×512×512的内核的输入特征映射的全卷积层的多重增加和参数。我们将在下一节的资源和准确性之间的权衡得更仔细。

Experiments
In this section we first investigate the effects of depthwise convolutions as well as the choice of shrinking by reducing the width of the network rather than the number of layers.
4.实验
在本节中，我们首先研究深度卷积的影响以及缩小网络宽度而不是层数的缩小选择。

We then show the trade offs of reducing the net- work based on the two hyper-parameters: width multiplier and resolution multiplier and compare results to a number of popular models.
然后，我们基于两个超参数（宽度乘法器和分辨率乘法器）展示减少网络的权衡，并将结果与一些流行模型进行比较。

We then investigate MobileNets applied to a number of different applications.
然后我们调查应用于许多不同应用的MobileNets。

4.1. Model Choices
First we show results for MobileNet with depthwise sep- arable convolutions compared to a model built with full con- volutions.
4.1。模型选择
首先，我们展示MobileNet的深度分离卷积结果，与完全回收的模型进行比较。

In Table 4 we see that using depthwise separable convolutions compared to full convolutions only reduces accuracy by 1% on ImageNet was saving tremendously on mult-adds and parameters.
在表4中我们看到，使用深度可分离的卷积与完全卷积相比，ImageNet上的精度仅降低了1％，可以大大节省多次叠加和参数。

We next show results comparing thinner models with width multiplier to shallower models using less layers.
我们接下来会展示使用较少层数的较薄模型与宽度倍数比较的结果。

To make MobileNet shallower, the 5 layers of separable filters with feature size 14 × 14 × 512 in Table 1 are removed.
为了使MobileNet更浅，表1中特征尺寸为14×14×512的5层可分离滤波器被去除。

Table 5 shows that at similar computation and number of parameters, that making MobileNets thinner is 3% better than making them shallower.
表5显示，在类似的计算和参数数量下，使得MobileNets更薄的3％要比使它们更浅。

4.2. Model Shrinking Hyperparameters
4.2. 模型缩小的超参数

Table 6 shows the accuracy, computation and size trade offs of shrinking the MobileNet architecture with the width multiplier α.
表6显示了使用宽度乘数α缩小MobileNet架构的精度，计算和尺寸的权衡。

Accuracy drops off smoothly until the architecture is made too small at α = 0.25.
精确度平稳地下降，直到结构在α= 0.25时变得太小。

Table 7 shows the accuracy, computation and size trade offs for different resolution multipliers by training MobileNets with reduced input resolutions. Accuracy drops off smoothly across resolution.
表7显示了通过减少输入分辨率对MoBileNets进行训练的不同分辨率乘法器的精度，计算和尺寸的权衡。精度在整个分辨率下平稳下降。

Figure 4 shows the trade off between ImageNet Accu- racy and computation for the 16 models made from the cross product of width multiplier α ∈ {1, 0.75, 0.5, 0.25} and resolutions {224, 192, 160, 128}.
图4显示了在16个由宽度乘法器α∈{1,0.75,0.5,0.25}和分辨率{224,192,160,128}的叉积所构成的模型中，ImageNet精度和计算之间的折衷。

Results are log linear with a jump when models get very small at α = 0.25.
当模型在α= 0.25时非常小时，结果是对数线性的跳跃。

Figure 5 shows the trade off between ImageNet Accuracy and number of parameters for the 16 models made from the cross product of width multiplier α ∈ {1, 0.75, 0.5, 0.25} and resolutions {224, 192, 160, 128}.
图5显示了16个模型的宽度乘法器α∈{1,0.75,0.5,0.25}和分辨率{224,192,160,128}的交叉乘积，ImageNet精度和参数个数之间的折衷。

Table 8 compares full MobileNet to the original GoogleNet [30] and VGG16 [27]. MobileNet is nearly as accurate as VGG16 while being 32 times smaller and 27 times less compute intensive.
表 8 将完整MobileNet与原始GoogleNet [30]和VGG16 [27]进行了比较。 MobileNet几乎与VGG16一样精确，同时体积缩小32倍，计算密度减少27倍。

It is more accurate than GoogleNet while being smaller and more than 2.5 times less computation.
它比GoogleNet更精确，体积更小，计算量减少2.5倍以上。

Table 9 compares a reduced MobileNet with width multiplier α = 0.5 and reduced resolution 160 × 160. Reduced MobileNet is 4% better than AlexNet [19] while being 45× smaller and 9.4× less compute than AlexNet.
表9比较了减少MobileNet与宽度成倍 α= 0.5 和减少分辨率 160×160. 减少MobileNet 比 AlexNet [4]好4％，而比 AlexNet 小 45 倍和 9.4 倍计算。

It is also 4% better than Squeezenet [12] at about the same size and 22× less computation.
它大约相当于Squeezenet [12]的 4％，大小相同，计算量少22倍。

4.3. Fine Grained Recognition
4.3。细粒度的识别

We train MobileNet for fine grained recognition on the Stanford Dogs dataset [17].
我们训练MobileNet在Stanford Dogs数据集上的细粒度识别[17]。

We extend the approach of [18] and collect an even larger but noisy training set than [18] from the web.
我们扩展[18]的方法，收集比网络[18]更大，但噪音更大的训练集。

We use the noisy web data to pretrain a fine grained dog recognition model and then fine tune the model on the Stanford Dogs training set.
我们使用嘈杂的网络数据来预训细粒度的狗识别模型，然后在斯坦福犬训练集上微调模型。

Results on Stanford Dogs test set are in Table 10.
斯坦福大学的测试结果如表10所示。

MobileNet can almost achieve the state of the art results from [18] at greatly reduced computation and size.
MobileNet几乎可以在大大减少计算和大小的情况下达到最先进的结果[18]。

4.4. Large Scale Geolocalizaton
4.4. 大规模地质定位

PlaNet [35] casts the task of determining where on earth a photo was taken as a classification problem.
PlaNet [35]负责确定将照片拍摄到哪里作为分类问题。

The approach divides the earth into a grid of geographic cells that serve as the target classes and trains a convolutional neural network on millions of geo-tagged photos.
该方法将地球划分为一个地理单元格网格，作为目标类别，并在数以百万计的地理标记照片上训练卷积神经网络。

PlaNet has been shown to successfully localize a large variety of photos and to out- perform Im2GPS [6, 7] that addresses the same task.
PlaNet已被证明能够成功地定位各种各样的照片，并且能够胜任解决相同任务的Im2GPS [6,7]。

We re-train PlaNet using the MobileNet architecture on the same data.
我们使用MobileNet架构在相同的数据上重新训练PlaNet。

While the full PlaNet model based on the Inception V3 architecture [31] has 52 million parameters and 5.74 billion mult-adds.
而基于感知V3架构[31]的完整PlaNet模型有5200万个参数和574亿个多重加法。

The MobileNet model has only 13 million parameters with the usual 3 million for the body and 10 million for the final layer and 0.58 Million mult-adds.
MobileNet模型只有1300万个参数，通常是300万个参数，最后一个参数是1000万个参数，0.58百万个参数。

As shown in Tab. 11, the MobileNet version delivers only slightly decreased performance compared to PlaNet despite being much more compact. Moreover, it still outperforms Im2GPS by a large margin.
如Tab。所示。如图11所示，尽管MobileNet版本更紧凑，但与PlaNet相比，其性能仅略有下降。而且，它的表现还是远远优于Im2GPS。

4.5. Face Attributes
4.5.人面的属性

Another use-case for MobileNet is compressing large systems with unknown or esoteric training procedures.
MobileNet的另一个用例是压缩具有未知或深奥训练过程的大型系统。

In a face attribute classification task, we demonstrate a synergistic relationship between MobileNet and distillation [9], a knowledge transfer technique for deep networks.
在面部属性分类任务中，我们展示了MobileNet和精馏之间的协同关系[9]，这是一种深度网络的知识转移技术。

We seek to reduce a large face attribute classifier with 75 million parameters and 1600 million Mult-Adds.
我们试图用7500万个参数和1600万个多重添加来减少一个大的人脸属性分类器。

The classifier is trained on a multi-attribute dataset similar to YFCC100M [32].
分类器是在类似YFCC100M [32]的多属性数据集上进行训练的。

We distill a face attribute classifier using the MobileNet architecture.
我们使用MobileNet架构来提取一个人脸属性分类器。

Distillation [9] works by training the classifier to emulate the outputs of a larger model2 instead of the ground-truth labels, hence enabling training from large (and potentially infinite) unlabeled datasets.
蒸馏[9]通过训练分类器来模拟较大模型2的输出而不是地面真值标签，从而能够从大的（可能是无限的）未标记的数据集进行训练。

Marrying the scalability of distillation training and the parsimonious parameterization of MobileNet, the end system not only requires no regularization (e.g. weight-decay and early-stopping), but also demonstrates enhanced performances.
结合蒸馏培训的可扩展性和MobileNet的简洁性，终端系统不仅不需要正规化（例如体重衰减和提早停止），而且还表现出提高的性能。

It is evident from Tab. 12 that the MobileNet-based classifier is resilient to aggressive model shrinking: it achieves a similar mean average precision across attributes (mean AP) as the in-house while consuming only 1% the Multi-Adds.
这是从Tab。在图12中，基于MobileNet的分类器反映了积极的模型收缩：它在整个属性（平均AP）上实现了类似的平均平均精度，而仅消耗1％的多重添加。

4.6. Object Detection
4.6. 对象检测

MobileNet can also be deployed as an effective base network in modern object detection systems.
MobileNet也可以作为现代物体探测系统的有效基础网络进行部署。

We report results for MobileNet trained for object detection on COCO data based on the recent work that won the 2016 COCO challenge [10].
我们根据最近获得2016年COCO挑战的工作报告了在COCO数据上进行对象检测培训的MobileNet结果[10]。

In table 13, MobileNet is compared to VGG and Inception V2 [13] under both Faster-RCNN [23] and SSD [21] framework.
在表13中，MobileNet与Faster-RCNN [23]和SSD [21]框架下的VGG和Inception V2 [13]相比较。

In our experiments, SSD is evaluated with 300 input resolution (SSD 300) and Faster-RCNN is compared with both 300 and 600 input resolution (Faster- RCNN 300, Faster-RCNN 600).
在我们的实验中，SSD的输入分辨率为300（SSD 300），Faster-RCNN的输入分辨率为300和600（Faster-RCNN 300，Faster-RCNN 600）。

The Faster-RCNN model evaluates 300 RPN proposal boxes per image.
Faster-RCNN模型评估每个图像300个RPN投标箱。

The models are trained on COCO train+val excluding 8k minival images and evaluated on minival.
模型在COCO 训练+验证集上训练，不包括8k minival图像，并在minival上进行评估。

For both frameworks, MobileNet achieves comparable results to other networks with only a fraction of computational complexity and model size.
对于这两种框架来说，MobileNet只能在计算复杂度和模型大小方面达到与其他网络类似的结果。

4.7. Face Embeddings
4.7. 面部嵌入

The FaceNet model is a state of the art face recognition model [25].
FaceNet模型是最先进的人脸识别模型[25]。

It builds face embeddings based on the triplet loss.
它基于三重损失构建脸部嵌入。

To build a mobile FaceNet model we use distillation to train by minimizing the squared differences of the output of FaceNet and MobileNet on the training data.
要建立一个移动的FaceNet模型，我们使用蒸馏来训练，通过最小化FaceNet和MobileNet的输出在训练数据上的平方差。

Results for very small MobileNet models can be found in table 14.
非常小的MobileNet模型的结果可以在表14中找到。

Conclusion
5.结论

We proposed a new model architecture called MobileNets based on depthwise separable convolutions.
我们提出了一种基于深度可分卷积的新模型体系结构MoBileNets。

We investigated some of the important design decisions leading to an efficient model.
我们调查了一些导致高效模型的重要设计决策。

We then demonstrated how to build smaller and faster MobileNets using width multiplier and resolution multiplier by trading off a reasonable amount of accuracy to reduce size and latency.
然后，我们演示了如何使用宽度乘法器和分辨率乘法器来构建更小更快的移动网络，通过交换合理的精度来减小大小和延迟。

We then compared different MobileNets to popular models demonstrating superior size, speed and accuracy characteristics.
然后，我们将不同的MobileNets与流行的模型进行了比较，这些模型展示了优越的尺寸，速度和准确度特征。

We concluded by demonstrating MobileNet’s effectiveness when applied to a wide variety of tasks.
我们通过展示MobileNet在应用于各种任务时的有效性得出结论。

As a next step to help adoption and exploration of MobileNets, we plan on releasing models in Tensor Flow.
作为帮助采用和探索MobileNets的下一步，我们计划在Tensor Flow中发布模型。