EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
【Google Research, Brain Team 2019】
参考的优秀博文:EfficientNet网络详解
Abstract
-
在本文中,我们系统地研究了模型的伸缩性,并发现谨慎地平衡网络深度、宽度和分辨率可以获得更好的性能。
In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. -
使用一个简单而高效的复合系数,统一缩放深度/宽度/分辨率的所有维度。
uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. -
对一个较为简单的网络使用神经网络结构搜索方法得到最佳复合系数
neural architecture search
1. Introduction
1.1 Model Scaling
- width:增加网络的channels
- depth:增加网络深度
- resolution:增加网络输入的resolution
- compound scaling:同时增加网络的width、depth、resolution
2. Related Work
- ConvNet Accuracy: GoogleNet 、SENet 、GPipe
- ConvNet Efficiency: Model compression 、SqueezeNets 、MobileNets 、ShuffleNets 、neural architecture search 、
【NAS起初用于设计efficient mobile-size ConvNets ,首次用于大型网络 】
Recently, neural architecture search becomes increasingly popular in designing efficient mobile-size ConvNets (Tan et al., 2019; Cai et al., 2019). However, it is unclear how to apply these techniques for larger models that have much larger design space and much more expensive tuning cost. |
- Model Scaling: ResNet --depth、WideResNet +MobileNets --width、
3. Compound Model Scaling
3.1 Problem Formulation
- ConvNet 的一般表示:
-
固定Fi,寻找每一层的Li,Ci,Hi,Wi
By fixing Fi, model scaling simplifies the design problem for new resource constraints, but it still remains a large design space to explore different Li; Ci; Hi; Wi for each layer. I
3.2 Scaling Dimensions
-
Depth:增加网络的深度
depth
能够得到更加丰富、复杂的特征并且能够很好的应用到其它任务中。但网络的深度过深会面临梯度消失,训练困难的问题。尽管skip connection和batch normalization可以减轻训练困难的问题,但是增加网络depth
的收益会消失。The intuition is that deeper ConvNet can capture richer and more complex features, and generalize well on new tasks. However, deeper networks are also more difficult to train due to the vanishing gradient problem.Although several techniques, such as skip connections (He et al., 2016) and batch normalization (Ioffe & Szegedy, 2015), alleviate the training problem, the accuracy gain of very deep network diminishes. -
Width:增加网络的
width
能够获得更高细粒度的特征并且也更容易训练,但对于width
很大而深度较浅的网络往往很难学习到更深层次的特征。wider networks tend to be able to capture more fine-grained features and are easier to train. However, extremely wide but shallow networks tend to have difficulties in capturing higher level features. -
Resolution:增加输入网络的图像分辨率能够潜在得获得更高细粒度的特征模板,但对于非常高的输入分辨率,准确率的增益也会减小。并且大分辨率图像会增加计算量。
With higher resolution input images, ConvNets can potentially capture more fine-grained patterns. but the accuracy gain diminishes for very high resolutions.
3.3 Compound Scaling
-
直观地说,对于更高分辨率的图像,我们应该增加网络深度,这样更大的接收区域可以帮助在更大的图像中捕捉到包含更多像素的类似特征。相应的,我们也应该在分辨率较高的时候增加网络宽度,以便在高分辨率图像中捕捉到更多的像素更细粒度的模式。这些直觉表明,我们需要协调和平衡不同的尺度,而不是传统的一维尺度。
Intuitively, for higher resolution images, we should increase network depth, such that the larger receptive fields can help capture similar features that include more pixels in bigger images. Correspondingly, we should also increase network width when resolution is higher, in order to capture more fine-grained patterns with more pixels in high resolution images. These intuitions suggest that we need to coordinate and balance different scaling dimensions rather than conventional single-dimension scaling.
- compound scaling method :
4. EfficientNet Architecture
4.1 EfficientNet-B0 baseline network
4.2 MBconv
MBConv结构:
MBConv
结构主要由一个1x1
的普通卷积(升维作用,包含BN和Swish)- 一个
kxk
的Depthwise Conv
卷积(包含BN和Swish) k
的具体值可看EfficientNet-B0的网络框架主要有3x3
和5x5
两种情况- 一个
SE
模块 - 一个
1x1
的普通卷积(降维作用,包含BN) - 一个
Droupout
层构成 - 关于
shortcut
连接,仅当输入MBConv
结构的特征矩阵与输出的特征矩阵shape
相同时才存在
4.3 Compound Scaling Method
5. Experiments
5.1 EfficientNet Performance Results on ImageNet
5.2 Scaling Up MobileNets and ResNet
5.3 Class Activation Map (CAM)
- In order to further understand why our compound scaling method is better than others
- Images are randomly picked from ImageNet validation set. As shown in the figure, the model with compound scaling tends to focus on more relevant regions with more object details, while other models are either lack of object details or unable to capture all objects in the images.