EfficientFormer: Vision Transformers at MobileNet Speed

最新推荐文章于 2024-07-01 19:48:54 发布

雾岛听雪

最新推荐文章于 2024-07-01 19:48:54 发布

阅读量324

点赞数

文章标签：深度学习

本文链接：https://blog.csdn.net/XZHBUT/article/details/134401955

版权

EfficientFormer: Vision Transformers at MobileNet Speed

Abstract

背景
Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications（实时性部署） is particularly challenging, especially on resource-constrained hardware such as mobile devices.
Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory.

视觉变换器（ViT）在计算机视觉任务中取得了快速的进展，在各种基准测试中取得了有希望的结果。然而，由于参数数量庞大和模型设计（例如注意力机制），ViT基础模型通常比轻量级卷积网络慢很多倍。因此，将ViT部署到实时应用程序中尤其具有挑战性，尤其是在资源受限的硬件上，如移动设备。最近的努力试图通过网络架构搜索或与MobileNet块的混合设计来减少ViT的计算复杂性，但是推理速度仍然不令人满意。

引出问题与本文策略
This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance?

这引出了一个重要的问题：变换器能否像MobileNet一样快速运行，同时获得高性能？

（1）To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs.

为了回答这个问题，我们首先重新审视ViT基础模型中使用的网络架构和运算符，并确定了效率低下的设计。

（2）Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as a design paradigm.

然后，我们引入了一个维度一致的纯transformer（没有MobileNet块）作为设计范例。

（3）Finally, we perform latency driven slimming（延迟驱动的修剪） to get a series of final models dubbed EfficientFormer.

最后，我们进行了延迟驱动的修剪，以获得一系列最终模型，称为EfficientFormer。
ps："延迟驱动的修剪"是一种优化模型的方法，主要目标是减少模型的推理时间，也就是模型从接收输入到产生输出所需的时间。

小结论
Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. Our fastest model, EfficientFormer-L1, achieves 79.2% top-1 accuracy on ImageNet-1K with only 1.6 ms inference latency on iPhone 12 (compiled with CoreML), which runs as fast as MobileNetV2×1.4 (1.6 ms, 74.7% top-1), and our largest model, EfficientFormer-L7, obtains 83.3% accuracy with only 7.0 ms latency.
Our work proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance1.

大量的实验显示了EfficientFormer在移动设备上的性能和速度的优越性。我们最快的模型，EfficientFormer-L1，在ImageNet-1K上达到了79.2%的top-1准确率，只有1.6毫秒的推理延迟在iPhone 12上（使用CoreML编译），它运行得和MobileNetV2×1.4一样快（1.6毫秒，74.7%的top-1），而我们最大的模型，EfficientFormer-L7，只有7.0毫秒的延迟，准确率达到83.3%。
.
我们的工作证明，适当设计的变换器可以在保持高性能的同时，在移动设备上达到极低的延迟。

Introduction

VIT背景
The transformer architecture [1], initially designed for Natural Language Processing (NLP) tasks, introduces the Multi-Head Self Attention (MHSA) mechanism that allows the network to model long-term dependencies(建模长期依赖性) and is easy to parallelize(并行化).

transformer架构最初是为自然语言处理（NLP）任务设计的，引入了多头自注意力（MHSA）机制，该机制使网络能够建模长期依赖性，并且易于并行化。

In this context, Dosovitskiy et al. [2] adapt the attention mechanism to 2D images and propose Vision Transformer (ViT): the input image is divided into non-overlapping patches, and the inter-patch representations are learned through MHSA without inductive bias.

在这个背景下，Dosovitskiy等人将注意力机制适应到2D图像，并提出了视觉变换器（ViT）：输入图像被划分为不重叠的块，通过MHSA学习块间的表示，而不需要归纳偏见。

VIT其他研究
ViTs demonstrate promising results compared to convolutional neural networks (CNNs) on computer vision tasks. Following this success, several efforts explore the potential of ViT by improving training strategies [3, 4, 5], introducing architecture changes [6, 7], redesigning attention mechanisms [8, 9], and elevating the performance of various vision tasks such as classification [10, 11, 12], segmentation [13, 14], and detection [15, 16].

与卷积神经网络（CNN）相比，ViT在计算机视觉任务上展示了有前景的结果。在这个成功的基础上，有几项工作探索了ViT的潜力，通过改进训练策略[3, 4, 5]，引入架构变化[6, 7]，重新设计注意力机制[8, 9]，并提高了各种视觉任务的性能，如分类[10, 11,12]，分割[13, 14]和检测[15, 16]。

VIT存在的问题-引出延迟问题

On the downside, transformer models are usually times slower than competitive CNNs [17, 18].
There are many factors that limit the inference speed of ViT, including the massive number of parameters, quadratic-increasing computation complexity with respect to token length, non-foldable normalization layers, and lack of compiler level optimizations (e.g., Winograd for CNN [19]).

然而，变换器模型通常比有竞争力的卷积神经网络慢很多[17,18]。
限制ViT推理速度的因素有很多，包括大量的参数，与token长度呈二次增长的计算复杂性，不可折叠的归一化层，以及缺乏编译器级别的优化（例如，CNN的Winograd）。

The high latency makes transformers impractical for real-world applications on resource-constrained hardware, such as augmented or virtual reality applications on mobile devices and wearables.
As a result, lightweight CNNs [20, 21, 22] remain the default choice（默认选择） for real-time inference.

高延迟使得变换器在资源受限的硬件上的实际应用变得不切实际，例如在移动设备和可穿戴设备上的增强现实或虚拟现实应用。
因此，轻量级的CNN[20, 21, 22]仍然是实时推理的默认选择。

针对transformer的延迟问题，研究背景
To alleviate the latency bottleneck of transformers, many approaches have been proposed.
For instance, some efforts consider designing new architectures or operations by changing the linear layers with convolutional layers (CONV) [23], combining self-attention with MobileNet blocks [24], or introducing sparse attention（稀疏注意力） [25, 26, 27], to reduce the computational cost, while other efforts leverage network searching algorithm [28] or pruning [29]（网络搜索和剪枝） to improve efficiency.

为了缓解变换器的延迟瓶颈，已经提出了许多方法。例如，一些努力考虑通过将线性层改为卷积层，将自注意力与MobileNet块结合，或引入稀疏注意力[25,26, 27]来设计新的架构或操作，以减少计算成本，而其他努力利用网络搜索算法或修剪来提高效率。

虽然但是，VIT能否以MobileNet的速度运行这个问题没有解决
Although the computation-performance trade-off has been improved by existing works,
the fundamental question that relates to the applicability of transformer models remains unanswered: Can powerful vision transformers run at MobileNet speed and become a default option for edge applications?

尽管现有的工作已经改善了计算性能的权衡，但是关于变换器模型的适用性的基本问题仍然没有得到回答：强大的VIT能否以MobileNet的速度运行，并成为边缘应用的默认选项？

本文研究工作
This work provides a study towards the answer through the following contributions:

（1）First, we revisit the design principles of ViT and its variants（变体） through latency analysis (Sec. 3). Following existing work [18], we utilize iPhone 12 as the testbed and publicly available CoreML [30] as the compiler, since the mobile device is widely used and the results can be easily reproduced.

首先，我们通过延迟分析（第3节）重新审视了ViT及其变体的设计原则。我们遵循现有的工作，使用iPhone 12作为测试平台，使用公开可用的CoreML作为编译器，因为移动设备被广泛使用，结果可以轻松复制。

（2）Second, based on our analysis, we identify inefficient designs and operators in ViT and propose a new dimension-consistent design paradigm for vision transformers (Sec. 4.1).

其次，基于我们的分析，我们确定了ViT中的低效设计和运算符，并提出了一个新的维度一致的视觉变换器设计范例（第4.1节）。

（3）Third, starting from a supernet（超网络） with the new design paradigm（范例）, we propose a simple yet effective latency-driven slimming method to obtain a new family of models, namely, EfficientFormers (Sec. 4.2). We directly optimize for inference speed instead of MACs or number of parameters [31, 32, 33].

第三，从一个具有新设计范例的超网络开始，我们提出了一个简单而有效的延迟驱动的修剪方法，以获得一系列新的模型，即EfficientFormers（第4.2节）。我们直接优化推理速度，而不是MACs或参数数量
.
ps：超网络：（说的就是上面的范例模型）它包含了多种子网络结构。超网络的设计允许我们在一个大的网络空间中搜索最优的网络结构。

实验结果总说+总结
Our fastest model, EfficientFormer-L1, achieves 79.2% top-1 accuracy on ImageNet-1K [34] classification task with only 1.6 ms inference time (averaged over 1, 000 runs), which runs as fast as MobileNetV2×1.4 and wields 4.5% higher top-1 accuracy (more results in Fig. 1 and Tab. 1).
The promising results demonstrate that latency is no longer an obstacle for the widespread adoption of vision transformers.
Our largest model, EfficientFormer-L7, achieves 83.3% accuracy with only 7.0 ms latency, outperforms ViT×MobileNet hybrid designs (MobileViT-XS, 74.8%, 7.2ms) by a large margin

我们最快的模型，EfficientFormer-L1，在ImageNet-1K分类任务上达到了79.2%的top-1准确率，只有1.6毫秒的推理时间（在1,000次运行中取平均），它的运行速度与MobileNetV2×1.4一样快，并且具有更高的top-1准确率（更多结果见图1和表1）。
这些令人鼓舞的结果表明，延迟不再是视觉变换器广泛应用的障碍。
我们最大的模型，EfficientFormer-L7，只有7.0毫秒的延迟，准确率达到83.3%，大大超过了ViT×MobileNet混合设计（MobileViT-XS，74.8%，7.2ms）。

Additionally, we observe superior performance by employing EfficientFormer as the backbone in image detection（目标检测） and segmentation（分割） benchmarks (Tab. 2).
We provide a preliminary answer to the aforementioned question, ViTs can achieve ultra fast inference speed and wield powerful performance at the same time.
We hope our EfficientFormer can serve as a strong baseline and inspire followup works on the edge deployment of vision transformers.

此外，我们发现在图像检测和分割基准测试中，使用EfficientFormer作为主干网络可以获得优越的性能（表2）。
.
我们对上述问题提供了初步的答案，即ViTs可以在保持强大性能的同时实现超快的推理速度。
> 我们希望我们的EfficientFormer可以作为一个强大的基线，并激发关于视觉变换器边缘部署的后续工作。

2 Related Work

Transformer-VIT-相关研究
Transformers are initially proposed to handle the learning of long sequences in NLP tasks [1].

变压器最初是为了处理自然语言处理任务中的长序列学习而提出的。

Dosovitskiy et al. [2] and Carion et al. [15] adapt the transformer architecture to classification and detection, respectively, and achieve competitive performance against CNN counterparts with stronger training techniques and larger-scale datasets.

Dosovitskiy等人和Carion等人分别将变压器架构应用于分类和检测，并在更强的训练技术和更大规模的数据集的支持下，与CNN对手取得了竞争性的表现。

DeiT [3] further improves the training pipeline with the aid of distillation, eliminating the need for large-scale pretraining [35].
Inspired by the competitive performance and global receptive field of transformer models, follow-up works are proposed to refine the architecture [36, 37], explore the relationship between CONV nets and ViT [38, 39, 40], and adapt ViT to different computer vision tasks [13, 41, 42, 43, 44, 45, 46].

DeiT进一步改进了训练流程，借助蒸馏消除了大规模预训练的需要。
受到变压器模型的竞争性表现和全局接受域的启发，后续的工作提出了改进架构[36, 37]，探索了CONV网络和ViT之间的关系[38, 39, 40]，并将ViT适应到不同的计算机视觉任务[13, 41, 42, 43, 44, 45, 46]。

Other research efforts explore the essence of attention mechanism and propose insightful variants of token mixer, e.g., local attention [8], spatial MLP [47, 48], and pooling-mixer [6].

其他的研究工作探索了注意力机制的本质并提出了令人深思的令牌混合器的变体，例如局部注意力，空间MLP[47, 48]，和池化混合器。

VIT存在的轻量化问题以及研究
Despite the success in most vision tasks, ViT-based models cannot compete with the well-studied lightweight CNNs [21, 49] when the inference speed is the major concern [50, 51, 52], especially on resource-constrained edge devices [17].

尽管在大多数视觉任务中取得了成功，但基于ViT的模型在推理速度是主要关注点时，无法与经过深入研究的轻量级CNNs竞争，特别是在资源受限的边缘设备上。

To accelerate ViT, many approaches have been introduced with different methodologies, such as proposing new architectures or modules [53, 54, 55, 56, 57, 58],re-thinking self-attention and sparse-attention mechanisms（稀疏注意力） [59, 60, 61, 62, 63, 64, 65], and utilizing search algorithms that are widely explored in CNNs to find smaller and faster ViTs [66, 28, 29, 67].

为了加速ViT，已经引入了许多不同方法论的方法，例如提出新的架构或模块，重新思考自我注意力和稀疏注意力机制，以及利用在CNNs中广泛探索的搜索算法来找到更小更快的ViTs。

（1）Recently, LeViT [23] proposes a CONV-clothing design to accelerate vision transformer. However, in order to perform MHSA, the 4D features need to be frequently reshaped into flat patches, which is still expensive to compute on edge resources (Fig. 2).

最近，LeViT提出了一种CONV-clothing设计来加速视觉变换器。然而，为了执行MHSA，需要频繁地将4D特征重塑成平面补丁，这在边缘资源上仍然需要大量计算（见图2）。

（2）Likewise, MobileViT [18] introduces a hybrid architecture that combines lightweight MobileNet blocks (with point-wise and depth-wise CONV) and MHSA blocks; the former is placed at early stages in the network pipeline to extract low-level features, while the latter is placed in late stages to enjoy the global receptive field. Similar approach has been explored by several works [24, 28] as a straightforward strategy to reduce computation.

同样，MobileViT引入了一种混合架构，结合了轻量级MobileNet块（带有点对点和深度对深度的CONV）和MHSA块；前者放置在网络管道的早期阶段以提取低级特征，而后者放置在后期阶段以享受全局接收场。类似的方法已经被几项工作探索过，作为一种直接的策略来减少计算。

3 On-Device Latency Analysis of Vision Transformers

分析模型在边缘设备上的延迟
Most existing approaches optimize the inference speed of transformers through computation complexity (MACs) or throughput (images/sec) obtained from server（服务器） GPU [23, 28]. While such metrics do not reflect the real on-device latency.

现有的大多数方法通过计算复杂性（MACs）或从服务器GPU获得的吞吐量（图像/秒）来优化变换器的推理速度。然而，这些指标并不能反映真实的设备延迟。

在这里插入图片描述
ps：（不同颜色代表不同模块的延迟）
分析模型各个部分对延迟的影响
To have a clear understanding of which operations and design choices slow down the inference of ViTs on edge devices, we perform a comprehensive latency analysis over a number of models and operations, as shown in Fig. 2, whereby the following observations are drawn.

为了清楚地理解哪些操作和设计选择减慢了边缘设备上ViTs的推理速度，我们对一些模型和操作进行了全面的延迟分析，如图2所示，从中得出以下观察结果。

(1) Patch embedding with large kernel and stride is a speed bottleneck on mobile devices.
（在移动设备上，具有大内核和步幅的patch嵌入是速度瓶颈。）
Patch embedding is often implemented with a non-overlapping（非重叠） convolution layer that has large kernel size and stride [3, 55].
A common belief is that the computation cost of the patch embedding layer in a transformer network is unremarkable or negligible [2, 6]. However, our comparison in Fig. 2 between models with large kernel and stride for patch embedding, i.e., DeiT-S [3] and PoolFormer-S24 [6], and the models without it, i.e., LeViT-256 [23] and EfficientFormer, shows that patch embedding is instead a speed bottleneck on mobile devices.****（大卷积核拖累网络）

①Large-kernel convolutions are not well supported by most compilers and cannot be accelerated through existing algorithms like Winograd [19]. （大卷积核无法加速）
②Alternatively, the non-overlapping patch embedding can be replaced by a convolution stem with fast downsampling [68, 69, 23] that consists of several hardware-efficient 3 × 3 convolutions (Fig. 3).（可以用3x3卷积核更快）

patch嵌入通常使用具有大内核大小和步幅的非重叠卷积层来实现。
人们普遍认为，变换器网络中的补丁嵌入层的计算成本微不足道或可以忽略不计。
然而，我们在图2中对具有大内核和步幅的补丁嵌入的模型（即DeiT-S和PoolFormer-S24）和没有它的模型（即LeViT-256和EfficientFormer）进行比较，结果显示，补丁嵌入在移动设备上反而是速度瓶颈。
大内核卷积在大多数编译器中都不受支持，不能通过现有的算法（如Winograd）进行加速。
另一种选择是，可以用由几个硬件高效的3×3卷积组成的具有快速下采样的卷积茎来替换非重叠的patch嵌入**

(2) Consistent feature dimension is important for the choice of token mixer. MHSA is not necessarily a speed bottleneck.
（一致的特征维度对于选择token混合器很重要。多头自注意力（MHSA）并不一定是速度瓶颈。）

ps：令牌混合器（Token Mixer）是视觉Transformer模型中的一个关键组件，用于融合空间位置信息。当令牌混合器采取不同的形式时，通用视觉模型也会变化为不同的类型。

Recent work extends ViT-based models to the MetaFormer architecture [6] consisting of MLP blocks and unspecified token mixers. Selecting a token mixer is an essential design choice when building ViT-based models.
The options are many—the conventional MHSA mixer with a global receptive field, more sophisticated shifted window attention [8], or a non-parametric operator like pooling [6].

最近的工作将基于ViT的模型扩展到由MLP块和未指定的令牌混合器组成的MetaFormer架构。
选择令牌混合器是构建基于ViT的模型时的一个重要设计选择。
选择有很多，如具有全局接收场的传统MHSA混合器，更复杂的移位窗口注意力，或者像池化这样的非参数运算符。
我们将比较范围缩小到两个token混合器，池化和MHSA，我们选择前者因为它简单高效，选择后者是因为它性能更好。
像移位窗口这样的更复杂的令牌混合器目前大多数公共移动编译器都不支持，我们将它们排除在我们的范围之外。
此外，我们没有使用深度卷积来替换池化，因为我们关注的是在没有轻量级卷积的帮助下构建架构。

To understand the latency of the two token mixers, we perform the following two comparisons:
（为了理解两个令牌混合器的延迟，我们执行以下两个比较:）
①First, by comparing PoolFormer-s24 [6] and LeViT-256 [23], we observe that the Reshape operation is a bottleneck for LeViT-256. The majority of LeViT-256 is implemented with CONV on 4D tensor, requiring frequent reshaping operations when forwarding features into MHSA since the attention has to be performed on patchified 3D tensor (discarding the extra dimension of attention heads). The extensive usage of Reshape limits the speed of LeViT on mobile devices (Fig. 2). On the other hand, pooling naturally suits the 4D tensor when the network primarily consists of CONV-based implementations, e.g., CONV 1 × 1 as MLP implementation and CONV stem for downsampling. As a result, PoolFormer exhibits faster inference speed.（Reshape是cnn相关VIT的瓶颈）

首先，通过比较PoolFormer-s24和LeViT-256，我们观察到Reshape操作是LeViT-256的瓶颈。LeViT-256的大部分是用4D张量上的CONV实现的，当将特征转发到MHSA时需要频繁的重塑操作，因为注意力必须在补丁化的3D张量上执行（丢弃注意力头的额外维度）。Reshape的广泛使用限制了LeViT在移动设备上的速度。
另一方面，当网络主要由基于CONV的实现组成时，如MLP实现的CONV1×1和用于下采样的CONV茎，池化自然适合4D张量。因此，PoolFormer展示了更快的推理速度。

②Second, by comparing DeiT-Small [3] and LeViT-256 [23], we find that MHSA does not bring significant overhead on mobiles if the feature dimensions are consistent and Reshape is not required. Though much more computation intensive, DeiT-Small with a consistent 3D feature can achieve comparable speed to the new ViT variant, i.e., LeViT-256.（不需要Reshape的模型，MSHA不会对延迟有很大影响）

其次，通过比较DeiT-Small和LeViT-256，我们发现如果特征维度一致且不需要Reshape，MHSA在移动设备上并不会带来显著的开销。尽管计算量大得多，但具有一致3D特征的DeiT-Small可以达到与新的ViT变体，即LeViT-256，相当的速度。

In this work, we propose a dimension-consistent network (Sec. 4.1) with both 4D feature implementation and 3D MHSA, but the inefficient frequent Reshape operations are eliminated.（基于上诉发现，本文的模块设计）

在这项工作中，我们提出了一个具有一致维度的网络，既有4D特征实现，也有3D MHSA，但消除了低效的频繁Reshape操作。

(3) CONV-BN is more latency-favorable than LN (GN)-Linear and the accuracy drawback is generally acceptable.

convn - bn比LN (GN)-Linear具有更好的延迟优势，精度缺点通常是可以接受的。
ps:CONV-BN：卷积（Convolution，CONV）和批量归一化（Batch Normalization，BN）的组合
LN(GN)-Linear：这是层归一化（Layer Normalization，LN）或组归一化（Group Normalization，GN）和线性变换的组合

Choosing the MLP implementation is another essential design choice. Usually, one of the two options is selected: layer normalization (LN) with 3D linear projection (proj.) and CONV 1 × 1 with batch normalization (BN). CONV-BN is more latency favorable because BN can be folded into the preceding convolution for inference speedup, while dynamic normalizations, such as LN and GN, still collects running statistics at the inference phase, thus contributing to latency. From the analysis of DeiT-Small and PoolFormer-S24 in Fig. 2 and previous work [17], the latency introduced by LN constitutes around 10% − 20% latency of the whole network. (BN+Conv比LN+proj更快)

选择MLP实现是另一个重要的设计选择。通常，会选择以下两个选项之一：层归一化（LN）与3D线性投影（proj.）和1×1的卷积（CONV）与批归一化（BN）。
CONV-BN更有利于延迟，因为BN可以折叠到前面的卷积中以加速推理，而动态归一化，如LN和GN，在推理阶段仍然收集运行统计，从而导致延迟。
从图2中DeiT-Small和PoolFormer-S24的分析以及以前的工作，由LN引入的延迟占整个网络延迟的大约10% - 20%。
.
ps:“BN可以折叠到前面的卷积中以加速推理"是指在深度学习模型的推理阶段，批量归一化（Batch Normalization，BN）层可以被合并到前面的卷积（Convolution，CONV）层中，从而提高推理速度
为什么动态归一化在推理阶段仍然需要收集信息?LNGN归一化过程需要计算每个特征的均值和方差，这就需要在推理阶段收集运行统计信息.BN在训练阶段，BN已经计算并保存了每个神经元的均值和方差

Based on our ablation study in Appendix Tab. 3, CONV-BN only slightly downgrades performance compared to GN and achieves comparable results to channel-wise LN. In this work, we apply CONVBN as much as possible (in all latent 4D features) for the latency gain with a negligible performance drop, while using LN for the 3D features, which aligns with the original MHSA design in ViT and
yields better accuracy.（基于上诉发现，本文的模块设计）

根据我们在附录Tab.3中的消融研究，CONV-BN只会稍微降低性能，与GN相比，与通道方向的LN的结果相当。在这项工作中，我们尽可能多地应用CONV-BN（在所有潜在的4D特征中）以获得延迟增益，性能下降可以忽略不计，同时对3D特征使用LN，这与ViT中的原始MHSA设计一致，并产生更好的准确性

(4) The latency of nonlinearity is hardware and compiler dependent.
（非线性的延迟取决于硬件和编译器）
Lastly, we study nonlinearity, including GeLU, ReLU, and HardSwish. Previous work [17] suggests GeLU is not efficient on hardware and slows down inference. However, we observe GeLU is well supported by iPhone 12 and hardly slower than its counterpart, ReLU. On the contrary, HardSwish is surprisingly slow in our experiments and may not be well supported by the compiler (LeViT-256 latency with HardSwish is 44.5 ms while with GeLU 11.9 ms). We conclude that nonlinearity should be determined on a case-by-case basis given specific hardware and compiler at hand. We believe that most of the activations will be supported in the future. In this work, we employ GeLU activations.

最后，我们研究了非线性激活函数，包括GeLU，ReLU和HardSwish。以前的工作表明GeLU在硬件上效率不高，会减慢推理。然而，我们观察到GeLU在iPhone
12上得到了很好的支持，几乎没有比ReLU慢。相反，HardSwish在我们的实验中出奇地慢，可能没有得到编译器的良好支持（LeViT-256使用HardSwish的延迟是44.5毫秒，而使用GeLU是11.9毫秒）。
我们得出的结论是，非线性激活函数应该根据手头的具体硬件和编译器进行逐案确定。我们相信，大多数激活函数将在未来得到支持。在这项工作中，我们使用了GeLU激活函数。

延迟分析实验结论总结：
（1）移动设备上，patch嵌入层用大卷积核延迟大，换成3x3这种小更快
（2）如果模型频繁的用Reshape（4d-3d），延迟就多了，多头注意力多延迟影响小，所以模型维度一致很重要。本研究消除了低效的频繁Reshape操作
（3）BN+Conv比LN（GN）+proj延迟小，层归一化这种在推理阶段还需要统计信息。所以本研究中多使用BN与卷积，注意力机制中还是LN。
（4）不同的硬件对不同的激活函数加速不同。

4 Design of EfficientFormer

在这里插入图片描述

Based on the latency analysis, we propose the design of EfficientFormer, demonstrated in Fig. 3.
The network consists of a patch embedding (PatchEmbed) and stack of meta transformer blocks（元transformer）, denoted as MB:

EfficientFormer的设计基于延迟分析，如图3所示。该网络由一个patch嵌入（PatchEmbed）和一堆元变换器块（Meta Transformer Blocks，简称MB）组成
.
Meta Transformer ：元变换器块（Meta Transformer Blocks）是构成Meta-Transformer网络的基本组成部分1234。Meta-Transformer是一个统一的多模态学习框架，它能够处理和学习高达12种模态的数据，包括文本、图像、点云、音频、视频、红外、超光谱、X射线、表格、图形、时间序列和惯性测量单元（IMU）数据

在这里插入图片描述
where is the input image with batch size as B and spatial size as [H,W], Y is the desired output, and m is the total number of blocks (depth).
MB consists of unspecified token mixer (TokenMixer) followed by a MLP block and can be expressed as follows:

其中，X0是具有批处理大小为B和空间大小为[H,W]的输入图像，Y是期望的输出，m是块的总数（深度）。
MB由未指定的令牌混合器（TokenMixer）后跟一个MLP块组成，可以表示为：

在这里插入图片描述
where Xi∣i>0 is the intermediate feature that forwarded into the i th MB. We further define Stage (or S) as the stack of several MetaBlocks that processes the features with the same spatial size, such as N1× in Fig. 3 denoting S1 has N1 MetaBlocks.
The network includes 4 Stages. Among each Stage, there is an embedding operation to project embedding dimension and downsample token length, denoted as Embedding in Fig. 3.

其中，Xi∣i>0是转发到第i个MB的中间特征。我们进一步定义阶段（或S）为处理具有相同空间大小的特征的几个MetaBlocks的堆栈，例如图3中的N1×表示S1有N1个MetaBlocks。
网络包括4个阶段。在每个阶段中，都有一个嵌入操作来投影嵌入维度并下采样令牌长度，如图3中的嵌入所示。

With the above architecture, EfficientFormer is a fully transformer-based model without integrating MobileNet structures.
Next, we dive into the details of the network design, specifically, the architecture details and the search algorithm.

有了上述架构，EfficientFormer是一个完全基于变换器的模型，没有集成MobileNet结构。
接下来，我们将深入研究网络设计的细节，特别是架构细节和搜索算法。

4.1 Dimension-Consistent Design

With the observations in Sec. 3, we propose a dimension consistent design which splits the network into a 4D partition where operators are implemented in CONV-net style (MB4D), and a 3D partition where linear projections and attentions are performed over 3D tensor to enjoy the global modeling power of MHSA without sacrificing efficiency (MB3D), as shown in Fig. 3. Specifically, the network starts with 4D partition, while 3D partition is applied in the last stages. Note that Fig. 3 is just an instance, the actual length of 4D and 3D partition is specified later through architecture search.
请注意，图3只是一个实例，4D和3D分区的实际长度将通过架构搜索在后面指定。

根据第3节的观察，我们提出了一个维度一致的设计，将网络划分为一个4D分区，其中的运算符以CONV-net风格实现（MB4D），以及一个3D分区，其中在3D张量上执行线性投影和注意力，以享受MHSA的全局建模能力，而不牺牲效率（MB3D），如图3所示

First, input images are processed by a CONV stem with two 3 × 3 convolutions with stride 2 as patch embedding,

首先，输入图像通过一个带有两个步长为2的3×3卷积的CONV stem进行处理，作为patch嵌入，

在这里插入图片描述
where is the channel number (width) of the j th stage. Then the network starts with MB4D with a simple Pool mixer to extract low level features

其中，Cj是第j阶段的通道数（宽度）。然后，网络从MB4D开始，使用一个简单的Pool混合器来提取低级特征，

在这里插入图片描述

where ConvB,G refers to whether the convolution is followed by BN and GeLU, respectively.
Note here we do not employ Group or Layer Normalization (LN) before the Pool mixer as in [6], since the 4D partition is CONV-BN based design, thus there exists a BN in front of each Pool mixer.

其中，ConvB,G表示卷积是否分别由BN和GeLU跟随。请注意，我们在Pool混合器前没有像[6]中那样使用Group或Layer
Normalization（LN），因为4D分区是基于CONV-BN的设计，因此每个Pool混合器前都有一个BN

After processing all the MB4D blocks, we perform a one-time reshaping to transform the features size and enter 3D partition.
MB3D follows conventional ViT structure, as in Fig. 3. Formally,

处理完所有的MB4D块后，我们进行一次重塑，以转换特征大小并进入3D分区。
MB3D遵循传统的ViT结构，如图3所示。形式上，

在这里插入图片描述
where LinearG denotes the Linear followed by GeLU, and

其中，LinearG表示Linear后面跟着GeLU，而

在这里插入图片描述

where Q, K, V represents query, key, and values learned by the linear projection, and b is parameterized attention bias as position encodings.

其中，Q、K、V表示由线性投影学习的查询、键和值，b是参数化的注意力偏差，作为位置编码。

4.2 Latency Driven Slimming（瘦身）

第一：Design of Supernet.（超级网络设计）
Based on the dimension-consistent design, we build a supernet for searching efficient models of the network architecture shown in Fig. 3 (Fig. 3 shows an example of searched final network). In order to represent such a supernet, we define the MetaPath (MP), which is the collection of possible blocks:

基于维度一致的设计，我们构建了一个超网络（Supernet）来搜索图3所示的网络架构的高效模型（图3显示了搜索到的最终网络的一个例子）。
为了表示这样一个超网络，我们定义了MetaPath（MP），它是可能的块的集合：

在这里插入图片描述
where represents identity path, denotes the jth Stage, and denotes the i th block. The supernet can be illustrated by replacing MB in Fig. 3 with MP.

其中，I表示恒等路径，j表示第j个阶段，i表示第i个块。超网络可以通过将图3中的MB替换为MP来说明。

As in Eqn. 7, in S1 and S2 of the supernet, each block can select from MB4D or I, while in S3 and S4, the block can be MB3D, MB4D, or I. （根据MP换模型里的模块，测试结构）

如等式7所示，在超网络的S1和S2中，每个块可以选择MB4D或I，而在S3和S4中，块可以是MB3D、MB4D或I。

We only enable MB3D in the last two Stages for two reasons.
(1)First, since the computation of MHSA grows quadratically with respect to token length, integrating it in early Stages would largely increase the computation cost.
(2)Second, applying the global MHSA to the last Stages aligns with(一致) the intuition that early stages in the networks capture low-level features, while late layers learn long-term dependencies.

我们只在最后两个阶段启用MB3D有两个原因。
首先，由于MHSA的计算量随着令牌长度的增加而呈二次增长，将其集成到早期阶段会大大增加计算成本。
其次，将全局MHSA应用到最后阶段与直觉一致，即网络的早期阶段捕获低级特征，而后期阶段学习长期依赖关系。

第二：Searching Space.（搜索空间）
Our searching space includes (the width of each Stage), (the number of blocks in each Stage, i.e., depth), and last N blocks to apply MB3D.

我们的搜索空间包括Cj（每个阶段的宽度）、Nj（每个阶段的块数，即深度），以及最后N个块应用MB3D。

第三：Searching Algorithm.（搜索算法）
Previous hardware-aware network searching methods generally rely on hardware deployment of each candidate in search space to obtain the latency, which is time consuming [71]. In this work, we propose a simple, fast yet effective gradient-based search algorithm to obtain a candidate network that just needs to train the supernet for once.
The algorithm has three major steps.

以前的硬件感知网络搜索方法通常依赖于每个候选项在搜索空间中的硬件部署来获取延迟，这是耗时的。
在这项工作中，我们提出了一个简单、快速而有效的基于梯度的搜索算法来获得一个候选网络，只需要训练一次超网络。
该算法有三个主要步骤。

(1) First, we train the supernet with Gumbel Softmax sampling [72] to get the importance score for the blocks within each MP, which can be expressed as

首先，我们使用Gumbel Softmax采样训练超网络，以获得每个MP内的块的重要性得分，可以表示为
.
ps：Gumbel Softmax是一种连续分布，它具有可以平滑地退火成为分类分布的特性，其参数梯度可以通过重参数化技巧轻松计算1。Gumbel Softmax允许模型中有从离散的分布（比如类别分布categorical distribution）中采样的这个过程变得可微，从而允许反向传播时可以用梯度更新模型参数2。这让Gumbel Softmax在深度学习的很多领域都有应用，比如分类分割任务、采样生成类任务AIGC、强化学习、语音识别、NAS等等2。

在这里插入图片描述
where evaluates the importance of each block in MP as it represents the probability to select a block,e.g., MB4D or MB3D for the ith block. ✏ ∼ U(0, 1) ensures exploration, is the temperature, and n represents the type of blocks in MP, i.e., n ∈ {4D, I} for S1 and S2, and n ∈ {4D, 3D, I} for S3 and S4. By using Eqn. 8, the derivatives with respect to network weights and can be computed easily. The training follows the standard recipe (see Sec. 5.1) to obtain the trained weights and architecture parameter

其中，α评估每个块在MP中的重要性，因为它表示选择一个块（例如，MB4D或MB3D）的概率。ϵ∼U(0,1)确保探索，τ是温度，n表示MP中的块的类型，即n∈{4D,I}对于S1和S2，n∈{4D,3D,I}对于S3和S4。
通过使用等式8，可以轻松地计算出相对于网络权重和α的导数。训练遵循标准的配方（见5.1节）来获得训练过的权重和架构参数α。

(2) Second, we build a latency lookup table by collecting the on-device latency of MB4D and MB3D with different widths (multiples of 16).

其次，我们通过收集MB4D和MB3D的不同宽度（16的倍数）的设备上的延迟来建立延迟查找表。

(3) Finally, we perform network slimming on the supernet obtained from the first step through latency evaluation using the lookup table. Note that a typical gradient-based searching algorithm simply select the block with largest [72], which does not fit our scope as it cannot search the width . In fact, constructing a multiple-width supernet is memory-consuming and even unrealistic given that each MP has several branches in our design. Instead of directly searching on the complex searching space, we perform a gradual slimming on the single-width supernet as follows.

最后，我们通过使用查找表进行延迟评估，在从第一步获得的超网络上执行网络瘦身。
注意，典型的基于梯度的搜索算法简单地选择最大的α，这不适合我们的范围，因为它不能搜索宽度Cj。
实际上，构建一个多宽度的超网络是消耗内存的，甚至在我们的设计中，每个MP都有几个分支，这是不现实的。我们不是直接在复杂的搜索空间上进行搜索，而是对单宽度超网络进行逐步瘦身，如下所示。

We first define the importance score for MPi as i
↵I
i
and ↵3D
i +↵4D
i
↵I
i for S1,2 and S3,4, respectively.
Similarly, the importance score for each Stage can be obtained by summing up the scores for all MP within the Stage. With the importance score, we define the action space that includes three options: 1) select I for the least important MP, 2) remove the first MB3D, and 3) reduce the width of the least important Stage (by multiples of 16). Then, we calculate the resulting latency of each action through lookup table, and evaluate the accuracy drop of each action. Lastly, we choose the action based on per-latency accuracy drop ( −%
ms ) . This process is performed iteratively until target latency is achieved. We show more details of the algorithm in Appendix.

我们首先为MPi定义重要性得分，对于S1,2和S3,4分别为αi4D/αiI和αi3D+αi4D/αiI。
同样，可以通过对所有MP的得分求和来获得每个阶段的重要性得分。有了重要性得分，我们定义了包括三个选项的动作空间：
1）为最不重要的MP选择I，
2）移除第一个MB3D，
3）减少最不重要阶段的宽度（16的倍数）。
然后，我们通过查找表计算每个动作的结果延迟，并评估每个动作的准确性下降。
最后，我们根据每延迟准确性下降（−%/ ms）选择动作。 这个过程反复进行，直到达到目标延迟。我们在附录中展示了该算法的更多细节。

5 Experiments and Discussion

We implement EfficientFormer through PyTorch 1.11 [73] and Timm library [74], which is the common practice in recent arts [18, 6]. Our models are trained on a cluster with NVIDIA A100 and V100 GPUs. The inference speed on iPhone 12 (A14 bionic chip) is measured with iOS version 15 and averaged over 1, 000 runs, with all available computing resources (NPU), or CPU only. CoreMLTools is used to deploy the run-time model. In addition, we provide latency analysis on Nvidia A100 GPU with batch size 64 to exploit hardware roofline. The trained PyTorch models are deployed in ONNX format and are compiled with TensorRT. We report GPU runtime that excludes preprocessing. We provide the detailed network architecture and more ablation studies in Appendix 6.

EfficientFormer模型是通过PyTorch 1.11和Timm库实现的，这是最近艺术作品中的常见做法。
我们的模型在配备了NVIDIA A100和V100 GPU的集群上进行训练。在iPhone
12（A14仿生芯片）上的推理速度是在iOS版本15下测量的，并在1,000次运行中取平均值，使用所有可用的计算资源（NPU），或仅使用CPU。CoreMLTools用于部署运行时模型。
此外，我们还提供了在Nvidia A100 GPU上的延迟分析，批处理大小为64，以利用硬件roofline。经过训练的PyTorch模型以ONNX格式部署，并通过TensorRT进行编译。
**我们报告的GPU运行时间不包括预处理。**我们在附录6中提供了详细的网络架构和更多消融研究。

5.1 Image Classification

All EfficientFormer models are trained from scratch on ImageNet-1K dataset [34] to perform the image classification task. We employ standard image size (224 × 224) for both training and testing. We follow the training recipe from DeiT [3] but mainly report results with 300 training epochs to have the comparison with other ViT-based models. We use AdamW optimizer [75, 76], warm-up training with 5 epochs, and a cosine annealing learning rate schedule. The initial learning rate is set as 10−3 × (batch size/1024) and the minimum learning rate is 10−5

所有的EfficientFormer模型都是在ImageNet-1K数据集上从头开始训练的，用于执行图像分类任务。我们在训练和测试中都使用标准的图像大小（224
× 224）。我们遵循DeiT1的训练配方，但主要报告300个训练周期的结果，以便与其他基于ViT的模型进行比较。我们使用AdamW优化器
，进行5个周期的预热训练，并使用余弦退火学习率调度。初始学习率设定为10^-3 × (批量大小/1024)，最小学习率为10^-5。

The teacher model for distillation is RegNetY-16GF [77] pretrained on ImageNet with 82.9% top-1 accuracy. Results are demonstrated
in Tab. 1 and Fig. 1

用于蒸馏的教师模型是在ImageNet上预训练的RegNetY-16GF，其top-1准确率为82.9%。结果在表1和图1中展示。

在这里插入图片描述

Comparison to CNNs.

与广泛使用的基于CNN的模型相比，EfficientFormer在准确性和延迟之间实现了更好的权衡。
在iPhone神经引擎上，EfficientFormer-L1以MobileNetV2×1.4的速度运行，同时实现了比MobileNetV2高4.5%的top-1准确率。
此外，EfficientFormer-L3与EfficientNet-B0运行速度相似，但top-1准确率相对提高了5.3%。对于性能高的模型（> 83%的top-1准确率），EfficientFormer-L7的运行速度比EfficientNet-B5快3倍以上，显示出我们的模型的优越性能。
此外，在桌面GPU（A100）上，EfficientFormer-L1比EfficientNet-B0运行速度快38%，同时实现了比EfficientNet-B0高2.1%的top-1准确率。EfficientFormer-L7比EfficientNet-B5运行速度快4.6倍。

这些结果使我们能够回答之前提出的中心问题：ViT不需要牺牲延迟来实现良好的性能，准确的ViT仍然可以像轻量级CNN一样具有超快的推理速度。

Comparison to ViTs.

传统的ViT在延迟方面仍然不如CNN。
例如，DeiT-Tiny达到了与EfficientNet-B0相似的准确性，但运行速度慢3.4倍。
然而，EfficientFormer像其他变压器模型一样运行，但运行速度快很多倍。
EfficientFormer-L3的准确性高于DeiT-Small（82.4% vs 81.2%），同时运行速度快4倍。值得注意的是，尽管最近的变压器变体PoolFormer自然具有一致的4D架构，并且运行速度比典型的ViT快，但全局MHSA的缺失大大限制了性能上限。
EfficientFormer-L3的top-1准确率比PoolFormer-S36高1%，同时在Nvidia A100 GPU上运行速度快3倍，在iPhone NPU上运行速度快2.2倍，在iPhone CPU上运行速度快6.8倍。

Comparison to Hybrid Designs…

现有的混合设计，例如LeViT-256和MobileViT，仍然在ViT的延迟瓶颈上挣扎，很难超越轻量级的CNN。例如，LeViT-256的运行速度比DeiT-Small慢，而且top-1准确率低1%。
对于MobileViT，这是一个包含MHSA和MobileNet块的混合模型，我们观察到它比CNN对应模型，例如MobileNetV2和EfficientNet-B0，慢得多，而且准确性也不令人满意（比EfficientNet-B0低2.3%）。
因此，简单地用MobileNet块来替代MHSA几乎无法推动帕累托曲线，
如图1所示。相反，EfficientFormer作为纯粹的基于变压器的模型，可以在实现超快推理速度的同时保持高性能。
EfficientFormer-L1的top-1准确率比MobileViT-XS高4.4%，并且在不同的硬件和编译器上运行速度更快（在Nvidia A100 GPU计算上快1.9倍，在iPhone CPU上快2.3倍，在iPhone NPU上快4.5倍）。
在相似的推理时间下，EfficientFormer-L7在ImageNet上的top-1准确率比MobileViT-XS高8.5%，显示出我们设计的优越性。

5.2 EfficientFormer as Backbone

Object Detection and Instance Segmentation.

我们遵循Mask-RCNN的实现，将EfficientFormer集成为主干，并验证性能。我们在COCO-2017上进行实验，该数据集包含118K的训练集和5K的验证集。
EfficientFormer主干使用ImageNet-1K预训练权重进行初始化。
与之前的工作类似，我们使用AdamW优化器，初始学习率为2×10^-4，并训练模型12个周期。我们将输入大小设置为1333×800。EfficientFormers在所有情况下都一致地优于CNN（ResNet）和Transformer（PoolFormer）主干。
在相似的计算成本下，EfficientFormer-L3的box AP和mask AP分别比ResNet50主干高3.4和3.7，比PoolFormer-S24主干的box AP和mask AP分别高1.3和1.1，这证明了EfficientFormer作为视觉任务中强大主干的泛化能力。

Semantic Segmentation
我们进一步验证了EfficientFormer在语义分割任务上的性能。
我们使用具有挑战性的场景解析数据集ADE20K，该数据集包含20K的训练图像和2K的验证图像，涵盖150个类别。
与现有工作类似，我们将EfficientFormer作为主干，Semantic FPN作为分割解码器进行公平比较。主
干使用ImageNet-1K预训练权重进行初始化，模型在8个GPU上进行32个总批次大小的40K次迭代训练。
我们遵循语义分割的常见做法，使用AdamW优化器，并应用0.9的poly学习率调度，初始学习率为2×10^-4。我们将输入图像调整和裁剪为512×512进行训练，并在测试时将较短的一侧设为512（在验证集上）。

如表2所示，EfficientFormer在相似的计算预算下，始终大幅优于基于CNN和Transformer的主干。例如，EfficientFormer-L3比PoolFormer-S24高3.2 mIoU。我们发现，通过全局注意力，EfficientFormer学习到了更好的长期依赖关系，这在高分辨率密集预测任务中是有益的。

在这里插入图片描述

5.3 Discussion

Relations to MetaFormer.
EfficientFormer的设计部分受到了MetaFormer概念的启发。
与PoolFormer相比，EfficientFormer解决了维度不匹配问题，这是影响边缘推理效率的根本原因，因此能够在不牺牲速度的情况下利用全局MHSA。
因此，EfficientFormer在准确性性能上优于PoolFormer。尽管PoolFormer采用了完全的4D设计，但它采用了效率低下的patch嵌入和组归一化（见图2），导致了延迟的增加。相反，我们重新设计的EfficientFormer的4D分区（见图3）更适合硬件，并在多个任务中表现出更好的性能。

限制：（i）尽管EfficientFormer中的大多数设计都是通用的，例如，维度一致的设计和带有CONV-BN融合的4D块，但EfficientFormer的实际速度可能会在其他平台上有所不同。例如，如果在特定的硬件和编译器上，GeLU没有得到很好的支持，而HardSwish得到了有效的实现，那么操作符可能需要相应地进行修改。
（ii）我们提出的基于延迟的剪枝方法简单且快速。然而，如果搜索成本不是问题，并且执行了基于枚举的暴力搜索，可能会得到更好的结果。

6 Conclusion

In this work, we show that Vision Transformer can operate at MobileNet speed on mobile devices. Starting from a comprehensive latency analysis, we identify inefficient operators in a series of ViTbased architectures, whereby we draw important observations that guide our new design paradigm. The proposed EfficientFormer complies with a dimension consistent design that smoothly leverages hardware-friendly 4D MetaBlocks and powerful 3D MHSA blocks. We further propose a fast latencydriven slimming method to derive optimized configurations based on our design space. Extensive experiments on image classification, object detection, and segmentation tasks show that EfficientFormer models outperform existing transformer models while being faster than most competitive CNNs. The latency-driven analysis of ViT architecture and the experimental results validate our claim: powerful vision transformers can achieve ultra-fast inference speed on the edge. Future research will further explore the potential of EfficientFormer on several resource-constrained devices.

在这项工作中，我们展示了视觉变压器可以在移动设备上以MobileNet的速度运行。
从全面的延迟分析开始，我们识别出了一系列基于ViT的架构中的低效操作符，从而得出了指导我们新设计范例的重要观察结果。
我们提出的EfficientFormer遵循一种维度一致的设计，可以顺利地利用硬件友好的4D MetaBlocks和强大的3D MHSA块。
我们进一步提出了一种快速的基于延迟的剪枝方法，以根据我们的设计空间得出优化的配置。
在图像分类、对象检测和分割任务上的大量实验表明，EfficientFormer模型在性能上超过了现有的变压器模型，同时比大多数有竞争力的CNN更快。
ViT架构的基于延迟的分析和实验结果验证了我们的主张：强大的视觉变压器可以在边缘实现超快的推理速度。未来的研究将进一步探索EfficientFormer在几个资源受限设备上的潜力。