SBCFormer: Lightweight Network Capable of Full-size ImageNet Classification at 1 FPS on Single Board

最新推荐文章于 2024-10-02 23:40:03 发布

雾岛听雪

最新推荐文章于 2024-10-02 23:40:03 发布

阅读量199

点赞数 2

文章标签：算法深度学习

本文链接：https://blog.csdn.net/XZHBUT/article/details/134311681

版权

SBCFormer: Lightweight Network Capable of Full-size ImageNet Classification at 1 FPS on Single Board Computers

Abstract

背景
Computer vision has become increasingly prevalent in solving real-world problems across diverse domains, including smart agriculture, fishery, and livestock management.
These applications may not require processing many image frames per second, leading practitioners to use single board computers (SBCs).
Although many lightweight networks have been developed for “mobile/edge” devices, they primarily target smartphones with more powerful processors and not SBCs with the low-end CPUs.

（虽然有很多计算机视觉的轻量化模型，但是都是用在手机或者边缘设备，没有在SBS（单板CPU））

本文目的
This paper introduces a CNN-ViT hybrid network called SBCFormer, which achieves high accuracy and fast computation on such low-end CPUs.
The hardware constraints of these CPUs make the Transformer’s attention mechanism preferable to convolution.
在低端CPU上计算注意力的挑战
However, using attention on low-end CPUs presents a challenge:
high-resolution internal feature maps demand excessive computational resources, but reducing their resolution results in the loss of local image details.

高分辨率的内部特征映射需要过多的计算资源，而降低它们的分辨率会导致局部图像细节的丢失。

解决方案
SBCFormer introduces an architectural design to address this issue.（引入一个架构设计）
As a result, SBCFormer achieves the highest trade-off（权衡） between accuracy and speed on a Raspberry Pi 4 Model B with an ARM-Cortex A72 CPU（树莓派4型B（带有ARM-Cortex A72 CPU））.
For the first time, it achieves an ImageNet-1K top-1 accuracy of around 80% at a speed of 1.0 frame/sec(每秒一帧) on the SBC.

1 Introduction

背景-CNN的移动部署
Deep neural networks have been used in various computer vision tasks across different settings, which require running them for inference on diverse hardware. To meet this demand, numerous designs of deep neural networks have been proposed for mobile and edge devices. Since theintroduction of MobileNet [27], many researchers have proposed various architectural designs of convolutional neural networks (CNNs) for mobile devices [46, 49, 68].

引出VIT，虽然之前VIT难部署，但是现在和CNN结合可以了
Moreover, following the introduction of the vision transformer (ViT) [12], several attempts have been made to adapt ViT for mobile devices [4, 8, 42, 65].

[4]Efficientvit: Enhanced linear attention for high-resolutionlow-computation visual recognition.
[8]Mobileformer: Bridging mobilenet and transformer
[42] Mobilevit: Lightweight,general-purpose, and mobile-friendly vision transformer.
[65]Rethinking mobile block for effi- cient neural models.

The current trend involves developing CNN-ViT hybrid models（结合模型） [20,21,35,50]. Thanks to these studies, while ViTs were previously considered slow and lightweight CNNs were the only viable option for mobile devices, recent hybrid models for mobile devices surpass CNNs in the trade-off between computational ef- ficiency and inference accuracy [14,31, 32, 44].

介绍硬件设备-之前的研究无人在意低端设备
Previous studies have mainly focused on smartphones as “mobile/edge” devices.
Although processors in smartphones are less powerful than GPUs/TPUs found in servers, they are still quite powerful and would be considered in the mid-range on the spectrum of processors.（之前的研究都是诊断边缘设备CPU但是相比较SBS，题目还是很强）

There are “low-end” processors such as CPUs/MPUs for embedded systems, which usually have by far limited computational power.
Nonetheless, these processors have been utilized in various real-world applications such as smart agriculture [41, 69] and AI applications for fishery [60] and livestock management [2,30], where limited computational resources are sufficient.
（虽然这些设备很low，但是被广泛使用，不能不关心）

For example, in object detection to prevent damage by wild animals, processing dozens of frames per second may not be necessary [1].
In many cases, processing at around one frame per second（1帧） is practical.
In fact, lightweight models, such as MobileNet and YOLO, have been quite popular in such applications, often implemented using a camera-equipped single board computer (SBC).
（很多应用中高帧率完全没有必要，1帧就可以）

本文的研究重点与方法
This study focuses on low-end processors, which have been underexplored in the development of lightweight networks.
Given their constraints, we introduce an architectural design named SBCFormer.
中心问题：A central question guiding our research is the suitability of either convolution or the Transformer’s attention mechanism for SBCs. （解决模型与SBS的实用性）
As outlined in [14], convolution requires complex memory access patterns（复杂的内存访问）, necessitating high IO throughput for efficient processing（更高的IO吞吐量）, whereas attention is comparatively simpler（而注意力机制更加简单）.
Additionally, both are translated to matrix multiplications, and attention usually deals with smaller matrix dimensions compared to the traditional im2col convolution approach
（注意力比卷积有更小的矩阵维度）

"im2col"是“image to column”的缩写，翻译过来就是“从图像到矩阵”的意思1。在卷积神经网络（CNN）中，im2col是一个重要的函数，它将4维的图像数据转换为2维的numpy数组2。具体来说，im2col函数将图像块重新排列成列3。这是一种实现卷积操作的技术，它使用GEMM（通用矩阵乘法）操作4。使用im2col展开输入数据后，之后就只需将卷积层的滤波器（权重）纵向展开为1列，并计算2个矩阵的乘积即可。

Considering that SBCs are inferior to GPUs in parallel computation resources and memory bandwidth, attention emerges as the preferred foundational building block for SBCs.
（SBS不如GPU，所以注意力是首选）

Nonetheless, the attention computation carries a computational complexity that’s quadratic（二次方） to the number of tokens.
Thus, it’s crucial to maintain a low spatial resolution in feature maps to ensure computational efficiency and reduced latency.（延迟） (Note that a feature map with a spatial resolution of H × W corresponds to HW tokens.)
（保持较低的空间分辨率以确保计算效率和降低延迟）

改进方法
Using the ViT architecture, which keeps consistent resolution feature maps across all layers, leads to a loss of local details from the input image because of the coarse（粗糙） feature maps.****（普通VIT每层分辨率都一样，会丢失细节信息）

In response, recent models aiming for computational efficiency, especially CNN-ViT hybrids [32, 40, 42, 54], adopt a foundation more like CNNs.（所以注意力CNN混合模型更注重CNN的结构）. In these models, feature maps reduce their spatial resolutions via downsampling from input to output.

Given that applying attention to all layers can greatly increase computational costs, especially in layers with high spatial resolutions, these models use attention mechanisms only in the upper layers. **
（每一层都用注意力太耗计算力，所以只上层用）**

This design takes advantage of the Transformer’s attention mechanism, known for its strength in global interaction（全局交互） of image features, while retaining local details（局部细节） in the feature maps.

However, for SBCs, convolutions in the lower layers might become problematic, causing longer computational times.
（但是，卷积的下层用卷积，可能导致更长的计算时间）

改进方法-结构

To tackle the challenge of preserving local information while optimizing attention computation, our SBCFormer employs a two-stream block structure（两块流结构）.

（1）The first stream shrinks the input feature map, applies attention to the reduced number of tokens, and then reverts the map to its initial size, ensuring efficient attention computation.

第一流缩小输入特征图，将注意力应用到减少的标记上，然后将图恢复到初始大小，确保有效的注意力计算。

（2）Recognizing the potential loss of local information from downsizing, the second stream acts as a ’pass-through（直通）’ to retain local information in the input feature map.

认识到缩小可能导致局部信息的丢失，第二流充当“直通”以保留输入特征图中的局部信息。

These streams converge, generating a feature map enriched with both local and global information, primed for the subsequent layer.（两个流汇聚，生成既包含全局也包含局部的特征的特征图）

Furthermore, we have refined the Transformer’s attention mechanism to offset any diminished representational capacity from concentrating（集中） on smaller feature maps.

此外，我们已经改进了Transformer的注意力机制，以抵消由于集中在较小特征图上而导致的表示能力的减少。

实验综述

Our experiments demonstrate the effectiveness of SBCFormer; see Fig. 1. As a result of the advancements mentioned above, SBCFormer achieves the highest accuracyspeed trade-off on a widely used single board computer(SBC), namely a Raspberry Pi 4 Model B with an ARM Cortex-A72 CPU（树莓派4型B）.
In fact, SBCFormer attains an ImageNet- 1K top-1 accuracy close to 80.0% at a speed of 1.0 frame per second on the SBC, marking the first time this level of performance has been achieved.
在这里插入图片描述

2 Related Work

2.1. Convolutional Networks for Mobile Devices

一些针对于轻量化部署的研究
In recent years, there has been a growing demand for deep neural networks in vision applications across various fields, urging researchers to pay their attention towards efficient neural network design.

One approach involves making convolutions computationally more efficient, as demonstrated by works like SqueezeNet [28] and so on.
MobileNet [27] introduces depth-wise separable convolutions(深度可分离卷积) to alleviate the expensive computational cost of a standard convolutional layer, to meet the resource constraints of edge devices. MobileNetV2 [46] improves the design, introducing the inverted residual block（反向残留快）.
Our proposed SBCFormer employs the block as a primary building block for convolutional operation

SBCFormer模型使用了倒置残差块（Inverted Residual Block）作为其卷积操作的主要构建模块。倒置残差块是一种在神经网络设计中常用的模块，它可以有效地减少计算量和模型参数的数量，从而提高模型的效率和速度。这种设计使得SBCFormer模型能够在满足边缘设备资源限制的同时，实现高效的卷积计算。这是一个重要的设计选择，因为它直接影响到模型的性能和效率。

✳本模型用了倒置残差块
Our proposed SBCFormer employs the block as a primary building block for convolutional operation.

作者提出的SBCFormer模型使用了倒置残差块（Inverted Residual Block）作为其卷积操作的主要构建模块。倒置残差块是一种在神经网络设计中常用的模块，它可以有效地减少计算量和模型参数的数量，从而提高模型的效率和速度。在MobileNetV2网络中，倒置残差块的结构是先进行1x1卷积升维，然后通过3x3深度卷积，最后再使用1x1卷积降维。这与ResNet网络中的残差块结构刚好相反，因此被称为“倒置残差块”。这种结构可以减少模块提取特征时的信息丢失，增强模块提取特征的能力。

Another approach aimed at efficient designs of convolutional neural networks (CNN) architectures（针对卷积神经网络（CNN）架构的高效设计）, as demonstrated in works such as Inception [47] and MnasNet [48].Other studies have proposed lightweight models, including ShuffleNetv1 [68], ESPNetv2 [43], GhostNet [17], MobileNeXt [71], EfficientNet [49], and TinyNet [18], among others.
（另一种策略是是优化结构高效设计）

CNN以及方法存在的问题
（1）It is worth noting that CNNs, including those previously mentioned, can only capture local spatial correlations in images at each layer and do not account for global interactions.（cnn，只能捕获每层图像中的局部空间相关性，而不能考虑全局相互作用。）

（2）Another important point to consider is that convolution with standard-sized images can be computationally expensive for CPUs since it requires large matrix multiplications.（消耗计算量，开销大） These are areas where Vision Transformer (ViT) [12].

2.2 ViTs and CNN-ViT Hybrids for Mobile Devices

Thanks to the self-attention mechanism [56] and largescale image datasets, Vision Transformer (ViT) [12] and related ViT-based architectures [3, 6, 29, 52, 72] have attained state-of-the-art（先进的） inference accuracy in various visual recognition tasks [16, 67].

Nevertheless, to fully leverage their potential, ViT-based models typically require significant computational and memory resources, which have limited their deployment on edge devices that have resource constraints.

（VIT虽然很强，但是需要大量的计算核内存需求，限制在边缘设备上部署）

VIT改进研究进展
Subsequently, a series of studies have focused on enhancing the efficiency of ViTs from various perspectives.
Inspired by hierarchical designs in convolutional architectures（借鉴CNN结构）, some works have developed new architectures for ViTs [24,58,62,66].

（1）For example, LeViT [14] reintroduces a convolutional stem（卷积茎） at the beginning of the network to learn low-resolution features,rather than using the patchy stem in ViT [12].

“Patchy stem"是指在ViT（Vision Transformer）模型中，将输入图像划分为多个固定大小的块（或称为"patch”），然后将这些块线性化（即将每个块展平为一维向量），并作为Transformer的输入

EdgeViT [44] introduces Local-Global-Local blocks to better integrate self-attention and convolution, allowing the model to capture spatial tokens with varying ranges and exchange information between them.

(1)MobileFormer [8] parallelizes MobileNet and Transformer to encode both local and global features and fuses the two branches through a bidirectional bridge(并通过双向桥接融合两个分支).
(2)MobileViT [42] treats Transformer blocks as convolutions and develops a MobileViT block to effectively learn both local and global information.
(3)Finally, EfficientFormer [32] employs a hybrid approach that combines convolutional layers and self-attention layers to achieve a balance between accuracy and efficiency.

虽然之前的问题很好，但是还是有很多没有解决
Despite active research in developing hybrid models for mobile devices, there are still several issues that need to be addressed.
（1）Firstly, many of these studies do not use latency (i.e., inference time)（延迟（即推理时间）） as the primary metric for evaluating efficiency, which will be discussed later.
（2）Secondly, low-end CPUs are often excluded from these studies, with the targets limited to smartphones’ CPUs/NPUs and Intel CPUs at best. LeViT, for example, was evaluated on an ARM CPU, specifically the ARM Graviton 2, which is designed for cloud servers.
（没有考虑推理时间，没有考虑最low的硬件）

2.3 Evaluation Metrics

There are multiple metrics for assessing computational efficiency, including the number of model parameters, operations (i.e., FLOPs), inference time or latency, and memory usage.
While all of these metrics are relevant, latency is of particular interest in the context of this study.
It is noteworthy that Dehghani et al. [10] and Vasu et al. [55] show that efficiency in terms of latency does not correlate well with the number of FLOPs and parameters.

As mentioned earlier, several studies have focused on developing lightweight and efficient CNNs. However, only a handful of them, such as MNASNet [48], MobileNetv3[26], and ShuffleNetv2 [39], have directly optimized for latency.
（只有极少数对延迟进行优化）

The same holds true for studies on CNN-ViT hybrids, where some are primarily designed for mobile devices [8,42]; most of these studies did not prioritize latency as a target but instead focused on metrics like FLOPs [8].
（CNNVIT混合模型也没有对延迟进行优化）

解释延迟，实验硬件
Latency is often avoided in such studies for a good reason. It is because the instruction set of each processor and the compilers used with it heavily influence latency.

在这类研究中，通常避免延迟是有充分理由的。这是因为每个处理器的指令集及其使用的编译器严重影响延迟。

Therefore, obtaining practical evaluation results necessitates choosing specific processors at the expense of general discussion. In this paper, we select CPUs used in a single board computer, such as Raspberry Pi, as our primary target, which is widely employed in various fields for edge applications. It is equipped with a microprocessor, ARM Cortex-A72, specifically designed for mobile platforms as part of the ARM Cortex-A series.

**因此，要获得实用的评估结果，就必须选择特定的处理器，而这又会牺牲一般性的讨论。**在本文中，我们选择了在单板计算机（如树莓派）中使用的CPU作为主要目标，这种计算机在各种领域的边缘应用中被广泛使用。它配备了一款微处理器，ARM Cortex-A72，这是专为移动平台设计的ARM Cortex-A系列的一部分。

一般性的讨论: 由于每种处理器的特性和性能都有所不同，因此这些评估结果可能无法泛化到其他类型的处理器上，也就是说，这些结果可能无法用于一般性的讨论或比较。换句话说，虽然这些评估结果对于特定的处理器来说是有用的，但可能并不适用于其他的处理器。这就是所谓的"牺牲一般性的讨论"。*

3 CNN-ViT Hybrid for Low-end CPUs

目标
We aim to develop a CPU-friendly ViT-CNN hybrid network that achieves a better trade off between test-time latency and inference accuracy.

3.1 Principle of Design

本文的模型设计-金字塔结构

We adopt the fundamental architecture that is commonly used in recent CNN-ViT hybrids.
The network’s initial stage comprises a set of standard convolution layers, which excel at converting the input image into a feature map [14,32,63], rather than a linear mapping from image patches to tokens in ViT [12, 51].
（网络的初始阶段包括一组标准卷积层，擅长将输入图像转换为特征图）

The main section of the network is divided into multiple stages, and the feature maps are reduced in size between consecutive stages.This results in a pyramid structure of feature maps with dimensions of H/8 × W/8, H/16 × W/16, H/32 × W/32, and so on.
（特征图逐渐缩小，金字塔结构）

一些之前的研究---计算复杂度和注意力位置的平衡不能解决
The computational complexity of the Transformer attention mechanism increases quadratically with the number of tokens, i.e., the size h × w of the input feature map. （计算复杂度次方增长）
Then, the lower stages with larger-sized feature maps need more computational cost. （具有较大尺寸特征图的较低阶段需要更多的计算成本）
Some studies have addressed this issue by applying attention only to sub-regions/tokens of the feature maps [19, 36, 45, 53]. Studies targeting mobile devices typically adopt attention only in high layers（高层：靠近输出那边） [32, 40, 54]. While avoiding the increased computational cost, this leads to suboptimal inference accuracy（次优，普通，一般） as it gives up on one of the most important properties of ViT, i.e., aggregating global information in images

一些研究注意力在低层（输入），这样计算复杂度高了；一些在高层（输出），计算复杂度低了，但是信息低了

本研究解决方案-沙漏设计

Taking these considerations into account,
we propose a method of downsizing the input feature map, applying attention to the downsized feature map, and then upsizing the resulting feature map. （缩小特征图，然后注意力，然后再放大）

In our experiments, we downsized the feature map to 7 × 7 regardless of the stage, for an input image of size 224 × 224. This hourglass design allows us to aggregate global information from the entire image while minimizing computational costs.
（模型输入是224x224，无论哪个阶段都缩小成7x7）

However, downsizing the feature map to this small size can lead to a loss of local information. （缩小会导致信息丢失）

To address this issue, we design a block with two parallel streams（两个并行流块）: one for local features and the other for global features. （一个局部特征，一个全局特征）
（1）Specifically, we maintain the original feature map size for the local stream and do not perform an attention operation.
（局部流保持原来大小，不执行注意力不缩小）
（2）For the global stream, we employ the above hourglass（沙漏设计） design of attention,which first downsizes the feature map, applies attention, and then upsizes it to the original size.
（全局流用沙漏设计注意力）
The outputs from the two streams are merged and transferred to the next block. （两流合并到下一个）
More details are given in Sec. 3.3. Additionally, to compensate for the loss of representational power due to the hourglass design, we propose a modified attention mechanism. See Sec. 3.4.

3.2 Overall Design of SBCFormer

Figure 2 shows the overall architecture of the proposed SBCFormer.
在这里插入图片描述

The network begins with an initial section (labeled as ‘Stem’ in the diagram) that consists of three convolution layers with 3 × 3 kernels and stride = 2, which converts an input image into a feature map. （Stem层：3个3x3 步长为2的卷积）

The main section comprises three stages, each of which is connected to the next stage by a single convolution layer (labeled as ‘Embedding’ in the diagram). This layer uses a stride-two 3 × 3 convolution to halve the size of the input feature map.（Embedding层，三个主干网路之前都有，3x3 步长为2的卷积，特征图减半）
Regarding the output section, we employ global average pooling followed by a fully-connected linear layer for the final layer of the network, specifically for image classification tasks.（最后一层全连接层分类）

3.3 SBCFormer Block

We denote the input feature map to the block at i-th stage by 在这里插入图片描述 （假设第i块的输入）
（1）InvRes
To start the block, we place mi consecutive inverted residual blocks（反向残差块） [46], which is first used in MobileNetV2
[46]. We use a variant with a GeLU activation function, which consists of a point-wise convolution, a GeLU activation function, and a depth-wise convolution with a 3 × 3 filter.（1x1点卷积，Gelu，3X3深度卷积） We call this InvRes in what follows. They convert the input map Xi into Xli as
在这里插入图片描述
where Fmi(·) indicates the application of mi consecutive InvRes blocks to the input.

（2）传到两个流
As shown in Fig. 2, the updated feature Xliis transferred to two different branches, local and global streams.
（21）局部流
For the local stream, Xli is passed through to the end section of the block. （局部流直接送到最后）
（3）全局流
For the global stream, Xl i is first downsized to h×w by an average pooling layer, denoted as ‘Pool’ in Fig. 2. We set it to 7 × 7 regardless of stages in our experiments.（全局流,先进池化层每个阶段都变成7x7） .
（31）Mixer和MAttn
The downsized map is then passed to a block consisting of two consecutive InvRes blocks, denoted as ‘Mixer’ in the diagram **（Mixer由两个InvRes组成）**and next to a stack of attention blocks named ‘MAttn.’（MAttn是几个注意块）
（32）全局流输出
The output feature map is then upsized followed by convolution, which is denoted by ‘ConvT.’ （特征图放大）
These operations provide a feature map 在这里插入图片描述

where FLi MAttn(·) indicates the application of Li consecutive MAttn blocks to the input.

（全局流公式）
（4）两个流融合
In the last section of the block, the local stream feature Xli and global stream feature Xgi are fused to obtain a new feature map, as shown in Fig. 2.

To fuse the two, we firstmodulate Xli with a weight map created from Xgi （用从Xg i创建的权重图来调制Xl i）
（用Xg i制作权重图的方法）：
Specifi-cally, we compute compute 在这里插入图片描述，as

where Proj indicates a point-wise convolution followed by batch normalization.****（Proj是点卷积和BN）

We then multiply it to Xli and concatenate the resulting map with Xgi in the channel dimension as（然后乘以Xl i，Xg i连接）

. 在这里插入图片描述
where is the Hadamard product（哈达玛积）.

哈达玛积（Hadamard product）是矩阵的一类运算，若A= (aij)和B=
(bij)是两个同阶矩阵，若cij=aij×bij,则称矩阵C= (cij)为A和B的哈达玛积，或称基本积

Finally, the fused feature Xui is passed through another projection block to halve the channels, yielding the output of this block.
（最后通过proj减半通道，产生输出）
在这里插入图片描述

3.4 Modified Attention Mechanism

The above two-stream design will compensate for the loss of local information caused by the proposed hourglass attention computation.
（上述的两流设计会弥沙漏注意力的信息损失）
However, as the attention operates on a very low-resolution (or equivalently, small-sized) feature map, the attention computation itself must lose its representational power.
（但是，注意力执行者非常低分辨率的特征图中也不行）

修改自注意力
To compensate for the loss, we make a few modifications（修改） to the Transformer attention mechanism; see ‘MAttn’ in Fig. 2.
在这里插入图片描述

The main concept is to utilize the standard computation tuple of a CNN（CNN的标准计算元组） for an input to attention, specifically a 3× 3 (depth-wise) convolution, a GeLU activation function, and batch normalization.
The input to attention is composed of query, key, and value, and we apply the tuple to value since it forms the basis for the attention’s output.
Our objective is to enhance the representational power by facilitating the aggregation of spatial information in the input feature map, while simultaneously reducing the training difficulty. （（目标是通过促进输入特征图中空间信息的聚合））
To offset the increase in computational cost, we eliminate the independent linear transformations applied to query and key and instead apply the identical point-wise convolution to all three components.
（为了抵消计算成本的增加，我们消除了应用于Q和K的独立线性变换，而是将相同的点卷积应用于所有三个组件。）

CNN的标准计算元组指卷积神经网络中常用的一组操作，包括3x3的深度卷积，GeLU激活函数，以及批量归一化。这些操作可以有效地提取输入数据的特征。

The details of the modified attention computation are as follows.

Let X ∈ R
h×w×Ci denote the input to the attention mechanism. The output XØØ ∈ Rh×w×Ci is computed as
在这里插入图片描述
where FFN stands for a feed-forward network（FNN是VIT中标准的前馈网络） as in ViTs[12, 51] and is defined to be

where Linear is a linear layer with learnable weights（可学习线性层） and PW-Conv indicates a point-wise convolution（逐点卷积）; MHSA is defined by
在这里插入图片描述
where d is the channel number of each head in query and key; b ∈ R hw is a learnable bias acting as positional encoding [14,32]; 1 ∈ Rhw is an all-one vector;（全一向量） is defined by

where DW-ConvG indicates a depth-wise convolution（深度卷积） followed by GeLU and BN is batch normalization applied in
the same way as that in CNNs.

4 Experimental Results

We conduct experiments to evaluate SBCFormer and compare it with existing networks on two tasks, image classification using ImageNet1K [11] and object detection using the COCO dataset [34].

4.1 Experimental Settings

SBCFormer primarily targets low-end CPUs that are commonly used in single board computers.

SBCFormer主要针对单板计算机中常用的低端cpu

Additionally, we evaluate its performance on an Intel CPU commonly found in edge devices, as well as on a GPU used in desktop PCs.
We use the following three processors and platforms for our experiments.(三种处理器和平台)：
（1） An ARM Cortex-A72 processor running at 1.5 GHz on a single board computer, Raspberry PI 4 model B. While it is classified as low-end, ARM Cortex-A72 is a quad-core 64-bit processor that supports the ARM Neon instruction set. We used the 32-bit Raspberry Pi OS and PyTorch ver 1.6.0 to run networks.
（2）An Intel Core i7-3520M processor running at 2.9 GHz.It is a dual-core processor that is commonly used in mobile devices such as laptops and tablets. It supports a variety of instruction sets including Intel Advanced Vector Extensions (AVX) and AVX2, which provide
improved performance for vector and matrix operations. We used Ubuntu ver 18.04.5 LTS and PyTorch ver 1.10.1 to run networks.
（3） A GeForce RTX 2080Ti on a desk-top PC equipped with an Intel Xeon CPU E5-1650 v3. We used Ubuntu 18.04.6 LTS and PyTorch ver 1.10.1.

（1）ARM Cortex-A72处理器，运行在1.5 GHz的单板计算机上，树莓派4型号B。虽然它被归类为低端，但ARM Cortex-A72是一款支持ARM Neon指令集的四核64位处理器。我们使用32位Raspberry Pi OS和PyTorch ver 1.6.0来运行网络。
（2）一个运行在2.9 GHz的Intel Core i7-3520M处理器。这是一款常用于移动设备如笔记本电脑和平板电脑的双核处理器。它支持多种指令集，包括Intel Advanced Vector Extensions (AVX) 和 AVX2，这些可以为向量和矩阵运算提供改进的性能。我们使用Ubuntu ver 18.04.5 LTS和PyTorch ver 1.10.1来运行网络。
（3）GeForce RTX 2080Ti配置在搭载Intel至强处理器E5-1650 v3的桌面PC上。我们使用的是Ubuntu 18.04.6 LTS和PyTorch 1.10.1版本。

We implemented and tested all the networks using the PyTorch framework (version 1.10) and the Timm library[61]. For each of the existing networks we compare, we employ the author’s official code but a few networks1. We follow previous studies [32, 44] to measure the inference time (i.e., latency) required to process a single input image. Specifically, setting the batch size to 1, we recorded the clock time on each platform. To ensure accuracy, we performed 300 inferences and reported the average latency in seconds. During measurement, we terminated any irrelevant applications that could interfere with the results. All computations used 32-bit floating point numbers. Since our focus is on inference speed rather than training, we trained all networks on a GPU server with eight Nvidia 2080Ti GPUs, and then evaluated their inference time on each platform.

我们使用PyTorch框架(版本1.10)和Timm库[61]实现和测试了所有网络。对于我们比较的每个现有网络，我们都使用了作者的官方代码，但只有少数几个网络1。我们遵循先前的研究[32,44]来测量处理单个输入图像所需的推理时间(即延迟)。具体来说，将批处理大小设置为1，我们记录每个平台上的时钟时间。**为了确保准确性，我们执行了300次推断，并报告了以秒为单位的平均延迟。**在测量期间，我们终止了任何可能干扰结果的无关应用程序。所有计算都使用32位浮点数。由于我们关注的是推理速度而不是训练，所以我们在使用8个Nvidia
2080Ti GPU的GPU服务器上训练所有网络，然后评估它们在每个平台上的推理时间。

4.2. ImageNet-1K

We first evaluate the networks on the most standard task, image classification of ImageNet-1K.

4.2.1 Training

We train SBCFormer and existing networks from scratch for 300 epochs on the training split of ImageNet-1K dataset, which consists of 1.28 million images across 1,000 classes. We consider the four variants with different model size, SBCFormer-XS, -S, -B, and -L, as shown in Table 1. All models are trained and tested at the standard resolution of 224 × 224.

We followed the original author’s code to train the existing networks. For training SBCFormer, we used the recipe from DeiT [51], which is summarized as follows. We employed the AdamW [38] optimizer with cosine learning rate scheduling [37], and applied a linear warm-up for the first five epochs. The initial learning rate was set to 2.5 × 10−4 , and the minimum value was set to 10−5. The weight decay and momentum were set to 5 × 10−2 and 0.9, respectively, and a batch size of 200 was used. Data augmentation techniques, including random cropping, random horizontal flipping, mixup, random erasing, and label-smoothing, were applied during training, following [44, 51]. Random cropping was applied to the input image during training to obtain an image size of 224×224 pixels, while a single center crop of the same size was used during testing.

我们采用了带余弦学习率调度[37]的AdamW[38]优化器，并对前5个epoch进行了线性预热。初始学习率设置为2.5 ×
10−4，最小学习率设置为10−5。重量衰减和动量分别设置为5 ×
10−2和0.9，批量大小为200。在训练过程中应用数据增强技术，包括随机裁剪、随机水平翻转、混合、随机擦除和标签平滑[44,51]。在训练过程中对输入图像进行随机裁剪，得到的图像尺寸为224×224像素，而在测试过程中使用相同尺寸的单中心裁剪。

在这里插入图片描述

4.2.2 Results

It is observed that SBCFormer variants with different model sizes achieve a higher trade-off between accuracy and latency on CPUs; ses also Fig. 1. The performance gap between SBCFormer and the other models is more pronounced on ARM CPUs than on Intel CPUs. Notably, SBCFormer only achieves mediocre or inferior trade-offs on the GPU. These results are consistent with our design goal, as
SBCFormer is optimized for running faster on CPUs with limited computational resources.

在这里插入图片描述

5. Conclusions

We have proposed a new deep network design, called SBCFormer, that achieves a favorable balance between inference accuracy and computational speed when used with low-end CPUs, commonly found in single-board computers (SBCs). These CPUs are not efficient at performing large matrix multiplications, making the Transformer’s attention mechanism more attractive than CNNs. However, attention
is computationally expensive when applied to large feature maps. SBCFormer mitigates this cost by first reducing the input feature map size, applying attention to the smaller map, and then restoring it to its original size. However, this approach has side effects, such as the loss of local image information and limited representation ability of smallsize attention. To address these issues, we introduced two novel designs. First, we add a parallel stream to the attention computation, which passes through the input feature map, allowing it to retain local image information. Second, we enhance the attention mechanism by incorporating standard CNN components. Our experiments have shown that SBCFormer achieves a good trade-off between accuracy and speed on a popular SBC, the Raspberry-PI 4 Model B with an ARM-Cortex A72 CPU.

我们提出了一种新的深度网络设计，称为SBCFormer，当使用时，它在推理精度和计算速度之间实现了良好的平衡通常在单板计算机(sbc)中发现的低端cpu。这些cpu在执行大型矩阵乘法时效率不高，这使得Transformer的注意力机制比cnn更有吸引力。然而，当应用于大型特征映射时，注意力在计算上是昂贵的。SBCFormer通过减小输入特征映射大小，将注意力集中在较小的映射上，然后将其恢复到原始大小，从而减轻了此成本。然而，这种方法存在局部图像信息丢失、小尺度注意力表征能力受限等副作用。为了解决这些问题，我们引入了两种新颖的设计。首先，我们在注意力计算中加入一个并行流，该流通过输入特征映射，使其保留局部图像信息。其次，我们通过加入标准CNN组件来增强注意机制。我们的实验表明，SBCFormer在流行的SBC上实现了精度和速度之间的良好权衡，Raspberry-PI
4 Model B带有ARM-Cortex A72 CPU。