SBCFormer: Lightweight Network Capable of Full-size ImageNet Classification at 1 FPS on Single Board

SBCFormer: Lightweight Network Capable of Full-size ImageNet Classification at 1 FPS on Single Board Computers


Computer vision has become increasingly prevalent in solving real-world problems across diverse domains, including smart agriculture, fishery, and livestock management.
These applications may not require processing many image frames per second, leading practitioners to use single board computers (SBCs).
Although many lightweight networks have been developed for “mobile/edge” devices, they primarily target smartphones with more powerful processors and not SBCs with the low-end CPUs.


This paper introduces a CNN-ViT hybrid network called SBCFormer, which achieves high accuracy and fast computation on such low-end CPUs.
The hardware constraints of these CPUs make the Transformer’s attention mechanism preferable to convolution.
However, using attention on low-end CPUs presents a challenge:
high-resolution internal feature maps demand excessive computational resources, but reducing their resolution results in the loss of local image details.


SBCFormer introduces an architectural design to address this issue.(引入一个架构设计)
As a result, SBCFormer achieves the highest trade-off(权衡) between accuracy and speed on a Raspberry Pi 4 Model B with an ARM-Cortex A72 CPU(树莓派4型B(带有ARM-Cortex A72 CPU)).
For the first time, it achieves an ImageNet-1K top-1 accuracy of around 80% at a speed of 1.0 frame/sec(每秒一帧) on the SBC.

1 Introduction

Deep neural networks have been used in various computer vision tasks across different settings, which require running them for inference on diverse hardware. To meet this demand, numerous designs of deep neural networks have been proposed for mobile and edge devices. Since theintroduction of MobileNet [27], many researchers have proposed various architectural designs of convolutional neural networks (CNNs) for mobile devices [46, 49, 68].

Moreover, following the introduction of the vision transformer (ViT) [12], several attempts have been made to adapt ViT for mobile devices [4, 8, 42, 65].

[4]Efficientvit: Enhanced linear attention for high-resolutionlow-computation visual recognition.
[8]Mobileformer: Bridging mobilenet and transformer
[42] Mobilevit: Lightweight,general-purpose, and mobile-friendly vision transformer.
[65]Rethinking mobile block for effi- cient neural models.

The current trend involves developing CNN-ViT hybrid models(结合模型) [20,21,35,50]. Thanks to these studies, while ViTs were previously considered slow and lightweight CNNs were the only viable option for mobile devices, recent hybrid models for mobile devices surpass CNNs in the trade-off between computational ef- ficiency and inference accuracy [14,31, 32, 44].

Previous studies have mainly focused on smartphones as “mobile/edge” devices.
Although processors in smartphones are less powerful than GPUs/TPUs found in servers, they are still quite powerful and would be considered in the mid-range on the spectrum of processors.(之前的研究都是诊断边缘设备CPU但是相比较SBS,题目还是很强)

There are “low-end” processors such as CPUs/MPUs for embedded systems, which usually have by far limited computational power.
Nonetheless, these processors have been utilized in various real-world applications such as smart agriculture [41, 69] and AI applications for fishery [60] and livestock management [2,30], where limited computational resources are sufficient.

For example, in object detection to prevent damage by wild animals, processing dozens of frames per second may not be necessary [1].
In many cases, processing at around one frame per second(1帧) is practical.
In fact, lightweight models, such as MobileNet and YOLO, have been quite popular in such applications, often implemented using a camera-equipped single board computer (SBC).

This study focuses on low-end processors, which have been underexplored in the development of lightweight networks.
Given their constraints, we introduce an architectural design named SBCFormer.
中心问题:A central question guiding our research is the suitability of either convolution or the Transformer’s attention mechanism for SBCs. (解决模型与SBS的实用性)
As outlined in [14], convolution requires complex memory access patterns(复杂的内存访问), necessitating high IO throughput for efficient processing(更高的IO吞吐量), whereas attention is comparatively simpler(而注意力机制更加简单).
Additionally, both are translated to matrix multiplications, and attention usually deals with smaller matrix dimensions compared to the traditional im2col convolution approach

Considering that SBCs are inferior to GPUs in parallel computation resources and memory bandwidth, attention emerges as the preferred foundational building block for SBCs.

Nonetheless, the attention computation carries a computational complexity that’s quadratic(二次方) to the number of tokens.
Thus, it’s crucial to maintain a low spatial resolution in feature maps to ensure computational efficiency and reduced latency.(延迟) (Note that a feature map with a spatial resolution of H × W corresponds to HW tokens.)

Using the ViT architecture, which keeps consistent resolution feature maps across all layers, leads to a loss of local details from the input image because of the coarse(粗糙) feature maps.****(普通VIT每层分辨率都一样,会丢失细节信息)

In response, recent models aiming for computational efficiency, especially CNN-ViT hybrids [32, 40, 42, 54], adopt a foundation more like CNNs.(所以注意力CNN混合模型更注重CNN的结构). In these models, feature maps reduce their spatial resolutions via downsampling from input to output.

Given that applying attention to all layers can greatly increase computational costs, especially in layers with high spatial resolutions, these models use attention mechanisms only in the upper layers. **

This design takes advantage of the Transformer’s attention mechanism, known for its strength in global interaction(全局交互) of image features, while retaining local details(局部细节) in the feature maps.

However, for SBCs, convolutions in the lower layers might become problematic, causing longer computational times.


To tackle the challenge of preserving local information while optimizing attention computation, our SBCFormer employs a two-stream block structure(两块流结构).

(1)The first stream shrinks the input feature map, applies attention to the reduced number of tokens, and then reverts the map to its initial size, ensuring efficient attention computation.


(2)Recognizing the potential loss of local information from downsizing, the second stream acts as a ’pass-through(直通)’ to retain local information in the input feature map.


These streams converge, generating a feature map enriched with both local and global information, primed for the subsequent layer.(两个流汇聚,生成既包含全局也包含局部的特征的特征图)

Furthermore, we have refined the Transformer’s attention mechanism to offset any diminished representational capacity from concentrating(集中) on smaller feature maps.



Our experiments demonstrate the effectiveness of SBCFormer; see Fig. 1. As a result of the advancements mentioned above, SBCFormer achieves the highest accuracyspeed trade-off on a widely used single board computer(SBC), namely a Raspberry Pi 4 Model B with an ARM Cortex-A72 CPU(树莓派4型B).
In fact, SBCFormer attains an ImageNet- 1K top-1 accuracy close to 80.0% at a speed of 1.0 frame per second on the SBC, marking the first time this level of performance has been achieved.

2 Related Work

2.1. Convolutional Networks for Mobile Devices

In recent years, there has been a growing demand for deep neural networks in vision applications across various fields, urging researchers to pay their attention towards efficient neural network design.

One approach involves making convolutions computationally more efficient, as demonstrated by works like SqueezeNet [28] and so on.
MobileNet [27] introduces depth-wise separable convolutions(深度可分离卷积) to alleviate the expensive computational cost of a standard convolutional layer, to meet the resource constraints of edge devices. MobileNetV2 [46] improves the design, introducing the inverted residual block(反向残留快).
Our proposed SBCFormer employs the block as a primary building block for convolutional operation

Our proposed SBCFormer employs the block as a primary building block for convolutional operation.

Another approach aimed at efficient designs of convolutional neural networks (CNN) architectures(针对卷积神经网络(CNN)架构的高效设计), as demonstrated in works such as Inception [47] and MnasNet [48].Other studies have proposed lightweight models, including ShuffleNetv1 [68], ESPNetv2 [43], GhostNet [17], MobileNeXt [71], EfficientNet [49], and TinyNet [18], among others.

(1)It is worth noting that CNNs, including those previously mentioned, can only capture local spatial correlations in images at each layer and do not account for global interactions.(cnn,只能捕获每层图像中的局部空间相关性,而不能考虑全局相互作用。)

(2)Another important point to consider is that convolution with standard-sized images can be computationally expensive for CPUs since it requires large matrix multiplications.(消耗计算量,开销大) These are areas where Vision Transformer (ViT) [12].

2.2 ViTs and CNN-ViT Hybrids for Mobile Devices

Thanks to the self-attention mechanism [56] and largescale image datasets, Vision Transformer (ViT) [12] and related ViT-based architectures [3, 6, 29, 52, 72] have attained state-of-the-art(先进的) inference accuracy in various visual recognition tasks [16, 67].

Nevertheless, to fully leverage their potential, ViT-based models typically require significant computational and memory resources, which have limited their deployment on edge devices that have resource constraints.


Subsequently, a series of studies have focused on enhancing the efficiency of ViTs from various perspectives.
Inspired by hierarchical designs in convolutional architectures(借鉴CNN结构), some works have developed new architectures for ViTs [24,58,62,66].

(1)For example, LeViT [14] reintroduces a convolutional stem(卷积茎) at the beginning of the network to learn low-resolution features,rather than using the patchy stem in ViT [12].

EdgeViT [44] introduces Local-Global-Local blocks to better integrate self-attention and convolution, allowing the model to capture spatial tokens with varying ranges and exchange information between them.

(1)MobileFormer [8] parallelizes MobileNet and Transformer to encode both local and global features and fuses the two branches through a bidirectional bridge(并通过双向桥接融合两个分支).
(2)MobileViT [42] treats Transformer blocks as convolutions and develops a MobileViT block to effectively learn both local and global information.
(3)Finally, EfficientFormer [32] employs a hybrid approach that combines convolutional layers and self-attention layers to achieve a balance between accuracy and efficiency.

Despite active research in developing hybrid models for mobile devices, there are still several issues that need to be addressed.
(1)Firstly, many of these studies do not use latency (i.e., inference time)(延迟(即推理时间)) as the primary metric for evaluating efficiency, which will be discussed later.
(2)Secondly, low-end CPUs are often excluded from these studies, with the targets limited to smartphones’ CPUs/NPUs and Intel CPUs at best. LeViT, for example, was evaluated on an ARM CPU, specifically the ARM Graviton 2, which is designed for cloud servers.

2.3 Evaluation Metrics

There are multiple metrics for assessing computational efficiency, including the number of model parameters, operations (i.e., FLOPs), inference time or latency, and memory usage.
While all of these metrics are relevant, latency is of particular interest in the context of this study.
It is noteworthy that Dehghani et al. [10] and Vasu et al. [55] show that efficiency in terms of latency does not correlate well with the number of FLOPs and parameters.

As mentioned earlier, several studies have focused on developing lightweight and efficient CNNs. However, only a handful of them, such as MNASNet [48], MobileNetv3[26], and ShuffleNetv2 [39], have directly optimized for latency.

The same holds true for studies on CNN-ViT hybrids, where some are primarily designed for mobile devices [8,42]; most of these studies did not prioritize latency as a target but instead focused on metrics like FLOPs [8].

Latency is often avoided in such studies for a good reason. It is because the instruction set of each processor and the compilers used with it heavily influence latency.


Therefore, obtaining practical evaluation results necessitates choosing specific processors at the expense of general discussion. In this paper, we select CPUs used in a single board computer, such as Raspberry Pi, as our primary target, which is widely employed in various fields for edge applications. It is equipped with a microprocessor, ARM Cortex-A72, specifically designed for mobile platforms as part of the ARM Cortex-A series.

3 CNN-ViT Hybrid for Low-end CPUs

We aim to develop a CPU-friendly ViT-CNN hybrid network that achieves a better trade off between test-time latency and inference accuracy.

3.1 Principle of Design


We adopt the fundamental architecture that is commonly used in recent CNN-ViT hybrids.
The network’s initial stage comprises a set of standard convolution layers, which excel at converting the input image into a feature map [14,32,63], rather than a linear mapping from image patches to tokens in ViT [12, 51].

The main section of the network is divided into multiple stages, and the feature maps are reduced in size between consecutive stages.This results in a pyramid structure of feature maps with dimensions of H/8 × W/8, H/16 × W/16, H/32 × W/32, and so on.

The computational complexity of the Transformer attention mechanism increases quadratically with the number of tokens, i.e., the size h × w of the input feature map. (计算复杂度次方增长)
Then, the lower stages with larger-sized feature maps need more computational cost. (具有较大尺寸特征图的较低阶段需要更多的计算成本)
Some studies have addressed this issue by applying attention only to sub-regions/tokens of the feature maps [19, 36, 45, 53]. Studies targeting mobile devices typically adopt attention only in high layers(高层:靠近输出那边) [32, 40, 54]. While avoiding the increased computational cost, this leads to suboptimal inference accuracy(次优,普通,一般) as it gives up on one of the most important properties of ViT, i.e., aggregating global information in images



Taking these considerations into account,
we propose a method of downsizing the input feature map, applying attention to the downsized feature map, and then upsizing the resulting feature map. (缩小特征图,然后注意力,然后再放大)

In our experiments, we downsized the feature map to 7 × 7 regardless of the stage, for an input image of size 224 × 224. This hourglass design allows us to aggregate global information from the entire image while minimizing computational costs.

However, downsizing the feature map to this small size can lead to a loss of local information. (缩小会导致信息丢失)

To address this issue, we design a block with two parallel streams(两个并行流块): one for local features and the other for global features. (一个局部特征,一个全局特征)
(1)Specifically, we maintain the original feature map size for the local stream and do not perform an attention operation.
(2)For the global stream, we employ the above hourglass(沙漏设计) design of attention,which first downsizes the feature map, applies attention, and then upsizes it to the original size.
The outputs from the two streams are merged and transferred to the next block. (两流合并到下一个)
More details are given in Sec. 3.3. Additionally, to compensate for the loss of representational power due to the hourglass design, we propose a modified attention mechanism. See Sec. 3.4.

3.2 Overall Design of SBCFormer

Figure 2 shows the overall architecture of the proposed SBCFormer.

The network begins with an initial section (labeled as ‘Stem’ in the diagram) that consists of three convolution layers with 3 × 3 kernels and stride = 2, which converts an input image into a feature map. (Stem层:3个3x3 步长为2的卷积)

The main section comprises three stages, each of which is connected to the next stage by a single convolution layer (labeled as ‘Embedding’ in the diagram). This layer uses a stride-two 3 × 3 convolution to halve the size of the input feature map.(Embedding层,三个主干网路之前都有,3x3 步长为2的卷积,特征图减半)
Regarding the output section, we employ global average pooling followed by a fully-connected linear layer for the final layer of the network, specifically for image classification tasks.(最后一层全连接层分类)

3.3 SBCFormer Block

We denote the input feature map to the block at i-th stage by在这里插入图片描述(假设第i块的输入)
To start the block, we place mi consecutive inverted residual blocks(反向残差块) [46], which is first used in MobileNetV2
[46]. We use a variant with a GeLU activation function, which consists of a point-wise convolution, a GeLU activation function, and a depth-wise convolution with a 3 × 3 filter.(1x1点卷积,Gelu,3X3深度卷积) We call this InvRes in what follows. They convert the input map Xi into Xli as
where Fmi(·) indicates the application of mi consecutive InvRes blocks to the input.

As shown in Fig. 2, the updated feature Xliis transferred to two different branches, local and global streams.
For the local stream, Xli is passed through to the end section of the block. (局部流直接送到最后)
For the global stream, Xl i is first downsized to h×w by an average pooling layer, denoted as ‘Pool’ in Fig. 2. We set it to 7 × 7 regardless of stages in our experiments.(全局流,先进池化层每个阶段都变成7x7) .
The downsized map is then passed to a block consisting of two consecutive InvRes blocks, denoted as ‘Mixer’ in the diagram **(Mixer由两个InvRes组成)**and next to a stack of attention blocks named ‘MAttn.’(MAttn是几个注意块)
The output feature map is then upsized followed by convolution, which is denoted by ‘ConvT.’ (特征图放大)
These operations provide a feature map在这里插入图片描述
where FLi MAttn(·) indicates the application of Li consecutive MAttn blocks to the input.

In the last section of the block, the local stream feature Xli and global stream feature Xgi are fused to obtain a new feature map, as shown in Fig. 2.

To fuse the two, we firstmodulate Xli with a weight map created from Xgi(用从Xg i创建的权重图来调制Xl i)
Specifi-cally, we compute compute 在这里插入图片描述,as
where Proj indicates a point-wise convolution followed by batch normalization.****(Proj是点卷积和BN)

We then multiply it to Xli and concatenate the resulting map with Xgi in the channel dimension as(然后乘以Xl i,Xg i连接

where ∉ is the Hadamard product(哈达玛积).

Finally, the fused feature Xui is passed through another projection block to halve the channels, yielding the output of this block.

3.4 Modified Attention Mechanism

The above two-stream design will compensate for the loss of local information caused by the proposed hourglass attention computation.
However, as the attention operates on a very low-resolution (or equivalently, small-sized) feature map, the attention computation itself must lose its representational power.

To compensate for the loss, we make a few modifications(修改) to the Transformer attention mechanism; see ‘MAttn’ in Fig. 2.

The main concept is to utilize the standard computation tuple of a CNN(CNN的标准计算元组) for an input to attention, specifically a 3× 3 (depth-wise) convolution, a GeLU activation function, and batch normalization.
The input to attention is composed of query, key, and value, and we apply the tuple to value since it forms the basis for the attention’s output.
Our objective is to enhance the representational power by facilitating the aggregation of spatial information in the input feature map, while simultaneously reducing the training difficulty. ((目标是通过促进输入特征图中空间信息的聚合))
To offset the increase in computational cost, we eliminate the independent linear transformations applied to query and key and instead apply the identical point-wise convolution to all three components.


The details of the modified attention computation are as follows.

Let X ∈ R
h×w×Ci denote the input to the attention mechanism. The output XØØ ∈ Rh×w×Ci is computed as
where FFN stands for a feed-forward network(FNN是VIT中标准的前馈网络) as in ViTs[12, 51] and XØis defined to be
where Linear is a linear layer with learnable weights(可学习线性层) and PW-Conv indicates a point-wise convolution(逐点卷积); MHSA is defined by
where d is the channel number of each head in query and key; b ∈ R hw is a learnable bias acting as positional encoding [14,32]; 1 ∈ Rhw is an all-one vector;(全一向量) YØ is defined by
where DW-ConvG indicates a depth-wise convolution(深度卷积) followed by GeLU and BN is batch normalization applied in
the same way as that in CNNs.

4 Experimental Results

We conduct experiments to evaluate SBCFormer and compare it with existing networks on two tasks, image classification using ImageNet1K [11] and object detection using the COCO dataset [34].

4.1 Experimental Settings

SBCFormer primarily targets low-end CPUs that are commonly used in single board computers.


Additionally, we evaluate its performance on an Intel CPU commonly found in edge devices, as well as on a GPU used in desktop PCs.
We use the following three processors and platforms for our experiments.(三种处理器和平台)
(1) An ARM Cortex-A72 processor running at 1.5 GHz on a single board computer, Raspberry PI 4 model B. While it is classified as low-end, ARM Cortex-A72 is a quad-core 64-bit processor that supports the ARM Neon instruction set. We used the 32-bit Raspberry Pi OS and PyTorch ver 1.6.0 to run networks.
(2)An Intel Core i7-3520M processor running at 2.9 GHz.It is a dual-core processor that is commonly used in mobile devices such as laptops and tablets. It supports a variety of instruction sets including Intel Advanced Vector Extensions (AVX) and AVX2, which provide
improved performance for vector and matrix operations. We used Ubuntu ver 18.04.5 LTS and PyTorch ver 1.10.1 to run networks.
(3) A GeForce RTX 2080Ti on a desk-top PC equipped with an Intel Xeon CPU E5-1650 v3. We used Ubuntu 18.04.6 LTS and PyTorch ver 1.10.1.

We implemented and tested all the networks using the PyTorch framework (version 1.10) and the Timm library[61]. For each of the existing networks we compare, we employ the author’s official code but a few networks1. We follow previous studies [32, 44] to measure the inference time (i.e., latency) required to process a single input image. Specifically, setting the batch size to 1, we recorded the clock time on each platform. To ensure accuracy, we performed 300 inferences and reported the average latency in seconds. During measurement, we terminated any irrelevant applications that could interfere with the results. All computations used 32-bit floating point numbers. Since our focus is on inference speed rather than training, we trained all networks on a GPU server with eight Nvidia 2080Ti GPUs, and then evaluated their inference time on each platform.

4.2. ImageNet-1K

We first evaluate the networks on the most standard task, image classification of ImageNet-1K.

4.2.1 Training

We train SBCFormer and existing networks from scratch for 300 epochs on the training split of ImageNet-1K dataset, which consists of 1.28 million images across 1,000 classes. We consider the four variants with different model size, SBCFormer-XS, -S, -B, and -L, as shown in Table 1. All models are trained and tested at the standard resolution of 224 × 224.

We followed the original author’s code to train the existing networks. For training SBCFormer, we used the recipe from DeiT [51], which is summarized as follows. We employed the AdamW [38] optimizer with cosine learning rate scheduling [37], and applied a linear warm-up for the first five epochs. The initial learning rate was set to 2.5 × 10−4 , and the minimum value was set to 10−5. The weight decay and momentum were set to 5 × 10−2 and 0.9, respectively, and a batch size of 200 was used. Data augmentation techniques, including random cropping, random horizontal flipping, mixup, random erasing, and label-smoothing, were applied during training, following [44, 51]. Random cropping was applied to the input image during training to obtain an image size of 224×224 pixels, while a single center crop of the same size was used during testing.

4.2.2 Results

It is observed that SBCFormer variants with different model sizes achieve a higher trade-off between accuracy and latency on CPUs; ses also Fig. 1. The performance gap between SBCFormer and the other models is more pronounced on ARM CPUs than on Intel CPUs. Notably, SBCFormer only achieves mediocre or inferior trade-offs on the GPU. These results are consistent with our design goal, as
SBCFormer is optimized for running faster on CPUs with limited computational resources.


5. Conclusions

We have proposed a new deep network design, called SBCFormer, that achieves a favorable balance between inference accuracy and computational speed when used with low-end CPUs, commonly found in single-board computers (SBCs). These CPUs are not efficient at performing large matrix multiplications, making the Transformer’s attention mechanism more attractive than CNNs. However, attention
is computationally expensive when applied to large feature maps. SBCFormer mitigates this cost by first reducing the input feature map size, applying attention to the smaller map, and then restoring it to its original size. However, this approach has side effects, such as the loss of local image information and limited representation ability of smallsize attention. To address these issues, we introduced two novel designs. First, we add a parallel stream to the attention computation, which passes through the input feature map, allowing it to retain local image information. Second, we enhance the attention mechanism by incorporating standard CNN components. Our experiments have shown that SBCFormer achieves a good trade-off between accuracy and speed on a popular SBC, the Raspberry-PI 4 Model B with an ARM-Cortex A72 CPU.

