YOLO Nano: a Highly Compact You Only Look Once Convolutional Neural Network for Object Detection

最新推荐文章于 2024-08-15 07:00:00 发布

Yongqiang Cheng

最新推荐文章于 2024-08-15 07:00:00 发布

阅读量3.1k

点赞数 3

object detection - 目标检测专栏收录该内容

27 篇文章 6 订阅

订阅专栏

YOLO Nano: a Highly Compact You Only Look Once Convolutional Neural Network for Object Detection

Alexander Wong, Mahmoud Famuori, Mohammad Javad Shafiee, Francis Li, Brendan Chwyl, Jonathan Chung

Waterloo Artificial Intelligence Institute, University of Waterloo, Waterloo, ON, Canada
DarwinAI Corp., Waterloo, ON, Canada

nano [ˈnænəʊ]：n. 纳，毫微
Ontario，ON：安大略省，安省
Waterloo [ˌwɔːtəˈluː]：n. 滑铁卢
University of Waterloo，Waterloo, UW or UWaterloo：滑铁卢大学
Waterloo Artificial Intelligence Institute，Waterloo.ai
Darwin [ˈdɑːwɪn]：n. 达尔文
Computer Science，CS：计算机科学
Computer Vision，CV：计算机视觉
preprint ['priːprɪnt]：v. 预印 n. 预印本

arXiv (archive - the X represents the Greek letter chi [χ]) is a repository of electronic preprints approved for posting after moderation, but not full peer review.

Abstract

Object detection remains an active area of research in the field of computer vision, and considerable advances and successes has been achieved in this area through the design of deep convolutional neural networks for tackling object detection. Despite these successes, one of the biggest challenges to widespread deployment of such object detection networks on edge and mobile scenarios is the high computational and memory requirements. As such, there has been growing research interest in the design of efficient deep neural network architectures catered for edge and mobile usage. In this study, we introduce YOLO Nano, a highly compact deep convolutional neural network for the task of object detection. A human-machine collaborative design strategy is leveraged to create YOLO Nano, where principled network design prototyping, based on design principles from the YOLO family of single-shot object detection network architectures, is coupled with machine-driven design exploration to create a compact network with highly customized module-level macroarchitecture and microarchitecture designs tailored for the task of embedded object detection. The proposed YOLO Nano possesses a model size of ∼4.0MB (>15.1 $\times$ and >8.3 $\times$ smaller than Tiny YOLOv2 and Tiny YOLOv3, respectively) and requires 4.57B operations for inference (>34% and ~17% lower than Tiny YOLOv2 and Tiny YOLOv3, respectively) while still achieving an mAP of ~69.1% on the VOC 2007 dataset (∼12% and ∼10.7% higher than Tiny YOLOv2 and Tiny YOLOv3, respectively). Experiments on inference speed and power efficiency on a Jetson AGX Xavier embedded module at different power budgets further demonstrate the efficacy of YOLO Nano for embedded scenarios.
目标检测仍然是计算机视觉领域研究的一个活跃领域，通过设计用于解决物体检测的深度卷积神经网络，在该领域取得了相当大的进步和成功。尽管取得了这些成功，但是在边缘和移动场景下广泛部署此类物体检测网络的最大挑战之一是对计算和内存的高要求。因此，对设计用于边缘和移动用途的高效深度神经网络架构的研究兴趣不断增长。在这项研究中，我们介绍了 YOLO Nano，这是一种高度紧凑的深度卷积神经网络，用于物体检测。利用人机协作设计策略来创建 YOLO Nano，进行有原则的网络设计原型，基于 YOLO 系列 single-shot 目标检测网络体系结构的设计原理，并与机器驱动的设计探索相结合，伴随着高度定制的模块级宏体系结构和微体系结构设计，以创建紧凑的网络，专为嵌入式目标检测任务而设计。提出的 YOLO Nano 具有约 4.0MB 的模型大小 (>15.1 $\times$ and >8.3 $\times$ smaller than Tiny YOLOv2 and Tiny YOLOv3, respectively)，并且需要 4.57B 的推理运算操作 (>34% and ~17% lower than Tiny YOLOv2 and Tiny YOLOv3, respectively)。YOLO Nano 仍然在 VOC 2007 数据集上仍达到约 69.1% 的 mAP (∼12% and ∼10.7% higher than Tiny YOLOv2 and Tiny YOLOv3, respectively)。在 Jetson AGX Xavier 嵌入式模块上以不同的功率预算进行推理速度和功率效率的实验，进一步证明了 YOLO Nano 在嵌入式场景中的效率。

通过人与机器协同设计模型架构提升了性能。

tackle [ˈtækl]：v. 应付，处理 (难题或局面)，与某人交涉，(足球、曲棍球等) 抢球，(橄榄球或美式足球) 擒抱摔倒，抓获，对付，打 (尤指罪犯) n. (足球等中的) 抢断球，(橄榄球或美式足球) 擒抱摔倒，(美式橄榄球的) 阻截队员，体育器械 (尤指渔具)，男性性器官
considerable [kənˈsɪdərəbl]：adj. 相当大的，重要的，值得考虑的
scenario [səˈnɑːriəʊ]：n. 方案，情节，剧本，设想
cater [ˈkeɪtə(r)]：vt. 投合，迎合，满足需要，提供饮食及服务
collaborative [kəˈlæbəreɪtɪv]：adj. 合作的，协作的
leverage [ˈliːvərɪdʒ]：n. 手段，影响力，杠杆作用，杠杆效率 v. 利用，举债经营
prototype [ˈprəʊtətaɪp]：n. 原型，标准，模范
budget [ˈbʌdʒɪt]：n. 预算，预算费 vt. 安排，预定，把...编入预算 vi. 编预算，做预算 adj. 廉价的
possess [pəˈzes]：vt. 控制，使掌握，持有，迷住，拥有，具备
principle [ˈprɪnsəpl]：n. 原理，原则，主义，道义，本质，本义，根源，源泉
couple [ˈkʌpl]：n. 对，夫妇，数个 vi. 结合，成婚 vt. 结合，连接，连合
exploration [ˌekspləˈreɪʃn]：n. 探测，探究，踏勘
microarchitecture：n. 微体系结构
macro [ˈmækrəʊ]：n. (计算机) 宏指令，宏，微距镜头 adj. 大规模的，宏观的，微距摄影的，巨大的，大量的
tailor [ˈteɪlə(r)]：n. 裁缝 v. 专门制作，订做，调整，迎合

1 Introduction

An active area in the field of computer vision is object detection, where the goal is to not only localize objects of interest within a scene, but also assign a class label to each of these objects of interest. Considerable recent successes in the area of object detection stems from modern advances in deep learning [8, 7], particularly leveraging deep convolutional neural networks. Much of the initial focus was on improving accuracy, leading to increasingly more complex object detection networks such as SSD [11], R-CNN [2], Mask R-CNN [3], and other extended variants of these networks [6, 9, 18]. While such networks demonstrated state-of-the-art object detection performance, they were very challenging, if not impossible, to deploy on edge and mobile devices due to computational and memory constraints. In fact, even faster variants such as Faster R-CNN [15] have inference speeds at low single-digit frame rates when running on embedded processors. This greatly limits the widespread adoption of such networks for a wide range of applications such as unmanned aerial vehicles, video surveillance, autonomous driving where local embedded processing is required.
在计算机视觉领域中的一个活跃领域是目标检测，其目标不仅是在场景中定位感兴趣的目标，而且还为这些感兴趣的目标中的每一个分配类别标签。物体检测领域最近取得的巨大成就源于深度学习的现代发展 [8, 7]，尤其是利用深度卷积神经网络。最初的重点主要是提高准确性，从而导致越来越复杂的目标检测网络，例如 SSD [11], R-CNN [2], Mask R-CNN [3] 以及这些网络的其他扩展形式 [6, 9, 18]。尽管这样的网络展示了最新的目标检测性能，但由于计算和内存的限制，它们部署在边缘设备和移动设备上非常具有挑战性。实际上，在嵌入式处理器上运行时，甚至更快的变体，例如 Faster R-CNN [15]，也具有个位数帧速率的推理速度。这极大地限制了此类网络在广泛应用中的大量使用，例如需要本地嵌入式处理的无人驾驶飞机、视频监控、自动驾驶。

目标检测任务目前有两种通行的解决方案，两阶段目标检测和单阶段目标检测。对于两阶段目标检测，首先需要神经网络识别目标 (例如在目标上打上定位框)，然后对识别出的目标进行分类。对于单阶段目标检测，直接使用网络对目标进行检测。两阶段目标检测的好处在于实现容易，但下游的分类任务依赖上游识别定位任务的表现。单阶段目标检测尽管不需要首先识别目标，但加大了端到端实现目标检测的难度。

一般而言，两阶段目标检测方法准确性高，但速度不快。单阶段的检测器速度快，准确率并达不到最高。不过随着基于关键点的方法越来越流行，单阶段目标检测不仅快，同时效果也不错。

stem [stem]：n. 干，茎，船首，血统 vt. 阻止，除去...的茎，给...装柄 vi. 阻止，起源于某事物，逆行
autonomous [ɔːˈtɒnəməs]：adj. 自治的，自主的，自发的
single-digit：adj. 单位数的，个位数的
unman [ʌn'mæn]：vt. 使失去男子气质，使怯懦，阉割

To address this challenge of achieving embedded object detection, there has been a growing interest in the exploration and design of highly efficient deep neural network architectures for object detection that are more well-suited for edge and mobile devices [12, 13, 14, 23, 4, 17]. A particularly interesting family of object detection networks designed around efficiency is the YOLO family of neural network architectures [12, 13, 14], which leverage a number of design principles to create single-shot architectures which can achieve embedded object detection performance on high-end desktop GPUs. However, these network architectures remain too large for many edge and mobile scenarios (e.g., ∼240MB in the case of the YOLOv3 architecture), and their inference speeds drop considerably when running on edge and mobile processors due to computational complexity (e.g., >65B operations in the case of YOLOv3). To address this issue, Redmon et al. introduced the Tiny YOLO family of network architectures, which has greatly reduced model sizes at a cost of object detection performance.
为了解决实现嵌入式目标检测的挑战，人们越来越关注探索和设计用于目标检测的高效深度神经网络体系结构，该体系结构更适合边缘和移动设备 [12, 13, 14, 23, 4, 17]。围绕效率而设计的一个特别有趣的目标检测网络系列是 YOLO 神经网络体系结构系列 [12, 13, 14]，该系列利用许多设计原理来创建 single-shot 体系结构，该体系可以在高端台式机 GPU 上实现嵌入式目标检测性能。但是，对于许多边缘和移动方案，这些网络体系结构仍然太大 (e.g., ∼240MB in the case of the YOLOv3 architecture)，并且由于计算复杂性 (e.g., >65B operations in the case of YOLOv3)，它们在边缘和移动处理器上运行时的推理速度会大大降低。为了解决这个问题，Redmon et al. 推出了 Tiny YOLO 系列网络架构，该系列大大缩小了模型尺寸，但降低了目标检测性能。

In this study, we are motivated to explore a human-machine collaborative design strategy to designing highly compact deep convolutional neural networks for the task of object detection, where principled network design prototyping is coupled with machine-driven design exploration. More specifically, we leverage the design principles from the YOLO family of single-shot object detection network architectures within this human-machine collaborative design strategy to create YOLO Nano, a highly compact network with highly customized module-level macroarchitecture and microarchitecture designs tailored for the task of embedded object detection.
在这项研究中，我们有目的地探索一种人机协作设计策略，以设计高度紧凑的深度卷积神经网络来完成目标检测任务，其中有原则的网络设计原型与机器驱动的设计探索相结合。更具体地说，我们在这种人机协作设计策略中利用 YOLO 系列 single-shot 目标检测网络体系结构的设计原理来创建 YOLO Nano，这是一个高度紧凑的网络，针对嵌入式目标检测任务，具有具有高度定制的模块级宏体系结构和微体系结构设计。

通过人机协作设计策略 (human-machine collaborative design)进行构建。在构建的过程中，首先设计主要的网络原型，原型基于 YOLO 网络家族中的单阶段目标检测网络架构。然后，将原型和机器驱动的设计探索策略结合，创建一个紧凑的网络。这个网络是高度定制化的，在模块级别上有着宏架构 (macro-architecture) 和微架构 (micro-architecture)，可用于嵌入式目标检测任务。

specifically [spəˈsɪfɪkli]：adv. 特别地，明确地

2 Methods

In this study, we introduce YOLO Nano, a highly compact deep convolutional neural network for embedded object detection designed using a human-machine collaborative design strategy [21]. The human-machine collaborative design strategy for designing YOLO Nano comprises of two main design stages: i) principled network design prototyping, and ii) machine-driven design exploration.
在这项研究中，我们介绍了 YOLO Nano，这是一种高度紧凑的深度卷积神经网络，用于使用人机协作设计策略设计的嵌入式目标检测 [21]。用于设计 YOLO Nano 的人机协作设计策略包括两个主要设计阶段： i) 有原则的网络设计原型，以及 ii) 机器驱动的设计探索。

YOLO Nano 在架构设计的中经过了两个阶段：首先设计一个原型网络，形成网络的主要设计架构；然后使用机器驱动的方法进行探索设计。

2.1 Principled network design prototyping (原型主体网络设计)

The first design stage in creating YOLO Nano is a principled network design prototyping stage, where we create an initial network design prototype (denoted as $\varphi$ ), based on human-driven design principles to guide the machine-driven design exploration stage. More specifically, we construct an initial network design prototype based on the design principles of the YOLO family of single-shot architecture [12, 13, 14]. A standout characteristic of the YOLO family of network architectures is that, unlike region proposal-based networks which rely on the construction of a regional proposal network to generate proposals for where objects lie in the scene followed by classification on the generated proposals, they instead leverage a single network architecture to process the input image and generate the output results. As such, all object detection predictions for a single image are made in a single forward pass, compared to hundreds to thousands of passes that need to be performed to get the final results for region proposal-based networks. This makes the YOLO family of network architectures significantly faster to run, and thus better suited for embedded object detection.
创建 YOLO Nano 的第一个设计阶段是有原则的网络设计原型阶段，在该阶段中，我们基于人为驱动的设计原则来创建初始的网络设计原型 (表示为 $\varphi$ )，以指导机器驱动的设计探索阶段。更具体地说，我们基于 YOLO 系列 single-shot 架构的设计原理构建了一个初始的网络设计原型 [12, 13, 14]。region proposal-based network 依赖于构建候选区域来生成目标位于场景中的提议，然后对所生成的候选区域进行分类，与基于候选区域的网络不同，YOLO 系列网络体系结构的一个突出特点是，利用一个单一的网络架构来处理输入图像并生成输出结果。这样，单个图像的所有目标检测预测都是在单个前向运算中进行的，相比之下，为获得基于候选区域的网络的最终结果，需要执行数百至数千次前向运算。这使得 YOLO 系列网络体系结构的运行速度明显加快，因此更适合嵌入式目标检测。

The initial design prototype used in this study draws inspiration from the YOLO family of network architectures and is comprised of a stack of feature representation modules, with shortcut connections between the modules as with [14]. Also, as with [14], the feature representation modules are configured in a way, similar to feature pyramid networks [10], such that it is capable of representing features at three different scales. These feature representation modules are followed by several convolutional layers, with output being a three-dimensional tensor that encodes bounding box, objectness, and class predictions for three different scales. As a result, this initial design prototype architecture design allows for efficient multi-scale object detection.
本研究中使用的初始设计原型从 YOLO 系列网络体系结构中汲取了灵感，并由功能表示模块的堆栈组成，这些模块之间的 shortcut connection 与 [14] 一样。同样，与 [14] 一样，特征表示模块的配置方式类似于特征金字塔网络 [10]，因此它能够以三种不同的比例来表示特征。这些特征表示模块后面是几个卷积层，输出是一个三维张量，该张量对三种不同比例的边界框，目标和类预测进行编码。结果，该初始设计原型体系结构设计允许有效的多尺度目标检测。

首先是设计主要的网络原型，研究者创建了一个原始的架构 (denoted as $\varphi$ )，用于引导机器进行后续的探索设计。

YOLO 网络架构不像基于候选框的网络那样需要构建一个 RPN，该网络会生成一系列定位目标的候选边界框，然后对生成的边界框进行分类。

inspiration [ˌɪnspəˈreɪʃn]：n. 灵感，鼓舞，吸气，妙计
comprise [kəmˈpraɪz]：vt. 包含，由...组成

The actual macroarchitecture and microarchitecture designs of the individual modules and layers in the final YOLO Nano network architecture, as well as the number of network modules, are left for the machine-driven design exploration stage to determine automatically given data as well as human-specified design requirements and constraints designed specifically around edge and mobile scenarios with limited computational and memory capabilities.
最终 YOLO Nano 网络体系结构中各个模块和层的实际宏体系结构和微体系结构设计，以及网络模块的数量，都留给机器驱动的设计探索阶段，以确定自动给出的数据，以及人为指定的设计要求和约束条件是专门针对边缘和移动方案设计的，具有有限的计算和存储功能。

individual [ˌɪndɪˈvɪdʒʊəl]：adj. 个人的，个别的，独特的 n. 个人，个体

2.2 Machine-driven design exploration (机器驱动的探索设计)

Using the initial network design prototype ( $\varphi$ ), data, as well as human-specified design requirements catered to edge and mobile usage as a guide, a machine-driven design exploration stage is then leveraged to determine the module-level macroarchitecture and microarchitecture designs for the proposed YOLO Nano network architecture. More specifically, machine-driven design exploration is achieved in this study in the form of generative synthesis [22], which is capable of determining the optimal macroarchitecture and microarchitecture designs of the final network architecture within the human-specified requirements and constraints. The overall goal of generative synthesis is to learn generative machines that can generate deep neural networks that meet design requirements and constraints, and can be described as follows. This is formulated within the concept of generative synthesis as a constrained optimization problem for determining a generator $\mathcal{G}$ that, given a set of seeds $S$ , can generate networks $\{N_{s} | s \in S \}$ maximizing a universal performance function $\mathcal{U}$ (e.g., [20]) while satisfying requirements and constraints defined via an indicator function $\text{l}_{r}(\cdot)$ :

$\mathcal{G} = \mathop{\text{max}}\limits_{\mathcal{G}} \ \mathcal{U}(\mathcal{G(s)}) \ \text{subject to} \ \text{l}_{r}(\mathcal{G(s)}) = 1, \ \forall s \in S. \tag{1}$

使用初始网络设计原型 ( $\varphi$ )，数据以及针对边缘和移动用途的人工指定设计要求作为指导，然后利用机器驱动的设计探索阶段来确定模块级宏体系结构和微体系结构设计，提出 YOLO Nano 网络体系结构。更具体地说，本研究以 generative synthesis 的形式实现了机器驱动的设计探索 [22]，它能够确定在人类指定的要求和约束范围内最终实现网络架构的最佳宏架构和微架构设计。generative synthesis 的总体目标是学习可以生成满足设计要求和约束的深度神经网络的生成机器，其描述如下。在 generative synthesis 的概念中将其表述为用于确定生成器 $\mathcal{G}$ 的约束优化问题，该生成器在给定一组种子 $S$ 的情况下可以生成网络 $\{N_{s} | s \in S \}$ 最大化通用性能函数 $\mathcal{U}$ (例如 [20])，同时满足通过指标函数 $\text{l}_{r}(\cdot)$ 定义的要求和约束：

synthesis [ˈsɪnθəsɪs]：n. 综合，合成，综合体

Since it is computationally intractable to solve for the globally optimal solution in the constrained optimization problem posed in Eq. 1 given the enormity of the feasible region, we instead solve for an approximate solution $\hat{\mathcal{G}}$ via iterative optimization, where the initial solution $\hat{\mathcal{G}}_{0}$ is guided by $\varphi$ , $\mathcal{U}$ , and $\text{l}_{r}(\cdot)$ , and progressively updated such that each successive approximate solution $\hat{\mathcal{G}}_{k}$ achieving a higher $\mathcal{U}$ than previous approximate solutions (i.e., $\hat{\mathcal{G}}_{1}$ , …, $\hat{\mathcal{G}}_{k-1}$ , etc.) while still constrained by $\text{l}_{r}(\cdot)$ . The final approximate solution $\hat{\mathcal{G}}$ is then used to create the proposed YOLO Nano network.
在 Eq. 1 中鉴于可行区域的巨大性，在其提出的约束优化问题中求解全局最优解在计算上是棘手的。我们通过迭代优化来求解近似解 $\hat{\mathcal{G}}$ ，其中初始解 $\hat{\mathcal{G}}_{0}$ 由 $\varphi$ , $\mathcal{U}$ , and $\text{l}_{r}(\cdot)$ 引导，并逐步更新，以便每个连续的近似解 $\hat{\mathcal{G}}_{k}$ 比先前的近似解决方案(i.e., $\hat{\mathcal{G}}_{1}$ , …, $\hat{\mathcal{G}}_{k-1}$ , etc.) 实现更高的 $\mathcal{U}$ ，但仍受 $\text{l}_{r}(\cdot)$ 约束。然后使用最终的近似解 $\hat{\mathcal{G}}$ 创建建议的 YOLO Nano 网络。

机器使用最初的原型网络、数据和人类提出的设计的要求做为指引，然后机器驱动的探索设计会决定模块级别的宏架构和微架构。在给定一系列种子 $S$ 的情形下，生成网络 $\{N_{s} | s \in S \}$ 以最大化全局性能函数 $\mathcal{U}$ 。在最大化过程中需要满足指示函数 $\text{l}_{r}(\cdot)$ ， $\text{l}_{r}(\cdot)$ 被定义来表示人类提出的需求和限制条件。

intractable [ɪnˈtræktəbl]：adj. 棘手的，难治的，倔强的，不听话的
enormity [ɪˈnɔːməti]：n. 巨大，暴行，极恶
iterative [ˈɪtərətɪv]：adj. 迭代的，重复的，反复的 n. 反复体
progressively [prəˈgresɪvli]：adv. 渐进地，日益增多地
successive [səkˈsesɪv]：adj. 连续的，继承的，依次的，接替的

To guide the generative synthesis process towards learning generative machines that generate object detection networks for edge and mobile scenarios that are not only highly efficient and compact but also provide strong object detection performance, one of the key steps is to configure the indicator function $\text{l}_{r}(\cdot)$ to enforce the appropriate design requirements and constraints. In this study, the indicator function $\text{l}_{r}(\cdot)$ was set up such that: i) mean average precision (mAP) $\geq$ 65% on VOC 2007, ii) computational cost $\leq$ 5B operations, and iii) 8-bit weight precision. The computational cost constraint is set such that the computational cost of the resulting YOLO Nano network is below that of Tiny YOLOv3 [14], one of the most popular compact networks for embedded object detection.
为了指导 generative synthesis 过程朝向学习生成机器，这些机器针对边缘和移动场景生成目标检测网络，不仅高效、紧凑，而且还提供强大的目标检测性能，关键步骤之一是配置指示器函数 $\text{l}_{r}(\cdot)$ 以实施适当的设计要求和约束。在本研究中，指示器函数 $\text{l}_{r}(\cdot)$ 的设置如下：i) 在 VOC 2007 的 mean average precision (mAP) $\geq$ 65%，ii) computational cost $\leq$ 5B operations, and iii) 8-bit weight precision. 设置计算成本约束以使所得 YOLO Nano 网络的计算成本低于 Tiny YOLOv3 的计算成本 [14]，后者是嵌入式目标检测最受欢迎的紧凑型网络之一。

在这里插入图片描述
Figure 1: YOLO Nano network architecture. Note that PEP( $x$ ) indicates $x$ channels in the first projection layer of a residual PEP module, and FCA( $x$ ) indicates reduction ratio of $x$

reduction [rɪˈdʌkʃn]：n. 减少，下降，缩小，还原反应
enforce [ɪnˈfɔːs]：vt. 实施，执行，强迫，强制

3 YOLO Nano Architectural Design

The network architecture of the proposed YOLO Nano network for embedded object detection is shown in Figure 1, with several interesting observations worth discussing below.
所建议的用于嵌入式目标检测的 YOLO Nano 网络的网络体系结构如图 1 所示，下面有几个有趣的观察值得讨论。

3.1 Residual Projection-Expansion-Projection Macroarchitecture (残差映射-扩张-映射宏架构)

The first notable observation about the YOLO Nano network architecture that differs significantly from the YOLO family of networks is that it is comprised of modules with unique residual projection-expansion-projection (PEP) macroarchitectures, in addition to expansion-projection (EP) macroarchitectures like those found in [16, 19, 1]. The residual PEP macroarchitecture consists of: i) a projection layer with 1 $\times$ 1 convolutions that projects output channels into an output tensor with lower dimensionality ii) an expansion layer with 1$\times$1 convolutions, that expands the number of channels to a higher dimensionality, iii) a depth-wise convolution layer that performs spatial convolutions with a different filter on each of the the individual output channels from the expansion layer, and iv) a projection layer with 1 $\times$ 1 convolutions that projects output channels into an output tensor with lower dimensionality. The use of residual PEP macroarchitectures enables significant reductions in the architectural and computational complexity while preserving model expressiveness.
关于 YOLO Nano 网络体系结构与 YOLO 网络家族有明显不同的第一个值得注意的发现是，它由具有唯一的 residual projection-expansion-projection (PEP) 宏体系结构的模块，以及类似于在 [16, 19, 1] 中所发现的 expansion-projection (EP) 宏体系结构组成。residual PEP 宏体系结构包括：i) 具有 1 $\times$ 1 卷积的 projection layer，将输出通道投影到较低维数的输出张量中；ii) 具有 1 $\times$ 1 卷积的 expansion layer，将通道数扩展为更高的维数；iii) depth-wise convolution layer 在扩展层的各个输出通道的每个通道上使用不同的滤波器执行空间卷积，并且 iv) projection layer 具有 1$\times$1 的卷积，用于投影输出通道变成维数较小的输出张量。residual PEP 宏体系结构使用可显著降低体系结构和计算复杂性，同时保留模型的表现力。

残差 PEP 宏架构主要由以下四部分组成：
一个 1 * 1 卷积的映射层，它将输入的特征图映射到较低维度的张量；
一个 1 * 1 卷积的扩张层，它会将特征图的通道再扩张到高一些的维度；
一个 depth-wise 的卷积层，它会通过不同滤波器对不同的扩张层输出通道执行空间卷积；
一个 1 * 1 卷积的映射层，它将前一层的输出通道映射到较低维度。
残差 PEP 宏架构的使用可以显著降低架构和计算上的复杂度，同时还能保证模型的表征能力。

projection [prəˈdʒekʃn]：n. 投射，规划，突出，发射，推测
expansion [ɪkˈspænʃn]：n. 膨胀，阐述，扩张物
residual [rɪˈzɪdjuəl]：adj. (数量) 剩余的，(物质状态在成因消失后) 剩余的，残留的，(实验误差) 舍去的，残差的，(土壤) 残余的 n. 剩余物，残渣，残差，剩余误差，(付给表演者的) 复播追加酬金，(地质) 残丘，蚀余山，(新车购入一定时间后的) 转售值

3.2 Fully-connected Attention Macroarchitecture

The second notable observation about the YOLO Nano network architecture is the strategic introduction of light-weight fully-connected attention (FCA) within the network by the machine-driven design exploration process, which is in contrast to fixed module-level introduction in other design exploration methods [19]. As with [5], the FCA macroarchitecture consists of two fully-connected layers that learn the dynamic, non-linear inter-dependencies between channels and produces modulation weights for re-weight the channels via channel-wise multiplication. The use of FCA facilitates for dynamic feature recalibration based on global information to pay more attention to informative features, thus enabling better utilization of available network capacity. This in turn allows for a strong balance between reduced architectural and computational complexity and model expressiveness.
关于 YOLO Nano 网络体系结构的第二个值得注意的发现是，通过机器驱动的设计探索过程在网络内战略性地引入了轻量级全连接注意力机制 (FCA)，这与其他设计探索方法中的固定模块级引入形成了对比 [19]。与 [5] 一样，FCA 宏体系结构由两个完全连接的层组成，这些层学习通道之间的动态、非线性相互依存关系，并产生调制权重，以通过按通道乘法对通道进行重新加权。FCA 的使用有助于基于全局信息进行动态功能重新校准，从而更加关注信息特征，从而可以更好地利用可用网络容量。反过来，这可以在减少体系结构和计算复杂度以及模型表示性之间实现强大的平衡。

在神经网络引入了轻量级的全连接注意力 (FCA) 模块。FCA 宏架构由两个全连接层组成，它们可以学习通道之间的动态、非线性内部依赖关系，并通过通道级的乘法重新加权通道的重要性。

FCA 的使用有助于基于全局信息关注更加具有信息量的特征，因为它再校准了一遍动态特征。这可以更有效利用神经网络的能力，即在有限参数量下尽可能表达重要信息。因此，该模块可以在修剪模型架构、降低模型复杂度、增加模型表征力之间做更好的权衡。

strategic [strəˈtiːdʒɪk]：adj. 战略上的，战略的
attention [əˈtenʃn]：n. 注意力，关心，立正！
facilitate [fəˈsɪlɪteɪt]：vt. 促进，帮助，使容易
recalibration [ri'kæli'breʃən]：n. 再校准
modulation [ˌmɒdjəˈleɪʃn]：n. 调制，调整

3.3 Macroarchitecture and Microarchitecture Heterogeneity

The third notable observation about the YOLO Nano network architecture is that there is high heterogeneity in terms of not only macroarchitectures (a diverse mix of PEP modules, EP modules, FCA, as well as individual 3 $\times$ 3 and 1 $\times$ 1 convolution layers), but also in terms of the microarchitectures of the individual feature representation modules and layers, with each module or layer in the network having unique microarchitectures. The benefit of having high microarchitecture heterogeneity in the YOLO Nano network architecture is that it enables each component of the network architecture to be uniquely tailored to achieve a very strong balance between architectural and computational complexity and model expressiveness. This architectural diversity in YOLO Nano also demonstrates the advantage of leveraging a machine-driven design exploration strategy as flexible as generative synthesis as it would be impossible for a human designer, or other design exploration methods such as [19, 1] to customize a network architecture to this level of architectural granularity.
关于 YOLO Nano 网络架构的第三个值得注意的发现是，不仅在宏架构 (a diverse mix of PEP modules, EP modules, FCA, as well as individual 3 $\times$ 3 and 1 $\times$ 1 convolution layers) 方面存在高度的异质性，而且在各个特征表示模块和层的微体系结构方面，网络中的每个模块或层都具有唯一的微体系结构。YOLO Nano 网络体系结构具有高度的微体系结构异质性的好处在于，它可以使网络体系结构的每个组件都经过独特地定制，以在体系结构和计算复杂性与模型表达性之间实现非常强大的平衡。YOLO Nano 的这种体系结构多样性还展示了利用机器驱动的设计探索策略如 generative synthesis 一样灵活的优势，因为人类设计人员或其他设计探索方法 (例如 [19，1]) 无法自定义网络架构达到此级别的架构粒度。

YOLO Nano 架构具有高度异质性的优势在于，它可以使网络架构的每个模块都经过特定的设计，从而在模型架构、计算复杂度和表征能力之间实现更优的权衡。YOLO Nano 这种架构多样性还展示了机器驱动设计探索策略和生成式组合一样灵活，因为人类设计者或其它设计探索方法无法在如此细粒度的层级上自定义架构。

EP modules 可以参考 MobileNetV2 Inverted Residuals 和相关系数。

heterogeneity [ˌhetərədʒəˈniːəti]：n. 异质性，不均匀性，多相性
diverse [daɪˈvɜːs]：adj. 不同的，相异的，多种多样的，形形色色的
granularity [grænjʊ'lærɪtɪ]：n. 间隔尺寸，粒度

4 Experimental Results and Discussion

To study the efficacy of YOLO Nano for embedded object detection, we examine its model size, object detection accuracy, and computational cost on the PASCAL VOC datasets. For comparison purposes, the Tiny YOLOv2 network [13] and the Tiny YOLOv3 network [14] were used as a baseline references given that they are amongst the most popular compact deep neural networks for embedded object detection given their small model sizes and low computational complexities. The VOC2007/2012 datasets consist of natural images that have been annotated with 20 different types of objects. The deep neural networks were trained using the VOC2007/2012 training datasets, and the mean average precision (mAP) was computed on the VOC2007 test dataset to evaluate the object detection accuracy of the deep neural networks, as is standard practice in research literature.
为了研究 YOLO Nano 在嵌入式物体检测中的效果，我们在 PASCAL VOC 数据集上测试了其模型大小、物体检测精度和计算成本。为了进行比较，Tiny YOLOv2 网络 [13] 和 Tiny YOLOv3 网络 [14] 被用作基线参考，因为它们是嵌入式目标检测中最受欢迎的紧凑型深度神经网络之一，因为它们的模型尺寸小且计算复杂度低。VOC2007/2012 数据集包含已用 20 种不同类型的目标标注的自然图像。使用 VOC2007 / 2012 训练数据集对深度神经网络进行了训练，并根据 VOC2007 测试数据集计算了平均平均精度 (mAP)，以评估深度神经网络的目标检测准确性，这是研究文献中的标准做法。

Table 1 shows the model sizes and the object detection accuracies of the proposed YOLO Nano network as well as Tiny YOLOv2 and Tiny YOLOv3. First, it was observed that the model size of YOLO Nano was 4.0MB, which is > 15.1 $\times$ and > 8.3 $\times$ smaller than Tiny YOLOv2 and Tiny YOLOv3, respectively, which is very important for edge and mobile scenarios given the memory constraints. Second, YOLO Nano, despite being much smaller in model size, achieved an mAP of 69.1% on the VOC 2007 test dataset, which is ∼12% and ∼10.7% higher than that of Tiny YOLOv2 and Tiny YOLOv3, respectively. Third, YOLO Nano requires just 4.57 billion operations to perform inference, which is >34% lower than Tiny YOLOv2 and ∼17% lower than Tiny YOLOv3.

Table 1: Object detection accuracy results of tested compact networks on VOC 2007 test set. Input size is 416 $\times$ 416 for all tested networks. Best results are highlighted in bold.
在这里插入图片描述

Finally, to investigate the real-world performance of YOLO Nano within an embedded scenario, we evaluated the inference speed and power efficiency of YOLO Nano running on a Jetson AGX Xavier embedded module at different power budgets. At 15W and 30W power budgets, YOLO Nano achieved inference speeds of ∼26.9 FPS and ∼48.2 FPS, respectively, resulting in power efficiencies of ∼1.97 images/sec/watt and ∼1.61 images/sec/watt, respectively. These experimental results show that the proposed YOLO Nano network, created through a human-machine collaborative design strategy, provides a strong balance between accuracy, size, and computational complexity that makes it well suited for embedded object detection for edge and mobile scenarios.
这些实验结果表明，通过人机协作设计策略创建的 YOLO Nano 网络在精度、大小和计算复杂性之间实现了强大的平衡，使其非常适合边缘和移动场景的嵌入式目标检测。

ideal [aɪ'dɪəl; aɪ'diːəl]：adj. 理想的，完美的，想象的，不切实际的 n. 理想，典范
billion [ˈbɪljən]，B：n. 十亿，大量 num. 十亿 adj. 十亿的
bold [bəʊld]：adj. 大胆的，英勇的，黑体的，厚颜无耻的，险峻的
investigate [ɪnˈvestɪɡeɪt]：v. 调查，研究
watt [wɒt]：n. 瓦特

References

[10] Feature Pyramid Networks for Object Detection
[14] YOLOv3: An Incremental Improvement
[16] MobileNetV2: Inverted Residuals and Linear Bottlenecks
[21] AttoNets: Compact and Efficient Deep Neural Networks for the Edge via Human-Machine Collaborative Design

WORDBOOK

exhibit [ɪɡˈzɪbɪt]：vt. 展览，显示，提出 (证据等) n. 展览品，证据，展示会 vi. 展出，开展览会
Statement of Work，SOW：工作说明书
kick-off ['kɪkɔf]：n. 开球，剔除，分离
pipeline [ˈpaɪplaɪn]：n. 管道，输油管，传递途径
dashboard [ˈdæʃbɔːd]：n. 汽车等的仪表板，马车等前部的挡泥板
restful [ˈrestfl]：adj. 宁静的，安静的，给人休息的
Unibail-Rodamco-Westfield，URW
fee [fiː]：n. 费用，酬金，小费 vt. 付费给......
articulation [ɑːˌtɪkjuˈleɪʃn]：n. 关节，接合，清晰发音
valet [ˈvæleɪ; ˈvælɪt]：n. 贴身男仆，用车的人，伺候客人停车 vt. 为...管理衣物，替...洗熨衣服 vi. 清洗汽车，服侍
elevator [ˈelɪveɪtə(r)]：n. 电梯，升降机，升降舵，起卸机
victoria [vɪk'torɪə]：n. 维多利亚
contractor [kənˈtræktə(r); ˈkɒntræktə(r)]：n. 承包人，立契约者