YOLOv6翻译

YOLOV6:A Single-Stage Object Detection Framework for Industrial

Applications

Chuyi Li Lulu Li Hongliang Jiang Kaiheng Weng Yifei Geng Liang Li Zaidan Ke Qingyuan Li Meng Cheng Weiqiang Nie Yiduo Li Bo Zhang Yufei Liang Linyuan Zhou Xiaoming Xuy Xiangxiang Chu Xiaoming Wei Xiaolin Wei Meituan Inc.

flichuyi, lilulu05, jianghongliang02, wengkaiheng, gengyifei,liliang58, kezaidan, liqingyuan02, chengmeng05, nieweiqiang, liyiduo, zhangbo97, liangyufei, zhoulinyuan, xuxiaoming04, chuxiangxiang, weixiaoming, weixiaolin02g@meituan.com

image-20240717155637946
Abstract

For years, YOLO series have been de facto industry-level standard for efficient object detection. The YOLO community has prospered overwhelmingly to enrich its use in a multitude of hardware platforms and abundant scenarios. In this technical report, we strive to push its limits to the next level, stepping forward with an unwavering mindset for industry application. Considering the diverse requirements for speed and accuracy in the real environment, we extensively examine the up-to-date object detection advancements either from industry or academy. Specifically, we heavily assimilate ideas from recent network design, training strategies, testing techniques, quantization and optimization methods. On top of this, we integrate our thoughts and practice to build a suite of deployment-ready networks at various scales to accommodate diversified use cases. With the generous permission of YOLO authors, we name it YOLOv6. We also express our warm welcome to users and contributors for further enhancement. For a glimpse of performance, our YOLOv6-N hits 35.9% AP on COCO dataset at a throughput of 1234 FPS on an NVIDIA Tesla T4 GPU. YOLOv6-S strikes 43.5% AP at 495 FPS, outperforming other mainstream detectors at the same scale (YOLOv5-S, YOLOX-S and PPYOLOE-S). Our quantized version of YOLOv6-S even brings a new state-of-theart 43.3% AP at 869 FPS. Furthermore, YOLOv6-M/L also achieves better accuracy performance (i.e., 49.5%/52.3%) than other detectors with the similar inference speed. We carefully conducted experiments to validate the effectiveness of each component. Our code is made available at https://github.com/meituan/YOLOv6.

多年来,YOLO 系列一直是高效目标检测的行业标准。YOLO 社区蓬勃发展,丰富了其在众多硬件平台和丰富场景中的应用。在本技术报告中,我们努力将其极限推向更高水平,以坚定不移的态度推进行业应用。考虑到现实环境中对速度和精度的不同要求,我们广泛研究了工业界或学术界的最新物体检测进展。具体来说,我们大量吸收了最新的网络设计、训练策略、测试技术、量化和优化方法。在此基础上,我们结合自己的想法和实践,构建了一套不同规模的部署网络,以适应多样化的使用案例。在征得 YOLO 作者的同意后,我们将其命名为 YOLOv6。我们也热烈欢迎用户和贡献者对其进行进一步改进。我们的 YOLOv6-N 在NVIDIA Tesla T4 GPU 上检测COCO 数据集达到1234 FPS 的吞吐量并达到了 35.9% 的 AP 性能。YOLOv6-S 在 495 FPS 的速度下达到了 43.5% 的 AP 值,超过了同等规模的其他主流探测器(YOLOv5-S、YOLOX-S 和 PPYOLOE-S)。我们的量化版本 YOLOv6-S 甚至在 869 FPS 的速度下达到了 43.3% 的 AP。此外,YOLOv6-M/L 的准确率(即 49.5%/52.3%)也优于推理速度相似的其他检测器。我们仔细地进行了实验,以验证每个组件的有效性。我们的代码可在 https://github.com/meituan/YOLOv6 上获取。

1 Introduction

YOLO series have been the most popular detection frameworks in industrial applications, for its excellent balance between speed and accuracy. Pioneering works of YOLO series are YOLOv1-3 [32–34], which blaze a new trail of one-stage detectors along with the later substantial improvements. YOLOv4 [1] reorganized the detection framework into several separate parts (backbone, neck and head), and verified bag-of-freebies and bag-of-specials at the time to design a framework suitable for training on a single GPU. At present, YOLOv5 [10], YOLOX [7], PPYOLOE [44] and YOLOv7 [42] are all the competing candidates for efficient detectors to deploy. Models at different sizes are commonly obtained through scaling techniques.

YOLO 系列因其在速度和精度之间的出色平衡而成为工业应用中最受欢迎的检测框架。YOLO 系列的开山之作是 YOLOv1-3 [32-34],它开创了单级检测器的新纪元,并在后来进行了大幅改进。YOLOv4 [1]将检测框架重组为几个独立的部分(骨干、颈部和头部),并在当时验证了“bag of freebies”和"bag of specials",从而设计出适合在单个 GPU 上进行训练的框架。目前,YOLOv5 [10]、YOLOX [7]、PPYOLOE [44] 和 YOLOv7 [42] 都是竞相部署的高效探测器。不同大小的模型通常通过缩放技术获得。

In this report, we empirically observed several important factors that motivate us to refurnish the YOLO framework: (1) Reparameterization from RepVGG [3] is a superior technique that is not yet well exploited in detection. We also notice that simple model scaling for RepVGG blocks becomes impractical, for which we consider that the elegant consistency of the network design between small and large networks is unnecessary. The plain single-path architecture is a better choice for small networks, but for larger models, the exponential growth of the parameters and the computation cost of the single-path architecture makes it infeasible; (2) Quantization of reparameterization-based detectors also requires meticulous treatment, otherwise it would be intractable to deal with performance degradation due to its heterogeneous configuration during training and inference. (3) Previous works [7, 10, 42, 44] tend to pay less attention to deployment, whose latencies are commonly compared on high-cost machines like V100. There is a hardware gap when it comes to real serving environment. Typically, lowpower GPUs like Tesla T4 are less costly and provide rather good inference performance. (4) Advanced domain-specific strategies like label assignment and loss function design need further verifications considering the architectural variance; (5) For deployment, we can tolerate the adjustments of the training strategy that improve the accuracy performance but not increase inference costs, such as knowledge distillation.

在本报告中,我们根据经验观察到几个重要因素,促使我们重新构建 YOLO 框架:(1) RepVGG [3] 中的reparameterization 是一项卓越的技术,但在检测中尚未得到充分利用。我们还注意到,RepVGG 块的简单模型缩放变得不切实际,因此我们认为,小型网络和大型网络之间优雅的网络设计一致性是不必要的。对于小型网络,普通的单路径架构是更好的选择,但对于大型模型,参数的指数增长和单路径架构的计算成本使其变得不可行;(2)基于reparameterization 的检测器的量化也需要细致处理,否则在训练和推理过程中,由于其异构配置导致的性能下降将难以处理。(3) 以前的工作[7, 10, 42, 44]往往不太关注部署,其延迟通常是在 V100 等高成本机器上比较的。在实际服务环境中,硬件存在差距。通常情况下,像 Tesla T4 这样的低功耗 GPU 成本较低,推理性能也相当不错。(4) 考虑到架构差异,label assignment和loss function设计等针对特定领域的高级策略需要进一步验证;(5) 在部署时,我们可以容忍对训练策略进行调整,以提高准确率性能,但不增加推理成本,例如知识提炼。

With the aforementioned observations in mind, we bring the birth of YOLOv6, which accomplishes so far the best trade-off in terms of accuracy and speed. We show the comparison of YOLOv6 with other peers at a similar scale in Fig. 1. To boost inference speed without much performance degradation, we examined the cutting-edge quantization methods, including post-training quantization (PTQ) and quantization-aware training (QAT), and accommodate them in YOLOv6 to achieve the goal of deployment-ready networks.

考虑到上述意见,我们推出了 YOLOv6,它是迄今为止在准确性和速度方面实现最佳权衡的解决方案。我们在图 1 中展示了 YOLOv6 与其他类似规模的同行的比较。为了在不降低性能的前提下提高推理速度,我们研究了最先进的量化方法,包括训练后量化(PTQ)和量化感知训练(QAT),并将它们融入 YOLOv6,以实现部署就绪网络的目标。

We summarize the main aspects of YOLOv6 as follows:

我们将 YOLOv6 的主要方面总结如下:

  • We refashion a line of networks of different sizes tailored for industrial applications in diverse scenarios. The architectures at different scales vary to achieve the best speed and accuracy trade-off, where small models feature a plain single-path backbone and large models are built on efficient multi-branch blocks.

    我们针对不同场景下的工业应用,重新设计了一系列不同规模的网络。不同规模的架构各不相同,以实现速度和精度的最佳权衡,其中小型模型采用普通的单路径骨干网,大型模型则建立在高效的多分支区块上。

  • We imbue YOLOv6 with a self-distillation strategy, performed both on the classification task and the regression task. Meanwhile, we dynamically adjust the knowledge from the teacher and labels to help the student model learn knowledge more efficiently during all training phases.

    我们为 YOLOv6 注入了自蒸馏策略,在分类任务和回归任务中同时执行。同时,我们动态调整来自教师模型的知识和标签,帮助学生模型在所有训练阶段更高效地学习知识。

  • We broadly verify the advanced detection techniques for label assignment, loss function and data augmentation techniques and adopt them selectively to further boost the performance.

    我们广泛验证了先进的标签分配检测技术、损失函数和数据增强技术,并有选择地采用这些技术来进一步提高性能。

  • We reform the quantization scheme for detection with the help of RepOptimizer [2] and channel-wise distillation [36], which leads to an ever-fast and accurate detector with 43.3% COCO AP and a throughput of 869 FPS at a batch size of 32.

    我们借助 RepOptimizer [2] 和通道蒸馏(channel-wise distillation) [36],改革了用于检测的量化方案,从而实现了快速、准确的检测器,在批量为 32 时,COCO AP 为 43.3%,吞吐量为 869 FPS。

2 Method

The renovated design of YOLOv6 consists of the following components, network design, label assignment, loss function, data augmentation, industry-handy improvements, and quantization and deployment:

YOLOv6 的改造设计由以下部分组成:网络设计、标签分配、损失函数、数据增强、行业实用改进以及量化和部署

  • Network Design: Backbone: Compared with other mainstream architectures, we find that RepVGG [3] backbones are equipped with more feature representation power in small networks at a similar inference speed, whereas it can hardly be scaled to obtain larger models due to the explosive growth of the parameters and computational costs. In this regard, we take RepBlock [3] as the building block of our small networks. For large models, we revise a more efficient CSP [43] block, named CSPStackRep Block. Neck: The neck of YOLOv6 adopts PAN topology [24] following YOLOv4 and YOLOv5. We enhance the neck with RepBlocks or CSPStackRep Blocks to have Rep- PAN. Head: We simplify the decoupled head to make it more efficient, called Efficient Decoupled Head.

    网络设计: 骨干网: 与其他主流架构相比,我们发现在推理速度相近的情况下,RepVGG[3]骨干网在小型网络中具有更强的特征表示能力,但由于参数和计算成本的爆炸式增长,它很难扩展到更大的模型。为此,我们将 RepBlock [3] 作为小型网络的构建模块。对于大型模型,我们修改了一个更高效的 CSP [43] 模块,命名为 CSPStackRep 模块。颈部继 YOLOv4 和 YOLOv5 之后,YOLOv6 的颈部采用了 PAN 拓扑[24]。我们用 RepBlocks 或 CSPStackRep Block 增强了颈部的 Rep- PAN。头部:我们简化了解耦头部,使其更高效,称为高效解耦头部。

  • Label Assignment: We evaluate the recent progress of label assignment strategies [5, 7, 18, 48, 51] on YOLOv6 through numerous experiments, and the results indicate that TAL [5] is more effective and training-friendly.

    标签分配: 我们在 YOLOv6 上通过大量实验评估了近期标签分配策略的进展[5, 7, 18, 48, 51],结果表明 TAL [5] 更有效、更便于训练。

  • Loss Function: The loss functions of the mainstream anchor-free object detectors contain classification loss,box regression loss and object loss. For each loss, we systematically experiment it with all available techniques and finally select VariFocal Loss [50] as our classification loss and SIoU [8]/GIoU [35] Loss as our regression loss.

    损失函数: 主流无锚对象检测器的损失函数包括分类损失、bbox回归损失和object损失。对于每种损失,我们都用所有可用技术进行了系统实验,最终选择 VariFocal Loss [50] 作为分类损失,SIoU [8]/GIoU [35] Loss 作为回归损失。

  • Industry-handy improvements: We introduce additional common practice and tricks to improve the performance including self-distillation and more training epochs. For self-distillation, both classification and box regression are respectively supervised by the teacher model. The distillation of box regression is made possible thanks to DFL [20]. In addition, the proportion of information from the soft and hard labels is dynamically declined via cosine decay, which helps the student selectively acquire knowledge at different phases during the training process. In addition, we encounter the problem of the impaired performance without adding extra gray borders at evaluation, for which we provide some remedies.

    行业实用改进: 我们引入了更多常用做法和技巧来提高性能,包括自蒸馏和更多训练历时。对于自蒸馏,分类和盒式回归都分别由教师模型监督。由于 DFL [20],盒式回归的蒸馏成为可能。此外,通过余弦衰减,软标签和硬标签的信息比例会动态下降,这有助于学生在训练过程的不同阶段选择性地获取知识。此外,我们还遇到了一个问题,即在评估时不添加额外的灰色边框会影响性能,对此我们提供了一些补救措施。

  • Quantization and deployment: To cure the performance degradation in quantizing reparameterizationbased models, we train YOLOv6 with RepOptimizer [2] to obtain PTQ-friendly weights. We further adopt QAT with channel-wise distillation [36] and graph optimization to pursue extreme performance. Our quantized YOLOv6-S hits a new state of the art with 42.3% AP and a throughput of 869 FPS (batch size=32).

    量化和部署: 为了解决量化基于重参数化的模型时性能下降的问题,我们使用 RepOptimizer [2] 训练 YOLOv6,以获得 PTQ 友好的权重。为了追求极致性能,我们还进一步采用了 QAT 和信道蒸馏法[36]以及图优化法。我们的量化 YOLOv6-S 达到了新的技术水平,AP 为 42.3%,吞吐量为 869 FPS(批量大小=32)。

2.1 Network Design

A one-stage object detector is generally composed of the following parts: a backbone, a neck and a head. The backbone mainly determines the feature representation ability, meanwhile, its design has a critical influence on the inference efficiency since it carries a large portion of computation cost. The neck is used to aggregate the low-level physical features with high-level semantic features, and then build up pyramid feature maps at all levels. The head consists of several convolutional layers, and it predicts final detection results according to multi-level features assembled by the neck. It can be categorized as anchor based and anchor-free, or rather parameter-coupled head and parameter-decoupled head from the structure’s perspective.

单级目标检测器一般由以下部分组成:主干、颈部和头部。主干主要决定了特征表示能力,同时,由于它承担了很大一部分计算成本,因此其设计对推理效率有着至关重要的影响。颈部用于将低层次的物理特征与高层次的语义特征聚合在一起,然后建立各个层次的金字塔特征图。头部由多个卷积层组成,根据颈部汇集的多层次特征预测最终检测结果。从结构上看,它可以分为基于锚和无锚两类,或者说参数耦合头部和参数解耦头部。

In YOLOv6, based on the principle of hardware friendly network design [3], we propose two scaled reparameterizable backbones and necks to accommodate models at different sizes, as well as an efficient decoupled head with the hybrid-channel strategy. The overall architecture of YOLOv6 is shown in Fig. 2.

在 YOLOv6 中,根据硬件友好型网络设计原则[3],我们提出了两种可缩放的重新参数化骨干和颈部,以适应不同尺寸的模型,以及采用混合通道策略的高效解耦头部。YOLOv6 的整体架构如图 2 所示。

image-20240717142924935
2.1.1 Backbone

As mentioned above, the design of the backbone network has a great impact on the effectiveness and efficiency of the detection model. Previously, it has been shown that multibranch networks [13, 14, 38, 39] can often achieve better classification performance than single-path ones [15, 37], but often it comes with the reduction of the parallelism and results in an increase of inference latency. On the contrary, plain single-path networks like VGG [37] take the advantages of high parallelism and less memory footprint, leading to higher inference efficiency. Lately in RepVGG [3], a structural re-parameterization method is proposed to decouple the training-time multi-branch topology with an inference-time plain architecture to achieve a better speedaccuracy trade-off.

如上所述,主干网络的设计对检测模型的效果和效率有很大影响。以前的研究表明,多分支网络 [13, 14, 38, 39] 通常能比单路径网络 [15, 37] 获得更好的分类性能,但往往会降低并行性,导致推理延迟增加。相反,像 VGG [37] 这样的普通单路径网络则具有并行性高、内存占用少的优点,从而提高了推理效率。最近,RepVGG [3]提出了一种结构重参数化方法,将训练时的多分支拓扑与推理时的朴素架构解耦,以实现更好的速度与精度权衡。

Inspired by the above works, we design an efficient reparameterizable backbone denoted as EfficientRep. For small models, the main component of the backbone is Rep-Block during the training phase, as shown in Fig. 3 (a). And each RepBlock is converted to stacks of 3 ×3 convolutional layers (denoted as RepConv) with ReLU activation functions during the inference phase, as shown in Fig. 3 (b). Typically a 3 ×3 convolution is highly optimized on mainstream GPUs and CPUs and it enjoys higher computational density. Consequently, EfficientRep Backbone sufficiently utilizes the computing power of the hardware, resulting in a significant decrease in inference latency while enhancing the representation ability in the meantime.

受上述工作的启发,我们设计了一种高效的可重新参数化骨干网,称为 EfficientRep。对于小型模型,骨干网的主要组成部分是训练阶段的 Rep-Block,如图 3(a)所示。如图 3(b)所示,在推理阶段,每个 RepBlock 会被转换为堆叠的 3 ×3 卷积层(用 ReLU 激活函数表示为 RepConv)。通常情况下,3 ×3 卷积在主流 GPU 和 CPU 上经过高度优化,具有更高的计算密度。因此,EfficientRep Backbone 充分利用了硬件的计算能力,从而显著降低了推理延迟,同时增强了表示能力。

However, we notice that with the model capacity further expanded, the computation cost and the number of parameters in the single-path plain network grow exponentially. To achieve a better trade-off between the computation burden and accuracy, we revise a CSPStackRep Block to build the backbone of medium and large networks. As shown in Fig. 3 ©, CSPStackRep Block is composed of three 11 convolution layers and a stack of sub-blocks consisting of two RepVGG blocks [3] or RepConv (at training or inference respectively) with a residual connection. Besides, a cross stage partial (CSP) connection is adopted to boost performance without excessive computation cost. Compared with CSPRepResStage [45], it comes with a more succinct outlook and considers the balance between accuracy and speed.

然而,我们注意到,随着模型容量的进一步扩大,single-path plain中的计算成本和参数数量呈指数增长。为了更好地权衡计算负担和准确性,我们对 CSPStackRep Block 进行了修改,以构建中型和大型网络的骨干网。如图 3 ©所示,CSPStackRep 块由三个 1×1 卷积层和由两个 RepVGG 块[3] 或 RepConv(分别用于训练或推理)组成的子块堆叠以及一个残差连接组成。此外,为了在不增加计算成本的情况下提高性能,还采用了跨阶段部分(CSP)连接。与 CSPRepResStage [45]相比,它的外观更简洁,并考虑了准确性和速度之间的平衡。

image-20240717143343911
2.1.2 Neck

In practice, the feature integration at multiple scales has been proved to be a critical and effective part of object detection [9, 21, 24, 40]. We adopt the modified PAN topology [24] from YOLOv4 [1] and YOLOv5 [10] as the base of our detection neck. In addition, we replace the CSPBlock used in YOLOv5 with RepBlock (for small models) or CSPStackRep Block (for large models) and adjust the width and depth accordingly. The neck of YOLOv6 is denoted as Rep-PAN.

在实践中,多个尺度的特征整合已被证明是物体检测的关键和有效部分[9, 21, 24, 40]。我们采用 YOLOv4 [1] 和 YOLOv5 [10] 中修改过的 PAN topology [24] 作为检测颈的基础。此外,我们将 YOLOv5 中使用的 CSPBlock 替换为 RepBlock(用于小型模型)或 CSPStackRep Block(用于大型模型),并相应调整了宽度和深度。YOLOv6 的颈部称为 Rep-PAN。

2.1.3 Head

Efficient decoupled head:The detection head of YOLOv5 is a coupled head with parameters shared between the classification and localization branches, while its counterparts in FCOS [41] and YOLOX [7] decouple the two branches, and additional two 33 convolutional layers are introduced in each branch to boost the performance.

高效解耦头:YOLOv5 的检测头是一个耦合头,分类分支和定位分支共享参数,而 FCOS [41] 和 YOLOX [7] 的同类产品则将这两个分支解耦,并在每个分支中引入额外的两个 33 卷积层以提高性能。

In YOLOv6, we adopt a hybrid-channel strategy to build a more efficient decoupled head. Specifically, we reduce the number of the middle 33 convolutional layers to only one. The width of the head is jointly scaled by the width multiplier for the backbone and the neck. These modifications further reduce computation costs to achieve a lower inference latency.

在 YOLOv6 中,我们采用了混合通道策略来构建更高效的解耦头。具体来说,我们将中间 33 卷积层的数量减少到只有一个。头部的宽度由骨干和颈部的宽度乘数共同缩放。这些修改进一步降低了计算成本,从而实现了更低的推理延迟。

Anchor-free Anchor-free detectors stand out because of their better generalization ability and simplicity in decoding prediction results. The time cost of its post-processing is substantially reduced. There are two types of anchorfree detectors: anchor point-based [7, 41] and keypointbased [16, 46, 53]. In YOLOv6, we adopt the anchor pointbased paradigm, whose box regression branch actually predicts the distance from the anchor point to the four sides of the bounding boxes.

无锚检测器:无锚检测器因其更好的泛化能力和解码预测结果的简便性而脱颖而出。其后处理的时间成本也大大降低。无锚检测器有两种类型:基于锚点的 [7, 41] 和基于关键点的 [16, 46, 53]。在 YOLOv6 中,我们采用了基于锚点的范例,其方框回归分支实际上是预测从锚点到边界方框四边的距离。

2.2. Label Assignment

Label assignment is responsible for assigning labels to predefined anchors during the training stage. Previous work has proposed various label assignment strategies ranging from simple IoU-based strategy and inside ground-truth method [41] to other more complex schemes [5, 7, 18, 48, 51].

标签分配负责在训练阶段为预定义的锚点分配标签。以往的工作提出了各种标签分配策略,从简单的基于 IoU 的策略和 inside ground-truth[41] 到其他更复杂的方案 [5, 7, 18, 48, 51]。

SimOTA OTA [6] considers the label assignment in object detection as an optimal transmission problem. It defines positive/negative training samples for each ground-truth object from a global perspective. SimOTA [7] is a simplified version of OTA [6], which reduces additional hyperparameters and maintains the performance. SimOTA was utilized as the label assignment method in the early version of YOLOv6. However, in practice, we find that introducing SimOTA will slow down the training process. And it is not rare to fall into unstable training. Therefore, we desire a replacement for SimOTA.

SimOTA :OTA [6] 将物体检测中的标签分配视为一个最佳传输问题。它从全局角度为每个地面实况对象定义正/负训练样本。SimOTA [7] 是 OTA [6] 的简化版本,它减少了额外的超参数并保持了性能。在 YOLOv6 的早期版本中,SimOTA 被用作标签分配方法。但在实践中,我们发现引入 SimOTA 会减慢训练过程。而且,陷入不稳定训练的情况并不罕见。因此,我们希望有一种方法可以替代 SimOTA。

Task alignment learning Task Alignment Learning (TAL) was first proposed in TOOD [5], in which a unified metric of classification score and predicted box quality is designed. The IoU is replaced by this metric to assign object labels. To a certain extent, the problem of the misalignment of tasks (classification and box regression) is alleviated.

任务配准学习:任务配准学习(TAL)最早是在 TOOD [5] 中提出的,其中设计了分类得分和预测框质量的统一指标。IoU 被该指标取代,用于分配对象标签。这在一定程度上缓解了任务(分类和箱体回归)错位的问题。

The other main contribution of TOOD is about the taskaligned head (T-head). T-head stacks convolutional layers to build interactive features, on top of which the Task-Aligned Predictor (TAP) is used. PP-YOLOE [45] improved Thead by replacing the layer attention in T-head with the lightweight ESE attention, forming ET-head. However, we find that the ET-head will deteriorate the inference speed in our models and it comes with no accuracy gain. Therefore, we retain the design of our Efficient decoupled head. Furthermore, we observed that TAL could bring more performance improvement than SimOTA and stabilize the training. Therefore, we adopt TAL as our default label assignment strategy in YOLOv6.

TOOD 的另一个主要贡献是任务对齐头(T-head)。T-head 堆叠卷积层来构建交互式特征,在此基础上使用任务对齐预测器(TAP)。PP-YOLOE [45] 改进了 Thead,用轻量级 ESE 注意力取代了 T-head 中的层注意力,形成了 ET-head。然而,我们发现 ET 头会降低我们模型的推理速度,而且不会带来准确性的提高。因此,我们保留了高效解耦头的设计。此外,我们观察到 TAL 比 SimOTA 能带来更多的性能改进,并能稳定训练。因此,我们将 TAL 作为 YOLOv6 的默认标签分配策略。

2.3. Loss Functions

Object detection contains two sub-tasks: classification and localization, corresponding to two loss functions: classification loss and box regression loss. For each sub-task, there are various loss functions presented in recent years. In this section, we will introduce these loss functions and describe how we select the best ones for YOLOv6.

物体检测包含两个子任务:分类和定位,对应两个损失函数:分类损失和box回归损失。对于每个子任务,近年来都出现了各种损失函数。在本节中,我们将介绍这些损失函数,并介绍如何为 YOLOv6 选择最佳的损失函数。

2.3.1 Classification Loss

Improving the performance of the classifier is a crucial part of optimizing detectors. Focal Loss [22] modified the traditional cross-entropy loss to solve the problems of class imbalance either between positive and negative examples, or hard and easy samples. To tackle the inconsistent usage of the quality estimation and classification between training and inference, Quality Focal Loss (QFL) [20] further extended Focal Loss with a joint representation of the classification score and the localization quality for the supervision in classification. Whereas VariFocal Loss (VFL) [50] is rooted from Focal Loss [22], but it treats the positive and negative samples asymmetrically. By considering positive and negative samples at different degrees of importance, it balances learning signals from both samples. Poly Loss [17] decomposes the commonly used classification loss into a series of weighted polynomial bases. It tunes polynomial coefficients on different tasks and datasets, which is proved better than Cross-entropy Loss and Focal Loss through experiments.

提高分类器的性能是优化检测器的关键部分。Focal Loss [22] 对传统的交叉熵损失进行了改进,以解决正负样本或难易样本之间的类不平衡问题。为了解决训练和推理之间质量估计和分类使用不一致的问题,质量焦点损失(Quality Focal Loss,QFL)[20] 进一步扩展了焦点损失,将分类得分和定位质量联合表示,用于分类监督。VariFocal Loss(VFL)[50] 源自 Focal Loss [22],但它以非对称的方式处理正负样本。通过考虑正负样本不同的重要程度,它可以平衡来自两个样本的学习信号。Poly Loss [17] 将常用的分类损失分解为一系列加权多项式基数。它在不同的任务和数据集上调整多项式系数,通过实验证明比交叉熵损失和焦点损失更好。

We assess all these advanced classification losses on YOLOv6 to finally adopt VFL [50].

我们在 YOLOv6 上对所有这些高级分类损失进行了评估,最终采用了 VFL [50]。

2.3.2 Box Regression Loss

Box regression loss provides significant learning signals localizing bounding boxes precisely. L1 loss is the original box regression loss in early works. Progressively, a variety of well-designed box regression losses have sprung up, such as IoU-series loss [8,11,35,47,52,52] and probability loss [20].

box回归损耗为精确定位边界框提供了重要的学习信号。L1 损失是早期研究中最原始的boxes 回归损失。随着时间的推移,各种精心设计的方框回归损失相继出现,如 IoU 序列损失 [8,11,35,47,52,52] 和概率损失 [20]。

IoU-series Loss IoU loss [47] regresses the four bounds of a predicted box as a whole unit. It has been proved to be effective because of its consistency with the evaluation metric. There are many variants of IoU, such as GIoU [35], DIoU [52], CIoU [52], -IoU [11] and SIoU [8], etc, forming relevant loss functions. We experiment with GIoU, CIoU and SIoU in this work. And SIoU is applied to YOLOv6-N and YOLOv6-T, while others use GIoU.

IoU 系列损失:IoU 损失 [47] 将预测方框的四个边界作为一个整体进行回归。由于其与评估指标的一致性,它被证明是有效的。IoU 有许多变体,如 GIoU [35]、DIoU [52]、CIoU [52]、-IoU [11] 和 SIoU [8]等,形成了相关的损失函数。在本研究中,我们对 GIoU、CIoU 和 SIoU 进行了实验。其中 SIoU 适用于 YOLOv6-N 和 YOLOv6-T,而其他则使用 GIoU。

Probability Loss Distribution Focal Loss (DFL) [20] simplifies the underlying continuous distribution of box locations as a discretized probability distribution. It considers ambiguity and uncertainty in data without introducing any other strong priors, which is helpful to improve the box localization accuracy especially when the boundaries of the ground-truth boxes are blurred. Upon DFL, DFLv2 [19] develops a lightweight sub-network to leverage the close correlation between distribution statistics and the real localization quality, which further boosts the detection performance. However, DFL usually outputs 17 more regression values than general box regression, leading to a substantial overhead. The extra computation cost significantly hinders the training of small models. Whilst DFLv2 further increases the computation burden because of the extra sub-network. In our experiments, DFLv2 brings similar performance gain to DFL on our models. Consequently, we only adopt DFL in YOLOv6-M/L. Experimental details can be found in Section 3.3.3.

概率损失分布 Focal Loss(DFL)[20] 将box位置的基本连续分布简化为离散概率分布。它考虑了数据中的模糊性和不确定性,而没有引入任何其他强先验,这有助于提高box定位精度,尤其是在地面实况方框边界模糊的情况下。在 DFL 的基础上,DFLv2 [19] 开发了一个轻量级子网络,利用分布统计与实际定位质量之间的密切关联,进一步提高了检测性能。然而,DFL 输出的回归值通常比一般box回归多出 17 个,导致大量开销。额外的计算成本极大地阻碍了小型模型的训练。而 DFLv2 则因为额外的子网络而进一步增加了计算负担。在我们的实验中,DFLv2 在我们的模型上带来了与 DFL 相似的性能增益。因此,我们只在 YOLOv6-M/L 中采用了 DFL。实验详情请参见第 3.3.3 节。

2.3.3 Object Loss

Object loss was first proposed in FCOS [41] to reduce the score of low-quality bounding boxes so that they can be filtered out in post-processing. It was also used in YOLOX [7] to accelerate convergence and improve network accuracy. As an anchor-free framework like FCOS and YOLOX, we have tried object loss into YOLOv6. Unfortunately, it doesn’t bring many positive effects. Details are given in Section 3.

Object loss(目标损失)最早是在 FCOS [41] 中提出的,用于降低低质量边界框的得分,以便在后处理中将其过滤掉。YOLOX [7] 中也使用了这种方法,以加快收敛速度并提高网络精度。作为一个像 FCOS 和 YOLOX 一样的无锚点框架,我们在 YOLOv6 中尝试了目标损失。遗憾的是,它并没有带来很多积极的效果。详情见第 3 节。

2.4. Industry handy

improvements :The following tricks come ready to use in real practice. They are not intended for a fair comparison but steadily produce performance gain without much tedious effort.

改进:以下技巧可在实际操作中随时使用。它们并不是为了进行公平的比较,而是为了不费吹灰之力就能稳定地提高性能。

2.4.1 More training epochs

Empirical results have shown that detectors have a progressing performance with more training time. We extended the training duration from 300 epochs to 400 epochs to reach a better convergence.

经验结果表明,随着训练时间的增加,探测器的性能也在不断提高。为了达到更好的收敛效果,我们将训练时间从 300 个历元延长到 400 个历元。

2.4.2 Self-distillation

To further improve the model accuracy while not introducing much additional computation cost, we apply the classical knowledge distillation technique minimizing the KL divergence between the prediction of the teacher and the student. We limit the teacher to be the student itself but pretrained, hence we call it self-distillation. Note that the KL-divergence is generally utilized to measure the difference between data distributions. However, there are two sub-tasks in object detection, in which only the classification task can directly utilize knowledge distillation based on KL-divergence. Thanks to DFL loss [20], we can perform it on box regression as well. The knowledge distillation loss can then be formulated as:

为了进一步提高模型的准确性,同时又不增加额外的计算成本,我们采用了经典的知识蒸馏技术,最大限度地减小教师预测与学生预测之间的 KL 分歧。我们将教师限定为学生本身,但进行了预训练,因此我们称之为自蒸馏。请注意,KL-发散一般用于衡量数据分布之间的差异。然而,在目标检测中有两个子任务,其中只有分类任务可以直接利用基于KL-发散的知识蒸馏。由于有了 DFL 损失[20],我们也可以在 box 回归上执行它。知识提炼损失可以表述为
L K D = K L ( p t c l s ∣ ∣ p s c l s ) + K L ( p t r e g ∣ ∣ p s r e g ) L_{KD} = KL(p_t^{cls}||p_s^{cls})+KL(p_t^{reg}||p_s^{reg}) LKD=KL(ptcls∣∣pscls)+KL(ptreg∣∣psreg)
where p t c l s p^{cls}_t ptcls and p s c l s p^{cls}_s pscls are class prediction of the teacher model and the student model respectively, and accordingly p t r e g p_t^{reg} ptreg and p s r e g p_s^{reg} psreg are box regression predictions. The overall lossfunc tion is now formulated as:

其中, p t c l s p^{cls}_t ptcls p s c l s p^{cls}_s pscls 分别为教师模型和学生模型的类预测值,相应地, p t r e g p_t^{reg} ptreg p s r e g p_s^{reg} psreg 为箱回归预测值。总体损失公式如下
L t o t a l = L d e t + α L K D L_{total}=L_{det}+\alpha L_{KD} Ltotal=Ldet+αLKD
where L d e t L_{det} Ldet is the detection loss computed with predictions and labels. The hyperparameter α \alpha α is introduced to balance two losses. In the early stage of training, the soft labels from the teacher are easier to learn. As the training continues, the performance of the student will match the teacher so that the hard labels will help students more. Upon this, we apply c o s i n e cosine cosine weight decay to α \alpha α to dynamically adjust the information from hard labels and soft ones from thet eacher. We conducted detailed experiments to verify the effect of self-distillation on YOLOv6, which will be discussedin Section 3.

其中, L d e t L_{det} Ldet 是使用预测和标签计算的检测损失。引入超参数 α \alpha α 是为了平衡两种损失。在训练初期,来自教师的软标签更容易学习。随着训练的继续,学生的表现将与教师相匹配,因此硬标签对学生的帮助会更大。在这种情况下,我们对 α \alpha α 应用 c o s i n e cosine cosine 权重衰减,以动态调整来自硬标签和来自教师的软标签的信息。我们进行了详细的实验来验证自蒸馏对 YOLOv6 的影响,这将在第 3 节中讨论。

2.4.3 Gray border of images

We notice that a half-stride gray border is put around each image when evaluating the model performance in the implementations of YOLOv5 [10] and YOLOv7 [42]. Although no useful information is added, it helps in detecting the objects near the edge of the image. This trick also applies in YOLOv6.

我们注意到,在 YOLOv5 [10] 和 YOLOv7 [42] 的实现中,在评估模型性能时,每幅图像周围都有一个半跨度的灰色边框。虽然没有添加任何有用的信息,但它有助于检测图像边缘附近的目标。这一技巧同样适用于 YOLOv6。

However, the extra gray pixels evidently reduce the inference speed. Without the gray border, the performance of YOLOv6 deteriorates, which is also the case in [10, 42]. We postulate that the problem is related to the gray borders padding in Mosaic augmentation [1, 10]. Experiments on turning mosaic augmentations off during last epochs [7] (aka. fade strategy) are conducted for verification. In this regard, we change the area of gray border and resize the image with gray borders directly to the target image size. Combining these two strategies, our models can maintain or even boost the performance without the degradation of inference speed.

然而,额外的灰色像素明显降低了推理速度。如果没有灰色边框,YOLOv6 的性能就会下降,这也是 [10, 42] 的情况。我们推测这个问题与Mosaic增强[1, 10]中的灰色边框填充有关。为了验证这一点,我们进行了在最后一个纪元关闭马赛克增强功能的实验[7](又称淡化策略)。在这方面,我们改变了灰色边框的面积,并将带有灰色边框的图像直接调整为目标图像大小。结合这两种策略,我们的模型可以在不降低推理速度的情况下保持甚至提高性能。

2.5. Quantization and Deployment

For industrial deployment, it has been common practice to adopt quantization to further speed up runtime without much performance compromise. Post-training quantization (PTQ) directly quantizes the model with only a small calibration set. Whereas quantization-aware training (QAT) further improves the performance with the access to the training set, which is typically used jointly with distillation. However, due to the heavy use of re-parameterization blocks in YOLOv6, previous PTQ techniques fail to produce high performance, while it is hard to incorporate QAT when it comes to matching fake quantizers during training and inference. We here demonstrate the pitfalls and our cures during deployment.

在工业部署中,通常的做法是在不影响性能的情况下采用量化技术来进一步加快运行速度。训练后量化(PTQ)只需少量校准集就能直接量化模型。而量化感知训练(QAT)通过访问训练集进一步提高性能,通常与蒸馏联合使用。然而,由于在 YOLOv6 中大量使用了重参数化块,以前的 PTQ 技术无法产生高性能,而在训练和推理过程中匹配假量化器时,又很难结合 QAT。我们在此展示了部署过程中的陷阱和解决方法。

2.5.1 Reparameterizing Optimizer

RepOptimizer [2] proposes gradient re-parameterization at each optimization step. This technique also well solves the quantization problem of reparameterization-based models. We hence reconstruct the re-parameterization blocks of YOLOv6 in this fashion and train it with RepOptimizer to obtain PTQ-friendly weights. The distribution of feature map is largely narrowed (e.g. Fig. 4, more in B.1), which greatly benefits the quantization process, see Sec 3.5.1 for results.

RepOptimizer [2] 建议在每个优化步骤中进行梯度重参数化。这项技术也很好地解决了基于重参数化模型的量化问题。因此,我们用这种方法重构了 YOLOv6 的重参数化块,并用 RepOptimizer 对其进行训练,以获得 PTQ 友好的权重。特征图的分布在很大程度上缩小了(如图 4,更多信息请参见 B.1),这对量化过程大有裨益,结果请参见第 3.5.1 节。

image-20240717144424815
2.5.2 Sensitivity Analysis

We further improve the PTQ performance by partially converting quantization-sensitive operations into float computation. To obtain the sensitivity distribution, several metrics are commonly used, mean-square error (MSE), signal-noise ratio (SNR) and cosine similarity. Typically for comparison, one can pick the output feature map (after the activation of a certain layer) to calculate these metrics with and without quantization. As an alternative, it is also viable to compute validation AP by switching quantization on and off for the certain layer [29]. We compute all these metrics on the YOLOv6-S model trained with RepOptimizer and pick the top-6 sensitive layers to run in float. The full chart of sensitivity analysis can be found in B.2.

我们通过将量化敏感运算部分转换为浮点运算,进一步提高了 PTQ 的性能。要获得灵敏度分布,通常需要使用几种指标:均方误差(MSE)、信噪比(SNR)和余弦相似度。通常情况下,为了进行比较,我们可以选取输出特征图(某一层激活后)来计算量化和未量化时的这些指标。作为一种替代方法,也可以通过开关某一层的量化来计算验证 AP [29]。我们在使用 RepOptimizer 训练的 YOLOv6-S 模型上计算了所有这些指标,并挑选了前 6 个敏感层进行浮动运行。敏感性分析的完整图表见 B.2。

2.5.3 Quantization-aware Training with Channel-wise Distillation

In case PTQ is insufficient, we propose to involve quantization-aware training (QAT) to boost quantization performance. To resolve the problem of the inconsistency of fake quantizers during training and inference, it is necessary to build QAT upon the RepOptimizer. Besides, channelwise distillation [36] (later as CW Distill) is adapted within the YOLOv6 framework, shown in Fig. 5. This is also a self-distillation approach where the teacher network is the student itself in FP32-precision. See experiments in Sec 3.5.1.

如果 PTQ 不足,我们建议采用量化感知训练(QAT)来提高量化性能。为了解决 Fake-Quantizers 在训练和推理过程中的不一致性问题,有必要在 RepOptimizer 的基础上建立 QAT。此外,在 YOLOv6 框架内还采用了信道蒸馏法 [36](后称 CW Distill),如图 5 所示。这也是一种自蒸馏方法,教师网络就是 FP32 精确度的学生本身。参见第 3.5.1 节中的实验。

image-20240717144801465
3 Experiments
3.1. Implementation Details

We use the same optimizer and the learning schedule as YOLOv5 [10], i.e. stochastic gradient descent (SGD) with momentum and cosine decay on learning rate. Warm-up, grouped weight decay strategy and the exponential moving average (EMA) are also utilized. We adopt two strong data augmentations (Mosaic [1,10] and Mixup [49]) following [1,7,10]. A complete list of hyperparameter settings can be found in our released code. We train our models on the COCO 2017 [23] training set, and the accuracy is evaluated on the COCO 2017 validation set. All our models are trained on 8 NVIDIA A100 GPUs, and the speed performance is measured on an NVIDIA Tesla T4 GPU with TensorRT version 7.2 unless otherwise stated. And the speed performance measured with other TensorRT versions or on other devices is demonstrated in Appendix A.

我们使用的优化器和学习计划与 YOLOv5 [10]相同,即随机梯度下降(SGD),学习率采用动量和余弦衰减。我们还采用了热身、分组权重衰减策略和指数移动平均法(EMA)。我们沿用了[1,7,10]中的两种强数据增强方法(Mosaic [1,10] 和 Mixup [49])。超参数设置的完整列表可在我们发布的代码中找到。我们在 COCO 2017 [23] 训练集上训练模型,并在 COCO 2017 验证集上评估准确性。我们的所有模型都是在 8 个英伟达 A100 GPU 上训练的,除非另有说明,否则速度性能都是在英伟达 Tesla T4 GPU 和 TensorRT 7.2 版本上测量的。使用其他 TensorRT 版本或在其他设备上测量的速度性能见附录 A。

3.2. Comparisons

Considering that the goal of this work is to build networks for industrial applications, we primarily focus on the speed performance of all models after deployment, including throughput (FPS at a batch size of 1 or 32) and the GPU latency, rather than FLOPs or the number of parameters. We compare YOLOv6 with other state-of-the-art detectors of YOLO series, including YOLOv5 [10], YOLOX [7], PPYOLOE [45] and YOLOv7 [42]. Note that we test the speed performance of all official models with FP16-precision on the same Tesla T4 GPU with TensorRT [28]. The performance of YOLOv7-Tiny is re-evaluated according to their open-sourced code and weights at the input size of 416 and 640. Results are shown in Table 1 and Fig. 1. Compared with YOLOv5-N/YOLOv7-Tiny (input size=416), our YOLOv6-N has significantly advanced by 7.9%/2.6% respectively. It also comes with the best speed performance in terms of both throughput and latency. Compared with YOLOX-S/PPYOLOE-S, YOLOv6-S can improve AP by 3.0%/0.4% with higher speed. We compare YOLOv5-S and YOLOv7-Tiny (input size=640) with YOLOv6-T, our method is 2.9% more accurate and 73/25 FPS faster with a batch size of 1. YOLOv6-M outperforms YOLOv5-M by 4.2% higher AP with a similar speed, and it achieves 2.7%/0.6% higher AP than YOLOX-M/PPYOLOE-M at a higher speed. Besides, it is more accurate and faster than YOLOv5-L. YOLOv6-L is 2.8%/1.1% more accurate than YOLOX-L/PPYOLOE-L under the same latency constraint. We additionally provide a faster version of YOLOv6-L by replacing SiLU with ReLU (denoted as YOLOv6-L-ReLU). It achieves 51.7% AP with a latency of 8.8 ms, outperforming YOLOX-L/PPYOLOE-L/YOLOv7 in both accuracy and speed.

考虑到这项工作的目标是为工业应用构建网络,我们主要关注所有模型部署后的速度性能,包括吞吐量(批量大小为 1 或 32 时的 FPS)和 GPU 延迟,而不是 FLOPs 或参数数量。我们将 YOLOv6 与 YOLO 系列的其他先进探测器进行了比较,包括 YOLOv5 [10]、YOLOX [7]、PPYOLOE [45] 和 YOLOv7 [42]。请注意,我们使用 TensorRT [28],在同一 Tesla T4 GPU 上以 FP16 精确度测试了所有官方模型的速度性能。YOLOv7-Tiny 的性能是根据其开源代码和权重在输入大小为 416 和 640 时重新评估的。结果如表 1 和图 1 所示。与YOLOv5-N/YOLOv7-Tiny(输入大小=416)相比,我们的YOLOv6-N大幅提高了7.9%/2.6%。在吞吐量和延迟方面,它的速度表现也是最好的。与 YOLOX-S/PPYOLOE-S 相比,YOLOv6-S 的 AP 性能提高了 3.0%/0.4%,速度更高。我们将 YOLOv5-S 和 YOLOv7-Tiny(输入大小=640)与 YOLOv6-T 进行了比较,在批量大小为 1 的情况下,我们的方法准确率提高了 2.9%,速度提高了 73/25 FPS。在速度相近的情况下,YOLOv6-M 比 YOLOv5-M 的 AP 提高了 4.2%,在速度更高的情况下,它比 YOLOX-M/PPYOLOE-M 的 AP 提高了 2.7%/0.6%。此外,它比 YOLOv5-L 更准确、更快速。在相同的延迟限制下,YOLOv6-L 比 YOLOX-L/PPYOLOE-L 精确度高 2.8%/1.1%。我们还提供了一个更快的 YOLOv6-L 版本,用 ReLU 取代了 SiLU(记为 YOLOv6-L-ReLU)。它以 8.8 毫秒的延迟实现了 51.7% 的 AP,在准确性和速度上都优于 YOLOX-L/PPYOLOE-L/YOLOv7。

image-20240717144834962
3.3. Ablation Study
3.3.1 Network

Backbone and neck We explore the influence of single path structure and multi-branch structure on backbones and necks, as well as the channel coefficient (denoted as CC) of CSPStackRep Block. All models described in this part adopt TAL as the label assignment strategy, VFL as the classification loss, and GIoU with DFL as the regression loss. Results are shown in Table 2. We find that the optimal network structure for models at different sizes should come up with different solutions.

骨干和颈部:我们探讨了单路径结构和多分支结构对骨干和颈部的影响,以及 CSPStackRep Block 的信道系数(表示为 CC)。本部分描述的所有模型都采用 TAL 作为标签分配策略,VFL 作为分类损失,GIoU 和 DFL 作为回归损失。结果如表 2 所示。我们发现,不同规模模型的最优网络结构应该有不同的解决方案。

image-20240717144921481

For YOLOv6-N, the single-path structure outperforms the multi-branch structure in terms of both accuracy and speed. Although the single-path structure has more FLOPs and parameters than the multi-branch structure, it could run faster due to a relatively lower memory footprint and a higher degree of parallelism. For YOLOv6-S, the two block styles bring similar performance. When it comes to larger models, multi-branch structure achieves better performance in accuracy and speed. And we finally select multi-branch with a channel coefficient of 2/3 for YOLOv6-M and 1/2 for YOLOv6-L.

对于 YOLOv6-N,单路径结构在准确性和速度方面都优于多分支结构。虽然单路径结构比多分支结构有更多的 FLOPs 和参数,但由于内存占用相对较少,并行程度较高,因此运行速度更快。对于 YOLOv6-S,两种分块方式带来了相似的性能。对于较大的模型,多分支结构在精度和速度上都有更好的表现。最后,我们为 YOLOv6-M 选择了通道系数为 2/3 的多分支结构,为 YOLOv6-L 选择了通道系数为 1/2 的多分支结构。

Furthermore, we study the influence of width and depth of the neck on YOLOv6-L. Results in Table 3 show that the slender neck performs 0.2% better than the wide-shallow neck with the similar speed.

此外,我们还研究了颈部宽度和深度对 YOLOv6-L 的影响。表 3 中的结果显示,在速度相近的情况下,细颈比宽浅颈的性能高 0.2%。

image-20240717160118614

Combinations of convolutional layers and activation functions YOLO series adopted a wide range of activation functions, ReLU [27], LReLU [25], Swish [31], SiLU [4], Mish [26] and so on. Among these activation functions, SiLU is the most used. Generally speaking, SiLU performs with better accuracy and does not cause too much extra computation cost. However, when it comes to industrial applications, especially for deploying models with TensorRT [28] acceleration, ReLU has a greater speed advantage because of its fusion into convolution.Moreover, we further verify the effectiveness of combinations of RepConv/ordinary convolution (denoted as Conv) and ReLU/SiLU/LReLU in networks of different sizes to achieve a better trade-off. As shown in Table 4, Conv with SiLU performs the best in accuracy while the combination of RepConv and ReLU achieves a better trade-off. We suggest users adopt RepConv with ReLU in latency-sensitive applications. We choose to use RepConv/ReLU combination in YOLOv6-N/T/S/M for higher inference speed and use the Conv/SiLU combination in the large model YOLOv6-L to speed up training and improve performance.

卷积层与激活函数的组合:YOLO 系列采用了多种激活函数,如 ReLU [27]、LReLU [25]、Swish [31]、SiLU [4]、Mish [26] 等。在这些激活函数中,SiLU 使用最多。一般来说,SiLU 的精度更高,而且不会产生太多额外的计算成本。此外,我们还进一步验证了 RepConv/ordinary convolution(简称 Conv)和 ReLU/SiLU/LReLU 在不同规模网络中的组合效果,以实现更好的权衡。如表 4 所示,Conv 与 SiLU 的准确率最高,而 RepConv 与 ReLU 的组合则实现了更好的权衡。我们建议用户在对延迟敏感的应用中采用 RepConv 和 ReLU。我们选择在 YOLOv6-N/T/S/M 中使用 RepConv/ReLU 组合以提高推理速度,并在大型模型 YOLOv6-L 中使用 Conv/SiLU 组合以加快训练并提高性能。

image-20240717160255579

Miscellaneous design We also conduct a series of ablation on other network parts mentioned in Section 2.1 based on YOLOv6-N. We choose YOLOv5-N as the baseline and add other components incrementally. Results are shown in Table 5. Firstly, with decoupled head (denoted as DH), our model is 1.4% more accurate with 5% increase in time cost. Secondly, we verify that the anchor-free paradigm is 51% faster than the anchor-based one for its 3 less predefined anchors, which results in less dimensionality of the output. Further, the unified modification of the backbone (EfficientRep Backbone) and the neck (Rep-PAN neck), denoted as EB+RN, brings 3.6% AP improvements, and runs 21% faster. Finally, the optimized decoupled head (hybrid channels, HC) brings 0.2% AP and 6.8% FPS improvements in accuracy and speed respectively.

其他设计:我们还基于 YOLOv6-N 对第 2.1 节中提到的其他网络部分进行了一系列消融。我们选择 YOLOv5-N 作为基线,并逐步添加其他部分。结果如表 5 所示。首先,在去耦头部(用 DH 表示)的情况下,我们的模型精确度提高了 1.4%,但时间成本增加了 5%。其次,我们验证了无锚范式比基于锚的范式快 51%,因为它减少了 3 个预定义锚,从而降低了输出维度。此外,对骨干(EfficientRep Backbone)和颈部(Rep-PAN neck)的统一修改(表示为 EB+RN)带来了 3.6% 的 AP 改进,运行速度提高了 21%。最后,经过优化的去耦头部(混合通道,HC)在精度和速度上分别带来了 0.2% 的 AP 和 6.8% 的 FPS 改进。

image-20240717160515063
3.3.2 Label Assignment

In Table 6, we analyze the effectiveness of mainstream label assign strategies. Experiments are conducted on YOLOv6-N. As expected, we observe that SimOTA and TAL are the best two strategies. Compared with the ATSS, SimOTA can increase AP by 2.0%, and TAL brings 0.5% higher AP than SimOTA. Considering the stable training and better accuracy performance of TAL, we adopt TAL as our label assignment strategy.

表 6 分析了主流标签分配策略的有效性。实验是在 YOLOv6-N 上进行的。不出所料,我们发现 SimOTA 和 TAL 是最好的两种策略。与 ATSS 相比,SimOTA 可以将 AP 提高 2.0%,而 TAL 则比 SimOTA 提高了 0.5%。考虑到 TAL 具有稳定的训练效果和更好的准确率,我们采用 TAL 作为标签分配策略。

image-20240717160459206

In addition, the implementation of TOOD [5] adopts ATSS [51] as the warm-up label assignment strategy during the early training epochs. We also retain the warm-up strategy and further make some explorations on it. Details are shown in Table 7, and we can find that without warm-up or warmed up by other strategies (i.e., SimOTA) it can also achieve the similar performance.

此外,TOOD[5]的实现采用了 ATSS[51] 作为早期训练期的热身标签分配策略。我们也保留了热身策略,并对其进行了进一步的探索。具体情况如表 7 所示,我们可以发现,不采用预热策略或采用其他预热策略(如 SimOTA)也能达到类似的性能。

image-20240717160541776
3.3.3 Loss functions

In the object detection framework, the loss function is composed of a classification loss, a box regression loss and an optional object loss, which can be formulated as follows:

在目标检测框架中,损失函数由分类损失、box 回归损失和可选目标损失组成,可表述如下:
L d e t = L c l s + λ L r e g + μ L o b j ; ( 3 ) L_{det} = L_{cls} + \lambda L_{reg} + \mu L_{obj} ; (3) Ldet=Lcls+λLreg+μLobj;(3)
where L c l s L_{cls} Lcls, L r e g L_{reg} Lreg and L o b j L_{obj} Lobj are classification loss, regression loss and object loss. λ \lambda λ and μ \mu μ are hyperparameters.

其中, L c l s L_{cls} Lcls L r e g L_{reg} Lreg L o b j L_{obj} Lobj 分别为分类损失、回归损失和目标损失。 λ \lambda λ μ \mu μ 是超参数。

In this subsection, we evaluate each loss function on YOLOv6. Unless otherwise specified, the baselines for YOLOv6-N, YOLOv6-S and YOLOv6-M are 35.0%, 42.9% and 48.0% trained with TAL, Focal Loss and GIoU Loss.

在本小节中,我们将在 YOLOv6 上对每个损失函数进行评估。除非另有说明,YOLOv6-N、YOLOv6-S 和 YOLOv6-M 的基线分别为使用 TAL、Focal Loss 和 GIoU Loss 训练的 35.0%、42.9% 和 48.0%。

Classification Loss We experiment Focal Loss [22], Polyloss [17], QFL [20] and VFL [50] on YOLOv6-N/S/M. As can be seen in Table 8, VFL brings 0.2%/0.3%/0.1% AP improvements on YOLOv6-N/S/M respectively compared with Focal Loss. We choose VFL as the classification loss function.

分类损失 我们在 YOLOv6-N/S/M 上试验了 Focal Loss [22]、Polyloss [17]、QFL [20] 和 VFL [50]。从表 8 中可以看出,与 Focal Loss 相比,VFL 在 YOLOv6-N/S/M 上带来的 AP 改进分别为 0.2%/0.3%/0.1%。我们选择 VFL 作为分类损失函数。

image-20240717160758008

Regression Loss IoU-series and probability loss functions are both experimented with on YOLOv6-N/S/M.

我们在 YOLOv6-N/S/M 上试验了回归损失 IoU 系列和概率损失函数。

The latest IoU-series losses are utilized in YOLOv6-N/S/M. Experiment results in Table 9 show that SIoU Loss outperforms others for YOLOv6-N and YOLOv6-T, while CIoU Loss performs better on YOLOv6-M.

YOLOv6-N/S/M 采用了最新的 IoU 系列损失。表 9 中的实验结果显示,在 YOLOv6-N 和 YOLOv6-T 中,SIoU Loss 的性能优于其他函数,而在 YOLOv6-M 中,CIoU Loss 的性能更好。

image-20240717160739646

For probability losses, as listed in Table 10, introducing DFL can obtain 0.2%/0.1%/0.2% performance gain for YOLOv6-N/S/M respectively. However, the inference speed is greatly affected for small models. Therefore, DFL is only introduced in YOLOv6-M/L.

对于概率损失,如表 10 所列,引入 DFL 可使 YOLOv6-N/S/M 的性能分别提高 0.2%/0.1%/0.2%。但是,对于小模型,推理速度会受到很大影响。因此,只在 YOLOv6-M/L 中引入了 DFL。

image-20240717160838149

Object Loss Object loss is also experimented with YOLOv6, as shown in Table 11. From Table 11, we can see that object loss has negative effects on YOLOv6-N/S/M networks, where the maximum decrease is 1.1% AP on YOLOv6-N. The negative gain may come from the conflict between the object branch and the other two branches in TAL. Specifically, in the training stage, IoU between predicted boxes and ground-truth ones, as well as classification scores are used to jointly build a metric as the criteria to assign labels. However, the introduced object branch extends the number of tasks to be aligned from two to three, which obviously increases the difficulty. Based on the experimental results and this analysis, the object loss is then discarded in YOLOv6.

目标损失 YOLOv6 还对目标损失进行了实验,如表 11 所示。从表 11 中可以看出,目标丢失对 YOLOv6-N/S/M 网络有负面影响,其中在 YOLOv6-N 上的最大降幅为 1.1% AP。负增益可能来自 TAL 中目标分支与其他两个分支之间的冲突。具体来说,在训练阶段,预测 box 和ground truth box 之间的 IoU 以及分类分数被用来共同建立一个度量标准,作为分配标签的标准。然而,引入的目标分支将需要对齐的任务数量从两个扩展到三个,这显然增加了难度。基于实验结果和上述分析,YOLOv6 中便舍弃了目标损失。

image-20240717160905611
3.4. Industry handy

improvements More training epochs In practice, more training epochs is a simple and effective way to further increase the accuracy. Results of our small models trained for 300 and 400 epochs are shown in Table 12. We observe that training for longer epochs substantially boosts AP by 0.4%, 0.6%, 0.5% for YOLOv6-N, T, S respectively. Considering the acceptable cost and the produced gain, it suggests that training for 400 epochs is a better convergence scheme for YOLOv6.

改进 更多训练历元 在实践中,增加训练历元是进一步提高准确率的简单而有效的方法。表 12 列出了我们的小型模型在 300 和 400 个训练历元下的训练结果。我们观察到,对于 YOLOv6-N、T 和 S 而言,更长的训练历元分别将 AP 大幅提高了 0.4%、0.6% 和 0.5%。考虑到可接受的成本和产生的增益,这表明对 YOLOv6 而言,400 个历元的训练是更好的收敛方案。

image-20240717161108326

Self-distillation We conducted detailed experiments to verify the proposed self-distillation method on YOLOv6-L. As can be seen in Table 13, applying the self-distillation only on the classification branch can bring 0.4% AP improvement. Furthermore, we simply perform the selfdistillation on the box regression task to have 0.3% AP increase. The introduction of weight decay boosts the model by 0.6% AP.

自蒸馏:我们在 YOLOv6-L 上进行了详细的实验来验证所提出的自蒸馏方法。从表 13 中可以看出,仅在分类分支上应用自蒸馏可以带来 0.4% 的 AP 改进。此外,我们只需在 box 回归任务中执行自蒸馏,就能获得 0.3% 的 AP 提升。引入权重衰减后,模型的 AP 提高了 0.6%。

image-20240717161132329

Gray border of images In Section 2.4.3, we introduce a strategy to solve the problem of performance degradation without extra gray borders. Experimental results are shown in Table 14. In these experiments, YOLOv6-N and YOLOv6-S are trained for 400 epochs and YOLOv6-M for 300 epochs. It can be observed that the accuracy of YOLOv6-N/S/M is lowered by 0.4%/0.5%/0.7% without Mosaic fading when removing the gray border. However, the performance degradation becomes 0.2%/0.5%/0.5% when adopting Mosaic fading, from which we find that, on the one hand, the problem of performance degradation is mitigated. On the other hand, the accuracy of small models (YOLOv6-N/S) is improved whether we pad gray borders or not. Moreover, we limit the input images to 634634 and add gray borders by 3 pixels wide around the edges (more results can be found in Appendix C). With this strategy, the size of the final images is the expected 640640. The results in Table 14 indicate that the final performance of YOLOv6-N/S/M is even 0.2%/0.3%/0.1% more accurate with the final image size reduced from 672 to 640.

图像灰边界 在第 2.4.3 节中,我们引入了一种策略来解决没有额外灰边界的性能下降问题。实验结果如表 14 所示。在这些实验中,YOLOv6-N 和 YOLOv6-S 训练了 400 个 epochs,YOLOv6-M 训练了 300 个 epochs。可以看出,在没有马赛克衰减的情况下,去除灰色边界后,YOLOv6-N/S/M 的准确率降低了 0.4%/0.5%/0.7%。然而,采用马赛克衰减后,性能下降幅度变为 0.2%/0.5%/0.5%,由此我们发现,一方面,性能下降的问题得到了缓解。另一方面,无论是否采用灰边界,小模型(YOLOv6-N/S)的精度都会得到提高。此外,我们将输入图像限制为 634634,并在边缘添加 3 像素宽的灰色边框(更多结果见附录 C)。通过这种策略,最终图像的大小达到了预期的 640640。表 14 中的结果表明,当最终图像大小从 672 缩小到 640 时,YOLOv6-N/S/M 的最终性能甚至提高了 0.2%/0.3%/0.1%。

image-20240717161148498
3.5. Quantization Results

We take YOLOv6-S as an example to validate our quantization method. The following experiment is on both two releases. The baseline model is trained for 300 epochs.

我们以 YOLOv6-S 为例来验证我们的量化方法。下面的实验在这两个版本上进行。基线模型训练了 300 个 epoch。

3.5.1 PTQ

The average performance is substantially improved when the model is trained with RepOptimizer, see Table 15. RepOptimizer is in general faster and nearly identical.

使用 RepOptimizer 对模型进行训练后,平均性能大幅提高,见表 15。RepOptimizer 一般速度更快,而且几乎完全相同。

image-20240717161316220
3.5.2 QAT

For v1.0, we apply fake quantizers to non-sensitive layers obtained from Section 2.5.2 to perform quantization-aware training and call it partial QAT. We compare the result with full QAT in Table 16. Partial QAT leads to better accuracy with a slightly reduced throughput.

对于 v1.0,我们将第 2.5.2 节中获得的非敏感层应用 Fake-Quantizers 来执行量化感知训练,并称之为部分 QAT。我们在表 16 中将结果与完全 QAT 进行了比较。部分 QAT 提高了准确性,但吞吐量略有降低。

image-20240717161342951

Due to the removal of quantization-sensitive layers in v2.0 release, we directly use full QAT on YOLOv6-S trained with RepOptimizer. We eliminate inserted quantizers through graph optimization to obtain higher accuracy and faster speed. We compare the distillation-based quantization results from PaddleSlim [30] in Table 17. Note our quantized version of YOLOv6-S is the fastest and the most accurate, also see Fig. 1.

由于在 v2.0 版本中取消了量化敏感层,我们直接在使用 RepOptimizer 训练的 YOLOv6-S 上使用完全 QAT。我们通过图优化消除了插入的量化器,以获得更高的精度和更快的速度。表 17 比较了 PaddleSlim [30] 基于蒸馏的量化结果。请注意,我们的 YOLOv6-S 量化版本速度最快、准确度最高,也请参见图 1。

image-20240717161357254
4 Conclusion

In a nutshell, with the persistent industrial requirements in mind, we present the current form of YOLOv6, carefully examining all the advancements of components of object detectors up to date, meantime instilling our thoughts and practices. The result surpasses other available real-time detectors in both accuracy and speed. For the convenience of the industrial deployment, we also supply a customized quantization method for YOLOv6, rendering an ever-fast detector out-of-box. We sincerely thank the academic and industrial community for their brilliant ideas and endeavors. In the future, we will continue expanding this project to meet higher standards and more demanding scenarios.

总之,考虑到持续的工业要求,我们提出了当前形式的 YOLOv6,仔细研究了迄今为止目标检测器组件的所有进展,同时灌输了我们的想法和做法。其结果在精度和速度上都超越了其他现有的实时探测器。为了方便工业应用,我们还为 YOLOv6 提供了定制的量化方法,使其成为一个开箱即用的快速检测器。我们衷心感谢学术界和工业界的杰出创意和努力。未来,我们将继续拓展这一项目,以满足更高标准和更苛刻的应用场景。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值