YOLOV9翻译

YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information

Chien-Yao Wang1,2, I-Hau Yeh2, and Hong-Yuan Mark Liao1,2,3 1Institute of Information Science, Academia Sinica, Taiwan 2National Taipei University of Technology, Taiwan 3Department of Information and Computer Engineering, Chung Yuan Christian University, Taiwan kinyiu@iis.sinica.edu.tw, ihyeh@emc.com.tw, and liao@iis.sinica.edu.tw

image-20240723113440762
Abstract

Today’s deep learning methods focus on how to design the most appropriate objective functions so that the prediction results of the model can be closest to the ground truth. Meanwhile, an appropriate architecture that can facilitate acquisition of enough information for prediction has to be designed. Existing methods ignore a fact that when input data undergoes layer-by-layer feature extraction and spatial transformation, large amount of information will be lost. This paper will delve into the important issues of data loss when data is transmitted through deep networks, namely information bottleneck and reversible functions. We proposed the concept of programmable gradient information (PGI) to cope with the various changes required by deep networks to achieve multiple objectives. PGI can provide complete input information for the target task to calculate objective function, so that reliable gradient information can be obtained to update network weights. In addition, a new lightweight network architecture – Generalized Efficient Layer Aggregation Network (GELAN), based on gradient path planning is designed. GELAN’s architecture confirms that PGI has gained superior results on lightweight models. We verified the proposed GELAN and PGI on MS COCO dataset based object detection. The results show that GELAN only uses conventional convolution operators to achieve better parameter utilization than the state-of-the-art methods developed based on depth-wise convolution. PGI can be used for variety of models from lightweight to large. It can be used to obtain complete information, so that train-from scratch models can achieve better results than state-of-the art models pre-trained using large datasets, the comparison results are shown in Figure 1. The source codes are at: https://github.com/WongKinYiu/yolov9.

如今的深度学习方法主要关注如何设计最合适的目标函数,从而使模型的预测结果最接近 Ground-Truth 。与此同时,还必须设计一个合适的架构,以方便获取足够的预测信息。现有方法忽略了一个事实,即输入数据经过逐层特征提取和空间转换后,会丢失大量信息。本文将深入探讨数据通过深度网络传输时数据丢失的重要问题,即信息瓶颈和可逆函数。我们提出了可编程梯度信息(PGI)的概念,以应对深度网络为实现多种目标所需的各种变化。PGI 可以为计算目标函数的目标任务提供完整的输入信息,从而获得可靠的梯度信息来更新网络权重。此外,还设计了一种基于梯度路径规划的新型轻量级网络架构–通用高效层聚合网络(GELAN)。GELAN 的架构证实了 PGI 在轻量级模型上取得的卓越成果。我们在基于 MS COCO 数据集的目标检测中验证了所提出的 GELAN 和 PGI。结果表明,与基于深度卷积的先进方法相比,GELAN 仅使用传统卷积算子就能获得更好的参数利用率。PGI 可用于从轻量级到大型的各种模型。它可以用来获取完整的信息,从而使从头开始训练的模型比使用大型数据集预先训练的先进模型取得更好的结果,比较结果如图 1 所示。源代码见:https://github.com/WongKinYiu/yolov9。

1 Introduction

Deep learning-based models have demonstrated far better performance than past artificial intelligence systems in various fields, such as computer vision, language processing, and speech recognition. In recent years, researchers in the field of deep learning have mainly focused on how to develop more powerful system architectures and learning methods, such as CNNs [21–23, 42, 55, 71, 72], Transformers [8, 9, 40, 41, 60, 69, 70], Perceivers [26, 26, 32, 52, 56, 81, 81], and Mambas [17, 38, 80]. In addition, some researchers have tried to develop more general objective functions, such as loss function [5, 45, 46, 50, 77, 78], label assignment [10, 12, 33, 67, 79] and auxiliary supervision [18, 20, 24, 28, 29, 51, 54, 68, 76]. The above studies all try to precisely find the mapping between input and target tasks. However, most past approaches have ignored that input data may have a non-negligible amount of information loss during the feedforward process. This loss of information can lead to biased gradient flows, which are subsequently used to update the model. The above problems can result in deep networks to establish incorrect associations between targets and inputs, causing the trained model to produce incorrect predictions.

在计算机视觉、语言处理和语音识别等多个领域,基于深度学习的模型已经表现出远超以往人工智能系统的性能。近年来,深度学习领域的研究人员主要关注如何开发更强大的系统架构和学习方法,如 CNN [21-23、42、55、71、72]、Transformers [8、9、40、41、60、69、70]、Perceivers [26、26、32、52、56、81、81] 和 Mambas [17、38、80]。此外,一些研究人员还试图开发更通用的目标函数,如损失函数[5, 45, 46, 50, 77, 78]、标签分配[10, 12, 33, 67, 79]和辅助监督[18, 20, 24, 28, 29, 51, 54, 68, 76]。上述研究都试图精确找到输入任务和目标任务之间的映射关系。然而,过去的大多数方法都忽略了输入数据在前馈过程中可能会有不可忽略的信息损失。这种信息损失会导致梯度流出现偏差,而梯度流随后会被用于更新模型。上述问题会导致深度网络在目标和输入之间建立不正确的关联,从而使训练好的模型产生不正确的预测。

In deep networks, the phenomenon of input data losing information during the feedforward process is commonly known as information bottleneck [59], and its schematic diagram is as shown in Figure 2. At present, the main methods that can alleviate this phenomenon are as follows: (1) The use of reversible architectures [3, 16, 19]: this method mainly uses repeated input data and maintains the information of the input data in an explicit way; (2) The use of masked modeling [1, 6, 9, 27, 71, 73]: it mainly uses reconstruction loss and adopts an implicit way to maximize the extracted features and retain the input information; and (3) Introduction of the deep supervision concept [28,51,54,68]: it uses shallow features that have not lost too much important information to pre-establish a mapping from features to targets to ensure that important information can be transferred to deeper layers. However, the above methods have different drawbacks in the training process and inference process. For example, a reversible architecture requires additional layers to combine repeatedly fed input data, which will significantly increase the inference cost. In addition, since the input data layer to the output layer cannot have a too deep path, this limitation will make it difficult to model high-order semantic information during the training process. As for masked modeling, its reconstruction loss sometimes conflicts with the target loss. In addition, most mask mechanisms also produce incorrect associations with data. For the deep supervision mechanism, it will produce error accumulation, and if the shallow supervision loses information during the training process, the subsequent layers will not be able to retrieve the required information. The above phenomenon will be more significant on difficult tasks and small models.

在深度网络中,输入数据在前馈过程中丢失信息的现象通常被称为信息瓶颈[59],其原理图如图 2 所示。目前,可以缓解这一现象的方法主要有以下几种: (1) 使用可逆架构[3, 16, 19]:这种方法主要使用重复输入数据,并以显式方式保持输入数据的信息;(2) 使用屏蔽建模[1, 6, 9, 27, 71, 73]: (3) 引入深度监督概念[28,51,54,68]:利用未丢失过多重要信息的浅层特征,预先建立从特征到目标的映射,确保重要信息能转移到深层。然而,上述方法在训练过程和推理过程中存在不同的缺点。例如,可逆架构需要额外的层来组合重复输入的数据,这将大大增加推理成本。此外,由于输入数据层到输出层不能有太深的路径,这一限制将导致在训练过程中难以对高阶语义信息进行建模。至于掩码建模,其重建损失有时会与目标损失相冲突。此外,大多数掩码机制还会产生错误的数据关联。对于深度监督机制来说,它会产生错误积累,如果浅层监督在训练过程中丢失信息,后续层将无法检索到所需信息。上述现象在高难度任务和小型模型中会更加明显。

image-20240723105209884

Figure 2. Visualization results of random initial weight output feature maps for different network architectures: (a) input image, (b) PlainNet, © ResNet, (d) CSPNet, and (e) proposed GELAN. From the figure, we can see that in different architectures, the information provided to the objective function to calculate the loss is lost to varying degrees, and our architecture can retain the most complete information and provide the most reliable gradient information for calculating the objective function.

To address the above-mentioned issues, we propose a new concept, which is programmable gradient information (PGI). The concept is to generate reliable gradients through auxiliary reversible branch, so that the deep features can still maintain key characteristics for executing target task. The design of auxiliary reversible branch can avoid the semantic loss that may be caused by a traditional deep supervision process that integrates multi-path features. In other words, we are programming gradient information propagation at different semantic levels, and thereby achieving the best training results. The reversible architecture of PGI is built on auxiliary branch, so there is no additional cost. Since PGI can freely select loss function suitable for the target task, it also overcomes the problems encountered by mask modeling. The proposed PGI mechanism can be applied to deep neural networks of various sizes and is more general than the deep supervision mechanism, which is only suitable for very deep neural networks.

针对上述问题,我们提出了一个新概念,即可编程梯度信息(PGI)。这一概念是通过辅助可逆分支生成可靠的梯度,使深度特征仍能保持执行目标任务的关键特征。辅助可逆分支的设计可以避免传统深度监督过程中整合多路径特征可能造成的语义损失。换句话说,我们是在不同的语义层次上对梯度信息传播进行编程,从而达到最佳的训练效果。PGI 的可逆架构建立在辅助分支上,因此没有额外成本。由于 PGI 可以自由选择适合目标任务的损失函数,因此也克服了掩码建模所遇到的问题。所提出的 PGI 机制可应用于各种规模的深度神经网络,比只适用于极深神经网络的深度监督机制更具通用性。

In this paper, we also designed generalized ELAN (GELAN) based on ELAN [65], the design of GELAN simultaneously takes into account the number of parameters, computational complexity, accuracy and inference speed. This design allows users to arbitrarily choose appropriate computational blocks for different inference devices. We combined the proposed PGI and GELAN, and then designed a new generation of YOLO series object detection system, which we call YOLOv9. We used the MS COCO dataset to conduct experiments, and the experimental results verified that our proposed YOLOv9 achieved the top performance in all comparisons.

本文还在 ELAN [65] 的基础上设计了广义 ELAN(GELAN),GELAN 的设计同时考虑了参数数量、计算复杂度、精确度和推理速度。这种设计允许用户为不同的推理设备任意选择合适的计算模块。我们将提出的 PGI 和 GELAN 结合起来,然后设计了新一代 YOLO 系列目标检测系统,我们称之为 YOLOv9。我们使用 MS COCO 数据集进行了实验,实验结果验证了我们提出的 YOLOv9 在所有比较中都取得了最佳性能。

We summarize the contributions of this paper as follows:

我们将本文的贡献总结如下:

  1. We theoretically analyzed the existing deep neural network architecture from the perspective of reversible function, and through this process we successfully explained many phenomena that were difficult to explain in the past. We also designed PGI and auxiliary reversible branch based on this analysis and achieved excellent results.

我们从可逆函数的角度对现有的深度神经网络架构进行了理论分析,并通过这一过程成功解释了许多过去难以解释的现象。我们还基于这一分析设计了 PGI 和辅助可逆分支,并取得了很好的效果。

  1. The PGI we designed solves the problem that deep supervision can only be used for extremely deep neural network architectures, and therefore allows new lightweight architectures to be truly applied in daily life.

我们设计的 PGI 解决了深度监督只能用于极深神经网络架构的问题,从而让新的轻量级架构真正应用于日常生活。

  1. The GELAN we designed only uses conventional convolution to achieve a higher parameter usage than the depth-wise convolution design that based on the most advanced technology, while showing great advantages of being light, fast, and accurate.

与基于最先进技术的深度卷积设计相比,我们设计的 GELAN 只使用了常规卷积,实现了更高的参数使用率,同时也展现出了轻、快、准的巨大优势。

  1. Combining the proposed PGI and GELAN, the object detection performance of the YOLOv9 on MS COCO dataset greatly surpasses the existing real-time object detectors in all aspects.

结合所提出的 PGI 和 GELAN,YOLOv9 在 MS COCO 数据集上的目标检测性能在各方面都大大超过了现有的实时目标检测器。

2 Related work
2.1. Real-time Object Detectors

The current mainstream real-time object detectors are the YOLO series [2, 7, 13–15, 25, 30, 31, 47–49, 61–63, 74, 75], and most of these models use CSPNet [64] or ELAN [65] and their variants as the main computing units. In terms of feature integration, improved PAN [37] or FPN [35] is often used as a tool, and then improved YOLOv3 head [49] or FCOS head [57, 58] is used as prediction head. Recently some real-time object detectors, such as RT DETR [43], which puts its fundation on DETR [4], have also been proposed. However, since it is extremely difficult for DETR series object detector to be applied to new domains without a corresponding domain pre-trained model, the most widely used real-time object detector at present is still YOLO series. This paper chooses YOLOv7 [63], which has been proven effective in a variety of computer vision tasks and various scenarios, as a base to develop the proposed method. We use GELAN to improve the architecture and the training process with the proposed PGI. The above novel approach makes the proposed YOLOv9 the top real-time object detector of the new generation.

目前主流的实时目标检测器是 YOLO 系列[2、7、13-15、25、30、31、47-49、61-63、74、75],这些模型大多采用 CSPNet [64] 或 ELAN [65] 及其变体作为主要计算单元。在特征集成方面,通常使用改进的 PAN [37] 或 FPN [35] 作为工具,然后使用改进的 YOLOv3 head [49] 或 FCOS head [57, 58] 作为预测头。最近也有人提出了一些实时目标检测器,如以 DETR [4] 为基础的 RT DETR [43]。然而,由于 DETR 系列目标检测器在没有相应领域预训练模型的情况下很难应用到新的领域,因此目前应用最广泛的实时目标检测器仍然是 YOLO 系列。本文选择在各种计算机视觉任务和各种场景中被证明有效的 YOLOv7 [63]作为基础,来开发所提出的方法。我们使用 GELAN 改进了架构和训练过程,并提出了 PGI。上述新颖的方法使所提出的 YOLOv9 成为新一代顶级实时目标检测器。

2.2. Reversible Architectures

The operation unit of reversible architectures [3, 16, 19] must maintain the characteristics of reversible conversion, so it can be ensured that the output feature map of each layer of operation unit can retain complete original information. Before, RevCol [3] generalizes traditional reversible unit to multiple levels, and in doing so can expand the semantic levels expressed by different layer units. Through a literature review of various neural network architectures, we found that there are many high-performing architectures with varying degree of reversible properties. For example, Res2Net module [11] combines different input partitions with the next partition in a hierarchical manner, and concatenates all converted partitions before passing them backwards. CBNet [34, 39] re-introduces the original input data through composite backbone to obtain complete original information, and obtains different levels of multilevel reversible information through various composition methods. These network architectures generally have excellent parameter utilization, but the extra composite layers cause slow inference speeds. DynamicDet [36] combines CBNet [34] and the high-efficiency real-time object detector YOLOv7 [63] to achieve a very good trade-off among speed, number of parameters, and accuracy. This paper introduces the DynamicDet architecture as the basis for designing reversible branches. In addition, reversible information is further introduced into the proposed PGI. The proposed new architecture does not require additional connections during the inference process, so it can fully retain the advantages of speed, parameter amount, and accuracy.

可逆架构[3, 16, 19]的运算单元必须保持可逆转换的特性,这样才能保证每层运算单元输出的特征图都能保留完整的原始信息。之前,RevCol[3]将传统的可逆单元泛化为多层次,这样做可以扩展不同层单元所表达的语义层次。通过对各种神经网络架构的文献回顾,我们发现有许多高性能架构都具有不同程度的可逆特性。例如,Res2Net 模块[11]以分层的方式将不同的输入分区与下一个分区相结合,并在向后传递之前将所有转换后的分区串联起来。CBNet [34, 39]通过复合骨干网重新引入原始输入数据,获得完整的原始信息,并通过各种组合方法获得不同层次的多级可逆信息。这些网络架构一般都有很好的参数利用率,但额外的复合层会导致推理速度缓慢。DynamicDet [36] 结合了 CBNet [34] 和高效实时目标检测器 YOLOv7 [63],在速度、参数数量和精度之间实现了很好的权衡。本文介绍的 DynamicDet 架构是设计可逆分支的基础。此外,还进一步将可逆信息引入到所提出的 PGI 中。所提出的新架构在推理过程中不需要额外的连接,因此可以充分保留速度、参数数量和精度方面的优势。

2.3. Auxiliary Supervision

Deep supervision [28,54,68] is the most common auxiliary supervision method, which performs training by inserting additional prediction layers in the middle layers. Especially the application of multi-layer decoders introduced in the transformer-based methods is the most common one. Another common auxiliary supervision method is to utilize the relevant meta information to guide the feature maps produced by the intermediate layers and make them have the properties required by the target tasks [18, 20, 24, 29, 76]. Examples of this type include using segmentation loss or depth loss to enhance the accuracy of object detectors. Recently, there are many reports in the literature [53, 67, 82] that use different label assignment methods to generate different auxiliary supervision mechanisms to speed up the convergence speed of the model and improve the robustness at the same time. However, the auxiliary supervision mechanism is usually only applicable to large models, so when it is applied to lightweight models, it is easy to cause an under parameterization phenomenon, which makes the performance worse. The PGI we proposed designed a way to reprogram multi-level semantic information, and this design allows lightweight models to also benefit from the auxiliary supervision mechanism.

深度监督 [28,54,68] 是最常见的辅助监督方法,它通过在中间层插入额外的预测层来进行训练。尤其是基于变换器的方法中引入的多层解码器的应用最为常见。另一种常见的辅助监督方法是利用相关元信息来指导中间层生成的特征图,使其具有目标任务所需的属性 [18, 20, 24, 29, 76]。这类例子包括利用分割损失或深度损失来提高目标检测器的准确性。最近,有许多文献 [53, 67, 82] 报道使用不同的标签分配方法生成不同的辅助监督机制,以加快模型的收敛速度,同时提高鲁棒性。然而,辅助监督机制通常只适用于大型模型,因此当它应用于轻量级模型时,容易造成参数化不足现象,使性能变差。我们提出的 PGI 设计了一种对多层次语义信息进行重新编程的方法,这种设计使轻量级模型也能从辅助监督机制中受益。

3 Problem Statement

Usually, people attribute the difficulty of deep neural network convergence problem due to factors such as gradient vanish or gradient saturation, and these phenomena do exist in traditional deep neural networks. However, modern deep neural networks have already fundamentally solved the above problem by designing various normalization and activation functions. Nevertheless, deep neural networks still have the problem of slow convergence or poor convergence results.

通常,人们将深度神经网络收敛困难问题归因于梯度消失或梯度饱和等因素,这些现象在传统深度神经网络中确实存在。然而,现代深度神经网络通过设计各种归一化和激活函数,已经从根本上解决了上述问题。然而,深度神经网络仍然存在收敛速度慢或收敛效果差的问题。

In this paper, we explore the nature of the above issue further. Through in-depth analysis of information bottleneck, we deduced that the root cause of this problem is that the initial gradient originally coming from a very deep network has lost a lot of information needed to achieve the goal soon after it is transmitted. In order to confirm this inference, we feedforward deep networks of different architectures with initial weights, and then visualize and illustrate them in Figure 2. Obviously, PlainNet has lost a lot of important information required for object detection in deep layers. As for the proportion of important information that ResNet, CSPNet, and GELAN can retain, it is indeed positively related to the accuracy that can be obtained after training. We further design reversible network-based methods to solve the causes of the above problems. In this section we shall elaborate our analysis of information bottleneck principle and reversible functions.

本文将进一步探讨上述问题的本质。通过对信息瓶颈的深入分析,我们推断出造成这一问题的根本原因是,原本来自非常深的网络的初始梯度在传输后不久就丢失了大量实现目标所需的信息。为了证实这一推论,我们对不同架构的深度网络进行了初始权重的前馈,然后在图 2 中对它们进行了可视化说明。显然,PlainNet 在深层中丢失了很多目标检测所需的重要信息。至于 ResNet、CSPNet 和 GELAN 所能保留的重要信息比例,确实与训练后所能获得的准确率成正相关。我们进一步设计了基于可逆网络的方法来解决上述问题的成因。本节将详细阐述我们对信息瓶颈原理和可逆函数的分析。

3.1 Information Bottleneck Principle

According to information bottleneck principle, we know that data X may cause information loss when going through transformation, as shown in Eq. 1 below:

根据信息瓶颈原理,我们知道数据 X 在经过转换时可能会造成信息损失,如下式 1 所示:
I ( X , X ) ≥ I ( X , f θ ( X ) ) ≥ I ( X , g ϕ ( f θ ( X ) ) ) I(X,X)\geq I(X,f_\theta(X))\geq I(X,g_\phi(f_\theta(X))) I(X,X)I(X,fθ(X))I(X,gϕ(fθ(X)))
where I I I indicates mutual information, f f f and g g g are transformation functions, and θ \theta θ and ϕ \phi ϕ are parameters of f f f and g g g, respectively.

其中, I I I表示互信息, f f f g g g是变换函数, θ \theta θ ϕ \phi ϕ分别是 f f f g g g的参数。

In deep neural networks, f θ ( ⋅ ) f_\theta(·) fθ() and g ϕ ( ⋅ ) g_\phi(·) gϕ() respectively represent the operations of two consecutive layers in deep neural network. From Eq. 1, we can predict that as the number of network layer becomes deeper, the original data will be more likely to be lost. However, the parameters of the deep neural network are based on the output of the network as well as the given target, and then update the network after generating new gradients by calculating the loss function. As one can imagine, the output of a deeper neural network is less able to retain complete information about the prediction target. This will make it possible to use incomplete information during network training, resulting in unreliable gradients and poor convergence.

在深度神经网络中, f θ ( ⋅ ) f_\theta(\cdot) fθ() g ϕ ( ⋅ ) g_\phi(\cdot) gϕ()分别代表深度神经网络中连续两层的运算。根据公式 1,我们可以预测,随着网络层数的增加,原始数据丢失的可能性会增大。然而,深度神经网络的参数是基于网络的输出以及给定的目标,然后通过计算损失函数产生新的梯度后更新网络。可以想象,深度神经网络的输出较难保留预测目标的完整信息。这就有可能在网络训练过程中使用不完整的信息,导致梯度不可靠和收敛性差。

One way to solve the above problem is to directly increase the size of the model. When we use a large number of parameters to construct a model, it is more capable of performing a more complete transformation of the data. The above approach allows even if information is lost during the data feedforward process, there is still a chance to retain enough information to perform the mapping to the target. The above phenomenon explains why the width is more important than the depth in most modern models. However, the above conclusion cannot fundamentally solve the problem of unreliable gradients in very deep neural network. Below, we will introduce how to use reversible functions to solve problems and conduct relative analysis.

解决上述问题的方法之一是直接增加模型的规模。当我们使用大量参数来构建模型时,它更能对数据进行更完整的转换。通过上述方法,即使在数据前馈过程中丢失了信息,仍有机会保留足够的信息来完成对目标的映射。上述现象解释了为什么在大多数现代模型中,宽度比深度更重要。然而,上述结论并不能从根本上解决超深度神经网络中梯度不可靠的问题。下面,我们将介绍如何使用可逆函数来解决问题并进行相对分析。

3.2 Reversible

When a function r r r has an inverse transformation function v v v, we call this function reversible function, as shown in Eq. 2.

当函数 r r r 具有反变换函数 v v v 时,我们称之为可逆函数,如公式 2 所示。
X = v ζ ( r ψ ( X ) ) X = v_{\zeta}(r_{\psi}(X)) X=vζ(rψ(X))
where ψ \psi ψ and ζ \zeta ζ are parameters of r r r and v v v, respectively. Data X X X is converted by reversible function without losing information, as shown in Eq. 3.

其中, ψ \psi ψ ζ \zeta ζ 分别是 r r r v v v 的参数。数据 X X X 在不丢失信息的情况下通过可逆函数进行转换,如公式 3 所示。
I ( X , X ) = I ( X , r ψ ( X ) ) = I ( X , v ζ ( r ψ ( X ) ) ) I(X,X)=I(X,r_\psi(X))=I(X,v_\zeta(r_\psi(X))) I(X,X)=I(X,rψ(X))=I(X,vζ(rψ(X)))
When the network’s transformation function is composed of reversible functions, more reliable gradients can be obtained to update the model. Almost all of today’s popular deep learning methods are architectures that conform to the reversible property, such as Eq. 4.

当网络的变换函数由可逆函数组成时,就能获得更可靠的梯度来更新模型。当今流行的深度学习方法几乎都是符合可逆特性的架构,如公式 4。
X l + 1 = X l + f θ l + 1 ( X l ) X^{l+1}=X^l+f_\theta^{l+1}(X^l) Xl+1=Xl+fθl+1(Xl)
where l l l indicates the l-th layer of a PreAct ResNet and f f f is the transformation function of the l-th layer. PreAct ResNet [22] repeatedly passes the original data X X X to subsequent layers in an explicit way. Although such a design can make a deep neural network with more than a thousand layers converge very well, it destroys an important reason why we need deep neural networks. That is, for difficult problems, it is difficult for us to directly find simple mapping functions to map data to targets. This also explains why PreAct ResNet performs worse than ResNet [21] whent he number of layers is small.

其中, l l l 表示 PreAct ResNet 的第 l 层, f f f 是第 l 层的转换函数。PreAct ResNet [22]以明确的方式将原始数据 X X X 重复传递给后续层。虽然这样的设计可以让一个拥有一千多层的深度神经网络很好地收敛,但它破坏了我们需要深度神经网络的一个重要原因。也就是说,对于困难的问题,我们很难直接找到简单的映射函数来将数据映射到目标。这也解释了为什么当层数较少时,PreAct ResNet 的表现不如 ResNet [21]。

In addition, we tried to use masked modeling that allowed the transformer model to achieve significant breakthroughs. We use approximation methods, such as Eq. 5, to try to find the inverse transformation v v v of r r r, so that the transformed features can retain enough information using sparse features. The form of Eq. 5 is as follows:

此外,我们还尝试使用掩码建模,使变换器模型取得重大突破。我们使用近似方法(如公式 5)来尝试找到 r r r 的反变换 v v v,这样变换后的特征就能利用稀疏特征保留足够的信息。公式 5 的形式如下:
X = v ζ ( r ψ ( X ) ⋅ M ) X = v_\zeta(r_\psi(X)\cdot M) X=vζ(rψ(X)M)
where M M M is a dynamic binary mask. Other methods that are commonly used to perform the above tasks are diffusion model and variational autoencoder, and they both have the function of finding the inverse function. However, when we apply the above approach to a lightweight model, there will be defects because the lightweight model will be under parameterized to a large amount of raw data. Because of the above reason, important information I ( Y , X ) I(Y,X) I(Y,X) that maps data X X X to target Y Y Y will also face the same problem. For this issue, we will explore it using the concept of information bottleneck [59]. The formula for information bottleneck is as follows:

其中, M M M 是动态二进制掩码。其他常用于执行上述任务的方法有扩散模型和变异自动编码器,它们都具有求反函数的功能。然而,当我们将上述方法应用于轻量级模型时,会出现缺陷,因为轻量级模型对大量原始数据的参数化不足。由于上述原因,将数据 X X X 映射到目标 Y Y Y 的重要信息 I ( Y , X ) I(Y,X) I(Y,X) 也会面临同样的问题。对于这个问题,我们将利用信息瓶颈的概念进行探讨[59]。信息瓶颈的计算公式如下:
I ( X , X ) ≥ I ( Y , X ) ≥ I ( Y , f θ ( X ) ) ≥ . . . ≥ I ( Y , Y ^ ) I(X,X)\geq I(Y,X) \geq I(Y,f_\theta(X))\geq...\geq I(Y,\hat{Y}) I(X,X)I(Y,X)I(Y,fθ(X))...I(Y,Y^)
Generally speaking, I ( Y , X ) I(Y,X) I(Y,X) will only occupy a very small part of I ( X , X ) I(X,X) I(X,X). However, it is critical to the target mission. Therefore, even if the amount of information lost in the feedforward stage is not significant, as long as I ( Y , X ) I(Y,X) I(Y,X) is covered, the training effect will be greatly affected. The lightweight model itself is in an under parameterized state, so it is easy to lose a lot of important information in the feedforward stage. Therefore, our goal for the lightweight model is how to accurately filter I ( Y , X ) I(Y,X) I(Y,X) from I ( X , X ) I(X,X) I(X,X). As for fully preserving the information of X X X, that is difficult to achieve. Based on the above analysis, we hope to propose a new deep neural network training method that can not only generate reliable gradients to update the model, but also be suitable for shallow and lightweight neural networks.

一般来说, I ( Y , X ) I(Y,X) I(Y,X) 只占 I ( X , X ) I(X,X) I(X,X) 的很小一部分。但是,它对目标任务至关重要。因此,即使在前馈阶段丢失的信息量不大,但只要覆盖了 I ( Y , X ) I(Y,X) I(Y,X),训练效果就会受到很大影响。轻量级模型本身处于参数化不足的状态,因此在前馈阶段很容易丢失大量重要信息。因此,我们对轻量级模型的目标是如何从 I ( X , X ) I(X,X) I(X,X) 中准确过滤出 I ( Y , X ) I(Y,X) I(Y,X)。至于完全保留 X X X 的信息,则很难实现。基于以上分析,我们希望提出一种新的深度神经网络训练方法,它不仅能产生可靠的梯度来更新模型,而且适用于浅层和轻量级神经网络。

4 Methodology
image-20240723111005787
4.1. Programmable Gradient Information

In order to solve the aforementioned problems, we propose a new auxiliary supervision framework called Programmable Gradient Information (PGI), as shown in Figure 3 (d). PGI mainly includes three components, namely (1) main branch, (2) auxiliary reversible branch, and (3) multi-level auxiliary information. From Figure 3 (d) we see that the inference process of PGI only uses main branch and therefore does not require any additional inference cost. As for the other two components, they are used to solve or slow down several important issues in deep learning methods. Among them, auxiliary reversible branch is designed to deal with the problems caused by the deepening of neural networks. Network deepening will cause information bottleneck, which will make the loss function unable to generate reliable gradients. As for multi-level auxiliary information, it is designed to handle the error accumulation problem caused by deep supervision, especially for the architecture and lightweight model of multiple prediction branch. Next, we will introduce these two components step by step.

为了解决上述问题,我们提出了一种新的辅助监督框架,称为可编程梯度信息(PGI),如图 3 (d)所示。PGI 主要包括三个部分,即(1)主分支、(2)辅助可逆分支和(3)多级辅助信息。从图 3(d)中可以看出,PGI 的推理过程只使用主分支,因此不需要额外的推理成本。至于其他两个部分,则用于解决或减缓深度学习方法中的几个重要问题。其中,辅助可逆分支是为了解决神经网络深化带来的问题而设计的。网络深化会造成信息瓶颈,使损失函数无法产生可靠的梯度。至于多级辅助信息,则是为了处理深度监督带来的误差积累问题,尤其是针对多重预测分支的架构和轻量级模型。接下来,我们将逐步介绍这两个组成部分。

4.1.1 Auxiliary Reversible Branch

In PGI, we propose auxiliary reversible branch to generate reliable gradients and update network parameters. By providing information that maps from data to targets, the loss function can provide guidance and avoid the possibility of finding false correlations from incomplete feedforward features that are less relevant to the target. We propose the maintenance of complete information by introducing reversible architecture, but adding main branch to reversible architecture will consume a lot of inference costs. We analyzed the architecture of Figure 3 (b) and found that when additional connections from deep to shallow layers are added, the inference time will increase by 20%. When we repeatedly add the input data to the high-resolution computing layer of the network (yellow box), the inference time even exceeds twice the time.

在 PGI 中,我们提出了生成可靠梯度和更新网络参数的辅助可逆分支。通过提供从数据到目标的映射信息,损失函数可以提供指导,避免从与目标相关性较低的不完整前馈特征中找到错误相关性的可能性。我们建议通过引入可逆架构来维护完整信息,但在可逆架构中添加主分支会消耗大量推理成本。我们分析了图 3 (b) 的架构,发现当从深层向浅层添加额外连接时,推理时间将增加 20%。当我们反复向网络的高分辨率计算层(黄框)添加输入数据时,推理时间甚至会超过两倍。

Since our goal is to use reversible architecture to obtain reliable gradients, “reversible” is not the only necessary condition in the inference stage. In view of this, we regard reversible branch as an expansion of deep supervision branch, and then design auxiliary reversible branch, as shown in Figure 3 (d). As for the main branch deep features that would have lost important information due to information bottleneck, they will be able to receive reliable gradient information from the auxiliary reversible branch. These gradient information will drive parameter learning to assist in extracting correct and important information, and the above actions can enable the main branch to obtain features that are more effective for the target task. Moreover, the reversible architecture performs worse on shallow networks than on general networks because complex tasks require conversion in deeper networks. Our proposed method does not force the main branch to retain complete original information but updates it by generating useful gradient through the auxiliary supervision mechanism. The advantage of this design is that the proposed method can also be applied to shallower networks.

由于我们的目标是利用可逆架构获得可靠的梯度,因此 “可逆 ”并不是推理阶段的唯一必要条件。有鉴于此,我们将可逆分支视为深度监督分支的扩展,然后设计了辅助可逆分支,如图 3(d)所示。对于因信息瓶颈而失去重要信息的主分支深度特征,它们将能从辅助可逆分支获得可靠的梯度信息。这些梯度信息将驱动参数学习,辅助提取正确和重要的信息,而上述操作可以使主分支获得对目标任务更有效的特征。此外,可逆架构在浅层网络中的表现比一般网络差,因为复杂任务需要在深层网络中进行转换。我们提出的方法并不强迫主分支保留完整的原始信息,而是通过辅助监督机制产生有用的梯度来更新信息。这种设计的优势在于,我们提出的方法也可以应用于较浅的网络。

Finally, since auxiliary reversible branch can be removed during the inference phase, the inference capabilities of the original network can be retained. We can also choose any reversible architectures in PGI to play the role of auxiliary reversible branch.

最后,由于辅助可逆分支可以在推理阶段移除,因此可以保留原始网络的推理能力。我们还可以选择 PGI 中的任意可逆架构来扮演辅助可逆分支的角色。

4.1.2 Multi-level Auxiliary Information

In this section we will discuss how multi-level auxiliary information works. The deep supervision architecture including multiple prediction branch is shown in Figure 3 ©. For object detection, different feature pyramids can be used to perform different tasks, for example together they can detect objects of different sizes. Therefore, after connecting to the deep supervision branch, the shallow features will be guided to learn the features required for small object detection, and at this time the system will regard the positions of objects of other sizes as the background. However, the above deed will cause the deep feature pyramids to lose a lot of information needed to predict the target object. Regarding this issue, we believe that each feature pyramid needs to receive information about all target objects so that subsequent main branch can retain complete information to learn predictions for various targets.

本节我们将讨论多级辅助信息是如何工作的。图 3(c)所示为包含多个预测分支的深度监督架构。对于目标检测来说,不同的特征金字塔可以用来执行不同的任务,例如一起可以检测不同大小的目标。因此,在连接到深度监督分支后,浅层特征将被引导学习小目标检测所需的特征,此时系统会将其他尺寸物体的位置视为背景。然而,上述契机将导致深度特征金字塔丢失大量预测目标物体所需的信息。关于这个问题,我们认为每个特征金字塔都需要接收所有目标对象的信息,这样后续的主分支才能保留完整的信息来学习预测各种目标。

The concept of multi-level auxiliary information is to insert an integration network between the feature pyramid hierarchy layers of auxiliary supervision and the main branch, and then uses it to combine returned gradients from different prediction heads, as shown in Figure 3 (d). Multi-level auxiliary information is then to aggregate the gradient information containing all target objects, and pass it to the main branch and then update parameters. At this time, the characteristics of the main branch’s feature pyramid hierarchy will not be dominated by some specific object’s information. As a result, our method can alleviate the broken information problem in deep supervision. In addition, any integrated network can be used in multi-level auxiliary information. Therefore, we can plan the required semantic levels to guide the learning of network architectures of different sizes.

多级辅助信息的概念是在辅助监督的特征金字塔分层和主分支之间插入一个集成网络,然后利用它来组合不同预测头返回的梯度,如图 3 (d) 所示。多级辅助信息则是将包含所有目标对象的梯度信息进行汇总,并传递给主分支,然后更新参数。此时,主分支的特征金字塔层级特征将不会被某些特定目标的信息所主导。因此,我们的方法可以缓解深度监控中的信息破碎问题。此外,任何集成网络都可以用于多级辅助信息。因此,我们可以规划所需的语义层次,以指导不同规模的网络架构的学习。

4.2. Generalized ELAN

In this Section we describe the proposed new network architecture – GELAN. By combining two neural network architectures, CSPNet [64] and ELAN [65], which are designed with gradient path planning, we designed generalized efficient layer aggregation network (GELAN) that takes into account light weight, inference speed, and accuracy. Its overall architecture is shown in Figure 4. We generalized the capability of ELAN [65], which originally only used stacking of convolutional layers, to a new architecture that can use any computational blocks.

本节将介绍我们提出的新网络架构–GELAN。通过结合梯度路径规划设计的 CSPNet [64] 和 ELAN [65] 两种神经网络架构,我们设计出了兼顾轻量级、推理速度和准确性的广义高效层聚合网络(GELAN)。其整体架构如图 4 所示。我们将 ELAN [65] 最初仅使用卷积层堆叠的功能推广到了一种可使用任何计算块的新架构。

image-20240723111428506
5 Experiments
5.1. Experimental Setup

We verify the proposed method with MS COCO dataset. All experimental setups follow YOLOv7 AF [63], while the dataset is MS COCO 2017 splitting. All models we mentioned are trained using the train-from-scratch strategy, and the total number of training times is 500 epochs. In setting the learning rate, we use linear warm-up in the first three epochs, and the subsequent epochs set the corresponding decay manner according to the model scale. As for the last 15 epochs, we turn mosaic data augmentation off. For more settings, please refer to Appendix.

我们用 MS COCO 数据集验证了所提出的方法。所有实验设置均遵循 YOLOv7 AF [63],而数据集则是 MS COCO 2017 拆分数据集。我们提到的所有模型都采用了从头开始训练的策略,总训练次数为 500 次。在学习率的设置上,我们在前三个epoch采用线性热身,后面的epoch根据模型规模设置相应的衰减方式。至于最后 15 个epoch,我们关闭了马赛克数据增强功能。更多设置请参见附录。

5.2. Implimentation Details

We built general and extended version of YOLOv9 based on YOLOv7 [63] and Dynamic YOLOv7 [36] respectively. In the design of the network architecture, we replaced ELAN [65] with GELAN using CSPNet blocks [64] with planned RepConv [63] as computational blocks. We also simplified down sampling module and optimized anchor free prediction head. As for the auxiliary loss part of PGI, we completely follow YOLOv7’s auxiliary head setting. Please see Appendix for more details.

我们在 YOLOv7 [63] 和动态 YOLOv7 [36] 的基础上分别构建了通用版和扩展版 YOLOv9。在网络结构的设计中,我们用 GELAN 取代了 ELAN [65],使用 CSPNet 块 [64] 和计划的 RepConv [63] 作为计算块。我们还简化了向下采样模块,优化了无锚预测头。至于 PGI 的辅助损失部分,我们完全沿用了 YOLOv7 的辅助头设置。详情请参见附录。

5.3. Comparison with state-of-the-arts

Table 1 lists comparison of our proposed YOLOv9 with other train-from-scratch real-time object detectors. Overall, the best performing methods among existing methods are YOLO MS-S [7] for lightweight models, YOLO MS [7] for medium models, YOLOv7 AF [63] for general models, and YOLOv8-X [15] for large models. Compared with lightweight and medium model YOLO MS [7], YOLOv9 has about 10% less parameters and 5∼15% less calculations, but still has a 0.4∼0.6% improvement in AP. Compared with YOLOv7 AF, YOLOv9-C has 42% less parameters and 22% less calculations, but achieves the same AP (53%). Compared with YOLOv8-X, YOLOv9-E has 16% less parameters, 27% less calculations, and has significant improvement of 1.7% AP. The above comparison results show that our proposed YOLOv9 has significantly improved in all aspects compared with existing methods.

表 1 列出了我们提出的 YOLOv9 与其他从零开始训练的实时目标检测器的比较。总体而言,现有方法中表现最好的是轻量级模型 YOLO MS-S [7]、中型模型 YOLO MS [7]、一般模型 YOLOv7 AF [63]和大型模型 YOLOv8-X [15]。与轻量级和中型模型 YOLO MS [7]相比,YOLOv9 的参数减少了约 10%,计算量减少了 5∼15%,但 AP 仍提高了 0.4∼0.6%。与 YOLOv7 AF 相比,YOLOv9-C 减少了 42% 的参数和 22% 的计算量,但实现了相同的 AP(53%)。与 YOLOv8-X 相比,YOLOv9-E 的参数减少了 16%,计算量减少了 27%,但 AP 显著提高了 1.7%。上述比较结果表明,与现有方法相比,我们提出的 YOLOv9 在各个方面都有显著改进。

image-20240723112209233

On the other hand, we also include ImageNet pretrained model in the comparison, and the results are shown in Figure 5. We compare them based on the parameters and the amount of computation respectively. In terms of the number of parameters, the best performing large model is RT DETR [43]. From Figure 5, we can see that YOLOv9 using conventional convolution is even better than YOLO MS using depth-wise convolution in parameter utilization. As for the parameter utilization of large models, it also greatly surpasses RT DETR using ImageNet pretrained model. Even better is that in the deep model, YOLOv9 shows the huge advantages of using PGI. By accurately retaining and extracting the information needed to map the data to the target, our method requires only 66% of the parameters while maintaining the accuracy as RT DETR-X.

另一方面,我们还将 ImageNet 预训练模型纳入比较范围,结果如图 5 所示。我们分别根据参数和计算量对它们进行了比较。就参数数量而言,性能最好的大型模型是 RT DETR [43]。从图 5 可以看出,使用传统卷积的 YOLOv9 在参数利用率方面甚至优于使用深度卷积的 YOLO MS。至于大型模型的参数利用率,它也大大超过了使用 ImageNet 预训练模型的 RT DETR。更妙的是,在深度模型中,YOLOv9 显示了使用 PGI 的巨大优势。通过准确保留和提取将数据映射到目标所需的信息,我们的方法只需要 66% 的参数,而精度却与 RT DETR-X 不相上下。

image-20240723112357375

Figure 5. Comparison of state-of-the-art real-time object detectors. The methods participating in the comparison all use ImageNet as pre-trained weights, including RT DETR [43], RTMDet [44], and PP-YOLOE [74], etc. The YOLOv9 that uses train-from scratch method clearly surpasses the performance of other methods.

As for the amount of computation, the best existing models from the smallest to the largest are YOLO MS [7], PP-YOLOE [74], and RT DETR [43]. From Figure 5, we can see that YOLOv9 is far superior to the train-from-scratch methods in terms of computational complexity. In addition, if compared with those based on depth-wise convolution and ImageNet-based pretrained models, YOLOv9 is also very competitive.

至于计算量,现有模型中从最小到最大的最佳模型是 YOLO MS [7]、PP-YOLOE [74] 和 RT DETR [43]。从图 5 中可以看出,YOLOv9 在计算复杂度方面远远优于从零开始训练的方法。此外,如果与基于深度卷积和基于 ImageNet 的预训练模型相比,YOLOv9 也具有很强的竞争力。

5.4. Ablation Studies
5.4.1 Generalized ELAN

For GELAN, we first do ablation studies for computational blocks. We used Res blocks [21], Dark blocks [49], and CSP blocks [64] to conduct experiments, respectively. Table 2 shows that after replacing convolutional layers in ELAN with different computational blocks, the system can maintain good performance. Users are indeed free to replace computational blocks and use them on their respective inference devices. Among different computational block replacements, CSP blocks perform particularly well. They not only reduce the amount of parameters and computation, but also improve AP by 0.7%. Therefore, we choose CSPELAN as the component unit of GELAN in YOLOv9.

对于 GELAN,我们首先对计算块进行消融研究。我们分别使用 Res 块[21]、Dark 块[49]和 CSP 块[64]进行了实验。表 2 显示,用不同的计算块替换 ELAN 中的卷积层后,系统仍能保持良好的性能。用户确实可以自由替换计算块,并在各自的推理设备上使用。在不同的计算模块替换中,CSP 模块的表现尤为出色。它们不仅减少了参数和计算量,还将 AP 提高了 0.7%。因此,我们选择 CSPELAN 作为 YOLOv9 中 GELAN 的组成单元。

image-20240723112548714

Next, we conduct ELAN block-depth and CSP blockdepth experiments on GELAN of different sizes, and display the results in Table 3. We can see that when the depth of ELAN is increased from 1 to 2, the accuracy is significantly improved. But when the depth is greater than or equal to 2, no matter it is improving the ELAN depth or the CSP depth, the number of parameters, the amount of computation, and the accuracy will always show a linear relationship. This means GELAN is not sensitive to the depth. In other words, users can arbitrarily combine the components in GELAN to design the network architecture, and have a model with stable performance without special design. In Table 3, for YOLOv9-{S,M,C}, we set the pairing of the ELAN depth and the CSP depth to {{2, 3}, {2, 1}, {2, 1}}.

接下来,我们对不同大小的 GELAN 进行 ELAN 块深度和 CSP 块深度实验,结果如表 3 所示。我们可以看到,当 ELAN 的深度从 1 增加到 2 时,准确度有了显著提高。但当深度大于或等于 2 时,不管是提高 ELAN 深度还是提高 CSP 深度,参数数、计算量和准确度总是呈线性关系。这说明 GELAN 对深度并不敏感。换句话说,用户可以任意组合 GELAN 中的组件来设计网络结构,无需特殊设计即可获得性能稳定的模型。在表 3 中,对于 YOLOv9-{S,M,C},我们将 ELAN 深度和 CSP 深度的配对设置为{{2, 3},{2, 1},{2, 1}}。

image-20240723112638468
5.4.2 Programmable Gradient Information

In terms of PGI, we performed ablation studies on auxiliary reversible branch and multi-level auxiliary information on the backbone and neck, respectively. We designed auxiliary reversible branch ICN to use DHLC [34] linkage to obtain multi-level reversible information. As for multi-level auxiliary information, we use FPN and PAN for ablation studies and the role of PFH is equivalent to the traditional deep supervision. The results of all experiments are listed in Table 4. From Table 4, we can see that PFH is only effective in deep models, while our proposed PGI can improve accuracy under different combinations. Especially when using ICN, we get stable and better results. We also tried to apply the lead-head guided assignment proposed in YOLOv7 [63] to the PGI’s auxiliary supervision, and achieved much better performance.

在 PGI 方面,我们分别对骨干和颈部的辅助可逆分支和多级辅助信息进行了消融研究。我们设计了辅助可逆分支 ICN,利用 DHLC [34] 链接获得多级可逆信息。至于多级辅助信息,我们使用 FPN 和 PAN 进行消融研究,PFH 的作用相当于传统的深度监督。表 4 列出了所有实验的结果。从表 4 可以看出,PFH 只在深度模型中有效,而我们提出的 PGI 在不同组合下都能提高准确率。特别是在使用 ICN 时,我们得到了稳定且更好的结果。我们还尝试将 YOLOv7 [63] 中提出的导头引导分配应用到 PGI 的辅助监督中,取得了更好的效果。

image-20240723112802076

We further implemented the concepts of PGI and deep supervision on models of various sizes and compared the results, these results are shown in Table 5. As analyzed at the beginning, introduction of deep supervision will cause a loss of accuracy for shallow models. As for general models, introducing deep supervision will cause unstable performance, and the design concept of deep supervision can only bring gains in extremely deep models. The proposed PGI can effectively handle problems such as information bottleneck and information broken, and can comprehensively improve the accuracy of models of different sizes. The concept of PGI brings two valuable contributions. The first one is to make the auxiliary supervision method applicable to shallow models, while the second one is to make the deep model training process obtain more reliable gradients. These gradients enable deep models to use more accurate information to establish correct correlations between data and targets.

我们进一步在不同大小的模型上实现了 PGI 和深度监督的概念,并对结果进行了比较,结果如表 5 所示。正如开头所分析的,对于浅层模型,引入深度监督会造成精度损失。对于一般模型,引入深度监督会导致性能不稳定,而深度监督的设计理念只能在极深模型中带来收益。所提出的 PGI 可以有效处理信息瓶颈和信息破碎等问题,并能全面提高不同规模模型的精度。PGI 的概念带来了两个有价值的贡献。一是使辅助监督方法适用于浅层模型,二是使深度模型训练过程获得更可靠的梯度。这些梯度使深度模型能够利用更准确的信息建立数据与目标之间的正确关联。

image-20240723112953270

Finally, we show in the table the results of gradually increasing components from baseline YOLOv7 to YOLOv9- E. The GELAN and PGI we proposed have brought all round improvement to the model.

最后,我们在表格中展示了从基线 YOLOv7 到 YOLOv9- E 逐步增加组件的结果,我们提出的 GELAN 和 PGI 为模型带来了全方位的改进。

5.5. Visualization

This section will explore the information bottleneck issues and visualize them. In addition, we will also visualize how the proposed PGI uses reliable gradients to find the correct correlations between data and targets. In Figure 6 we show the visualization results of feature maps obtained by using random initial weights as feedforward under different architectures. We can see that as the number of layers increases, the original information of all architectures gradually decreases. For example, at the 50th layer of the PlainNet, it is difficult to see the location of objects, and all distinguishable features will be lost at the 100th layer. As for ResNet, although the position of object can still be seen at the 50th layer, the boundary information has been lost. When the depth reached to the 100th layer, the whole image becomes blurry. Both CSPNet and the proposed GELAN perform very well, and they both can maintain features that support clear identification of objects until the 200th layer. Among the comparisons, GELAN has more stable results and clearer boundary information.

本节将探讨信息瓶颈问题,并将其可视化。此外,我们还将直观展示所提出的 PGI 如何利用可靠梯度找到数据与目标之间的正确关联。在图 6 中,我们展示了在不同架构下使用随机初始权重作为前馈所获得的特征图的可视化结果。我们可以看到,随着层数的增加,所有架构的原始信息都在逐渐减少。例如,在 PlainNet 的第 50 层,我们很难看到目标的位置,而到了第 100 层,所有可分辨的特征都将丢失。至于 ResNet,虽然在第 50 层仍能看到目标的位置,但边界信息已经丢失。当深度达到第 100 层时,整个图像变得模糊不清。CSPNet 和提出的 GELAN 的表现都很好,它们都能在第 200 层之前保持支持清晰识别目标的特征。其中,GELAN 的结果更稳定,边界信息更清晰。

image-20240723113055998

Figure 6. Feature maps (visualization results) output by random initial weights of PlainNet, ResNet, CSPNet, and GELAN at different depths. After 100 layers, ResNet begins to produce feedforward output that is enough to obfuscate object information. Our proposed GELAN can still retain quite complete information up to the 150-th layer, and is still sufficiently discriminative up to the 200-th layer.

Figure 7 is used to show whether PGI can provide more reliable gradients during the training process, so that the parameters used for updating can effectively capture the relationship between the input data and the target. Figure 7 shows the visualization results of the feature map of GELAN and YOLOv9 (GELAN + PGI) in PAN bias warmup. From the comparison of Figure 7(b) and ©, we can clearly see that PGI accurately and concisely captures the area containing objects. As for GELAN that does not use PGI, we found that it had divergence when detecting object boundaries, and it also produced unexpected responses in some background areas. This experiment confirms that PGI can indeed provide better gradients to update parameters and enable the feedforward stage of the main branch to retain more important features.

图 7 用于说明 PGI 能否在训练过程中提供更可靠的梯度,从而使用于更新的参数能够有效捕捉输入数据与目标之间的关系。图 7 显示了 PAN 偏差预热中 GELAN 和 YOLOv9(GELAN + PGI)的特征图可视化结果。从图 7(b)和©的对比中,我们可以清楚地看到 PGI 准确而简洁地捕捉到了包含目标的区域。至于没有使用 PGI 的 GELAN,我们发现它在检测目标边界时出现了发散现象,而且在一些背景区域也产生了意想不到的反应。这一实验证实,PGI 的确能为参数更新提供更好的梯度,并使主分支的前馈阶段保留更多重要特征。

Figure 7. PAN feature maps (visualization results) of GELAN and YOLOv9 (GELAN + PGI) after one epoch of bias warm-up. GELAN originally had some divergence, but after adding PGI’s reversible branch, it is more capable of focusing on the target object.

6 Conclusions

In this paper, we propose to use PGI to solve the information bottleneck problem and the problem that the deep supervision mechanism is not suitable for lightweight neural networks. We designed GELAN, a highly efficient and lightweight neural network. In terms of object detection, GELAN has strong and stable performance at different computational blocks and depth settings. It can indeed be widely expanded into a model suitable for various inference devices. For the above two issues, the introduction of PGI allows both lightweight models and deep models to achieve significant improvements in accuracy. The YOLOv9, designed by combining PGI and GELAN, has shown strong competitiveness. Its excellent design allows the deep model to reduce the number of parameters by 49% and the amount of calculations by 43% compared with YOLOv8, but it still has a 0.6% AP improvement on MS COCO dataset.

本文提出用 PGI 来解决信息瓶颈问题和深度监督机制不适合轻量级神经网络的问题。我们设计了一种高效的轻量级神经网络 GELAN。在目标检测方面,GELAN 在不同的计算区块和深度设置下都具有强大而稳定的性能。事实上,它可以被广泛扩展为适用于各种推理设备的模型。针对上述两个问题,PGI 的引入使轻量级模型和深度模型的精度都得到了显著提高。结合 PGI 和 GELAN 设计的 YOLOv9 已显示出强大的竞争力。与 YOLOv8 相比,其出色的设计使深度模型的参数数量减少了 49%,计算量减少了 43%,但在 MS COCO 数据集上的 AP 提高率仍只有 0.6%。

7 Acknowledgements

The authors wish to thank National Center for Highperformance Computing (NCHC) for providing computational and storage resources.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值