VAD: 向量化场景表示，用于高效的自动驾驶

最新推荐文章于 2024-08-02 19:54:36 发布

真诚的灰灰

最新推荐文章于 2024-08-02 19:54:36 发布

阅读量246

点赞数 3

文章标签：自动驾驶人工智能机器学习

本文链接：https://blog.csdn.net/jch924583667/article/details/140613742

版权

VAD: Vectorized Scene Representation for Efficient Autonomous Driving

VAD: 向量化场景表示，用于高效的自动驾驶

https://github.com/hustvl/VAD

Abstract

Autonomous driving requires a comprehensive understanding of the surrounding environment for reliable trajectory planning. Previous works rely on dense rasterized scene representation (e.g., agent occupancy and semantic map) to perform planning, which is computationally intensive and misses the instance-level structure information. In this paper, we propose VAD, an end-to-end vectorized paradigm for autonomous driving, which models the driving scene as a fully vectorized representation. The proposed vectorized paradigm has two significant advantages. On one hand, VAD exploits the vectorized agent motion and map elements as explicit instance-level planning constraints which effectively improves planning safety. On the other hand, VAD runs much faster than previous end-to-end planning methods by getting rid of computation-intensive rasterized representation and hand-designed post-processing steps. VAD achieves state-of-the-art end-to-end planning performance on the nuScenes dataset, outperforming the previous best method by a large margin. Our base model, VAD-Base, greatly reduces the average collision rate by 29.0% and runs 2.5× faster. Besides, a lightweight variant, VAD-Tiny, greatly improves the inference speed (up to 9.3×) while achieving comparable planning performance. We believe the excellent performance and the high efficiency of VAD are critical for the real-world deployment of an autonomous driving system. Code and models are available at https://github.com/hustvl/VAD for facilitating
自动驾驶需要全面理解周围环境以实现可靠的轨迹规划。以前的工作依赖于密集的光栅化场景表示（例如，代理占用和语义地图）来进行规划，这在计算上很复杂，并且缺少实例级别的结构信息。在本文中，我们提出了VAD，这是一种端到端的向量化自动驾驶范例，它将驾驶场景建模为完全向量化的表示。
提出的向量化范例有两个显著优势：
1.显式实例级规划约束：
VAD利用向量化的代理运动和地图元素作为显式的实例级规划约束，这有效地提高了规划的安全性。
2.计算效率：
与传统的端到端规划方法相比，VAD通过消除计算密集型的光栅化表示和手工设计的后处理步骤，运行速度更快。
在nuScenes数据集上，VAD实现了最先进的端到端规划性能，大大超越了之前最佳方法。我们的基准模型VAD-Base，平均碰撞率降低了29.0%，运行速度提高了2.5倍。此外，一个轻量级变体VAD-Tiny，在保持可比规划性能的同时，极大地提高了推理速度（高达9.3倍）。我们认为VAD的卓越性能和高效性对于自动驾驶系统在现实世界中的部署至关重要。
代码和模型可在 https://github.com/hustvl/VAD 获取，以便于进一步研究和应用。
在这里插入图片描述
图 1. 以往的范例主要依赖于栅格化表示（a）来进行规划（例如，语义地图、占用地图、流量图和成本图），这在计算上非常密集。所提出的VAD完全基于向量化场景表示（b）进行端到端规划。VAD利用实例级结构信息作为规划约束和指导，实现了有希望的性能和效率。

1. Introduction

Autonomous driving requires both comprehensive scene understanding for ensuring safety and high efficiency for real-world deployment. An autonomous vehicle needs to efficiently perceive the driving scene and perform reasonable planning based on the scene information.
确实，自动驾驶汽车在确保安全和实现现实世界部署方面，需要同时具备全面的场景理解和高效率。以下是自动驾驶汽车在这些方面的一些关键要求：
Traditional autonomous driving methods [7, 14, 23, 48] adopt a modular paradigm, where perception and planning are decoupled into standalone modules. The disadvantage is, the planning module cannot access the original sensor data, which contains rich semantic information. And since planning is fully based on preceding perception results, the error in perception may severely influence planning and can not be recognized and cured in the planning stage, which leads to the safety problem.
传统的自动驾驶方法[7, 14, 23, 48]采用模块化范式，将感知和规划解耦为独立的模块。其缺点是，规划模块无法访问包含丰富语义信息的原始传感器数据。而且由于规划完全基于前序感知结果，感知中的错误可能会严重影响规划，并且无法在规划阶段被识别和纠正，从而导致安全问题。
Recently, end-to-end autonomous driving methods [2, 11, 19, 21] take sensor data as input for perception and output planning results with one holistic model. Some works [9, 40, 41] directly output planning results based on the sensor data without learning scene representation, which lacks interpretability and is difficult to optimize. Most works [2, 19, 21] transform the sensor data into rasterized scene representation (e.g., semantic map, occupancy map, flow map, and cost map) for planning. Though straightforward, rasterized representation is computationally intensive and misses critical instance-level structure information.
最近，端到端的自动驾驶方法[2, 11, 19, 21]将传感器数据作为感知的输入，并用一个整体模型输出规划结果。一些工作[9, 40, 41]直接基于传感器数据输出规划结果，而没有学习场景表示，这缺乏可解释性并且难以优化。大多数工作[2, 19, 21]将传感器数据转换为光栅化的场景表示（例如，语义地图、占用地图、流量图和成本图）来进行规划。尽管这种方法直接了当，但光栅化表示在计算上很密集，并且缺少关键的实例级结构信息。
In this work, we propose VAD (Vectorized Autonomous Driving), an end-to-end vectorized paradigm for autonomous driving. VAD models the scene in a fully vectorized way (i.e., vectorized agent motion and map), getting rid of computationally intensive rasterized representation.
在这项工作中，我们提出了VAD（向量化自动驾驶），这是一种端到端的向量化自动驾驶范例。VAD以完全向量化的方式对场景进行建模（即向量化的代理运动和地图），摆脱了计算密集型的光栅化表示。
We argue that vectorized scene representation is superior to rasterized one. Vectorized map (represented as boundary vectors and lane vectors) provides road structure information (e.g., traffic flow, drivable boundary, and lane direction), and helps the autonomous vehicle narrow down the trajectory search space and plan a reasonable future trajectory. The motion of traffic participants (represented as agent motion vectors) provides instance-level restriction for collision avoidance. What’s more, vectorized scene representation is efficient in terms of computation, which is important for real-world applications.
我们认为向量化场景表示优于光栅化表示。向量化地图（表示为边界向量和车道向量）提供了道路结构信息（例如，交通流向、可行驶边界和车道方向），这有助于自动驾驶车辆缩小轨迹搜索空间，并规划出合理的未来轨迹。交通参与者的运动（表示为代理运动向量）为避免碰撞提供了实例级别的限制。更重要的是，向量化场景表示在计算上是高效的，这对于现实世界的应用来说非常重要。

有个问题？这些向量是从哪来的，行驶边界、车道、目标啊这些。。。是网络训练的一部分，在网络中自主生成

以下是向量化场景表示的一些优点：

结构信息清晰：
- 向量化地图清晰地提供了道路的结构信息，使得自动驾驶车辆能够更好地理解道路布局和交通规则。
搜索空间优化：
- 通过使用向量化表示，可以有效地减少轨迹规划的搜索空间，从而提高规划的效率和准确性。
实例级碰撞避免：
- 代理运动向量提供了每个交通参与者的具体运动信息，有助于进行精确的碰撞避免和安全规划。
计算效率：
- 向量化表示在计算上更加高效，减少了处理时间和资源消耗，这对于实时系统和实际部署至关重要。
易于优化和集成：
- 向量化数据通常更容易进行数学运算和优化，简化了规划算法的集成和实现。
适应性强：
- 向量化表示可以灵活地适应不同的驾驶场景和条件，提高了系统的泛化能力。
可解释性：
- 相比光栅化表示，向量化表示通常更易于理解和解释，有助于提高系统的透明度和可信度。
实时性：
高效的计算能力确保了系统能够实时响应环境变化，满足自动驾驶的实时性要求。
通过采用向量化场景表示，VAD能够提供一个高效、准确且可解释的自动驾驶解决方案，这对于推动自动驾驶技术的实际应用具有重要意义。

VAD takes full advantage of the vectorized information to guide planning both implicitly and explicitly. On one hand, VAD adopts map queries and agent queries to implicitly learn instance-level map features and agent motion features from sensor data, and extracts guidance information for planning via query interaction. On the other hand, VAD proposes three instance-level planning constraints based on the explicit vectorized scene representation: the ego-agent collision constraint for maintaining a safe distance between the ego vehicle and other dynamic agents both laterally and longitudinally; the ego-boundary overstepping constraint for pushing the planning trajectory away from the road boundary; and the ego-lane direction constraint for regularizing the future motion direction of the autonomous vehicle with vectorized lane direction. Our proposed framework and the vectorized planning constraints effectively improve the planning performance, without incurring large computational overhead.
VAD充分利用向量化信息隐式和显式地指导规划。一方面，VAD采用地图查询和代理查询隐式地从传感器数据中学习实例级别的地图特征和代理运动特征，并通过查询交互提取规划的引导信息。另一方面，VAD基于显式的向量化场景表示提出了三个实例级规划约束：自我代理碰撞约束，用于在横向和纵向上保持自我车辆与其他动态代理之间的安全距离；自我边界越界约束，用于将规划轨迹推向远离道路边界；以及自我车道方向约束，用于通过向量化车道方向规范自动驾驶车辆的未来运动方向。我们提出的框架和向量化规划约束有效地提高了规划性能，且没有带来巨大的计算开销。
以下是VAD如何利用向量化信息提高规划性能的详细解释：

隐式学习：
- VAD通过地图和代理查询隐式地从原始传感器数据中提取实例级别的特征，这些特征对于理解场景和交通参与者的运动至关重要。
查询交互：
- 通过查询交互，VAD能够提取对规划有指导意义的信息，如地图结构、交通规则和周围代理的运动趋势。
显式规划约束：
- VAD提出了基于向量化场景表示的显式规划约束，这些约束直接指导轨迹规划，确保生成的轨迹既安全又合理。
自我代理碰撞约束：
- 确保自动驾驶车辆与周围动态代理保持安全距离，避免潜在的碰撞。
自我边界越界约束：
- 避免自动驾驶车辆的规划轨迹越过道路边界，确保车辆始终在可行驶区域内。
自我车道方向约束：
- 根据车道的方向向量，引导自动驾驶车辆沿着正确的方向行驶，提高规划的自然性和流畅性。
计算效率：
- 尽管VAD引入了多个规划约束，但其向量化的特性使得计算过程高效，避免了传统栅格化表示所带来的计算负担。
性能提升：
- 通过结合隐式学习和显式约束，VAD在不牺牲计算效率的前提下，显著提高了规划的性能。
  VAD的这种设计哲学将感知和规划紧密结合，使得自动驾驶系统能够更加精确和高效地理解和响应复杂的交通环境，为实现安全、可靠的自动驾驶提供了强有力的支持。

结合隐式学习和显式约束的确是一种不错的思路，将端到端学习、红绿灯、道路拓扑、目标识别都结合到一个模型中，可解释性也变的强了，成本函数也更容易高效构建

Without fancy tricks or hand-designed post-processing steps, VAD achieves state-of-the-art (SOTA) end-to-end planning performance and the best efficiency on the challenging nuScenes [1] dataset. Compared with the previous SOTA method UniAD [21], our base model, VAD-Base, greatly reduces the average planning displacement error by 30.1% (1.03m vs. 0.72m) and the average collision rate by 29.0% (0.31% vs. 0.22%), while running 2.5× faster (1.8 FPS vs. 4.5 FPS). The lightweight variant, VAD-Tiny, runs 9.3× faster (1.8 FPS vs. 16.8 FPS) while achieving comparable planning performance, the average planning displacement error is 0.78m and the average collision rate is 0.38%. We also demonstrate the effectiveness of our design choices through thorough ablations
无需花哨的技巧或手工设计的后处理步骤，VAD在具有挑战性的nuScenes[1]数据集上实现了最先进的（SOTA）端到端规划性能和最佳效率。与之前的SOTA方法UniAD[21]相比，我们的基准模型VAD-Base大幅降低了平均规划位移误差30.1%（1.03米对比0.72米）和平均碰撞率29.0%（0.31%对比0.22%），同时运行速度提高了2.5倍（1.8 FPS对比4.5 FPS）。轻量级变体VAD-Tiny的运行速度提高了9.3倍（1.8 FPS对比16.8 FPS），同时实现了可比的规划性能，平均规划位移误差为0.78米，平均碰撞率为0.38%。我们还通过彻底的消融实验展示了我们设计选择的有效性。
Our key contributions are summarized as follows: • We propose VAD, an end-to-end vectorized paradigm for autonomous driving. VAD models the driving scene as a fully vectorized representation, getting rid of computationally intensive dense rasterized representation and hand-designed post-processing steps. • VAD implicitly and explicitly utilizes the vectorized scene information to improve planning safety, via query interaction and vectorized planning constraints. • VAD achieves SOTA end-to-end planning performance, outperforming previous methods by a large margin. Not only that, because of the vectorized scene representation and our concise model design, VAD greatly improves the inference speed, which is critical for the real-world deployment of an autonomous driving system.
我们的主要贡献总结如下：
1.我们提出了VAD，一种端到端的向量化自动驾驶范例。VAD将驾驶场景建模为完全向量化的表示，摆脱了计算密集型的密集光栅化表示和手工设计的后处理步骤。
2.VAD通过查询交互和向量化规划约束，隐式和显式地利用向量化场景信息来提高规划的安全性。
3.VAD实现了最先进的端到端规划性能，大幅度超越了以前的方法。不仅如此，由于向量化场景表示和我们简洁的模型设计，VAD显著提高了推理速度，这对于自动驾驶系统在现实世界的部署至关重要。
It’s our belief that autonomous driving can be performed in a fully vectorized manner with high efficiency. We hope the impressive performance of VAD can reveal the potential of vectorized paradigm to the community.
我们坚信，自动驾驶可以以完全向量化的方式高效执行。我们希望VAD的出色表现能够向业界展示向量化范例的潜力。

2. Related Work

Perception. Accurate perception of the driving scene is the basis for autonomous driving. We mainly introduce some camera-based 3D object detection and online mapping methods which are most relevant to this paper. DETR3D [47] uses 3D queries to sample corresponding image features and accomplish detection without nonmaximum suppression. PETR [31] introduces 3D positional encoding to image features and uses detection queries to learn object features via attention [46] mechanism. Recently, bird’s-eye view (BEV) representation has become popular and has greatly contributed to the field of perception [8, 17, 26, 28, 29, 34, 51]. LSS [39] is a pioneering work that introduces depth prediction to project features from perspective view to BEV. BEVFormer [26] proposes spatial and temporal attention for better encoding the BEV feature map and achieves remarkable detection performance with camera input only. FIERY [17] and BEVerse [51] use the BEV feature map to predict dense map segmentation. HDMapNet [25] converts lane segmentation to vectorized map with post-processing steps. VectorMapNet [32] predicts map elements in an autoregressive way. MapTR [29] recognizes the permutation invariance of the map instance points and can predict all map elements simultaneously. LaneGAP [28] models the lane graph in a novel path-wise manner, which well preserves the continuity of the lane and encodes traffic information for planning. We leverage a group of BEV queries, agent queries, and map queries to accomplish scene perception following BEVFormer [26] and MapTR [29], and further use these query features and perception results in the motion prediction and planning stage. Details are shown in Sec. 3.
感知，对驾驶场景的准确感知是自动驾驶的基础。我们主要介绍一些与本文最相关的基于摄像头的3D物体检测和在线映射方法。

DETR3D [47]：使用3D查询来采样相应的图像特征，并在没有非极大值抑制的情况下完成检测。
PETR [31]：引入3D位置编码到图像特征中，并使用检测查询通过注意力[46]机制学习对象特征。
鸟瞰图（BEV）表示：最近变得流行，并对感知领域做出了巨大贡献[8, 17, 26, 28, 29, 34, 51]。
LSS [39]：是一项开创性的工作，引入深度预测，将特征从透视图投影到BEV。
BEVFormer [26]：提出空间和时间注意力，更好地编码BEV特征图，并仅使用摄像头输入就取得了显著的检测性能。
FIERY [17] 和 BEVerse [51]：使用BEV特征图预测密集的地图分割。
HDMapNet [25]：通过后处理步骤将车道分割转换为向量化地图。
VectorMapNet [32]：以自回归方式预测地图元素。
MapTR [29]：认识到地图实例点的排列不变性，并且可以同时预测所有地图元素。
LaneGAP [28]：以新颖的路径方式建模车道图，很好地保留了车道的连续性，并为规划编码交通信息。
我们利用一组BEV查询、代理查询和地图查询来完成场景感知，遵循BEVFormer [26]和MapTR [29]的方法，并在运动预测和规划阶段进一步使用这些查询特征和感知结果。详细信息见第3节。
在自动驾驶的感知环节中，这些方法和技术的介绍表明了如何利用先进的计算机视觉和机器学习算法来提高对周围环境的理解。通过这些技术，自动驾驶系统能够准确地检测和识别其他车辆、行人、交通标志和道路结构，为后续的决策和规划提供必要的信息。此外，BEV表示法特别有助于在更高层面上理解交通场景，因为它提供了一个从上方俯瞰的全局视角，有助于系统进行更有效的空间分析和决策。

Motion Prediction. Traditional motion prediction takes perception ground truth (e.g., agent history trajectories and HD map) as input. Some works [3, 38] render the driving scene as BEV images and adopt CNN-based networks to predict future motion. Some other works [13, 33, 37] use vectorized representation, and adopt GNN [27] or Transformer [33, 37, 46] to accomplish learning and prediction. Recent end-to-end works [15,17,22,51] jointly perform perception and motion prediction. Some works [17, 20, 51] see future motion as dense occupancy and flow instead of agent-level future waypoints. ViP3D [15] predicts future motion based on the tracking results and HD map. PIP [22] proposes an interaction scheme between dynamic agents and static vectorized map, and achieves SOTA performance without relying on HD map. VAD learns vectorized agent motion by interacting between dynamic agents and static map elements, inspired by [22].
运动预测。传统的运动预测以感知的地面真实情况（例如，代理历史轨迹和高精地图）作为输入。一些工作[3, 38]将驾驶场景渲染为BEV图像，并采用基于CNN的网络来预测未来的运动。还有一些工作[13, 33, 37]使用向量化表示，并采用图神经网络（GNN）[27]或Transformer[33, 37, 46]来完成学习和预测。最近的端到端工作[15,17,22,51]联合执行感知和运动预测。一些工作[17, 20, 51]将未来运动视为密集的占据和流动，而不是代理级别的未来航点。ViP3D[15]基于追踪结果和高精地图预测未来运动。PIP[22]提出了动态代理和静态向量化地图之间的交互方案，并在不依赖高精地图的情况下实现了最先进的性能。VAD通过动态代理和静态地图元素之间的交互学习向量化的代理运动，受到[22]的启发。
Planning. Recently, learning-based planning methods prevail. Some works [9, 40, 41] omit intermediate stages such as perception and motion prediction, and directly predict planning trajectories or control signals. Although this idea is straightforward and simple, they lack interpretability and are difficult to optimize. Reinforcement learning is quite up to the planning task and has become a promising research direction [4, 5, 45]. Explicit dense cost map has great interpretability and is widely used [2, 11, 19, 44]. The cost maps are constructed from the perception or motion prediction results or come from a learning-based module. And hand-crafted rules are often adopted to select the best planning trajectory with minimum cost. The construction of a dense cost map is computationally intensive and the using of hand-crafted rules brings robustness and generalization problems. UniAD [21] effectively incorporates the information provided by various preceding tasks to assist planning in a goal-oriented spirit, and achieves remarkable performance in perception, prediction, and planning. PlanT [43] takes perception ground truth as input and encodes the scene in object-level representation for planning. In this paper, we explore the potential of vectorized scene representation for planning and get rid of dense maps or hand-designed postprocessing steps.
规划。最近，基于学习的规划方法变得非常流行。一些工作[9, 40, 41]省略了如感知和运动预测等中间阶段，并直接预测规划轨迹或控制信号。尽管这个想法简单直接，但它们缺乏可解释性，并且难以优化。强化学习非常适合规划任务，并已成为一个有前景的研究方向[4, 5, 45]。显式的密集成本图具有很好的可解释性，并且被广泛使用[2, 11, 19, 44]。成本图是从感知或运动预测结果构建的，或者来自基于学习的模块。通常采用手工制定的规则来选择具有最小成本的最佳规划轨迹。构建密集成本图在计算上是密集的，并且使用手工制定的规则会带来鲁棒性和泛化问题。UniAD[21]有效地整合了由各种前序任务提供的信息，以目标导向的精神协助规划，并在感知、预测和规划方面取得了显著的性能。PlanT[43]以感知地面真实情况作为输入，并将场景编码为面向规划的对象级表示。在本文中，我们探索了向量化场景表示在规划中的潜力，并摆脱了密集地图或手工设计的后处理步骤。

3. Method

Overview. The overall framework of VAD is depicted in Fig. 2. Given multi-frame and multi-view image input, VAD first encodes the image features with a backbone network and utilizes a group of BEV queries to project the image features to the BEV features [26, 39, 52]. Second, VAD utilizes a group of agent queries and map queries to learn the vectorized scene representation, including vectorized map and vectorized agent motion (Sec. 3.1). Third, planning is performed based on the scene information (Sec. 3.2). Specifically, VAD uses an ego query to learn the implicit scene information through interaction with agent queries and map queries. Based on the ego query, ego status features, and high-level driving command, the Planning Head outputs the planning trajectory. Besides, VAD introduces three vectorized planning constraints to restrict the planning trajectory at the instance level (Sec. 3.3). VAD is fully differentiable and trained in an end-to-end manner (Sec. 3.4).
概述。VAD的整体框架在图2 中描述。给定多帧和多视角图像输入，VAD首先使用主干网络编码图像特征，并利用一组BEV查询将图像特征投影到BEV特征[26, 39, 52]。其次，VAD利用一组代理查询和地图查询学习向量化场景表示，包括向量化地图和向量化代理运动（第3.1节）。第三，基于场景信息执行规划（第3.2节）。具体来说，VAD使用自我查询通过与代理查询和地图查询的交互学习隐式场景信息。基于自我查询、自我状态特征和高级驾驶命令，规划头输出规划轨迹。此外，VAD引入了三个向量化规划约束，在实例级别上限制规划轨迹（第3.3节）。VAD是完全可微分的，并以端到端的方式进行训练（第3.4节）。
在这里插入图片描述
图 2. VAD的总体架构。VAD的完整流程被分为四个阶段。主干网络包括一个图像特征提取器和一个BEV编码器，用于将图像特征投影到BEV特征上。向量化场景学习的目标是将场景信息编码到代理查询和地图查询中，同时使用运动向量和地图向量来表示场景。在规划的推理阶段，VAD利用自我查询通过查询交互提取地图和代理信息，并输出规划轨迹（表示为自我向量）。在训练阶段，提出的向量化规划约束对规划轨迹进行规范化。*：可选。

3.1. Vectorized Scene Learning

Perceiving traffic agents and map elements are important in driving scene understanding. VAD encodes the scene information into query features and represents the scene by map vectors and agent motion vectors.
感知交通代理和地图元素在驾驶场景理解中非常重要。VAD将场景信息编码到查询特征中，并通过地图向量和代理运动向量来表示场景。
Vectorized Map. Previous works [19, 21] use rasterized semantic maps to guide the planning, which misses critical instance-level structure information of the map. VAD utilizes a group of map queries [29] Qm to extract map information from BEV feature map and predicts map vectors Vˆm ∈ RNm×Np×2 and the class score for each map vector, where Nm and Np denote the number of predicted map vectors and the points contained in each map vector. Three kinds of map elements are considered: lane divider, road boundary, and pedestrian crossing. The Lane divider provides direction information, and the road boundary indicates the drivable area. Map queries and map vectors are both leveraged to improve the planning performance (Sec. 3.2 and Sec. 3.3).
向量化地图。以前的工作[19, 21]使用栅格化的语义地图来指导规划，这忽略了地图的关键实例级结构信息。VAD使用一组地图查询[29] $Q_m$ 从BEV（鸟瞰视图）特征图中提取地图信息，并预测地图向量 $\hat V_m ∈ R^{N_m×N_p×2}$ 和每个地图向量对应的类别得分，其中 $N_m$ 和 $N_p$ 分别表示预测的地图向量的数量以及每个地图向量中包含的点数。考虑了三种类型的地图元素：车道分隔线、道路边界和人行横道。车道分隔线提供了方向信息，道路边界指示了可行驶区域。地图查询和地图向量都被用来提高规划性能（第3.2节和第3.3节）。
Vectorized Agent Motion. VAD first adopts a group of agent queries Qa to learn agent-level features from the shared BEV feature map via deformable attention [53]. The agent’s attributes (location, class score, orientation, etc.) are decoded from the agent queries by an MLP-based decoder head. To enrich the agent features for motion prediction, VAD performs agent-agent and agent-map interaction [22, 37] via attention mechanism. Then VAD predicts future trajectories of each agent, represented as multimodality motion vectors Vˆa ∈ RNa×Nk×Tf ×2. Na, Nk, and Tf denote the number of predicted agents, the number of modalities, and the number of future timestamps. Each modality of the motion vector indicates a kind of driving intention. VAD outputs a probability score for each modality. The agent motion vectors are used to restrict the ego planning trajectory and avoid collision (Sec. 3.3). Meanwhile, the agent queries are sent into the planning module as scene information (Sec. 3.2).
向量化代理运动。VAD首先采用一组代理查询 $Q_a$ ，通过可变形注意力[53]从共享的BEV特征图中学习代理级特征。代理的属性（位置、类别得分、方向等）由基于MLP的解码器头从代理查询中解码得出。为了丰富代理特征以进行运动预测，VAD通过注意力机制执行代理-代理和代理-地图交互[22, 37]。然后，VAD预测每个代理的未来轨迹，表示为多模态运动向量 $\hat V_a ∈ R^{N_a×N_k×T_f ×2}$ 。 $N_a$ 、 $N_k$ 和 $T_f$ 分别表示预测的代理数量、模态数量和未来时间戳数量。运动向量的每个模态指示一种驾驶意图。VAD为每个模态输出一个概率得分。代理运动向量用于限制自我规划轨迹并避免碰撞（第3.3节）。同时，代理查询作为场景信息被送入规划模块（第3.2节）。

3.2. Planning via Interaction

Ego-Agent Interaction. VAD utilizes a randomly initialized ego query Qego to learn the implicit scene features which are valuable for planning. In order to learn the location and motion information of other dynamic traffic participants, the ego query first interacts with the agent queries through a Transformer decoder [46], in which ego query serves as query of attention q, and agent queries serve as key k and value v. The ego position pego and agent positions pa predicted by the perception module are encoded by a single layer MLP PE1, and serve as query position embedding qpos and key position embedding kpos. The positional embeddings provide information on the relative position relationship between agents and the ego vehicle. The above process can be formulated as:
自车和代理交互。VAD使用一个随机初始化的自我查询 $Q_{ego}$ 来学习对规划有价值的隐式场景特征。为了学习其他动态交通参与者的位置和运动信息，自我查询首先通过Transformer解码器[46]与代理查询进行交互，其中自我查询作为注意力的查询 $q$ ，代理查询作为键 $k$ 和值 $v$ 。由感知模块预测的自我位置 $p_{ego}$ 和代理位置 $p_a$ 通过单层MLP $PE_1$ 编码，并作为查询位置嵌入 $q_{pos}$ 和键位置嵌入 $k_{pos}$ 。位置嵌入提供了代理和自我车辆之间相对位置关系的信息。上述过程可以表述为：
在这里插入图片描述

这里的位置编码器还比较好理解，自车和代理之间的相对位置通过一个MLP得到位置编码，自车的位置 $p_{ego}$ 不是固定的自车坐标系的原点吗？？？

Ego-Map Interaction. After interacting with agent queries, the updated ego query Q′ ego further interacts with the map queries Qm in a similar way. The only difference is we use a different MLP PE2 to encode the positions of the ego vehicle and the map elements. The output ego query Q′′ ego contains both dynamic and static information of the driving scene. The process is formulated as:
自车和地图交互。在与代理查询交互后，更新的自我查询 $Q'_{ego}$ 以类似的方式进一步与地图查询 $Q_m$ 交互。唯一的区别是我们使用不同的MLP $PE_2$ 来编码自我车辆和地图元素的位置。输出的自我查询 $Q'_{ego}$ 包含驾驶场景的动态和静态信息。该过程可以表述为：
在这里插入图片描述
在这个过程中，自我查询通过注意力机制与地图查询进行交互，从而能够整合地图元素的信息。这种交互允许自我查询捕获道路的静态结构，如车道线、道路边界等，这些信息对于规划阶段确定自我车辆的运动至关重要。
Planning Head. Because VAD performs HD-map-free planning, a high-level driving command c is required for navigation. Following the common practice [19, 21], VAD uses three kinds of driving commands: turn left, turn right and go straight. So the planning head takes the updated ego queries (Q′ ego, Q′′ ego) and the current status of the ego vehicle s ego (optional) as ego features fego, as well as the driving command c as inputs, and outputs the planning trajectory Vˆego ∈ RTf ×2. VAD adopts a simple MLP-based planning head. The decoding process is formulated as follows:
规划头。由于VAD执行的是无需高精地图的规划，因此需要一个高级驾驶命令 $c$ 来进行导航。按照常见的做法[19, 21]，VAD使用三种驾驶命令：左转、右转和直行。因此，规划头以更新后的自我查询（Q’ego, Q’'ego）和自我车辆的当前状态 $s_{ego}$ （可选）作为自我特征 $f_{ego}$ ，以及驾驶命令 $c$ 作为输入，并输出规划轨迹 $\hat V_{ego} ∈ \mathbb{R}^{T_f ×2}$ 。VAD采用一个基于简单MLP的规划头。解码过程可以表述如下：
在这里插入图片描述
where […] denotes concatenation operation, ft denotes features used for decoding, and cmd denotes the navigation driving command.

3.3. Vectorized Planning Constraint

Based on the learned map vector and motion vector, VAD regularizes the planning trajectory Vˆego with instance level vectorized constraints during the training phase, as shown in Fig. 3.
基于学习到的地图向量和运动向量，在训练阶段，VAD使用实例级别的向量化约束来规范规划轨迹 $\hat{V}_{ego}$ ，如图3所示。

学习地图向量、运动向量和学习轨迹规划是同时进行的吗？还是先学习地图向量和运动向量再学习轨迹规划？？？我感觉应该是同时学习的吧，不过没看到具体明确的说，应该还是要看代码。。。

在这里插入图片描述
Figure 3. Illustration of Vectorized Planning Constraints. Ego-agent collision constraint aims to keep longitudinal safety and lateral safety between the ego vehicle and other agents. Ego-boundary overstepping constraint punishes the predictions when the planning trajectory gets too close to the lane boundary. Ego-lane directional constraint leverages the direction of the closet lane vector from the ego car (the pink lane in the right sub-figure) as prior to regularize the motion direction of planning.
图3. 向量化规划约束的示意图。自我-代理碰撞约束旨在保持自我车辆与其他代理之间的纵向安全和横向安全。自我-边界越界约束在规划轨迹过于接近车道边界时对预测进行惩罚。自我-车道方向约束利用自我车辆最近的车道向量的方向（右图下方的粉红色车道）作为先验来规范规划的运动方向。
Ego-Agent Collision Constraint. Ego-agent collision constraint explicitly considers the compatibility of the ego planning trajectory and the future trajectory of other vehicles, in order to improve planning safety and avoid collision. Unlike previous works [19,21] that adopt dense occupancy maps, we utilize vectorized motion trajectories which both keep great interpretability and require less computation. Specifically, we first filter out low-confidence agent predictions by a threshold ϵa. For multi-modality motion prediction, we use the trajectory with the highest confidence score as the final prediction. We consider collision constraint as a safe boundary for the ego vehicle both laterally and longitudinally. Multiple cars may be close to each other (e.g., driving side by side) in the lateral direction, but a longer safety distance is required in the longitudinal direction. So we adopt different agent distance thresholds δX and δY for different directions. For each future timestamp, we find the closest agent within a certain range δa in both directions. Then for each direction i ∈ {X, Y}, if the distance di a with the closet agent is less than the threshold δi, then the loss item of this constraint Li col = δi − di a, otherwise it is 0. The loss for ego-agent collision constraint can be formulated as:
自我代理碰撞约束。自我代理碰撞约束明确考虑了自我规划轨迹和其他车辆未来轨迹的兼容性，以提高规划安全性并避免碰撞。与之前采用密集占用图的作品[19,21]不同，我们使用向量化的运动轨迹，这些轨迹既保持了很好的可解释性，又需要较少的计算。具体来说，我们首先通过阈值 $\epsilon_a$ 过滤出低置信度的代理预测。对于多模态运动预测，我们使用置信度得分最高的轨迹作为最终预测。我们认为碰撞约束是自我车辆在横向和纵向上的安全边界。多辆车可能在横向上彼此靠近（例如并排行驶），但在纵向上需要更长的安全距离。因此，我们采用不同的代理距离阈值 $\delta_X$ 和 $\delta_Y$ 用于不同方向。对于每个未来的时间戳，我们在两个方向上找到一定范围内 $\delta_a$ 最近的代理。然后对于每个方向 $\in \{X, Y\}$ ，如果与最近代理的距离 $d_{i,a}$ 小于阈值 $\delta_i$ ，则此约束的损失项 $L_{i,col}$ 为 $\delta_i - d_{i,a}$ ，否则为0。自我代理碰撞约束的损失可以表述为：
在这里插入图片描述

是X和Y方向上只要出现一个满足距离较近就计算损失，还是两个都满足才计算损失，因为如果纵向距离比较远，但是横向距离比较近其实也影响不大，也不会有发生碰撞的风险，当横向和纵向都比较近的时候才有碰撞的风险。。。、

Ego-Boundary Overstepping Constraint. This constraint aims to push the planning trajectory away from the road boundary so that the trajectory can be kept in the drivable area. We first filter out low-confidence map predictions according to a threshold ϵm. Then for each future timestamp, we calculate the distance dt bd between the planning waypoint and its closest map boundary line. Then the loss for this constraint is formulated as:
自我边界越界约束。此约束旨在将规划轨迹从道路边界推开，以确保轨迹保持在可行驶区域内。我们首先根据阈值 $\epsilon_m$ 过滤出低置信度的地图预测。然后对于每个未来的时间戳，我们计算规划路径点与其最近的地图边界线之间的距离 $d^t_{bd}$ 。然后，此约束的损失公式如下：
$L_{bd} = \max(0, -d_{t,bd} + \delta_{bd})$
其中， $\delta_{bd}$ 是一个正的安全距离阈值，确保规划轨迹与道路边界之间保持一定的安全距离。如果 $d_{t,bd}$ 小于 $\delta_{bd}$ ，则损失项 $L_{bd}$ 为正值，表示轨迹需要向远离边界的方向调整；如果 $d_{t,bd}$ 大于或等于 $\delta_{bd}$ ，则损失项为0，表示当前轨迹与边界的距离是安全的。
在这里插入图片描述
Ego-Lane Directional Constraint. Ego-lane directional constraint comes from a prior that the vehicle’s motion direction should be consistent with the lane direction where the vehicle locates. The directional constraint leverages the vectorized lane direction to regularize the motion direction of our planning trajectory. Specifically, first, we filter out low-confidence map predictions according to ϵm. Then we find the closest lane divider vector vˆm ∈ RTf ×2×2 (within a certain range δdir) from our planning waypoint at each future timestamp. Finally, the loss for this constraint is the angular difference averaged over time between the lane vector and the ego vector:
自我车道方向约束。自我车道方向约束来自于一个先验知识，即车辆的运动方向应与车辆所在车道的方向一致。方向约束利用向量化的车道方向来规范我们规划轨迹的运动方向。具体来说，首先，我们根据 $\epsilon_m$ 过滤出低置信度的地图预测。然后，对于每个未来的时间戳，我们找到从规划路径点开始的最近车道分隔线向量 $\hat{v}_m \in \mathbb{R}^{T_f \times 2 \times 2}$ （在一定范围 $\delta_{dir}$ 内）。最后，此约束的损失是车道向量和自我向量之间角度差的平均值：
$L_{dir} = \frac{1}{T_f} \sum_{t=0}^{T_f-1} \arccos\left(\frac{\hat{v}_{m,t} \cdot v_{ego,t}}{\|\hat{v}_{m,t}\| \|v_{ego,t}\|}\right)$
这里， $\hat{v}_{m,t}$ 是时间戳 $t$ 时的车道分隔线向量， $v_{ego,t}$ 是规划轨迹在时间戳 $t$ 的向量， $T_f$ 是预测时间范围， $\arccos$ 是反余弦函数，用于计算两个向量之间的夹角。损失 $L_{dir}$ 衡量的是规划轨迹向量与车道方向向量之间的平均角度差异，目的是使规划轨迹尽可能与车道方向一致。
在这里插入图片描述
in which vˆ ego ∈ RTf ×2×2 is the planning ego vectors. vˆego t denotes the ego vector starting from the planning waypoint at the previous timestamp t − 1 and pointing to the planning waypoint at the current timestamp t. Fang(v1, v2) denotes the angular difference between vector v1 and vector v2.
在这里插入图片描述

3.4. End-to-End Learning

Vectorized Scene Learning Loss. Vectorized scene learning includes vectorized map learning and vectorized motion prediction. For vectorized map learning, Manhattan distance is adopted to calculate the regression loss between the predicted map points and the ground truth map points. Besides, focal loss [30] is used as the map classification loss. The overall map loss is denoted as Lmap.
向量化场景学习损失。向量化场景学习包括向量化地图学习和向量化运动预测。对于向量化地图学习，采用曼哈顿距离来计算预测地图点与真实地图点之间的回归损失。此外，使用焦点损失（focal loss）[30]作为地图分类损失。总体地图损失表示为 $L_{\text{map}}$ 。
在这里插入图片描述
Vectorized Constraint Loss. The vectorized constraint loss is composed of three constraints proposed in Sec. 3.3, i.e., ego-agent collision constraint Lcol, ego-boundary overstepping constraint Lbd, and ego-lane directional constraint Ldir, which regularize the planning trajectory Vˆego with vectorized scene representation.
向量化约束损失由第3.3节提出的三个约束组成，即自我代理碰撞约束 $L_{\text{col}}$ 、自我边界越界约束 $L_{\text{bd}}$ 和自我车道方向约束 $L_{\text{dir}}$ ，这些约束使用向量化场景表示来规范规划轨迹 $\hat{V}_{ego}$ 。
在这里插入图片描述
Imitation Learning Loss. The imitation learning loss Limi is an l1 loss between the planning trajectory Vˆego and the ground truth ego trajectory Vego, aiming at guiding the planning trajectory with expert driving behavior. Limi is formulated as follows:
模仿学习损失 $L_{\text{imi}}$ 是规划轨迹 $\hat{V}_{ego}$ 与真实自我轨迹 $V_{ego}$ 之间的 L1 损失，目的是引导规划轨迹学习专家驾驶行为。L1 损失，也称为绝对值损失或曼哈顿损失，是一种常用的损失函数，用于衡量预测值与真实值之间的差异。模仿学习损失 $L_{\text{imi}}$ 的公式可以表示为：
在这里插入图片描述

通过最小化 $L_{\text{imi}}$ ，模型可以学习到如何生成与专家驾驶行为相似的规划轨迹。这种方法允许模型从专家的轨迹中学习，而不仅仅是从规则或约束中学习。模仿学习是一种强大的策略，可以在没有显式模型的情况下，通过观察专家的行为来学习复杂任务。
VAD is end-to-end trainable based on the proposed vectorized planning constraint. The overall loss for end-toend learning is the weighted sum of vectorized scene learning loss, vectorized planning constraint loss, and imitation learning loss:
VAD（Vehicle Autonomous Driving）系统是基于所提出的向量化规划约束进行端到端训练的。端到端学习的整体损失是向量化场景学习损失、向量化规划约束损失和模仿学习损失的加权总和。
在这里插入图片描述

通过调整这些权重，可以控制模型在训练过程中对不同方面（如地图理解、运动预测、规划约束、模仿学习）的重视程度。这样，可以更灵活地优化模型的性能，以适应不同的驾驶场景和需求。

4. Experiments

We conduct experiments on the challenging public nuScenes [1] dataset, which contains 1000 driving scenes, and each scene roughly lasts for 20 seconds. nuScenes provides 1.4M 3D bounding boxes of 23 categories in total. The scene images are captured by 6 cameras covering 360° FOV horizontally, and the keyframes are annotated at 2Hz. Following previous works [19, 21], Displacement Error (DE) and Collision Rate (CR) are adopted to comprehensively evaluate the planning performance. For the closed-loop setting, we adopt CARLA simulator [12] and the Town05 [42] benchmark for simulation. Following previous works [19, 42], Route Completion (RC) and Driving Score (DS) are used to evaluate the planning performance.
在这段描述中，提到了几个关键点，我将为您逐一解释：

nuScenes 数据集：这是一个公共的自动驾驶数据集，包含1000个驾驶场景，每个场景大约持续20秒。它提供了总共1.4M个3D边界框，涵盖了23个类别。
数据采集：场景图像由6个摄像头捕获，这些摄像头的水平视场覆盖了360°，关键帧的标注频率为2Hz。
评估指标：
- 位移误差（Displacement Error, DE）：用于衡量规划轨迹与实际轨迹之间的差异。
- 碰撞率（Collision Rate, CR）：用于评估规划系统在模拟环境中避免碰撞的能力。
闭环设置：在闭环测试中，使用CARLA模拟器和Town05基准进行模拟测试。
模拟环境中的评估指标：
- 路线完成度（Route Completion, RC）：衡量规划系统是否能够成功地完成预定路线。
- 驾驶得分（Driving Score, DS）：综合考虑多种因素（如速度、遵守交通规则等）来评估驾驶性能。

这些评估指标和测试环境的选择，是为了全面地评估自动驾驶系统的规划性能，确保系统在实际道路条件下的安全性和有效性。通过在具有挑战性的数据集上进行实验，研究者可以验证和改进他们的自动驾驶算法。
在这里插入图片描述
Table1 这段描述提供了关于自动驾驶领域中开放环路（open-loop）规划性能的一些信息，以下是一些关键点的解释：

开放环路规划性能：这通常指的是在没有实时反馈调整的情况下，根据初始条件和环境信息生成的规划轨迹的性能。
VAD：指的是所讨论的自动驾驶系统或方法，它在nuScenes验证数据集上实现了最佳的端到端规划性能和最快的推理速度。
LiDAR-based methods：使用激光雷达（LiDAR）作为主要传感器输入的方法，这些方法在表格中用†表示。
Ego status information：指的是自我车辆的状态信息，如速度、加速度等。在开放环路评估中，为了公平比较，这些信息被停用。
FPS (Frames Per Second)：指的是每秒帧数，用于衡量系统在单位时间内处理数据的能力，即推理速度。ST-P3和VAD的FPS是在NVIDIA Geforce RTX 3090 GPU上测量的，而UniAD的FPS是在NVIDIA Tesla A100 GPU上测量的。
公平比较：在开放环路评估中，所有系统可能都被要求在没有自我状态信息的情况下运行，以确保比较的公正性。
nuScenes val dataset：指的是nuScenes数据集的验证部分，通常用于模型的调优和性能评估。

这段描述可能来自于一篇研究论文或技术报告，旨在展示VAD方法在自动驾驶规划任务中的性能优势，特别是在推理速度方面。通过与其他方法的比较，可以更好地理解VAD在实际应用中的潜力和局限性。
在这里插入图片描述
这段描述提供了一个表格的标题和内容的简要说明，表格内容涉及自动驾驶系统中规划模块的设计选择的消融研究（Ablation Study）。以下是关键点的解释：

消融研究（Ablation Study）：这是一种研究方法，通过逐步移除或修改系统的一部分来评估该部分对整体性能的影响。这有助于理解不同组件对系统性能的具体贡献。
Agent Inter. 和 Map Inter.：
- Agent Inter. 指规划模块中自我车辆（ego）与代理（其他车辆）之间的查询交互。
- Map Inter. 指规划模块中自我车辆与地图之间的查询交互。
OverStep. Const.、Dir. Const. 和 Col. Const.：
- OverStep. Const. 指自我边界越界约束（Ego-Boundary Overstepping Constraint），用于确保规划轨迹不越过车道边界。
- Dir. Const. 指自我车道方向约束（Ego-Lane Directional Constraint），用于确保规划轨迹的方向与所在车道的方向一致。
- Col. Const. 指自我代理碰撞约束（Ego-Agent Collision Constraint），用于确保规划轨迹不与其他代理发生碰撞。

通过这种消融研究，研究人员可以了解：

不同交互方式（如自我车辆与代理或地图的交互）对规划性能的影响。
特定约束（如边界越界、车道方向或碰撞约束）对规划性能的影响。

这种分析有助于优化规划算法，确定哪些组件是必不可少的，哪些可以进一步改进或简化，从而提高自动驾驶系统的整体性能和效率。
在这里插入图片描述
这段描述提到了一个消融研究的表格，专注于地图表示方式对于规划性能的影响。以下是关键点的解释：

地图表示方式：研究比较了两种不同的地图表示方法——栅格化（rasterized）和向量化（vectorized）。
消融研究：通过移除或替换地图表示方式，研究者可以评估不同表示方法对规划模块性能的具体影响。
栅格化地图：这是一种地图表示方式，将地图分割成许多小的网格单元，每个单元包含有关该区域的信息，如障碍物、道路类型等。
向量化地图：另一种地图表示方式，使用向量数据来表示地图上的元素，如道路中心线、车道边界、交通标志等。
OverStep. Const.：自我边界越界约束（Ego-Boundary Overstepping Constraint），用于确保规划轨迹不越过车道边界。这个约束可能在栅格化和向量化地图上的表现不同。
Dir. Const.：自我车道方向约束（Ego-Lane Directional Constraint），用于确保规划轨迹的方向与所在车道的方向一致。这个约束同样可能受到地图表示方式的影响。

在这里插入图片描述
这段描述涉及到闭环仿真结果的表格，以下是关键点的解释：

闭环仿真结果：闭环仿真指的是在仿真环境中，系统不仅生成规划轨迹，还能根据环境的实时反馈进行调整。这种仿真可以更真实地模拟实际驾驶情况。
VAD：指的是所讨论的自动驾驶系统或方法，它在CARLA模拟器上实现了最佳的闭环视觉端到端规划性能。VAD可能特别强调仅使用视觉输入进行规划。
CARLA [12]：CARLA是一个开源的模拟器，广泛用于自动驾驶研究。它提供了一个虚拟环境，可以模拟各种驾驶场景和交通情况。
视觉端到端规划：这表明VAD系统使用视觉输入（如摄像头图像）来生成规划轨迹，而不是依赖激光雷达（LiDAR）或其他传感器数据。
†：表示某些方法是基于激光雷达的。激光雷达是一种常用的自动驾驶传感器，能够提供高精度的距离测量和障碍物检测。
闭环视觉端到端规划性能：这强调了VAD系统在闭环仿真环境中，仅使用视觉数据就能实现高性能的规划。这种性能可能通过以下指标来衡量：
- 路线完成度（Route Completion, RC）：系统是否能够成功地完成预定路线。
- 驾驶得分（Driving Score, DS）：综合考虑多种因素（如速度、遵守交通规则等）来评估驾驶性能。

通过在CARLA这样的仿真环境中进行闭环仿真，研究人员可以验证和优化自动驾驶系统在复杂交通环境中的表现，确保其在实际应用中的安全性和可靠性。

在这里插入图片描述
这段描述提到了一个表格，涉及自动驾驶系统中不同模块的运行时间和推理速度。以下是关键点的解释：

模块运行时间（Module Runtime）：这通常指的是自动驾驶系统中各个组件或模块处理数据所需的时间。
推理速度（Inference Speed）：特别指模型或算法在给定硬件上进行推理（即实时预测）的速度，通常以每秒处理的帧数（FPS）或每秒处理的图像数来衡量。
VAD-Tiny：可能是VAD系统的一个轻量级或优化版本，旨在提高推理速度或减少计算资源的使用。
NVIDIA GeForce RTX 3090 GPU：这是一种高性能的图形处理单元，常用于深度学习和实时图形渲染任务。这里用作测量VAD-Tiny推理速度的硬件平台。
测量：表格可能列出了VAD-Tiny在不同模块上的运行时间，这些时间是在上述GPU上进行测量得到的。

这种分析有助于了解系统中各个模块的计算效率，以及整个系统在实际部署时可能的性能表现。推理速度是自动驾驶系统设计中的一个关键因素，因为它直接影响到系统的响应时间和实时性。通过优化模块的运行时间，可以提高整个系统的性能，使其更适合实时应用。

4.1. Implementation Details

VAD uses 2-second history information and plans a 3- second future trajectory. ResNet50 [16] is adopted as the default backbone network for encoding image features. VAD performs vectorized mapping and motion prediction for a 60m × 30m perception range longitudinally and laterally. We have two variants of VAD, which are VAD-Tiny and VAD-Base. VAD-Base is the default model for the experiments. The default number for BEV query, map query, and agent query is 200 × 200, 100 × 20, and 300, respectively. There is a total of 100 map vector queries, each containing 20 map points. The feature dimension and the default hidden size are 256. Compared with VAD-Base, VAD Tiny has fewer BEV queries, which is 100 × 100. The number of BEV encoder layer and decoder layer of motion and map modules is reduced from 6 to 3, and the input image size is reduced from 1280 × 720 to 640 × 360.
这段描述提供了关于VAD（Vehicle Autonomous Driving）系统及其不同变体的详细信息，以下是关键点的总结：

历史信息使用：VAD系统使用2秒的历史信息来规划3秒的未来轨迹。
骨干网络：ResNet50 [16]作为默认的特征提取网络，用于编码图像特征。
感知范围：VAD进行向量化的地图和运动预测，覆盖60米×30米的感知范围，包括纵向和横向。
VAD变体：
- VAD-Tiny：轻量级版本，具有较低的计算需求。
- VAD-Base：实验中使用的默认模型。
查询数量：
- 鸟瞰图（BEV）查询：VAD-Base为200×200，VAD-Tiny为100×100。
- 地图查询：100×20。
- 代理（其他车辆）查询：300。
地图向量查询：共有100个地图向量查询，每个查询包含20个地图点。
特征维度和隐藏层大小：默认情况下为256。
VAD-Tiny的优化：
- 减少BEV查询数量至100×100。
- 减少运动和地图模块的编码器和解码器层数，从6层减少到3层。
- 减少输入图像的分辨率，从1280×720降低到640×360。

通过这些设计选择，VAD-Tiny旨在降低计算复杂度和资源消耗，同时保持有效的性能，适合在计算能力有限的环境中使用。而VAD-Base则可能提供更高精度的规划，适用于对性能要求更高的场景。这种灵活性允许VAD系统根据不同的应用需求和硬件限制进行调整。
As for training, the confidence thresholds ϵa and ϵm are set to 0.5, the distance thresholds δa, δbd and δdir are 3.0m, 1.0m, and 2.0m, respectively. The agent safety threshold δX and δY are set to 1.5m and 3.0m. We use AdamW [36] optimizer and Cosine Annealing [35] scheduler to train VAD with weight decay 0.01 and initial learning rate 2 × 10−4. VAD is trained for 60 epochs on 8 NVIDIA GeForce RTX 3090 GPUs with batch size 1 per GPU.
在这里插入图片描述
VAD-Base is adopted for the closed-loop evaluation. The input image size is 640 × 320. Following previous works [19, 42], the navigation information includes a sparse goal location and a corresponding discrete navigational command. This navigation information is encoded by an MLP and sent to the planning head as one of the input features. Besides, we add a traffic light classification branch to recognize the traffic signal. Specifically, it consists of a Resnet50 network and an MLP-based classification head. The input of this branch is the cropped front-view image, corresponding to the upper middle part of the image. The image feature map is flattened and also sent to the planning head to help the model realize the traffic light information.
这段描述详细介绍了VAD-Base系统在闭环评估中的配置和功能，以下是关键点的总结：

VAD-Base 系统：用于闭环评估的默认模型。
输入图像尺寸：输入图像的分辨率为640×320像素。
导航信息：
- 包括一个稀疏的目标位置（sparse goal location）和一个相应的离散导航命令（discrete navigational command）。
- 这些导航信息通过多层感知器（MLP）进行编码，并作为规划头（planning head）的输入特征之一。
多层感知器（MLP）：用于将导航信息编码为模型可以理解的特征。
交通信号灯分类分支：
- 为了识别交通信号灯，系统增加了一个交通信号灯分类分支。
- 该分支包括一个ResNet50网络和一个基于MLP的分类头。
分支输入：
- 分支的输入是裁剪的前视图图像，对应于图像的上中部部分。
- 图像特征图被展平（flattened），并发送到规划头以帮助模型识别交通信号灯信息。
规划头：接收来自导航信息编码和交通信号灯分类分支的特征，以生成规划轨迹。

这种设计使得VAD-Base系统能够：

处理复杂的导航任务，包括目标位置和导航命令。
识别并响应交通信号灯，提高驾驶的安全性和合规性。

通过将这些不同的输入特征整合到规划头中，VAD-Base系统能够更全面地理解和预测驾驶环境，从而生成更安全、更有效的规划轨迹。这种综合方法有助于提高自动驾驶系统在复杂交通环境中的性能和可靠性。

4.2. Main Results

Open-loop planning results. As shown in Tab. 1, VAD shows great advantages in both performance and speed compared with the previous SOTA method [21]. On one hand, VAD-Tiny and VAD-Base greatly reduce the average planning displacement error by 0.25m and 0.31m. Meanwhile, VAD-Base greatly reduces the average collision rates by 29.0%. On the other hand, because VAD does not need many auxiliary tasks (e.g., tracking and occupancy prediction) and tedious post-processing steps, it achieves the fastest inference speed based on the vectorized scene representation. VAD-Tiny runs 9.3× faster while keeping a comparable planning performance. VAD-Base achieves the best planning performance and still runs 2.5× faster.It is worth noticing that in the main results, VAD omits ego status features to avoid shortcut learning in the open-loop planning [50], but the results of VAD using ego status features are still preserved in Tab. 1 for reference.
这段描述总结了开环（open-loop）规划的结果，并与之前的最佳方法（SOTA）[21]进行了比较。以下是关键点的总结：

性能和速度优势：VAD（Vehicle Autonomous Driving）在性能和速度方面都显示出了显著的优势。
位移误差降低：
- VAD-Tiny 平均规划位移误差降低了0.25米。
- VAD-Base 平均规划位移误差降低了0.31米。
碰撞率降低：
- VAD-Base 平均碰撞率降低了29.0%。
推理速度：
- VAD不需要执行许多辅助任务（例如跟踪和占用预测）和繁琐的后处理步骤，因此基于向量化场景表示实现了最快的推理速度。
- VAD-Tiny 的推理速度提高了9.3倍，同时保持了可比的规划性能。
- VAD-Base 在实现最佳规划性能的同时，推理速度提高了2.5倍。
避免捷径学习：
- 在主要结果中，为了避免开放环路规划中的捷径学习[50]，VAD省略了自我状态特征（ego status features）。
- 尽管如此，使用自我状态特征的VAD结果仍然在表1中保留，供参考。
VAD变体：
- VAD-Tiny 和 VAD-Base 都是VAD的不同版本，它们在减少计算资源和提高效率方面做出了不同的折衷。
向量化场景表示：
- VAD利用向量化的场景表示来提高规划的速度和效率。
自我状态特征：
- 尽管在开放环路规划中省略了自我状态特征以避免捷径学习，但使用这些特征的VAD结果仍然被记录和展示，以展示VAD在不同配置下的性能。

这些结果表明，VAD系统在开放环路规划任务中不仅能够提供高性能，还能实现快速的推理速度，这在自动驾驶领域是非常重要的，尤其是在需要实时反应的应用场景中。
Closed-loop planning results. VAD outperforms previous SOTA vision-only end-to-end planning methods [19, 42] on the Town05 Short benchmark. Compared to STP3 [19], VAD greatly improves DS by 9.15 and has a better RC. On the Town05 Long benchmark, VAD achieves 30.31 DS, which is close to the LiDAR-based method [42], while significantly improving RC from 56.36 to 75.20. ST-P3 [19] obtains better RC but has a much worse DS.
这段描述提供了闭环规划结果的概述，以下是关键点的总结：

闭环规划性能：VAD（Vehicle Autonomous Driving）在闭环规划任务中超越了之前的最佳方法（SOTA）[19, 42]，特别是在Town05 Short基准测试中。
与STP3 [19]的比较：
- VAD在Driving Score（DS）上比STP3提高了9.15分，这是一个衡量驾驶性能的指标。
- VAD在Route Completion（RC）上也表现更好，这是一个衡量系统是否能够成功完成预定路线的指标。
Town05 Long基准测试：
- VAD在Town05 Long基准测试中取得了30.31的DS，接近基于激光雷达（LiDAR）的方法[42]。
- VAD显著提高了RC，从56.36提高到75.20。这表明VAD在完成更长路线方面的能力更强。
ST-P3 [19]的性能：
- 尽管ST-P3在RC上取得了更好的结果，但其DS明显较差，这表明它在整体驾驶性能上不如VAD。
视觉端到端规划：VAD是一个仅使用视觉输入的端到端规划方法，这意味着它直接从图像数据生成规划轨迹，而不依赖其他传感器数据。
激光雷达与视觉方法的比较：VAD作为一个视觉方法，在DS上接近基于LiDAR的方法，这表明视觉方法在自动驾驶规划任务中具有潜力，尤其是在成本效益和普及性方面。
性能提升：VAD在闭环规划中的表现，特别是在DS和RC上的提升，显示了其在自动驾驶领域的竞争力和实用性。

这些结果强调了VAD系统在闭环规划环境中的有效性，特别是在仅使用视觉输入的情况下，能够实现与基于LiDAR的方法相媲美的性能。这对于自动驾驶技术的发展具有重要意义，因为它表明视觉系统可以是一个可行和高效的选择。

4.3. Ablation Study

Effectiveness of designs. Tab. 2 shows the effectiveness of our design choices. First, because map can provide critical guidance for planning, the planning distance error is much larger without ego-map interaction (ID 1). Second, the ego-agent interaction and ego-map interaction provide implicit scene features for the ego query so that the ego car can realize others’ driving intentions and plan safely. The collision rate becomes much higher without interaction (ID 1-2). Finally, the collision rate can be reduced with any of the vectorized planning constraints (ID 4-6). When utilizing the three constraints together, VAD achieves the lowest collision rate and the best planning accuracy (ID 7).
这段描述总结了表格2中展示的设计选择的有效性，以下是关键点的总结：

地图交互的重要性：
- 地图可以为规划提供关键的指导，如果没有自我车辆与地图的交互（ID 1），规划距离误差会显著增大。
自我车辆交互的作用：
- 自我车辆与代理（其他车辆）的交互以及与地图的交互能够为自我查询提供隐式的场景特征。
- 这些交互使得自我车辆能够意识到其他车辆的驾驶意图，并安全地进行规划。
碰撞率的影响：
- 没有这些交互（ID 1-2），碰撞率会显著提高。
向量化规划约束的效果：
- 采用任一的向量化规划约束（ID 4-6）都可以降低碰撞率。
约束的综合应用：
- 当同时使用三种约束时，VAD实现了最低的碰撞率和最佳的规划精度（ID 7）。
设计选择的编号（ID）：
- 表格中的设计选择可能被编号为ID 1至ID 7，每个编号代表不同的设计配置或交互的组合。
向量化规划约束：
- 包括自我边界越界约束（OverStep. Const.）、自我车道方向约束（Dir. Const.）和自我代理碰撞约束（Col. Const.）。
VAD的综合性能：
- 结合所有设计选择，VAD展现了在规划精度和安全性方面的优势。

这些结论强调了在自动驾驶系统中，交互和约束对于提高规划性能和降低碰撞风险的重要性。通过消融研究，可以更清楚地理解每个设计选择对系统整体性能的具体贡献，从而指导未来的系统优化和改进。
Rasterized map representation. We show the results of a VAD variant with rasterized map representation in Tab. 3. Specifically, this VAD variant utilizes map queries to perform BEV map segmentation instead of vectorized map detection, and the updated map queries are used in the planning transformer the same as VAD. As shown in Tab. 3, VAD with rasterized map representation suffers from a much higher collision rate.
这段描述介绍了一种使用栅格化地图表示的VAD（Vehicle Autonomous Driving）系统变体，并在 表格3 中展示了其结果。以下是关键点的总结：

栅格化地图表示：
- 这种VAD变体使用栅格化的地图表示，而不是向量化的地图检测。
地图查询的使用：
- 该变体利用地图查询来进行鸟瞰图（BEV）地图分割，而不是传统的向量化地图检测方法。
规划变换器（Planning Transformer）：
- 更新后的地图查询被用于规划变换器中，与标准的VAD系统相同。
碰撞率的影响：
- 根据表格3的结果，使用栅格化地图表示的VAD变体遭受了更高的碰撞率。
性能比较：
- 栅格化地图表示的VAD变体在碰撞率方面表现较差，这表明向量化地图表示可能在处理复杂场景和提高规划安全性方面更为有效。
原因分析：
- 栅格化地图表示可能在捕捉道路边界和障碍物的精确位置方面存在局限性，导致规划路径更容易与这些障碍物发生冲突。
向量化与栅格化地图表示的对比：
- 向量化地图表示通过使用向量数据来表示道路元素（如车道线、交通标志等），可能提供更精确的空间信息，从而有助于生成更安全、更准确的规划路径。

通过这种对比，研究人员可以更好地理解不同地图表示方法对自动驾驶系统规划性能的影响，并据此优化系统设计。这也强调了在自动驾驶领域中，选择合适的地图表示方法对于提高系统的整体性能和安全性至关重要。
Runtime of each module. We evaluate the runtime of each module of VAD-Tiny, and the results are shown in Tab. 5. Backbone and BEV Encoder take most of the runtime for feature extraction and transformation. Then motion module and map module take 34.6% of the total runtime to accomplish multi-agent vectorized motion prediction and vectorized map prediction. The runtime of the planning module is only 3.4ms, thanks to the sparse vectorized representation and concise model design.
这段描述提供了VAD-Tiny系统中各个模块的运行时间评估，并在表格5中展示了结果。以下是关键点的总结：

模块运行时间评估：对VAD-Tiny系统中的每个模块进行了运行时间的评估。
主要耗时模块：
- Backbone（骨干网络） 和 BEV Encoder（鸟瞰图编码器）：这两个模块占用了大部分运行时间，主要用于特征提取和转换。
运动模块和地图模块：
- 这两个模块占用了总运行时间的34.6%，用于完成多代理向量化运动预测和向量化地图预测。
规划模块运行时间：
- 规划模块的运行时间仅为3.4毫秒，这得益于稀疏向量化表示和简洁的模型设计。
稀疏向量化表示：
- 向量化表示通过减少不必要的数据，使得模型能够更快地处理信息。
模型设计：
- 模型设计的简洁性有助于减少计算量，从而降低运行时间。
效率和性能：
- 较短的规划模块运行时间表明VAD-Tiny能够实现快速响应，这对于实时自动驾驶系统至关重要。
模块间的协调：
- 尽管特征提取和转换阶段占用了较多时间，但整体模块间的协调确保了系统整体的高效运行。

通过这种详细的运行时间分析，可以识别出系统性能瓶颈，并为进一步的优化提供方向。例如，如果需要进一步提高推理速度，可以专注于优化骨干网络和编码器部分，或者进一步简化模型结构以减少规划模块的运行时间。这种分析对于确保自动驾驶系统在各种应用场景中都能满足实时性要求非常重要。

4.4. Qualitative Results

We show three vectorized scene learning and planning results of VAD in Fig. 4. For a better understanding of the scene, we also provide raw surrounding camera images and project the planning trajectories to the front camera image. VAD can predict multi-modality agent motions and map elements accurately, as well as plan the ego future movements reasonably according to the vectorized scene representation.
这段描述提到了VAD（Vehicle Autonomous Driving）系统在图4中展示的向量化场景学习和规划结果，并提供了一些关键点的总结：

向量化场景学习和规划结果：在图4中展示了VAD系统在向量化场景学习（包括地图和运动预测）和规划方面的三个结果。
场景理解：为了更好地理解场景，系统提供了原始的周围摄像头图像。
规划轨迹可视化：将规划的轨迹投影到前视摄像头图像上，这有助于直观地展示车辆如何在场景中移动。
多模态代理运动预测：VAD能够准确预测其他代理（如其他车辆）的多种可能运动轨迹。
地图元素预测：VAD能够准确地预测地图元素，如车道线、交通标志等。
自我车辆运动规划：根据向量化的场景表示，VAD合理地规划了自我车辆的未来运动。
向量化场景表示：VAD使用向量化的场景表示来提高计算效率和规划的准确性。
结果展示：通过将规划结果与原始图像结合展示，可以更直观地评估VAD系统的性能。
这种可视化方法不仅有助于研究人员和开发人员理解VAD系统是如何感知环境和做出规划决策的，而且也使得非专业观众能够更清楚地看到系统的工作过程。通过展示原始图像和规划轨迹，可以更全面地评估系统在实际场景中的表现。

5. Conclusion

In this paper, we explore the fully vectorized representation of the driving scene, and how to effectively incorporate the vectorized scene information for better planning performance. The resulting end-to-end autonomous driving paradigm is termed VAD. VAD achieves both high performance and high efficiency, which are vital for the safety and deployment of an autonomous driving system. We hope the impressive performance of VAD can reveal the potential of vectorized paradigm to the community.
在这篇论文中，研究者们探讨了驾驶场景的完全向量化表示，以及如何有效地整合向量化场景信息以提升规划性能。以下是关键点的总结：

向量化表示：研究者们探索了将驾驶场景完全表示为向量的方法，这种方法可以更精确地捕捉场景的几何和动态特性。
整合向量化信息：研究者们研究了如何将向量化的场景信息有效地整合到规划过程中，以提高规划的准确性和可靠性。
端到端自动驾驶范式：这种整合了向量化场景信息的自动驾驶系统被称为VAD（Vehicle Autonomous Driving）。
高性能与高效率：
- VAD在规划性能方面实现了高表现，这包括规划的准确性、响应时间和安全性。
- VAD还展示了高效率，特别是在推理速度和计算资源利用方面。
安全性和部署：VAD的高性能和高效率对于自动驾驶系统的安全性和实际部署至关重要。安全性是自动驾驶技术的核心要求，而高效率则有助于降低成本和提高系统的可扩展性。
向量化范式的潜力：研究者们希望VAD的卓越表现能够向学术界和工业界展示向量化范式在自动驾驶领域的潜力和应用前景。
社区影响：通过展示VAD的成功，研究者们希望能够激发更多的研究和开发，推动自动驾驶技术的进步。
这种研究不仅有助于推动自动驾驶技术的发展，还可能对其他需要复杂场景理解和快速决策的领域产生影响，如机器人导航、增强现实等。通过向量化的方法，可以更有效地处理和分析大量的空间和动态信息，从而提高系统的整体性能。
VAD predicts multi-modality motion trajectories for other dynamic agents, We use the most confident prediction in our collision constraint to improve planning safety. How to utilize the multi-modality motion predictions for planning, is worthy of future discussion. Besides, how to incorporate other traffic information (e.g., lane graph, road sign, traffic light, and speed limit) into this autonomous driving system, also deserves further exploration.
这段描述提到了VAD（Vehicle Autonomous Driving）系统在处理多模态运动预测和其他交通信息整合方面的考量，以下是关键点的总结：
多模态运动预测：
- VAD能够预测其他动态代理（如其他车辆或行人）的多种可能运动轨迹。
最自信预测的使用：
- 在碰撞约束中，VAD使用最自信（置信度最高）的预测来提高规划的安全性。
多模态预测在规划中的利用：
- 如何有效地利用多模态运动预测来改进规划策略是一个值得未来探讨的问题。
其他交通信息的整合：
- 除了动态代理的运动预测，自动驾驶系统还需要考虑其他交通信息，如车道图、路标、交通信号灯和速度限制等。
未来研究方向：
- 将这些额外的交通信息整合到自动驾驶系统中，以提高规划的准确性和鲁棒性，是另一个值得深入研究的方向。
系统性能提升：
- 通过整合更多的交通信息，可以提高系统对复杂交通环境的理解和适应能力，从而提升整体性能。
安全性和可靠性：
- 考虑多模态预测和其他交通信息对于提高自动驾驶系统的安全性和可靠性至关重要。
技术挑战：
- 处理和融合这些信息可能涉及到数据融合、传感器集成、预测算法优化和决策制定等技术挑战。
社区贡献：
- 这些研究方向可以为自动驾驶领域的学术界和工业界提供新的视角和解决方案。

通过不断探索和解决这些问题，VAD系统和其他自动驾驶技术可以更接近实际部署，实现更安全、更有效的自动驾驶体验。
在这里插入图片描述

在这里插入图片描述

真诚的灰灰

关注

3
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
VAD: 向量化场景表示，用于高效的自动驾驶

自动驾驶需要全面理解周围环境以实现可靠的轨迹规划。以前的工作依赖于密集的光栅化场景表示（例如，代理占用和语义地图）来进行规划，这在计算上很复杂，并且缺少实例级别的结构信息。在本文中，我们提出了VAD，这是一种端到端的向量化自动驾驶范例，它将驾驶场景建模为完全向量化的表示。VAD利用向量化的代理运动和地图元素作为显式的实例级规划约束，这有效地提高了规划的安全性。与传统的端到端规划方法相比，VAD通过消除计算密集型的光栅化表示和手工设计的后处理步骤，运行速度更快。
复制链接

扫一扫