VADv2：通过概率规划的端到端矢量化自动驾驶

最新推荐文章于 2025-04-09 21:51:12 发布

真诚的灰灰

最新推荐文章于 2025-04-09 21:51:12 发布

阅读量2.8k

点赞数 5

文章标签：自动驾驶人工智能机器学习

本文链接：https://blog.csdn.net/jch924583667/article/details/140739291

版权

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

VADv2：通过概率规划的端到端矢量化自动驾驶

Abstract

Learning a human-like driving policy from large-scale driving demonstrations is promising, but the uncertainty and non-deterministic nature of planning make it challenging. In this work, to cope with the uncertainty problem, we propose VADv2, an end-to-end driving model based on probabilistic planning. VADv2 takes multi-view image sequences as input in a streaming manner, transforms sensor data into environmental token embeddings, outputs the probabilistic distribution of action, and samples one action to control the vehicle. Only with camera sensors, VADv2 achieves state-of-the-art closed-loop performance on the CARLA Town05 benchmark, significantly outperforming all existing methods. It runs stably in a fully end-to-end manner, even without the rule-based wrapper. Closed-loop demos are presented at https://hgao-cv.github. io/VADv2.
从大规模驾驶演示中学习类似人类的驾驶策略是有前景的，但规划的不确定性和非确定性本质使其具有挑战性。在这项工作中，为了应对不确定性问题，我们提出了VADv2，这是一种基于概率规划的端到端驾驶模型。VADv2以流式方式接收多视图图像序列作为输入，将传感器数据转化为环境token嵌入，输出动作的概率分布，并从中采样一个动作来控制车辆。仅使用相机传感器，VADv2在CARLA Town05基准测试中实现了最先进的闭环性能，显著优于所有现有方法。它以完全端到端的方式稳定运行，甚至没有基于规则的包装器。闭环演示可以在https://hgao-cv.github.io/VADv2上查看。

1. Introduction

End-to-end autonomous driving is an important and popular field recently. Mass of human driving demonstrations are easily available. It seems promising to learn a humanlike driving policy from large-scale demonstrations.
端到端自动驾驶是近年来一个重要且受欢迎的领域。大量的人类驾驶演示很容易获得。从大规模演示中学习类似人类的驾驶策略看起来是很有前景的。
However, the uncertainty and non-deterministic nature of planning make it challenging to extract the driving knowledge from driving demonstrations. To demonstrate such uncertainty, two scenarios are presented in Fig. 1. 1) Following another vehicle. The human driver has diverse reasonable driving maneuvers, keeping following or changing lanes to overtake. 2) Interaction with the coming vehicle. The human driver has two possible driving maneuvers,yield or overtake. From the perspective of statistics, the action (including the timing and speed) is highly stochastic, affected by many latent factors that can not be modeled.
然而，规划的不确定性和非确定性本质使得从驾驶演示中提取驾驶知识变得具有挑战性。为了展示这种不确定性，图1 中呈现了两个场景：

跟随另一辆车。人类驾驶员有多种合理的驾驶操作，包括继续跟随或变道超车。
与来车的交互。人类驾驶员有两种可能的驾驶操作：让路或超车。从统计学的角度来看，行动（包括时机和速度）是高度随机的，受到许多无法建模的潜在因素的影响。

在这里插入图片描述
图1. 规划中存在不确定性。环境和行动之间不存在确定性关系。当可行的解决方案空间是非凸的时，确定性规划尤其无法模拟这种不确定性。VADv2基于概率规划，从大规模驾驶演示中学习环境条件化的动作概率分布。
Existing learning-based planning methods [23, 19, 21, 40, 16, 54] follow a deterministic paradigm to directly regress the action. The regression target aˆ is the future trajectory in [23, 19, 21, 40] and control signal (acceleration and steering) in [16, 54]. Such a paradigm assumes there exists a deterministic relation between environment and action, which is not the case. The variance of human driving behavior causes the ambiguity of the regression target. Especially when the feasible solution space is nonconvex (see Fig. 1), the deterministic modeling cannot cope with non-convex cases and may output an in-between action, causing safety problems. Besides, such deterministic regression-based planner tends to output the dominant trajectory, which appears the most in the training data (like stop or go straight), and results in undesirable planning performance.
现有的基于学习的规划方法[23, 19, 21, 40, 16, 54]遵循确定性范式，直接回归行动。回归目标 $\hat{a}$ 在[23, 19, 21, 40]中是未来轨迹，在[16, 54]中是控制信号（加速度和转向）。这样的范式假定环境和行动之间存在确定性关系，但事实并非如此。人类驾驶行为的变异性导致了回归目标的模糊性。特别是当可行的解决方案空间是非凸的（见图1）时，确定性建模无法处理非凸案例，可能会输出一个折中行动，导致安全问题。此外，这种基于确定性回归的规划器倾向于输出在训练数据中出现最频繁的主要轨迹（如停车或直行），导致不理想的规划性能。
In this work, we propose probabilistic planning to cope with the uncertainty of planning. As far as we know, VADv2 is the first work to use probabilistic modeling to fit the continuous planning action space, which is different from previous practices that use deterministic modeling for planning. We model the planning policy as an environmentconditioned non-stationary stochastic process, formulated as p(a|o), where o is the historical and current observations of the driving environment, and a is a candidate planning action. Compared with deterministic modeling, probabilistic modeling can effectively capture the uncertainty in planning and achieve more accurate and safe planning performance.
在这项工作中，我们提出了概率规划来应对规划的不确定性。据我们所知，VADv2是第一个使用概率建模来适应连续规划行动空间的工作，这与以往使用确定性建模进行规划的做法不同。我们将规划策略建模为环境条件化的非静态随机过程，公式化为 $p (a ∣ o)$ ，其中 $o$ 是驾驶环境的历史和当前观测，而 $a$ 是候选的规划行动。与确定性建模相比，概率建模能够有效地捕捉规划中的不确定性，并实现更准确和安全的规划性能。
The planning action space is a high-dimensional continuous spatiotemporal space. We resort to a probabilistic field function to model the mapping from the action space to the probabilistic distribution. Since directly fitting the continuous planning action space is not feasible, we discretize the planning action space to a large planning vocabulary and use mass driving demonstrations to learn the probability distribution of planning actions based on the planning vocabulary. For discretization, we collect all the trajectories in driving demonstrations and adopt the furthest trajectory sampling to select N representative trajectories which serve as the planning vocabulary
规划行动空间是一个高维连续的时空空间。我们采用概率场函数来模拟从行动空间到概率分布的映射。由于直接拟合连续的规划行动空间是不可行的，我们将规划行动空间离散化为一个大型规划词汇表，并使用大量的驾驶演示来学习基于规划词汇的规划行动的概率分布。为了离散化，我们收集了驾驶演示中的所有轨迹，并采用最远轨迹采样方法选择 $N$ 条代表性轨迹，这些轨迹作为规划词汇表。
这种方法允许VADv2从实际驾驶数据中学习，并将连续的行动空间转化为可管理的、离散化的规划词汇，从而能够更好地处理规划中的不确定性和复杂性。通过这种方式，VADv2能够生成具有统计学基础的多样化规划行为，提高了自动驾驶系统在面对复杂交通环境时的适应性和安全性。
Probabilistic planning has two other advantages. First, probabilistic planning models the correlation between each action and environment. Unlike deterministic modeling which only provides sparse supervision for the target planning action, probabilistic planning can provide supervision not only for the positive sample but also for all candidates in the planning vocabulary, which brings richer supervision information. Besides, probabilistic planning is flexible in the inference stage. It outputs multi-mode planning results and is easy to combine with rule-based and optimizationbased planning methods. And we can flexibly add other candidate planning actions to the planning vocabulary and evaluate them because we model the distribution over the whole action space.
概率规划有两个其他优势。首先，概率规划模拟了每个行动和环境之间的相关性。与只对目标规划行动提供稀疏监督的确定性建模不同，概率规划不仅可以为积极样本提供监督，还可以为规划词汇表中的所有候选行动提供监督，这带来了更丰富的监督信息。此外，概率规划在推理阶段是灵活的。它输出多模态规划结果，并且易于与基于规则和基于优化的规划方法结合。我们还可以灵活地向规划词汇表中添加其他候选规划行动并评估它们，因为我们对整个行动空间进行了分布建模。
这种灵活性意味着概率规划可以适应不同的规划需求和约束，同时保持对各种可能行动的全面考虑。通过在整个行动空间上建模概率分布，概率规划能够提供一种更加细致和全面的方法来评估和选择最佳的规划行动，从而提高自动驾驶系统在复杂和动态环境中的决策能力和性能。
Based on the probabilistic planning, we present VADv2, an end-to-end driving model, which takes surround-view image sequence as input in a streaming manner, transforms sensor data into token embeddings, outputs the probabilistic distribution of action, and samples one action to control the vehicle. Only with camera sensors, VADv2 achieves state-of-the-art closed-loop performance on the CARLA Town05 benchmark, significantly outperforming all existing methods. Abundant closed-loop demos are presented at https://hgao-cv.github.io/VADv2. VADv2 runs stably in a fully end-to-end manner, even without the rule-based wrapper.
基于概率规划，我们提出了VADv2，这是一个端到端的驾驶模型。它以流式方式接收环视图像序列作为输入，将传感器数据转化为token嵌入，输出动作的概率分布，并从中采样一个动作来控制车辆。仅使用相机传感器，VADv2在CARLA Town05基准测试中实现了最先进的闭环性能，显著优于所有现有方法。在https://hgao-cv.github.io/VADv2上展示了大量的闭环演示。即使没有基于规则的包装器，VADv2也能以完全端到端的方式稳定运行。
VADv2的这种设计使其能够在没有额外规则约束的情况下，通过概率规划来实现更自然和灵活的驾驶决策。这种方法不仅提高了自动驾驶系统的安全性和可靠性，还为自动驾驶技术的发展提供了新的思路和可能性。
Our contributions are summarized as follows:
• We propose probabilistic planning to cope with the uncertainty of planning. We design a probabilistic field to map from the action space to the probabilistic distribution, and learn the distribution of action from largescale driving demonstrations.
• Based on the probabilistic planning, we present VADv2, an end-to-end driving model, which transforms sensor data into environmental token embeddings, outputs the probabilistic distribution of action, and samples one action to control the vehicle.
• In CARLA simulator, VADv2 achieves state-of-the art closed-loop performance on Town05 benchmark. Closed-loop demos show it runs stably in an end-toend manner.
我们的贡献总结如下：

我们提出了概率规划来应对规划的不确定性。我们设计了一个概率场，将行动空间映射到概率分布，并从大规模驾驶演示中学习行动的分布。
在概率规划的基础上，我们展示了VADv2，这是一个端到端的驾驶模型，它将传感器数据转化为环境token嵌入，输出行动的概率分布，并从中采样一个行动来控制车辆。
在CARLA模拟器中，VADv2在Town05基准测试中实现了最先进的闭环性能。闭环演示显示它以端到端的方式稳定运行。

2. Related Work

Perception. Perception is the first step in achieving autonomous driving, and a unified representation of driving scenes is beneficial for easy integration into downstream tasks. Bird’s Eye View (BEV) representation has become a common strategy in recent years, enabling effective scene feature encoding and multimodal data fusion. LSS [38] is a pioneering work that achieves the perspective view to BEV transformation by explicitly predicting depth for image pixels. BEVFormer [26, 52], on the other hand, avoids explicit depth prediction by designing spatial and temporal attention mechanisms, and achieves impressive detection performance. Subsequent works [25, 48] continuously improve performance in downstream tasks by optimizing temporal modeling and BEV transformation strategies. In terms of vectorized mapping, HDMapNet [24] converts lane segmentation into vector maps through post-processing. VectorMapNet [32] predicts vector map elements in an autoregressive manner. MapTR [29, 30] introduces permutation equivalence and hierarchical matching strategies, significantly improving mapping performance. LaneGAP [28] introduces path-wise modeling for lane graphs.
感知是实现自动驾驶的第一步，统一的驾驶场景表示有利于轻松集成到下游任务中。近年来，鸟瞰图（BEV）表示已成为一种常见策略，它使场景特征编码和多模态数据融合更加有效。

LSS [38] 是开创性的工作，它通过为图像像素显式预测深度，实现了透视图到BEV的转换。
BEVFormer [26, 52] 则通过设计空间和时间注意力机制避免了显式深度预测，并取得了令人印象深刻的检测性能。
后续的工作 [25, 48] 通过优化时间建模和BEV转换策略，不断改进下游任务的性能。

在矢量化映射方面：

HDMapNet [24] 通过后处理将车道分割转换为矢量地图。
VectorMapNet [32] 以自回归方式预测矢量地图元素。
MapTR [29, 30] 引入了排列等价和层次匹配策略，显著提高了映射性能。
LaneGAP [28] 引入了车道图的路径建模。

这些方法展示了在自动驾驶领域中，如何通过不同的技术手段来提高场景理解和数据融合的能力，进而提升自动驾驶系统的整体性能。
Motion Prediction. Motion prediction aims to forecast the future trajectories of other traffic participants in driving scenes, assisting the ego vehicle in making informed planning decisions. Traditional motion prediction task utilizes input such as historical trajectories and high-definition maps to predict future trajectories. However, recent developments in end-to-end motion prediction methods [17, 53, 14, 22] perform perception and motion prediction jointly. In terms of scene representation, some works adopt rasterized image representations and employ CNN networks for prediction [3, 37]. Other approaches utilize vectorized representations and employ Graph Neural Networks [27] or Transformer models [13, 33, 36] for feature extraction and motion prediction. Some works [17, 53] see future motion as dense occupancy and flow instead of agent-level future waypoints. Some motion prediction methods [14, 22] adopt Gaussian Mixture Model (GMM) to regress multi-mode trajectories. It can be applied in planning to model uncertainty. But the number of modes is limited.
运动预测的目标是预测驾驶场景中其他交通参与者的未来轨迹，帮助自车做出明智的规划决策。传统的运动预测任务利用历史轨迹和高精地图等输入来预测未来轨迹。然而，近期在端到端运动预测方法上的发展[17, 53, 14, 22]则同时进行感知和运动预测。
在场景表示方面：

一些工作采用栅格化图像表示，并利用卷积神经网络（CNN）进行预测[3, 37]。
其他方法则利用矢量化表示，并采用图神经网络（GNN）[27]或变换器模型（Transformer）[13, 33, 36]进行特征提取和运动预测。
一些工作[17, 53]将未来运动视为密集的占据和流动，而不是基于代理级别的未来航点。
一些运动预测方法[14, 22]采用高斯混合模型（GMM）来回归多模态轨迹。这种方法可以应用于规划中，以模拟不确定性，但模式的数量是有限的。

这些方法展示了在自动驾驶系统中，如何通过不同的技术手段来提高对未来运动的预测精度，这对于规划和决策至关重要。通过联合感知和预测，系统能够更好地理解周围环境并做出更准确的决策。同时，多模态轨迹预测和高斯混合模型的应用为处理不确定性提供了一种有效的方法，尽管在实际应用中可能需要进一步优化以处理更复杂的情况。
Planning. Learning-based planning has shown great potential recently due to its data-driven nature and impressive performance with increasing amounts of data. Early attempts [39, 8, 41] use a completely black-box spirit, where sensor data is directly used to predict control signals. However, this strategy lacks interpretability and is difficult to optimize. In addition, there are numerous studies combining reinforcement learning and planning [46, 5, 4]. By autonomously exploring driving behavior in closed-loop simulation environments, these approaches achieve or even surpass human-level driving performance. However, bridging the gap between simulation and reality, as well as addressing safety concerns, poses challenges in applying reinforcement learning strategies to real driving scenarios. Imitation learning [2, 18, 19, 23] is another research direction, where models learn expert driving behavior to achieve good planning performance and develop a driving style close to that of humans. In recent years, end-to-end autonomous driving has emerged, integrating perception, motion prediction, and planning into a single model, resulting in a fully datadriven approach that demonstrates promising performance. UniAD [19] cleverly integrates multiple perception and prediction tasks to enhance planning performance. VAD [23] explores the potential of vectorized scene representation for planning and getting rid of dense maps.
规划是自动驾驶系统中的关键环节，近年来基于学习的方法在规划方面展现出巨大的潜力，这主要得益于其数据驱动的特性和随着数据量增加而提升的性能。

早期尝试：
- 早期的尝试[39, 8, 41]采用了完全的黑盒方法，直接使用传感器数据来预测控制信号。然而，这种策略缺乏可解释性，并且优化起来比较困难。
强化学习与规划的结合：
- 有大量研究将强化学习和规划结合起来[46, 5, 4]。通过在闭环仿真环境中自主探索驾驶行为，这些方法实现了甚至超越了人类的驾驶性能。然而，将仿真与现实之间的差距桥接起来，以及解决安全问题，在将强化学习策略应用到真实驾驶场景中时带来了挑战。
模仿学习：
- 模仿学习[2, 18, 19, 23]是另一个研究方向，其中模型学习专家驾驶行为，以实现良好的规划性能并发展出接近人类的驾驶风格。
端到端自动驾驶：
- 近年来，端到端自动驾驶的出现将感知、运动预测和规划整合到一个单一模型中，形成了一种完全数据驱动的方法，显示出有希望的性能。例如：
  - UniAD [19] 巧妙地整合了多个感知和预测任务，以增强规划性能。
  - VAD [23] 探索了矢量化场景表示在规划中的潜力，并摆脱了密集地图的依赖。

这些方法的发展表明，通过整合不同的感知和预测技术，可以显著提高自动驾驶系统的规划能力。端到端的方法特别有前景，因为它允许系统在单一框架内处理从感知到决策的整个流程，从而提高效率和性能。同时，模仿学习和强化学习为自动驾驶系统提供了学习和优化决策策略的有效途径。
Large Language Model in Autonomous Driving. The interpretability and logical reasoning abilities demonstrated by large language models (LLMs) can greatly assist in the field of autonomous driving. Recent research has explored the combination of LLMs and autonomous driving[7, 10, 12, 44, 51, 34, 31, 50, 49]. One line of work utilizes LLMs for driving scene understanding and evaluation through question-answering (QA) tasks. Another approach goes a step further by incorporating planning on top of LLM-based scene understanding. For instance, DriveGPT4 [51] takes inputs such as historical video and text (including questions and additional information like historical control signals). After encoding, these inputs are fed into an LLM, which predicts answers to the questions and control signals. LanguageMPC [44], on the other hand, takes in historical ground truth perception results and HD maps in the form of language descriptions. It then utilizes a Chain of Thought analysis approach to understand the scene, and the LLM finally predicts planning actions from a predefined set as output. Each action corresponds to a specific control signal for execution. VADv2 draws inspiration from GPT [42, 43, 1, 47] to cope with the uncertainty problem. Uncertainty also exists in language modeling. Given a specific context, the next word is non-deterministic and probabilistic. LLM learns the context-conditioned probabilistic distribution of the next word from a large-scale corpus, and samples one word from the distribution. Inspired by LLM, VADv2 models the planning policy as an environmentconditioned nonstationary stochastic process. VADv2 discretizes the action space to generate a planning vocabulary, approximates the probabilistic distribution based on largescale driving demonstrations, and samples one action from the distribution at each time step to control the vehicle.
在自动驾驶领域，大型语言模型（LLMs）所展示的可解释性和逻辑推理能力可以极大地提供帮助。近期的研究已经探索了将大型语言模型与自动驾驶结合起来的方法[7, 10, 12, 44, 51, 34, 31, 50, 49]。一种研究线路是利用大型语言模型进行驾驶场景理解和评估，通过问答（QA）任务来实现。另一种方法是在基于大型语言模型的场景理解基础上加入规划。例如，DriveGPT4 [51] 接收历史视频和文本输入（包括问题以及历史控制信号等附加信息）。经过编码后，这些输入被送入大型语言模型，模型预测问题的答案和控制信号。另一方面，LanguageMPC [44] 接收历史的真实感知结果和以语言描述形式的高精地图。然后，它采用思维链分析方法来理解场景，最终大型语言模型从预定义的集合中预测规划行动作为输出。每个行动对应于一个特定的控制信号以供执行。
VADv2从GPT [42, 43, 1, 47]中获得灵感，以应对不确定性问题。在语言建模中也存在不确定性。在给定特定上下文的情况下，下一个词是非确定性的，具有概率性。大型语言模型从大规模语料库中学习上下文条件化的下一个词的概率分布，并从分布中采样一个词。受大型语言模型启发，VADv2将规划策略建模为环境条件化的非静态随机过程。VADv2将行动空间离散化以生成规划词汇表，基于大规模驾驶演示近似概率分布，并在每个时间步骤从分布中采样一个行动来控制车辆。这种方法允许VADv2以一种类似于人类驾驶决策过程的方式来处理复杂的规划任务，同时考虑到了驾驶环境中的不确定性和变化性。

3. Method

The overall framework of VADv2 is depicted in Fig. 2. VADv2 takes multi-view image sequences as input in a streaming manner, transforms sensor data into environmental token embeddings, outputs the probabilistic distribution of action, and samples one action to control the vehicle. Large-scale driving demonstrations and scene constraints are used to supervise the predicted distribution.
VADv2 的整体框架在图 2 中进行了描述。VADv2 以流式方式接收多视图图像序列作为输入，将传感器数据转化为环境token嵌入，输出动作的概率分布，并从中采样一个动作来控制车辆。使用大规模的驾驶演示和场景约束来指导预测的分布。
在这里插入图片描述

具体来说，VADv2 的工作流程包括以下几个关键步骤：

输入多视图图像序列：系统接收来自多个摄像头的实时图像数据。
传感器数据转化：将这些多视角的图像转换为环境token嵌入，这些嵌入能够捕捉到场景的关键信息。
概率分布输出：基于环境token嵌入，系统输出可能的动作的概率分布。
动作采样：从这个概率分布中采样一个动作，这个动作将用于控制自动驾驶车辆。
监督学习：使用大量的驾驶演示数据和场景约束来训练和优化模型，确保预测的准确性和安全性。

通过这种方式，VADv2 能够实现端到端的自动驾驶控制，同时考虑到了规划中的不确定性，提高了系统的鲁棒性和适应性。

3.1. Scene Encoder

Information in the image is sparse and low-level. We use an encoder to transform the sensor data into instance-level token embeddings Eenv, to explicitly extract high-level information. Eenv includes four kinds of token: map token, agent token, traffic element token, and image token. VADv2 utilizes a group of map tokens [30, 29, 28] to predict the vectorized representation of the map (including lane centerline, lane divider, road boundary, and pedestrian crossing). Besides, VADv2 uses a group of agent tokens [22, 26] to predict other traffic participants’ motion information (including location, orientation, size, speed, and multi-mode future trajectories). Traffic elements also play a vital role in planning. VADv2 transforms sensor data into traffic element tokens to predict the states of traffic elements. In CARLA, we consider two types of traffic signals: traffic light signals and stop signs. Map tokens, agent tokens, and traffic element tokens are supervised with corresponding supervision signals to make sure they explicitly encode corresponding high-level information. We also take image tokens as scene representations for planning, which contain rich information and are complementary to the instance-level tokens above. Besides, navigation information and ego state are also encoded into embeddings {Enavi, Estate} with an MLP.
图像中的信息通常是稀疏和低层次的。为了明确提取高层次信息，我们使用编码器将传感器数据转换为实例级token嵌入 $E_{\text{env}}$ 。 $E_{\text{env}}$ 包括四种类型的token：地图token、智能体token、交通元素token和图像token。

地图token：VADv2使用一组地图token[30, 29, 28]来预测地图的矢量化表示，包括车道中心线、车道分隔线、道路边界和人行横道。
智能体token：VADv2使用一组智能体token[22, 26]来预测其他交通参与者的运动信息，包括位置、方向、大小、速度和多模态未来轨迹。
交通元素token：交通元素在规划中也起着至关重要的作用。VADv2将传感器数据转换为交通元素token，以预测交通元素的状态。在CARLA模拟器中，我们考虑了两种类型的交通信号：交通灯信号和停车标志。
地图token、智能体token和交通元素token：这些token都受到相应的监督信号的监督，以确保它们明确编码了相应的高层次信息。
图像token：作为规划的场景表示，图像token包含了丰富的信息，与上述实例级token互为补充。
导航信息和自车状态：这些信息也被编码成MLP（多层感知机）生成的嵌入 $\{E_{\text{navi}}, E_{\text{state}}\}$ 。

通过这种设计，VADv2能够将原始的传感器数据有效地转换为对自动驾驶系统更有用的形式，使得系统能够更好地理解和预测其所处的环境，并在此基础上进行决策和规划。这种方法提高了系统的感知能力，使其能够更准确地模拟和预测交通环境中的复杂动态。

3.2. Probabilistic Planning

We propose probabilistic planning to cope with the uncertainty of planning. We model the planning policy as an environment-conditioned nonstationary stochastic process, formulated as p(a|o). We approximate the planning action space as a probabilistic distribution based on largescale driving demonstrations, and sample one action from the distribution at each time step to control the vehicle.
我们提出了概率规划来应对规划中的不确定性。我们将规划策略建模为环境条件化的非静态随机过程，公式化表示为 $p (a ∣ o)$ 。在这里， $o$ 代表驾驶环境的历史和当前观测，而 $a$ 代表候选的规划行动。
为了处理这个随机过程，我们将规划行动空间近似为基于大规模驾驶演示的概率分布，并在每个时间步骤从分布中采样一个行动来控制车辆。这种方法允许系统考虑到规划过程中的不确定性，并且能够生成适应不同环境条件的灵活行动策略。
通过概率规划，VADv2能够从连续的规划行动空间中学习并模拟出概率分布，从而在给定环境条件下为车辆选择最合适的行动。这种策略不仅提高了自动驾驶系统在面对复杂多变交通环境时的适应性和鲁棒性，而且还增强了其在不确定性条件下做出安全决策的能力。
The planning action space is a high-dimensional continuous spatiotemporal space A = {a|a ∈ R2T}. Since directly fitting the continuous planning action space is not feasible, we discretize the planning action space to a large planning vocabulary V = {ai}N . Specifically, we collect all the planning actions in driving demonstrations and adopt the furthest trajectory sampling to select N representative actions to serve as the planning vocabulary. Each trajectory in V is sampled from driving demonstrations and thus naturally satisfies the kinematic constraints of the ego vehicle, which means that when the trajectory is converted into control signals (steer, throttle, and brake), the control signal values do not exceed the feasible range. By default, N is set to 4096.
规划行动空间是一个高维连续的时空空间 $\{a | a \in \mathbb{R}^{2T}\}$ 。在这里， $R^{2T}$ 表示轨迹在二维平面上随时间展开的空间， $T$ 是时间步长或轨迹点的数量。
由于直接拟合这个连续的规划行动空间是不可行的，我们将其离散化成一个大型规划词汇表 $V = \{a_i\}^N$ 。具体来说，我们收集驾驶演示中的所有规划行动，并采用最远轨迹采样方法来选择 $N$ 个代表性行动，作为规划词汇表的元素。
在 $V$ 中的每个轨迹都是从驾驶演示中采样得到的，因此自然地满足自车的运动学约束。这意味着，当这些轨迹转换成控制信号（转向、油门和刹车）时，控制信号的值不会超出可行的范围。默认情况下， $N$ 被设置为4096，这个数值足够大，可以确保规划词汇表的丰富性和多样性，从而能够覆盖各种可能的驾驶情况。
通过这种离散化方法，VADv2能够将复杂的连续规划问题转化为更易于管理和优化的离散选择问题。在每个时间步骤，模型可以从这个离散化的规划词汇表中采样一个行动，从而实现对车辆的控制。这种方法不仅提高了计算效率，而且通过概率分布的采样增加了规划的灵活性和鲁棒性。
We represent each action in the planning vocabulary as a waypoint sequence a = (x1, y1, x2, y2, …, xT, yT). Each waypoint corresponds to a future timestamp. The probability p(a) is assumed to be continuous with respect to a and insensitive to the little deviation of a, i.e., lim∆a→0[p(a) − p(a + ∆a)] = 0. Inspired by NeRF [35], which models the continuous radiance field over the 5D space (x, y, z, θ, ϕ), we resort to a probabilistic field to model the continuous mapping from the action space A to the probabilistic distribution {p(a)|a ∈ A}. We encode each action (trajectory) into a high-dimensional planning token embedding E(a), use a cascaded Transformer decoder for interaction with environmental information Eenv, and combine with Navigation information Enavi and ego state Estate to output the probability, i.e.,
我们规划词汇表中的每个行动都表示为一个航点序列 $a = (x_1, y_1, x_2, y_2, ..., x_T, y_T)$ 。每个航点对应一个将来的时间戳。假设概率 $p (a)$ 是相对于 $a$ 连续的，并且对 $a$ 的小偏差不敏感，即当 $\Delta a$ 趋近于0时， $\lim_{\Delta a \to 0} [p(a) - p(a + \Delta a)] = 0$ 。
受到 NeRF [35] 的启发，NeRF 模型了5D空间 $\theta, \phi）$ 上的连续辐射场，我们采用一个概率场来模拟从行动空间 $A$ 到概率分布 $\{p(a) | a \in A\}$ 的连续映射。我们将每个行动（轨迹）编码为一个高维规划token嵌入 $E (a)$ ，使用级联的Transformer解码器与环境信息 $E_{\text{env}}$ 交互，并结合导航信息 $E_{\text{navi}}$ 和自车状态 $E_{\text{state}}$ 来输出概率，即：
$\text{MLP}(\text{Transformer}(E(a), E_{\text{env}}) + E_{\text{navi}} + E_{\text{state}})$
在这个公式中，MLP（多层感知机）作为最终的输出层，用于将Transformer解码器的输出、导航信息和自车状态的嵌入合并起来，输出每个行动的概率。这样的设计允许模型在给定环境信息和自车状态的情况下，评估每个可能行动的概率，并从中采样最合适的行动来进行车辆控制。这种方法不仅考虑了行动空间的连续性，而且还利用了大规模驾驶演示数据来学习行动的概率分布，从而提高了规划的准确性和可靠性。
在这里插入图片描述
Γ is an encoding function that maps each coordinate from R into a high dimensional embedding space R2L, and is applied separately to each of the coordinate values of trajectory a. pos denotes the position. We use these functions to map continuous input coordinates into a higher dimensional space to better approximate a higher frequency field function.
Γ 是一个编码函数，它将每个坐标从实数空间 $\mathbb{R}$ 映射到一个高维嵌入空间 $\mathbb{R}^{2L}$ ，并且独立地应用于轨迹 $a$ 的每个坐标值。这里的 $p os$ 表示位置。我们使用这些函数将连续的输入坐标映射到更高维度的空间，以便更好地近似更高频率的场函数。
在上下文中，这种编码函数可能用于以下目的：

提高表达能力：通过将数据从原始空间映射到更高维的空间，可以增加模型的表达能力，使其能够捕捉到更加细微的特征和模式。
增强拟合度：在概率场中，更高频率的场函数可以提供更细致的拟合，这对于精确建模行动空间的概率分布尤其重要。
处理连续性：在处理连续的时空轨迹时，将连续坐标映射到高维空间有助于更好地处理数据的连续性和动态变化。

编码函数 $\Gamma$ 的具体形式和应用方式可能取决于模型的设计和特定应用的需求。在自动驾驶的上下文中，这样的映射可能用于增强轨迹预测的准确性，通过在高维空间中捕捉更复杂的模式来改善规划策略。

3.3. Training

We train VADv2 with three kinds of supervision, distribution loss, conflict loss, and scene token loss,
VADv2 的训练涉及三种类型的监督信号，分别是：

分布损失（Distribution Loss）：这种损失用于确保模型学习的概率分布与大规模驾驶演示中得到的概率分布相匹配。它通常使用KL散度（Kullback-Leibler divergence）来衡量预测分布与实际分布之间的差异，从而指导模型更好地拟合数据。
冲突损失（Conflict Loss）：这种损失用于衡量规划行动与环境约束之间的潜在冲突，例如避免碰撞和其他危险行为。通过最小化冲突损失，模型学习生成既安全又合理的行动。
场景token损失（Scene Token Loss）：这种损失用于监督模型正确编码场景中的关键元素，如地图、智能体、交通元素等。通过这种方式，模型学习如何将传感器数据有效地转换为对规划有用的高级特征表示。

这三种损失的结合为 VADv2 提供了全面的学习目标，使其能够从多个角度理解和模拟复杂的驾驶场景，并生成符合环境约束和安全要求的高质量规划行动。通过这种方式，VADv2 能够在各种交通情况下实现稳定和可靠的自动驾驶行为。
在这里插入图片描述
Distribution Loss. We learn the probabilistic distribution from large-scale driving demonstrations. KL divergence is used to minimize the difference between the predicted distribution and the distribution of the data.
分布损失（Distribution Loss）是VADv2训练过程中的一个关键组成部分，其目的是让模型学习从大规模驾驶演示中得到的概率分布。这种损失使用KL散度（Kullback-Leibler divergence）来衡量模型预测的分布与实际驾驶数据分布之间的差异。
KL散度是一种度量两个概率分布相似度的方法，它从信息论的角度量化了一个分布转换到另一个分布所需的信息量。在VADv2中，KL散度用于以下目的：

最小化差异：通过最小化预测分布与真实分布之间的KL散度，模型被训练为更准确地预测在给定环境条件下采取各种行动的概率。
优化学习过程：KL散度作为一个损失函数，指导模型在训练过程中调整其参数，以更好地拟合训练数据中的概率分布。

数学上，如果 $p_{\text{data}}(a)$ 是驾驶演示数据中行动的真实概率分布，而 $p_{\text{model}}(a)$ 是模型预测的概率分布，那么KL散度可以表示为：
$\text{KL}(p_{\text{data}} || p_{\text{model}}) = \sum_a p_{\text{data}}(a) \log \left(\frac{p_{\text{data}}(a)}{p_{\text{model}}(a)}\right)$
其中求和是对所有可能的行动 $a$ 进行的。这个表达式衡量了当用模型预测的分布 $p_{\text{model}}(a)$ 来近似真实分布 $p_{\text{data}}(a)$ 时，所损失的信息量。
在VADv2的训练中，目标是最小化这个损失函数，从而使模型的预测尽可能接近于实际驾驶行为的概率分布。这有助于提高模型在实际应用中的性能和可靠性。
在这里插入图片描述
在训练阶段，正确的轨迹（ground truth trajectory）被添加到规划词汇表中，作为正样本（positive sample）。其他的轨迹则被视为负样本（negative samples）。为了处理这些正负样本，模型会为负样本分配不同的损失权重。接近正确轨迹的轨迹会受到较小的惩罚。
这种方法的关键在于：

正样本的重要性：正确的轨迹作为正样本，对于模型学习如何正确预测行动至关重要。模型会从这些正样本中学习到理想的规划行为。
负样本的差异化处理：对于那些与正确轨迹相似度较高的负样本（即与期望行为相差不远的轨迹），模型会给予较小的损失权重，这意味着在训练过程中对这些样本的惩罚会较轻。相反，对于那些与正确轨迹相差较远的负样本，模型会给予较大的损失权重，以鼓励模型避免这些不良的规划行为。
损失权重的分配：通过为不同的负样本分配不同的损失权重，模型可以更加细致地学习到在特定环境下哪些行为是可接受的，哪些是不可接受的。这种方法有助于提高模型的泛化能力，使其在面对复杂的交通环境时能够做出更加合理和安全的规划决策。
轨迹相似度的度量：确定轨迹与正确轨迹的接近程度可能涉及到距离度量（如欧氏距离）、路径相似度或其他相关度量方法。这些度量方法帮助模型评估负样本与正样本之间的差异。

通过这种方式，VADv2能够在训练过程中有效地学习到从驾驶演示中提取的概率分布，并且能够区分不同轨迹的优劣，从而在实际驾驶中生成更加准确和安全的行为预测。
Conflict Loss. We use the driving scene constraints to help the model learn important prior knowledge about driving and further regularize the predicted distribution. Specifically, if one action in the planning vocabulary conflicts with other agents’ future motion or road boundary, the action is regarded as a negative sample, and we impose a significant loss weight to reduce the probability of this action.
**冲突损失（Conflict Loss）**是VADv2中用于增强模型安全性意识的关键组成部分。这种损失的设计基于驾驶场景的约束条件，目的是帮助模型学习驾驶中的重要先验知识，并进一步规范预测的分布。具体来说，冲突损失的工作机制如下：

场景约束的利用：模型在预测行动时，会考虑驾驶场景中的约束，如其他交通参与者的未来运动轨迹和道路边界。
冲突行动的识别：如果规划词汇表中的某个行动与这些约束发生冲突，例如可能与其他交通参与者发生碰撞或违反道路边界，这个行动就被视为负样本。
损失权重的施加：对于这些负样本，即那些可能导致不安全情况的行动，模型会施加一个较大的损失权重。这样做可以显著减少这些行动在概率分布中的权重，从而降低它们被采样的可能性。
概率的降低：通过增加冲突行动的损失，模型学习到降低这些行动的概率，从而在实际驾驶中避免采取可能导致事故的行为。
预测分布的规范：冲突损失有助于规范模型预测的行动概率分布，使其更加符合安全驾驶的规则和常识。
提高模型的泛化能力：通过这种方式，VADv2不仅能够模仿人类的驾驶行为，还能够在没有明确指令的情况下，自主地学习如何在复杂的交通环境中做出安全决策。

冲突损失的引入，使得VADv2在保持概率规划的灵活性的同时，也能够确保生成的规划行动是安全和合理的，这对于自动驾驶系统的成功部署至关重要。
Scene Token Loss. Map tokens, agent tokens, and traffic element tokens are supervised with corresponding supervision signals to make sure they explicitly encode corresponding high-level information.
场景Token损失（Scene Token Loss）是VADv2训练过程中用于确保模型能够准确编码场景中关键元素的损失函数。这种损失特别关注以下几个方面：

地图Token（Map Tokens）：这些Token代表了驾驶环境中的地图特征，如车道中心线、车道分隔线、道路边界和人行横道。场景Token损失确保这些地图Token能够准确编码地图的矢量化表示。
智能体Token（Agent Tokens）：智能体Token用于表示其他交通参与者，包括他们的位置、方向、大小和速度。损失函数监督这些Token，确保它们能够正确地编码周围车辆和行人等交通参与者的运动信息。
交通元素Token（Traffic Element Tokens）：这些Token与交通信号灯和停车标志等交通控制元素相关。场景Token损失确保这些Token能够精确地预测交通元素的状态，这对于车辆的规划和决策至关重要。
监督信号（Supervision Signals）：为了训练这些Token，模型使用与实际场景相对应的监督信号。这些信号提供了正确编码场景所需的信息，帮助模型学习如何从原始传感器数据中提取和编码关键特征。
显式编码（Explicit Encoding）：场景Token损失的目的是确保这些Token不仅仅是数据的低级表示，而是能够显式地编码高层次的、对规划和决策有用的信息。
损失函数的作用：通过最小化场景Token损失，模型被鼓励去学习和提炼出场景中的关键信息，并将这些信息有效地整合到规划过程中，从而提高自动驾驶系统的整体性能和安全性。

场景Token损失是VADv2中不可或缺的一部分，它通过监督学习确保模型能够理解和解释其周围环境，为安全、有效的驾驶决策提供支持。
The loss of map tokens is the same with MapTRv2 [30]. l1 loss is adopted to calculate the regression loss between the predicted map points and the ground truth map points. Focal loss is used as the map classification loss.
地图Token的损失计算在VADv2中与MapTRv2 [30]的方法相同，具体包括以下几个方面：

L1损失（L1 Loss）：这是一种回归损失，用于计算预测的地图点与真实地图点（ground truth map points）之间的差异。L1损失是绝对值误差的度量，公式为：
$\text{L1 Loss} = \sum_{i} |p_i - g_i|$
其中，( p_i ) 是预测的地图点，( g_i ) 是对应的真实地图点。
焦点损失（Focal Loss）：这是一种分类损失，用于处理类别不平衡问题，特别是当数据集中存在大量负样本和较少正样本时。焦点损失的目的是减少对易分类样本的关注，而更多地关注那些难以分类的样本。焦点损失的公式为：
$\text{Focal Loss} = -\alpha_t (1 - p_t)^\gamma \log(p_t)$
其中， $p_t$ 是模型对于实际类别的预测概率， $\alpha_t$ 是平衡正负样本的权重系数， $\gamma$ 是调节易分类样本权重的聚焦参数。
焦点损失有助于模型更好地学习那些难以预测的地图元素，如交通信号灯的状态或者路面标记等。

通过结合L1损失和焦点损失，VADv2能够更准确地预测地图元素的位置和类别，这对于自动驾驶车辆理解其所处的环境并做出正确的驾驶决策至关重要。这种方法确保了地图Token能够有效地编码地图的高级别信息，提高了自动驾驶系统在复杂环境中的导航和规划能力。
The loss of agent tokens is composed of the detection loss and the motion prediction loss, which is the same with VAD [23]. l1 loss is used as the regression loss to predict agent attributes (location, orientation, size, etc.), and focal loss to predict agent classes. For each agent who has matched with a ground truth agent, we predict K future trajectories and use the trajectory that has the minimum final displacement error (minFDE) as a representative prediction. Then we calculate l1 loss between this representative trajectory and the ground truth trajectory as the motion regression loss. Besides, focal loss is adopted as the multi-modal motion classification loss.
代理Token（Agent Tokens）的损失由**检测损失（Detection Loss）和运动预测损失（Motion Prediction Loss）**组成，这与VAD [23]中的方法相同。具体来说：

L1损失（L1 Loss）：用作回归损失，用于预测代理的属性，如位置、方向、大小等。L1损失计算预测值与真实值之间的绝对误差，公式为：
$\text{L1 Detection Loss} = \sum_{i} |p_i - g_i|$
其中， $p_i$ 是预测的代理属性， $g_i$ 是对应的真实代理属性。
焦点损失（Focal Loss）：用于预测代理类别，解决类别不平衡问题，提高模型对少数类别的识别能力。焦点损失的公式为：
$\text{Focal Loss} = -\alpha_t (1 - p_t)^\gamma \log(p_t)$
其中， $p_t$ 是模型对代理类别的预测概率， $\alpha_t$ 是平衡不同类别的权重系数， $\gamma$ 是减少易分类样本权重的聚焦参数。
未来轨迹预测（Future Trajectory Prediction）：对于每个与真实代理匹配的代理，模型预测K条未来轨迹，并选择最终位移误差（Final Displacement Error, FDE）最小的轨迹作为代表性预测。
最小最终位移误差（minFDE）：作为代表性预测的轨迹是那些与真实轨迹相比，在最终位置的误差最小的轨迹。这种选择方法鼓励模型生成更准确的运动预测。
运动回归损失（Motion Regression Loss）：计算代表性预测轨迹与真实轨迹之间的L1损失，作为运动预测的回归损失，公式为：
$\text{Motion Regression Loss} = | \text{Predicted Trajectory} - \text{Ground Truth Trajectory} |$
多模态运动分类损失（Multi-Modal Motion Classification Loss）：采用焦点损失作为多模态运动的分类损失，帮助模型区分不同的运动模式。

通过结合这些损失，VADv2能够更准确地预测其他交通参与者的运动状态和未来轨迹，这对于自动驾驶车辆在复杂交通环境中进行安全有效的规划至关重要。这种方法提高了模型对交通参与者行为的预测能力，从而提升了整个自动驾驶系统的可靠性和性能。
Traffic element tokens consist of two parts: the traffic light token and the stop sign token. On one hand, we send the traffic light token to an MLP to predict the state of the traffic light (yellow, red, and green) and whether the traffic light affects the ego vehicle. On the other hand, the stop sign token is also sent to an MLP to predict the overlap between the stop sign area and the ego vehicle. Focal loss is used to supervise these predictions.
交通元素Token由两部分组成：交通灯Token和停车标志Token。VADv2中对这两部分的处理方式如下：

交通灯Token（Traffic Light Token）：
- 交通灯Token被送入一个多层感知机（MLP）模型，用于预测交通灯的状态（黄灯、红灯和绿灯）。
- 同时，MLP还预测交通灯是否对自车（ego vehicle）有影响。这涉及到评估交通灯信号对车辆行驶决策的相关性。
停车标志Token（Stop Sign Token）：
- 停车标志Token也被送入一个MLP模型，目的是预测停车标志区域与自车的重叠程度。
- 这种预测有助于自动驾驶系统判断是否需要在停车标志前停车。
焦点损失（Focal Loss）：
- 为了监督这些预测，VADv2采用焦点损失。焦点损失是一种特别设计来解决类别不平衡问题的损失函数，它通过增加难以分类样本的权重来提高模型的性能。
- 焦点损失的公式为：
  $\text{Focal Loss} = -\alpha_t (1 - p_t)^\gamma \log(p_t)$

通过这种方式，VADv2能够更准确地理解和预测交通环境中的关键元素，如交通灯和停车标志，这对于自动驾驶车辆在遵守交通规则和确保行车安全方面至关重要。焦点损失的使用进一步提高了模型对这些关键交通元素状态预测的准确性，从而提升了整个自动驾驶系统的可靠性和安全性。

3.4. Inference

In closed-loop inference, it’s flexible to get the driving policy πmodel from the distribution. Intuitively, we sample the action with the highest probability at each time step, and use the PID controller to convert the selected trajectory to control signals (steer, throttle, and brake).
在**闭环推理（closed-loop inference）**中，从概率分布中获取驾驶策略 ( \pi_{\text{model}} ) 是非常灵活的。以下是该过程的详细说明：

采样最大概率行动：在每个时间步，模型从行动的概率分布中采样具有最高概率的行动。这意味着模型选择在当前环境下最有可能执行的行动。
使用PID控制器：选定的轨迹随后被送入一个PID（比例-积分-微分）控制器。PID控制器是一种常见的反馈控制器，用于计算控制信号，以便系统输出（如车辆的转向、油门和刹车）能够跟随期望的轨迹。
转换为控制信号：PID控制器将选定的轨迹转换为具体的控制信号，这些信号包括：
- 转向（Steer）：控制车辆转向的信号。
- 油门（Throttle）：控制车辆加速的信号。
- 刹车（Brake）：控制车辆减速或停车的信号。
实时调整：PID控制器通过实时调整这些控制信号，确保车辆能够平滑且准确地跟随采样得到的轨迹。PID控制器的参数（比例增益、积分增益和微分增益）可以根据车辆的动态特性和驾驶环境进行调整，以达到最佳的控制效果。
灵活性和鲁棒性：这种方法提供了灵活性，因为模型不是简单地跟随一个固定的轨迹，而是能够根据当前的概率分布做出决策。同时，通过PID控制器的实时调整，系统还具有鲁棒性，能够应对各种驾驶条件和突发情况。

通过这种闭环推理方法，VADv2能够在保持决策灵活性的同时，确保车辆控制的稳定性和安全性，这对于自动驾驶系统的成功部署和运行至关重要。
In real-world applications, there are more robust strategies to make full use of the probabilistic distribution. A good practice is, sampling top-K actions as proposals, and adopting a rule-based wrapper for filtering proposals and an optimization-based post-solver for refinement. Besides, the probability of the action reflects how confident the endto-end model is, and can be regarded as the judgment condition to switch between conventional PnC and learningbased PnC.
在现实世界的应用中，为了充分利用概率分布并提高自动驾驶系统的鲁棒性，可以采取一些更稳健的策略：

采样Top-K行动：
- 从概率分布中采样K个最可能的行动作为候选提案（proposals）。这种方法不仅考虑了最有可能的单一行动，还考虑了其他高概率的行动，增加了系统的灵活性和多样性。
基于规则的包装器（Rule-Based Wrapper）：
- 使用基于规则的系统对采样得到的Top-K行动进行过滤，确保所选行动符合交通规则和安全标准。这个包装器可以基于专家知识或预先定义的规则来排除那些可能导致不安全或非法行为的行动。
优化后的求解器（Optimization-Based Post-Solver）：
- 在基于规则的过滤之后，使用优化求解器对剩余的行动进行进一步的评估和选择。这个求解器可以采用各种优化算法，如线性规划、二次规划或强化学习，来选择最终的行动，考虑因素如最小化风险、最大化舒适性或效率。
行动概率的解释：
- 行动的概率反映了端到端模型对行动的置信度。这个概率可以作为判断条件，用于在传统的基于规则的规划与控制（PnC）和基于学习的PnC之间进行切换。例如，当模型对某个行动非常有信心时，系统可以更倾向于采用基于学习的PnC策略；而在不确定性较高时，则可能更依赖传统的PnC策略。
灵活性与鲁棒性的平衡：
- 通过结合概率采样、规则过滤和优化求解，系统能够在灵活性和鲁棒性之间找到平衡。这种方法允许系统在面对复杂和动态的驾驶环境时，做出更加合理和安全的决策。
实时适应性：
- 系统可以根据实时的环境变化和模型的置信度动态调整其策略，从而更好地适应不同的驾驶场景和条件。

通过这些策略，VADv2能够在保持高效决策的同时，确保驾驶行为的安全性和合规性，这对于自动驾驶技术的实际应用和广泛部署至关重要。

4. Experiments

4.1. Experimental Settings

The widely used CARLA [11] simulator is adopted to evaluate the performance of VADv2. Following common practice, we use Town05 Long and Town05 Short benchmarks for closed-loop evaluation. Specifically, each benchmark contains several pre-defined driving routes. Town05 Long consists of 10 routes, each route is about 1km in length. Town05 Short consists of 32 routes, each route is 70m in length. Town05 Long validates the comprehensive capabilities of the model, while Town05 Short focuses on evaluating the model’s performance in specific scenarios, such as lane changing before intersections.
CARLA [11] 是一个广泛使用的自动驾驶模拟器，它提供了一个虚拟环境来测试和验证自动驾驶算法。VADv2 的性能评估就是在这个模拟器中进行的。以下是使用 CARLA 模拟器评估 VADv2 的一些细节：

Town05 基准测试：CARLA 模拟器中的 Town05 地图被用作评估基准，它包含不同的交通环境和城市布局。
长路线（Town05 Long）：这个基准包括10条预定的驾驶路线，每条路线大约1公里长。它用于验证模型的综合能力，包括在更长的驾驶过程中的稳定性和决策能力。
短路线（Town05 Short）：这个基准包括32条预定的驾驶路线，每条路线70米长。它专注于评估模型在特定场景下的性能，例如在交叉路口前的变道能力。
闭环评估：在这两种基准测试中，VADv2 都在闭环环境中进行评估，这意味着模型的输出（控制信号）直接用于控制模拟器中的车辆，并且车辆的状态和传感器数据会实时反馈给模型。
综合性能验证：通过长路线的评估，可以测试 VADv2 在多样化交通情况和更长时间段内的整体性能，包括导航能力、避障、遵守交通规则等。
特定场景评估：短路线的评估则更侧重于测试 VADv2 在特定驾驶场景下的反应和决策，如路口处的变道、停车和启动等。
路线和场景设计：这些路线和场景都是预先定义好的，它们设计用来覆盖广泛的驾驶情况，以便全面评估 VADv2 的能力。

通过在 CARLA 模拟器中的这些评估，研究人员可以验证 VADv2 在不同驾驶条件下的性能，包括它如何处理交通信号、与其他车辆的交互、以及它在城市环境中的导航策略。这些评估结果有助于进一步改进模型，并为将来在现实世界中的测试和部署提供有价值的见解。
We use the official autonomous agent of CARLA to collect training data by randomly generating driving routes in Town03, Town04, Town06, Town07, and Town10. The data is sampled at a frequency of 2Hz, and we collect approximately 3 million frames for training. For each frame, we save 6-camera surround-view images, traffic signals, information about other traffic participants, and the state information of the ego vehicle. Additionally, we obtain the vectorized maps for training the online mapping module by preprocessing the OpenStreetMap [15] format maps provided by CARLA. It is important to note that the map information was only provided as ground truth during training, and VADv2 does not utilize any high-definition map in closedloop evaluation.
在VADv2的训练过程中，使用CARLA模拟器的官方自动驾驶代理来收集训练数据，具体步骤和特点如下：

随机生成驾驶路线：在CARLA模拟器的Town03、Town04、Town06、Town07和Town10地图中随机生成驾驶路线。
数据采样频率：数据以2Hz的频率采样，意味着每秒钟收集两次数据。
数据量：大约收集了300万帧用于训练，这为模型提供了大量的驾驶场景和经验。
数据内容：
- 6摄像头环视图像：收集自车周围的六个方向的摄像头图像，为模型提供全方位的视觉信息。
- 交通信号：收集交通灯和其他交通信号的数据，这些信号对于驾驶决策至关重要。
- 其他交通参与者信息：收集周围其他车辆、行人等交通参与者的信息，包括他们的位置、速度和运动方向。
- 自车状态信息：记录自车的速度、加速度、转向角度等状态信息。
矢量化地图：通过预处理CARLA提供的OpenStreetMap格式地图，获得用于训练在线映射模块的矢量化地图。
地图信息的使用：重要的是，地图信息仅在训练期间作为真值提供，而在闭环评估中，VADv2不使用任何高精地图。这意味着VADv2能够不依赖于高精地图进行自主导航和决策。
无高精地图的闭环评估：在闭环评估中，VADv2依赖于其感知和预测模块来理解和导航环境，而不是依赖预先制作的高精地图，这增加了模型的实用性和适应性。

通过这种数据收集和训练方法，VADv2能够学习如何在各种交通环境和条件下安全驾驶，同时提高其在没有高精地图支持下的性能和泛化能力。这种方法有助于推动自动驾驶技术的发展，使其更接近实际应用的需求。

4.2. Metrics

For closed-loop evaluation, we use the official metrics of CARLA. Route Completion indicates the percentage of the route distance completed by an agent. Infraction Score indicates the degree of infractions happening along the route. Typical infractions include running red lights, collisions with pedestrians, etc… Each type of infraction has a corre sponding penalty coefficient, with more infractions happening, Infraction Score becomes lower. Driving Score serves as the product between the Route Completion and the Infraction Score, which is the main metric for evaluation. In benchmark evaluation, most works adopt a rule-based wrapper to reduce the infraction. For fair comparisons with other methods, we follow the common practice of adopting a rulebased wrapper over the learning-based policy.
在闭环评估中，VADv2 使用 CARLA 官方的评估指标，这些指标设计用来衡量自动驾驶代理的性能：

路线完成度（Route Completion）：表示代理完成的路线距离占总路线距离的百分比。这个指标反映了代理沿着预定路线行驶的能力。
违规分数（Infraction Score）：表示在行驶路线上发生的违规行为的程度。典型的违规行为包括闯红灯、与行人碰撞等。每种类型的违规行为都有一个相应的惩罚系数，随着违规行为的增加，违规分数会降低。
驾驶分数（Driving Score）：是路线完成度和违规分数的乘积，是评估的主要指标。它综合考虑了代理完成路线的能力以及在行驶过程中遵守交通规则的程度。
基准评估：在基准评估中，为了减少违规行为，大多数研究工作都采用了基于规则的包装器（rule-based wrapper）。这种包装器通常包含一系列硬编码的规则，用于确保代理的行为符合交通法规。
公平比较：为了与其他方法进行公平比较，VADv2 在基于学习的策略之上也采用了基于规则的包装器。这允许评估 VADv2 在实际驾驶场景中的性能，同时确保其行为不会因违反交通规则而受到惩罚。
评估实践：遵循这一常见实践，VADv2 的评估结果可以与使用类似方法的其他研究工作进行比较，从而提供一个公平和一致的比较基准。

通过使用这些评估指标和方法，可以全面地评估 VADv2 在不同驾驶场景中的性能，包括它的导航能力、对交通规则的遵守程度以及在面对复杂交通环境时的适应性。这些评估结果有助于进一步改进模型，并为将来在现实世界中的测试和部署提供有价值的见解。
For open-loop evaluation, L2 distance and collision rate are adopted to show which degree the learned policy drives similar to the expert demonstrations. In ablation experiments, we adopt open-loop metrics for evaluation, considering open-loop metrics are fast to calculate and more stable. We use the official autonomous agent of CARLA to generate the validation set on the Town05 Long benchmark for open-loop evaluation, and the results are averaged over all validation samples.
在开环评估（open-loop evaluation）中，使用以下指标来衡量学习策略与专家演示的相似度：

L2距离（L2 Distance）：这是衡量代理行驶轨迹与专家演示轨迹之间差异的指标。L2距离，也称为欧几里得距离，计算了两点在多维空间中的直线距离。在自动驾驶中，这通常用来衡量预测轨迹与实际轨迹之间的误差。
碰撞率（Collision Rate）：表示在行驶过程中发生碰撞的频率。这个指标反映了代理在避免碰撞方面的表现，是评估驾驶安全性的重要指标。
在进行消融实验（ablation experiments）时，选择开环指标进行评估的原因包括：

快速计算：开环指标的计算通常比闭环指标更快速，因为它们不需要实时反馈和调整。
稳定性：开环指标更稳定，因为它们不受模型实时决策变化的影响，可以提供更一致的性能评估。

使用CARLA官方的自动驾驶代理来生成Town05 Long基准的验证集，进行开环评估，并取所有验证样本的平均结果。这种方法确保了评估的一致性和可重复性。开环评估的结果可以帮助研究人员理解模型在没有环境反馈的情况下，其策略与专家驾驶行为的一致性如何，以及在没有实时调整的情况下模型的驾驶性能。这对于分析和改进模型的决策过程非常有用。

4.3. Comparisons with State-of-the-Art Methods

On the Town05 Long benchmark, VADv2 achieved a Drive Score of 85.1, a Route Completion of 98.4, and an Infraction Score of 0.87, as shown in Tab. 1. Compared to the previous state-of-the-art method [49], VADv2 achieves a higher Route Completion while significantly improving Drive Score by 9.0. It is worth noting that VADv2 only utilizes cameras as perception input, whereas [49] utilizes both cameras and LiDAR. Furthermore, compared to the previous best method [45] which only relies on cameras, VADv2 demonstrates even greater advantages, with a remarkable increase in Drive Score of up to 16.8.
在Town05 Long基准测试中，VADv2取得了以下成绩，如表1所示：
在这里插入图片描述

驾驶分数（Drive Score）：85.1
路线完成度（Route Completion）：98.4%
违规分数（Infraction Score）：0.87

与之前的最先进方法[49]相比，VADv2在以下方面表现更优：

路线完成度：VADv2达到了更高的路线完成度。
驾驶分数：VADv2的驾驶分数提高了9.0，这是一个显著的改进。

值得注意的是，VADv2仅使用相机作为感知输入，而[49]则同时使用了相机和激光雷达（LiDAR）。此外，与之前仅依赖相机的最佳方法[45]相比，VADv2展现了更大的优势，驾驶分数惊人地提高了高达16.8。
这些结果表明，VADv2在以下方面具有显著的优势：

感知输入的简化：即使仅使用相机输入，VADv2也能实现与使用更复杂传感器套件相当或更好的性能。
性能提升：VADv2在驾驶分数上的提升表明，其决策和规划策略在遵循交通规则和完成驾驶任务方面非常有效。
与现有技术的比较：VADv2在与现有技术比较时显示出其优越性，无论是在仅使用相机的方法中，还是在结合使用多种传感器的技术中。
概率规划的有效性：这些成绩证明了VADv2采用的概率规划方法在处理自动驾驶中的不确定性和复杂性方面的有效性。

VADv2的这些成果为未来自动驾驶技术的发展提供了有价值的见解，并展示了在不依赖昂贵或复杂的传感器配置的情况下实现高性能自动驾驶的潜力。
We present the results for all publicly available works on the Town05 Short benchmark in Tab. 2. Compared to the Town05 Long benchmark, the Town05 Short benchmark focuses more on evaluating the ability of models to perform specific driving behaviors, such as lane changing in congested traffic flow and lane changing before intersections. In comparison to the previous result [23], VADv2 significantly improves Drive Score and Route Completion by 25.3 and 5.7 respectively, demonstrating the comprehensive driving ability of VADv2 in complex driving scenarios.
在Town05 Short基准测试中，VADv2与其他公开可用的工作结果进行了比较，如表2 所示。与Town05 Long基准相比，Town05 Short基准更侧重于评估模型执行特定驾驶行为的能力，例如在拥挤的交通流中变道和在交叉路口前变道。与之前的结果[23]相比，VADv2在驾驶分数和路线完成度上分别显著提高了25.3和5.7，这表明VADv2在复杂驾驶场景中具有全面的驾驶能力。
在这里插入图片描述

这些结果强调了VADv2的几个关键优势：

特定驾驶行为的执行能力：VADv2在执行如交通拥堵中的变道和交叉路口前变道等特定驾驶行为方面表现出色。
综合性能提升：VADv2不仅在单一指标上有所提升，而是在多个指标上都显示出了综合性能的提高。
复杂场景适应性：VADv2在复杂交通环境中的驾驶能力得到了验证，特别是在需要精细操作和快速决策的场景中。
与先前技术的对比：与之前的方法[23]相比，VADv2的性能提升明显，这表明了其在自动驾驶领域的先进性和创新性。
概率规划方法的有效性：VADv2采用的概率规划方法在实际驾驶场景中得到了有效验证，特别是在需要处理不确定性和做出多样化决策的情况下。
对自动驾驶研究的贡献：VADv2的这些成果为自动驾驶领域的研究提供了新的视角，特别是在如何提高模型在复杂城市环境中的驾驶性能方面。

4.4. Ablation Study

Tab. 3 shows the ablation experiments of the key modules in VADv2. The model performs poorly in terms of planning accuracy without the supervision of expert driving behavior provided by the Distribution Loss (ID 1). The Conflict Loss provides critical prior information about driving, so without the Conflict Loss (ID 2), the model’s planning accuracy is also affected. Scene tokens encode important scene elements into high-dimensional features, and the planning tokens interact with the scene tokens to learn both dynamic and static information about the driving scene. When any type of scene token is missing, the model’s planning performance will be affected (ID 3-ID 6). The best planning performance is achieved when the model incorporates all of the aforementioned designs (ID 7).
表3 展示了VADv2关键模块的消融实验结果。以下是对消融实验结果的解读：
在这里插入图片描述

分布损失（Distribution Loss, ID 1）的重要性：
- 没有分布损失提供的专家驾驶行为的监督，模型的规划精度表现不佳。分布损失确保模型学习从大规模驾驶演示中得到的概率分布，从而提高规划的准确性。
冲突损失（Conflict Loss, ID 2）的作用：
- 冲突损失提供了关于驾驶的重要先验信息，如果没有冲突损失，模型的规划精度也会受到影响。它帮助模型理解驾驶场景中潜在的冲突，并学习避免这些冲突的策略。
场景Token（Scene Tokens, ID 3-ID 6）的影响：
- 场景Token将驾驶场景中的重要元素编码为高维特征，与规划Token的交互使模型能够学习驾驶场景的动态和静态信息。缺少任何类型的场景Token都会影响模型的规划性能。消融实验显示，当移除某一类场景Token时（例如地图Token、智能体Token、交通元素Token或图像Token），模型的表现会下降。
最佳规划性能（ID 7）：
- 当模型整合了上述所有设计时，实现了最佳的规划性能。这表明VADv2中各个组件的协同作用对于实现高性能至关重要。

消融实验的结果强调了VADv2中每个组件对于最终性能的贡献，以及它们如何共同作用来提高模型的规划能力。这些发现有助于理解不同损失函数和Token在模型中的作用，并为进一步优化模型提供了有价值的见解。通过这些实验，可以更清楚地了解哪些组件对于维持和提升模型性能是必不可少的。

4.5. Visualization

Fig. 3 presents some qualitative results of VADv2. The first image showcases multi-modal planning trajectories predicted by VADv2 at different driving speeds. The second image showcases VADv2’s predictions of both forward creeping and multi-modal left-turn trajectories in a lanechanging scenario. The third image depicts a right lanechanging scenario at an intersection, where VADv2 predicts multiple trajectories for both going straight and changing lanes to the right. The final image demonstrates a lanechanging scenario where there is a vehicle in the target lane, and VADv2 predicts multiple reasonable lane-changing trajectories.
图3 展示了VADv2的一些定性结果，以下是对这些结果的描述：

多模态规划轨迹（第一张图）：
- 展示了VADv2在不同驾驶速度下预测的多模态规划轨迹。这表明VADv2能够根据不同的驾驶条件生成多种可能的轨迹选项。
变道场景中的预测（第二张图）：
- 在变道场景中，VADv2展示了既能进行缓慢前行（forward creeping）也能进行多模态左转（multi-modal left-turn）的轨迹预测。这说明模型能够处理复杂的交通情景，并为车辆提供灵活的导航选项。
十字路口的变道（第三张图）：
- 在十字路口的变道场景中，VADv2预测了直行和向右变道的多种轨迹。这显示了模型在处理交叉路口时的决策能力，能够考虑不同的行动方案。
目标车道中有车辆的变道（最后一张图）：
- 在目标车道中存在其他车辆的情况下进行变道时，VADv2预测了多种合理的变道轨迹。这表明VADv2能够考虑周围交通参与者的状态，并规划出安全的变道路径。

这些定性结果突出了VADv2在不同交通场景下的能力：

适应性：VADv2能够适应不同的驾驶速度和交通条件。
灵活性：模型能够为复杂的驾驶情景提供灵活的解决方案。
安全性：在预测轨迹时，VADv2考虑了避免碰撞和其他危险情况。
多模态预测：VADv2能够生成多种可能的轨迹，增加了决策的多样性。

在这里插入图片描述

5. Conclusion

In this work, we present VADv2, an end-to-end driving model based on probabilistic planning. In the CARLA simulator, VADv2 runs stably and achieves state-of-the-art closed-loop performance. The feasibility of this probabilistic paradigm is primarily validated. However, its effectiveness in more complicated real-world scenarios remains unexplored, which is the future work.
在这项工作中，我们介绍了VADv2，这是一款基于概率规划的端到端驾驶模型。在CARLA模拟器中，VADv2运行稳定，并实现了最先进的闭环性能。这种概率范式的基本可行性已经得到了验证。然而，其在更复杂的现实世界场景中的有效性尚未探索，这将是未来的工作。
以下是对这项工作和未来方向的总结：

端到端模型：VADv2作为一个端到端模型，能够从传感器输入直接生成驾驶决策，无需人工干预。
概率规划：通过概率规划，VADv2能够处理驾驶决策中的不确定性，提供更灵活和适应性强的驾驶策略。
CARLA模拟器中的验证：在CARLA模拟器中，VADv2展示了其稳定性和高性能，证明了其在虚拟环境中的有效性。
闭环性能：VADv2在闭环测试中表现出色，这表明它能够实时响应环境变化并做出相应的调整。
未来工作：
- 现实世界测试：将VADv2部署到真实车辆中，并在现实世界的复杂交通环境中进行测试，以验证其在实际应用中的有效性。
- 传感器融合：探索如何将VADv2与其他传感器（如激光雷达、雷达等）的数据融合，以提高系统的鲁棒性和准确性。
- 算法优化：进一步优化概率规划算法，提高其在复杂场景中的决策质量和响应速度。
- 安全性评估：对VADv2进行深入的安全性评估，确保其在各种驾驶情况下都能提供安全可靠的驾驶决策。