PlanT：通过对象级表示的可解释规划变换器

最新推荐文章于 2025-05-15 16:53:44 发布

真诚的灰灰

最新推荐文章于 2025-05-15 16:53:44 发布

阅读量4.4k

点赞数 7

文章标签：自动驾驶机器学习人工智能

本文链接：https://blog.csdn.net/jch924583667/article/details/141562223

版权

PlanT: Explainable Planning Transformers via Object-Level Representations

PlanT：通过对象级表示的可解释规划变换器

paper
code

在这里插入图片描述

Abstract

Planning an optimal route in a complex environment requires efficient reasoning about the surrounding scene. While human drivers prioritize important objects and ignore details not relevant to the decision, learning-based planners typically extract features from dense, high-dimensional grid representations containing all vehicle and road context information. In this paper, we propose PlanT, a novel approach for planning in the context of self-driving that uses a standard transformer architecture. PlanT is based on imitation learning with a compact object-level input representation. On the Longest6 benchmark for CARLA, PlanT outperforms all prior methods (matching the driving score of the expert) while being 5.3× faster than equivalent pixel-based planning baselines during inference. Combining PlanT with an off-the-shelf perception module provides a sensorbased driving system that is more than 10 points better in terms of driving score than the existing state of the art. Furthermore, we propose an evaluation protocol to quantify the ability of planners to identify relevant objects, providing insights regarding their decision-making. Our results indicate that PlanT can focus on the most relevant object in the scene, even when this object is geometrically distant.
在复杂环境中规划最优路线需要对周围场景进行高效的推理。虽然人类驾驶员会优先考虑重要的物体并忽略与决策无关的细节，但基于学习的规划器通常从包含所有车辆和道路环境信息的密集、高维网格表示中提取特征。在本文中，我们提出了PlanT，这是一种新颖的自动驾驶规划方法，它使用标准的Transformer架构。PlanT基于模仿学习，使用紧凑的对象级输入表示。在CARLA的Longest6基准测试中，PlanT的表现超越了所有先前的方法（与专家的驾驶得分相匹配），同时在推理过程中比等效的基于像素的规划基线快5.3倍。将PlanT与现成的感知模块结合起来，提供了一个基于传感器的驾驶系统，其驾驶得分比现有技术高出10多分。此外，我们提出了一个评估协议，以量化规划器识别相关对象的能力，为它们的决策过程提供洞察。我们的结果表明，即使相关对象在几何上较远，PlanT也能专注于场景中最相关的对象。
Keywords: Autonomous Driving, Transformers, Explainability

1 Introduction

The ability to plan is an important aspect of human intelligence, allowing us to solve complex navigation tasks. For example, to change lanes on a busy highway, a driver must wait for sufficient space in the new lane and adjust the speed based on the expected behavior of the other vehicles. Humans quickly learn this and can generalize to new scenarios, a trait we would also like autonomous agents to have. Due to the difficulty of the planning task, the field of autonomous driving is shifting away from traditional rule-based algorithms [1, 2, 3, 4, 5, 6, 7, 8] towards learning-based solutions [9, 10, 11, 12, 13, 14]. Learning-based planners directly map the environmental state representation (e.g., HD maps and object bounding boxes) to waypoints or vehicle controls. They emerged as a scalable alternative to rule-based planners which require significant manual effort to design.
规划能力是人类智能的一个重要方面，它使我们能够解决复杂的导航任务。例如，在繁忙的高速公路上换道时，驾驶员必须等待新车道有足够的空间，并根据其他车辆预期的行为调整速度。人类可以迅速学会这一点，并且能够将其推广到新的场景中，这是我们也希望自动驾驶代理拥有的特性。由于规划任务的难度，自动驾驶领域正从传统的基于规则的算法[1, 2, 3, 4, 5, 6, 7, 8]转向基于学习的解决方案[9, 10, 11, 12, 13, 14]。基于学习的规划器直接将环境状态表示（例如，高清地图和对象边界框）映射到航点或车辆控制。它们作为可扩展的替代方案出现了，因为基于规则的规划器需要大量的手动工作来设计。
基于学习的规划器通常使用深度学习或强化学习技术，它们可以从数据中学习如何做出决策。这些系统通过分析大量的驾驶场景和相应的成功或失败的决策来学习，从而能够预测在特定情况下的最佳行动方案。以下是一些关键点，概述了基于学习的方法如何应用于自动驾驶规划：
Interestingly, while humans reason about the world in terms of objects [15, 16, 17], most existing learned planners [9, 12, 18] choose a high-dimensional pixel-level input representation by rendering bird’s eye view (BEV) images of detailed HD maps (Fig. 1 left). It is widely believed that this kind of accurate scene understanding is key for robust self-driving vehicles, leading to significant interest in recovering pixel-level BEV information from sensor inputs [19, 20, 21, 22, 23, 24]. In this paper, we investigate whether such detailed representations are actually necessary to achieve convincing planning performance. We propose PlanT, a learning-based planner that leverages an object-level representation (Fig. 1 right) as an input to a transformer encoder [25]. We represent a scene as a set of features corresponding to (1) nearby vehicles and (2) the route the planner must follow. We show that despite the low feature dimensionality, our model achieves state-of-the-art results. We then propose a novel evaluation scheme and metric to analyze explainability which is generally applicable to any learning-based planner. Specifically, we test the ability of a planner to identify the objects that are the most relevant to account for to plan a collision-free route.
有趣的是，尽管人类习惯于用对象来推理世界[15, 16, 17]，但大多数现有的学习型规划器[9, 12, 18]选择使用高维像素级输入表示，通过渲染详细的高清地图（HD maps）的鸟瞰图（BEV）图像（图1左）。人们普遍认为，这种精确的场景理解对于鲁棒的自动驾驶车辆至关重要，从而激发了从传感器输入中恢复像素级BEV信息的显著兴趣[19, 20, 21, 22, 23, 24]。在本文中，我们探讨了是否确实需要这种详细的表示才能实现令人信服的规划性能。我们提出了PlanT，一个基于学习的规划器，它利用对象级表示（图1右）作为变换器编码器[25]的输入。我们将场景表示为一组特征，对应于（1）附近的车辆和（2）规划器必须遵循的路线。我们展示了尽管特征维度低，但我们的模型仍然实现了最先进的结果。然后，我们提出了一种新颖的评估方案和度量标准来分析可解释性，这通常适用于任何基于学习的规划器。具体来说，我们测试了规划器识别对规划无碰撞路线最相关对象的能力。
在这里插入图片描述
图1：规划的场景表示。作为像素级规划器（左侧）主流范式的替代，我们展示了紧凑的对象级表示（右侧）的有效性
We perform a detailed empirical analysis of learning-based planning on the Longest6 benchmark [26] of the CARLA simulator [27]. We first identify the key missing elements in the design of existing learned planners such as their incomplete field of view and sub-optimal dataset and model sizes. We then show the advantages of our proposed transformer architecture, including improvements in performance and significantly faster inference times. Finally, we show that the attention weights of the transformer, which are readily accessible, can be used to represent object relevance. Our qualitative and quantitative results on explainability confirm that PlanT attends to the objects that match our intuition for the relevance of objects for safe driving.
在CARLA模拟器[27]的Longest6基准测试[26]上，我们对基于学习的规划进行了详细的实证分析。首先，我们确定了现有学习型规划器设计中缺失的关键元素，例如它们不完整的视野范围和次优的数据集及模型大小。然后，我们展示了我们提出的变换器架构的优势，包括性能的提升和显著更快的推理时间。最后，我们展示了transformer的注意力权重，这些权重容易获取，可以用来表示对象的相关性。我们在可解释性方面的定性和定量结果证实，PlanT关注的对象符合我们对安全驾驶中对象相关性的直觉。
Contributions. (1) Using a simple object-level representation, we significantly improve upon the previous state of the art for planning on CARLA via PlanT, our novel transformer-based approach. (2) Through a comprehensive experimental study, we identify that the ego vehicle’s route, a full 360° field of view, and information about vehicle speeds are critical elements of a planner’s input representation. (3) We propose a protocol and metric for evaluating a planner’s prioritization of obstacles in a scene and show that PlanT is more explainable than CNN-based methods, i.e., the attention weights of the transformer identify the most relevant objects more reliably.
贡献概述：

创新的transformer基础方法：通过使用简单的对象级表示，我们在CARLA上的规划性能上显著超越了之前的最佳水平，这得益于我们的新型transformer基础方法PlanT。
关键输入元素的识别：通过全面的实验研究，我们确定了自动驾驶车辆的路线、完整的360°视野范围以及车辆速度信息是规划器输入表示中的关键元素。
评估协议和度量的提出：我们提出了一种评估规划器在场景中优先考虑障碍物的协议和度量标准，并展示了PlanT比基于CNN的方法具有更高的可解释性，即变换器的注意力权重更可靠地识别了最相关对象。

2 Related Work

Intermediate Representations for Driving. Early work on decoupling end-to-end driving into two stages predicts a set of low-dimensional affordances from sensor inputs with CNNs which are then input to a rule-based planner [28]. These affordances are scene-descriptive attributes (e.g. emergency brake, red light, center-line distance, angle) that are compact, yet comprehensive enough to enable simple driving tasks, such as urban driving on the initial version of CARLA [27]. Unfortunately, methods based on affordances perform poorly on subsequent benchmarks in CARLA which involve higher task complexity [29]. Most state-of-the-art driving models instead rely heavily on annotated 2D data either as intermediate representations or auxiliary training objectives [26, 30]. Several subsequent studies show that using semantic segmentation as an intermediate representation helps for navigational tasks [31, 32, 33, 34]. More recently, there has been a rapid growth in interest on using BEV semantic segmentation maps as the input representation to planners [9, 12, 30, 18]. To reduce the immense labeling cost of such segmentation methods, Behl et al. [35] propose visual abstractions, which are label-efficient alternatives to dense 2D semantic segmentation maps. They show that reduced class counts and the use of bounding boxes instead of pixel-accurate masks for certain classes is sufficient. Wang et al. [36] explore the use of object-centric representations for planning by explicitly extracting objects and rendering them into a BEV input for a planner. However, so far, the literature lacks a systematic analysis of whether object-centric representations are better or worse than BEV context techniques for planning in dense traffic, which we address in this work. We keep our representation simple and compact by directly considering the set of objects as inputs to our models. In addition to baselines using CNNs to process the object-centric representation, we show that using a transformer leads to improved performance, efficiency, and explainability.
驾驶的中间表示。在将端到端驾驶解耦为两个阶段的早期工作中，使用CNN从传感器输入预测一组低维的可驾驶性特征，然后将这些特征输入到基于规则的规划器[28]。这些可驾驶性特征是描述场景的属性（例如紧急刹车、红灯、中线距离、角度），它们紧凑但足够全面，能够支持简单的驾驶任务，如在CARLA[27]的初始版本上的城市驾驶。不幸的是，基于可驾驶性特征的方法在涉及更高任务复杂性的后续CARLA基准测试中表现不佳[29]。大多数最先进的驾驶模型严重依赖于标注的2D数据，无论是作为中间表示还是辅助训练目标[26, 30]。几项随后的研究表明，使用语义分割作为中间表示有助于导航任务[31, 32, 33, 34]。最近，使用BEV语义分割图作为规划器输入表示的兴趣迅速增长[9, 12, 30, 18]。为了减少这种分割方法的巨大标注成本，Behl等人[35]提出了视觉抽象，这是密集2D语义分割图的标签高效替代品。他们表明，减少类别数量和对某些类别使用边界框而不是像素精确的掩码就足够了。Wang等人[36]通过显式提取对象并将它们渲染成规划器的BEV输入，探索了使用以对象为中心的表示进行规划。然而，到目前为止，文献中缺乏系统分析以对象为中心的表示是否比BEV上下文技术更适合密集交通中的规划，这是我们在这项工作中解决的。我们通过直接将对象集合作为模型的输入，保持我们的表示简单和紧凑。除了使用CNN处理以对象为中心的表示的基线外，我们还展示了使用transformer可以提高性能、效率和可解释性。
Transformers for Forecasting. Transformers obtain impressive results in several research areas [25, 37, 38, 39], including simple interactive environments such as Atari games [40, 41, 42, 43, 44]. While the end objective differs, one application domain that involves similar challenges to planning is motion forecasting. Most existing motion forecasting methods use a rasterized input in combination with a CNN-based network architecture [45, 46, 47, 48, 49, 50]. Gao et al. [51] show the advantages of object-level representations for motion forecasting via Graph Neural Networks (GNN). Several follow-ups to this work use object-level representations in combination with Transformer-based architectures [52, 53, 54]. Our key distinctions when compared to these methods are the architectural simplicity of PlanT (our use of simple self-attention transformer blocks and the proposed route representation) as well as our closed-loop evaluation protocol (we evaluate the driving performance in simulation and report online driving metrics).
transformer在预测方面的应用。变换器在多个研究领域[25, 37, 38, 39]取得了令人印象深刻的成果，包括像Atari游戏[40, 41, 42, 43, 44]这样的简单交互环境。虽然最终目标不同，但运动预测是一个涉及类似挑战的应用领域。大多数现有的运动预测方法使用光栅化输入与基于CNN的网络架构[45, 46, 47, 48, 49, 50]。Gao等人[51]通过图神经网络(GNN)展示了对象级表示在运动预测方面的优势。这项工作的几项后续研究使用对象级表示与基于变换器的架构相结合[52, 53, 54]。与这些方法相比，我们的关键区别在于PlanT的架构简单性（我们使用简单的自注意力变换器块和提出的道路表示）以及我们的闭环评估协议（我们在模拟中评估驾驶性能并报告在线驾驶指标）。
Explainability. Explaining the decisions of neural networks is a rapidly evolving research field [55, 56, 57, 58, 59, 60, 61]. In the context of self-driving cars, existing work uses text [62] or heatmaps [63] to explain decisions. In our work, we can directly obtain post hoc explanations for decisions of our learning-based PlanT architecture by considering its learned attention. While the concurrent work CAPO [64] uses a similar strategy, it only considers pedestrian-ego interactions on an empty route, while we consider the full planning task in an urban environment with dense traffic. Furthermore, we introduce a simple metric to measure the quality of explanations for a planner.
可解释性。解释神经网络的决策是一个快速发展的研究领域[55, 56, 57, 58, 59, 60, 61]。在自动驾驶汽车的背景下，现有的工作使用文本[62]或热图[63]来解释决策。在我们的工作中，我们可以通过考虑PlanT架构学习到的注意力来直接获得其决策的事后解释。虽然并行工作CAPO[64]采用了类似的策略，但它只考虑了空旷路线上的行人-自我交互，而我们考虑了在密集交通的城市环境中的完整规划任务。此外，我们引入了一个简单的度量标准来衡量规划器解释的质量。

3 Planning Transformers

In this section, we provide details about our task setup, novel scene representation, simple but effective architecture, and training strategy resulting in state-of-the-art performance. A PyTorch-style pseudo-code snippet outlining PlanT and its training is provided in the supplementary material.
在这一部分，我们提供了有关我们任务设置的详细信息，新颖的场景表示，简单但有效的架构，以及导致最先进性能的训练策略。在补充材料中提供了一个PyTorch风格的伪代码片段，概述了PlanT及其训练。
Task. We consider the task of point-to-point navigation in an urban setting where the goal is to drive from a start to a goal location while reacting to other dynamic agents and following traffic rules. We use Imitation Learning (IL) to train the driving agent. The goal of IL is to learn a policy π that imitates the behavior of an expert π∗ (the expert implementation is described in Section 4). In our setup, the policy is a mapping π : X −! W from our novel object-level input representation X to the future trajectory W of an expert driver. For following traffic rules, we assume access to the state of the next traffic light relevant to the ego vehicle l 2 fgreen; redg.
任务。我们考虑的是在城市环境中的点对点导航任务，目标是在对其他动态代理做出反应并遵守交通规则的同时，从起点驾驶到目标位置。我们使用模仿学习（Imitation Learning, IL）来训练驾驶代理。IL的目标是学习一个策略π，该策略模仿专家π*的行为（专家实现在第4节中有描述）。在我们的设置中，策略是从我们新颖的对象级输入表示 X 到专家驾驶员的未来轨迹W的映射π : X → W。为了遵守交通规则，我们假设可以访问与自我车辆相关的下一个交通灯的状态，即交通灯的颜色状态（绿色或红色）。
Tokenization. To encode the task-specific information required from the scene, we represent it using a set of objects, with vehicles and segments of the route each being assigned an oriented bounding box in BEV space (Fig. 1 right). Let Xt = Vt[St, where Vt 2 RVt×A and St 2 RSt×A represent the set of vehicles and the set of route segments at time-step t with A = 6 attributes each. Specifically, if oi;t 2 Xt represents a particular object, the attributes of oi;t include an object type-specific attribute zi;t (described below), the position of the bounding box (xi;t; yi;t) relative to the ego vehicle, the orientation ’i;t 2 [0; 2π], and the extent (wi;t; hi;t). Thus, each object oi;t can be described as a vector oi;t = fzi;t; xi;t; yi;t; ’i;t; wi;t; hi;tg, or concisely as foi;t;ag6 a=1.
标记化，为了编码场景中所需的特定任务信息，我们使用一组对象来表示它，其中车辆和路线段在BEV（鸟瞰图）空间中各自被分配了一个有向边界框（图1右侧）。设 $X_t = V_t \cup S_t$ ，其中 $V_t \in \mathbb{R}^{V_t \times A}$ 和 $S_t \in \mathbb{R}^{S_t \times A}$ 分别表示时间步 t 时的车辆集合和路线段集合，每个集合中的每个元素都有 A 个属性。具体来说，如果 $o_{i,t} \in X_t$ 表示一个特定的对象， $o_{i,t}$ 的属性包括一个对象类型特定的属性 $z_{i,t}$ （下面描述），相对于自我车辆的边界框位置 $x_{i,t}, y_{i,t})$ ，方向 $\varphi_{i,t} \in [0, 2\pi]$ ，以及尺寸（ $w_{i,t}$ , $h_{i,t}$ )。因此，每个对象 $o_{i,t}$ 可以被描述为一个向量 $o_{i,t} = [z_{i,t}, x_{i,t}, y_{i,t}, \varphi_{i,t}, w_{i,t}, h_{i,t}]$ ，或者简洁地表示为 $\{o_{i,t,a}\}^6_{a=1}$ 。

a是什么，好像没有提到？

For the vehicles Vt, we extract the attributes directly from the simulator in our main experiments and use an off-the-shelf perception module based on CenterNet [65] (described in the supplementary material) for experiments involving a full driving system. We consider only vehicles up to a distance Dmax from the ego vehicle, and use oi;t;1 (i.e., zi;t) to represent the speed.
对于车辆 $V_t$ ，我们在主要实验中直接从模拟器中提取属性，并在涉及完整驾驶系统的实验中使用基于CenterNet[65]（在补充材料中描述）的现成感知模块。我们只考虑距离自我车辆 $D_{\text{max}}$ 以内的车辆，并使用 $o_{i,t,1} (i.e., z_{i,t})$ 来表示速度。
To obtain the route segments St, we first sample a dense set of Nt points Ut 2 RNt×2 along the route ahead of the ego vehicle at time-step t. We directly use the ground-truth points from CARLA as Ut in our main experiments and predict them with a perception module for the PlanT with perception experiments in Section 4.1. The points are subsampled using the Ramer-Douglas-Peucker algorithm [66, 67] to select a subset U^t. One segment spans the area between two points subsampled from the route, ui;t; ui+1;t 2 U^t. Specifically, oi;t;1 (i.e., zi;t) denotes the ordering for the current time-step t, starting from 0 for the segment closest to the ego vehicle. We set the segment length oi;t;6 = jjui;t − ui+1;tjj2, and the width, oi;t;5, equal to the lane width. In addition, we clip oi;t;6 <= Lmax; 8i; t; and always input a fixed number of segments Ns to our policy. More details and visualizations of the route representation are provided in the supplementary material.
为了获得路线段 $S_t$ ，我们首先在时间步 $t$ 时沿着自我车辆前方的路线采样一组密集的 $N_t$ 个点 $U_t \in \mathbb{R}^{N_t \times 2}$ 。在主要实验中，我们直接使用CARLA中的地面真实点作为 $U_t$ ，并在第4.1节中提到的 PlanT 感知实验中用感知模块预测它们。这些点使用Ramer-Douglas-Peucker算法[66, 67]进行子采样，以选择一个子集 $\hat U_t$ 。一个路线段跨越从路线中子采样的两个点 $u_{i,t}, u_{i+1,t} \in \hat U_t$ 之间的区域。具体来说， $o_{i,t,1} (i.e., z_{i,t})$ 表示当前时间步 t 的排序，从距离自我车辆最近的段开始为0。我们设置段长度 $o_{i,t,6} = ||u_{i,t} - u_{i+1,t}||^2$ ，并把宽度 $o_{i,t,5}$ 设为车道宽度。此外，我们限制 $o_{i,t,6} \leq L_{\text{max}}; \forall i, t$ ；并且总是向我们的策略输入固定数量的段 $N_s$ 。有关路线表示的更多细节和可视化，请参见补充材料。
Token Embeddings. Our model is illustrated in Fig. 2. As a first step, applying a transformer backbone requires the generation of embeddings for each input token, for which we define a linear projection ρ : R6 ! RH (where H is the desired hidden dimensionality). To obtain token embeddings ei;t, we add the projected input tokens oi;t to a learnable object type embedding vector ev 2 RH or es 2 RH, indicating to which type the token belongs (vehicle or route segment).
标记嵌入。我们的模型如图2所示。作为第一步，应用 transformer 主干需要为每个输入标记生成嵌入，为此我们定义了一个线性投影 $\rho : \mathbb{R}^6 \rightarrow \mathbb{R}^H$ （其中 H 是所需的隐藏维度）。为了获得标记嵌入 $e_{i,t}$ ，我们将投影输入标记 $o_{i,t}$ 加到一个可学习的类型嵌入向量 $e_v \in \mathbb{R}^H$ 或 $e_s \in \mathbb{R}^H$ ，这表明标记属于哪种类型（车辆或路线段）。
在这里插入图片描述
图2：规划变换器（PlanT）。我们使用一组包含车辆和要遵循的路线（绿色箭头）的对象来表示场景（左下角）。我们通过线性投影（右下角）嵌入这些对象，并使用变换器编码器处理它们。PlanT通过GRU解码器输出未来的航点。我们使用预测其他车辆未来的自监督辅助任务。此外，提取和可视化注意力权重可以产生可解释的决策（左上角）。

这里说的线性投影，是不是MLP？

在这里插入图片描述
Main Task: Waypoint Prediction. The main building block for the IL policy π is a standard transformer encoder, τ : RVt+St+1×H ! RVt+St+1×H, based on the BERT architecture [37]. Specifically, we define a learnable [CLS] token, c 2 RH (based on [37, 39]) and stack this with other token embeddings to obtain the transformer input [c; e1;t; :::; eVt+St;t]. The [CLS] token’s processing through τ involves an attention-based aggregation of the features from all other tokens, after which it is used for generating the waypoint predictions via an auto-regressive waypoint decoder, γ : R(H+1) ! RW ×2. For a detailed description of the waypoint decoder architecture, see [18, 26]. We concatenate the binary traffic light flag, lt to the transformer output as the initial hidden state to the decoder which makes use of GRUs [68] to predict the future trajectory Wt of the ego vehicle, centered at the coordinate frame of the current time-step t. The trajectory is represented by a sequence of 2D waypoints in BEV space, fww = (xw; yw)gt w+=Wt+1 for W = 4 future time-steps:
主要任务：航点预测。模仿学习（Imitation Learning, IL）策略 $π$ 的主要构建块是标准变换器编码器 $\tau : \mathbb{R}^{V_t+S_t+1 \times H} \rightarrow \mathbb{R}^{V_t+S_t+1 \times H}$ ，基于BERT架构[37]。具体来说，我们定义了一个可学习的[CLS]标记 $\in \mathbb{R}^H$ （基于[37, 39]），并将其与其他标记嵌入堆叠，以获得变换器输入 $e_1, t, \ldots, e_{V_t+S_t},t]$ 。[CLS]标记通过 $\tau$ 处理，涉及基于注意力机制从所有其他标记聚合特征，之后它被用于通过自回归航点解码器 $\gamma : \mathbb{R}^{(H+1)} \rightarrow \mathbb{R}^{W \times 2}$ 生成航点预测。有关航点解码器架构的详细描述，请参见[18, 26]。我们将二进制交通灯标志 $l_t$ 连接到变换器输出，并将其作为初始隐藏状态输入到解码器，解码器利用GRU[68]预测自我车辆的未来轨迹 $W_t$ ，以当前时间步t的坐标框架为中心。轨迹由一系列2D航点表示，在BEV空间中， ${w_w = (x_w, y_w)\} _{w=t+1}^{t+W}$ 对于 ( W = 4 ) 个未来时间步：
在这里插入图片描述
Auxiliary Task: Vehicle Future Prediction. In addition to the primary waypoint prediction task, we propose the auxiliary task of predicting the future attributes of other vehicles. This is aligned with the overall driving goal in two ways. (1) The ability to reason about the future of other vehicles is important in an urban environment as it heavily influences the ego vehicle’s own future. (2) Our main task is to predict the ego vehicle’s future trajectory, which means the output feature of the transformer needs to encode all the information necessary to predict the future. Supervising the outputs of all vehicles on a similar task (i.e., predicting vehicle poses at a future time-step) exploits synergies between the task of the ego vehicle and the other vehicles [69, 30]. Specifically, using the output embeddings fhi;tgV i=1 t corresponding to all vehicle tokens oi;t 2 Vt, we predict class probabilities ffpi;t+1;ag6 a=1gV i=1 t for the speed, position, orientation, and extent attributes from the next time-step foi;t+1;ag6 a=1 using a linear layer per attribute type f a : RH ! RZag6 a=1:
辅助任务：车辆未来预测。除了主要的航点预测任务外，我们提出了预测其他车辆未来轨迹的辅助任务。这在两个方面与整体驾驶目标一致：

推理其他车辆的未来轨迹在城市环境中很重要，因为它严重影响自我车辆自己的未来。
我们的主要任务是预测自我车辆的未来轨迹，这意味着变换器的输出特征需要编码所有必要的信息来预测未来。在类似任务（即预测未来时间步的车辆姿态）上监督所有车辆的输出利用了自我车辆和其他车辆任务之间的协同效应[69, 30]。

具体来说，使用对应于所有车辆标记 $o_{i,t} \in V_t$ 的输出嵌入 $\{h_{i,t}\}^{V_t}_{i=1}$ ，我们预测下一个时间步的速度、位置、方向和尺寸及其他属性的类别概率 $\{\{{p_{i,t+1},a}\}^6_{a=1}\}_{i=1}^{V_t}$ ，对于下一个时间步的 $\{o_{i,t+1,a}\}^6_{a=1}$ 使用每种属性类型的线性层 $\{\psi_a: \mathbb{R}^H \rightarrow \mathbb{R}^{Z_a}\}^6_{a=1}$ ：
在这里插入图片描述
We choose to discretize each attribute into Za bins to allow for uncertainty in the predictions since the future is multi-modal. This is also better aligned with how humans drive without predicting exact locations and velocities, where a rough estimate is sufficient to make a safe decision.
我们选择将每个属性离散化成 $Z_a$ 个区间，以允许预测中存在不确定性，因为未来是多模态的。这与人类驾驶的方式也更为一致，人类在驾驶时不会预测确切的位置和速度，一个大致的估计就足以做出安全的决策。
Loss Functions. Following recent driving models [9, 11, 30, 26], we leverage the L1 loss to the ground truth future waypoints wgt as our main training objective. For the auxiliary task, we calculate the cross-entropy loss LCE using a one-hot encoded representation pgt i;t+1 of the ground truth future vehicle attributes ogt i;t+1. We train the model in a multi-task setting using a weighted combination of these losses with a weighting factor λ:
损失函数。遵循最近的驾驶模型[9, 11, 30, 26]，我们利用与地面真实航点 $w_{gt}$ 的 L1 损失作为我们的主要训练目标。对于辅助任务，我们使用地面真实未来车辆属性的一位有效编码 $p^{gt}_{(i,t+1)}$ 计算交叉熵损失 $L_{CE}$ 。我们在多任务设置中训练模型，使用加权组合这些损失，权重因子为 $\lambda$ ：
在这里插入图片描述

4 Experiments

In this section, we describe our experimental setup, evaluate the driving performance of our approach, analyze the explainability of its driving decisions, and finally discuss limitations.
在这一节中，我们将描述我们的实验设置，评估我们方法的驾驶性能，分析其驾驶决策的可解释性，并最终讨论其局限性。
Dataset and Benchmark. We use the expert, dataset and evaluation benchmark Longest 6 proposed by [26]. The expert policy is a rule-based algorithm with access to ground truth locations of the vehicles as well as privileged information that is not available to PlanT such as their actions and dynamics. Using this information, the expert determines the future position of all vehicles and estimates intersections between its own future position and those of the other vehicles to prevent most collisions. The dataset collected with this expert contains 228k frames. We use this as our reference point denoted by 1×. For our analysis, we also generate additional data following [26] but with different initializations of the traffic. The data quantities we use are always relative to the original dataset (i.e., 2× contains double the data, 3× contains triple). We refer the reader to [26] for a detailed description of the expert algorithm and dataset collection.
数据集和基准测试。我们使用了由[26]提出的专家、数据集和评估基准Longest 6。专家策略是基于规则的算法，可以访问车辆的真实位置以及PlanT无法获得的特权信息，例如它们的行动和动态。利用这些信息，专家确定了所有车辆的未来位置，并估计了自己未来位置与其他车辆未来位置之间的交叉点，以防止大多数碰撞。使用此专家收集的数据集包含 228k 帧。我们将其作为我们的参考点，记为1×。为了我们的分析，我们还根据[26]生成了额外的数据，但使用了不同的交通初始化。我们使用的数据量始终相对于原始数据集（例如，2×包含双倍数据，3×包含三倍数据）。我们请读者参考[26]以获取专家算法和数据集收集的详细描述。
Metrics. We report the established metrics of the CARLA leaderboard [70]: Route Completion (RC), Infraction Score (IS), and Driving Score (DS), which is the weighted average of the RC and IS. In addition, we show Collisions with Vehicles per kilometer (CV) and Inference Time (IT) for one forward pass of the model, measured in milliseconds on a single RTX 3080 GPU.
指标。我们报告CARLA排行榜[70]上已建立的指标：路线完成率（Route Completion, RC）、违规得分（Infraction Score, IS）以及驾驶得分（Driving Score, DS），后者是 RC 和 IS 的加权平均值。此外，我们还展示了每公里与车辆的碰撞次数（Collisions with Vehicles per kilometer, CV）以及模型单次前向传播的推理时间（Inference Time, IT），以毫秒为单位，在单个RTX 3080 GPU上测量。
Baselines. To highlight the advantages of learning-based planning, we include a rule-based planning baseline that uses the same inputs as PlanT. It follows the same high-level algorithm as the expert but estimates the future of other vehicles using a constant speed assumption since it does not have access to their actions. AIM-BEV [18] is a recent privileged agent trained using IL. It uses a BEV semantic map input with channels for the road, lane markings, vehicles, and pedestrians, and a GRU identical to PlanT to predict a trajectory for the ego vehicle which is executed using lateral and longitudinal PID controllers. Roach [12] is a Reinforcement Learning (RL) based agent with a similar input representation as AIM-BEV that directly outputs driving actions. Roach and AIMBEV are the closest existing methods to PlanT. However, they use a different input field of view in their representation leading to sub-optimal performance. We additionally build PlanCNN, a more competitive CNN-based approach for planning with the same training data and input information as PlanT, which is adapted from AIM-BEV to input a rasterized version of our object-level representation. We render the oriented vehicle bounding boxes in one channel, represent the speed of each pixel in a second channel, and render the oriented bounding boxes of the route in the third channel. We provide detailed descriptions of the baselines in the supplementary material.
基线。为了突出基于学习的规划的优势，我们包含了一个使用与 PlanT 相同输入的基于规则的规划基线。它遵循与专家相同的高级算法，但使用恒定速度假设来估计其他车辆的未来，因为它无法访问它们的行动。AIM-BEV[18]是最近一个使用 IL 训练的特权代理。它使用 BEV 语义地图输入，其中有道路、车道标记、车辆和行人的通道，并使用与 PlanT 相同的 GRU 来预测自我车辆的轨迹，该轨迹使用横向和纵向 PID 控制器执行。Roach[12]是一个基于强化学习（RL）的代理，其输入表示与 AIM-BEV 相似，直接输出驾驶动作。Roach 和 AIM-BEV 是与 PlanT 最接近的现有方法。然而，它们在表示中使用不同的输入视野，导致性能次优。我们另外构建了 PlanCNN，这是一个更具竞争力的基于 CNN 的规划方法，具有与 PlanT 相同的训练数据和输入信息，它从 AIM-BEV 适应而来，输入我们的对象级表示的栅格化版本。我们在一个通道中渲染有向车辆边界框，在第二个通道中表示每个像素的速度，并在第三个通道中渲染路线的有向边界框。我们在补充材料中提供了基线的详细描述。
Implementation. Our analysis includes three BERT encoder variants taken from [71]: MINI, SMALL, and MEDIUM with 11.2M, 28.8M and 41.4M parameters respectively. For PlanCNN, we experiment with two backbones: ResNet-18 and ResNet-34. We choose these architectures to maintain an IT which enables real-time execution. We train the models from scratch on 4 RTX 2080Ti GPUs with a total batch size of 128. Optimization is done with AdamW [72] for 47 epochs with an initial learning rate of 10−4 which we decay by 0.1 after 45 epochs. Training takes approximately 3.2 hours for the PlanTMEDIUM variant on the 3× dataset. We set the weight decay to 0.1 and clip the gradient norm at 1.0. For the auxiliary objective, we use quantization precisions Za of 128 bins for the position, 4 bins for the speed and 32 bins for the orientation of the vehicles. We use Tin = 0 and δt = 1 for auxiliary supervision. The loss weight λ is set to 0.2. By default, we use Dmax =30 m, Ns = 2, and Lmax =10 m. For our experiment with a full driving stack, we use a perception module based on TransFuser [26] to obtain the object-level input representation for PlanT. Additional details regarding this perception module as well as detailed ablation studies on the multi-task training and input representation hyperparameters are provided in the supplementary material.
实现。我们的分析包括三种BERT编码器变体，取自[71]：MINI、SMALL和MEDIUM，分别有1120万、2888万和4144万参数。对于PlanCNN，我们尝试了两种骨架：ResNet-18和ResNet-34。我们选择这些架构是为了保持 IT（推理时间：Inference Time），使其能够实时执行。我们在4个RTX 2080Ti GPU上从头开始训练模型，总批量大小为128。优化使用AdamW[72]进行，共47个周期，初始学习率为10^-4，我们在45个周期后将其衰减 0.1。在3×数据集上，PlanTMEDIUM变体的训练大约需要3.2小时。我们将权重衰减设置为0.1，并将梯度范数限制在1.0。对于辅助目标，我们对车辆的位置使用128个区间的量化精度 $Z_a$ ，速度使用4个区间，方向使用32个区间。我们使用 $T_{in} = 0和δt = 1$ 进行辅助监督。损失权重 λ 设置为0.2。默认情况下，我们使用 $D_{max} = 30m，N_s = 2，L_{max} = 10m$ 。对于我们使用完整驾驶栈的实验，我们使用基于 TransFuser[26]的感知模块来获取 PlanT 的对象级输入表示。有关该感知模块以及多任务训练和输入表示超参数的详细消融研究的更多细节在补充材料中提供。

4.1 Obtaining Expert-Level Driving Performance

In the following, we discuss the key findings of our study which enable expert-level driving with learned planners. We begin with a discussion of the privileged methods and analyze the sensorbased methods in Section 4.1. Unless otherwise specified, the experiments consider the largest version of our dataset (3×) and models (MEDIUM for PlanT, ResNet-34 for PlanCNN).
在接下来的部分中，我们将讨论我们研究的关键发现，这些发现使得通过学习型规划器实现专家级的驾驶成为可能。我们首先讨论特权方法，并在第4.1节中分析基于传感器的方法。除非另有说明，实验考虑我们数据集的最大版本（3×）和模型（PlanT的MEDIUM版本，PlanCNN的ResNet-34版本）。
Input Representation. Table 1 compares the performance on the Longest6 benchmark. The rulebased system acts cautiously and gets blocked often. Among the learning-based methods, both PlanCNN and PlanT significantly outperform AIM-BEV [18] and Roach [12]. We systematically break down the factors leading to this in Table 2a by studying the following: (1) the representation used for the road layout, (2) the horizontal field of view, (3) whether objects behind the ego vehicle are part of the representation, and (4) whether the input representation incorporates speed.
输入表示。表1 比较了在Longest6基准测试上的性能。基于规则的系统表现得相当谨慎，经常受阻。在基于学习的规划方法中，PlanCNN 和 PlanT 都显著优于AIM-BEV[18]和Roach[12]。我们在 表2a 中系统地分析了导致这一结果的因素，研究了以下方面：(1)用于道路布局的表示，(2)水平视野范围，(3)自我车辆背后的对象是否是表示的一部分，以及(4)输入表示是否包含速度信息。
在这里插入图片描述
表1：Longest6结果。我们展示了3次评估的平均值±标准差。PlanT达到了专家级性能，并且所需的推理时间远少于基线模型。*我们评估了LAV和Roach作者提供的预训练模型。
Roach uses the same view to the sides as AIM-BEV but additionally includes 8 m to the back and multiple input frames to reason about speed. We see in Table 2a that training PlanCNN in a configuration close to Roach (with the key differences being the removal of details from the map and a 0 m back view) results in a higher DS (59:97 vs. 55:27), demonstrating the importance of the route representation for urban driving. While additional information might be important when moving to more complex environments, our results suggest that the route is particularly important. Increasing the side view from 19:2 to 30 m improves PlanCNN from 59:97 to 70:72. Including vehicles to the rear further boosts PlanCNN’s DS to 77:47 and improves PlanT’s DS from 72:86 to 81:36. These results show that a full 360° field of view is helpful to handle certain situations encountered during our evaluation (e.g. changing lanes). Finally, removing the vehicle speed input significantly reduces the DS for both PlanCNN and PlanT (Table 2a), showing the importance of cues regarding motion.

Roach与AIM-BEV具有相同的侧面视野，但另外还包括了后方8米的视图和多个输入帧来推理速度。我们在 表2a 中看到，以接近Roach的配置训练PlanCNN（关键区别是去掉了地图上的细节和0米的后方视图），结果是DS（Driving Score，驾驶得分）更高（59.97 vs. 55.27），证明了路线表示对于城市驾驶的重要性。虽然在进入更复杂环境时额外信息可能很重要，但我们的结果表明路线尤其重要。将侧面视野从19.2米增加到30米将PlanCNN的DS从59.97提高到70.72。将后方车辆包括进来进一步将PlanCNN的DS提高到77.47，并将PlanT的DS从72.86提高到81.36。这些结果表明，完整的360°视野有助于处理我们在评估中遇到的某些情况（例如变道）。最后，去除车辆速度输入显著降低了PlanCNN和PlanT的DS（见表2a），显示了关于运动的线索的重要性。
在这里插入图片描述
表2：我们研究了基于学习的规划器的输入表示（表2a）和架构（表2b）的选择。将自我车辆后方的车辆包括在内、编码车辆速度，以及扩展到大型模型/数据集对于PlanCNN和PlanT的性能至关重要。

Scaling. In Table 2b, we show the impact of scaling the dataset size and the model size for PlanT and PlanCNN. The circle size indicates the inference time (IT). First, we observe that PlanT demonstrates better data efficiency than PlanCNN, e.g., using the 1× data setting is sufficient to reach the same performance as PlanCNN with 2×. Interestingly, scaling the data from 1× to 3× leads to expert-level performance, showing the effectiveness of scaling. In fact, PlanTMEDIUM (81:36) outperforms the expert (76:91) in some evaluation runs. We visualize one consistent failure mode of the expert that leads to this discrepancy in Fig. 3a. We observe that the expert sometimes stops once it has already entered an intersection if it anticipates a collision, which then leads to collisions or blocked traffic. On the other hand, PlanT learns to wait further outside an intersection before entering which is a smoother function than the discrete rule-based expert and subsequently avoids these infractions. Importantly, in our final setting, PlanTMEDIUM is around 3× as fast as PlanCNN while being 4 points better in terms of the DS and PlanTMINI is 5.3× as fast (IT=5:46 ms) while reaching the same DS as PlanCNN. This shows that PlanT is suitable for systems where fast inference time is a requirement. We report results with multiple training seeds in the supplementary material.
扩展。在 表2b 中，我们展示了扩展数据集大小和模型大小对PlanT和PlanCNN的影响。圆圈的大小表示推理时间（IT）。首先，我们观察到 PlanT 比 PlanCNN 展现出更好的数据效率，例如，使用1×数据设置就足以达到PlanCNN使用2×数据的性能。有趣的是，将数据从1×扩展到3×可达到专家级性能，展示了扩展的有效性。实际上，在某些评估运行中，PlanTMEDIUM（81.36）超过了专家（76.91）。我们在 图3a 中可视化了专家一个一致的失败模式，这种模式导致了这种差异。我们观察到，专家有时一旦已经进入交叉口，如果它预计会发生碰撞，就会停下来，这会导致碰撞或交通堵塞。另一方面，PlanT学习在进入交叉口之前在外部等待更远，这比离散的基于规则的专家更平滑，随后避免了这些违规行为。重要的是，在我们最终的设置中，PlanTMEDIUM的速度大约是PlanCNN的3倍，而在DS方面比PlanCNN高出4分，而PlanTMINI的速度是PlanCNN的5.3倍（IT=5.46毫秒），同时达到了与PlanCNN相同的DS。这表明PlanT适用于需要快速推理时间的系统。我们在补充材料中报告了使用多个训练种子的结果。
在这里插入图片描述
(a) PlanT与专家。PlanT在交叉口外等待更远的距离以避免碰撞。
(b) RFDS。具有最高相关性得分的车辆被标记为红色边界矩形。我们展示了相关性得分与直觉成功匹配的示例（绿色框架）和失败的示例（红色框架）。
图3：我们对比了专家的一个失败案例与PlanT（图3a），并展示了相关性得分的质量（图3b）。自我车辆用黄色三角形标记，那些导致碰撞或在场景中直观上最相关联的车辆用蓝色框标记。

Loss. A detailed study of the training strategy for PlanT can be found in the supplementary material, where we show that the auxiliary loss proposed in Eq. (4) is crucial to its performance. However, since this is a self-supervised objective, it can be incorporated without additional annotation costs. This is in line with recent findings on training transformers that show the effectiveness of supervising multiple output tokens instead of just a single [CLS] token [73].
损失。PlanT的训练策略的详细研究可以在补充材料中找到，其中我们展示了在方程（4）中提出的辅助损失对其性能至关重要。然而，由于这是一个自监督目标，它可以在不增加额外注释成本的情况下被纳入。这与最近关于训练transformers的发现一致，表明监督多个输出标记而不仅仅是单个[CLS]标记的有效性[73]。

4.2 Combining an Off-the-Shelf Perception Module with PlanT

Next, we discuss the results of the sensor-based methods in Table 1. We compare the proposed approach to LAV [30] and TransFuser [26], which are recent state-of-the-art sensor-based methods. Our perception module is based on TransFuser, enabling a fair comparison to this approach. Therefore, PlanT with perception only detects vehicles to its front and has a limited view to the sides (16m instead of 30m). Our approach outperforms TransFuser [26] by 10.36 points and LAV by 24.92 points. While TransFuser uses an ensemble and manually designed heuristics to creep forward if stuck [26], these are unnecessary for PlanT with perception. Since we do not use an ensemble, we observe a 2.7× speedup (101:24 ms vs. 37:61 ms) in IT compared to TransFuser. We refer to the supplementary material for a more detailed analysis.
接下来，我们将讨论表1 中基于传感器方法的结果。我们将我们提出的方法与LAV[30]和TransFuser[26]进行比较，这些是最近最先进的基于传感器的方法。我们的感知模块基于TransFuser，使我们能够与这种方法进行公平比较。因此，配备感知能力的 PlanT 仅能检测其前方的车辆，并且对侧面的视野有限（16米而不是30米）。我们的方法在得分上比TransFuser[26]高出10.36分，比LAV高出24.92分。虽然 TransFuser 使用集成和手动设计的经验规则在卡住时向前蠕动[26]，但这些对配备感知能力的 PlanT 来说是不必要的。由于我们不使用集成，我们观察到与 TransFuser 相比，在推理时间（IT）上有2.7倍的速度提升（101.24毫秒对比37.61毫秒）。我们更详细的分析请参见补充材料。

4.3 Explainability: Identification of Most Relevant Objects

Finally, we investigate the explainability of PlanT and PlanCNN by analyzing the objects in the scene that are relevant and crucial for the agent’s decision. In particular, we measure the relevance of an object in terms of the learned attention for PlanT and by considering the impact that the removal of each object has on the output predictions for PlanCNN. To quantify the ability to reason about the most relevant objects, we propose a novel evaluation scheme together with the Relative Filtered Driving Score (RFDS). For the rule-based expert algorithm, collision avoidance depends on a single vehicle which it identifies as the reason for braking. To measure the RFDS of a learned planner, we run one forward pass of the planner (without executing the actions) to obtain a scalar relevance score for each vehicle in the scene. We then execute the expert algorithm while restricting its observations to the (single) vehicle with the highest relevance score. The RFDS is defined as the relative DS of this restricted version of the expert compared to the default version which checks for collisions against all vehicles. We describe the extraction of the relevance score for PlanT and PlanCNN in the following. Our protocol leads to a fair comparison of different agents as the RFDS does not depend on the ability to drive itself but only on the obtained ranking of object relevance.
最后，我们通过分析场景中对代理决策至关重要的相关对象来研究 PlanT 和 PlanCNN 的可解释性。具体来说，我们根据 PlanT 学习到的注意力来衡量对象的相关性，并通过考虑移除每个对象对输出预测的影响来衡量 PlanCNN 中对象的相关性。为了量化对最相关对象推理的能力，我们提出了一种新的评估方案，以及相对过滤驾驶得分（Relative Filtered Driving Score, RFDS）。对于基于规则的专家算法，避免碰撞取决于它识别为刹车原因的单一车辆。为了衡量学习型规划器的 RFDS，我们运行一次规划器的前向传递（不执行动作），以获得场景中每辆车的标量相关性得分。然后，在执行专家算法时，我们限制其观察范围，只观察（单一的）具有最高相关性得分的车辆。RFDS 定义为这种受限版本的专家与默认版本（检查与所有车辆的碰撞）相比的相对DS。我们在下文中描述了提取 PlanT 和 PlanCNN 相关性得分的方法。我们的协议可以公平地比较不同的代理，因为 RFDS 不依赖于自身的驾驶能力，而只依赖于获得的对象相关性排名。
Baselines. As a naïve baseline, we consider the inverse distance to the ego vehicle as a vehicle’s relevance score, such that the expert only sees the closest vehicle. For PlanT, we extract the relevance score by adding the attention weights of all layers and heads for the [CLS] token. This only requires a single forward pass of PlanT. Since PlanCNN does not use attention, we choose a masking method to find the most salient region in the image, using the same principle as [59, 58, 57]. We remove one object at a time from the input image and compute the L1 distance to the predicted waypoints for the full image. The objects are then ranked based on how much their absence affects the L1 distance.
基线。作为一个简单的基线，我们考虑以自我车辆的逆距离作为车辆的相关性得分，这样专家只看到最近的一辆车。对于PlanT，我们通过将所有层和头的注意力权重加和来提取相关性得分，用于[CLS]标记。这只需要 PlanT 进行一次前向传递。由于 PlanCNN 不使用注意力机制，我们选择一种掩蔽方法来找到图像中最显著的区域，使用的原则与[59, 58, 57]相同。我们一次从输入图像中移除一个对象，并计算完整图像预测航点的 L1 距离。然后根据它们缺失对 L1 距离的影响对对象进行排名。
Results. We provide results for the reasoning about relevant objects in Table 3. Both planners significantly outperform the distance-based baseline, with PlanT obtaining a mean RFDS of 96:82 compared to 82:83 for PlanCNN. We show qualitative examples in Figure 3b where we highlight the vehicle with the highest relevance score using a red bounding rectangle. Both planners correctly identify the most important object in simple scenarios. However, PlanT is also able to correctly identify the most important object in complex scenes. When merging into a lane (examples 1 & 2 from the left) it correctly looks at the moving vehicles coming from the rear to avoid collisions. Example 3 shows advanced reasoning about dynamics. The two vehicles closer to the ego vehicle are moving away at a high speed and are therefore not as relevant. PlanT already pays attention to the more distant vehicle behind them as this is the one that it would collide with if it does not brake. One of the failures of PlanT we observe is that it sometimes allocates the highest attention to a very close vehicle behind itself (example 4) and misses the relevant object. PlanCNN has more prominent errors when there are a large number of vehicles in the scene or when merging lanes (examples 2 & 3). To better assess the driving performance and relevance scores we provide additional results in the supplementary video.
结果。我们在表3 中提供了关于推理相关对象的结果。两个规划器都显著优于基于距离的基线，PlanT 获得了平均 RFDS 为96.82，而 PlanCNN 为82.83。我们在 图3b 中展示了定性示例，其中我们使用红色边界矩形突出显示了相关性得分最高的车辆。两个规划器在简单场景中都能正确识别最重要的对象。然而，PlanT 也能在复杂场景中正确识别最重要的对象。在并入车道时（左侧的示例1和2），它正确地关注来自后方的移动车辆，以避免碰撞。示例3展示了关于动态的高级推理。靠近自我车辆的两辆车以高速远离，因此不是那么重要。PlanT 已经注意到它们后面的更远车辆，因为如果不刹车，它将与这辆车发生碰撞。我们观察到 PlanT 的一个失败是，它有时会将最高的注意力分配给紧随其后的非常接近的车辆（示例4），并错过了相关对象。PlanCNN 在场景中有很多车辆或并入车道时（示例2和3）出现更明显的错误。为了更好地评估驾驶性能和相关性得分，我们在补充视频中提供了额外的结果。
在这里插入图片描述
表3：RFDS（Relative Filtered Driving Score）。根据相应规划器判断，专家仅观察最相关车辆（CIPV）时的相对得分。

5 Conclusion

In this work, we take a step towards efficient, high-performance, explainable planning for autonomous driving with a novel object-level representation and transformer-based architecture called PlanT. Our experiments highlight the importance of correctly encoding the ego vehicle’s route for planning. We show that incorporating a 360° field of view, information about vehicle speeds, and scaling up both the architecture and dataset size of a learned planner are essential to achieve stateof-the-art results. Additionally, PlanT significantly outperforms state-of-the-art end-to-end sensorbased models even with a noisy and incomplete input representation obtained via a perception module. Finally, we demonstrate that PlanT can reliably identify the most relevant object in the scene via a new metric and evaluation protocol that measure explainability.
在这项工作中，我们通过一种新颖的对象级表示和基于 transformer 的架构PlanT，朝着自动驾驶的高效、高性能、可解释规划迈出了一步。我们的实验强调了为规划正确编码自我车辆路线的重要性。我们展示了包含360°视野、车辆速度信息，以及扩大学习型规划器的架构和数据集大小对于实现最先进结果至关重要。此外，即使通过感知模块获得的输入表示是有噪声和不完整的，PlanT也显著优于最先进的端到端基于传感器的模型。最后，我们通过一种新的度量标准和评估协议展示了PlanT能够可靠地识别场景中最相关对象的能力，这些协议衡量了可解释性。
Limitations. Firstly, the expert driver used in our IL-based training strategy does not achieve a perfect score (Table 1) and has certain consistent failure modes (Fig. 3a, more examples in the supplementary material). Human data collection to address this would be time-consuming (the 3× dataset used in our experiments contains around 95 hours of driving). Second, all our experiments are conducted in simulation. Real-world scenarios are more diverse and challenging. However, CARLA is a high-fidelity simulator actively used by many researchers for autonomous driving, and previous findings demonstrate that systems developed in simulators like CARLA can be transferred to the real world [31, 74, 75]. Finally, our experiment with perception (Section 4.1) uses a single offthe-shelf perception module that was not specifically optimized for PlanT, leading to sub-optimal performance. It is a well-known limitation of modular systems that downstream modules cannot recover easily from errors made by earlier modules. A thorough analysis of perception robustness and uncertainty encoding are important research directions beyond the scope of this work.
局限性。首先，我们基于模仿学习（IL）训练策略中使用的专家驾驶员并没有达到完美得分（表1），并且有某些一致的失败模式（图3a，更多示例在补充材料中）。解决这个问题的人工数据收集将是耗时的（我们在实验中使用的3×数据集包含大约95小时的驾驶时间）。第二，我们所有的实验都在模拟环境中进行。现实世界场景更加多样化和具有挑战性。然而，CARLA是一个高保真模拟器，被许多研究人员积极用于自动驾驶，并且以前的发现表明，在像CARLA这样的模拟器中开发的系统可以转移到现实世界[31, 74, 75]。最后，我们与感知相关的实验（第4.1节）使用了一个单一的现成感知模块，该模块并未特别针对 PlanT 进行优化，导致性能次优。众所周知，模块化系统的一个限制是下游模块不能轻易地从早期模块所犯的错误中恢复。对感知鲁棒性和不确定性编码的深入分析是超出这项工作范围的重要研究方向。

在这里插入图片描述