【论文翻译】nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

洌泉_就这样吧

已于 2022-05-07 19:05:01 修改

阅读量3.3k

点赞数 2

分类专栏：自动驾驶文章标签：人工智能

于 2022-05-07 19:01:17 首次发布

本文链接：https://blog.csdn.net/baidu_35231778/article/details/124630660

版权

自动驾驶专栏收录该内容

11 篇文章

订阅专栏

论文链接：https://arxiv.org/pdf/2106.11810.pdf

标题

nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles
nuPlan：基于机器学习的闭环规划benchmark，用于自动驾驶车辆

1 摘要/Abstract

In this work, we propose the world’s first closed-loop ML-based planning benchmark for autonomous driving. While there is a growing body of ML-based motion planners, the lack of established datasets and metrics has limited the progress in this area. Existing benchmarks for autonomous vehicle motion prediction have focused on short-term motion forecasting, rather than long-term planning. This has led previous works to use open-loop evaluation with L2-based metrics, which are not suitable for fairly evaluating long-term planning. Our benchmark overcomes these limitations by introducing a largescale driving dataset, lightweight closed-loop simulator, and motion-planning-specific metrics. We provide a highquality dataset with 1500h of human driving data from 4 cities across the US and Asia with widely varying traffic patterns (Boston, Pittsburgh, Las Vegas and Singapore). We will provide a closed-loop simulation framework with reactive agents and provide a large set of both general and scenario-specific planning metrics. We plan to release the dataset at NeurIPS 2021 and organize benchmark challenges starting in early 2022

在这项工作中，我们提出了世界上第一个用于自动驾驶的基于机器学习的闭环规划benchmark 。虽然越来越多的基于机器学习的运动规划器，但缺乏成熟的数据集和度量指标限制了这一领域的进展。现有用于自动驾驶车辆运动预测的benchmark主要关注短期运动预测，而不是长期规划。这导致以前的工作采用基于L2度量指标（欧氏距离）的开环评估方式，不适用于进行有效的长期规划的评估。为了克服这些限制，我们的benchmark引入大规模驾驶数据集、轻量级闭环模拟器和特定于运动规划的指标。我们提供了一个高质量的数据集，其中包含来自美国和亚洲四个交通模式迥异的城市（波士顿、匹兹堡、拉斯维加斯和新加坡）的1500小时人类驾驶数据。我们将提供一个带有反馈车辆（reactive agents）的闭环仿真框架，并提供一大组通用且基于特定场景的规划指标。我们计划在NeurIPS 2021 发布该数据集，并从2022年初开始组织benchmark 挑战赛。

引言/Introduction

Large-scale human labeled datasets in combination with deep Convolutional Neural Networks have led to an impressive performance increase in autonomous vehicle (AV) perception over the last few years [9, 4]. In contrast, existing solutions for AV planning are still primarily based on carefully engineered expert systems, that require significant amounts of engineering to adapt to new geographies and do not scale with more training data. We believe that providing suitable data and metrics will enable ML-based planning and pave the way towards a full “Software 2.0” stack.

在过去的几年里，大规模的人类标注数据集与深度卷积神经网络相结合，使得自动驾驶车辆（AV）感知的性能显著提高[9,4]。相比之下，现有的AV规划解决方案仍然主要基于精心设计的专家系统，这些系统需要大量的工程设计来适应新的道路环境，并且不能随着更多的训练数据来提升模型。我们相信，提供合适的数据和度量指标，将实现基于机器学习（ML）的规划，并为实现完整的“软件2.0”技术铺平道路。

Existing real-world benchmarks are focused on shortterm motion forecasting, also known as prediction [6, 4, 11, 8], rather than planning. This is evident in the lack of high-level goals, the choice of metrics, and the openloop evaluation. Prediction focuses on the behavior of other agents, while planning relates to the ego vehicle behavior. Prediction is typically multi-modal, which means that for each agent we predict the N most likely trajectories. In contrast, planning is typically uni-modal (except for contingency planning) and we predict a single trajectory. As an example, in Fig. 1a, turning left or right at an intersection are equally likely options. Prediction datasets lack a baseline navigation route to indicate the high-level goals of the agents. In Fig. 1b, the options of merging immediately or later are both equally valid, but the commonly used L2 distance-based metrics (minADE, minFDE, and miss rate) penalize the option that was not observed in the data. Intuitively, the distance between the predicted trajectory and the observed trajectory is not a suitable indicator in a multimodal scenario. In Fig. 1c, the decision whether to continue to overtake or get back into the lane should be based on the consecutive actions of all agent vehicles, which is not possible in open-loop evaluation. Lack of closed-loop evaluation leads to systematic drift, making it difficult to evaluate beyond a short time horizon (3-8s).

在这里插入图片描述
Figure 1. We show different driving scenarios to emphasize the limitations of existing benchmarks. The observed driving route of the ego vehicle in shown in white and the hypothetical planner route in red. (a) The absence of a goal leads to ambiguity at intersections. (b) Displacement metrics do not take into account the multi-modal nature of driving. © open-loop evaluation does not take into account agent interaction
图1. 我们展示了不同的驾驶场景，以强调现有benchmark的局限性。观察到的自车行驶路线显示为白色，假设的规划路线显示为红色。（a）没有目标会导致交叉路口的模糊性。（b）位移指标没有考虑到驾驶的多模态特性。（c）开环评估不考虑车辆交互

现有的真实世界的benchmark主要关注短期运动预测[6,4,11,8]，而不是规划，体现在缺乏高层目标、指标选择和开环评估。预测侧重于他车行为，而规划则与自车行为有关。预测通常是多模态的，即对于每个代理都需要预测出N条最可能的轨迹。相比之下，规划通常是单模态的（连续规划除外），我们预测的是单一轨迹。例如，在图1a中，在十字路口左转或右转是同样可能的选择。预测数据集缺乏一条基线导航路线来指示车辆的高层目标。在图1b中，立即或稍后并道的选项都同样有效，但常用的基于L2距离的度量（minADE、minFDE和未命中率）会惩罚数据中未观察到的选项。直觉上，将预测轨迹和观测轨迹之间的距离为基准在多模态场景中不是一个合适的指标。在图1c中，是否继续超车或返回车道的决定应基于所有车辆的连续动作，这在开环评估中是不可能的。缺乏闭环评估会导致系统漂移，难以在短时间范围（3-8秒）之外进行评估。

We instead provide a planning benchmark to address these shortcomings. Our main contributions are:

The largest existing public real-world dataset for autonomous driving with high quality autolabeled tracks from 4 cities.
Planning metrics related to traffic rule violation, human driving similarity, vehicle dynamics, goal achievement, as well as scenario-based.
The first public benchmark for real-world data with a closed-loop planner evaluation protocol.

相反，我们提供了一个规划benchmark来解决这些缺点。我们的主要贡献是：

提出现有最大的公共真实世界数据集用于自动驾驶，数据来源于4个城市的高质量自动标注轨迹。
规划度量指标涉及交通违规、人类驾驶相似性、车辆动力学、目标实现以及与场景相关的行为。
第一个使用闭环规划器评估的真实世界数据公共benchmark。

2 相关工作/Related work

We review the relevant literature for prediction and planning datasets, simulation, and ML-based planning. Prediction datasets. Table 1 shows a comparison between our dataset and relevant prediction datasets. Argoverse Motion Forecasting [6] was the first large-scale prediction dataset. With 320h of driving data, it was unprecedented in size and provides simple semantic maps with centerlines and driveable area annotations. However, the autolabeled trajectories in the dataset are of lower quality due to the state of object detection field at the time and the insufficient amount of human-labeled training data (113 scenes).

我们回顾了预测和规划数据集、仿真和基于ML的规划的相关文献。预测数据集。表1显示了我们的数据集和相关预测数据集之间的比较。Argoverse 运动预测[6]是第一个大型预测数据集。它拥有320小时的驾驶数据，数据规模前所未有，提供了带有中心线和可驾驶区域注释的简单语义地图。然而，由于当时目标检测场的状态不佳以及人类标注的训练数据量（113个场景）不够，数据集中自动标注的轨迹质量较低。

在这里插入图片描述
Table 1. A comparison of leading datasets for motion prediction (Pred) and planning (Plan). We show the dataset size, number of cities, availability of sensor data, dataset type, and whether it uses open-loop (OL) or closed-loop (CL) evaluation. nuPredict refers to the prediction challenge of the nuScenes [4] dataset.
表1. 运动预测（Pred）和规划（Plan）的主要数据集的比较。我们展示了数据集的大小、城市数量、传感器数据的可用性、数据集类型，以及它是使用开环（OL）还是闭环（CL）评估。nuPredict指的是nuScenes[4]数据集的预测挑战。

The nuScenes prediction [4] challenge consists of 850 human-labeled scenes from the nuScenes dataset. While the annotations are high quality and sensor data is provided, the small scale limits the number of driving variations. The Lyft Level 5 Prediction Dataset [11] contains 1118h of data from a single route of 6.8 miles. It features detailed semantic maps, aerial maps, and dynamic traffic light status. While the scale is unprecedented, the autolabeled tracks are often noisy and geographic diversity is limited. The Waymo Open Motion Dataset [8] focuses specifically on the interactions between agents, but does so using open-loop evaluation. While the dataset size is smaller than existing datasets at 570h, the autolabeled tracks are of high quality [17]. They provide semantic maps and dynamic traffic light status.

nuScenes预测[4]挑战包括来自nuScenes数据集的850个人类标注场景。虽然注释质量很高，并且提供了传感器数据，但数据规模小限制了不同驾驶行为的数量。Lyft Level 5预测数据集[11]包含来自6.8英里单路线的1118小时数据，具有详细的语义地图、航空地图和动态交通灯状态，虽然规模空前，但自动标注的轨迹往往随机误差大，且其地理多样性有限。Waymo开放式运动数据集[8]专门关注车辆之间的交互，但其使用开环评估来实现，虽然在数据集的大小小于现有数据集，仅有570小时，但自动标注的轨迹质量较高[17]，并提供语义地图和动态交通灯状态。

These datasets focus on prediction, rather than planning. In this work we aim to overcome this limitation by using planning metrics and closed-loop evaluation. We are the first large-scale dataset to provide sensor data.

这些数据集侧重于预测，而不是规划。在这项工作中，我们的目标是通过使用规划指标和闭环评估来克服这一局限性。我们是第一个提供传感器数据的大型数据集。

2.1 规划数据集/Planning datasets

CommonRoad [1] provides a first of its kind planning benchmark, that is composed of different vehicle models, cost functions and scenarios (including goals and constraints). There are both pre-recorded and interactive scenarios. With 5700 scenarios in total, the scale of the dataset does not support training modern deep learning based methods. All scenarios lack sensor data.

CommonRoad[1]提供了第一个规划benchmark，由不同的车型、代价函数和场景（包括目标和约束）组成。既有预先录制的场景，也有互动的场景。由于总共有5700个场景，数据集的规模不支持用于训练当下深度学习的方法。所有场景都缺少传感器数据。

2.2 Simulation/仿真

Simulators have enabled breakthroughs in planning and reinforcement learning with their ability to simulate physics, agents, and environmental conditions in a closed-loop environment. AirSim [19] is a high-fidelty simulator for AVs, such as drones and cars. It includes a physics engine that can operate at a high frequency for real-time hardware-in-the-loop simulation. CARLA [7] supports the training and validation of autonomous urban driving systems. It allows for flexible specification of sensor suites and environmental conditions. In the CARLA Autonomous Driving Challenge（carlachallenge.org）the goal is to navigate a set of waypoints using different combinations of sensor data and HD maps. Alternatively, users can use scene abstraction to omit the perception task and focus on planning and control aspects. This challenge is conceptually similar to what we propose, but does not use real world data and provides less detailed planning metrics.

仿真器能够在闭环环境中模拟道路、车辆和环境条件，从而在规划和强化学习方面取得突破。AirSim[19]是一款用于无人机和汽车的高保真模拟器，包括一个物理引擎，可以在高频率下运行，用于实时硬件在环仿真。CARLA[7]支持城市自动驾驶系统的训练和验证，兼容多种规格的传感器套件和环境条件。在CARLA自动驾驶挑战赛（carlachallenge.org）中，目标是使用传感器数据和高清地图的不同组合来获取轨迹点。可选地，用户可以使用场景抽象来省略感知任务，重点关注在规划和控制方面。这个挑战在概念上与我们提出的类似，但不使用真实世界的数据，并且提供了不太详细的规划度量指标。

Sim-to-real transfer is an active research area for diverse tasks such as localization, perception, prediction, planning and control. [21] show that the domain gap between simulated and real-world data remains an issue, by transferring a synthetically trained tracking model to the KITTI [9] dataset. To overcome the domain gap, they jointly train their model using real-world data for visible and simulation data for occluded objects. [3] learn how to drive by transferring a vision-based lane following driving policy from simulation to the real world without any real-world labels. [14] use reinforcement learning in simulation to obtain a driving system controlling a full-size real-world vehicle. They use mostly synthetic data, with labelled real-world data appearing only in the training of the segmentation network.

Sim-to-real transfer是一个热门研究领域，用于定位、感知、预测、规划和控制等多种任务。在一项研究中[21]，通过将经过综合训练的跟踪模型传输到KITTI[9]数据集，表明模拟数据和真实数据之间的domain差距仍然是一个问题。为了克服domain差距，他们联合使用真实世界的可见数据和遮挡对象的模拟数据来训练他们的模型。另一项研究中[3]介绍了如何通过将基于视觉的车道跟随驾驶策略从模拟转向真实世界，而无需任何真实世界标签来驾驶。而文献[14]的研究则是在仿真中使用强化学习获得控制全尺寸真实世界车辆的驾驶系统。他们主要使用合成数据，带有标签的真实世界数据只出现在分割网络的训练中。

However, all simulations have fundamental limits since they introduce systematic biases. More work is required to plausibly emulate real-world sensors, e.g. to generate photo-realistic camera images.

然而，所有的仿真都有基本的局限性，因为它们引入了系统偏差。要合理地模拟真实世界的传感器，例如生成照片逼真的相机图像，还需要做更多的工作。

2.3 ML-based planning/基于ML的规划

A new emerging research field is ML-based planning for AVs using real-world data. However, the field has yet to converge on a common input/output space, dataset, or metrics. A jointly learnable behavior and trajectory planner is proposed in [18]. An interpretable cost function is learned on top of models for perception, prediction and vehicle dynamics, and evaluated in open-loop on two unpublished datasets. An end-to-end interpretable neural motion planner [24] takes raw lidar point clouds and dynamic map data as inputs and predicts a cost map for planning. They evaluate in open-loop on an unpublished dataset, with a planning horizon of only 3s. ChauffeurNet [2] finds that standard behavior cloning is insufficient for handling complex driving scenarios, even when using as many as 30 million examples. They propose exposing the learner to synthesized data in the form of perturbations to the expert’s driving and augment the imitation loss with additional losses that penalize undesirable events and encour-age progress. Their unpublished dataset contains 26 million examples which correspond to 60 days of continuous driving. The method is evaluated in a closed-loop and an openloop setup, as well as in the real world. They also show that open-loop evaluation can be misleading compared to closed-loop. MP3 [5] proposes an end-to-end approach to mapless driving, where the input is raw lidar data and a high-level navigation goal. They evaluate on an unpublished dataset in open and closed-loop. Multi-modal methods have also been explored in recent works [16, 20, 13]. These approaches explore different strategies for fusing various modality representations in order to predict future waypoints or control commands. Neural planners were also used in [15, 10] to evaluate an object detector using the KL divergence of the planned trajectory and the observed route.

一个新兴研究领域是面向AVs使用真实世界的数据实现基于ML的规划。然而，该领域还没有在一个通用的输入/输出空间、数据集或指标上趋同。文献[18]中提出了一种联合学习的行为和轨迹规划器。在感知、预测和车辆动力学模型的基础上学习可解释的成本函数，并在两个未发布的数据集上进行开环评估。端到端可解释的神经运动规划器[24]将原始激光雷达点云和动态地图数据作为输入，并预测规划成本地图。他们在一个未发布的数据集上进行开环评估，规划范围仅为3秒。ChaufferNet[2]发现，即使使用多达3000万个示例，标准的行为克隆也不足以处理复杂的驾驶场景。他们建议让学习者以干扰专家驾驶的形式接触合成数据，并用额外的损失来增加模仿损失，惩罚不良事件和鼓励进步。他们未公布的数据集包含2600万个例子，相当于连续驾驶60天。该方法在闭环和开环设置中以及在现实世界中进行评估，文章表明，与闭环相比，开环评估可能具有误导性。MP3[5]提出了一种端到端的无地图的驾驶方法，其中输入的是原始激光雷达数据和高级导航目标，对未发布的数据集进行开环和闭环评估。最近的研究也探索了多模态方法[16,20,13]，这些方法探索了融合各种模态表示的不同策略，以便预测未来的轨迹点或控制命令。[15,10]中还使用了神经规划器，利用规划轨迹和观测路线的KL散度来评估目标探测器。

Existing works evaluate on different metrics which are inconsistent across the literature. TransFuser [16] evaluates its method on the number of infractions, the percentage of the route distance completed, and the route completion weighted by an infraction multiplier. Infractions include collisions with other agents, and running red lights. [20] evaluates its planner using off-road time, off-lane time and number of crashes, while [13, 22] report the success rate of reaching a given destination within a fixed time window. [13] also introduces another metric which measures the average percentage of distance travelled to the goal.

现有工作根据不同的度量指标进行评估，这些度量指标在文献中并不一致。TransFuser[16]根据违规次数、完成的路线距离百分比以及违规乘数加权的路线完成情况来评估其方法，违规行为包括与其他车辆发生碰撞，以及闯红灯。文献[20] 使用离开道路时间、离开车道时间和撞车次数评估规划模型，同时[13,22]报告在固定时间窗口内到达给定目的地的成功率。[13] 还引入了另一个度量指标，用于测量到达目标的平均距离百分比。

While ML-based planning has been studied in great detail, the lack of published datasets and a standard set of metrics that provide a common framework for closed-loop evaluation has limited the progress in this area. We aim to fill this gap by providing an ML-based planning dataset and metrics.

虽然已经对基于ML的规划进行了非常详细的研究，但由于缺乏已发布的数据集和一套度量指标，无法为闭环评估提供一个通用的框架，这限制了该领域的进展。我们的目标是通过提供基于ML的规划数据集和度量指标来填补这一空白。

3 数据集/Dataset

3.1 概述/Overview

We plan to release 1500 hours of data from Las Vegas, Boston, Pittsburgh, and Singapore. Each city provides its unique driving challenges. For example, Las Vegas includes bustling casino pick-up and drop-off points (PUDOs) with complex interactions and busy intersections with up to 8 parallel driving lanes per direction, Boston routes include drivers who love to double park, Pittsburgh has its own custom precedence pattern for left turns at intersections, and Singapore features left hand traffic. For each city we provide semantic maps and an API for efficient map queries. The dataset includes lidar point clouds, camera images, localization information and steering inputs. While we release autolabeled agent trajectories on the entire dataset, we make only a subset of the sensor data available due to the vast scale of the dataset (200+ TB).

我们计划在拉斯维加斯、波士顿、匹兹堡和新加坡发布1500小时的数据。每个城市都有其独特的驾驶挑战。例如，拉斯维加斯包括繁忙的赌场上下车点（PUDO），带有互动复杂、繁忙的交叉口且每个方向最多有8条平行车道；波士顿路线包括喜欢双停车的司机；匹兹堡有自己的交叉口左转优先模式；新加坡是左侧驾驶。对于每个城市，我们都提供语义地图和用于高效地图查询的API。数据集包括激光雷达点云、相机图像、定位信息和转向输入。虽然我们在整个数据集上发布了自动标注的车辆轨迹，但由于数据集规模巨大（200 TB以上），我们只提供了传感器数据的一个子集。

3.2 Autolabeling/自动标注

We use an offline perception system to label the large-scale dataset at high accuracy, without the realtime constraints imposed on the online perception system of an AV. We use PointPillars [12] with CenterPoint [23], a modified version multi-view fusion (MVF++) [17], and non-causal tracking to achieve near-human labeling performance.

我们使用离线感知系统以高精度标记大规模数据集，而不受AV在线感知系统的实时约束。我们将PointPillars[12]与CenterPoint[23]结合使用，使用改进版的多视图融合（MVF++）[17]和非因果跟踪来实现接近人类的标注性能。

3.3 Scenarios/场景

To enable scenario-based metrics, we automatically annotate intervals with tags for complex scenarios. These scenarios include merges, lane changes, protected or unprotected left or right turns, interaction with cyclists, interaction with pedestrians at crosswalks or elsewhere, interactions with close proximity or high acceleration, double parked vehicles, stop controlled intersections and driving in construction zones.

为了实现基于场景的度量指标，我们自动为复杂场景的间隔添加标记。这些场景包括并道、车道变更、有保护或无保护的左转或右转、与骑自行车人的互动、与人行横道或其他地方的行人的互动、与近距离或高加速度、双驻车、停车控制交叉口的互动以及在施工区内的驾驶。

4 基准测试/Benchmarks

To further the state of the art in ML-based planning, we organize benchmark challenges with the tasks and metrics described below.

为了进一步提升基于ML的规划的技术水平，我们组织了以下任务和指标的基准测试挑战。

4.1. 概述/verview

To evaluate a proposed method against the benchmark dataset, users submit ML-based planning code to our evaluation server. The code must follow a provided template. Contrary to most benchmarks, the code is containerized for portability in order to enable closed-loop evaluation on a secret test set. The planner operates either on the autolabeled trajectories or, for end-to-end open-loop approaches, directly on the raw sensor data. When queried for a particular timestep, the planner returns the planned position and heading of the ego vehicle. A provided controller will then drive a vehicle while closely tracking the planned trajectory. We use a predefined motion model to simulate the ego vehicle motion in order to approximate a real system. The final driven trajectory is then scored against the metrics defined in Sec 4.2.

对比基准数据集来评估所提出方法时，用户可以向我们的评估服务器提交基于ML的规划代码，代码必须遵循提供的模板。与大多数基准测试相反，为了便于移植和代码容器化，以便在保密测试集上进行闭环评估，规划者要么在自动标注的轨迹上操作，要么在端到端地直接在原始传感器数据上操作。当查询特定时间步时，规划模型返回车辆的计划位置和方向。然后，提供的控制器来运行车辆根据规划轨迹行驶。我们使用一个预定义的运动模型来模拟自车的运动以逼近真实系统。然后根据第4.2节中定义的度量指标对最终行驶轨迹进行评估。

4.2. 任务/Tasks

We present the three different tasks for our dataset with increasing difficulty.
我们为我们的数据集呈现三种不同的任务，难度越来越大。

Open-loop/开环

In the first challenge, we task the planning system to mimic a human driver. For every timestep, the trajectory is scored based on predefined metrics. It is not used to control the vehicle. In this case, no interactions are considered.
在第一个挑战中，我们要求规划系统模仿人类驾驶员。对于每个时间步，轨迹都基于预定义的度量指标进行评分，但不用于控制车辆。在这种情况下，不考虑车辆相互作用。

Closed-loop/闭环

In the closed-loop setup the planner outputs a planned trajectory using the information available at each timestep, similar to the previous case. However, the proposed trajectory is used as a reference for a controller, and thus, the planning system is gradually corrected at each timestep with the new state of the vehicle. While the new state of the vehicle may not coincide with that of 3the recorded state, leading to different camera views or lidar point clouds, we will not perform any sensor data warping or novel view synthesis. In this set, we distinguish between two tasks. In the Non-reactive closed-loop task we do not make any assumptions on other agents behavior and simply use the observed agent trajectories. As shown in [11], the vast majority of interventions in closed-loop simulation is due to the non-reactive nature, e.g. vehicles naively colliding with the ego vehicle. In the reactive closed-loop task we provide a planning model for all other agents that are tracked like the ego vehicle.

在闭环设置中，规划模型使用每个时间步的可用信息输出规划轨迹，类似于前一种情况。然而，所提出的轨迹被用作控制器的参考，因此，规划系统在每个时间步随车辆的新状态逐渐修正。虽然车辆的新状态可能与记录状态不一致，导致不同的摄像头视图或激光雷达点云，但我们不会执行任何传感器的偏差数据或新视角。在这个集合中，我们区分两个任务。在非反应性闭环任务中，我们不对其他车辆的行为进行任何假设，只使用观察到的车辆轨迹。如[11]所示，闭环模拟中的绝大多数干预都是由于非反应性性质，例如他车与自车的碰撞。在反应性闭环任务中，我们为所有其他被跟踪的车辆提供了一个于与自车相同的规划模型。

4.3. 度量指标/Metrics

We split the metrics into two categories, common metrics, which are computed for every scenario and scenariobased metrics, which are tailored to predefined scenarios.
我们将度量指标分为两类：针对每个场景计算的公共度量指标和针对预定义场景定制的基于场景的度量指标。

公共度量指标/Common metrics

Traffic rule violation is used to measure compliance with common traffic rules. We compute the rate of collisions with other agents, rate of off-road trajectories, the time gap to lead agents, time to collision and the relative velocity while passing an agents as a function of the passing distance.
Human driving similarity is used to quantify a maneuver satisfaction in comparison to a human, e.g. longitudinal velocity error, longitudinal stop position error and lateral position error. In addition, the resulting jerk/acceleration is compared to the human-level jerk/acceleration.
Vehicle dynamics quantify rider comfort and feasibility of a trajectory. Rider comfort is measured by jerk, acceleration, steering rate and vehicle oscillation. Feasibility is measured by violation of predefined limits of the same criteria.
Goal achievement measures the route progress towards a goal waypoint on the map using L2 distance.
交通违规用于衡量对常见交通规则的遵守情况。我们计算了与他车碰撞概率、偏离道路轨迹概率、与前车的时间间隔、碰撞时间以及超过他车时的相对速度（作为超车距离的函数)
人类驾驶相似性用于量化与人工驾驶相比的驾驶操作满意度，例如纵向速度误差、纵向停车位置误差和横向位置误差。此外，将产生的急动度/加速度与人工驾驶的的急动/加速度进行比较。
车辆动力学量化了乘客的舒适度和轨迹的可行性。乘客舒适度通过急动度、加速度、转向速率和车辆振动来衡量。可行性是通过违反相同标准的预定义限制来衡量的。
目标达成使用欧式距离测量通往地图上目标轨迹点的路线进度。

基于场景的度量指标/Scenario-based metrics

Based on the scenario tags from Sec. 3, we use additional metrics for challenging maneuvers. For lane change, time to collision and time gap to lead/rear agent on the target lane is measured and scored. For pedestrian/cyclist interaction, we quantify the passing relative velocity while differentiating their location. Furthermore, we compare the agreement between decisions made by a planner and human for crosswalks and unprotected turns (right of way).
基于Sec 3 的场景标签，我们使用额外的度量指标进行实验。对于换道，测量并评分目标车道上的碰撞时间和与前车/后车的时间间隔。对于行人/骑车人的相互作用，我们量化了车辆经过的相对速度，同时区分了它们的位置。此外，我们还比较了规划模块和人工驾驶在人行横道和无保护转弯（路权）方面做出的决定之间的一致性。

社区反馈/ Community feedback

Note that the metrics shown here are an initial proposal and do not form an exhaustive list. We will work closely with the community to add novel scenarios and metrics to achieve consensus across the community. Likewise, for the main challenge metric we see multiple options, such as a weighted sum of metrics, a weighted sum of metric violations above a predefined threshold or a hierarchy of metrics. We invite the community to collaborate with us to define the metrics that will drive this field forward.

注意，这里度量指标只是一个初步建议，并不构成一个详尽的列表。我们将与社区密切合作，添加新的场景和指标，以在整个社区达成共识。同样，对于主要挑战指标，我们可以看到多个选项，例如指标的加权和、超过预定义阈值的指标违规的加权和或指标的层次结构。我们邀请社区与我们合作，定义将推动该领域向前发展的指标。

5. 结论/Conclusion

In this work we proposed the first ML-based planning benchmark for AVs. Contrary to existing forecasting benchmarks, we focus on goal-based planning, planning metrics and closed-loop evaluation. We hope that by providing a common benchmark, we will pave a path towards progress in ML-based planning, which is one of the final frontiers in autonomous driving.

在这项工作中，我们提出了第一个基于ML的AVs规划benchmark。与现有的预测benchmark不同，我们专注于基于目标的规划、规划指标和闭环评估。我们希望通过提供一个共同的benchmark，我们将为基于ML的规划铺平道路，这是自动驾驶的最终前沿之一。

参考文献/References

[1] Matthias Althoff, Markus Koschi, and Stefanie Manzinger. CommonRoad: Composable benchmarks for motion planning on roads. In Proc. of the IEEE Intelligent Vehicles Symposium, 2017. 2
[2] Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. In RSS, 2019. 2
[3] Alex Bewley, Jessica Rigley, Yuxuan Liu, Jeffrey Hawke, Richard Shen, Vinh-Dieu Lam, and Alex Kendall. Learning to drive from simulation without real world labels. In ICRA, 2019. 2
[4] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020. 1, 2
[5] Sergio Casas, Abbas Sadat, and Raquel Urtasun. MP3: A unified model to map, perceive, predict and plan. In CVPR, 2021. 3
[6] Ming-Fang Chang, John W Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, and James Hays. Argoverse: 3d tracking and forecasting with rich maps. In CVPR, 2019. 1, 2
[7] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. CoRR, 2017. 2
[8] Scott Ettinger, Shuyang Cheng, and Benjamin Caine et al. Large scale interactive motion forecasting for autonomous driving: The Waymo Open Motion Dataset. arXiv preprint arXiv:2104.10133, 2021. 1, 2
[9] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The KITTI dataset. IJRR, 32(11):1231–1237, 2013. 1, 2
[10] Yiluan Guo, Holger Caesar, Oscar Beijbom, Jonah Philion, and Sanja Fidler. The efficacy of neural planning metrics: A
4meta-analysis of PKL on nuscenes. In IROS Workshop on Benchmarking Progress in Autonomous Driving, 2020. 3
[11] John Houston, Guido Zuidhof, and Luca Bergamini et al. One thousand and one hours: Self-driving motion prediction dataset. arXiv preprint arXiv:2006.14480, 2020. 1, 2, 4
[12] Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In CVPR, 2019. 3
[13] Eraqi Hesham M., Mohamed N. Moustafa, and Jens Honer. Conditional imitation learning driving considering camera and lidar fusion. In NeurIPS, 2020. 3
[14] Blazej Osinski, Adam Jakubowski, Pawel Ziecina, Piotr Milos, Christopher Galias, Silviu Homoceanu, and Henryk Michalewski. Simulation-based reinforcement learning for real-world autonomous driving. In ICRA, 2020. 2
[15] Jonah Philion, Amlan Kar, and Sanja Fidler. Learning to evaluate perception models using planner-centric metrics. In CVPR, 2020. 3
[16] Aditya Prakash, Kashyap Chitta, and Andreas Geiger. Multimodal fusion transformer for end-to-end autonomous driving. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 3
[17] Charles R. Qi, Yin Zhou, Mahyar Najibi, Pei Sun, Khoa Vo, Boyang Deng, and Dragomir Anguelov. Offboard 3d object detection from point cloud sequences. arXiv preprint arXiv:2103.05073, 2021. 2, 3
[18] Abbas Sadat, Mengye Ren, Andrei Pokrovsky, Yen-Chen Lin, Ersin Yumer, and Raquel Urtasun. Jointly learnable behavior and trajectory planning for self-driving vehicles. In IROS, 2019. 2
[19] Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. AirSim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics, 2017. 2
[20] Ibrahim Sobh, Loay Amin, Sherif Abdelkarim, Khaled Elmadawy, Mahmoud Saeed, Omar Abdeltawab, Mostafa Gamal, and Ahmad El Sallab. End-to-end multi-modal sensors fusion system for urban automated driving. In NeurIPS, 2018. 3
[21] Pavel Tokmakov, Jie Li, Wolfram Burgard, and Adrien Gaidon. Learning to track with object permanence. arXiv
preprint arXiv:2103.14258, 2021. 2
[22] Yi Xiao, Felipe Codevilla, Akhil Gurram, Onay Urfalioglu, and Antonio M. Lopez. Multimodal end-to-end autonomous ´ driving. arXiv preprint arXiv:1906.03199, 2019. 3
[23] Tianwei Yin, Xingyi Zhou, and Philipp Krahenb ¨ uhl. Center- ¨based 3d object detection and tracking. arXiv preprint arXiv:2006.11275, 2020. 3
[24] Wenyuan Zeng, Wenjie Luo, Simon Suo, Abbas Sadat, BinYang, Sergio Casas, and Raquel Urtasun. End-to-end interpretable neural motion planner. In CVPR, 2021. 2