nuPlan 是一个针对自动驾驶车辆的闭环机器学习（ML-based）规划基准测试

nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

nuPlan 是一个针对自动驾驶车辆的闭环机器学习（ML-based）规划基准测试

Abstract

In this work, we propose the world’s first closed-loop ML-based planning benchmark for autonomous driving. While there is a growing body of ML-based motion planners, the lack of established datasets and metrics has limited the progress in this area. Existing benchmarks for autonomous vehicle motion prediction have focused on short-term motion forecasting, rather than long-term planning. This has led previous works to use open-loop evaluation with L2-based metrics, which are not suitable for fairly evaluating long-term planning. Our benchmark overcomes these limitations by introducing a largescale driving dataset, lightweight closed-loop simulator, and motion-planning-specific metrics. We provide a highquality dataset with 1500h of human driving data from 4 cities across the US and Asia with widely varying traffic patterns (Boston, Pittsburgh, Las Vegas and Singapore). We will provide a closed-loop simulation framework with reactive agents and provide a large set of both general and scenario-specific planning metrics. We plan to release the dataset at NeurIPS 2021 and organize benchmark challenges starting in early 2022.
在这项研究中，我们首次提出了一个闭环的基于机器学习的自动驾驶规划基准测试。尽管基于机器学习的运动规划器日益增多，但缺乏成熟的数据集和评价指标限制了该领域的发展。现有的自动驾驶车辆运动预测基准主要集中在短期运动预测上，而不是长期规划。这导致以往的研究采用基于 L2 指标的开环评估，这并不适用于长期规划的公正评价。我们的基准测试通过引入大规模的驾驶数据集、轻量级的闭环模拟器和专门针对运动规划的度量标准来克服这些限制。我们提供了一个高质量的数据集，包含了来自美国和亚洲4个城市（波士顿、匹兹堡、拉斯维加斯和新加坡）的1500小时人类驾驶数据，这些地区交通模式差异显著。我们还将提供一个闭环模拟框架，其中包括反应性代理，并提供了一系列通用和特定场景的规划度量标准。我们计划在 2021 年的 NeurIPS 会议上发布该数据集，并从 2022 年初开始组织基准测试挑战。

1. Introduction

Large-scale human labeled datasets in combination with deep Convolutional Neural Networks have led to an impressive performance increase in autonomous vehicle (AV) perception over the last few years [9, 4]. In contrast, existing solutions for AV planning are still primarily based on carefully engineered expert systems, that require significant amounts of engineering to adapt to new geographies and do not scale with more training data. We believe that providing suitable data and metrics will enable ML-based planning and pave the way towards a full “Software 2.0” stack.
在过去几年中，结合了大规模人工标注数据集和深度卷积神经网络的技术，已经在自动驾驶汽车（AV）的感知能力上取得了令人瞩目的性能提升[9, 4]。然而，目前针对 AV 规划的解决方案主要还是依赖于精心设计的专业系统，这些系统需要大量的工程努力来适应不同的地理位置，并且它们并不随着训练数据的增加而自动扩展。我们认为，提供合适的数据和度量标准将促进基于机器学习的规划方法的发展，并为实现全面的“软件 2.0”技术体系铺平道路。这种技术体系强调利用机器学习模型来设计和实现软件功能，而不是传统的基于规则的编程方法。
Existing real-world benchmarks are focused on shortterm motion forecasting, also known as prediction [6, 4, 11, 8], rather than planning. This is evident in the lack of high-level goals, the choice of metrics, and the openloop evaluation. Prediction focuses on the behavior of other agents, while planning relates to the ego vehicle behavior.
现有的真实世界基准测试主要关注短期运动预测，也就是通常所说的预测[6, 4, 11, 8]，而不是长期规划。这一点从缺少高级目标、所选的度量标准，以及开环评估方式中都可以看出。运动预测主要关注其他交通参与者的行为，而规划则与自车的行为密切相关。
在自动驾驶领域，运动预测通常涉及预测其他车辆、行人或自行车等在未来短时间内的运动轨迹。而规划则是基于这些预测，以及自车的当前状态和高级目标（如目的地），来决定自车的最佳行驶路径和行为。规划过程需要考虑更多的长期因素，如遵守交通规则、优化行程时间或舒适性等。
现有的基准测试可能没有为长期规划提供足够的支持，这可能是因为短期预测在技术上更容易实现，或者因为缺乏合适的数据和评估方法。然而，为了实现更高级的自动驾驶功能，需要开发能够进行长期规划的系统，并为这些系统提供相应的基准测试和度量标准。
Prediction is typically multi-modal, which means that for each agent we predict the N most likely trajectories. In contrast, planning is typically uni-modal (except for contingency planning) and we predict a single trajectory. As an example, in Fig. 1a, turning left or right at an intersection are equally likely options. Prediction datasets lack a baseline navigation route to indicate the high-level goals of the agents. In Fig. 1b, the options of merging immediately or later are both equally valid, but the commonly used L2 distance-based metrics (minADE, minFDE, and miss rate) penalize the option that was not observed in the data. Intuitively, the distance between the predicted trajectory and the observed trajectory is not a suitable indicator in a multimodal scenario. In Fig. 1c, the decision whether to continue to overtake or get back into the lane should be based on the consecutive actions of all agent vehicles, which is not possible in open-loop evaluation. Lack of closed-loop evaluation leads to systematic drift, making it difficult to evaluate beyond a short time horizon (3-8s).
预测通常具有多模态性，这意味着对于每个交通参与者，我们会预测 N 条最可能的轨迹。与此相反，规划通常是单模态的（紧急规划情况除外），我们只预测一条轨迹。例如，在图 1a 中，在一个交叉路口左转或右转是同样可能的选项。预测数据集缺少一个基线导航路线来指示参与者的高级目标。在图 1b 中，立即合并车道或稍后再合并都是同样有效的选择，但常用的基于 L2 距离的度量方法（最小平均误差 minADE、最小最终误差 minFDE 和未命中率）会惩罚那些在数据中未被观察到的选项。直观上讲，在多模态场景中，预测轨迹与实际观察到的轨迹之间的距离并不是一个合适的评估指标。在图 1c 中，是否继续超车或回到车道的决定应该基于所有交通参与者连续动作的考量，这在开环评估中是无法实现的。缺乏闭环评估会导致系统性偏差，使得评估难以扩展到更长远的时间范围（3-8 秒）。
在这里插入图片描述
图 1 展示了不同的驾驶场景，用以突出现有基准测试的不足之处。图中自车的观测行驶路线用白色表示，而假想的规划器路线则用红色表示：
(a) 由于缺少目标，导致在交叉路口出现不确定性。
(b) 位移度量标准并未充分考虑驾驶行为的多模态性。
© 开环评估没有顾及到交通参与者之间的相互作用。
这些场景说明了在自动驾驶规划中，需要更先进的评估方法来准确反映车辆在复杂交通环境中的行为和决策过程。
We instead provide a planning benchmark to address these shortcomings. Our main contributions are:
我们提供了一个规划基准测试来解决这些不足。我们的主要贡献是：
• The largest existing public real-world dataset for autonomous driving with high quality autolabeled tracks from 4 cities.
• Planning metrics related to traffic rule violation, human driving similarity, vehicle dynamics, goal achievement, as well as scenario-based.
• The first public benchmark for real-world data with a closed-loop planner evaluation protocol.
• 最大的现有公共真实世界自动驾驶数据集：包含来自 4 个城市的高质量自动标注轨迹，这是目前最大的公共数据集。
• 与交通规则违规相关的规划度量：度量标准涉及评估自动驾驶车辆是否违反交通规则。
• 首个公共基准：提供首个公共基准，用于通过闭环规划器评估协议来评估真实世界数据上规划器的性能。

2. Related Work

We review the relevant literature for prediction and planning datasets, simulation, and ML-based planning.
我们对预测和规划数据集、仿真技术以及基于机器学习的规划领域的相关文献进行了综述。
Prediction datasets. Table 1 shows a comparison between our dataset and relevant prediction datasets. Argoverse Motion Forecasting [6] was the first large-scale prediction dataset. With 320h of driving data, it was unprecedented in size and provides simple semantic maps with centerlines and driveable area annotations. However, the autolabeled trajectories in the dataset are of lower quality due to the state of object detection field at the time and the insufficient amount of human-labeled training data (113 scenes).
预测数据集方面，表 1 展示了我们的 dataset 与现有相关预测 dataset 的对比情况。Argoverse Motion Forecasting [6] 作为首个大规模预测 dataset，凭借 320 小时的驾驶数据，在数据量上具有开创性，并提供了包含中心线和可行驶区域标注的简单语义地图。但受限于当时的物体检测技术水平以及人工标注训练数据的不足（仅涵盖 113 个场景），该 dataset 中的自动标注轨迹质量相对较低。
The nuScenes prediction [4] challenge consists of 850 human-labeled scenes from the nuScenes dataset. While the annotations are high quality and sensor data is provided, the small scale limits the number of driving variations. The Lyft Level 5 Prediction Dataset [11] contains 1118h of data from a single route of 6.8 miles. It features detailed semantic maps, aerial maps, and dynamic traffic light status. While the scale is unprecedented, the autolabeled tracks are often noisy and geographic diversity is limited. The Waymo Open Motion Dataset [8] focuses specifically on the interactions between agents, but does so using open-loop evaluation. While the dataset size is smaller than existing datasets at 570h, the autolabeled tracks are of high quality [17]. They provide semantic maps and dynamic traffic light status.
nuScenes 预测挑战赛 [4] 包含了来自 nuScenes 数据集的 850 个人工标注场景。虽然提供了高质量的标注和传感器数据，但其规模较小，限制了驾驶场景的多样性。Lyft Level 5 预测数据集 [11] 包含了单一路线 6.8 英里路程上的 1118 小时数据。它具备详尽的语义地图、航拍地图和动态交通灯状态信息。尽管在规模上是前所未有的，但自动标注的轨迹往往存在噪声，且地理多样性受限。Waymo 开放运动数据集 [8] 专注于代理之间的交互，但采用的是开环评估方式。虽然其数据集规模小于其他现有数据集，为 570 小时，但自动标注的轨迹质量很高 [17]。它们同样提供了语义地图和动态交通灯状态信息。
These datasets focus on prediction, rather than planning. In this work we aim to overcome this limitation by using planning metrics and closed-loop evaluation. We are the first large-scale dataset to provide sensor data.
这些数据集主要关注于运动预测，而不是长期规划。在本项工作中，我们旨在通过引入规划相关的度量标准和闭环评估方法来解决这一限制。我们的数据集是首个大规模提供传感器数据集，这为自动驾驶车辆的规划和决策提供了更为丰富的信息。通过结合传感器数据和规划度量，我们能够更全面地评估和提升自动驾驶系统的性能。
Planning datasets. CommonRoad [1] provides a first of its kind planning benchmark, that is composed of different vehicle models, cost functions and scenarios (including goals and constraints). There are both pre-recorded and interactive scenarios. With 5700 scenarios in total, the scale of the dataset does not support training modern deep learning based methods. All scenarios lack sensor data.
规划数据集方面，CommonRoad [1] 作为首个此类规划基准测试平台，包含了不同的车辆模型、成本函数以及包含目标和约束条件的场景。它提供了预先录制和交互式两种类型的场景。尽管 CommonRoad 拥有总共 5700 个场景，但其规模仍然无法满足现代基于深度学习的方法的训练需求。此外，所有场景均未提供传感器数据，这限制了其在自动驾驶规划领域的应用。
Simulation. Simulators have enabled breakthroughs in planning and reinforcement learning with their ability to simulate physics, agents, and environmental conditions in a closed-loop environment.
仿真器因其能够在一个闭环环境中模拟物理现象、代理行为和环境条件，从而在规划和强化学习方面促成了重大突破。通过这种方式，仿真器提供了一个可控的测试平台，允许研究人员在安全的环境中试验和优化自动驾驶算法，而不必担心真实世界中的风险和成本。此外，仿真器还可以生成大量多样化的数据，这对于训练和评估机器学习模型至关重要。
AirSim [19] is a high-fidelty simulator for AVs, such as drones and cars. It includes a physics engine that can operate at a high frequency for real-time hardware-in-the-loop simulation. CARLA [7] supports the training and validation of autonomous urban driving systems. It allows for flexible specification of sensor suites and environmental conditions. In the CARLA Autonomous Driving Challenge1 the goal is to navigate a set of waypoints using different combinations of sensor data and HD maps. Alternatively, users can use scene abstraction to omit the perception task and focus on planning and control aspects. This challenge is conceptually similar to what we propose, but does not use real world data and provides less detailed planning metrics.
AirSim [19] 是一款用于自动驾驶车辆（AVs），包括无人机和汽车的高保真度模拟器。它内置了一个物理引擎，能够以高频率运作，支持实时硬件在环仿真。CARLA [7] 是一个用于训练和验证自动驾驶城市驾驶系统的平台。它允许用户灵活地配置传感器组合和环境条件。在 CARLA 自动驾驶挑战赛中，参赛者的目标是利用不同的传感器数据和高清地图来导航通过一系列航点。此外，用户还可以使用场景抽象方法来忽略感知任务，专注于规划和控制方面。这个挑战赛在概念上与我们提出的基准测试相似，但它不使用真实世界数据，并且在规划度量方面提供的细节较少。
Sim-to-real transfer is an active research area for diverse tasks such as localization, perception, prediction, planning and control. [21] show that the domain gap between simulated and real-world data remains an issue, by transferring a synthetically trained tracking model to the KITTI [9] dataset. To overcome the domain gap, they jointly train their model using real-world data for visible and simulation data for occluded objects. [3] learn how to drive by transferring a vision-based lane following driving policy from simulation to the real world without any real-world labels. [14] use reinforcement learning in simulation to obtain a driving system controlling a full-size real-world vehicle. They use mostly synthetic data, with labelled real-world data appearing only in the training of the segmentation network.
仿真到现实（Sim-to-Real）转移是包括定位、感知、预测、规划和控制等多个任务领域的一个活跃研究方向。[21] 通过将一个在合成数据上训练的追踪模型应用到 KITTI [9] 数据集上，展示了模拟数据与现实世界数据之间存在的领域差异问题。为了克服这一领域差异，他们采用现实世界数据和模拟数据****共同训练模型，其中现实世界数据用于训练可见对象的识别，而模拟数据用于训练遮挡对象的识别。[3] 展示了如何通过从仿真环境到现实世界的转移，学习实现基于视觉的车道跟随驾驶策略，且此过程不需要现实世界的标签数据。[14] 利用仿真中的强化学习来开发一个驾驶系统，该系统能够控制一个真实世界中的全尺寸车辆。他们主要依赖合成数据进行训练，在分割网络的训练中仅使用了少量的标记现实世界数据。这些研究表明，尽管存在领域差异，但通过结合使用现实世界数据和模拟数据，可以在一定程度上实现从仿真到现实的知识转移。
However, all simulations have fundamental limits since they introduce systematic biases. More work is required to plausibly emulate real-world sensors, e.g. to generate photo-realistic camera images.
然而，所有的仿真模拟都有其基本的限制，这是因为它们在模拟过程中会产生系统性的偏差。为了更加逼真地模拟现实世界的传感器，比如生成具有照片级真实感的相机图像，还需要进行更多的研究和开发工作。这意味着尽管仿真环境在自动驾驶技术的研究和开发中非常有用，但它们并不能完全替代真实世界的数据和经验。
ML-based planning. A new emerging research field is ML-based planning for AVs using real-world data. However, the field has yet to converge on a common input/output space, dataset, or metrics. A jointly learnable behavior and trajectory planner is proposed in [18]. An interpretable cost function is learned on top of models for perception, prediction and vehicle dynamics, and evaluated in open-loop on two unpublished datasets. An end-to-end interpretable neural motion planner [24] takes raw lidar point clouds and dynamic map data as inputs and predicts a cost map for planning. They evaluate in open-loop on an unpublished dataset, with a planning horizon of only 3s. ChauffeurNet [2] finds that standard behavior cloning is insufficient for handling complex driving scenarios, even when using as many as 30 million examples. They propose exposing the learner to synthesized data in the form of perturbations to the expert’s driving and augment the imitation loss with additional losses that penalize undesirable events and encour-age progress. Their unpublished dataset contains 26 million examples which correspond to 60 days of continuous driving. The method is evaluated in a closed-loop and an openloop setup, as well as in the real world. They also show that open-loop evaluation can be misleading compared to closed-loop. MP3 [5] proposes an end-to-end approach to mapless driving, where the input is raw lidar data and a high-level navigation goal. They evaluate on an unpublished dataset in open and closed-loop. Multi-modal methods have also been explored in recent works [16, 20, 13]. These approaches explore different strategies for fusing various modality representations in order to predict future waypoints or control commands. Neural planners were also used in [15, 10] to evaluate an object detector using the KL divergence of the planned trajectory and the observed route.
基于机器学习的规划是自动驾驶车辆领域的一个新兴研究方向，它使用真实世界数据进行规划。但目前该领域尚未统一输入输出空间、数据集或评估指标的标准。[18] 提出了一个能够联合学习的行为和轨迹规划器。该规划器在感知、预测和车辆动力学模型的基础上，学习了一个可解释的成本函数，并在两个未公开的数据集上进行了开环测试。[24] 提出了一个端到端可解释的神经运动规划器，它接受原始的激光雷达点云和动态地图数据作为输入，预测规划所需的代价图。该规划器在仅规划3秒内的未来的情况下，在一个未公开的数据集上进行了开环评估。ChauffeurNet [2] 发现标准的行为克隆方法不足以应对复杂的驾驶场景，即便使用了高达3000万的样本数据。他们建议通过向学习器提供专家驾驶的扰动合成数据，并在模仿损失中加入额外的损失项，以惩罚不良事件并鼓励模型取得进展。他们的未公开数据集包含2600万个样本，相当于60天的连续驾驶数据。该方法在闭环和开环配置下以及在现实世界中都进行了评估。他们还指出，与闭环评估相比，开环评估可能会产生误导。MP3 [5] 提出了一种无需地图的端到端驾驶方法，输入为原始激光雷达数据和高级导航目标。他们在一个未公开的数据集上进行了开环和闭环评估。多模态方法也在最近的研究中被探索，这些方法尝试了不同的策略来融合各种模态的表征，以预测未来的航点或控制命令。[15, 10] 中也使用了神经规划器，通过比较计划轨迹和实际观察到的轨迹之间的 KL 散度来评估物体检测器的性能。

Existing works evaluate on different metrics which are inconsistent across the literature. TransFuser [16] evaluates its method on the number of infractions, the percentage of the route distance completed, and the route completion weighted by an infraction multiplier. Infractions include collisions with other agents, and running red lights. [20] evaluates its planner using off-road time, off-lane time and number of crashes, while [13, 22] report the success rate of reaching a given destination within a fixed time window. [13] also introduces another metric which measures the average percentage of distance travelled to the goal.
目前的研究工作基于不同的评估指标进行评价，而这些指标在学术文献中并不统一。TransFuser [16] 通过违规次数、完成路线所需距离的百分比，以及考虑违规因素的路线完成率来评估其方法。违规行为包括与其他交通参与者发生碰撞和违反红灯信号。[20] 使用车辆驶离路面的时间、驶离车道的时间以及碰撞次数来评估其规划器的性能，而 [13, 22] 则报告了在固定时间范围内成功到达预定目的地的比率。[13] 还引入了另一种评估指标，即测量到达目标所需的平均距离百分比。
While ML-based planning has been studied in great detail, the lack of published datasets and a standard set of metrics that provide a common framework for closed-loop evaluation has limited the progress in this area. We aim to fill this gap by providing an ML-based planning dataset and metrics.
尽管基于机器学习的规划方法已经被广泛而深入地研究，但由于缺少公开的数据集和一套标准化的度量标准，这些标准应为闭环评估提供一个共同的框架，这限制了该领域的进一步发展。我们的目标是通过提供基于机器学习的数据集和度量标准来填补这一空白。这样，研究人员和开发者就可以在统一的标准下评估和比较不同的规划算法，从而推动自动驾驶规划技术的进步。

3. Dataset

Overview. We plan to release 1500 hours of data from Las Vegas, Boston, Pittsburgh, and Singapore. Each city provides its unique driving challenges. For example, Las Vegas includes bustling casino pick-up and drop-off points (PUDOs) with complex interactions and busy intersections with up to 8 parallel driving lanes per direction, Boston routes include drivers who love to double park, Pittsburgh has its own custom precedence pattern for left turns at intersections, and Singapore features left hand traffic. For each city we provide semantic maps and an API for efficient map queries. The dataset includes lidar point clouds, camera images, localization information and steering inputs. While we release autolabeled agent trajectories on the entire dataset, we make only a subset of the sensor data available due to the vast scale of the dataset (200+ TB).
我们将发布来自拉斯维加斯、波士顿、匹兹堡和新加坡的共计 1500 小时的驾驶数据。每个城市都有其独特的驾驶挑战。例如，拉斯维加斯包含了繁忙的赌场接送区域，这些地方交通交互复杂，并且有多达每方向 8 条车道的繁忙十字路口；波士顿的路线中包含了习惯于双排停车的驾驶员；匹兹堡有其独特的左转交通规则；新加坡则以左侧行驶著称。对于每个城市，我们都提供了详细的语义地图和高效的地图查询 API。数据集涵盖了激光雷达点云、摄像头图像、定位信息以及转向输入。尽管我们为整个数据集提供了自动标注的代理轨迹，但由于数据集规模庞大（超过 200 太字节），我们仅提供了传感器数据的一部分。这样的数据集将为自动驾驶车辆的感知、预测和规划算法提供丰富的真实世界场景和挑战。
Autolabeling. We use an offline perception system to label the large-scale dataset at high accuracy, without the realtime constraints imposed on the online perception system of an AV. We use PointPillars [12] with CenterPoint [23], a modified version multi-view fusion (MVF++) [17], and non-causal tracking to achieve near-human labeling performance.
自动标注。我们利用一个离线感知系统对大规模数据集进行高精度的标注，这不受自动驾驶车辆（AV）在线感知系统必须遵守的实时性限制。我们采用了 PointPillars [12] 与 CenterPoint [23] 技术，以及改进版的多视图融合（MVF++）[17] 和非因果追踪技术，以达到接近人类标注者的性能水平。
Scenarios. To enable scenario-based metrics, we automatically annotate intervals with tags for complex scenarios. These scenarios include merges, lane changes, protected or unprotected left or right turns, interaction with cyclists, interaction with pedestrians at crosswalks or elsewhere, interactions with close proximity or high acceleration, double parked vehicles, stop controlled intersections and driving in construction zones.
场景标注。为了支持基于场景的度量标准，我们自动地为复杂场景的时段添加标签。这些场景包括：
• 车道合并
• 车道变换
• 受保护或不受保护的左转和右转
• 与骑自行车的人的互动
• 在人行横道或其他地点与行人的互动
• 与距离过近或加速度过高的车辆的互动
• 双排停放的车辆
• 受停车标志控制的交叉路口
• 在施工区域驾驶
通过自动识别和标注这些复杂的交通场景，数据集能够为自动驾驶系统的开发和评估提供丰富的上下文信息，从而使得基于场景的规划和决策更加精准和可靠。这种场景标注方法有助于在特定场景下评估自动驾驶算法的性能，并为解决特定驾驶挑战提供数据支持。

4. Benchmarks

To further the state of the art in ML-based planning, we organize benchmark challenges with the tasks and metrics described below
为了推动基于机器学习的规划技术达到新的高度，我们组织了以下描述的任务和指标的基准挑战。

4.1. Overview

To evaluate a proposed method against the benchmark dataset, users submit ML-based planning code to our evaluation server. The code must follow a provided template. Contrary to most benchmarks, the code is containerized for portability in order to enable closed-loop evaluation on a secret test set. The planner operates either on the autolabeled trajectories or, for end-to-end open-loop approaches, directly on the raw sensor data. When queried for a particular timestep, the planner returns the planned position and heading of the ego vehicle. A provided controller will then drive a vehicle while closely tracking the planned trajectory. We use a predefined motion model to simulate the ego vehicle motion in order to approximate a real system. The final driven trajectory is then scored against the metrics defined in Sec 4.2.
为了评估一个提议的方法与基准数据集的对比，用户需要将基于机器学习的规划代码提交到我们的评估服务器。代码必须遵循提供的模板。与大多数基准测试不同，为了在秘密测试集上实现闭环评估的便携性，代码被容器化。规划器要么操作在自动标记的轨迹上，要么对于端到端的开环方法，直接操作原始传感器数据。当查询特定时间步时，规划器返回自车计划的位置和航向。然后，一个提供的控制器将驾驶车辆，同时紧密跟踪计划的轨迹。我们使用一个预定义的运动模型来模拟自车的运动，以近似一个真实系统。最终驱动的轨迹然后根据第4.2节中定义的指标进行评分。

4.2. Tasks

We present the three different tasks for our dataset with increasing difficulty.
我们为我们的数据集呈现了三种不同难度的任务。
Open-loop. In the first challenge, we task the planning system to mimic a human driver. For every timestep, the trajectory is scored based on predefined metrics. It is not used to control the vehicle. In this case, no interactions are considered.
开环。在第一个挑战中，我们要求规划系统模仿人类驾驶员。对于每一个时间步，轨迹会根据预定义的指标进行评分。它不用于控制车辆。在这种情况下，不考虑任何交互。
Closed-loop. In the closed-loop setup the planner outputs a planned trajectory using the information available at each timestep, similar to the previous case. However, the proposed trajectory is used as a reference for a controller, and thus, the planning system is gradually corrected at each timestep with the new state of the vehicle. While the new state of the vehicle may not coincide with that of the recorded state, leading to different camera views or lidar point clouds, we will not perform any sensor data warping or novel view synthesis. In this set, we distinguish between two tasks. In the Non-reactive closed-loop task we do not make any assumptions on other agents behavior and simply use the observed agent trajectories. As shown in [11], the vast majority of interventions in closed-loop simulation is due to the non-reactive nature, e.g. vehicles naively colliding with the ego vehicle. In the reactive closed-loop task we provide a planning model for all other agents that are tracked like the ego vehicle.
闭环。在闭环设置中，规划器使用每个时间步可用的信息输出一个计划好的轨迹，这与前一种情况类似。然而，提出的轨迹被用作控制器的参考，因此，规划系统在每个时间步都会根据车辆的新状态逐渐进行校正。虽然车辆的新状态可能与记录的状态不一致，导致不同的摄像头视图或激光雷达点云，但我们将不执行任何传感器数据扭曲或新视图合成。在这组设置中，我们区分了两个任务。在非反应性闭环任务中，我们不对其他代理的行为做任何假设，只是简单地使用观察到的代理轨迹。如[11]所示，闭环模拟中绝大多数的干预是由于非反应性本质，例如，车辆天真地与自车发生碰撞。在反应性闭环任务中，我们为所有跟踪的代理提供了一个规划模型，就像对自车一样。

4.3. Metrics

We split the metrics into two categories, common metrics, which are computed for every scenario and scenariobased metrics, which are tailored to predefined scenarios.
我们将指标分为两类：一类是通用指标，它们针对每个场景都会进行计算；另一类是场景基础指标，它们是为预定义的场景量身定制的。

Common metrics.

• Traffic rule violation is used to measure compliance with common traffic rules. We compute the rate of collisions with other agents, rate of off-road trajectories, the time gap to lead agents, time to collision and the relative velocity while passing an agents as a function of the passing distance.
• Human driving similarity is used to quantify a maneuver satisfaction in comparison to a human, e.g. longitudinal velocity error, longitudinal stop position error and lateral position error. In addition, the resulting jerk/acceleration is compared to the human-level jerk/acceleration.
• Vehicle dynamics quantify rider comfort and feasibility of a trajectory. Rider comfort is measured by jerk, acceleration, steering rate and vehicle oscillation. Feasibility is measured by violation of predefined limits of the same criteria.
• Goal achievement measures the route progress towards a goal waypoint on the map using L2 distance.
• 交通规则违规用于衡量对通用交通规则的遵守程度。我们计算与其他代理发生碰撞的比率、驶离道路轨迹的比率、与前导代理的时间间隔、碰撞时间以及通过代理时的相对速度作为通过距离的函数。
• 人类驾驶相似度用于量化与人类相比的驾驶操作满意度，例如，纵向速度误差、纵向停车位置误差和横向位置误差。此外，产生的急动度/加速度与人类水平的急动度/加速度进行比较。
• 车辆动力学量化乘客舒适度和轨迹的可行性。乘客舒适度通过急动度、加速度、转向速率和车辆振荡来衡量。可行性是通过违反同一标准的预定义限制来衡量的。
• 目标达成度通过使用L2距离来衡量地图上目标航点的路线进度。

Scenario-based metrics. Based on the scenario tags from Sec. 3, we use additional metrics for challenging maneuvers. For lane change, time to collision and time gap to lead/rear agent on the target lane is measured and scored. For pedestrian/cyclist interaction, we quantify the passing relative velocity while differentiating their location. Furthermore, we compare the agreement between decisions made by a planner and human for crosswalks and unprotected turns (right of way).
场景基础指标。基于第3节中的场景标签，我们为具有挑战性的操作使用额外的指标。对于变道，我们测量并评分目标车道上的碰撞时间和与前/后代理的时间间隔。对于行人/自行车交互，我们量化通过时的相对速度，同时区分他们的位置。此外，我们比较规划器和人类在人行横道和无保护转弯（路权）上做出的决策之间的一致性。
Community feedback. Note that the metrics shown here are an initial proposal and do not form an exhaustive list.We will work closely with the community to add novel scenarios and metrics to achieve consensus across the community. Likewise, for the main challenge metric we see multiple options, such as a weighted sum of metrics, a weighted sum of metric violations above a predefined threshold or a hierarchy of metrics. We invite the community to collaborate with us to define the metrics that will drive this field forward.
社区反馈。请注意，这里展示的指标是一个初步提议，并不是一个详尽的列表。我们将与社区紧密合作，增加新的情境和指标，以在社区内达成共识。同样地，对于主要挑战指标，我们看到了多种选项，例如，指标的加权总和、超过预定义阈值的指标违规的加权总和，或者指标的层级结构。我们邀请社区与我们合作，定义将推动这一领域前进的指标。

5. Conclusion

In this work we proposed the first ML-based planning benchmark for AVs. Contrary to existing forecasting benchmarks, we focus on goal-based planning, planning metrics and closed-loop evaluation. We hope that by providing a common benchmark, we will pave a path towards progress in ML-based planning, which is one of the final frontiers in autonomous driving.
在这项工作中，我们提出了第一个基于机器学习（ML）的自动驾驶车辆（AVs）规划基准。与现有的预测基准不同，我们专注于基于目标的规划、规划指标和闭环评估。我们希望通过提供一个共同的基准，为基于机器学习的规划铺平道路，这是自动驾驶的最后前沿之一。
在这里插入图片描述