ADAPT：动作感知驾驶字幕转换器

最新推荐文章于 2024-08-05 20:31:58 发布

真诚的灰灰

最新推荐文章于 2024-08-05 20:31:58 发布

阅读量258

点赞数 3

文章标签：机器学习自动驾驶语言模型自然语言处理人工智能

本文链接：https://blog.csdn.net/jch924583667/article/details/140920614

版权

ADAPT: Action-aware Driving Caption Transformer

ADAPT：动作感知驾驶字幕转换器

Abstract

End-to-end autonomous driving has great potential in the transportation industry. However, the lack of transparency and interpretability of the automatic decisionmaking process hinders its industrial adoption in practice. There have been some early attempts to use attention maps or cost volume for better model explainability which is difficult for ordinary passengers to understand. To bridge the gap, we propose an end-to-end transformer-based architecture, ADAPT (Action-aware Driving cAPtion Transformer), which provides user-friendly natural language narrations and reasoning for each decision making step of autonomous vehicular control and action. ADAPT jointly trains both the driving caption task and the vehicular control prediction task, through a shared video representation. Experiments on BDD-X (Berkeley DeepDrive eXplanation) dataset demonstrate state-of-the-art performance of the ADAPT framework on both automatic metrics and human evaluation. To illustrate the feasibility of the proposed framework in real-world applications, we build a novel deployable system that takes raw car videos as input and outputs the action narrations and reasoning in real time. The code, models and data are available at https://github.com/jxbbb/ADAPT.
端到端的自动驾驶在交通行业中具有巨大的潜力。然而，自动决策过程的透明度和可解释性的缺乏阻碍了其在实际工业中的应用。为了提高模型的可解释性，一些早期尝试使用注意力图或成本体积，这对于普通乘客来说很难理解。为了弥补这一差距，我们提出了一种基于Transformer的端到端架构，即ADAPT（感知行动驾驶字幕变换器），它为自动驾驶车辆的每个决策和行动步骤提供了用户友好的自然语言叙述和推理。ADAPT通过共享视频表示，联合训练驾驶字幕任务和车辆控制预测任务。在BDD-X（Berkeley DeepDrive eXplanation）数据集上的实验表明，ADAPT框架在自动度量和人类评估方面都展现出了最先进的性能。为了说明所提出框架在现实世界应用中的可行性，我们构建了一个新颖的可部署系统，它以原始汽车视频为输入，并实时输出行动叙述和推理。代码、模型和数据可在 https://github.com/jxbbb/ADAPT 上获取。

I. INTRODUCTION

The goal of an autonomous system is to gain precise perception of the environment, make safe real-time decisions, take reliable actions without human involvement and provide a safe and comfortable ride experience for passengers. There are generally two types of paradigms for autopilot controller design: mediation-aware method [1], [2] and end-to-end learning approach [3]–[23]. Mediation-aware approaches rely on recognizing human-specified features such as vehicles, lane markings, etc., which require rigorous parameter tuning to achieve satisfactory performance. In contrast, end-to-end methods directly take raw data from sensors as input to generate planning routes or control signals.
自动驾驶系统的目标是精确感知环境，进行安全的实时决策，无需人类干预即可采取可靠的行动，并为乘客提供安全舒适的乘坐体验。自动驾驶员控制器设计通常有两种范式：中介意识方法[1]、[2]和端到端学习方法[3]–[23]。中介意识方法依赖于识别人类指定的特征，如车辆、车道标记等，这需要严格的参数调整才能达到令人满意的性能。相比之下，端到端方法直接将传感器的原始数据作为输入，以生成规划路线或控制信号。
One of the key challenges in deploying such autonomous control systems to real vehicles is that intelligent decisionmaking policies in autonomous cars are often too complicated and difficult for common passengers to understand, for whom the safety of such vehicles and their controlability is the top priority.
在将这类自动驾驶控制系统部署到真实车辆中时，面临的一个关键挑战是，自动驾驶汽车中智能决策制定策略通常过于复杂，对普通乘客来说难以理解。对乘客而言，这些车辆的安全性和可控性是最重要的。
Some previous work has explored the interpretation of autonomous navigation [13], [14], [24]–[30]. Cost map, for example, is employed in [13] to interpret the actions of a self-driving system by visualizing the difficulty of traversing through different areas of the map. Visual attention is utilized in [24] to filter out non-salient image regions, and [31] constructs BEV (Bird’s eye view) to visualize the motion information of the vehicle. However, these interfaces can easily lead to misinterpretation if the user is unfamiliar with the system.
一些先前的研究已经探索了自动驾驶导航的解释问题[13]、[14]、[24]–[30]。例如，在[13]中，成本图被用来通过可视化不同区域的通过难度来解释自动驾驶系统的行动。在[24]中，视觉注意力被用来过滤掉非显著的图像区域，而[31]构建了BEV（鸟瞰图）来可视化车辆的运动信息。然而，如果用户不熟悉这些系统，这些界面很容易导致误解。
An ideal solution is to include natural language narrations to guide the use throughout the decision making and action taking process of the autonomous control module, which is comprehensible and user-friendly. Furthermore, an additional reasoning explanation for each control/action decision can help users understand the current state of the vehicle and the surrounding environment, as supporting evidence for the actions taken by the autonomous vehicle. For example, ”[Action narration:] the car pulls over to the right side of the road, [Reasoning:] because the car is parking”, as shown in Fig. 1. Explaining vehicle behaviors via natural language narrations and reasoning thus makes the whole autonomous system more transparent and easier to understand.
一个理想的解决方案是在自动驾驶控制模块的整个决策和行动过程中加入自然语言叙述，以指导使用，使其易于理解和用户友好。此外，为每个控制/行动决策提供额外的推理解释，可以帮助用户理解车辆当前的状态和周围环境，作为自动驾驶车辆采取行动的支持证据。例如，“[行动叙述：] 汽车靠向道路右侧，[推理：] 因为汽车正在停车”，如图1 所示。通过自然语言叙述和推理来解释车辆行为，使整个自动驾驶系统更加透明和易于理解。
在这里插入图片描述
图1. 展示了不同解释自动驾驶车辆的方法，包括注意力图[24]、成本体积[13]和自然语言。尽管注意力图或成本体积有效，但基于语言的解释对普通乘客来说更加友好。
To this end, we propose ADAPT, the first action-aware transformer-based driving action captioning architecture that provides for passengers user-friendly natural language narrations and reasoning of autonomous driving vehicles. To eliminate the discrepancy between the captioning task and the vehicular control signal prediction task, we jointly train these two tasks with a shared video representation. This multi-task framework can be built upon various end-to-end autonomous systems by incorporating a text generation head.
为了实现这一目标，我们提出了ADAPT，这是首个基于感知行动的变换器（transformer）驾驶动作字幕生成架构，为乘客提供了用户友好的自然语言叙述和自动驾驶车辆的推理。为了消除字幕生成任务和车辆控制信号预测任务之间的差异，我们通过共享的视频表示来联合训练这两个任务。这个多任务框架可以通过整合一个文本生成头部，构建在各种端到端的自动驾驶系统之上。
We demonstrate the effectiveness of the ADAPT approach on a large-scale dataset that consists of control signals and videos along with action narration and reasoning. Based on ADAPT, we build a novel deployable system that takes raw vehicular navigation videos as input and generates the action narrations and reasoning explanations in real time.
我们在包含控制信号和视频以及动作叙述和推理的大型数据集上展示了ADAPT方法的有效性。基于ADAPT，我们构建了一个新颖的可部署系统，该系统以原始车辆导航视频为输入，并实时生成动作叙述和推理解释。
Our contributions can be summarized as:
• We propose ADAPT, a new end-to-end transformerbased action narration and reasoning framework for self-driving vehicles.
• We propose a multi-task joint training framework that aligns both the driving action captioning task and the control signal prediction task.
• We develop a deployable pipeline for the application of ADAPT in both the simulator environment and the real world.
我们的贡献可以总结为：

我们提出了ADAPT，这是一个新的端到端基于变换器的自动驾驶车辆动作叙述和推理框架。
我们提出了一个多任务联合训练框架，该框架协调了驾驶动作字幕生成任务和控制信号预测任务。
我们为ADAPT在模拟器环境和现实世界中的应用开发了一个可部署的流程。

II. RELATED WORK

A. Video Captioning

The main goal of the video captioning task is to describe the objects and their relationship of a given video in natural language. Early researches [32]–[35] generate sentences with specific syntactic structures by filling recognized elements in fixed templates, which are inflexible and lack of richness. [36]–[45] exploit sequence learning approaches to generate natural sentences with flexible syntactic structures. Specifically, these methods employ a video encoder to extract frame features and a language decoder to learn visual-textual alignment for caption generation. To enrich captions with fine-grained objects and actions, [46]–[48] exploit objectlevel representations that capture detailed object-aware interaction features in videos. [49] further develops a novel dual-branch convolutional encoder to jointly learn the content and semantic information of videos. Moreover, [50] adapts the uni-modal transformer to video captioning and employs a sparse boundary-aware pooling to reduce the redundancy in video frames. The development of scene understanding [51]–[62] also contribute a lot to the captioning task. Most recently, [63] proposes an end-to-end transformer-based model SWINBERT, which utilizes a sparse attention mask to lessen the redundant and irrelevant information in consecutive video frames.
视频字幕生成任务的主要目标是使用自然语言描述给定视频中的物体及其相互关系。早期的研究[32]–[35]通过在固定模板中填充识别出的元素来生成具有特定句法结构的句子，这些方法不够灵活且缺乏丰富性。随后的研究[36]–[45]利用序列学习方法来生成具有灵活句法结构的自然句子。具体来说，这些方法采用视频编码器提取帧特征，并使用语言解码器学习视觉-文本对齐，以生成字幕。为了使字幕包含更细粒度的物体和动作，[46]–[48]利用物体级别的表示来捕捉视频中详细的物体感知交互特征。[49]进一步开发了一种新颖的双分支卷积编码器，以联合学习视频的内容和语义信息。此外，[50]将单模态变换器适应于视频字幕生成，并采用稀疏的边界感知池化来减少视频帧中的冗余。场景理解的发展[51]–[62]也对字幕任务做出了很大贡献。最近，[63]提出了一个端到端的基于变换器的模型SWINBERT，它使用稀疏注意力掩码来减少连续视频帧中的冗余和不相关信息。
While existing architectures achieve promising results for general video captioning, it cannot be directly applied to action representation because simply transferring video caption to self-driving action representation would miss some key information like the speed of the vehicle, which is essential in the autonomous system. How to effectively use these multimodal information to generate sentences remains a mystery, which is the focus of our work.
尽管现有的架构在普通视频字幕生成方面取得了有希望的结果，但它不能直接应用于动作表示，因为简单地将视频字幕转换为自动驾驶动作表示会遗漏一些关键信息，如车辆的速度，这在自动驾驶系统中是必不可少的。如何有效地利用这些多模态信息生成句子仍然是一个谜，这也是我们工作的重点。

B. End-to-End Autonomous Driving

Learning-based autonomous driving is an active research area [64], [65]. Some learning-based driving methods such as affordances [3], [4] and reinforcement learning [5]–[7] are employed, gaining promising performance. Imitation methods [8]–[13] are also utilized to regress the control commands from human demonstrations. For example, [14]–[16] model the future behavior of driving agents like vehicles, cyclists or pedestrians to predict the vehicular waypoints, while [17]–[23] predict vehicular control signals directly according to the sensor input, which is similar to our control signal prediction sub-task.
基于学习的自动驾驶是一个活跃的研究领域[64]，[65]。一些基于学习的驾驶方法，如功能提示[3]，[4]和强化学习[5]–[7]已被采用，展现出了有希望的性能。模仿学习方法[8]–[13]也被用来从人类演示中回归控制命令。例如，[14]–[16]对驾驶代理（如车辆、自行车或行人）的未来行为进行建模，以预测车辆的航点，而[17]–[23]则直接根据传感器输入预测车辆控制信号，这与我们的控制信号预测子任务相似。

C. Interpretability of Autonomous Driving

Interpretability, or the ability to provide a comprehensive explanation plays a significant role in the social acceptance of artificial intelligence [66], [67] and autonomous driving is no exception. Most interpretable approaches of autonomous vehicles are vision-based [24]–[27], [31] or LiDAR-based [13], [14], [28]–[30]. [24] first utilizes the visualization of an attention map that filters out non-salient image regions to make autonomous vehicles reasonable and interpretable. Nevertheless, the attention map may easily include some less important areas which cause misunderstanding for passengers. [25]–[27], [31] constructs BEV (Bird’s eye view) from a vehicle camera to visualize the motion information and environmental status of the vehicle. [13] takes as input LiDAR and HD maps to forecast the bounding boxes of driving agents and exploits cost volume to explain the reason for the planner’s decision. Furthermore, [14] constructs an online map from segmentation as well as the states of driving agents to avoid heavy dependence on HD maps.
可解释性，或者说提供全面解释的能力，在人工智能的社会接受度中扮演着重要角色[66]，[67]，自动驾驶也不例外。大多数自动驾驶车辆的可解释方法是基于视觉的[24]–[27]，[31]或基于激光雷达的[13]，[14]，[28]–[30]。[24]首先利用注意力图的可视化，该图过滤掉非显著的图像区域，使自动驾驶车辆合理且可解释。然而，注意力图可能很容易包含一些对乘客造成误解的不那么重要的区域。[25]–[27]，[31]从车辆摄像头构建BEV（鸟瞰图），以可视化车辆的运动信息和环境状态。[13]以激光雷达和高精地图作为输入来预测驾驶代理的边界框，并利用成本体积来解释规划者决策的原因。此外，[14]从分割以及驾驶代理的状态构建在线地图，以减少对高精地图的严重依赖。
Although the vision-based or LiDAR-based approaches provide promising results, the lack of linguistic interpretation makes them too complicated for passengers like the elderly to understand. [68] first explores the possibility of textual explanations for self-driving vehicles, which offline extracts video features from control signal prediction task and conducts video captioning afterwards. Unfortunately, the discrepancy between these two tasks makes the offline-extracted features sub-optimal for downstream captioning task, which is the focus of our work.
尽管基于视觉或激光雷达的方法提供了有希望的结果，但缺乏语言解释使它们对于像老年人这样的乘客来说过于复杂，难以理解。[68]首次探索了为自动驾驶车辆提供文本解释的可能性，该方法离线提取控制信号预测任务的视频特征，然后进行视频字幕生成。不幸的是，这两个任务之间的差异使得离线提取的特征对于下游的字幕生成任务来说不是最优的，这也是我们工作的重点。

D. Multi-task Learning in Autonomous Driving

Our end-to-end framework adopts multi-task learning, where we train the model on a joint objective of text generation and control signal prediction. Multi-task learning helps extract more useful information by exploiting inductive biases between different tasks [69] and has shown promising prospects in autonomous driving. [70], [71] shows that detection and tracking can be trained together. [72] further applies a joint detector and trajectory predictor into a single model and gains promising results. This idea is extended by [73] to simultaneously predict the intention of actors. More recently, [13] further includes a cost map based control signal planner in the joint model. These works show that joint training of different tasks improves the performance of individual tasks due to better data utilization and shared features, which inspires our joint training strategy of control signal prediction task and text generation task.
我们的端到端框架采用了多任务学习，我们在文本生成和控制信号预测的联合目标上训练模型。多任务学习通过利用不同任务之间的归纳偏置来提取更有用的信息[69]，并在自动驾驶中显示出了有希望的前景。[70]、[71]表明检测和跟踪可以一起训练。[72]进一步将联合检测器和轨迹预测器应用到一个单一模型中，并取得了有希望的结果。[73]通过同时预测行动者的意图扩展了这一思路。最近，[13]进一步在联合模型中包括了一个基于成本图的控制信号规划器。这些研究表明，不同任务的联合训练由于更好的数据利用和共享特征，提高了各个任务的性能，这启发了我们对控制信号预测任务和文本生成任务的联合训练策略。

III. METHOD

A. Overview

The ADAPT architecture is illustrated in Fig. 2, which addresses two tasks: Driving Caption Generation (DCG) and Control Signal Prediction (CSP). DCG takes a sequence of raw video frames as inputs, and outputs two natural language sentences: one describes the vehicle’s action (e.g., ”the car is accelerating”), and the other explains the reasoning for taking this action (e.g., ”because the traffic lights turn green”). CSP takes the same video frames as inputs, and outputs a sequence of control signals, such as speed, course or acceleration.
ADAPT架构如图2 所示，它处理两个任务：驾驶字幕生成（DCG）和控制信号预测（CSP）。DCG以一系列原始视频帧作为输入，并输出两个自然语言句子：一个描述车辆的动作（例如，“汽车正在加速”），另一个解释采取这一动作的原因（例如，“因为交通灯变绿了”）。CSP以相同的视频帧作为输入，并输出一系列控制信号，如速度、方向或加速度。
在这里插入图片描述
图2. ADAPT框架概述。(a) 输入为车辆前置摄像头视频，输出为预测的车辆控制信号以及当前动作的叙述和推理。我们首先从视频中密集且均匀地采样T帧，这些帧被送入可学习的Swin Transformer视频编码器并被标记化为视频标记。不同的预测头分别生成最终的运动结果和文本结果。(b)和©分别显示了预测头。
在这个框架中，视频帧的采样是为了捕捉车辆前方视角的关键视觉信息，这些信息对于理解车辆的动态和周围环境至关重要。通过将这些帧输入到视频Swin Transformer，模型能够将原始的像素信息转换为更高级别的特征表示，即视频标记。这些视频标记随后被用于不同的预测任务，包括控制信号预测和文本生成。
控制信号预测头专注于从视频数据中提取与车辆运动直接相关的信息，如速度和方向，以生成控制信号。而文本生成头则利用视频标记和可能的其他上下文信息来生成描述车辆行为的自然语言句子，包括动作叙述和推理解释。
这种多任务学习方法允许模型在不同的输出任务之间共享表示，同时确保每个任务能够专注于提取对其最有用的信息。通过这种方式，ADAPT框架能够提供对自动驾驶车辆行为的全面解释，包括预测控制信号和生成描述性文本，从而提高了系统的透明度和可解释性。

Generally, DCG and CSP tasks share the same Video Encoder, while employing different prediction heads to produce the final prediction results. For DCG task, we employ a vision-language transformer encoder to generate two natural language sentences via sequence-to-sequence generation. For CSP task, we use a motion transformer encoder to predict the control signal sequence.
通常，DCG（驾驶字幕生成）和CSP（控制信号预测）任务共享同一个视频编码器，同时使用不同的预测头来产生最终的预测结果。对于DCG任务，我们采用视觉-语言变换器编码器，通过序列到序列生成的方式来生成两句自然语言。对于CSP任务，我们使用运动变换器编码器来预测控制信号序列。

B. Video Encoder

Following Swinbert [63], we employ Video Swin Transformer (video swin) [74] as the visual encoder to encode video frames into video feature tokens. Given a car video captured from the first perspective, we first do uniform sampling to get T frames of size H × W × 3. These frames are passed as inputs to video swin, resulting in feature FV of size T 2 × 32 H × W32 ×8C, where C is the channel dimension defined in video swin. Then the video features are fed into different prediction heads for individual tasks.
遵循Swinbert[63]的方法，我们采用视频Swin变换器（video swin）[74]作为视觉编码器，将视频帧编码成视频特征标记。给定一个从第一人称视角拍摄的汽车视频，我们首先进行均匀采样以获得T帧大小为H × W × 3的视频帧。这些帧作为输入传递给视频Swin，生成了大小为T/2 × H/32 × W/32 × 8C的特征FV，其中C是在视频Swin中定义的通道维度。然后，视频特征被送入不同预测头，以完成各自的任务。

C. Prediction Heads

Text Generation Head The purpose of the text generation head is to generate two sentences that describe both the action of the vehicle and the reason behind it. As mentioned in Sec. III-B, the video frames are encoded to video features FV of size T2 × 32 H × W32 × 8C. Then we tokenize the video features along the channel dimension, resulting in T2 ×32 H ×W32 tokens with dimension of 8C. As for the text inputs (action narrations and reasoning), we first tokenize each sentence and pad it to a fixed length. Then we concatenate these two sentences and embed them with an embedding layer. To identify the difference between action narration and reasoning, we exploit a segment embedding method (widely used in Bert [75]) to distinguish them. And we use a learnable MLP that transforms the dimension of video tokens to ensure the dimension consistency between video tokens and text tokens. Finally, the text tokens and video tokens are fed into the vision-language transformer encoder, which will generate a new sequence includes both action narrations and reasoning.
文本生成头的目的是生成两个句子，分别描述车辆的行为及其背后的原因。如第三节B部分所述，视频帧被编码成视频特征FV，其大小为T/2 × H/32 × W/32 × 8C。然后，我们沿着通道维度对视频特征进行标记化，得到T/2 × H/32 × W/32个标记，每个标记的维度为8C。对于文本输入（动作叙述和推理），我们首先对每个句子进行标记化，并将其填充到固定长度。然后，我们将这两个句子连接起来，并使用嵌入层对它们进行嵌入。为了区分动作叙述和推理，我们采用了段落嵌入方法（在Bert[75]中广泛使用），以区分它们。我们使用一个可学习的MLP（多层感知机），将视频标记的维度转换为与文本标记一致的维度。最后，文本标记和视频标记被输入到视觉-语言变换器编码器中，它将生成一个新的序列，包括动作叙述和推理。
Control Signal Prediction Head The goal of CSP head is to predict the control signals (e.g. acceleration) of the vehicle based on video frames. Given video features of T frames, along with the corresponding control signal recordings S = fs1; s2; ::; sT g, the output of CSP head is a sequence of control signals S^ = fs^2; :::; s^T g. Each control signal si or s^i is a n-tuple, where n refers to how many types of sensor we exploit. We first tokenize the video features, then utilize another transformer (motion transformer) to generate the prediction of these control signals. The loss function LCSP is defined as the mean squared error of S and S^:
控制信号预测头（CSP头）的目标是基于视频帧预测车辆的控制信号（例如加速度）。给定T帧的视频特征，连同相应的控制信号记录 $S = {s_1, s_2, ..., s_T}$ ，CSP头的输出是一系列控制信号的预测 $\hat S = {\hat s_2, ..., \hat s_T}$ 。每个控制信号 $s_i或\hat s_i$ 是一个n元组，其中n指的是我们使用的传感器类型数量。我们首先对视频特征进行标记化，然后利用另一个变换器（运动变换器）来生成这些控制信号的预测。损失函数LCSP定义为S和S^的均方误差：
在这里插入图片描述
这意味着CSP头的训练是通过最小化预测控制信号S^与实际控制信号S之间的差异来进行的，以确保预测的准确性。
Note that we do not predict control signal corresponding to the first frame, since the dynamic information of the first frame is limited, while other signals can be easily inferred from previous frames.
请注意，我们不预测与第一帧相对应的控制信号，因为第一帧的动态信息有限，而其他信号可以从之前的帧中容易地推断出来。这意味着在进行控制信号预测时，系统将从视频序列的第二帧开始进行，利用已有的帧信息来预测随后的控制信号，而不是从序列的第一帧开始。这样的设计考虑到了视频序列中动态信息的连续性和可推断性，从而提高了预测的效率和准确性。

D. Joint Training

In our framework, we assume that CSP and DCG tasks are aligned on the semantic level of the video representation. Intuitively, action narration and the control signal data are different expression forms of the action of self-driving vehicles, while reasoning explanation concentrates on the elements of the environment that influence the action of the vehicles. We believe that jointly training these tasks in a single network can improve performance by leveraging the inductive biases between different tasks.
在我们的框架中，我们假设CSP（控制信号预测）和DCG（驾驶字幕生成）任务在视频表示的语义层面是对齐的。直观地说，动作叙述和控制信号数据是自动驾驶车辆行为的不同表达形式，而推理解释则侧重于影响车辆行为的环境元素。我们相信，在单个网络中联合训练这些任务可以通过利用不同任务之间的归纳偏置来提高性能。
During training, CSP and DCG are performed jointly. We simply add the LCSP and LDCG to get the final loss function:
在这里插入图片描述
Despite the joint training of both tasks, inference on each task can be carried out independently. For the DCG task, ADAPT takes a video sequence as input, and outputs the driving caption with two segments. Text generation is performed in an auto-regressive manner. Specifically, our model starts with a ”[CLS]” token and generates one word token at a time, consuming previously generated tokens as the inputs of the vision-language transformer encoder. Generation continues until the model outputs the ending token ”[SEP]” or reaches the maximum length threshold of a single sentence. After padding the first sentence to the maximum length, we concatenate another ”[CLS]” to the inputs and repeat the aforementioned process.
尽管CSP和DCG任务是联合训练的，但每个任务的推理可以独立进行。对于DCG任务，ADAPT以视频序列为输入，并输出带有两段文字的驾驶字幕。文本生成以自回归方式执行。具体来说，我们的模型以一个“[CLS]”标记开始，一次生成一个单词标记，将之前生成的标记作为视觉-语言变换器编码器的输入。生成过程持续到模型输出结束标记“[SEP]”或达到单个句子的最大长度阈值为止。在将第一句话填充到最大长度后，我们在输入后附加另一个“[CLS]”，然后重复上述过程。

IV. EXPERIMENT

In this section, we evaluate ADAPT over metrics of the standard captioning task, including BLEU4 [76], METEOR [77], ROUGE-L [78] and CIDEr [79] (abbreviated as B4, M, R and C in later tables). As quantitative evaluation of captioning is still an open question, we also provide detailed human evaluation results for the subjective correctness of the generated text. Ablation studies further demonstrate the effectiveness of the proposed joint-training framework.
在本节中，我们根据标准字幕任务的指标对ADAPT进行了评估，包括BLEU4[76]、METEOR[77]、ROUGE-L[78]和CIDEr[79]（在后续表格中简写为B4、M、R和C）。由于字幕生成的定量评估仍然是一个开放性问题，我们还提供了详细的人类评估结果，以评估生成文本的主观正确性。此外，消融研究进一步证明了所提出的联合训练框架的有效性。

BLEU4 (B4): 这是一种评估机器翻译和文本生成的指标，通过比较生成文本和参考文本中的n-gram重叠来工作。
METEOR (M): 这是另一种评估文本生成质量的指标，它考虑了同义词和句子结构，以更全面地评估生成文本的质量。
ROUGE-L ®: 这个指标专注于评估生成文本和参考文本之间的最长公共子序列，通常用于评估自动摘要和机器翻译。
CIDEr ©: 这是一个特别为图像字幕生成任务设计的指标，通过考虑人类标注的一致性来评估生成字幕的质量。

通过这些指标，研究者可以量化评估ADAPT生成字幕的性能。同时，人类评估提供了对模型输出的主观质量的洞察，这有助于理解模型在实际应用中的表现。消融研究则有助于识别和理解模型中哪些组件对性能提升最为关键。

A. Dataset

BDD-X [68] is a driving-domain caption dataset, consisting of nearly 7000 videos paired with control signals. The videos and control signals are collected from BDD100K dataset [80]. Each video has a duration of 40 seconds on average, with 1280×720 resolution and 30 FPS. Each video contains 1 to 5 vehicle behaviors, such as accelerating, turning right or merging lanes. All these behaviors are accompanied by text annotation, including action narration (e.g., ”the car stops”) and reasoning (e.g., ”because the traffic light is red”). There are around 29000 behavior-annotation pairs in total. To the best of our knowledge, BDD-X is the only driving-domain caption dataset accompanied by car videos and control signals.
BDD-X[68]是一个驾驶领域的字幕数据集，包含近7000个与控制信号配对的视频。这些视频和控制信号是从BDD100K数据集[80]中收集的。每个视频平均时长为40秒，分辨率为1280×720，帧率为30 FPS。每个视频包含1到5种车辆行为，如加速、右转或并道。所有这些行为都有文本注释相伴，包括动作叙述（例如，“汽车停下”）和推理（例如，“因为交通灯是红色的”）。总共大约有29000对行为-注释对。据我们所知，BDD-X是唯一一个附带汽车视频和控制信号的驾驶领域字幕数据集。

B. Implementation Details

The video swin transformer is pre-trained on Kinetics- 600 [81], while the vision-language transformer and motion transformer are randomly initialized. Note that in our implementation we do not freeze the parameters of video swin, so ADAPT is trained in a complete end-to-end manner. The input video frames are resized and cropped to the spatial size of 224. And for narration and reasoning, we use WordPiece embeddings [75] instead of the whole words (e.g., ”stops” is cut to ”stop” and ”#s”) and the maximal length of each sentence is 15. During training period, we randomly mask 50% of the tokens for masked language modeling. And the masked token has 80% chance to be a ”[MASK]” token, 10% chance to be a random word, and 10% chance to remain the same. We employ AdamW optimizer and use a learning rate warm-up during the early 10% training steps followed by linear decay. The whole training process for 40 epochs takes about 13 hours on 4 NVIDIA V100 GPUs with a batch size of 4 per GPU.
视频Swin变换器（video swin transformer）在Kinetics-600[81]上进行了预训练，而视觉-语言变换器（vision-language transformer）和运动变换器（motion transformer）是随机初始化的。请注意，在我们的实现中，我们没有冻结视频Swin的参数，因此ADAPT以完全端到端的方式进行训练。输入视频帧被调整并裁剪到224的空间尺寸。对于叙述和推理，我们使用WordPiece嵌入[75]而不是整个单词（例如，“stops”被切分为“stop”和“#s”），并且每个句子的最大长度是15个标记。在训练期间，我们随机遮蔽50%的标记以进行掩蔽语言建模。被遮蔽的标记有80%的机会是一个“[MASK]”标记，有10%的机会是一个随机词，还有10%的机会保持不变。我们采用AdamW优化器，并在前10%的训练步骤中使用学习率预热，随后进行线性衰减。整个训练过程在4个NVIDIA V100 GPU上进行40个周期，每GPU的批量大小为4，大约需要13个小时。
这种端到端的训练方法允许模型从数据中学习所有层级的表示，而预训练的视频Swin变换器提供了一个强大的起点，可以在相关但不同的任务上进行微调。使用WordPiece嵌入有助于模型更好地处理词汇的子单元，这在处理长度受限的句子时尤其有用。随机遮蔽语言建模（Masked Language Modeling, MLM）是一种有效的预训练策略，它教会模型预测句子中缺失的单词，从而提高其对语言的理解能力。AdamW优化器是一种带有权重衰减的Adam变体，它在训练深度学习模型时被广泛使用，因为它通常能够提供更稳定的性能和更快的收敛速度。

C. Main Results

We compare ADAPT with state-of-the-art methods on BDD-X dataset. Table I shows the comparison results on standard captioning metrics. We observe that ADAPT achieves significant performance gain over existing methods. Specifically, ADAPT outperforms prior state-of-the-art work [68] by 31.7 for action narration and 33.1 for reasoning on CIDEr metric.
我们在BDD-X数据集上将ADAPT与最先进的方法进行了比较。表I 显示了在标准字幕指标上的比较结果。我们观察到ADAPT在性能上取得了显著的提升，超过了现有的方法。具体来说，在CIDEr指标上，ADAPT在动作叙述方面比之前最先进的工作[68]高出31.7，在推理方面高出33.1。
在这里插入图片描述

In addition to automatic evaluation measures, we also conduct human evaluation to measure the subjective correctness of output narration and reasoning. The whole evaluation process is divided into three sections: (1) narration, (2) reasoning, and (3) full sentence. During the first section, a human evaluator judges whether the predicted narrations conform to the vehicle’s action. In the second section, we display both ground-truth narration and predicted reasoning, and require human evaluators to judge whether the reasoning is correct. Then in the last section, both predicted narrations and predicted reasoning are displayed. Table II shows that ADAPT outperforms previous work in reasoning accuracy while maintaining high accuracy on narration evaluation, demonstrating the effectiveness of ADAPT.
除了自动评估指标外，我们还进行了人类评估来衡量输出叙述和推理的主观正确性。整个评估过程被分为三个部分：(1) 叙述，(2) 推理，和 (3) 完整句子。在第一部分中，人类评估员判断预测的叙述是否符合车辆的行为。在第二部分中，我们同时展示真实叙述和预测推理，并要求人类评估员判断推理是否正确。然后在最后一部分，同时展示预测的叙述和预测的推理。表II 显示，ADAPT在推理准确性方面优于以前的工作，同时在叙述评估上保持了高准确性，证明了ADAPT的有效性。
在这里插入图片描述

D. Ablation Study

We conduct a comprehensive ablation study to analyze various aspects of ADAPT design.
我们进行了一项全面的消融研究，以分析ADAPT设计的各个方面。
Effect of Action-aware Joint Training To investigate the effect of action-awareness in joint training on ADAPT, we train a single captioning model by removing the CSP (control signal prediction) head of ADAPT, referred to as ”Single”. As shown in Table III, ADAPT outperforms single training with an improvement of 15.9 for narration and 7.2 for reasoning on CIDEr metric. This suggests that cues from the other task help regularize the shared video representation and improve the performance of the text generation task.
动作感知联合训练的效果：为了研究在ADAPT中动作感知对联合训练的影响，我们通过移除ADAPT的CSP（控制信号预测）头，训练了一个单一的字幕生成模型，称之为“Single”。如表III 所示，ADAPT在CIDEr指标上比单一训练提高了15.9分的叙述能力和7.2分的推理能力。这表明来自另一任务的线索有助于规范共享的视频表示，并提高文本生成任务的性能。
在这里插入图片描述
Additionally, we can see from Fig. 2(a) that the caption and control signal data are employed in two streams in ADAPT. An interesting question is: can we simply pass the control signals to the multi-modal transformer to get the final caption prediction? So we create such an architecture that takes video tokens, control signal tokens (generated by a learnable embedding layer) and masked text tokens as input and generates predictions of the masked tokens, which is referred to as ”Single+”. Results are shown in the second row of Table III. We can see that the proposed ADAPT still achieves the best results, especially for reasoning segment, which demonstrates the superiority of multi-task learning over using both videos and control signals as inputs despite the latter is an intuitive setting.
此外，我们从图2(a) 中可以看到，在ADAPT中，字幕和控制信号数据在两个独立的流中被使用。一个有趣的问题是：我们能否直接将控制信号传递给多模态变换器来获得最终的字幕预测？因此，我们创建了这样一个架构，它以视频标记、控制信号标记（由一个可学习的嵌入层生成）和掩蔽的文本标记作为输入，并生成掩蔽标记的预测，这被称为**“Single+”。结果在表III** 的第二行中显示。我们可以看到，所提出的ADAPT仍然取得了最佳结果，尤其是在推理部分，这证明了多任务学习相对于使用视频和控制信号作为输入的直观设置的优越性。

Impact of Different Control Signal Types In our implementations, we leverage control signals (e.g., course) as supervision for the CSP task. In this analysis, we investigate the impact of different supervision signal types of ADAPT. The base signals in our experiments are speed and course. We first conduct experiments by removing one of them, results of which are shown in the first two rows of Table IV. Then in the third row both speed and course are utilized, which is the same as previous experiments. We observe that the removal of each signal leads to the decrease of performance. For example, the CIDEr metric decreases by 29.3 for narration and by 14.0 for reasoning without the speed inputs. This is understandable because being aware of speed and course can help the network learn representations that are informative for narration and reasoning and the lack of either can result in the bias of video representations.
不同控制信号类型的影响：在我们的实现中，我们利用控制信号（例如，方向）作为CSP（控制信号预测）任务的监督信号。在这项分析中，我们研究了ADAPT不同监督信号类型的影响。我们实验中的基础信号是速度和方向。我们首先通过移除其中一个信号来进行实验，其结果如表IV 的前两行所示。然后在第三行中，速度和方向两者都被利用，这与之前的实验设置相同。我们观察到，移除每个信号都会导致性能下降。例如，如果没有速度输入，CIDEr指标在叙述上降低了29.3，在推理上降低了14.0。这是可以理解的，因为了解速度和方向可以帮助网络学习对叙述和推理有用的表示，而缺少任何一个都可能导致视频表示的偏差。
在这里插入图片描述

Interaction between Narration and Reasoning Compared with the general caption task, the driving caption task generates two sentences: action narration and reasoning. In this section, we explore how these two segments interact with each other by controlling the attention mask or the order of two sentences.
叙述与推理之间的交互：与一般字幕任务相比，驾驶字幕任务生成两个句子：动作叙述和推理。在这一部分中，我们通过控制注意力掩码或两个句子的顺序来探索这两个部分如何相互交互。
在ADAPT框架中，动作叙述通常描述车辆执行的动作（例如，“汽车正在加速”），而推理部分解释了为什么执行该动作（例如，“因为绿灯亮了”）。这两个句子虽然在功能上有所区分，但它们在语义上是相互关联的，一个完整的字幕任务需要同时生成这两个句子。
为了研究叙述和推理之间的交互作用，可以采取以下方法：

改变注意力掩码：在文本生成过程中，使用不同的注意力掩码策略，例如，允许或不允许叙述部分的标记影响推理部分的生成，或者反之亦然。
改变句子生成顺序：可以实验不同的生成顺序，比如先生成叙述再生成推理，或者先生成推理再生成叙述，观察哪种顺序更有助于生成流畅且逻辑一致的字幕。
交叉注意力机制：在文本生成的Transformer模型中，可以引入交叉注意力机制，使叙述和推理部分能够在生成过程中相互提供信息。
独立与联合训练：分别独立训练叙述和推理生成器，以及将它们联合训练，比较不同训练策略对最终性能的影响。
人类评估：除了自动度量指标外，还可以通过人类评估员来评估叙述和推理部分的自然性、准确性和一致性。

Specifically, as shown in the right of Fig. 2©, we use a causal self-attention mask for each sentence where a word token can only attend to the existing output tokens, and employ sparse attention [63] for video tokens. The reasoning segment has full attention to the narration segment, referred to as cross attention, which defines the dependence of reasoning on narration. In this section, we first conduct experiments without cross attention or with swapped cross attention (by swapping the order of narration and reasoning). Results are reported in Table V. Compared with the default setting (denoted as ”Ours”), results without cross attention have lower performance in both sentences, which indicates that conditioning the reasoning segment on the narration segment is beneficial for training. And the performance with swapped cross attention also decreases, especially for the narration part, which further demonstrates this dependence of reasoning on narration, instead of the other way around.
具体来说，如图2(c ) 右侧所示，我们对每个句子使用因果自注意力掩码，其中单词标记只能关注现有的输出标记，并对视频标记使用稀疏注意力[63]。推理部分对叙述部分有完全的注意力，称为交叉注意力，这定义了推理对叙述的依赖性。在这一部分，我们首先进行了没有交叉注意力或交叉注意力顺序颠倒（通过颠倒叙述和推理的顺序）的实验。结果报告在表V中。与默认设置（称为“Ours”）相比，没有交叉注意力的结果在两个句子中的表现都较差，这表明在训练中将推理部分条件化在叙述部分上是有益的。并且，交叉注意力顺序颠倒后的性能也有所下降，尤其是叙述部分，这进一步证明了推理对叙述的依赖性，而不是相反。
Additionally, we conduct experiments with only one sentence, referred to as ”Narration only” and ”Reasoning only”. Table V shows that training with both sentences yields improvement on the performance, especially for the reasoning segment, indicating that the interaction between narration and reasoning promotes each component of the full caption task.
此外，我们还进行了只有一句话的实验，分别称为“仅叙述”和“仅推理”。表V 显示，同时训练这两个句子可以提高性能，尤其是推理部分，这表明叙述和推理之间的交互作用促进了完整字幕任务的每个组成部分。
这种实验设置帮助我们理解在字幕生成任务中，叙述和推理两个部分是如何相互影响和促进的。当模型只被训练为生成叙述或推理中的一部分时，它可能无法充分利用两个部分之间的信息互补性。相反，当模型同时考虑叙述和推理时，它能够更全面地捕捉到视频内容的语义信息，从而生成更准确和更具上下文相关性的字幕。
“仅叙述”实验强调了模型在没有额外推理信息的情况下生成动作描述的能力，而“仅推理”实验则突出了模型解释动作原因的能力。两者的结合表明，叙述为推理提供了上下文基础，而推理则为叙述增加了深度和解释力。这种双向的促进作用是提升字幕生成质量的关键因素，有助于生成更丰富、更符合人类理解习惯的字幕。
在这里插入图片描述

Impact of Different Sampling Rates In previous experiments, we uniformly sample T = 32 frames from a given video, along with control signal data of the same timestamp. In this study, we investigate the impact of sampling rate by varying the number of sampled frames. Specifically, we uniformly sample T = 2; 4; 8; 16; 32 frames from a variablelength video, as shown in Table VI. The performance of ADAPT improves steadily as the sampled number increases since more frames lead to less missing visual content. This suggests that caption results can be enhanced by densely sampled frames and control signals. The training time costs are also provided in Table VI. We hope this ablation provides robotics practitioner with insights about the accuracyefficiency trade-off of driving caption.
不同采样率的影响：在之前的实验中，我们从给定视频中均匀采样了T=32帧，以及相同时间戳的控制信号数据。在这项研究中，我们通过改变采样帧的数量来调查采样率的影响。具体来说，我们从可变长度的视频中均匀采样T=2、4、8、16、32帧，如表VI 所示。随着采样数量的增加，ADAPT的性能稳步提高，因为更多的帧意味着更少的视觉内容缺失。这表明，通过密集采样的帧和控制信号可以增强字幕结果。训练时间成本也在表VI中提供。我们希望这种消融研究能够为机器人从业者提供关于驾驶字幕准确性与效率权衡的见解。
在这里插入图片描述

E. Analysis on Control Signal Prediction

Although the main goal of driving caption task is to generate sentences, we also investigate the performance of control signal prediction tasks. We employ root mean squared error (RMSE) and a tolerant accuracy (Aσ) to measure the final performance. Tolerant accuracy means we first use two thresholds to determine the range of the control signal deviation and truncate it. For example, we define the truncation value of predicted course c^ as:
尽管驾驶字幕任务的主要目标是生成句子，我们还研究了控制信号预测任务的性能。我们采用均方根误差（Root Mean Squared Error, RMSE）和容差精度（Tolerant Accuracy, Aσ）来衡量最终性能。容差精度意味着我们首先使用两个阈值来确定控制信号偏差的范围，并将其截断。例如，我们定义预测的课程（course）控制信号 $\hat c$ ) 的截断值如下：
在这里插入图片描述
where c is the ground-truth course and σ is the tolerant threshold value. Then Aσ of course represents the accuracy of cσ recorded as a percentage, and Aσ of speed is defined similarly. Results are provided in Table VII. We observe that our joint training framework can further improve the performance of control signal prediction, indicating the benefit of joint training.
其中 $c$ 是实际的航向角， $\sigma$ 是容差阈值。然后，航向角的 $A_\sigma$ 表示 $c_\sigma$ 的准确度，以百分比记录，速度的 $A_\sigma$ 也以类似的方式定义。结果提供在表VII 中。我们观察到，我们的联合训练框架可以进一步提高控制信号预测的性能，表明了联合训练的好处。
在这里插入图片描述

F. Deployment in Autonomous Systems

We further develop a pipeline for the deployment of ADAPT in both the simulator environment (e.g., Carla [82]) and the real world. The system takes raw vehicular videos as input and generates action narrations and reasoning explanations in real time. Specifically, we first record the frames captured by the camera from the front view. Then the frames in the last several seconds are passed as input to ADAPT to generate the action narration and reasoning of the current step. Moreover, we further utilize text-tospeech technology to convert the generated sentences into speech narration, to make it more convenient and more interactive for common passengers (especially helpful for vision-impaired passengers).
我们进一步开发了一个部署ADAPT的流程，既可以在模拟器环境（例如Carla[82]）中使用，也可以在现实世界中使用。该系统以原始车辆视频为输入，并实时生成动作叙述和推理解释。具体来说，我们首先记录由前置摄像头捕获的帧。然后，将最近几秒钟的帧作为输入传递给ADAPT，以生成当前步骤的动作叙述和推理。此外，我们进一步利用文本到语音技术将生成的句子转换为语音叙述，使它对普通乘客更加方便和互动（对视力受损的乘客特别有帮助）。

V. CONCLUSION

Language-based interpretability is essential for the social acceptance of self-driving vehicles. We present Adapt (Action-aware Driving cAPtion Transformer), a new endto-end transformer-based framework for generating action narration and reasoning for self-driving vehicles. ADAPT utilizes multi-task joint training to reduce the discrepancy between the driving action captioning task and the control signal prediction task. Experiments on BDD-X dataset over standard captioning metrics as well as human evaluation demonstrate the effectiveness of ADAPT over state-of-the-art methods. We further develop a deployable pipeline for the application of ADAPT in both simulator environment and the real world.
基于语言的可解释性对于自动驾驶车辆的社会接受至关重要。我们介绍了ADAPT（Action-aware Driving cAPtion Transformer），这是一种新的端到端基于变换器的框架，用于为自动驾驶车辆生成动作叙述和推理。ADAPT利用多任务联合训练来减少驾驶动作字幕生成任务和控制信号预测任务之间的差异。在BDD-X数据集上的实验，无论是通过标准的字幕指标还是通过人类评估，都证明了ADAPT方法比现有最先进方法更有效。我们进一步开发了一个可部署的流程，使ADAPT既可以在模拟器环境中应用，也可以在现实世界中应用。

在这里插入图片描述

真诚的灰灰

关注

3
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
ADAPT：动作感知驾驶字幕转换器

(a) 输入为车辆前置摄像头视频，输出为预测的车辆控制信号以及当前动作的叙述和推理。我们首先从视频中密集且均匀地采样T帧，这些帧被送入可学习的Swin Transformer视频编码器并被标记化为视频标记。不同的预测头分别生成最终的运动结果和文本结果。(b)和©分别显示了预测头。在这个框架中，视频帧的采样是为了捕捉车辆前方视角的关键视觉信息，这些信息对于理解车辆的动态和周围环境至关重要。
复制链接

扫一扫