强制并道场景中自动驾驶车辆的交互感知轨迹预测和规划

Interaction-Aware Trajectory Prediction and Planning for Autonomous Vehicles in Forced Merge Scenarios

强制并道场景中自动驾驶车辆的交互感知轨迹预测和规划

Abstract—Merging is, in general, a challenging task for both human drivers and autonomous vehicles, especially in dense traffic, because the merging vehicle typically needs to interact with other vehicles to identify or create a gap and safely merge into. In this paper, we consider the problem of autonomous vehicle control for forced merge scenarios. We propose a novel game-theoretic controller, called the Leader-Follower Game Controller (LFGC), in which the interactions between the autonomous ego vehicle and other vehicles with a priori uncertain driving intentions is modeled as a partially observable leaderfollower game. The LFGC estimates the other vehicles’ intentions online based on observed trajectories, and then predicts their future trajectories and plans the ego vehicle’s own trajectory using Model Predictive Control (MPC) to simultaneously achieve probabilistically guaranteed safety and merging objectives. To verify the performance of LFGC, we test it in simulations and with the NGSIM data, where the LFGC demonstrates a high success rate of 97:5% in merging.
摘要: 一般来说,并道对于人类驾驶员和自动驾驶车辆来说都是一项具有挑战性的任务,特别是在交通密集的情况下,因为并道车辆通常需要与其他车辆交互以识别或创造间隙并安全地并入。 在本文中,我们考虑强制并道场景下的自动车辆控制问题。 我们提出了一种新颖的博弈论控制器,称为领导者-跟随者博弈控制器(LFGC),其中自主车辆与具有先验不确定驾驶意图的其他车辆之间的交互被建模为部分可观察的领导者跟随者博弈。 LFGC根据观察到的轨迹在线估计其他车辆的意图然后预测它们的未来轨迹并使用模型预测控制(MPC)规划自我车辆自身的轨迹,以同时实现概率保证的安全和合并目标。 为了验证 LFGC 的性能,我们在模拟和 NGSIM 数据中对其进行了测试,其中 LFGC 表现出 97:5% 的合并成功率。

I. INTRODUCTION

Advances in autonomous vehicle technologies are projected to reduce vehicle crashes and fatalities, improve mobility especially for elderly and disabled people, improve fuel economy and emission control, and to promote more efficient land uses [1], [2], [3]. Despite these benefits, there are still many challenges that need to be addressed to deliver a highly (level 4 or level 5) autonomous vehicle [4]. One challenging scenario for both human drivers and autonomous vehicles is highway forced merge, where the merging vehicle needs to choose a proper gap in the highway traffic and potentially force the upstream traffic to slow down so that it can safely merge into that gap. Forced merge typically occurs in mandatory merge scenarios where the current lane is ending, such as at highway on-ramps. When the traffic is dense, interactions and/or cooperation between the merging vehicle and vehicles driving in the target lane are often needed. In particular, a vehicle in the target lane may choose to ignore the merging vehicle (i.e., proceed) and consequently the merging vehicle can only merge behind. Alternatively, the vehicle in the target lane may choose to yield to the merging vehicle (i.e., let the merging vehicle merge in front of it). In order to successfully merge into a busy traffic, an autonomous vehicle controller needs to appropriately respond to the intentions to proceed or yield of other vehicles. An overly conservative controller may yield to all other vehicles (including those that intend to yield to the autonomous ego vehicle) and eventually fail to merge, while an overly aggressive controller may have conflicts with the vehicles that intend to proceed and lead to vehicle crashes. Meanwhile, the decision whether to proceed or to yield to another vehicle depends not only on the traffic situation (e.g., the relative position and velocity between the two vehicles) but also on its driver’s general driving style, personality, mood, etc. For instance, in a similar situation, an aggressive driver may be inclined to proceed while a cautious/conservative driver may tend to yield. This poses a significant challenge to autonomous vehicle planning and control.
自动驾驶汽车技术的进步预计将减少车辆碰撞和死亡,改善尤其是老年人和残疾人的流动性,提高燃油经济性和排放控制,并促进更有效的土地利用[1]、[2]、[3]。 尽管有这些好处,但要提供高度(4 级或 5 级)自动驾驶汽车,仍然需要解决许多挑战 [4]。 对于人类驾驶员和自动驾驶车辆来说,一个具有挑战性的场景是高速公路强制并道,其中并道车辆需要在高速公路交通中选择适当的间隙,并可能迫使上游交通减速,以便安全地并入该间隙。 强制并道通常发生在当前车道即将结束的强制并道场景中,例如在高速公路入口匝道处。 当交通密集时,合流车辆与目标车道上行驶的车辆之间经常需要交互和/或协作。 具体地,目标车道中的车辆可能选择忽略并道车辆(即继续前进),因此合道车辆只能在后面并道。 或者,目标车道上的车辆可以选择给合流车辆让行(即让合流车辆在其前方合流)。 为了成功地融入繁忙的交通中,自动驾驶车辆控制器需要适当地响应其他车辆的前进或让行意图。 过于保守的控制器可能会屈服于所有其他车辆(包括那些打算屈服于自主车辆的车辆)并最终无法并道而过于激进的控制器可能会与打算继续前进的车辆发生冲突并导致车辆碰撞。 同时,是否继续前进或让路的决定不仅取决于交通状况(例如两辆车之间的相对位置和速度),还取决于驾驶员的总体驾驶风格、性格、情绪等。 ,在类似的情况下,激进的驾驶员可能倾向于继续前进,而谨慎/保守的驾驶员可能倾向于屈服。 这对自动驾驶车辆的规划和控制提出了重大挑战。
There exists an extensive literature on modeling human driver interactions and autonomous vehicle decision-making during lane change or merging. To handle interaction uncertainties (e.g., due to varied cooperation intentions of other vehicles), the Partially Observable Markov Decision Process (POMDP) framework has been exploited, where the uncertainties are modeled as latent variables and estimated online based on observed trajectories [5], [6], [7], [8], [9], [10]. However, solving a POMDP problem with a large state and/or action space is computationally very demanding [11]. Consequently, conventional POMDP-based approaches typically only consider the interaction of the ego vehicle with one interacting vehicle at a time to minimize the state space dimension. However, in reality a merging scenario can involve simultaneous interactions with multiple vehicles.
存在大量关于在变道或并道期间对人类驾驶员交互和自动车辆决策进行建模的文献。 为了处理交互不确定性(例如,由于其他车辆的不同合作意图),采用了部分可观察马尔可夫决策过程(POMDP)框架其中不确定性被建模为潜在变量并根据观察到的轨迹进行在线估计[5], [6]、[7]、[8]、[9]、[10]。 然而,解决具有大状态和/或动作空间的 POMDP 问题在计算上要求非常高[11]。 因此,传统的基于 POMDP 的方法通常每次只考虑自我车辆与一个交互车辆的交互,以最小化状态空间维度。 然而,实际上合并场景可能涉及与多辆车同时交互。
Reinforcement Learning (RL) is another popular approach to developing control policies for lane change or merge scenarios [12], [13]. An RL-based policy can account for the vehicle interactions in certain scenarios through training in an environment capable of representing such interactions [14], [15]. In order to obtain through RL driving policies that behave like human drivers, several researchers chose to use inverse RL to estimate human’s reward function for driving [16], [17], [18], [19]. To be able to model different human driver styles and/or interaction intentions, [20] incorporates cooperativeness into the intelligent driver model and [21] formulates different reward functions for different drivers and performs RL based on the models. Although RL-based approaches are appealing in terms of their potentials to handle complex traffic scenarios with multi-vehicle interactions, potential drawbacks of these approaches that hinder their practical application include their lacks of interpretability and explicit safety guarantee, because safety is typically only promoted through certain terms in the reward function rather than enforced through hard constraints.
强化学习 (RL) 是另一种流行的方法,用于开发变道或并道场景的控制策略 [12]、[13]。 基于强化学习的策略可以通过在能够表示此类交互的环境中进行训练来解释某些场景中的车辆交互[14]、[15]。 为了通过强化学习获得像人类驾驶员一样的驾驶策略,一些研究人员选择使用逆强化学习来估计人类驾驶的奖励函数[16]、[17]、[18]、[19]。 为了能够对不同的人类驾驶员风格和/或交互意图进行建模,[20]将合作性纳入智能驾驶员模型,[21]为不同的驾驶员制定不同的奖励函数,并根据模型执行强化学习。 尽管基于强化学习的方法在处理具有多车辆交互的复杂交通场景方面具有吸引力,但这些方法阻碍其实际应用的潜在缺点包括缺乏可解释性和明确的安全保证,因为安全性通常只能通过以下方式来提升: 奖励函数中的某些条款,而不是通过硬约束强制执行。
To achieve more interpretable control, other researchers proposed to explicitly incorporate a prediction model for vehicle interactions in the control algorithm. For instance, [22] uses a “Social Generative Adversarial Network (Social GAN)” to generate predictions of other vehicles’ future trajectories in response to ego vehicle’s actions. However, the Social GAN does not account for variations of drivers’ styles and intentions and needs to be trained with sufficient traffic data [23]. For the latter, it has been reported that multi-vehicle interaction scenarios in released traffic datasets are insufficient [24]. Game-theoretic methods have also been investigated for modeling vehicle interactions in lane change or merge scenarios [9], [25], [26], [27], [28], [29]. It is possible to account for varied driving styles and/or intentions with these gametheoretic methods, for instance, through modeling and online estimation of drivers’ cognitive levels [26] or aggressiveness [30], [31].
为了实现更可解释的控制,其他研究人员建议在控制算法中明确纳入车辆交互的预测模型。 例如,[22]使用“社交生成对抗网络(社交GAN)”来生成其他车辆未来轨迹的预测,以响应自我车辆的行为。 然而,社交 GAN 没有考虑驾驶员风格和意图的变化,需要使用足够的交通数据进行训练 [23]。 对于后者,有报道称已发布的交通数据集中的多车辆交互场景不足[24]。 博弈论方法也被研究用于建模车道变换或合并场景中的车辆交互[9]、[25]、[26]、[27]、[28]、[29]。 可以使用这些博弈论方法来解释不同的驾驶风格和/或意图,例如,通过建模和在线估计驾驶员的认知水平[26]或攻击性[30]、[31]。
In this paper, we propose a novel high-level control algorithm, called the Leader-Follower Game Controller (LFGC), for autonomous vehicle planning and control in forced merge scenarios. In the LFGC, drivers’ interaction intentions (to proceed or yield) and their resulting vehicle behaviors are represented by an explicit game-theoretic model with multiple concurrent leader-follower pairs, called a leader-follower game [32]. To account for interaction uncertainties, the pairwise leader-follower relationships among the vehicles are assumed to be a priori uncertain and modeled as latent variables. The LFGC estimates the leader-follower relationships online based on observed trajectories and makes optimal decisions for the autonomous ego vehicle using a Model Predictive Control (MPC)-based strategy. The proposed approach thus adapts to the inferred leader-follower relationship estimates to simultaneously achieve probabilistically guaranteed safety and the merging objectives. Note that a similar idea has been investigated in our previous conference paper [33]. The LFGC presented in this paper differs from the one in [33] in several aspects: 1) Instead of relying on discretization of the state space and the POMDP framework in [33], the LFGC in this paper is designed assuming a continuous state space, which results in smoother trajectories for lower-level controllers to track and which alleviates the computational difficulty associated with discrete spaces. 2) Unlike our previous work [33] in which we use a small number of actions (or motion primitives) to represent vehicle behavior, the LFGC of this paper predicts and plans vehicle motion using two much larger sets of trajectories (162 trajectories for the merging ego vehicle and 81 trajectories for each of the highway interacting vehicles), which leads to finer-resolution controls and the potential for higher performance. 3) The LFGC of this paper is validated based on a comprehensive set of simulation-based test cases including cases where other vehicles are controlled by various types of driver models and cases where their motion follows real traffic data, which is not done in [33].
本文中,我们提出了一种新颖的高级控制算法,称为领导者-跟随者博弈控制器(LFGC)用于强制并道场景中的自动车辆规划和控制。 在 LFGC 中,驾驶员的交互意图(继续或让步)及其产生的车辆行为由具有多个并发领导者-追随者对的显式博弈论模型表示,称为领导者-追随者博弈 [32]。 为了解释相互作用的不确定性,车辆之间的成对领导者-跟随者关系被假设为先验不确定性并建模为潜在变量。 LFGC 根据观察到的轨迹在线估计领导者-跟随者关系并使用基于模型预测控制 (MPC) 的策略为自主车辆做出最佳决策。 因此,所提出的方法适应推断的领导者-追随者关系估计,以同时实现概率保证的安全和合并目标。 请注意,我们之前的会议论文 [33] 中也研究了类似的想法。 本文提出的 LFGC 在几个方面与[33]中的不同:1)本文中的 LFGC 不是依赖于状态空间的离散化和[33]中的 POMDP 框架,而是假设连续状态空间进行设计 ,这会为较低级别的控制器提供更平滑的跟踪轨迹,并减轻与离散空间相关的计算难度。 2)与我们之前的工作[33]不同,我们使用少量的动作(或运动基元)来表示车辆行为,本文的 LFGC 使用两组更大的轨迹(162 条轨迹)来预测和规划车辆运动。 合并自我车辆和每辆高速公路交互车辆的 81 条轨迹),从而实现更精细的分辨率控制和更高性能的潜力。 3)本文的LFGC是基于一套全面的基于模拟的测试用例进行验证的,包括其他车辆由各种类型的驾驶员模型控制的情况以及它们的运动遵循真实交通数据的情况,这在[33]中没有完成 ]。
在这里插入图片描述
图 1:自动驾驶车辆(蓝色)需要在入口匝道结束之前并入高速公路。 在交通拥挤的情况下,可能没有足够的间隙供自动驾驶车辆汇入。 在这种情况下,自动驾驶车辆需要迫使其他车辆合作并让其介入。然而,意识到自动驾驶车辆并道尝试(红色)的交互车辆可能会根据自己的意图选择继续或让步。

The contributions and novelties of the proposed LFGC over aforementioned previous approaches are as follows:
与上述之前的方法相比,拟议的 LFGC 的贡献和新颖之处如下:
1)The LFGC uses a game-theoretic model for vehicle trajectory prediction while accounting for interactions and cooperation and while leading to interpretable control solutions (because the control solutions are based on model predictive control with an interpretable gametheoretic prediction model) .
1)LFGC 使用博弈论模型进行车辆轨迹预测,同时考虑交互和合作,同时产生可解释的控制解决方案(因为控制解决方案基于具有可解释博弈论预测模型的模型预测控制)。
2) The LFGC handles interaction uncertainties due to varied cooperation intentions of other vehicles by modeling these uncertainties as latent variables and estimating them online based on observed trajectories and Bayesian inference
2)LFGC通过将这些不确定性建模为潜在变量并基于观察到的轨迹和贝叶斯推理在线估计它们来处理由于其他车辆的不同合作意图而导致的交互不确定性
3) The LFGC represents vehicle safety requirements (e.g., collision avoidance) as constraints and pursues optimization subject to satisfying an explicit probabilistic safety characterization (i.e., a user-specified probability bound of safety) in the presence of interaction uncertainties.
3) LFGC 将车辆安全要求(例如,避免碰撞)表示为约束,并在存在交互不确定性的情况下追求优化,以满足明确的概率安全特征(即,用户指定的安全概率界限)。
4) The LFGC is designed in a continuous state space setting, which avoids the computational difficulty resulting from space discretization of previous POMDP-based approaches. This also enables the LFGC to handle more complex scenarios that involve interactions with multiple vehicles
4)LFGC是在连续状态空间设置中设计的,这避免了先前基于POMDP的方法因空间离散化而导致的计算困难。 这也使得 LFGC 能够处理涉及与多辆车交互的更复杂的场景
5) The LFGC is validated based on a comprehensive set of simulation-based case studies that include cases where other vehicles are controlled by various types of driver models and cases where their motion follows actual vehicle trajectories in the NGSIM US Highway 101 dataset [34]. For the latter, the LFGC demonstrates a high success rate (in terms of safely completing merges) of 97:5%
5) LFGC 基于一套全面的基于仿真的案例研究进行验证,其中包括其他车辆由各种类型的驾驶员模型控制的案例以及其运动遵循 NGSIM 美国高速公路 101 数据集中的实际车辆轨迹的案例 [34] 。 对于后者,LFGC 表现出 97:5% 的高成功率(就安全完成合并而言)
This paper is organized as follows: In Section II, we introduce the models that represent vehicle/traffic dynamics, driver objectives, vehicle actions, and the MPC-based control strategy for the autonomous ego vehicle. In Section III, we introduce the leader-follower game model that is used to represent drivers’ interaction intentions and their resulting vehicle behaviors in multi-vehicle traffic scenarios. In Section IV, we integrate the MPC-based control strategy with the leaderfollower game model and with online estimation of pairwise leader-follower relationships among interacting vehicles based on Bayesian inference, to enable the ego vehicle’s actions to adapt in real-time to interacting drivers’/vehicles’ intentions. In Section V, we validate the proposed LFGC through multiple simulation-based case studies, including validations against vehicles following our leader-follower game model, the Intelligent Driver Model (IDM), and trajectories from NGSIM US Highway 101 data. Finally, conclusions are given in Section VI.
本文的结构如下:在第二节中,我们介绍了代表车辆/交通动态驾驶员目标车辆行为以及自主车辆基于 MPC 的控制策略的模型。 在第三节中,我们介绍了领导者-跟随者博弈模型,该模型用于表示多车辆交通场景中驾驶员的交互意图及其由此产生的车辆行为。 在第四节中,我们将基于 MPC 的控制策略与领导者跟随者博弈模型以及基于贝叶斯推理的交互车辆之间的成对领导者-跟随者关系的在线估计相结合,以使自我车辆的动作能够实时适应交互驾驶员 ‘/车辆’ 意图。 在第五节中,我们通过多个基于模拟的案例研究验证了拟议的 LFGC,包括对遵循我们的领导者-追随者博弈模型、智能驾驶员模型 (IDM) 的车辆以及来自 NGSIM 美国 101 号高速公路数据的轨迹进行验证。 最后,第六节给出结论。

II. MODELS AND CONTROL STRATEGY DESCRIPTIONS

In this section, we introduce models to represent the vehicle and traffic dynamics and the MPC-based strategy for ego vehicle’s trajectory planning.
在本节中,我们将介绍代表车辆和交通动态的模型以及用于自我车辆轨迹规划的基于 MPC 的策略。

A. Vehicle dynamics

We use the kinematic bicycle model [35] to represent the motion of each vehicle. The kinematic bicycle model is defined by the following set of continuous-time equations,
我们使用运动学自行车模型[35]来表示每辆车的运动。 自行车运动学模型由以下一组连续时间方程定义,
在这里插入图片描述
where we have assumed only front-wheel steering δf and no rear-wheel steering (i.e., δr = 0); x and y are the longitudinal and lateral positions of the vehicle; v is the speed of the vehicle; and β are the yaw angle and the slip angle of the vehicle; lf and lr represent the distances from the CG of the vehicle to the front wheel and rear wheel axles; a is the acceleration along the direction of speed v. The control inputs are the acceleration and front-wheel steering, u = [a; δf]T .
其中我们假设只有前轮转向 δ f δ_f δf 而没有后轮转向(即 δ r = 0 δ_r = 0 δr=0); x和y是车辆的纵向和横向位置; v是车辆的速度; ψ \psi ψ 为车辆的偏航角和β 为滑移角; l f 、 l r l_f、l_r lflr 分别表示车辆重心到前轮、后轮轴的距离; a 是沿速度 v 方向的加速度。控制输入是加速度和前轮转向, u = [ a , δ f ] T u = [a, δ_f]^T u=[a,δf]T
While vehicle models other than (1) could be used, the above kinematic bicycle model (1) is suitable for our purpose of trajectory prediction and planning in forced merge scenarios – it can produce sufficiently accurate predictions of vehicle trajectories under given acceleration and front-wheel steering profiles [36] while it is simple and thus computationally efficient.
虽然可以使用除(1)之外的车辆模型,但上述运动学自行车模型(1)适合我们在强制并道场景中进行轨迹预测和规划的目的——它可以在给定加速度和前向的情况下对车辆轨迹进行足够准确的预测。 车轮转向轮廓[36],因为它很简单,因此计算效率很高。

B. Traffic dynamics

We consider a traffic scenario involving n + 1 vehicles, including the ego vehicle, denoted by 0, and n other interacting vehicles k, k 2 f1; : : : ; ng, which correspond to vehicles that are aware of the ego vehicle’s merging attempt. Therefore, the traffic state and its dynamics are characterized by the aggregation of all n+1 vehicles’ states and dynamics. Specifically, we describe the traffic dynamics using the following discrete-time model,
我们考虑一个涉及 n + 1 辆车的交通场景,包括自我车辆(用 0 表示)和 n 个其他交互车辆 k , k ∈ [ 1 , . . . , n ] k,k \in [1,..., n] kk[1,...,n],对应于意识到自我车辆并道尝试的车辆。 因此,交通状态及其动态由所有n+1车辆的状态和动态的聚合来表征。 具体来说,我们使用以下离散时间模型来描述交通动态,
在这里插入图片描述
where s¯t = (s0 t ; s1 t ; s2 t ; : : : ; sn t ) denotes the traffic state at the discrete time instant t, with s0 t denoting the ego vehicle’s state and sk t , k 2 f1; : : : ; ng, denoting the kth interacting vehicle’s state; and similarly, u¯t = (u0 t ; u1 t ; u2 t ; : : : ; un t ) denotes the aggregation of all n + 1 vehicles’ control inputs at the time constant t. In particular, each vehicle’s state sk t , k 2 f0; 1; : : : ; ng, consists of its x and y positions, speed, and yaw angle, i.e., sk t = [xk t ; ytk; vtk; tk]T , and each vehicle’s control inputs are uk t = [ak t ; δf;t k ]T . Accordingly, the function f in (2) that represents the transition of traffic state from s¯t to s¯t+1 as a result of all vehicles’ control inputs u¯t is an aggregation of (n + 1)-copies of the kinematic bicycle model (1) converted to discrete time with a specified sampling period ∆T and using the Euler method.
其中 s ˉ t = ( s t 0 , s t 1 , s t 2 , s t n ) \bar s_t = (s^0_t , s^1_t ,s^2_t ,s^n_t ) sˉt=(st0,st1,st2,stn) 表示离散时刻 t 的交通状态, s t 0 s^0_t st0 表示本车状态, s t k , k ∈ [ 1 , . . . n ] s^k_t , k \in [1,...n] stk,k[1,...n],表示第 k 个交互车辆的状态; 类似地, u ˉ t = ( u t 0 , u t 1 , u t 2 , . . . , u t n ) \bar u_t = (u^0_t ,u^1_t ,u^2_t ,...,u^n_t) uˉt=(ut0,ut1,ut2,...,utn) 表示在时间常数 t 时所有 n + 1 辆车的控制输入的聚合。 特别地,每辆车的状态 s t k , k ∈ [ 0 , 1 , . . . , n ] s^k_t , k \in [0,1,...,n] stk,k[0,1,...,n],由其 x 和 y 位置、速度和偏航角组成,即 s t k = [ x t k , y t k , v t k , ψ t k ] T s^k_t = [x^k_t , y_t^k,v^k_t, \psi ^k_t]^T stk=[xtk,ytk,vtk,ψtk]T ,每辆车的控制输入为 u t k = [ a t k ; δ f , t k ] T u^k_t = [a^k_t ; δ^k_{f,t}]^T utk=[atk;δf,tk]T 。 因此,(2) 中的函数 f 表示由于所有车辆的控制输入 u ˉ t \bar u_t uˉt 导致交通状态从 s ˉ t \bar s_t sˉt s ˉ t + 1 \bar s_{t + 1} sˉt+1 的转变,它是 (n + 1) 个副本的聚合, 使用欧拉方法将自行车运动模型 (1) 转换为具有指定采样周期 ΔT 的离散模型。

C. Reward function

The reward function R(¯ st; u¯t) is a mathematical representation of the driving goals of the driver. Here, we start by considering the interactions between the ego vehicle and one other vehicle, i.e., s¯t = (s0 t ; s1 t ) and u¯t = (u0 t ; u1 t ). In this case, the traffic state is composed of the states of these two vehicles, and the reward received by the ego vehicle depends on the states and control inputs of both vehicles. Following [33], we consider
奖励函数 R ( s ˉ t , u ˉ t ) R(\bar s_t, \bar u_t) R(sˉt,uˉt) 是驾驶员驾驶目标的数学表示。 在这里,我们首先考虑自我车辆和另一车辆之间的相互作用,即 s ˉ t = ( s t 0 , s t 1 ) 和 u ˉ t = ( u t 0 , u t 1 ) \bar s_t = (s^0_t , s^1_t ) 和 \bar u_t = (u^0_t , u^1_t ) sˉt=(st0,st1)uˉt=(ut0,ut1)。 在这种情况下,交通状态由这两辆车的状态组成,并且自我车辆收到的奖励取决于两辆车的状态和控制输入。 根据[33],我们考虑
在这里插入图片描述
where r = [r1; r2; r3; r4; r5]T and w 2 R5 + is a vector of weights. The reward terms r1; : : : ; r5 are defined to represent the following common considerations during driving: 1) safety (r1; r2), i.e., not colliding with other vehicles and not getting off the road; 2) liveness (r3; r4), i.e., approaching the destination; and 3) comfort (r5), i.e., maintaining a reasonable separation from other vehicles. The reader is referred to [33] for more detailed definitions of r1; : : : ; r5.
其中 r = [ r 1 , r 2 , r 3 , r 4 , r 5 ] T 和 w ∈ R + 5 r = [r_1, r_2, r_3 ,r_4, r_5]^T 和 w \in R^5_+ r=[r1,r2,r3,r4,r5]TwR+5 是权重向量。 奖励条件 r 1 , . . . , r 5 r_1,...,r_5 r1,...,r5 的定义代表了驾驶过程中的以下常见考虑因素: 1)安全 ( r 1 , r 2 ) (r_1,r_2) r1,r2,即不与其他车辆相撞,不离开道路; 2)活跃度 ( r 3 , r 4 ) (r_3,r_4) r3,r4,即接近目的地; 3)舒适度 ( r 5 ) (r_5) r5,即与其他车辆保持合理的距离。 读者可参考[33]了解 r 1 , . . . , r 5 r_1,...,r_5 r1,...,r5 的更详细定义。
在这里插入图片描述

D. Selecting Trajectories as vehicle actions

Instead of considering a discrete set of acceleration a and steering δf levels as in [33], we consider a sampled set of vehicle motion trajectories over a planning horizon of T = N∆T [s] as the action space for each vehicle. Specifically, each trajectory is a time history of vehicle state st = [xt; yt; vt; t]T starting from the vehicle’s current state s0. Note that the time history of control inputs ut = [at; δf;t]T corresponding to each trajectory can be calculated according to the vehicle dynamics model (1). Compared to representing vehicle motion using discrete acceleration and steering levels as in [33], the method here can lead to smoother trajectories and finer-resolution controls.
我们没有像[33]中那样考虑一组离散的加速度 a 和转向 δ f δ_f δf 水平,而是考虑在 T = N Δ T [ s ] T = NΔT [s] T=NΔT[s] 的规划范围内采样的一组车辆运动轨迹作为每辆车的动作空间。 具体来说,每条轨迹都是车辆状态的时间历史 s t = [ x t , y t , v t , ψ t ] T s_t = [x_t, y_t, v_t, \psi_ t]^T st=[xt,yt,vt,ψt]T 从车辆当前状态 s 0 s_0 s0 开始。 请注意,每条轨迹对应的控制输入的时间历程 u t = [ a t , δ f , t ] T u_t = [a_t,δ_{f,t}]^T ut=[at,δf,t]T 可以根据车辆动力学模型(1)计算得到。 与[33]中使用离散加速度和转向水平表示车辆运动相比,这里的方法可以实现更平滑的轨迹和更精细的分辨率控制。
For interacting vehicles driving in the target lane, we only consider their longitudinal motion, which corresponds to the assumption that these vehicles do not change lanes. Assuming = 0 and δf = 0, the kinematic bicycle model (1) for these vehicles reduces to
对于在目标车道上行驶的交互车辆,我们只考虑它们的纵向运动,这对应于这些车辆不改变车道的假设。 假设 ψ = 0 且 δ f = 0 \psi= 0 且 δ_f = 0 ψ=0δf=0,这些车辆的运动学自行车模型 (1) 简化为
在这里插入图片描述
In this case, a trajectory starting with a given initial condition depends only on the profile of acceleration a over [0; T]. In particular, at each sample time instant, we consider 81 acceleration profiles, which translates into 81 trajectories through (4), for each interacting vehicle k driving in the target lane, and we treat these trajectories as its admissible actions. Note that we also enforce the speed limits vtk 2 [vmin; vmax] when we generate these trajectories. We denote each of such trajectories as γmk (sk 0), with m = 1; 2; : : : ; 81, and the collection of such trajectories as Γk(sk 0) := fγmk (sk 0)g81 m=1.
在这种情况下,以给定初始条件开始的轨迹仅取决于加速度 a 在 [0, T]上的分布。 特别是,在每个样本时刻,对于在目标车道上行驶的每个交互车辆 k,我们考虑 81 个加速度曲线,通过(4)转化为 81 个轨迹,并且我们将这些轨迹视为其可接受的动作。 请注意,我们还强制执行速度限制 v t k ∈ [ v m i n , v m a x ] v_t^k \in [v_{min}, v_{max}] vtk[vmin,vmax] 当我们生成这些轨迹时。 我们将每个这样的轨迹表示为 γ m k ( s 0 k ) γ_m^k (s^k_0) γmk(s0k),其中 m = 1 , 2 , . . . , 81 m = 1, 2,..., 81 m=1,2,...,81,以及诸如 Γ k ( s 0 k ) : = [ γ m k ( s 0 k ) ] m = 1 81 Γ^k(s^k_0) := [γ_m^k (s^k_0)]^{81}_{m=1} Γk(s0k):=[γmk(s0k)]m=181 这样的轨迹的集合。
The merging vehicle’s maneuvers include both lane keeping and lane change. Trajectories or pieces of trajectories that represent lane keeping are generated using (4) in a similar way as above. For a lane change, we use 5th-order polynomials to represent lane change trajectories [37]. Specifically, a lane change trajectory is produced by the solution to the following boundary value problem: Find the coefficients a1; : : : ; a5 and b1; : : : ; b5 such that the 5th-order polynomials
合道车辆的操作包括车道保持和变道。 代表车道保持的轨迹或轨迹片段是使用(4)以与上面类似的方式生成的。 对于车道变换,我们使用五阶多项式来表示车道变换轨迹[37]。 具体地,通过求解以下边值问题来产生变道轨迹: 求出5 阶多项式系数 a 1 , . . . , a 5 和 b 1 , . . . b 5 a_1,..., a_5和b_1,...b_5 a1,...,a5b1,...b5
在这里插入图片描述
satisfy specified initial and terminal conditions (xini; x_ini; x¨ini; yini; y_ini; y¨ini) and (xterm; x_term; x¨term; yterm; y_term; y¨term), where (xini; x_ini; x¨ini; yini; y_ini; y¨ini) corresponds to either the vehicle’s current state or its state at the start of a lane change, and (xterm; x_term; x¨term; yterm; y_term; y¨term) corresponds to the vehicle’s state after the completion of a lane change. The variable ζ in (5) denotes continuous time. We let ζ = 0 correspond to the current sample time instant and assume that 1) the vehicle can start a lane change at any sample time instant ζ = t ∆T , with t = 0; : : : ; N − 1, over the planning horizon, and 2) a complete lane change takes a constant time duration of Tlc = 3 [s] [37]. Then, for the case where at the current sample time instant the vehicle is in the middle of a lane change (i.e., the vehicle started the lane change ∆Tlc [s] ago), (xini; x_ini; x¨ini; yini; y_ini; y¨ini) corresponds to the vehicle’s current state and is satisfied by (5) at ζ = 0, while (xterm; x_term; x¨term; yterm; y_term; y¨term) is satisfied by (5) at ζ = Tlc − ∆Tlc. For the case where the vehicle starts a lane change at a future sample time instant t ∆T , (xini; x_ini; x¨ini; yini; y_ini; y¨ini) corresponds to the vehicle’s state at the start of the lane change and is satisfied by (5) at ζ = t ∆T , while (xterm; x_term; x¨term; yterm; y_term; y¨term) is satisfied by (5) at ζ = t ∆T + Tlc. Furthermore, we allow the vehicle, when it is in the middle of a lane change, to abort the lane change at any sample time instant ζ = t ∆T over the planning horizon. This represents a “change of mind” of the driver when a previously planned lane change becomes no longer feasible/safe. A trajectory for aborting a lane change is generated in a similar way as a lane change trajectory,but the terminal condition (xterm; x_term; x¨term; yterm; y_term; y¨term) corresponds now to the vehicle’s state after its returns to its original lane. Finally, we glue together pieces of trajectories for lane keeping, lane change, and aborting lane change to construct complete trajectories over the planning horizon. This way, we obtain a total of 162 trajectories for the merging vehicle that we treat as admissible actions. Each of these trajectories is characterized by 1) whether and when to start a lane change and 2) whether and when to abort an improper lane change. Fig. 3 illustrates a sampled set of such trajectories when the vehicle has not started a lane change and those when the vehicle is in the middle of a lane change. We denote each of such trajectories as γm0 (s0 0), with m = 1; 2; : : : ; 162, and the collection of such trajectories as Γ0(s0 0) := fγm0 (s0 0)g162 m=1.
满足指定的初始和终止条件 ( x i n i , x ˙ i n i , x ¨ i n i , y i n i , y ˙ i n i , y ¨ i n i ) (x_{ini},\dot x_{ini}, \ddot x_{ini}, y_{ini}, \dot y_{ini}, \ddot y_{ini}) (xini,x˙ini,x¨ini,yini,y˙ini,y¨ini) ( x t e r m , x ˙ t e r m , x ¨ t e r m , y t e r m , y ˙ t e r m , y ¨ t e r m ) (x_{term},\dot x_{term}, \ddot x_{term}, y_{term}, \dot y_{term}, \ddot y_{term}) (xterm,x˙term,x¨term,yterm,y˙term,y¨term),其中 ( x i n i , x ˙ i n i , x ¨ i n i , y i n i , y ˙ i n i , y ¨ i n i ) (x_{ini},\dot x_{ini}, \ddot x_{ini}, y_{ini}, \dot y_{ini}, \ddot y_{ini}) (xini,x˙ini,x¨ini,yini,y˙ini,y¨ini) 对应于车辆的当前状态或变道开始时的状态, ( x t e r m , x ˙ t e r m , x ¨ t e r m , y t e r m , y ˙ t e r m , y ¨ t e r m ) (x_{term},\dot x_{term}, \ddot x_{term}, y_{term}, \dot y_{term}, \ddot y_{term}) (xterm,x˙term,x¨term,yterm,y˙term,y¨term) 对应于车辆的状态 变道完成后的状态。 (5)中的变量 ζ \zeta ζ 表示连续时间。 我们让 ζ = 0 \zeta = 0 ζ=0 对应于当前采样时刻,并假设 1) 车辆可以在任何采样时刻 ζ = t Δ T \zeta = t ΔT ζ=tΔT 开始变道,其中 t = 0 , . . . , N − 1 t = 0,...,N − 1 t=0,...,N1,在规划范围内,2)完整的车道变换需要恒定的持续时间 T l c = 3 [ s ] T_{lc} = 3 [s] Tlc=3[s] [37]。 然后,对于当前采样时刻车辆处于换道中间的情况(即车辆在 Δ T l c [ s ] ΔT_{lc} [s] ΔTlc[s] 前开始换道), ζ = T l c – Δ T l c \zeta = T_{lc} – ΔT_{lc} ζ=Tlc–ΔTlc ( x i n i , x ˙ i n i , x ¨ i n i , y i n i , y ˙ i n i , y ¨ i n i ) (x_{ini},\dot x_{ini}, \ddot x_{ini}, y_{ini}, \dot y_{ini}, \ddot y_{ini}) (xini,x˙ini,x¨ini,yini,y˙ini,y¨ini) 。 对于车辆在未来采样时刻 t ΔT 开始换道的情况,(xini; x_ini; xìini; yini; y_ini; yìini) 对应于车辆在换道开始时的状态,并且 在 ze = t ΔT 处满足 (5),而 (xterm; x_term; x´term; yterm; y_term; y´term) 在 ze = t ΔT + Tlc 处满足 (5)。 此外,我们允许车辆在换道过程中在规划范围内的任何采样时刻 ze = t ΔT 处中止换道。 当先前计划的变道变得不再可行/安全时,这代表驾驶员“改变主意”。 中止车道变换的轨迹的生成方式与车道变换轨迹类似,但终止条件 (xterm; x_term; x´term; yterm; y_term; y´term) 现在对应于车辆返回到 它原来的车道。 最后,我们将车道保持、车道变换和中止车道变换的轨迹片段粘合在一起,以在规划范围内构建完整的轨迹。 这样,我们就获得了合道车辆的总共 162 条轨迹,我们将其视为可接受的行为。 每个轨迹的特征在于:1)是否以及何时开始变道;2)是否以及何时中止不正确的变道。 图3示出了当车辆尚未开始变道时和当车辆处于变道中间时的此类轨迹的采样集。 我们将每个这样的轨迹表示为 γm0 (s0 0),其中 m = 1; 2; ::::; 162,以及 Γ0(s0 0) := fγm0 (s0 0)g162 m=1 等轨迹的集合。

第二章还有一些没翻译完呢,公式太多了,有时间再回头看吧

III. GAME-THEORETIC MODEL FOR VEHICLE COOPERATION BEHAVIORS AND EXPLICIT REPRESENTATION USING IMITATION LEARNING

三. 使用模仿学习的车辆合作行为和显式表示的博弈论模型
In this section, we introduce the leader-follower game employed in this paper for modeling the interaction/cooperation between the merging vehicle and vehicles driving in the target lane. In order to simplify the online computations associated with this game-theoretic model, imitation learning is utilized to derive a neural network-based explicit representation of the model, which is used online for predicting the interacting vehicles’ trajectories in response to the merging ego vehicle’s actions in our MPC-based trajectory planning strategy.
在本节中,我们介绍本文中采用的领导者-跟随者博弈,用于对合流车辆与目标车道上行驶的车辆之间的交互/合作进行建模。 为了简化与该博弈论模型相关的在线计算,利用模仿学习来导出模型的基于神经网络的显式表示,该模型用于在线预测交互车辆的轨迹,以响应合并自我车辆的轨迹。 我们基于 MPC 的轨迹规划策略中的操作。

A. Leader-follower game-theoretic model

During a highway forced merge process, the merging vehicle (ego vehicle) interacts with other vehicles driving in the target lane, who may choose to proceed or yield to the merging vehicle depending on the traffic situation and individual driver’s preference. In this paper, we consider a gametheoretic model based on pairwise leader-follower interactions, called a leader-follower game, to represent drivers’ cooperation intentions and their resulting vehicle behaviors. In this model, a vehicle (or, a driver) who decides to proceed before another vehicle is a leader in this vehicle pair and the one who decides to yield to another vehicle is a follower in the pair. The leader and the follower use different decision strategies. This leader-follower game-theoretic model was originally proposed in [32], where it demonstrated the ability to effectively model drivers’ intentions to proceed or yield (e.g., caused by common traffic rules and etiquette) in driving through intersections scenarios. Here, we briefly review this game-theoretic model and introduce its application to our highway forced merge scenarios.
在高速公路强制并道过程中,并道车辆(自我车辆)与目标车道上行驶的其他车辆相互作用,其他车辆可以根据交通状况和个人驾驶员的偏好选择让并道车辆前进或让行。 在本文中,我们考虑一种基于成对领导者-追随者互动的博弈论模型,称为领导者-追随者博弈,来表示驾驶员的合作意图及其由此产生的车辆行为在此模型中,决定在另一辆车之前行驶的车辆(或驾驶员)是该车辆对中的领导者,而决定让另一辆车行驶的车辆是该车辆对中的跟随者领导者和追随者使用不同的决策策略。 这种领导者-追随者博弈论模型最初在[32]中提出,它证明了在驾驶通过十字路口场景时有效模拟驾驶员前进或让行意图(例如,由常见交通规则和礼仪引起)的能力。 在这里,我们简要回顾一下这个博弈论模型,并介绍其在高速公路强制并道场景中的应用。
Denote the trajectories of the leader and the follower as γl;t 2 Γl(¯ st), and γf;t 2 Γf(¯ st), respectively, where Γl(¯ st) and Γf(¯ st) are the sets of admissible trajectories of the leader and the follower. We assume that both vehicles make decisions to maximize their cumulative rewards, denoted as Rl(¯ st; γl;t; γf;t) and Rf(¯ st; γl;t; γf;t), respectively, and defined according to
将领导者和跟随者的轨迹分别表示为 γ l , t ∈ Γ l ( s ˉ t ) 和 γ f , t ∈ Г f ( s ˉ t ) γ_{l,t} \in Γ_l(\bar s_t) 和 γ_{f,t} \in Г_f(\bar s_t) γl,tΓl(sˉt)γf,tГf(sˉt),其中 Γ l ( s ˉ t ) 和 Г f ( s ˉ t ) Γ_l(\bar s_t)和 Г_f(\bar s_t) Γl(sˉt)Гf(sˉt)是可接受的领导者和追随者的轨迹集合。我们假设两个车辆都做出最大化其累积奖励的决策,分别表示为 R l ( s ˉ t , γ l , t , γ f , t ) 和 R f ( s ˉ t , γ l , t , γ f , t ) R_l(\bar s_t, γ_{l,t},γ_{f,t}) 和 R_f(\bar s_t, γ_{l,t}, γ_{f,t}) Rl(sˉt,γl,t,γf,t)Rf(sˉt,γl,t,γf,t),并根据
在这里插入图片描述
where σ 2 L = fleader; followerg represents the leader or follower role in the game, Rσs¯t+τ ; ul;t+τ ; uf;t+τ is the reward function for the leader or the follower defined as in Section II-C, and ul;t+τ and uf;t+τ , τ = 0; : : : ; N − 1, are the control inputs corresponding to γl;t and γf;t as described in Section II-D.
其中 σ ∈ L = [ l e a d e r , f o l l o w e r ] σ \in L = [leader, follower] σL=[leader,follower] 代表游戏中的领导者或追随者角色, R σ ( s ˉ t + τ , u l , t + τ , u f , t + τ ) R_σ(\bar s_{t+τ} , u_{l,t+τ} , u_{f,t+τ}) Rσ(sˉt+τ,ul,t+τ,uf,t+τ) 是领导者或追随者的奖励函数,如第 II-C 节中定义,并且 u l , t + τ 和 u f , t + τ , τ = 0 , . . . , N − 1 u_{l,t+τ} 和 u_{f,t+τ} , τ = 0,..., N − 1 ul,t+τuf,t+τ,τ=0,...,N1 是对应于 γ l , t 和 γ f , t γ_{l,t} 和 γ_{f,t} γl,tγf,t 的控制输入,如第 II-D 节中所述。
Specifically, we model the leader’s and the follower’s interactive decision processes as follows:
具体来说,我们对领导者和追随者的交互决策过程进行建模如下:
在这里插入图片描述
where γl∗(¯ st) (resp. γf∗(¯ st)) is an optimal trajectory of the leader (resp. follower) given the current traffic state s¯t, and Ql and Qf are defined as
其中 γ l ∗ ( s ˉ t ) ( 对应的 γ f ∗ ( s ˉ t ) ) γ_l^*(\bar s_t)(对应的 γ_f^*(\bar s_t)) γl(sˉt)(对应的γf(sˉt))是给定当前交通状态 s ˉ t \bar s_t sˉt 的领导者(对应的追随者)的最优轨迹, Q l 和 Q f Q_l 和 Q_f QlQf 定义为
在这里插入图片描述
The decision model (8)-(11) can be explained as follows: A follower represents a driver who intends to yield. Due to uncertainty about the other driver’s action, the follower decides to take an action that maximizes her worst-case reward through (9) and (11). Such a “max-min” decision strategy of the follower models the yielding behavior because it assumes the other driver can take actions freely. Similarly, a leader represents a driver who intends to proceed and assumes the other driver will yield. Therefore, the leader uses the follower model to predict the other driver’s action and takes an action that maximizes the leader own reward under the predicted follower’s action through (8) and (10). This leader-follower game model is partly inspired by the Stackelberg game model [38], but relaxes several assumptions of the Stackelberg model that generally do not hold for driver interactions in traffic. The reader is referred to [32] for more discussions of this leaderfollower game model and of its effectiveness for modeling driver interactions in multi-vehicle scenarios.
决策模型(8)-(11)可以解释如下:跟随者代表打算让行的驾驶员。 由于其他驾驶员行为的不确定性,跟随者决定采取行动,通过(9)和(11)最大化她在最坏情况下的奖励。 追随者的这种“最大-最小”决策策略模拟了让行行为,因为它假设其他驾驶员可以自由采取行动。 同样,领导者代表一名打算继续前进的司机,并假设其他司机会让路。 因此,领导者使用追随者模型来预测其他驾驶员的行为,并通过(8)和(10)采取在预测的追随者行为下最大化领导者自身奖励的行动。 这种领导者-追随者博弈模型部分受到 Stackelberg 博弈模型 [38] 的启发,但放宽了 Stackelberg 模型的几个假设,这些假设通常不适用于交通中的驾驶员交互。 读者可以参考[32],了解有关这种领导者追随者博弈模型及其在多车辆场景中对驾驶员交互进行建模的有效性的更多讨论。
Note that although the asymmetric leader-follower roles in the decision model (8)-(11) are used to represent drivers’ intentions to proceed and yield, respectively, the model does not imply that a leader interacting vehicle will always force a merging vehicle to merge behind it or a follower interacting vehicle will always let a merging vehicle merge in front of it. For instance, a merging vehicle may merge in front of a leader interacting vehicle in the following two situations: 1) The merging vehicle is ahead of the interacting vehicle with a sufficiently large distance to allow safe merging. 2) The merging vehicle is about to reach the end of its lane. Because getting off the road yields a large penalty (see Section II-C), the merging vehicle may choose to merge ahead of the interacting vehicle to avoid the large penalty as long as its merging will not lead to a collision between the two vehicles (it will not merge if the merging will cause a collision because the penalty for collision is even larger than for getting off the road). The above observations clarify that the leaderfollower roles in our decision model (8)-(11) are not assigned by vehicle spatial positions (i.e., a leader is not necessarily a vehicle in front). Moreover, this model allows a merging vehicle to force the traffic in the target lane to let it merge into: As the merging vehicle approaches the end of its lane, it is increasingly inclined to merge to avoid the penalty for getting off the road even if all of the interacting vehicles in the target lane are leaders (i.e., their drivers all originally intend to proceed) and the current gaps are not large enough in terms of comfort. The model (8)-(11) enables these leader interacting vehicles to predict the merging vehicle’s upcoming merging maneuver. Then, for their own safety and comfort, they will slow down to enlarge the gap and, consequently, warrant the merging. Therefore, our leader-follower model (8)- (11) is suitable for trajectory prediction and planning in forced merge scenarios.
请注意,虽然决策模型(8)-(11)中的不对称领导者-跟随者角色分别用于表示驾驶员前进和让行的意图,但该模型并不意味着领导者交互车辆总是会迫使并道车辆 在其后面并道或跟随交互车辆将始终让合并车辆在其前面合并。 例如,在以下两种情况下,合流车辆可能会在先导交互车辆前方合流: 1)合流车辆位于交互车辆前方,距离足够大,以允许安全合流。 2) 合流车辆即将到达车道末端。 由于离开道路会产生较大的惩罚(参见第 II-C 节),因此并道车辆可以选择在交互车辆之前并道以避免较大的惩罚,只要其并道不会导致两辆车之间发生碰撞( 如果并道会导致碰撞,则不会并道,因为碰撞的惩罚甚至比冲出道路还要大)。 上述观察结果表明,我们的决策模型(8)-(11)中的领导者跟随者角色不是由车辆空间位置分配的(即领导者不一定是前面的车辆)。 此外,该模型允许并道车辆强制目标车道上的交通让其并入:当并道车辆接近其车道末端时,它越来越倾向于并道以避免偏离道路的处罚,即使目标车道上所有交互的车辆都是领导者(即它们的驾驶员都原本打算继续前进),并且目前的差距在舒适度方面还不够大。 模型(8)-(11)使这些领导者交互车辆能够预测合流车辆即将进行的合流动作。 然后,为了自己的安全和舒适,他们会放慢速度以扩大差距,从而保证合并。 因此,我们的领导者-跟随者模型(8)-(11)适用于强制合并场景中的轨迹预测和规划。

B. Explicit representation of leader-follower game policy through imitation learning

B. 通过模仿学习明确表示领导者-追随者博弈策略
Based on (8)-(11), we are able to predict other vehicles’ decision and trajectories given the knowledge of drivers’ intentions and the current traffic state information. Hence, we can denote leader’s optimal action policy as γl∗(¯ st) and follower’s optimal action policy as γf∗(¯ st). Obtaining γl∗(¯ st) and γf∗(¯ st) require going through (8)-(11), and the repeated online computations involving (8)-(11) can be time consuming. As a result, we want to explicitly represent γl∗ and γf∗.
基于(8)-(11),我们能够在了解驾驶员意图和当前交通状态信息的情况下预测其他车辆的决策和轨迹。 因此,我们可以将领导者的最优行动策略表示为 γ l ∗ ( s ˉ t ) γ_l^*(\bar s_t) γl(sˉt),将追随者的最优行动策略表示为 γ f ∗ ( s ˉ t ) γ_f^*(\bar s_t) γf(sˉt)。 获得 γ l ∗ ( s ˉ t ) 和 γ f ∗ ( s ˉ t ) γ_l^*(\bar s_t)和 γ_f^*(\bar s_t) γl(sˉt)γf(sˉt) 需要经过(8)-(11),并且涉及(8)-(11)的重复在线计算可能非常耗时。 因此,我们想要明确地表示 γ l ∗ 和 γ f ∗ γ_l^* 和 γ_f^* γlγf
Here, γσ∗(¯ st); σ 2 L are maps that map current traffic state to a predicted trajectory that other vehicle will follow. These maps are determined according to (8)-(11). Instead of algorithmically determining γl∗(¯ st) and γf∗(¯ st), we follow [39] and exploit the use of supervised learning, more specifically, imitation learning, to represent γσ∗(¯ st).
这里, γ σ ∗ ( s ˉ t ) , σ ∈ L γ_σ^*(\bar s_t), σ \in L γσ(sˉt),σL 是将当前交通状态映射到其他车辆将遵循的预测轨迹的映射。 这些映射是根据(8)-(11)确定的。 我们遵循[39]并利用监督学习(更具体地说,模仿学习)来表示 γ σ ∗ ( s ˉ t ) γ_σ^*(\bar s_t) γσ(sˉt),而不是通过算法确定 γ l ∗ ( s ˉ t ) 和 γ f ∗ ( s ˉ t ) γ_l^*(\bar s_t) 和 γ_f^*(\bar s_t) γl(sˉt)γf(sˉt)
Imitation learning can be considered as a supervised learning problem, where an autonomous agent tried to learn a policy by observing expert’s demonstrations. The expert demonstration can be generated either by a human operator or an artificial intelligent agent. In this work, we treat γσ∗(¯ st) obtained by (8)- (11) as the expert policy
模仿学习可以被视为监督学习问题,其中自主代理试图通过观察专家的演示来学习策略。 专家演示可以由人类操作员或人工智能代理生成。 在这项工作中,我们将由(8)-(11)获得的 γ σ ∗ ( s ˉ t ) γ_σ^*(\bar s_t) γσ(sˉt)视为专家策略
In particular, the “Dataset Aggregation” algorithm [40] has been utilized to obtain an imitated policy γ^σ. The overall learning objective for the Dataset Aggregation algorithm can be described by,
特别是,“数据集聚合”算法[40]已被用来获得模拟策略 γ ^ σ \hat γ_σ γ^σ。 数据集聚合算法的总体学习目标可以描述为:
在这里插入图片描述
where γθ represents a policy with respect to which optimization is performed and which is parameterized by θ (e.g. neural network weights), and L represents a loss function. More detailed discussions on the imitation learning and the “Dataset Aggregation” algorithm can be found in [39] and [40].
其中 γ θ γ_θ γθ 表示执行优化的策略,并由 θ 参数化(例如神经网络权重),L 表示损失函数。 关于模仿学习和“数据集聚合”算法的更详细讨论可以在[39]和[40]中找到。
The model (8)-(11) and the imitation learning policies can be used to predict the other vehicles’ decisions and future trajectories under the knowledge of their drivers’ cooperation intentions. However, in a given traffic scenario, we may not know the other drivers’ cooperation intentions a priori, because a driver’s intention depends not only on the traffic situation (e.g., the relative position and velocity between two vehicles) but also on the driver’s style/type (e.g., aggressive versus conservative). To deal with prior uncertainties about other vehicles’ cooperation intentions, in what follows we describe an approach where such uncertainties are modeled as latent variables and the autonomous vehicle planning and control problem exploits estimating other vehicle’s cooperation intentions as well as using predictive control method to obtain the optimal trajectory.
模型(8)-(11)和模仿学习策略可用于在了解驾驶员合作意图的情况下预测其他车辆的决策和未来轨迹。 然而,在给定的交通场景中,我们可能无法先验地知道其他驾驶员的合作意图,因为驾驶员的意图不仅取决于交通状况(例如,两辆车之间的相对位置和速度),还取决于驾驶员的风格 /类型(例如,激进与保守)。 为了处理其他车辆合作意图的先验不确定性,在下文中,我们描述了一种将此类不确定性建模为潜在变量的方法,自动车辆规划和控制问题利用估计其他车辆合作意图以及使用预测控制方法来获得 最优轨迹。

IV. DECISION MAKING UNDER COOPERATION INTENTION UNCERTAINTY

四. 合作意向不确定下的决策

In this section, we describe the decision making algorithm, called the Leader-Follower Game Controller (LFGC), for the highway forced merge scenario under cooperation intention uncertainty. During the forced merge process, we generate an estimate of other driver’s cooperation intention, as described in this section. Based on the estimate of cooperation intention, we apply the control strategy presented in (6) under multi-vehicle interactions settings by considering pairwise interactions.
在本节中,我们描述了合作意图不确定下的高速公路强制并道场景的决策算法,称为领导者-跟随者博弈控制器(LFGC)。 在强制并道过程中,我们生成对其他驾驶员合作意图的估计,如本节所述。 基于合作意图的估计,我们通过考虑成对交互,在多车辆交互设置下应用(6)中提出的控制策略。

A. Estimation of interacting vehicle’s cooperation intention

A. 交互车辆合作意图估计

According to Section III, we can model other driver’s behavior based on their cooperation intentions using the leaderfollower game. A yielding vehicle may have similar behavior as a follower in the game, while a proceeding (not yielding) vehicle may be modeled as a leader in the game. In this sense, we can estimate interacting vehicle’s cooperation intention by estimating their leader or follower roles in the leader-follower game.
根据第三节,我们可以使用领导者追随者博弈根据其他驾驶员的合作意图对他们的行为进行建模。 让行车辆可能具有与游戏中的跟随者类似的行为,而前进(不让行)的车辆可以被建模为游戏中的领导者。 从这个意义上说,我们可以通过估计交互车辆在领导者-跟随者博弈中的领导者或跟随者角色来估计交互车辆的合作意图。
To achieve that, we consider the traffic dynamics model (2) and the leader or follower’s optimal actions (8) and (9). From the perspective of the ego vehicle, the interacting vehicle is playing a leader-follower game with it, and the traffic dynamics model can be written as
为了实现这一目标,我们考虑流量动态模型 (2) 以及领导者或追随者的最佳行动 (8) 和 (9)。 从自我车辆的角度来看,交互车辆正在与其玩领导者-跟随者游戏,交通动态模型可以写为
在这里插入图片描述
where u0 t is the control of ego vehicle, u1 t is the control of the interacting vehicle and is determined by the leaderfollower game, σ 2 L = fleader; followerg represents either leader of follower, and (u1 σ;t)∗(¯ st) is the first control input corresponding to the optimal trajectory of γσ∗(¯ st) in (8) and (9). Now the only input to (14) is the control of the ego vehicle u0 t .
其中 u t 0 u^0_t ut0 是自车的控制, u t 1 u^1_t ut1 是交互车辆的控制,由leader-follower博弈决定, σ ∈ L = { l e a d e r , f o l l o w e r } σ \in L = \{leader, follower\} σL={leader,follower} 表示跟随者中的任一领导者, ( u σ , t 1 ) ∗ ( s ˉ t ) (u^1_{σ,t})*(\bar s_t) (uσ,t1)(sˉt) 是对应于 (8) 和 (9) 中最优轨迹 γ σ ∗ ( s ˉ t ) γ^*_σ(\bar s_t) γσ(sˉt) 的第一个控制输入。 现在(14)的唯一输入是自车 u t 0 u^0_t ut0 的控制。
However, in reality, the interacting vehicle’s decision does not necessarily follow the optimal policy computed from (8) and (9). In order to account for the difference between the leader-follower policy and the actual policy of the interacting vehicle, we assume the system is propagated by (14) with an additive Gaussian noise, i.e.,
然而,实际上,交互车辆的决策并不一定遵循(8)和(9)计算出的最优策略。 为了解释领导者-追随者策略和交互车辆的实际策略之间的差异,我们假设系统通过(14)带有附加的高斯噪声,即
在这里插入图片描述
where w is the additive Gaussian noise with 0 mean and covariance W .
其中 w w w 是均值为 0 和协方差为 W W W 的高斯噪声 。
The ego vehicle is assumed to have a prior belief on σ, denoted as P(σ = l), with l 2 L = fleader; followerg. Then based on all previous traffic states and on all actions taken by the ego vehicle,
假设自车对 σ σ σ 有先验信念,表示为 P ( σ = l ) P(σ = l) P(σ=l),其中 l ∈ L = { l e a d e r , f o l l o w e r } l \in L = \{leader, follower\} lL={leader,follower}. 然后根据所有先前的交通状态以及自车采取的所有行动,

the ego vehicle needs to compute or maintain a posterior belief of interacting vehicle’s leader or follower role, P(σ = ljξt).
自车需要计算或维持交互车辆的领导者或追随者角色的后验信念, P ( σ = l ∣ ξ t ) P(σ = l| \xi_t) P(σ=lξt)
The conditional posterior belief of interacting vehicle’s leader or follower’s role is computed using the hybrid estimation algorithm proposed in [41].
使用[41]中提出的混合估计算法计算交互车辆的领导者或跟随者角色的条件后验信念
Specifically, identification of the interacting vehicle’s leader or follower role can be achieved by,
具体来说,交互车辆的领导者或跟随者角色的识别可以通过以下方式实现:
在这里插入图片描述
where P(·j·) is the conditional probability; πlk denotes the transition probability of the interaction vehicle’s role from k to l; and Λl;t is the likelihood function of the interacting vehicle as role l, defined by,
其中 P ( ⋅ ∣ ⋅ ) P(·|·) P() 是条件概率; π l k π_{lk} πlk表示交互载体角色从 k 到 l k 到 l kl 的转移概率; ∧ l , t \wedge_{l,t} l,t 是交互车辆作为角色 l l l 的似然函数,定义为:
在这里插入图片描述
where N(rl;t; 0; W ) denotes the probability density function of the normal distribution with mean 0 and covariance W evaluated at rl;t; and ct is the normalization constant.
其中 N ( r l , t , 0 , W ) N(r_{l,t},0,W) N(rl,t,0,W) 表示均值为 0 且协方差 W 在 r l , t r_{l,t} rl,t 处计算的正态分布的概率密度函数。 c t c_t ct 是归一化常数。
Assuming the interacting vehicle’s role remains unchanged over the merge period, i.e., πlk = 1 when l = k and πlk = 0 when l 6= k, the posterior belief of the interacting vehicle’s leader or follower role can be updated using the following equation,
假设交互车辆的角色在merge期间保持不变,即当 l = k 时 , π l k = 1 l = k时, π_{lk} = 1 l=k,πlk=1 l ≠ k 时 , π l k = 0 l \ne k时, π_{lk} = 0 l=k,πlk=0交互车辆的领导者或跟随者角色的后验信念可以使用以下等式更新
在这里插入图片描述
where P(σ = ljξt−1) is the previous belief of the interacting vehicle’s leader or follower role.
其中 P ( σ = l ∣ ξ t − 1 ) P(σ = l|\xi_{t−1}) P(σ=lξt1) 是交互车辆的领导者或追随者角色的先前信念。

  • 25
    点赞
  • 25
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值