基于模型的强化学习电动汽车经济驾驶控制

最新推荐文章于 2025-02-02 22:48:27 发布

真诚的灰灰

最新推荐文章于 2025-02-02 22:48:27 发布

阅读量336

点赞数

分类专栏：规划控制文章标签：自动驾驶算法动态规划深度学习人工智能

本文链接：https://blog.csdn.net/jch924583667/article/details/134562511

版权

规划控制专栏收录该内容

18 篇文章

订阅专栏

Model-Based Reinforcement Learning for Eco_Driving Control of Electric Vehicles

基于模型的强化学习电动汽车经济驾驶控制

ABSTRACT With the development of autonomous vehicles, research on energy-efficient eco-driving is becoming increasingly important. The optimal control problem of determining the speed profile of the vehicle for minimizing energy consumption is a challenging problem that necessitates the consideration of various aspects, such as the vehicle energy consumption, slope of the road, and driving environment, e.g., the traffic and other vehicles on the road. In this study, an approach using reinforcement learning was applied to the eco-driving problem for electric vehicles considering road slopes. A novel model-based reinforcement learning algorithm for eco-driving was developed, which separates the vehicle’s energy consumption approximation model and driving environment model. Thus, the domain knowledge of vehicle dynamics and the powertrain system is utilized for the reinforcement learning process, while model-free characteristics are maintained by updating the approximation model using experience replay. The proposed algorithm was tested via a vehicle simulation and compared with a solution obtained using dynamic programming (DP), and as well as conventional cruise control driving with constant speed. The simulation results indicated that the speed profile optimized using model-based reinforcement learning had similar behavior to the global solution obtained via DP and energy saving performance compared with cruise control.
摘要随着自动驾驶汽车的发展，节能环保驾驶的研究变得越来越重要。确定车辆速度曲线以最小化能量消耗的最优控制问题是一个具有挑战性的问题，需要考虑各个方面，例如车辆能量消耗、道路坡度和驾驶环境，例如交通和其他道路上的车辆。在这项研究中，采用强化学习的方法应用于考虑道路坡度的电动汽车的经济驾驶问题。开发了一种新的基于模型的经济驾驶强化学习算法，该算法将车辆的能耗近似模型和驾驶环境模型分开。因此，车辆动力学和动力总成系统的领域知识被用于强化学习过程，同时通过使用经验回放更新近似模型来保持无模型特性。通过车辆仿真对所提出的算法进行了测试，并与使用动态规划（DP）以及传统恒速巡航控制驾驶获得的解决方案进行了比较。仿真结果表明，使用基于模型的强化学习优化的速度曲线与通过DP获得的全局解具有相似的行为，并且与巡航控制相比具有节能性能。
INDEX TERMS Eco-driving control, electric vehicles, model-based reinforcement learning, optimal control, Q-learning, reinforcement learning.
索引术语 环保驾驶控制、电动汽车、基于模型的强化学习、最优控制、Q 学习、强化学习。
I. INTRODUCTION
Recently, diverse technologies for autonomous vehicles have been developing rapidly, which has led to advancements in autonomous driving. In the future, vehicles can be operated with less intervention by human drivers. Without manipulation by the human driver, the vehicle becomes safer; additionally, vehicles will be able to move quickly with the aid of computational intelligence based on autonomous vehicle technologies in the near future. Another issue concerning future vehicles is the environmental aspect; diverse vehicles, such as hybrid electric vehicles (HEVs), electric vehicles (EVs), and fuel-cell EVs (FCEVs), are being developed to reduce emissions and increase the vehicular efficiency. The use of autonomous vehicles can also contribute to increasing the vehicular fuel efficiency. In an autonomous vehicle, when the level of driving automation increases, the intervention of the human driver can be minimized. The efficiency of these vehicles can be maximized while satisfying the desired travel time. This optimization of the vehicle speed profile can be very useful, as the vehicle efficiency can be increased without changes in the vehicle hardware, and this technology can be used in any type of vehicle. Additionally, considering that in the near future, many vehicles can be operated without a human driver, optimization of the vehicle speed profile, which is called an eco-driving strategy, is a very important problem.
近年来，自动驾驶汽车的各种技术迅速发展，带动了自动驾驶的进步。未来，车辆的操作可以减少人类驾驶员的干预。无需人类驾驶员操控，车辆变得更加安全；此外，在不久的将来，车辆将能够借助基于自动驾驶汽车技术的计算智能快速移动。关于未来车辆的另一个问题是环境方面。正在开发各种车辆，例如混合动力电动汽车（HEV）、电动汽车（EV）和燃料电池电动汽车（FCEV），以减少排放并提高车辆效率。自动驾驶汽车的使用还有助于提高车辆燃油效率。在自动驾驶汽车中，当驾驶自动化水平提高时，人类驾驶员的干预可以最小化。这些车辆的效率可以最大化，同时满足所需的行驶时间。这种车辆速度曲线的优化非常有用，因为可以在不改变车辆硬件的情况下提高车辆效率，并且该技术可以用于任何类型的车辆。此外，考虑到在不久的将来，许多车辆可以在没有人类驾驶员的情况下运行，车辆速度曲线的优化，即经济驾驶策略，是一个非常重要的问题。
Various studies have been conducted on eco-driving strategies. First, approaches based on an analytical solution derived from the optimal control problem have been proposed. In [1],a closed-form solution of the optimal problem was found for eco-driving of EVs. Here, the optimization problem was defined to minimize the fuel consumption, and the target traveling time for a given distance was given as the constraint of the problem. Then, the optimization problem was solved to obtain an explicit solution. In [2], an analytical state-constrained solution was derived considering vehicle safety constraints for EVs. Here, the minimum inter-vehicle distance and maximum road speed limit were defined as state constraints, and an analytical state constrained solution was derived for connected and automated vehicles.
人们对生态驾驶策略进行了各种研究。首先，提出了基于从最优控制问题导出的解析解的方法。在[1]中，找到了电动汽车生态驾驶最优问题的封闭式解决方案。这里，定义优化问题以最小化燃料消耗，并给出给定距离的目标行驶时间作为问题的约束。然后，求解优化问题以获得显式解。在[2]中，考虑到电动汽车的车辆安全约束，得出了分析状态约束解决方案。这里，最小车距和最大道路速度限制被定义为状态约束，并为联网和自动驾驶车辆导出了分析状态约束解。
Additionally, in many studies, approaches based on dynamic programming (DP) or Pontryagin’s minimum principle (PMP) were utilized. In [3], look-ahead control was used to optimize the speed profile. Here, based on a global positioning system, the road geometry ahead of the vehicle was extracted, and DP was used in a predictive scheme to optimize the velocity trajectory for a heavy diesel truck. In [4], stochastic approaches based on the DP were employed. Here, a time-independent fuel-efficient control strategy based on stochastic DP was developed, which does not require preview information of the route or the road slope. Additionally, constraints on the vehicle-following distance are applied to develop a fuel-efficient vehicle-following control policy. In [5], PMP was applied to a passenger car with an internal combustion engine vehicle. Here, optimal periodic control was derived for cruise control, which is a hybrid system that includes gear shift and idle operation of the engine. In [6], the minimum fuel driving control was studied according to PMP. Here, the vehicle model was expressed as a point-mass vehicle with a quasi-static polynomial fuelconsumption model, and gear shifting, clutch disengagement, and brake control were modeled as simple on–off switches. More recently, in [7], PMP and DP were used together for the eco-driving of all-EVs. Here, PMP was first utilized to find the possible operating mode satisfying the necessary condition, and then, DP was used to solve the optimal control problem again in the distance domain, which reduced the computational burden of the DP calculation.
此外，在许多研究中，采用了基于动态规划（DP）或庞特里亚金最小原理（PMP）的方法。在[3]中，前瞻控制用于优化速度曲线。这里，基于全球定位系统，提取了车辆前方的道路几何形状，并将DP用于预测方案中，以优化重型柴油卡车的速度轨迹。在[4]中，采用了基于DP的随机方法。这里，开发了一种基于随机DP的时间无关的燃油效率控制策略，该策略不需要路线或道路坡度的预览信息。此外，还应用跟车距离的约束来制定节能的跟车控制策略。文献[5]将PMP应用于内燃机汽车客车。这里，为巡航控制导出了最佳周期控制，巡航控制是一个混合系统，包括换档和发动机怠速操作。文献[6]根据PMP研究了最小燃油驱动控制。这里，车辆模型被表示为具有准静态多项式燃油消耗模型的点质量车辆，并且换档、离合器分离和制动控制被建模为简单的开关。最近，在[7]中，PMP 和 DP 一起用于全电动汽车的生态驾驶。这里，首先利用PMP寻找满足必要条件的可能运行模式，然后利用DP在距离域再次求解最优控制问题，减少了DP计算的计算负担。
In [8], the traffic signal was included in the eco-driving control framework. Here, with the assumption of vehicle-toinfrastructure communication capabilities, an optimal speed profile was obtained to minimize the total fuel consumption while safely crossing an intersection. Additionally, combined with the energy management of HEVs, the speed profile control problem was defined in an all-inclusive manner in [9]. Here, a bi-level methodology was used for the predictive energy management of parallel HEVs, where the optimal velocity was calculated first in the outer loop using a Krylov subspace method, and in the inner loop, the optimal torque split and gear shift were determined using PMP based on the model predictive control (MPC) framework.
在[8]中，交通信号被纳入经济驾驶控制框架中。在这里，假设车辆与基础设施的通信能力，获得了最佳速度曲线，以最大限度地减少安全穿过十字路口时的总燃料消耗。此外，结合混合动力汽车的能量管理，速度曲线控制问题在[9]中得到了全面的定义。这里，双层方法用于并联混合动力汽车的预测能量管理，其中首先使用 Krylov 子空间方法在外环中计算最佳速度，在内环中确定最佳扭矩分配和换档使用基于模型预测控制（MPC）框架的PMP。
However, applying these eco-driving strategies to real-world driving situations is not easy and has many limitations. First, the environment changes frequently and has many disturbances. Thus, the deterministic algorithm has limitations in that it must predict future driving conditions precisely, or there are driving environments that are difficult to model, such as the driving behavior of the car ahead or a traffic jam. Additionally, implementing DP or PMP for an online eco-driving strategy is challenging because of the computational burden of DP or the co-state sensitive characteristics of PMP. The more practical method of MPC was used in [10], and [11]. Here, adaptive nonlinear MPC was utilized, and it was implemented in a vehicle with a standard production powertrain control module. To increase the prediction accuracy, a recursive least-squares algorithm was used for parameter adaptation, which was combined with MPC to obtain more reliable results under real-world driving conditions. In [12], the vehicle-following scenario was studied. In the automated car-following scenario, the pulseand-gliding strategy was implemented based on the switching logic in a servo-loop controller to minimize the fuel consumption. However, these approaches also have limitations in that MPC and periodic control are focused on finding the local optimal for the near future, rather than the global optimal solution with entire travel distances. Thus, the fuel-economy improvement is limited, and consideration of complex driving environments is challenging, requiring an additional parameter calibration process.
然而，将这些生态驾驶策略应用于现实世界的驾驶情况并不容易，并且有很多局限性。一是环境变化频繁、干扰多。因此，确定性算法的局限性在于它必须精确预测未来的驾驶条件，或者存在难以建模的驾驶环境，例如前方车辆的驾驶行为或交通拥堵。此外，由于 DP 的计算负担或 PMP 的共状态敏感特性，为在线经济驾驶策略实施 DP 或 PMP 具有挑战性。 [10]和[11]中使用了更实用的MPC方法。这里使用了自适应非线性 MPC，并在具有标准生产动力总成控制模块的车辆中实施。为了提高预测精度，采用递归最小二乘算法进行参数自适应，与MPC相结合，在实际驾驶条件下获得更可靠的结果。在[12]中，研究了跟车场景。在自动跟车场景中，基于伺服环控制器中的切换逻辑实施脉冲滑行策略，以最大限度地减少燃油消耗。然而，这些方法也有局限性，因为 MPC 和周期性控制侧重于寻找近期的局部最优解，而不是整个行程距离的全局最优解。因此，燃油经济性的提高是有限的，并且考虑复杂的驾驶环境具有挑战性，需要额外的参数校准过程。
Therefore, in this study, we conducted an eco-driving strategy based on reinforcement learning. Reinforcement learning is an algorithm that can learn the optimal control policy according to the interaction between the agent and the environment [13]. Reinforcement learning is very similar approach to the DP-based approach in that they can optimize the cost-to-go value function based on the Bellman equation, and it is possible to replace this DP-based approach with reinforcement learning-based approach. On the other hand, unlike DP, reinforcement learning can be used as a real time controller through learning in a stochastic manner, and it has a model-free feature by learning the optimal control policy through the interaction between the agent and the environment with adaptation. Accordingly, reinforcement learning approach is well suited to the eco-driving control problem in which an optimization solution must be found through a probabilistic point of view in various and complex road driving environments. Reinforcement learning has been used for eco-driving in several studies. In [14], multi-objective deep Q-learning was utilized for the eco-routing problem to identify the best route for minimizing the traveling time and fuel consumption. In [15], and [16], a reinforcement learning algorithm was studied for minimizing the fuel consumption in the vicinity of an isolated signal intersection. In [17], eco-driving control considering the car-following scenario using an actor–gear–critic network architecture was studied for a conventional vehicle equipped with an internal combustion engine and automated manual transmission. Here, a fuel economy with safe inter-vehicle distance constraints was considered as an objective function, but the road slope was not considered as a state variable and was set to zero.
因此，在本研究中，我们进行了基于强化学习的经济驾驶策略。强化学习是一种能够根据智能体与环境之间的交互来学习最优控制策略的算法[13]。 强化学习与基于 DP 的方法非常相似，因为它们可以基于 Bellman 方程优化 cost-to-go 价值函数，并且可以用基于强化学习的方法替换这种基于 DP 的方法。另一方面，与DP不同的是，强化学习可以通过随机方式学习作为实时控制器，并且通过智能体与环境之间的自适应交互来学习最优控制策略，具有无模型的特点。因此，强化学习方法非常适合经济驾驶控制问题，其中必须在各种复杂的道路驾驶环境中通过概率的角度找到优化解决方案。强化学习已在多项研究中用于经济驾驶。在[14]中，多目标深度Q学习被用于经济routing问题，以确定最小化行驶时间和燃料消耗的最佳路线。在[15]和[16]中，研究了一种强化学习算法，用于最小化孤立信号交叉口附近的燃料消耗。在[17]中，针对配备内燃机和手自一体变速箱的传统车辆，研究了使用actor-gear-critic网络架构考虑跟车场景的经济驾驶控制。这里，具有安全车距约束的燃油经济性被视为目标函数，但道路坡度不被视为状态变量并设置为零。
In the present study, general speed profile optimization for an eco-driving strategy for longitudinal driving considering the road slope using model-based reinforcement learning (MBRL) was investigated. MBRL is a methodology that approximates the environment, including the system dynamics; thus, learning can be conducted with guaranteed stability [18], or few interactions [19]. In the case of vehicle control, MBRL was successfully applied to the optimal control problem of energy management of HEVs in our previous studies [20], [21]. The contribution of the present study is as follows: We developed a new algorithm for the eco-driving control problem using the reinforcement learning approach, and through this, we confirmed that the reinforcement learning method can be well applied to the eco-driving problem. In particular, in the eco-driving problem through optimization of the vehicle’s speed profile reflecting the road slope, the reinforcement learning method was compared with the optimal solution using the existing DP method and the cruise control case with constant vehicle speed, demonstrating the excellence and feasibility of the reinforcement learning based approach. Especially, we developed an eco-driving strategy with model-based Q-learning and confirmed its effectiveness via a vehicle simulation. To the best of our knowledge, this was the first study in which the MBRL approach was applied to the eco-driving control problem. Even though only the road slope is considered among diverse driving environments for the eco-driving strategy, considering that the proposed approaches can be extended to diverse driving environment conditions, e.g., traffic signals and other vehicles on the road, thanks to the model-free characteristic of the algorithm, the approaches using the reinforcement learning technique can be powerful. Additionally, the trained optimal control policy can be used for real-time vehicle controllers. The remainder of this article is organized as follows. In Section II, the EV model used in this study is presented. In Section III, the optimization problem for the eco-driving strategy is presented, and the MBRL algorithm for eco-driving is explained. In Section IV, the vehicle simulation is presented, and in Section V, the conclusions are presented.
在本研究中，使用基于模型的强化学习（MBRL）研究了考虑道路坡度的纵向经济驾驶策略的一般速度曲线优化。 MBRL是一种近似环境的方法，包括系统动力学；因此，学习可以在保证稳定性[18]或很少交互的情况下进行[19]。在车辆控制方面，MBRL在我们之前的研究中成功应用于混合动力汽车能量管理的最优控制问题[20]、[21]。本研究的贡献如下：我们使用强化学习方法开发了一种用于经济驾驶控制问题的新算法，并通过此方法，我们证实强化学习方法可以很好地应用于经济驾驶问题。特别是在通过优化反映道路坡度的车辆速度曲线的经济驾驶问题中，将强化学习方法与现有DP方法的最优解以及车速恒定的巡航控制案例进行了比较，展示了强化学习方法的优越性和有效性。基于强化学习的方法的可行性。特别是，我们开发了一种基于模型的 Q 学习的环保驾驶策略，并通过车辆模拟证实了其有效性。据我们所知，这是第一个将 MBRL 方法应用于经济驾驶控制问题的研究。尽管经济驾驶策略在不同的驾驶环境中只考虑了道路坡度，但考虑到所提出的方法可以扩展到不同的驾驶环境条件，例如交通信号灯和道路上的其他车辆，这得益于无模型根据算法的特点，使用强化学习技术的方法可能非常强大。此外，经过训练的最优控制策略可用于实时车辆控制器。本文的其余部分组织如下。在第二节中，介绍了本研究中使用的 EV 模型。第三节提出了经济策略的优化问题，并解释了经济驾驶的 MBRL 算法。在第四节中，介绍了车辆仿真，在第五节中，介绍了结论。

II. VEHICLE MODELING

In this study, a vehicle simulation was performed for training and testing the proposed algorithm. For the simulation, an EV was used. Compared with conventional internal combustion engine-based vehicles, EVs can recover energy from regenerative braking, making them more suitable for energy-efficient driving. However, the algorithm proposed in this article is not limited to EVs but is applicable to all vehicles.
在本研究中，进行了车辆模拟来训练和测试所提出的算法。为了进行模拟，使用了电动汽车。与传统的内燃机汽车相比，电动汽车可以通过再生制动回收能量，使其更适合节能驾驶。然而，本文提出的算法不仅限于电动汽车，而是适用于所有车辆。
For the EV modeling, a backward-looking vehicle simulation was performed via a quasi-static modeling technique, and only longitudinal vehicle dynamics are considered. The vehicle configuration is shown in Fig. 1, and the vehicle parameters used in the simulation are presented in Table 1. The efficiency of the motor including the efficiency of the converter ηelec(Tmot; !mot), was calculated using a predetermined map, as shown in Fig. 2, and the electric power consumed by the motor pbat, was calculated using the following equationV
对于电动汽车建模，通过准静态建模技术进行后视车辆仿真，并且仅考虑纵向车辆动力学。车辆配置如图 1 所示，仿真中使用的车辆参数如表 1 所示。使用预定映射计算电机效率，包括转换器效率 $η_{elec}(T_{mot}; w_{mot})$ ，如图 2 所示，电机消耗的电力 $p_{bat}$ 使用以下公式计算
在这里插入图片描述

在这里插入图片描述

where Tmot represents the motor torque, and !mot represents the motor speed. The battery state of charge (SOC) dynamics can be expressed as followsV
其中 $T_{mot}$ 代表电机扭矩， $w_{mot}$ 代表电机速度。电池荷电状态（SOC）动态可表示为：
在这里插入图片描述
where Voc(SOC) represents the open-circuit voltage, Rbat(SOC) represents the internal resistance of the battery, and Qbat represents the battery capacitance. The powertrain dynamic is given as follows:
其中 $V_{oc}(SOC)$ 代表开路电压， $R_{bat}(SOC)$ 代表电池内阻， $Q_{bat}$ 代表电池电容。动力系统动态如下:
在这里插入图片描述
where Twh represents the wheel torque, γfd represents the gear ratio of the final drive, Tfd;loss(Tfd; !fd) represents the torque loss in the final drive, Tfd represents the input torque in the final drive, and !fd represents the input speed in the final drive, which can be expressed as followsV
其中 $T_{wh}$ 代表车轮扭矩， $\gamma_{fd}$ 代表最终传动装置的齿轮比， $T_{fd,loss}(T_{fd},w_{fd})$ 代表最终传动装置中的扭矩损失， $T_{fd}$ 表示最终传动装置的输入扭矩。终传动输入速度 $w_{fd}$ ，可表示为：
在这里插入图片描述
where Rtire represents the tire radius, and v represents vehicle’s longitudinal speed. The vehicle dynamic is given as followsV
其中 $R_{tire}$ 表示轮胎半径，v 表示车辆的纵向速度。车辆动力学方程如下

where Fbrake represents the brake force, Mveh represents the vehicle mass, Meq represents the equivalent mass of the rotating inertia in the vehicle component, and Fload represents the road load force, including the grading resistance, which can be expressed as follows
式中， $F_{brake}$ 表示制动力， $M_{veh}$ 表示车辆质量， $M_{eq}$ 表示车辆部件中转动惯量的等效质量， $F_{load}$ 表示道路负载力，包括分级阻力，可表示为：
在这里插入图片描述
where f0, f1, and f2 are the road load coefficients, which have the units of N, N=km, and N=km2, respectively, and θ represents the road slope. Using this vehicle model, a vehicle simulation was conducted to train and test the proposed algorithm, as explained in the following section.
式中， $f_0、f_1、f_2$ 为道路荷载系数，单位分别为 $N、N/km、N/km^2，θ$ 为道路坡度。使用该车辆模型，进行车辆模拟来训练和测试所提出的算法，如下一节所述。

III. MODEL-BASED REINFORCEMENT LEARNING FOR ECO-DRIVING STRATEGY

A. OPTIMAL CONTROL PROBLEM FORMULATION

The optimal control problem for an eco-driving strategy can be defined to minimize the battery electric energy consumption for state vector consists of the vehicle speed and the traveling distance x D fv; dg, while driving a given distance D for a given time T as follows
经济驾驶策略的最优控制问题可以定义为最小化电池电能消耗，状态向量由车速和行驶距离组成 $x =$ { $v, d$ }；在给定时间 T 内行驶给定距离 D 时，如下
在这里插入图片描述
where f represents the nonlinear vehicle dynamics explained in the previous section, u represents the control variable, which is motor torque with the maximum torque Tmot;max and the minimum torque Tmot;min. v0 and vf represent the vehicle initial speed, and the vehicle final speed at D respectively, which have the minimum and maximum speed of vmin, and vmax.
其中， $f$ 代表上一节中解释的非线性车辆动力学， $u$ 代表控制变量，即具有最大扭矩 $T_{mot,max}$ 和最小扭矩 $T_{mot,min}$ 的电机扭矩。 $v_0和v_f$ 分别表示车辆在D处的初始速度和最终速度，其最小速度为 $v_{min}，最大速度为 $v_{max}$ 。
To simplify the optimal control problem, it can be expressed as one state formulation using dt D dd=v as in [22], which is the weighted sum of the battery SOC usage and the traveling time according to the distance,as followsV
为了简化最优控制问题，可以将其表示为一种状态公式，如[22]中使用 $d t = dd / v$ ，它是电池SOC使用率和行驶时间根据距离的加权和，如下
在这里插入图片描述
where $w$ represents the weighting factor to be tuned for satisfying the total driving time. By transferring the optimal control problem from (7) to (8), the problem is defined according to the traveling distance d, while the vehicle’s initial and final speed constraints remain the same. As mentioned in the section I, the traffic-signal and car-following situations are not considered.
$w$ 表示为满足总行驶时间而调整的权重因子。通过将最优控制问题从（7）转移到（8），根据行驶距离 d 定义问题，同时车辆的初始和最终速度约束保持不变。正如第一节中提到的，没有考虑交通信号和跟车情况。

B. DETERMINISTIC DP

First, deterministic DP is used to solve the optimal control problem. The foregoing optimal control problem can be presented in discrete form as follows
首先，使用确定性DP来解决最优控制问题。上述最优控制问题可以用离散形式表示如下:
在这里插入图片描述
where the index k represents the discretization step for N segments, which is equally divided with unit distance 1s, and L(v(k); u(k)) represents the instantaneous cost incurred, which can be expressed as follows
其中，索引 $k$ 表示 N 段的离散化步数，以单位距离 $\Delta s$ 等分， $L (v (k), u (k))$ 表示产生的瞬时成本，可以表示为：
在这里插入图片描述
Then, the optimal solution can be obtained using the Bellman equation [23], as follows.
然后，利用贝尔曼方程[23]可以获得最优解，如下所示。

Here, Jk;N represents the cost function for traveling from step k to N, which can be expressed in the recursive form using the instantaneous cost L(v(k); u(k)), and the cost function for traveling from step k C1 to N, JkC1;N . Using (13), an optimal speed profile based on the travel distance can be obtained, but deterministic DP is computationally inefficient and difficult to adapt under driving-condition changes. Thus, it has many limitations when used in real-time vehicle controllers.
这里， $J_{k,N}$ 表示从步骤 k 行进到 N 的成本函数，可以使用瞬时成本 $L (v (k), u (k))$ 以递归形式表示，以及从步骤 k+1 行进到 N 的成本 $J_{k+1,N}$ 。使用（13），可以获得基于行驶距离的最佳速度曲线，但确定性DP计算效率低且难以适应驾驶条件变化。因此，它在实时车辆控制器中使用时有很多限制。

C. MODEL-BASED REINFORCEMENT LEARNING

First, the optimal control problem can be expressed to minimize the expected total cost over an infinite horizon, as follows
首先，最优控制问题可以表示为在无限范围内最小化预期总成本，如下
在这里插入图片描述
where Jπ(x0) represents the cost with initial condition and control policy π, γ represents the discount factor, and g represents the instantaneous cost, which can be expressed as follows
其中 $J_π(x_0)$ 表示初始条件和控制策略 π 下的成本， $γ$ 表示贴现因子， $g$ 表示瞬时成本，可表示为
在这里插入图片描述
Here, $η(v_k)$ represents the penalty cost that is applied when the vehicle speed is higher than vmax or lower than vmin, as follows
这里， $η(v_k)$ 表示当车速高于 $v_{max}$ 或低于 $v_{min}$ 时施加的惩罚成本，如下

Here, costpenalty is a positive constant. The optimal control problem is defined in an infinite horizon; thus, the generated control policy is time-invariant, which can be easily implemented on a real-time vehicle controller. The state variable x is defined as followsV
这里， $cost_{penalty}$ 成本惩罚是一个正常数。最优控制问题是在无限范围内定义的；因此，生成的控制策略是时不变的，可以很容易地在实时车辆控制器上实现。状态变量x定义如下:
在这里插入图片描述
where h represents the height, and θ represents the road slope. x is discretized as follows
其中 h 代表高度，θ 代表道路坡度。 x 离散化如下 :

where Nv, Nh, and Nθ represent the number of the discretized speed, height and road slope respectively, and the control variable is also discretized as follows
式中， $N_v、N_h、N_θ$ 分别表示离散化后的速度、高度、道路坡度的个数，控制变量也离散化为：
在这里插入图片描述
where Nu represent the number of the discretized control input.
其中 $N_u$ 表示离散控制输入的数量。
Therefore, in this study, the eco-driving control policy was determined according to the current vehicle speed, height, and road slope. Among them, vehicle speed or road slope directly affect the cost values. However, in the case of height, it does not directly affect the cost value instantly, but it reflects the future cost concerning the road driving environment. Combined with the road slope, height can be expressed as a state of the Q function indicating the expected total cost of future energy use. That is, even in the same uphill situation, the optimal driving speed of the vehicle may vary according to the current height, and this is obvious when considering the relationship between the kinetic energy of the vehicle and the potential energy, and the energy consumption of driving the vehicle accordingly. Therefore, by considering the height as a state variable with the vehicle speed and the road slope,it is possible to better represent the value of the cost-to-go function in Q function and a probabilistic driving situation than when the road slope is considered only.
因此，本研究根据当前车速、高度、道路坡度确定经济驾驶控制策略。其中，车速或道路坡度直接影响成本值。 但在高度情况下，它并不直接影响即时的成本值，而是反映了未来道路行驶环境的成本。结合道路坡度，高度可以表示为 Q 函数的状态，指示未来能源使用的预期总成本。也就是说，即使在相同的上坡情况下，车辆的最佳行驶速度也可能根据当前高度而变化，并且当考虑车辆的动能和势能之间的关系以及车辆的能量消耗时，这一点是显而易见的。相应地驾驶车辆。因此，通过将高度作为车速和道路坡度的状态变量，可以比仅考虑道路坡度时更好地表示Q函数中的出行成本函数的值和概率驾驶情况。
Based on Q-learning [24], The optimal cost J∗(xk) and optimal control policy π∗(xk) can be expressed as followsV
基于Q-学习[24]，最优成本 $J^*(x_k)$ 和最优控制策略 $π^*(x_k)$ 可以表示为：
在这里插入图片描述
Then, the Q function can be updated as followsV
然后，Q函数可以更新如下:

In this study, to solve the optimal control problem, a novel eco-driving strategy utilizing MBRL was developed on the basis of a previous study on MBRL for the HEV control case study in [21]. In (17), the state variable xk D [vk; hk; θk] is partially stochastic (the driving environment hk and θk can be considered stochastic without preview terrain information), but it is possible to expect vk, and gk deterministically based on the vehicle powertrain dynamics equations of (1)–(6), and the cost equation (15) for the given driving conditions of hk, and θk and the given control input u. Thus, the domain knowledge of the known vehicle dynamic model and powertrain system can be used in reinforcement learning, while the remaining model uncertainty due to modeling error or other driving environments that are difficult to model can be learning via model-free still.
在本研究中，为了解决最优控制问题，在[21]中针对混合动力汽车控制案例研究的MBRL研究的基础上，开发了一种利用MBRL的新型经济驾驶策略。 (17)式中，状态变量 $x_k=[v_k,h_k,\theta _k]$ 是部分随机的（在没有预览地形信息的情况下，驾驶环境 $h_k和θ_k$ 可以被认为是随机的），但是可以根据（1）-（6）的车辆动力总成动力学方程确定性地期望 $v_k和g_k$ ，并且给定 $h_k 和 θ_k$ 驱动条件以及给定控制输入 $u$ 的成本方程（15）。因此，已知车辆动态模型和动力总成系统的领域知识可以用于强化学习，而由于建模误差或其他难以建模的驾驶环境而导致的剩余模型不确定性仍然可以通过无模型来学习。
On the basis of this observation, we developed a new MBRL algorithm for the eco-driving strategy. The overall algorithm is presented in Fig. 3 and Algorithm 1. In the new algorithm, the agent’s learning takes place based on approximation model using the deterministic variable, while stochastic variables are reflected in the agent’s learning through the experience replay. Usually, in the Q-learning, the agent derives the action uk using methods such as - greedy (exploitation and exploration) according to the Q function value and the current state xk, and conducts learning by updating Q function value using the observation of the reward gk and the next state xkC1. Alternatively, in the new algorithm, the agent derives the greedy control input using the Q function and observes the reward and the next variable (exploitation), but observed information is used to make an approximation model, and by using it, various control inputs are tested for a given stochastic state transition (exploration). In other words, according to the experience of the driving environment of hk and θk, learning with experience replay is conducted to optimize the Q function value as shown in ‘‘for’’ loop in the Algorithm 1. With estimation of vOkC1 and estimation of the reward gOk, the Q function value can be updated for different vehicle speeds v 2 fv1; v2; v3; : : : ; vNvg and control inputs u 2 fu1; u2; u3; : : : ; uNug, as followsV
基于这一观察，我们为经济驾驶策略开发了一种新的 MBRL 算法。整体算法如图3和算法1所示。在新算法中，智能体的学习是基于使用确定性变量的近似模型进行的，而随机变量通过经验回放反映在智能体的学习中。通常，在Q学习中，智能体根据Q函数值和当前状态 $x_k$ ，使用贪婪（利用和探索）等方法导出动作 $u_k$ ，并通过观察更新Q函数值来进行学习。奖励 $g_k$ 和下一个状态 $x_{k+1}$ 。或者，在新算法中，agent 使用 Q 函数导出贪婪控制输入，并观察奖励和下一个变量（利用），但观察到的信息用于建立近似模型，并通过使用它，各种控制输入测试给定的随机状态转换（探索）。换句话说，根据 $h_k和θ_k$ 的驾驶环境经验，进行经验回放学习，以优化Q函数值，如算法1中的“for”循环所示。通过估计 $\hat v_{k+1}$ 和估计奖励 $\hat g_k$ ，Q函数值可以针对不同车速 $v\in{v^1,v^2,v^3,...,v^{N_v}}$ 和控制输入 $u\in{u^1,u^2,u^3,...,u^{N_u}}$ 进行更新，如下:
在这里插入图片描述

where gOk, and vOkC1 can be determined based on the vehicle powertrain model. Alternatively, gOk, and vOkC1 can be determined based on approximation model for keeping model-free characteristic of reinforcement learning (see [21]), that approximation could be done based on the experience as shown in following equationV
其中 $\hat g_k$ 和 $\hat v_{k+1}$ 可以根据车辆动力系统模型确定。或者， $\hat g_k$ 和 $\hat v_{k+1}$ 可以基于近似模型来确定，以保持强化学习的无模型特性（参见[21]），该近似可以基于经验来完成，如下式所示
在这里插入图片描述
where α g and αv represent the learning rates. According to the experience of the stochastic state transition from (hk, θk) to (hkC1, θkC1), the deterministic state transition from vk to vOkC1 and reward gOk can be estimated to optimize the Q function value. In this study, to simplify the problem, the control input u was defined as the relative offset of the vehicle speed instead of the motor torque, as followsV
其中 $α_g$ 和 $α_v$ 表示学习率。根据从 $h_k,θ_k) 到 (h_{k+1}, θ_{k+1})$ 的随机状态转移的经验，可以估计从 $v_k$ 到 $\hat v_{k+1}$ 的确定性状态转移和奖励 $\hat g_k$ 以优化Q函数值。在本研究中，为了简化问题，控制输入 u 定义为车速的相对偏移量而不是电机扭矩，如下
在这里插入图片描述
Here, δ represents the unit speed for discretization in (18). Then, the estimation of vOkC1 is not required, as vOkC1 is determined u directly, and the control policy can be expressed in a more intuitive form in the direction of reducing or increasing the speed of the vehicle in several steps.
这里， $δ$ 表示式(18)中离散化的单位速度。那么，就不需要对 $\hat v_{k+1}$ 进行估计，因为 $\hat v_{k+1}$ 是直接确定 $u$ 的，并且控制策略可以以更直观的形式表达为分几步降低或提高车辆速度的方向。
The feature of this algorithm is to separate the model-based insight of the vehicle powertrain, which can be estimated relatively well, from various driving situations that are difficult to estimate, so that the control policy is extracted more effectively using reinforcement learning. Additionally, the proposed algorithm has the advantage of experience replay, which accelerates the convergence and enhances the stability. Furthermore, the control u can be tested in the nested ‘‘for’’ loop of the experience replay process before it is used in the real greedy action; thus, irrelevant control inputs can be excluded to prevent fatal system errors or optimize the controller performance. Additionally, to reduce the computational burden, the ‘‘for’’ loop in the algorithm can be alternated via prioritized sweeping without searching all the v and u values.
该算法的特点是将能够相对较好地估计的基于模型的车辆动力总成洞察力与难以估计的各种驾驶情况分开，从而利用强化学习更有效地提取控制策略。此外，该算法还具有经验回放的优点，加速了收敛速度，增强了稳定性。此外，控制 $u$ 可以在经验重放过程的嵌套“for”循环中进行测试，然后再用于真正的贪婪动作；因此，可以排除不相关的控制输入，以防止致命的系统错误或优化控制器性能。此外，为了减少计算负担，算法中的“for”循环可以通过优先扫描来交替，而无需搜索所有 $v$ 和 $u$ 值。

IV. VEHICLE SIMULATION

A simulation based on the vehicle model described in Section II was performed. For the vehicle simulation and development of the reinforcement learning algorithm, MATLAB was used. Information regarding the driving environment, i.e.,hk and θk, was recorded during real-world driving and used in the simulation. The height and road slope profiles are shown in Fig. 4. The distance of the diving cycle was 10 km, and it was divided into equal intervals of 10 m.
基于第二节中描述的车辆模型进行了模拟。对于车辆仿真和强化学习算法的开发，使用了 MATLAB。有关驾驶环境的信息，即 $h_k 和 θ_k$ ，是在实际驾驶过程中记录并在模拟中使用的。高度和道路坡度剖面如图4 所示。驾驶周期的距离为10 km，以10 m的等间隔划分。
在这里插入图片描述
The slope was assumed to be a piecewise constant. The parameters used in the reinforcement learning algorithm are presented in Table 2. For discretization, the nearest-neighbor method was used. The vehicle speed was discretized by 1 km=h from 0 to 100 km=h, and height was discretized by 5 m. The road slope was discretized by 1%. For the weighting factor, !, which should be defined to satisfy the desired traveling time for a given driving environment, was assumed to be 0.004. For the battery SOC, the initial value was defined as 70% and the initial vehicle speed was defined as 60 km=h. The initial value of the approximation model was defined roughly using (1)–(6) and was updated during the learning process.
假设斜率是分段常数。强化学习算法中使用的参数如表2 所示。对于离散化，使用了最近邻法。车速从0到100 km/h以1 km/h离散化，高度以5m离散化。道路坡度按 1% 离散化。对于权重因子，应该被定义为满足给定驾驶环境的期望行驶时间，假设为0.004。对于电池SOC，初始值定义为70%，初始车速定义为60km/h。近似模型的初始值大致使用（1）-（6）定义，并在学习过程中更新。
在这里插入图片描述

A. LEARNING CURVE AND CONTROL POLICY

Using the proposed algorithm, a learning process was conducted to determine the energy-efficient speed trajectory using driving cycles A, B, and C, separately. With an initial speed of 60 km=h, the vehicle speed was generated for each driving cycle, and learning was performed 500 times. The learning curve resulting from the learning process utilizing driving cycle A is presented in Fig. 5. As shown, the sum of the instantaneous cost in (15) rapidly decreased and converged as the learning was repeated, indicating that the learning process was successful. The resulting speed trajectories and height with respect to the traveling distance, as well as the motor torque profiles for all the driving cycles, are presented in Fig. 6. Generally, the vehicle speed decreased when the vehicle traveled uphill and increased when the vehicle traveled downhill. Using the proposed algorithm, the optimal velocity profile utilizing the slope and height information of the terrain was learned. As an example, the control policy extracted from the Q function using driving cycle A is presented in Fig. 7, where the optimal speed command for either an increasing or decreasing speed is shown according to vehicle’s current speed, and slope. Here, the trend of the control according to the current speed and slope value was confirmed: for a low speed, the control policy increased the speed actively, and higher slope values tended to reduce the speed. In all three cycles, the vehicle’s speed was maintained between approximately 60 and 80 km=h. This is because the same weighting coefficient ! was used, and in the same way, the tendency to control with the target speed section can be confirmed in the extracted control policy
使用所提出的算法，进行学习过程，以确定分别使用驾驶周期 A、B 和 C 的节能速度轨迹。以60km/h的初始速度，生成每个行驶周期的车辆速度，并进行500次学习。利用驾驶循环A的学习过程产生的学习曲线如图5 所示。如图所示，随着学习的重复，(15)中的瞬时成本之和迅速减小并收敛，表明学习过程是成功的。由此产生的速度轨迹和高度相对于行驶距离的关系，以及所有驾驶周期的电机扭矩曲线如图 6 所示。一般来说，车辆上坡行驶时车速降低，下坡行驶时车速下降。使用所提出的算法，利用地形的坡度和高度信息学习了最佳速度剖面。例如，图 7 给出了使用驾驶循环 A 从 Q 函数中提取的控制策略，其中根据车辆当前速度和坡度显示了增加或减少速度的最佳速度命令。这里，确认了根据当前速度和斜率值进行控制的趋势：对于低速度，控制策略主动提高速度，而较高的斜率值倾向于降低速度。在所有三个循环中，车辆的速度都保持在约 60 至 80 km/h 之间。这是因为加权系数 $w$ 相同！同理，在提取的控制策略中可以确认以目标速度段进行控制的趋势
在这里插入图片描述

B. COMPARISON WITH DETERMINISTIC DP RESULT AND CRUISE CONTROL RESULT

The energy saving performance of the given speed trajectory was evaluated by comparing it with the result of DP, as well as general cruise control in which the vehicle is driven at a constant speed. As mentioned in Section III, DP can present the global optimal solution; thus, the solution of DP can be used as a benchmark. However, in contrast to DP, in which the initial and final condition of the state can be defined, in the proposed MBRL algorithm, the final speed cannot be determined in advance. Thus, according to the MBRL result, the initial and final speed results (v0 and vf , respectively) were set as constraints in DP, to compare the energy saving performance fairly by making the remaining kinetic energy of the vehicle identical between the two cases. For cruise control, the vehicle is driven at an average speed equal to the MBRL result vave, and the initial and final speed constraints are applied.
通过与DP结果以及车辆匀速行驶的一般巡航控制的结果进行比较，评估给定速度轨迹的节能性能。正如第三节提到的，DP可以给出全局最优解；因此，DP的解可以作为基准。然而，与可以定义状态的初始和最终条件的DP不同，在所提出的MBRL算法中，不能提前确定最终速度。因此，根据MBRL结果，将初始速度结果和最终速度结果（分别为 $v_0和v_f$ ）设置为DP中的约束，通过使两种情况下车辆的剩余动能相同来公平地比较节能性能。对于巡航控制，车辆以 MBRL 结果值的平均速度行驶，并应用初始和最终速度约束。
The simulation results are presented in Table 3, and Fig. 8. Table 3 presents the traveling time, SOC usage, and percent energy saving of SOC usage with respect to the cruise control result. The traveling time was an important factor, as driving the same distance slower tended to use less battery SOC. In the simulation, the traveling time results for DP, MBRL, and cruise control were close, with a maximum difference of 0.6%, which is negligible. With regard to the battery SOC use, DP was the most efficient of the three approaches,followed by MBRL. Compared with cruise control, DP and MBRL exhibited average energy saving of 3.4% and 2.0%, respectively. Fig. 8 shows the speed profiles and battery SOC trajectories. The speed profiles obtained from MBRL were similar to those obtained from DP, as expected from the energy saving performance results. However, there was a difference between the DP and MBRL results, even though training was successfully conducted using the entire driving cycle information in MBRL. This can be explained by the problem definition: in MBRL, the optimal control problem is defined as minimizing the expected total cost in an infinite horizon, whereas in DP, it is defined in a finite horizon. This results in a higher energy saving for DP, while the control policy of MBRL can be used as an offline real-time controller in a stochastic manner. However, for various driving environment scenarios, MBRL shows good performance, because MBRL has learned the optimal behavior well using the transition probability of the vehicle’s driving environment {hk, θk} to {hkC1, θkC1} through the learning process. Unlike deterministic DP in which entire driving environment information should be given in advance, or stochastic DP in which a model for the transition probability matrix of a driving environment should be given in advance [25], in reinforcement learning, the agent brings the transition probability distribution of the driving environment to the optimization problem through the interaction with the driving environment. Therefore, it is possible to derive optimal control through learning based on model-free characteristics.
仿真结果如表 3 和图 8 所示。表 3 显示了相对于巡航控制结果的行驶时间、SOC 使用情况以及 SOC 使用情况的节能百分比。行驶时间是一个重要因素，因为行驶同样的距离越慢，消耗的电池电量就越少。仿真中，DP、MBRL和巡航控制的行程时间结果很接近，最大差异为0.6%，可以忽略不计。 就电池 SOC 使用而言，DP 是三种方法中最有效的，其次是 MBRL。与巡航控制相比，DP和MBRL平均节能分别为3.4%和2.0%。图 8 显示了速度曲线和电池 SOC 轨迹。从 MBRL 获得的速度曲线与从 DP 获得的速度曲线相似，正如节能性能结果所预期的那样。然而，尽管使用 MBRL 中的整个驾驶周期信息成功进行了训练，但 DP 和 MBRL 结果之间还是存在差异。这可以通过问题定义来解释：在MBRL中，最优控制问题被定义为在无限范围内最小化预期总成本，而在DP中，它被定义在有限范围内。这使得DP具有更高的节能效果，同时MBRL的控制策略可以随机方式用作离线实时控制器。 然而，对于各种驾驶环境场景，MBRL表现出了良好的性能，因为MBRL通过学习过程利用车辆驾驶环境 { $h_k，θ_k$ }到 { $h_{k+1}，θ_{k+1}$ } 的转移概率很好地学习了最优行为。与确定性动态规划（其中整个驾驶环境信息应提前给出）或随机动态规划（其中驾驶环境的转移概率矩阵模型应提前给出[25]）不同，在强化学习中，智能体带来了转移概率通过与驾驶环境的交互将驾驶环境的分布归结为优化问题。 因此，可以通过基于无模型特性的学习来导出最优控制。
在这里插入图片描述

C. PERFORMANCE FOR LEARNING WITH COMBINED DRIVING CYCLES

The performance of the algorithm when the learning process was conducted with various driving cycles was evaluated. Here, all the driving cycles (A, B, and C) were utilized for the training process, and the resulting Q function was employed as an offline control policy for simulation using each driving cycle. Thus, by testing the control policy obtained from the learning process using combined driving cycles, we checked whether the algorithm could show the performance when it visited an area previously learned. The simulation results are presented in Fig. 9 and Table 4. In Fig. 9, speed trajectories obtained from learning with a specific driving cycle only and leaning with all driving cycles are compared, both using MBRL. As shown, the vehicle speed profiles were similar for all the driving cycles, indicating that the driving-cycle information (once learned in MBRL) could be stored in the Q function value and that improved performance could be repeated when the vehicle revisited a driving environment; however, there was a small performance reduction in energy saving. In Table 4, the energy saving performance of the speed profile based on learning with all driving cycles was compared with that of cruise control. Similar to the previous comparison, the final speed and traveling time were applied as constraints for generating a cruise control speed profile. The results indicated that there was still meaningful energy saving, even though the percentage improvement was reduced compared with learning using a single driving cycle in Table 3. Therefore, the offline control policy (represented as the optimized Q function) can be utilized as a real-time controller for the eco-driving strategy.
评估了在各种驾驶循环下进行学习过程时算法的性能。这里，所有驾驶循环（A、B和C）都用于训练过程，并且所得的Q函数被用作使用每个驾驶循环进行模拟的离线控制策略。因此，通过使用组合驾驶循环测试从学习过程中获得的控制策略，我们检查了算法在访问先前学习的区域时是否能够显示出性能。仿真结果如图 9 和表 4 所示。在图 9 中，比较了仅使用特定驾驶周期学习和所有驾驶周期学习获得的速度轨迹，两者均使用 MBRL。如图所示，所有驾驶周期的车辆速度曲线相似，这表明驾驶周期信息（一旦在 MBRL 中学习）可以存储在 Q 函数值中，并且当车辆重新访问驾驶环境时可以重复改进的性能 ; 然而，节能性能略有下降。表4将基于所有驾驶循环学习的速度曲线的节能性能与巡航控制进行了比较。与之前的比较类似，最终速度和行驶时间被用作生成巡航控制速度曲线的约束。结果表明，尽管与表 3 中使用单个驾驶循环的学习相比，改进百分比有所降低，但仍然存在有意义的节能效果。因此，离线控制策略（表示为优化的 Q 函数）可以用作实时的控制器。

V. CONCLUSION

A reinforcement learning algorithm was developed for the eco-driving of an EV, and through this, we confirmed that the reinforcement learning method can be well applied to the eco-driving problem. Especially, we showed that using MBRL, an energy saving optimal speed trajectory utilizing the road slope of the driving cycle can be acquired, and it was compared with the optimal solution using the existing DP method and the cruise control case, demonstrating the excellence and feasibility of the reinforcement learning based approach. The proposed MBRL algorithm separates the vehicle energy-consumption model from the driving environment; thus, learning can be conducted efficiently with the domain knowledge of vehicle dynamics and the powertrain model, while model-free characteristics are maintained by updating the approximation model with experience replay. The proposed algorithm exhibited an energy saving performance of 1.2% – 3.0% compared with cruise control and similar behavior to DP. Additionally, we showed that by training the control policy with combined driving cycles and testing for separated specific driving cycles, the control policy can be used as an offline real-time controller. The limitation of this study is that we only used road slope information to generate an optimal speed profile among diverse driving environments; other constraints associated with traffic signals or other vehicles on the road were not included in the learning process. However, the approaches based on reinforcement learning have advantages for dealing with model uncertainty using the interaction between the agent and the environment; thus, we expect that these constraints with disturbance can be modeled and an optimal control policy can be learned successfully in a stochastic manner. This should be investigated in a future study.
针对电动汽车的经济驾驶开发了强化学习算法，通过该算法，我们证实了强化学习方法可以很好地应用于经济驾驶问题。特别是，我们表明，使用MBRL可以利用行驶循环的道路坡度获得节能的最佳速度轨迹，并将其与使用现有DP方法和巡航控制案例的最佳解决方案进行比较，证明了基于强化学习的方法的优越性和可行性。 提出的MBRL算法将车辆能耗模型与驾驶环境分离；因此，可以利用车辆动力学和动力总成模型的领域知识来有效地进行学习，同时通过经验回放更新近似模型来保持无模型特性。与巡航控制相比，所提出的算法表现出 1.2% – 3.0% 的节能性能，并且与 DP 具有类似的行为。此外，我们还表明，通过组合驾驶循环训练控制策略并测试单独的特定驾驶循环，该控制策略可以用作离线实时控制器。这项研究的局限性在于我们仅使用道路坡度信息来生成不同驾驶环境下的最佳速度曲线；学习过程中不包括与交通信号或道路上其他车辆相关的其他限制。然而，基于强化学习的方法在利用智能体与环境之间的交互来处理模型不确定性方面具有优势；因此，我们期望可以对这些扰动约束进行建模，并以随机方式成功学习最优控制策略。这应该在未来的研究中进行调查。
在这里插入图片描述