【论文笔记】交通信号控制强化学习的研究进展:模型与评价综述

请我喝好果汁

已于 2024-01-14 19:34:24 修改

阅读量2.6k

点赞数 8

分类专栏：深度强化学习+信号灯控制交通相关论文文章标签： 1024程序员节人工智能论文笔记论文阅读

于 2023-10-24 16:52:03 首次发布

本文链接：https://blog.csdn.net/wjt_0167/article/details/134016349

版权

交通相关论文同时被 2 个专栏收录

4 篇文章

订阅专栏

深度强化学习+信号灯控制

3 篇文章

订阅专栏

博客声明：本文仅为个人论文阅读笔记，大部分原文对照的中文为翻译而来，只对其中错误明显的部分作了修改。其他一些个人理解不到位或有误的地方也尽请见谅。

标题原文：Recent Advances in Reinforcement Learning for Traffic Signal Control: A Survey of Models and Evaluation（2021）
论文来源：宾州州立大学的团队主页 https://traffic-signal-control.github.io/
论文DOI：未公开发表？
关键词：Traffic Signal Control、Reinforcement Learning

0 摘要

概括：

调查了交通信号控制问题下，强化学习方向的最近进展（截止2021年）
根据RL技术对已知方法进行分类，回顾了现有模型，分析了它们的优缺点
概述了评估现有模型的模拟环境和实验设置
探讨了该问题下强化学习方向的未来展望

Traffic signal control is an important and challenging real-world problem that has recently received a large amount of interest from both transportation and computer science communities. In this survey, we focus on investigating the recent advances in using reinforcement learning (RL) techniques to solve the traffic signal control problem. We classify the known approaches based on the RL techniques they use and provide a review of existing models with analysis on their advantages and disadvantages. Moreover, we give an overview of the simulation environments and experimental settings that have been developed to evaluate the traffic signal control methods. Finally, we explore future directions in the area of RLbased traffic signal control methods. We hope this survey could provide insights to researchers dealing with real-world applications in intelligent transportation systems.
·
交通信号控制是一个重要的和具有挑战性的现实世界的问题，最近从交通和计算机科学社区收到了大量的研究兴趣。在本调查中，我们重点调查了使用强化学习(RL)技术来解决交通信号控制问题的最近进展。我们根据使用的RL技术对已知的方法进行了分类，并对现有模型进行了回顾，分析了它们的优缺点。此外，我们对评估交通信号控制方法而开发的模拟环境和实验设置进行了概述。最后，我们探讨了基于RL的交通信号控制方法的未来发展方向。我们希望这项调查可以为研究人员，在处理智能交通系统至真实世界的应用提供见解。

1 介绍

概括：

介绍交通拥堵，引出其中一个解决思路-交通信号灯控制
简单地对交通信号灯控制的问题进行了描述，以及基于RL的方法和基于Deep RL的解决思路
概述本文其他章节的主题

Traffic congestion is a growing problem that continues to plague urban areas with negative outcomes to both the traveling public and society as a whole. These negative outcomes will only grow over time as more people flock to urban areas. In 2014, traffic congestion costs Americans over $166 billion in lost productivity and wasted over 3.1 billion gallons of fuel [15]. Traffic congestion was also attributed to over 56 billion pounds of harmful CO2 emissions in 2011 [54]. Mitigating congestion would have significant economic, environmental, and societal benefits. Signalized intersections are one of the most prevalent bottleneck types in urban environments, and thus traffic signal control plays a vital role in urban traffic management.
·
交通拥堵是一个日益严重的问题，持续困扰着城市地区，对出行的公众和整个社会都产生了负面影响。随着越来越多的人涌向城市地区，这些负面后果只会随着时间的推移而加剧。2014年，交通拥堵使美国人损失了1660亿美元的生产力，浪费了31亿加仑的燃料。交通拥堵同样使得在2011年，有害二氧化碳的排放超过560亿磅。缓解交通拥堵将带来显著的经济、环境和社会效益。信号交叉口是城市环境中最常见的瓶颈类型之一，因此交通信号控制在城市交通管理中起着至关重要的作用。

The typical approach t mization problem under certain assumptions about the traffic model, e.g., vehicles come in a uniform and constant rate [52]. Various assumptions have to be made in order to make the optimization problem tractable. These assumptions, however, usually deviate from the real world, where the traffic condition is affected by many factors such as driver’s preference, interactions with vulnerable road users (e.g., pedestrians, cyclists, etc.), weather and road conditions. These factors can hardly be fully described in a traffic model. For a more comprehensive survey for the methods in transportation, we refer the interested readers to [47; 52; 40; 31; 17; 50; 64].
`
例如，车辆以一致和恒定的速率[52]进入路口。为了使优化问题易于处理，必须做出各种假设。然而，这些假设通常会偏离现实世界，在现实世界中，交通状况会受到许多因素的影响，如司机的偏好、与弱势道路使用者(如行人、骑自行车的人等)的互动、天气和道路状况。这些因素很难在交通模型中完全描述。为了对交通方式进行更全面的调查，我们建议感兴趣的读者参考[47;52;40;31;17;50;64]。

On the other hand, reinforcement learning methods can directly learn from the observed data without making unrealistic assump tions about the traffic model, by first taking actions to change the signal plans and then learning from the outcomes. In essence, an RL-based traffic signal control system observes the traffic condition first, then generates and executes different actions (i.e., traffic signal plans). It will then learn and adjust the strategies based on the feedback from the environment. However, in tradictional RL-based methods, the states in an environment are required discretized and low-dimensional, which is one of the major limitations of the traditional approaches.
`
另一方面，强化学习方法可以直接从观察到的数据中学习，而无需对交通模型做出不切实际的假设，首先采取行动改变信号计划，然后从结果中学习。本质上，基于rl的交通信号控制系统首先观察交通状况，然后生成并执行不同的动作(即交通信号计划)。然后，它会根据环境的反馈来学习和调整策略。然而，在传统的基于rl的方法中，需要对环境中的状态进行离散化和低维化，这是传统方法的主要局限性之一。

Recent advances in RL, especially deep RL, offer the opportunity to efficiently work with high dimensional input data (like images), where the agent can learn a state abstraction and a policy approximation directly from its input states. A series of related studies using deep RL for traffic signal control have appeared in the past few years. This survey is to provide an overview on the recent RL-based traffic signal control approaches, including the state-of-the-art methods and their experimental settings for evaluation. In this survey, we first introduce the formulation of traffic light control problems under RL, and then classify and discuss the current RL control methods from different aspects: agent formulation, policy learning approach, and coordination strategy when facing multiple intersections. In the third section, we review how current methods are evaluated, including simulators and experimental settings that affect the performance of these methods. We then discuss some future research directions. While [39; 71] provide surveys mainly on earlier studies before the popularity of deep RL, in this survey, we will mainly cover the recent deep RL methods. With the increasing interest on RL-based control mechanisms in intelligent transportation systems [24], such as autonomous driving [67] and road control [46; 68], we hope this survey could also provide insights on dealing with real-world challenges for other applications in intelligent transportation systems.
`
最近在RL方面的进展，特别是深度RL，提供了有效处理高维输入数据(如图像)的机会，在这种情况下，智能体可以直接从其输入状态学习状态抽象和策略近似。在过去的几年中，出现了一系列将深度RL应用于交通信号控制的相关研究。这项调查是提供了一个最近基于rl的交通信号控制方法的概述，包括最先进的方法和他们的实验设置的评估。在本次调查中，我们首先介绍了RL下的交通灯控制问题的制定，然后从智能体制定、政策学习方法、面对多个交叉口时的协调策略等不同方面对当前的RL控制方法进行了分类和讨论。在第三部分中，我们回顾了如何评估当前的方法，包括影响这些方法性能的模拟器和实验设置。并对今后的研究方向进行了探讨。而[39;71]提供了调查中，主要是关于深度学习方法普及之前的早期研究，在这次调查中，我们将主要涵盖最近的深度学习方法。随着人们对智能交通系统[24]中基于rl的控制机制越来越感兴趣，如自动驾驶[67]和道路控制[46;68]，我们希望这项调查也能为智能交通系统的其他应用提供应对现实世界挑战的见解。

2 背景

概括：

描述了RL框架：状态、动作、转移概率、奖励、折扣因子，其解决问题的基本思路
基于RL解决交通信号控制问题的常规设置，其中包括单交叉口和多交叉口

In this section, we first describe the reinforcement learning framework which constitutes the foundation of all the methods presented in this paper. We then provide background on conventional RLbased traffic signal control, including the problem of controlling a single intersection and multiple intersections.
`
在这一节中，我们首先描述了强化学习框架，它构成了本文中提出的所有方法的基础。然后介绍了传统基于rl的交通信号控制的背景，包括单个交叉口和多个交叉口的控制问题。

2.1 强化学习

Usually a single agent RL problem is modeled as a Markov Decision Process represented by <S, A, P, R, γ>, where their definitions are given as follows:

通常将单个智能体RL问题建模为用 <S, A, P, R, γ>表示的马尔科夫决策过程，其定义如下:

Set of state representations S: At time step t, the agent observes state st ∈ S.
状态的集合表示为S: 在第t步，智能体观测到状态。
Set of action A and state transition function P: At time step t, the agent takes an action at ∈ A, which induces a transition in the environment according to the state transition function P(st+1|st, at) : S ×A → S
动作的集合A和状态转移函数P: 在时间步长t时，智能体采取一个动作，根据状态转移函数，在环境中转移到下一个状态
Reward function R: At time step t, the agent obtains a reward rt by a reward function: R(st, at) : S ×A → R
奖励函数R:在时间步长t时，智能体通过奖励函数R(st, at): S ×A→R 获得奖励 rt
Discount factor γ: The goal of an agent is to find a policy that maximizes the expected return, which is the discounted sum of rewards: Gt := ∞ i=0 γirt+i, where the discount factor γ ∈ [0, 1] controls the importance of immediate rewards versus future rewards. Here, we only consider continuing agent-environment intersections which do not end with terminal states but goes on continually without limit.
折扣因子:智能体的目标是找到一个最大化期望收益的策略，该策略是奖励的折现和:Gt:=∞i=0 irt+i，其中折扣因子∈[0,1]控制了当前奖励相对于未来奖励的重要性。在这里，我们只考虑持续的代理-环境交叉路口，这些交叉路口不以终端状态结束，而是无限制地连续进行。

Solving a reinforcement learning task means, roughly, finding an optimal policy π∗ that maximizes expected return. While the agent only receives reward about its immediate, one-step performance, one way to find the optimal policy π∗ is by following an optimal action-value function or state-value function. The action-value function (Q-function) of a policy π, Qπ : S × A → R, is the expected return of a state-action pair Qπ(s, a) = Eπ[Gt|st = s, at = a]. The state-value function of a policy π, Vπ : S → R, is the expected return of a state Vπ(s) = Eπ[Gt|st = s].
`
解决一个强化学习任务大致意味着找到一个最大化预期回报的最佳策略π∗。虽然智能体只获得当前即时的一步性能的奖励，但是找到最优策略的一种方法是跟踪最优的动作价值函数或状态价值函数。策略π的动作价值函数（Q函数），函数Qπ: S × a→R，是状态-动作对Qπ(S, a) = Eπ[Gt|st = S, at = a]的期望回报。策略π的状态价值函数Vπ: S→R，是状态Vπ(S) = Eπ[Gt|st = S]的期望回报。

2.2 问题设置

We now introduce the general setting of RL-based traffic signal control problem, in which the traffic signals are controlled by an RL agent or several RL agents. In single traffic signal control problem, the environment is the traffic conditions on the roads, and the agent controls the traffic signal. At each time step t, a description of the environment (e.g., signal phase, waiting time of cars, queue length of cars, and positions of cars) will be generated as the state st. The agent will predict the next action at to take that maximizes the expected return, where the action could be changing to a certain phase in the single intersection scenario. The action at will be executed in the environment, and a reward rt will be generated, where the reward could be defined on traffic conditions of the intersection. Usually, in the decision process, an agent combines the exploitation of learned policy and exploration of a new policy.
`
我们现在介绍基于RL的交通信号控制问题的常规设置，其中交通信号是由一个或多个RL智能体控制的。在单一交通信号控制问题中，环境是指道路上的交通状况，然后智能体控制交通信号。在每个时间步t中，将生成一个环境描述(例如，信号相位、车辆等待时间、车辆排队长度和车辆位置)作为状态S_t。智能体将预测下一步将采取的行动，使预期收益最大化，在单一交叉场景中，动作可能会改变到某个相位。该动作将在环境中被执行，奖励r^t也会在随后生成，生成的奖励可以根据路口的交通状况来定义。通常，在决策过程中，智能体会把对已学习策略的利用和对新策略的探索结合在一起。

In multi-intersection traffic signal control problem, there are N traffic signals in the environment, controlled by one or several agents. The goal of the agent(s) is to learn the optimal policies to optimize the traffic condition of the whole environment. At each timestep t, each agent i observes part of the environment as the observation ot i and make predictions on the next actions a t = (at 1, . . . , at N) to take. The actions will be executed in the environment, and the reward rt i will be generated, where the reward could be defined on the level of individual intersections or a group of intersections within the environment. We refer readers interested in more detailed problem settings to [8].
`
在多路口交通信号控制问题中，环境中有N个交通信号，由一个或多个代理控制。智能体的目标是学习最优策略来优化整个环境的交通状况。在每个时间步t中，每个智能体i都观察到环境的一部分作为i的观察对象，并对接下来的动作做出预测，t = (At 1，…答案:N) 。行动将在环境中执行，并将生成奖励rt i，其中奖励可以在环境中的单个或一组交叉路口上定义。我们建议对其中更细节的问题定义感兴趣的读者参考文献[8]。

3 基于RL的交通信号灯控制

概括：

机遇：RL和Deep RL解决交通信号控制问题的优点
智能体设计
- 奖励：
  - 平均队列长度、平均等待时间、平均速度或吞吐量；
  - 其他包括：信号变化频率和紧急停车次数
  - 单、多交叉口场景：队列长度（最优设计）
- 状态：建议使用车道级队列长度和相位等简单的状态表示
- 动作方案
  - 固定循环序列
    - 设置当前相位持续时间
    - 设定相位持续时间与预定义总周期持续时间的比值
    - 在预定义循环相位序列中切换到下一相位
  - 灵活相位序列
    - 在一组相位中选择要变换的相位
- 政策学习
  - 基于模型的方法：对状态之间的转移概率进行显式建模
  - 无模型的方法：估计state-action对的回报，以此为基础选择动作
    - 基于值的方法：通过近似state-value函数和state-action函数来隐式获得策略，大多采用DQN
    - 基于策略的方法：将策略参数向能够最大化预期收益的方向更新。其中被广泛采用的actor-critic框架，结合了基于价值和基于策略的方法的优点
    - 深度强化学习：上述方法中通过使用深度神经网络来逼近价值函数
- 协调：从单交叉口扩展到多交叉口需要面临的问题
  - 联合动作学习：使用一个全局智能体来控制所有的交叉点，其中有全局Q函数和max-plus算法
  - 独立动作学习：使用独立的智能体，各自控制一个交叉路口
    - 有通信：获取相邻智能体的交通状况或过往动作，不只是利用自我的局部交通状况
    - 无通信：智能体观察各自的本地环境，不使用通信来解决冲突

In this section, we introduce three major aspects investigated in recent RL-based traffic signal control literature: agent formulation, policy learning approach and coordination strategy.
`
在本节中，我们介绍了最近基于RL的交通信号控制文献中研究的三个主要方面:智能体的制定、政策学习方法和协调策略。

3.1 机遇

In this subsection, we point out some high-level discussions about why RL and deep RL are appropriate for the traffic signal control problem.
`
在这个小节中，我们指出了一些关于为什么RL和深度RL适用于交通信号控制问题的高层讨论。

Reinforcement learning methods learns from trail-and-error without making unrealistic assumptions on traffic model. The typical approach that conventional transportation methods take is to cast traffic signal control as an optimization problem under certain as sumptions about the traffic model. For example, Webster’s Formula method [28] is one of the widely-used method in field for a single intersection. Assuming the traffic flow is uniform during a certain period (i.e., past 5 or 10 minutes), it has a closed-form solution after optimization [52]. Other methods like Maxband [35] or SCATS [38] also make similar assumptions to make the optimization problem tractable. The key issue here is that these assumptions often deviate from the real world. The real-world traffic condition evolves in a complicated way, affected by many factors such as driver’s preference, interactions with vulnerable road users (e.g., pedestrians, cyclists, etc.), weather and road conditions. These factors can hardly be fully described in a traffic model. On the other hand, reinforcement learning techniques can directly learn from the observed data without making unrealistic assumptions about the model. In essence, an RL system generates and executes different strategies (e.g., for traffic signal control) based on the current environment. It will then learn and adjust the strategy based on the feedback from the environment. This reveals the most significant difference between transportation approaches and our RL approaches: in traditional transportation research, the control model is static; in reinforcement learning, the control model is dynamically learned through trial-and-error in the real environment.
`
强化学习方法会不断地试错学习，而不会对交通模型做出不切实际的假设。传统交通方法所采用的典型方法是在一定的交通模型假设条件下，将交通信号控制问题看作是一个优化问题。例如，Webster 's Formula method[28]就是目前领域中应用最广泛的单一路口方法之一。假设某段时间内(即过去5分钟或10分钟)的车流量是均匀的，其优化后的解[52]为封闭解。其他方法，如Maxband[35]或SCATS[38]，也做了类似的假设，以使优化问题易于被处理。这里的关键问题是，这些假设经常偏离现实世界。现实世界的交通状况以一种复杂的方式进行着，受许多因素的影响，如司机的偏好，与脆弱的道路使用者(如行人、骑自行车的人等)的互动，天气和道路状况。这些因素很难在交通模型中完全描述。另一方面，强化学习技术可以直接从观察到的数据中学习，而不会对模型做出不切实际的假设。本质上，RL系统根据当前环境产生和执行不同的策略(如交通信号控制)。然后，它会根据环境的反馈来学习和调整策略。这揭示了交通方法和RL方法之间最显著的区别:在传统的交通研究中，控制模型是静态的;在强化学习中，控制模型是在真实环境中通过试错动态学习的。

The combination of deep learning with reinforcement learning helps alleviate the “curse of dimensionality“ problem. Traditionally, RL is concerned with the issue of the curse of dimensionality as the number of state-action pairs can grows exponentialy with the dimension of states and actions. Recent advances in deep learning helps the approximation of functions in RL like Q(s, a) or V(s) by learning efficiently on a significantly smaller number of features instead of a large number of state-action pairs. This helps to improve scalability with reduced requirements on memory or storage capacity, as well as reduced learning time.
`
深度学习与强化学习的结合有助于缓解“维度诅咒”问题。传统上，RL关注的是维数诅咒问题，因为状态-动作对的数量可以随着状态和动作的维数呈指数增长。深度学习的最新进展有助于RL中的函数的近似，如Q(s, a)或V(s)，通过在显著少的特征上有效地学习，而不是大量的状态-动作对。这有助于提高可伸缩性，减少对内存或存储容量的需求，以及减少学习时间。

3.2 智能体设计

A key question for RL is how to formulate the RL agent, i.e., the reward, state, and action definition. In this subsection, we focus on the advances in the reward, state, and action design in recent deep RL-based methods, and refer readers interested in more detailed definitions to [16; 39; 71].
`
RL的一个关键问题是如何制定RL智能体，即奖励、状态和动作定义。在本小节中，我们将重点关注最近基于Deep RL的方法在奖励、状态和动作设计方面的进展，对更详细的定义感兴趣的读者可以参考文献[16;39;71]。

3.2.1 奖励

The choice of reward reflects the learning objective of an RL agent. In the traffic signal control problem, although the ultimate objective is to minimize the travel time of all vehicles, travel time is hard to serve as a valid reward in RL. Because the travel time of a vehicle is affected by multiple actions from traffic signals and vehicle movements, the travel time as reward would be delayed and ineffective in indicating the goodness of the signals’ action. Therefore, the existing literature often uses a surrogate reward that can be effectively measured after an action, considering factors like average queue length, average waiting time, average speed or throughput [55; 65]. The authors in [58] also take the frequency of signal changing and the number of emergency stops into reward. With different reward functions being proposed, researchers in [78; 62] find out that the weight on each factor in reward is tricky to set, and a minor difference in weight setting could lead to dramatically different results. Thus they set out to find a minimal set of factors, proving that using queue length as the reward for a single intersection scenario and using pressure, a variant of queue length in the multi-intersection scenario are equivalent to optimizing the global travel time.
`
奖励的选择反映了RL智能体的学习目标。在交通信号控制问题中，虽然最终目标是最小化所有车辆的行驶时间，但在RL中，行驶时间很难作为有效的奖励。由于车辆的行驶时间受到交通信号和车辆运动的多个动作的影响，作为奖励的行驶时间会被延迟，不能很好地反映信号的作用。因此，现有文献经常使用替代奖励来使得在动作执行后可以被有效测量，值得考虑的因素有平均队列长度、平均等待时间、平均速度或吞吐量等因素[55;65]。[58]的作者还将信号变化频率和紧急停车次数作为奖励。由于提出了不同的奖励功能，研究者在[78;发现奖励中每个因素的权重很难设置，权重设置的微小差异可能会导致显著不同的结果。因此，他们开始寻找一组影响最小的因素，在单个交叉口场景中，使用队列长度作为奖励；并在多交叉口场景中使用压力，最终证明得出队列长度的变化等同于优化全局行驶时间。

3.2.2 状态

At each time step, the agent receives some quantitative descriptions of the environment as state to decide its action. Various kinds of elements have been proposed to describe the environment state, such as queue length, waiting time, speed and phase, etc. These elements can be defined on the lane level or road segment level, and then concatenated as a vector. In earlier work using RL for traffic signal control, people need to discretize the state space and use a simple tabular or linear model to approximate the state functions for efficiency [1; 7; 49]. However, the real-world state space is usually huge, which confines the traditional RL methods in terms of memory or performance. With advances in deep learning, deep RL methods are proposed to handle large state space as an effective function approximator. Recent studies propose to use images [9; 14; 18; 20; 22; 23; 32; 33; 42; 58; 65] to represent the state, where the position of vehicles are extracted as an image representation. With variant information used in state representation in different studies, [62; 78] shows that complex state definition and large state space do not necessarily lead to significant performance gain, and proposes to use simple state like lane-level queue length and phase to represent the environment state.
`
在每个时间步骤中，智能体接收一些作为状态的环境定量描述，以决定其动作。各类可以用来描述环境状态的元素被提出，如队列长度、等待时间、速度和相位等。这些元素可以在车道级别或路段级别上定义，然后以矢量的形式连接起来。在早期使用RL进行交通信号控制的工作中，人们需要将状态空间离散化，使用简单的表格或线性模型来高效地逼近状态函数[1;7;49]。然而，现实世界的状态空间通常是巨大的，这在内存或性能方面限制了传统的RL方法。随着深度学习技术的发展，被提出来的深度RL方法作为一种有效的函数逼近器来处理大的状态空间。最近的研究建议使用图像[9;14;18;20;22;23;32;33;42;58;65]表示状态，其中车辆的位置被提取为图像表示。在不同的研究中，状态表征中使用了不同的信息，[62;78]表明复杂的状态定义和大的状态空间不一定会带来显著的性能提升，并提出使用车道级队列长度和相位等简单的状态来表示环境状态。

3.2.3 动作方案

Now there are different types of action definitions for an RL agent in traffic signal control: (1) set current phase duration [4; 5], (2) set the ratio of the phase duration over pre-defined total cycle duration [1; 10], (3) change to the next phase in pre-defined cyclic phase sequence [39; 48; 58; 65], and (4) choose the phase to change to among a set of phases [2; 9; 12; 44; 42; 78]. The choice of action scheme is closely related to specific settings of traffic signals. For example, if the phase sequence is required to be cyclic, then the first three action schemes should be considered, while “choosing the phase to change to among a set of phases” can generate flexible phase sequences.
`
对于交通信号控制中的RL智能体，现在有不同类型的动作定义:(1)设置当前相位持续时间[4];5]，(2)设定相位持续时间与预定义总周期持续时间的比值[1];10]，(3)在预定义循环相位序列中切换到下一相位[39;48;58;65]，以及(4)在一组相位中选择要变换的相位[2];9;12;44;42;78]。行动方案的选择与交通信号的具体设置密切相关。例如，如果要求相位是一个固定循环序列，则应该考虑前三个动作方案；而最后一个方案：“在一组相位中选择要变换的相位”，则可以生成灵活的相位序列。

3.3 政策学习

RL methods can be categorized in different ways. [3; 26] divide current RL methods to model-based methods and model-free methods. Model-based methods try to model the transition probability among states explicitly, while model-free methods directly estimate the reward for state-action pairs and choose the action based on this. In the context of traffic signal control, the state transition between states is primarily influenced by people’s driving behaviors, which are diverse and hard to predict. Therefore, currently, most RLbased methods for traffic signal control are model-free methods. In this subsection, we take the categorization in [43]: value-based methods and policy-based methods.
`
RL方法可以以不同的方式进行分类。[3;26]将当前的RL方法分为基于模型的方法和无模型的方法。基于模型的方法试图对状态之间的转移概率进行显式建模，而无模型的方法直接估计状态-动作对的回报，并以此为基础选择动作。在交通信号控制的背景下，状态间的状态转换主要受人的驾驶行为的影响，而驾驶行为具有多样性，且难以预测。因此，目前基于RLbased的交通信号控制方法大多是无模型方法。在本小节中，我们将[43]中的方法分类为:基于值的方法和基于策略的方法。

3.3.1 基于价值的方法

Value-based methods approximate the state-value function or state-action value function (i.e., how rewarding each state is or state-action pair is), and the policy is implicitly obtained from the learned value function. Most of the RL-based traffic signal control methods use DQN [41], where the model is parameterized by neural networks and takes the state representation as input [58; 30]. In DQN, discrete actions are required as the model directly outputs the action’s value given a state, which is especially suitable for action schema (3) and (4) mentioned in Section 3.2.3.
`
基于值的方法近似状态-值函数或状态-动作值函数(即，每个状态或状态-动作对的奖励程度)，策略从学习的值函数中隐式获得。基于rl的交通信号控制方法大多采用DQN[41]，其中模型通过神经网络参数化，以状态表示作为输入[58;30]。在DQN中，需要对动作进行离散化，因为模型在给定状态下直接输出动作的值，这特别适用于章节3.2.3中提到的动作模式(3)和(4)。

3.3.2 基于策略的方法

Policy-based methods directly update the policy parameters (e.g., a vector of probabilities to conduct actions under specific state) towards the direction to maximizing a predefined objective (e.g., average expected return). The advantage of policy-based methods is that it does not require the action to be discrete like DQN. Also, it can learn a stochastic policy and keep exploring potentially more rewarding actions. To stabilize the training process, the actor-critic framework is widely adopted. It utilizes the strengths of both value-based and policy-based methods, with an actor controls how the agent behaves (policy-based), and the critic measures how good the conducted action is (value-based). In the traffic signal control problem, [10] uses DDPG [34] to learn a deterministic policy which directly maps states to actions, while [4; 42; 69] learn a stochastic policy that maps states to action probability distribution, all of which have shown excellent performance in traffic signal control problems. To further improve convergence speed for RL agents, [51] proposed a time-dependent baseline to reduce the variance of policy gradient updates to specifically avoid traffic jams.
`
基于策略的方法直接将策略参数(如在特定状态下采取行动的概率向量)向最大化预定义目标(如平均预期收益)的方向更新。基于策略的方法的优点是它不需要像DQN那样将动作离散化。此外，它可以学习随机策略，并不断探索潜在的更有回报的行为。为了稳定训练过程，演员评论家的框架被广泛采用。它利用了基于价值和基于策略的方法的优点，参与者控制智能体的行为(基于策略)，而批评者度量执行的操作有多好(基于价值)。在交通信号控制问题中，[10]使用DDPG[34]学习一个直接将状态映射到动作的确定性策略，而[4;42;学习一种将状态映射到行动概率分布的随机策略，所有这些策略在交通信号控制问题中都表现出了卓越的性能。为了进一步提高RL agent的收敛速度，[51]提出了一个基于时间的基线来减小政策梯度更新的方差，以避免交通拥堵。

In the above-mentioned methods, including both value-based and policy-based methods, deep neural networks are used to approximate the value functions. Most of the literature use vanilla neural networks with their corresponding strengths. For example, Convolutional Neural Networks (CNN) are used since the state representation contains image representation [9; 20; 22; 23; 32; 33; 42; 58]; Recurrent Neural Networks (RNN) are used to capture the temporal dependency of historical states [60]. Special neural network structures are also proposed to incorporate prior knowledge about the states into the learning process [65; 77].
`
在上述方法中，包括基于价值的方法和基于策略的方法，都使用深度神经网络来逼近价值函数。大多数文献使用香草神经网络及其相应的优势。例如，使用卷积神经网络因为状态描述可以包含图像表示(Convolutional Neural Networks, CNN) [9;20;22;23;32;33;42;58];使用循环神经网络(RNN)来捕获历史状态[60]的时间依赖性。还提出了特殊的神经网络结构，将关于状态的先验知识纳入学习过程[65;77]。

3.4 协调

Coordination could benefit signal control for multi-intersection scenarios. Since recent advances in RL improve the performance on isolated traffic signal control, efforts have been performed to design strategies that cooperate with multi-agent reinforcement learning (MARL) agents. Literature [13] categorizes MARL into two classes: Joint action learners and independent learners. Here we extend this categorization for the traffic signal control problem.
`
在多交叉口的情况下，协调有利于信号控制。由于最近的研究进展提高了孤立交通信号控制的性能，人们努力设计与多智能体增强学习(MARL)智能体相配合的策略。文献[13]将MARL分为两类:联合行动学习者和独立学习者。在这里，我们将这种分类扩展到交通信号控制问题。

3.4.1 联合动作学习

A straightforward solution is to use a single global agent to control all the intersections [49]. It directly takes the state as input and learns to set the joint actions of all intersections at the same time. However, these methods can result in the curse of dimensionality, which encompasses the exponential growth of the state-action space in the number of state and action dimensions. Joint action modeling methods explicitly learns to model the joint action value of multiple agents Q(o1, . . . , oN, a ). The joint action space grows with the increase in the number of agents to model. To alleviate this challenge, [58] factorizes the global Q-function as a linear combination of local subproblems, extending [66] using max-plus [27] algorithm: Qˆ (o1, . . . , oN, a ) = Σi,jQi,j(oi, oj, ai, aj), where i and j correspond to the index of neighboring agents. In other works, [74; 12; 57] regard the joint Q-value as a weighted sum of local Q-values, Qˆ (o1, . . . , oN, a ) = Σi,jwi,jQi,j(oi, oj, ai, aj), where wi,j is the pre-defined weights. They attempt to ensure individual agents to consider other agents’ learning process by adding a shaping term in the loss function of the individual agent’s learning process and minimizing the difference between the weighted sum of individual Q-values and the global Q-value.
`
一个简单的解决方案是使用一个全局智能体来控制所有的交叉点[49]。它直接以状态为输入，学习同时设置各交叉口的联合动作。然而，这些方法可能导致维数的诅咒，包括状态-动作空间中状态和动作维数的指数增长。联合动作建模方法显式学习对多个agent的联合动作值Q(o1，…， oN, a)。联合动作空间随着要建模的agent数量的增加而增大。为了缓解这一挑战，[58]将全局Q函数分解为局部子问题的线性组合，使用max-plus[27]算法扩展[66]:Qˆ(o1，…， oN, a) = i,jQi,j(oi, oj, ai, aj)，其中i和j分别对应相邻agent的索引。在其他工作中，[74;12;57]将联合Q值视为局部Q值的加权和，Qˆ(o1，…， oN, a) = Σi,jwi,jQi,j(oi, oj, ai, aj)，其中wi,j为预定义的权重。他们试图通过在个体智能体学习过程的损失函数中加入一个形成项，并最小化个体q值加权和与整体q值的差值，来保证个体agent考虑其他agent的学习过程。

3.4.2 独立学习

There is also a line of studies that use independent RL (IRL) agents to control the traffic signals, where each RL agent controls an intersection. Unlike joint action learning methods, each agent learns its control policy without knowing the reward signal of other agents. IRL without communication methods treat each intersection individually, with each agent observing its own local environment and do not use explicit communication to resolve conflicts [39; 10; 78; 48; 36; 9; 23]. In some simple scenarios like arterial networks, this approach has performed well with the formation of several mini green waves. However, when the environment becomes complicated, the non-stationary impacts from neighboring agents will be brought into the environment, and the learning process usually cannot converge to stationary policies if there are no communication or coordination mechanisms among agents [45]. To deal with this challenge, the authors in [62] propose a specified reward that describes the demand for coordination between neighbors to achieve coordination.
`
还有一系列研究使用独立的RL (IRL)代理来控制交通信号，其中每个RL智能体控制一个交叉路口。与联合行动学习方法不同，每个智能体在不知道其他agent的奖励信号的情况下学习自己的控制策略。没有通信方法的IRL单独对待每个交叉口，每个智能体观察自己的本地环境，不使用显式通信来解决冲突[39;10;78;48;36;9;23]。在一些简单的情况下，如动脉网络，这种方法已经很好地形成了几个小绿波。然而，当环境变得复杂时，邻近agent的非平稳影响会被带入环境中，如果agent之间没有通信或协调机制，学习过程通常无法收敛到平稳策略。为了应对这一挑战，[62]中的作者提出了一种特定的奖励，描述了相邻智能体之间为了实现和谐状态而进行协调的需求。

IRL with communication methods enable agents to communicate between agents about their observations and behave as a group, rather than a collection of individuals in complex tasks where the environment is dynamic, and each agent has limited capabilities and visibility of the world [56]. Typical methods directly add neighbor’s traffic condition [70] or past actions [21] into the observation of the ego agent, other than just using the local traffic condition of the ego agent. In this method, all the agents for different intersection share one learning model, which requires the consistent indexing of neighboring intersections. [44] attempts to remove this requirement by utilizing the road network structure with Graph Convolutional Network [53] to cooperate multi-hop nearby intersections. [44] models the influence of neighboring agents by the fixed adjacency matrix defined in Graph Convolutional Network, which indicates their assumption that the influences between neighbors is static. In other work, [63; 60] proposes to use Graph Attentional Networks [59] to learn the dynamic interactions between the hidden states of neighboring agents and the ego agent. It should be pointed out that there is a strong connection between methods employing max-plus [27] to learn joint action-learners and methods using Graph Convolutional Network to learn the communication, as both of them can be seen to learn the message passing on the graph, where the former kind of methods passing the reward and the later passing the state obervations.
`
带有通信方法的IRL使代理能够在代理之间就它们的观察进行通信，并作为一个组进行活动，而不是在环境是动态的复杂任务中作为个体的集合，并且每个智能体具有有限的能力和对世界[56]的可见性。典型的方法是直接将相邻智能体的交通状况[70]或过往行为[21]加入到自我智能体的观察中，而不仅仅是利用自我智能体的局部交通状况。在该方法中，不同交叉口的所有代理共享一个学习模型，该学习模型要求相邻交叉口的索引一致。[44]试图通过利用带有图卷积网络[53]的路网结构来配合附近交叉口的多跳来消除这一要求。[44]通过图卷积网络中定义的固定邻接矩阵对相邻agent的影响进行建模，这表明他们假设相邻agent之间的影响是静态的。在其他研究中，[63;60]提出使用图注意网络[59]来学习相邻agent和自我智能体的隐状态之间的动态交互。需要指出的是，使用max-plus[27]学习联合动作学习者的方法与使用Graph Convolutional Network学习通信的方法之间有很强的联系，可以看出两者都在学习图上传递的信息，其中前一种方法传递奖励，后一种方法传递状态观察。

4 评估

概括：

评估指标（交叉口的效率量化）：
- 所有车辆的平均行驶时间、平均停留次数、平均排队长度、路网吞吐量
- 其中对于waiting的定义各不相同，有的是车速为0，有的是车速过低
仿真环境：
- 代表性：GLD、AIM、SUMO、CityFlow
- 其他专有：Paramics4、Aimsun5
路网：
- 道路属性：车道数、车道速度限制、交叉路口结构、相位设置
- 其中交通信号的数量直接影响着路网规模
交通流：
- 交通流量的动态性和复杂性越大，模型学习最优策略的难度越大
- 现有文献中将车辆变道、变速、路径等行为路线固定住，来简化实验设计

In this section, we will introduce some experimental settings that will influence the evaluation of traffic signal control strategies: evaluation metrics, simulation environment, road network setting, and traffic flow setting. A comparison of the settings that influence the evaluation are summarized in Table 1.
`
在本节中，我们将介绍一些影响交通信号控制策略评估的实验设置:评估指标、模拟环境、路网设置和交通流设置。表1总结了影响评估的设置的比较。

4.1 评估指标

The objective of traffic signal control is to facilitate safe and efficient movement of vehicles at the intersection. Safety is achieved by separating conflicting movements in time and is not considered in most related literature. Various measures have been proposed to quantify efficiency of the intersection from different perspectives, including the average travel time of all vehicles, the average number of stops that vehicles experience in the network, the average queue length in the road network, and the throughput of the road network. While the performance of the same method on queue length might differ with different definitions of a ”waiting” state of a vehicle, travel time and throughput are widely adopted as evaluation metrics by recent literature.
`
交通信号控制的目标是使车辆在交叉口安全、高效地行驶。安全是通过在时间上分离冲突的运动来实现的，这在大多数相关文献中都没有考虑到。针对交叉口的效率问题，提出了各种量化方法，包括所有车辆的平均行驶时间、车辆在路网中的平均停留次数、路网中的平均排队长度、路网的吞吐量等。虽然同一方法对车辆“等待”状态的定义不同，其对排队长度的性能也会有所不同，但近年来的文献广泛采用了出行时间和吞吐量作为评价指标。

4.2 仿真环境

Since deploying and testing traffic signal control strategies in the real world involves high cost and intensive labor, simulation is a useful alternative before actual implementation. Simulations of traffic signal control often involve large, heterogeneous scenarios and vehicle-level information, thus most literature relies on microscopic simulation, in which movements of individual vehicles are represented through microscopic properties such as the position and velocity of each vehicle. Some representative open-source microscopic simulators are: The Green Light District (GLD)1, The Autonomous Intersection Management (AIM)2, Simulation of Urban MObility (SUMO)3, and CityFlow [73]. Other proprietary simulators like Paramics4 and Aimsun5 are also adopted in [30; 10; 5]. For a detailed comparison of the open-source simulators, please refer to [40].
`
由于在现实世界中部署和测试交通信号控制策略需要较高的成本和密集的劳动，因此在实际实施之前，模拟是一个有用的替代方案。交通信号控制的仿真通常涉及大的、异构的场景和车辆级的信息，因此大多数文献都依赖于微观仿真，其中单个车辆的运动通过每辆车的位置和速度等微观属性来表示。代表性的开源微观仿真器有:Green Light District(GLD)、Autonomous Intersection Management(AIM)、Simulation of Urban MObility(SUMO)、CityFlow[73]。其他专有模拟器如Paramics4和Aimsun5也在[30;10;5]得到应用。有关开源模拟器的详细比较，请参阅文献[40]。

4.3 路网

Different road networks are explored in the current literature, including synthetic and real-world road network. At a coarse scale, a road network is a directed graph with nodes and edges representing intersections and roads, respectively. Specifically, a real-world road network can be more complicated than the synthetic network in the road properties (e.g., the number of lanes, speed limit of every lane), intersection structures and signal phases settings. Among all the road network properties, the number of traffic signals in the network largely influences the experiment results because the scale of explorations for RL agents to take increases with the scale of road network. Currently, most of the work still conducts experiments on relatively small road networks compared to the scale of a city, which could include thousands of traffic signals. Aslani et al. [5; 4] test their method in a real-world road network with 50 signals. In [63], a district with 196 signals is investigated. One of the most recent work [11] tests their methods on the real road network of Manhattan, New York, with 2510 traffic signals.
`
不同的道路网络在当前的文献中被探索，包括合成和真实世界的道路网络。在粗尺度上，路网是一个有向图，节点和边分别表示交叉口和道路。具体来说，真实世界的道路网络在道路属性(例如车道数、每条车道的速度限制)、交叉口结构和信号相位设置方面可能比合成网络更复杂。在所有路网属性中，由于RL agent的探索规模随着路网规模的增大而增大，所以网络中交通信号的数量对实验结果影响较大。目前，大多数研究工作仍在相对较小的道路网络上进行实验，而城市的规模可能包括数千个交通信号。Aslani等[5;4]在有50个信号的真实世界的道路网络中测试他们的方法。在[63]中，对一个包含196个信号的区域进行调查。[11]最近的一项研究在纽约曼哈顿的真实道路网络上测试了他们的方法，其中有2510个交通信号。

4.4 交通流

Traffic flow demand in the simulation can influence the evaluation of control strategies. The simulator takes traffic demand data as input, with each vehicle described as (o, t, d), where o is the origin location, t is time, and d is the destination location. Locations o and d are both locations on the road network. Usually, the more dynamic and heavier the traffic demand is, the harder for an RL method to learn an optimal policy. This is because the dynamic traffic would require the RL agents learn in a non-stationary environment, and heavier traffic would require fast adaptation for RL policies. The vehicle behavior models, such as lane changing, speed changing and routing models, could also influence the traffic flow and further influence the evaluation of traffic signal control policies. But in existing literature, they are usually kept fixed during the learning process of traffic signal control methods.
`
仿真中的交通流量需求会影响控制策略的评价。模拟器以交通需求数据为输入，每辆车描述为(o, t, d)，其中o为起点位置，t为时间，d为终点位置。位置o和d都是路网上的位置。通常情况下，流量需求的动态性越大，RL方法学习最优策略的难度越大。这是因为动态流量需要RL代理在非平稳环境中学习，而较大的流量则需要快速适应RL策略。车辆变道、变速、路径等行为模型也会对交通流产生影响，进而影响交通信号控制策略的评价。但在现有文献中，在交通信号控制方法的学习过程中，它们通常是固定不变的。

5 结论和未来工作

概括：

基准数据集和基线：
- 通过创建统一的路网和交通流来确保提出方法的公平比较和可重现性
- 强化学习方法缺少与传统典型方法的比较：Webster’s Formula 和 MaxPressure
学习效率：
- 利用有限的数据样本学习和高效探索
安全问题：
- 缺少现实世界的物理约束和风险管理
仿真迁移到现实：
- 在应用于真实世界前先学习一种可解释的策略
- 构建一个更真实的仿真模拟软件

In this survey, we present an overview of recent advances in reinforcement learning methods for traffic signal control, and provide an organization considering both the learning approach and evaluations of the research in this field. Here, we briefly discuss some directions for future research.
`
在这项调查中，我们提出了一个关于交通信号控制的强化学习方法的最新进展的概述，并提供了一个组织，考虑学习方法和该领域的研究评估。在此，我们简要地讨论了今后的研究方向。

5.1 基准数据集和基线

As discussed in Section 4.4, researchers use different road networks and traffic flow datasets, which could introduce large variances in final performance. Therefore, evaluating control policies using a standard setting could save the effort and assure a fair comparison and reproducability of RL methods [25]. An effort that could greatly facilitate research in this field is to create publicly available benchmark. Another concern for RL-based traffic signal control is that for this interdisciplinary research problem, existing literature of RL-based methods is often lack of comparison with typical methods from transportation area, like Webster’s Formula [28] and MaxPressure [29].
`
正如4.4节所讨论的，研究人员使用不同的道路网络和交通流数据集，这可能会在最终性能中引入很大的差异。因此，使用标准设置来评估控制策略可以节省精力，并确保RL方法[25]的公平比较和可重现性。一个可以极大促进这一领域研究的努力是创建公共可用的基准。基于rl的交通信号控制的另一个问题是，对于这一跨学科的研究问题，现有的基于rl的方法文献往往缺乏与交通领域的典型方法的比较，如Webster 's Formula[28]和MaxPressure[29]。

5.2 学习效率

Existing RL methods for games usually require a massive number of update iterations and trial-and-errors for RL models to yield impressive results in simulated environments. These trial-and-error attempts will lead to real traffic jams in the traffic signal control problem. Therefore, how to learn efficiently is a critical question for the application of RL in traffic signal control. While there is some previous work using Meta-Learning [72] or imitation learning [69], there is still much to investigate on learning with limited data samples and efficient exploration in traffic signal control problem.
`
现有的应用于游戏上的RL方法通常需要大量的更新迭代和反复试错，才能让RL模型在模拟环境中产生令人印象深刻的结果。然而这些反复试错的操作会导致真正的交通堵塞的交通信号控制问题。因此，如何有效地学习是RL在交通信号控制中应用的关键问题。虽然已有一些使用元学习[72]或模仿学习[69]的研究，但在有限数据样本的学习和交通信号控制问题的高效探索方面仍有很多需要研究的地方。

5.3 安全问题

While RL methods learn from trial-and-error, the learning cost of RL could be critical or even fatal in the real world as the malfunction of traffic signals might lead to accidents. An open problem for RL-based traffic signal control problem is to find ways to adapt risk management to make RL agents acceptably safe in physical environments [19]. [37] directly integrates real-world constraints into the action selection process. If pedestrians are crossing the intersection, their method will not change the control actions, which can protect crossing pedestrians. However, more safety problems like handling collisions are still to be explored.
`
虽然RL方法是通过反复试错来学习的，但在现实世界中，RL的学习代价可能是致命的，甚至是致命的，因为交通信号故障可能会导致事故。基于RL的交通信号控制问题的一个开放问题是寻找方法来适应风险管理，使RL代理在物理环境[19]中具有可接受的安全性。[37]直接将现实世界的约束集成到行动选择过程中。如果行人正在过十字路口，他们的方法不会改变控制动作，可以保护过十字路口的行人。然而，更多的安全问题，如处理碰撞仍有待探索。

5.4 从仿真迁移到真实场景

Most RL-based traffic signal control methods mainly conduct experiments in the simulator since the simulator can generate data in a cheaper and faster way than real experimentation. Discrepancies between simulation and reality confine the application of learned policies in the real world. While some work considers to learn an interpretable policy before applying to the real world [6] or to build a more realistic simulator [61; 75; 76] for direct transferring, there is still a challenge to transfer the control policies learned in simulation to reality.
`
大多数基于rl的交通信号控制方法主要在模拟器中进行实验，因为模拟器可以比真实实验更便宜、更快地生成数据。模拟和现实之间的差异限制了学习到的政策在现实世界中的应用。而有些工作则考虑在其应用于现实世界之前，先学习一种可解释的策略[61;75;76]，或为了直接迁移构建一个更真实的模拟器。在模拟中学习到的控制策略转移到现实中这仍然是一个挑战。