Parallel Reinforcement Learning for Traffic Signal ControlJava

最新推荐文章于 2024-07-24 06:12:13 发布

blst7673

最新推荐文章于 2024-07-24 06:12:13 发布

阅读量570

点赞数 6

文章标签：开发语言

本文链接：https://blog.csdn.net/blst7673/article/details/140273448

版权

Java Python The 4th International Workshop on Agent-based Mobility, Traﬃc and Transportation Models,

Methodologies and Applications (ABMTRANS)

Parallel Reinforcement Learning for Traﬃc Signal Contro

Abstract

Developing Adaptive Traﬃc Signal Control strategies for eﬃcienturban traﬃc management is a challenging problem, which is not easily solved. Reinforcement Learning (RL) has been shown to be a promising approach when applied to traﬃc signal control (TSC) problems. When using RL agents forTSC, diﬃculties may arise with respect to convergence times and performance. This is especially pronounced on complex intersections with many diﬀerent phases, due to the increased size of the state action space. Parallel Learning is an emerging technique in RL literature, which allows several learning agents to pool their experiences while learning concurrently on the same problem. Here we present an extension to a leading published work on RL for TSC, which leverages the beneﬁts of Parallel Learning to increase exploration and reduce delay times and queue lengths.

Peer-review under responsibility of the Conference Program Chairs

Keywords: Reinforcement Learning, Parallel Learning, Multi Agent Systems, Intelligent Transportation Systems, Adaptive Traﬃc Signal Control, Smart Cities

1. Introduction

Traﬃc congestion is a major issue in modern cities that results in many negative environmental, social and eco- nomic consequences. High vehicle usage rates, coupled with the lack of space and public funds available to construct new transport infrastructure, serve to further complicate the issue. Against this backdrop, it is necessary to develop intelligent and economical solutions to improve the quality of service for road users. A relatively inexpensive way to alleviate the problem is to ensure optimal use of the existing road network, e.g. by using Adaptive Traﬃc Signal Control (ATSC). Improvements in ATSC have a pivotal role to play in the future development of Smart Cities, espe- cially considering the current EU-wide emphasis on the theme of Smart, Green and Integrated Transport in Horizon 2020. Developing ATSC strategies for eﬃcienturban traﬃc management is a challenging problem, which is not easily solved.

In recent years, Artiﬁcial Intelligence methods such as Fuzzy Logic, Neural Networks, Genetic Algorithms and Reinforcement Learning have all been applied successfully to the traﬃc control problem. This coincides with an in- creasing interest among researchers in the broader ﬁeld of Intelligent Transportation Systems (ITS). The approach that we present in this paper is based on Reinforcement Learning (RL), a technique that has many potential applications in the ITS area. Here we present an extension to a leading published work on Reinforcement Learning for Traﬃc Signal Control (RL-TSC), which leverages the beneﬁts of Parallel Learning to increase exploration and reduce delay times and queue lengths.

This paper will make a contribution in the following ways: 1) a new method of Parallel Reinforcement Learning for Traﬃc Signal Control applications is presented; 2) we prove experimentally that the proposed method improves the quality of the solution reached compared to a standard single agent approach; 3) we identify and discuss speciﬁc issues that need to betaken into account when applying Parallel Learning to this diﬃcult problem domain; 4) we discuss the future direction of this research topic.

The remainder of this paper is structured as follows: the second section discusses related research, while the third section describes our Parallel Learning approach. The following section details the design of our experimental set up, after which we present our experimental results. Finally, we conclude by discussing our ﬁndings and our plans for future work.

2. Related Research

2.1. Reinforcement Learning

The term Reinforcement Learning describes a class of algorithms that have the capability to learn through expe- rience. An RL agent is deployed into an environment, usually without any prior knowledge of how to behave. The agent interacts with its environment, and receives a scalar reward signal r based on the outcomes of previously selected actions. This reward can be either negative or positive, and a properly designed reward function allows the agent to iteratively learn an optimal or near optimal control policy. The agent must strike a balance between exploiting known good actions and exploring the consequences of new actions in order to maximise the reward received during its life- time. Q values represent the expected reward for each state action pair, which aid the agent in deciding which action is most desirable to select when in a certain state. The Q values are typically stored in a matrix, which represents the knowledge learned by an RL agent.

RL problems are generally modelled as a Markov Decision Process (MDP), which is considered the de facto standard when formalising problems involving learning sequential decision making 1 . An MDP may be represented using a reward function R, set of states S , set of actions A, and a transition function T 1 , i.e. a tuple < S , A, T, R >. When in any state s ∈ S , selecting an action a ∈ A will result in the environment entering a new state s/ ∈ S with probability T(s, a, s/) ∈ (0,1), and give a rewardr = R(s, a, s/).

RL algorithms may be categorised as either model-based (e.g. Dyna, Prioritised Sweeping), or model-free (e.g. Q-Learning, SARSA). To successfully implement model-based approaches, it is necessary to know the transition function T 1 . This can sometimes be problematic, given that T may be diﬃcult or even impossible to determine in complex problem domains. By contrast, model-free approaches do not have this requirement. Instead, model- free approaches sample the underlying MDP in order to gain knowledge about the unknown model. Exploration is necessary for a model-free learner in order to gain the required knowledge about its environment.

Q-Learning is an of-policy, model-free learning algorithm that is commonly used in RL-TSC literature, e.g. 2 . It has been proven that Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action-values are represented discretely3 . In Q-Learning, the Q values are updated according to the following equation:

Qt+1(st, at ) = Qt (st, at ) + α(rt + γQmaxa (st+1, a) − Qt (st, at )) (1) The learning rate α ∈ [0, 1] is an input parameter required for many RL algorithms, and determines by how much Q values are updated at each time step t. The discount factor γ ∈ [0, 1] controlshow the agent regards future rewards.

2.2. Parallel Reinforcement Learning

Parallel Learning (PL) is an emerging paradigm within RL where multiple agents pool their experiences while learning concurrently on a problem, thus improving performance and decreasing convergence times. In contrast to mainstream Multi Agent Reinforcement Learning (MARL) research, PL is typically applied to speeding up the convergence of single agent RL problems, rather than examining competitive or cooperative strategies and emergent behaviour as in MARL. PL agents typically inﬂuence each other’s behaviour by sharing information, rather than the direct agent interactions that typically occur in MARL. This is because PL agents learn in separate instances of the same problem, rather than multiple agents learning in a single problem instance as is common in MARL literature.

Although several authors have reported promising results using Parallel Reinforcement Learning (PRL) approaches (see e.g. 4,5,6,7 ), surprisingly little research has been published in this area to date. The main focus so far has been on testing PRL approaches using abstract problem domains like Gridworld, with the exception of Barrett et al. 4 , who also tested their algorithm on a realistic cloud resource allocation problem. In general, previously published works in PL have reported improvements in convergence times and/or quality of the solution reached. These results are derived mainly from experimentation using simple abstract problem domains, but we hypothesise that PL techniques may also be efective when applied to more complex Traﬃc Signal Control problems.

3. Parallel Reinforcement Learning for Traﬃc Signal Control

Traﬃc Signal Control is a complex and highly stochastic problem domain, which presents a number of signiﬁcant challenges when compared with the traditional abstract problem domains (e.g. Gridworld) studied in RL research. Speciﬁcally, the complexity of transportation networks and the number of independent entities involved is on such a scale that a learning agent cannot possibly know every detail about the environment state, therefore making these problems into Partially Observable Markov Decision Processes.

In RL-TSC, each intersection is typically controlled by a single agent. Each agent has the responsibility of deter- mining the light switching sequence at its assigned intersection. A network of traﬃc signal control agents may then be considered as a Multi Agent System. Reinforcement Learning has been shown to be a promising approach when applied to urban traﬃc signal control (e.g.2,8,9), and ofers many beneﬁts; RL agents can learn online to continuously improve their performance, as well as adapting readily to changesin traﬃc demand. Traﬃc control problems make very attractive testbeds for emerging RL approaches, and present a number of non-trivial challenges such as devel- oping strategies for coordination and information sharing between individual agents. For a comprehensive review of the usage of learning agents in Traﬃc Signal Control, we refer the interested reader to a review paper published by Mannion et al. 8 .

A signiﬁcant challenge of applying RL to traﬃc signal control relates to the number of possible state action combi- nations for complex intersections with many phases. This problem is referred to as the Curse of Dimensionality in RL literature. As RL agents encounter more complex problems, convergence times and the quality of the policy learned tend to sufer. Here we apply the principles of PRL to develop a method which can speed up learning and reach a better ﬁnal policy compared to conventional RL-TSC approaches.

Our PRL implementation consists of two types of agents: Master, and Slave. Both types of agents are based on Q- Learning, and a single Master agent maybe used to solve a prob- lem on its own, or with the aid of one or more Slave agents. The Master and Slave agents may learn simultaneously on diferent instances of the same problem (or similar, relevant problems). The Slave agents impart their experience to the Master Agent by means of a shared experience pool, in the form of a global Q matrix, as shown in Fig. 1. The Master and Slave agents each have their own instance of the problem environment to learn on,

and at each timestep the knowledge learned by all agents is synchronised in the global Q matrix, in accordance with Equation 1. The Master agent may then draw on this experience to aid its action selection in the next timestep. Exper- imental results are measured from the instance of the problem environment owned by the Master agent in all of our tests. Here we use the same state, action and reward deﬁnitions as used by El-Tantawy et al. 2 , so our PRL method can be considered to be an extension of this approach.

For each agent, state is deﬁned as a vector of dimension 2 + P, shown in Equation 2 below. The ﬁrst 2 components are the indexof the current phase (Pc) and the elapsed time in the current phase (PTE). The remaining P components are the queue lengths (QLi ) for each phase. For a given state s, the state vector is deﬁned as follows:

s = [ Pc, PTE, QL1, , QLn ] (2)

We limit the maximum number of queueing vehicles considered by an agent to 20, and the maximum phase elapsed time considered is limited to 30 seconds. By imposing these limits, we reduce the possible number of states considered by an agent. A vehicle is considered to be queueing at a junction if its approach speed is less than 10 km/hr.

The actions available to the agents at each time step are: to keep the currently displayed green and red signals, or to set a green light for a diﬀerent phase. Phases are subject to a minimum length of 10 seconds, to eliminate unreasonably low phase lengths from consideration. There is no ﬁxed cycle length, and agents are free to extend the current phase or switch to the next phase as they see ﬁt. When changing phases, an amber signal is displayed for 3 seconds, followed by an all red period of 2 seconds, followed by a green signal to the next phase.

Actions are selected using the&nb Parallel Reinforcement Learning for Traffic Signal ControlJava sp;∈-greedy strategy, where a random action is chosen with probability ∈, or the action with the best expected reward is chosen with the remaining probability 1 − ∈ . The value of ∈ is set to 0.05. The learning rate α is set to 0.08, while the discount factor γ is set to 0.8. These values for ∈, α and γ are used by all learning agents in our experiments. All agents begin with theirQ values for each state action pair initialised to zero at the start of each experiment.

The reward received by an agent for selecting an action a in a given state s and transitioning to a resultant state s/ is deﬁned as the diﬀerence between the current and previous cumulative waiting times (CWT) of vehicles queueing at the junction. This reward function is shown in Equation 3 below. An action that results in an increase in CWT receives a negative reward, while an action that results in a decrease in CWT results in a positive reward.

R(s, a, s/) = CWTs − CWTs/ (3)

4. Experimental Design

Fig. 2: Test Junctions

(a) 2 Phase Junction

(b) 3 Phase Junction

The microscopic traﬃc simulation package SUMO (Simulation of Urban MObility) 10 is the basis for our experimental setup. Agent logic is deﬁned in our external framework, which is implemented in Java. The simulation timestep length is 1 second for all exper- iments. Each RL agent is responsible for controlling the light sequences of a single junction. The TraaS library 11 is used to feed simulation data to the agents, and also to send commands from the agents back to SUMO.

The hourly traﬃc demand D used in our experiments is deﬁned as a step function. The demand step function comprises a base ﬂow b, episode lengthe, step demand increase h, and a step interval i. Demand at time t into a particular episode is deﬁned as shown in Equation 4 below:

The increase in demand due to this function is computed at intervals equal to i. This time varying traﬃc demand is a more realistic and demanding ﬂow pattern for the agents to control, compared to a constant hourly demand deﬁnition. These step functions aim to simulate a peak in demand similar to a morning or evening rush hour. The episode lengthe used for all intersections is two hours. Each experiment is run for 75 successive two hour episodes, with the same demand step function repeated over each two hour episode.

The ﬁrst experimental scenario is a simpliﬁed junction with 2 phases: North and East (see Fig. 2a). Here two intersecting one-way streets are controlled by a set of traﬃc lights, with the number of phases P = 2. The base ﬂow for the step function is 1000 vehicles per hour (veh/hr). The demand level rises by 250 veh/hr every 15 minutes, reaching a peak of 1750 veh/hr, before stepping down again to a value of 1000 veh/hr. There are four possible routes through the intersection, to which vehicles are randomly assigned upon creation in the simulator.

The second test scenario consists of three intersecting two-way streets, which form a T junction (shown in Fig. 2b). Here there are 6 possible routes through the junction arising from the lane conﬁguration, and three phases: North, East and South. The base ﬂow is 1000 veh/hr, rising by 100 veh/hr every 15 minutes to a peak of 1300 veh/hr, before returning to 1000 veh/hr.

For each experimental scenario, we ﬁrst test a single RL agent, based on those used by El-Tantawy elal. 2 . This approach has already been proven to ofer performance improvements compared to real world traﬃc control systems based on ﬁxed-time control, semiactuated control, and SCOOT control.

We then test our parallel learning algorithm against this, with 2, 3 and 4 agents learning simultaneously on the same problem. As other works have already proved the eﬃcacy of RL-TSC approaches in test scenarios with multiple intersections (see e.g. 2,8 ), we have decided to focus on using single junction test cases to clearly illustrate the beneﬁt of our PRL approach without adding unnecessary additional complexity.

5. Experimental Results and Discussion

Fig. 3: Experimental Results

(a) Average Waiting Times

(b) Average Queue Lengths

We evaluate each approach tested using the following param- eters: Average Waiting Times (AWT), Average Queue Lengths (AQL), and Average Number of States Visited (ANSV). The val- ues reported for AWT and AQL are the average value for each two hour episode, while for ANSV the actual value at the end of each two hour episode is reported. The results presented are the average of 10 runs, and are summarised in Table 1.

Fig. 3a shows the average vehicle waiting times (AWT) for the junctions tested. This plot shows an improvement in AWT compared to a single agent when multiple agents are applied to the same problem. The Average Queue Lengths (AQL) for each experiment are plotted in Fig. 3b. Similar to our ﬁndings in terms of AWT, our PRL approach beats a single learner on all the intersections tested, resulting in lower AQL values. Adding additional agents on both the 2 and 3 Phase junctions resulted in a further decrease in AWT and AQL, which scaled quite well relative to the number of additional learners.

Another beneﬁt ofered by our PRL approach is the increased exploration when compared with a single learning agent. Fig. 3c presents the Average Number of States Visited (ANSV) at the end of each episode for all tests. On both tested junctions, our PRL approach results in an increased number of states being reached when compared with a single agent. Multiple agents learning on the same problem results in increased diversity in terms of action selection choices, leading to knowledge being gained about a higher number of possible system states.

To ensure the signiﬁcance of the results, a series of two-tailed t-tests were performed. Here we compared the average values of AWT, AQL, and ANSV over the ﬁnal 10 episodes of each experi- ment for the single agent and PRL with 2 agents. The diferences in the means were deemed to be signiﬁcantif the p value was less than 0.05. PRL with 2 agents was found to ofer a statistically signiﬁcant improvement over the single agent approach in both test cases for all experimental parameters measured.

The observed improvement in performance also appears to scale well with the number of PRL agents applied to the problem. Reductions in AWT and AQL of the order of up to 7% and 5% respectively are possible when using 4 PRL agents compared to a standard single learning agent approach. Increases of up to 27% recorded in ANSV show that PRL increases exploration when compared to a single learner.

Table 1: Summary of Experimental Results (Average over Final 10 Episodes)

Experiment	AWT (s)	% Reduction AWT	AQL	% Reduction AQL	ANSV	% Increase ANSV
2 Phase, 1 agent	41.45	-	12.74	-	6,170.02	-
2 Phase, 2 agents	39.84	3.88 %	12.37	2.90 %	6,874.23	11.41 %
2 Phase, 3 agents	38.66	6.73 %	12.23	4.00 %	7,425.38	20.35 %
2 Phase, 4 agents	38.29	7.62 %	12.12	4.87 %	7,868.37	27.53 %
3 Phase, 1 agent	85.52	-	15.06	-	13,001.19	-
3 Phase, 2 agents	82.45	3.59 %	14.77	1.93 %	13,326.58	2.50 %
3 Phase, 3 agents	80.40	5.99 %	14.42	4.25 %	15,035.03	15.64 %
3 Phase, 4 agents	78.86	7.79 %	14.21	5.64 %	16,322.52	25.55 %

6. Conclusions and Future Work

Here we have presented a novel method of Parallel Reinforcement Learning for Traﬃc Signal Control, and proven experimentally that it outperforms a single learning agent approach. Our PRL algorithm increased exploration, while also reducing waiting times and queue lengths when compared to a single RL agent learning on the same problem. By testing 2, 3 and 4 agents learning in parallel, we have also shown that additional performance gains are possible by increasing the number of parallel learners. In the future we plan to test this method more extensively on simulated real world traﬃc networks with multiple signalised junctions. We also wish to investigate the efect of using alternative exploration strategies with PRL. In previous work 7 we partitioned the state action space of a simple PRL problem, with a single PRL agent responsible for learning about each subset of the state action space, allowing for more focused exploration. Using this exploration strategy in conjunction with our PRL algorithm for TSC has produced promising initial results, and will allow for further improvements in performance beyond those presented in this paper

blst7673

关注

6
点赞
踩
16

收藏

觉得还不错? 一键收藏
0
评论
Parallel Reinforcement Learning for Traffic Signal ControlJava

KeywordsrRSATSATRs∈Sa∈As/ ∈STsasrRsas/).TTQt+1(stat) =Qtstat) + α(rt+ γQmaxast+1,a) −Qtstatt2.2.ParallelLearningPﬁPcPTEPQLiss= [PcPTEQL1, ,QLnas。
复制链接

扫一扫