基于强化学习A2C快速路车辆决策控制
在上一篇基于DQN强化学习的高速路决策控制中采用DQN算法进行车辆决策控制,最终并未获得奖励较高的模型,今采用A2C算法进行测试。
依赖包版本
gym == 0.21.0
stable-baselines3 == 1.6.2
highway-env == 1.5
环境配置
查看环境配置:
import gym
import highway_env
from stable_baselines3 import A2C
# Create environment
env = gym.make("highway-fast-v0")
print(env.config)
输出如下:
{
'observation': {'type': 'Kinematics'},
'action': {'type': 'DiscreteMetaAction'},
'simulation_frequency': 5,
'policy_frequency': 1,
'other_vehicles_type':'highway_env.vehicle.behavior.IDMVehicle',
'screen_width': 600,
'screen_height': 150,
'centering_position': [0.3, 0.5],
'scaling': 5.5,
'show_trajectories': False,
'render_agent': True,
'offscreen_rendering': False,
'manual_control': False,
'real_time_rendering': False,
'lanes_count': 3,
'vehicles_count': 20,
'controlled_vehicles': 1,
'initial_lane_id': None,
'duration': 30,
'ego_spacing': 1.5,
'vehicles_density': 1,
'collision_reward': -1,
'right_lane_reward': 0.1,
'high_speed_reward': 0.4,
'lane_change_reward': 0,
'reward_speed_range': [20, 30],
'offroad_terminal': False
}
环境动作空间为:
ACTIONS_ALL = {
0: 'LANE_LEFT', # 左转
1: 'IDLE', # 宕机(无动作)
2: 'LANE_RIGHT', # 右转
3: 'FASTER', # 加速
4: 'SLOWER' # 减速
}
模型训练
import gym
import highway_env
from stable_baselines3 import A2C
# Create environment
env = gym.make("highway-fast-v0")
model = A2C("MlpPolicy",
env,
tensorboard_log="./logs",
verbose=1)
model.learn(total_timesteps=25000)
model.save("a2c_highway")
训练过程图为:
由图可知,在训练至10K步左右,模型奖励值逐渐稳定。
模型测试
import gym
import highway_env
from stable_baselines3 import A2C
# Create environment
env = gym.make("highway-fast-v0")
# load model
model = A2C.load("a2c_highway", env=env)
eposides = 10
for eq in range(eposides):
obs = env.reset()
done = False
rewards = 0
actions = []
while not done:
action, _state = model.predict(obs, deterministic=True)
action = action.item()
actions.append(action)
obs, reward, done, info = env.step(action)
env.render()
rewards += reward
print('actions: {}'.format(actions))
print('rewards: {}'.format(rewards))
模型输出:
actions: [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
rewards: 22.020221169036347
actions: [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
rewards: 21.020221169036326
actions: [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
rewards: 20.02022116903634
actions: [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
rewards: 21.020221169036326
actions: [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
rewards: 22.020221169036347
actions: [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
rewards: 21.020221169036326
actions: [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
rewards: 22.020221169036347
actions: [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
rewards: 22.020221169036347
actions: [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
rewards: 21.020221169036326
actions: [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
rewards: 20.02022116903634
由输出可知,在环境默认的奖励函数条件下,获取最高的奖励动作为一直执行动作4,根据上述动作空间可知,自车一直执行“减速”动作。对应测试视频如下:highway_fast_a2c.