在本节的实例中,将训练一个强化学习代理程序,使其能够在OpenAI Gym中的MountainCar-v0环境中控制小车的运动。这个任务的目标是让代理程序学会如何操作小车,以使其能够成功地将小车移动到山谷的另一侧。
实例6-8:手推车的自动驾驶(源码路径:daima\6\MountainCar.ipynb)
6.5.1 项目介绍
本项目是解决MountainCar-v0问题的升级版,在这个特定的控制理论问题中,汽车位于一维轨道上,位于两座“山”。目标是使动力不足的汽车到达右侧的山顶。唯一对汽车的干扰是以固定的动量向左或向右推动汽车,因此我们使用tf.agents在模型上实现强化学习,让汽车来回行驶自动向前到达山顶。在彻底研究MountainCar-v0环境后,我们对部分环境进行了修改。
- 首先,我们在原来的环境中修改了奖励,使培训更加方便。
- 第二,修改了汽车的路线,使问题复杂化,让汽车翻越2座高山。
本实例用到了库TF-Agents,这是是一个第三方库,提供了经过充分测试且可修改和扩展的模块化组件,可帮助开发者更轻松地设计、实现和测试新的 RL 算法。TF-Agents支持快速代码迭代,具备良好的测试集成和基准化分析。
本实例的具体实现流程如下:
- 通过安装必要的依赖项和用于渲染OpenAI Gym环境的包,包括设置虚拟显示以进行渲染,设置环境。
- 将Google Drive目录挂载到Colab以保存训练模型检查点和策略。
- 定义了名为ChangeRewardMountainCarEnv的自定义MountainCar环境版本。在这个自定义环境中,修改了奖励函数,step方法根据位置和速度的变化以不同的方式计算奖励。还在达到目标时更新奖励。
- 函数RL_train使用TensorFlow Agents训练了一个DQN代理程序。它设置了Q网络、优化器和重播缓冲区。它还定义了训练循环,其中收集数据并更新代理程序的网络。它会定期评估代理程序的性能并保存检查点。
- 代码还包括用于计算和绘制训练期间平均回报和步数的函数,以及创建视频以可视化代理程序行为的功能。
- 代码在原始的MountainCar-v0环境和修改后的NewMountainCarEnv环境(具有自定义奖励函数)上训练了DQN代理。
- 绘制了训练进度图,保存了经过训练的代理程序,并创建了视频以可视化代理在两个环境中的性能。
- 最后,计算并显示了代理在两个环境中的平均奖励和步数。
总之,本实例演示了如何使用TensorFlow Agents来训练DQN代理程序,以解决MountainCar-v0环境,包括原始奖励函数和具有自定义奖励函数的修改后的环境。它还提供了可视化工具来评估代理程序的性能。
6.5.2 具体实现
实例文件MountainCar.ipynb在谷歌colab中调试运行,的具体实现流程如下:
(1)安装需要的库:
!sudo apt-get install -y xvfb ffmpeg
!pip install -q gym
!pip install -q 'imageio==2.4.0'
!pip install -q PILLOW
!pip install -q pyglet
!pip install -q pyvirtualdisplay
!pip install -q tf-agents
(2)设置如下几个参数:
- 启动状态:车辆在x轴上的位置在x轴上分配了一个统一的随机值[−0.6, −0.4].
- 起始速度:汽车的速度在开始时始终指定为0。
- 游戏结束:轿厢位置大于0.5,表示已到达山顶。或情节长度大于200。
- 布线公式:此环境中的位置对应于x,高度对应于对y。高度不属于观测值,它与位置一一对应。
y = sin(3x
实现本环境的代码如下所示:
class ChangeRewardMountainCarEnv(MountainCarEnv):
def __init__(self, goal_velocity=0):
super(ChangeRewardMountainCarEnv, self).__init__(goal_velocity=goal_velocity)
def step(self, action):
assert self.action_space.contains(action), "%r (%s) invalid" % (action, type(action))
position, velocity = self.state
####改变奖励
past_reward = 100*(np.sin(3 * position) * 0.0025 + 0.5 * velocity * velocity)
velocity += (action - 1) * self.force + math.cos(3 * position) * (-self.gravity)
velocity = np.clip(velocity, -self.max_speed, self.max_speed)
position += velocity
position = np.clip(position, self.min_position, self.max_position)
if position == self.min_position and velocity < 0:
velocity = 0
done = bool(
position >= self.goal_position and velocity >= self.goal_velocity
)
####改变奖励
now_reward = 100*(np.sin(3 * position) * 0.0025 + 0.5 * velocity * velocity)
reward = now_reward - past_reward
if done:
reward += 1
self.state = (position, velocity)
return np.array(self.state), reward, done, {}
def RL_train(train_env, eval_env, fc_layer_params=(48,64,), name='train'):
global agent, random_policy, returns, steps
# Q 网络
q_net = q_network.QNetwork(
train_env.observation_spec(),
train_env.action_spec(),
fc_layer_params=fc_layer_params,
)
#优化器
optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate)
# DQN Agent
train_step_counter = tf.Variable(0)
agent = dqn_agent.DqnAgent(
train_env.time_step_spec(),
train_env.action_spec(),
q_network=q_net,
optimizer=optimizer,
td_errors_loss_fn=common.element_wise_squared_loss,
gamma = 0.99,
target_update_tau = 0.005,
train_step_counter=train_step_counter,
)
agent.initialize()
#政策
eval_policy = agent.policy
collect_policy = agent.collect_policy
random_policy = random_tf_policy.RandomTFPolicy(train_env.time_step_spec(), train_env.action_spec())
#重新播送缓冲器
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
data_spec=agent.collect_data_spec,
batch_size=train_env.batch_size,
max_length=replay_buffer_max_length)
# 收集数据
collect_data(train_env, agent.policy, replay_buffer, initial_collect_steps)
#数据管道
dataset = replay_buffer.as_dataset(
num_parallel_calls=4,
sample_batch_size=batch_size,
num_steps=2).prefetch(4)
iterator = iter(dataset)
#弹道
time_step = train_env.current_time_step()
action_step = agent.collect_policy.action(time_step)
next_time_step = train_env.step(action_step.action)
traj = trajectory.from_transition(time_step, action_step, next_time_step)
replay_buffer.add_batch(traj)
# Reset the train step
agent.train_step_counter.assign(0)
#在训练前评估一次代理人的政策
avg_return, avg_step = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
returns = [avg_return]
steps = [avg_step]
#训练政策
for _ in range(num_iterations):
#使用Collect_策略收集一些步骤并保存到replay缓冲区.
collect_data(train_env, agent.collect_policy, replay_buffer, collect_steps_per_iteration)
#从缓冲区中采样一批数据并更新代理的网络
experience, unused_info = next(iterator)
train_loss = agent.train(experience).loss
step = agent.train_step_counter.numpy()
if step % log_interval == 0:
print('step = {0}: loss = {1}'.format(step, train_loss))
if step % eval_interval == 0:
avg_return, avg_step = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
print('step = {0}: Average Return = {1}, Average Steps = {2}'.format(step, avg_return, avg_step))
returns.append(avg_return)
steps.append(avg_step)
#保存代理和策略
checkpoint_dir = os.path.join(tempdir, 'checkpoint' + name)
global_step = tf.compat.v1.train.get_or_create_global_step()
train_checkpointer = common.Checkpointer(
ckpt_dir=checkpoint_dir,
max_to_keep=1,
agent=agent,
policy=agent.policy,
replay_buffer=replay_buffer,
global_step=global_step
)
policy_dir = os.path.join(tempdir, 'policy' + name)
tf_policy_saver = policy_saver.PolicySaver(agent.policy)
train_checkpointer.save(global_step)
tf_policy_saver.save(policy_dir)
(4)在MountainCar-v0中,输入数据是观察值,输出数据是行动Action和奖励Reward的组合。然后开始训练,代码如下:
num_iterations = 100000 # @param {type:"integer"}
initial_collect_steps = 100 # @param {type:"integer"}
collect_steps_per_iteration = 1 # @param {type:"integer"}
replay_buffer_max_length = 100000 # @param {type:"integer"}
batch_size = 256 # @param {type:"integer"}
learning_rate = 1e-3 # @param {type:"number"}
log_interval = 200 # @param {type:"integer"}
num_eval_episodes = 10 # @param {type:"integer"}
eval_interval = 1000 # @param {type:"integer"}
tempdir = '/content/drive/MyDrive/5242/Project' # @param {type:"string"}
train_py_env = gym_wrapper.GymWrapper(
ChangeRewardMountainCarEnv(),
discount=1,
spec_dtype_map=None,
auto_reset=True,
render_kwargs=None,
)
eval_py_env = gym_wrapper.GymWrapper(
ChangeRewardMountainCarEnv(),
discount=1,
spec_dtype_map=None,
auto_reset=True,
render_kwargs=None,
)
train_py_env = wrappers.TimeLimit(train_py_env, duration=200)
eval_py_env = wrappers.TimeLimit(eval_py_env, duration=200)
train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)
RL_train(train_env, eval_env, fc_layer_params = (48,64,), name = '_train')
(5)绘制两幅可视化的平均回报曲线图,代码如下:
iterations = range(len(returns))
plt.plot(iterations, returns)
plt.ylabel('Average Return')
plt.xlabel('Iterations')
iterations = range(len(steps))
plt.plot(iterations, steps)
plt.ylabel('Average Step')
plt.xlabel('Iterations')
执行效果如图6-3所示。
图6-3 执行效果
(6)编写函数create_policy_eval_video()创建策略评估视频,代码如下:
def create_policy_eval_video(policy, filename, num_episodes=5, fps=30):
filename = filename + ".mp4"
with imageio.get_writer(filename, fps=fps) as video:
for _ in range(num_episodes):
time_step = eval_env.reset()
video.append_data(eval_py_env.render())
while not time_step.is_last():
action_step = policy.action(time_step)
time_step = eval_env.step(action_step.action)
video.append_data(eval_py_env.render())
return embed_mp4(filename)
create_policy_eval_video(saved_policy, "trained-agent", 5, 60)
N = 200
now_reward, now_step = compute_avg_return(eval_env, saved_policy, N)
print('Average reward for %d consecutive trials: %f' %(N, now_reward))
print('Average step for %d consecutive trials: %f' %(N, now_step))
create_policy_eval_video(random_policy, "random-agent", 5, 60)
(7)修改环境:
- 启动状态:车辆在x轴上的位置在x轴上分配了一个统一的随机值[−0.8, −0.2].
- 起始速度:汽车的速度在开始时始终指定为0。
- 游戏结束:轿厢位置大于4.7,表示已经达到了极限右山顶。或情节长度大于500。
- 布线公式:此环境中的位置对应于x,高度对应于Y高度不属于观测值,它通过以下公式逐个对应位置:
实现本环境的代码如下:
class NewMountainCarEnv(MountainCarEnv):
def __init__(self, goal_velocity=0):
super(NewMountainCarEnv, self).__init__(goal_velocity=goal_velocity)
self.min_position = -2
self.left_position = -1.2
self.middle_position = np.pi / 2
self.max_position = 5
self.max_speed = 0.2
self.goal_position = 4.7
self.goal_velocity = goal_velocity
def _cal_ypos(self, x):
if x < self.left_position:
return 3 * (x - self.left_position) * np.cos(3 * self.left_position) + np.sin(3 * self.left_position)
elif x < self.middle_position:
return np.sin(3 * x)
else:
return -3 * np.sin(x) + 2
def step(self, action):
assert self.action_space.contains(action), "%r (%s) invalid" % (action, type(action))
position, velocity = self.state
#### Changed reward
past_reward = 100 * (self._cal_ypos(position) * 0.0025 + 0.5 * velocity * velocity)
if position < self.left_position:
velocity += (action - 1) * self.force + math.cos(3 * self.left_position) * (-self.gravity)
elif position < self.middle_position:
velocity += (action - 1) * self.force + math.cos(3 * position) * (-self.gravity)
else:
velocity += (action - 1) * self.force - math.cos(position) * (-self.gravity)
velocity = np.clip(velocity, -self.max_speed, self.max_speed)
position += velocity
position = np.clip(position, self.min_position, self.max_position)
if (position == self.min_position and velocity < 0):
velocity = 0
done = bool(
position >= self.goal_position and velocity >= self.goal_velocity
)
#### Changed reward
now_reward = 100 * (self._cal_ypos(position) * 0.0025 + 0.5 * velocity * velocity)
reward = now_reward - past_reward
if done:
reward += 5
self.state = (position, velocity)
return np.array(self.state), reward, done, {}
def reset(self):
self.state = np.array([self.np_random.uniform(low=-0.8, high=-0.2), 0])
return np.array(self.state)
def _height(self, xs):
try:
ys = []
for s in xs:
ys += [self._cal_ypos(s) * .45 + .55]
return np.asarray(ys)
except:
return self._cal_ypos(xs) * .45 + .55
def render(self, mode='human'):
screen_width = 600
screen_height = 400
world_width = self.max_position - self.min_position
scale = screen_width / world_width
carwidth = 20
carheight = 10
if self.viewer is None:
from gym.envs.classic_control import rendering
self.viewer = rendering.Viewer(screen_width, screen_height)
xs = np.linspace(self.min_position, self.max_position, 300)
ys = self._height(xs)
xys = list(zip((xs - self.min_position) * scale, ys * scale))
self.track = rendering.make_polyline(xys)
self.track.set_linewidth(4)
self.viewer.add_geom(self.track)
clearance = 5
l, r, t, b = -carwidth / 2, carwidth / 2, carheight, 0
car = rendering.FilledPolygon([(l, b), (l, t), (r, t), (r, b)])
car.add_attr(rendering.Transform(translation=(0, clearance)))
self.cartrans = rendering.Transform()
car.add_attr(self.cartrans)
self.viewer.add_geom(car)
frontwheel = rendering.make_circle(carheight / 2.5)
frontwheel.set_color(.5, .5, .5)
frontwheel.add_attr(
rendering.Transform(translation=(carwidth / 4, clearance))
)
frontwheel.add_attr(self.cartrans)
self.viewer.add_geom(frontwheel)
backwheel = rendering.make_circle(carheight / 2.5)
backwheel.add_attr(
rendering.Transform(translation=(-carwidth / 4, clearance))
)
backwheel.add_attr(self.cartrans)
backwheel.set_color(.5, .5, .5)
self.viewer.add_geom(backwheel)
flagx = (self.goal_position - self.min_position) * scale
flagy1 = self._height(self.goal_position) * scale
flagy2 = flagy1 + 50
flagpole = rendering.Line((flagx, flagy1), (flagx, flagy2))
self.viewer.add_geom(flagpole)
flag = rendering.FilledPolygon(
[(flagx, flagy2), (flagx, flagy2 - 10), (flagx + 25, flagy2 - 5)]
)
flag.set_color(.8, .8, 0)
self.viewer.add_geom(flag)
………
(8)绘制新的两幅可视化的平均回报曲线图,代码如下:
iterations = range(len(returns))
plt.plot(iterations, returns)
plt.ylabel('Average Return')
plt.xlabel('Iterations')
iterations = range(len(steps))
plt.plot(iterations, steps)
plt.ylabel('Average Step')
plt.xlabel('Iterations')
执行效果如图6-4所示。
图6-4 新的平均回报曲线图
打开生成的MP4文件,可以查看修改后的执行动画效果,汽车需要自动翻越两座高山。如图6-5所示。
图6-5 自动翻越两座高山
本《Deep Q Network (DQN)算法》专题已完结:
(6-1)Deep Q Network (DQN)算法:引言与背景-CSDN博客
(6-2)Deep Q Network (DQN)算法:DQN的基本原理与实践-CSDN博客