基于TF-Agent的回合策略梯度算法模型训练Atari游戏

在上一篇博客中,我用Tensorflow的Agent库的DQN模型来对Atari的PONG游戏进行训练,效果很好。这次我打算测试一下回合策略梯度模型,看是否也能取得相同的效果。关于回合策略梯度算法的介绍,可以见我之前的另一篇博客强化学习笔记(5)-回合策略梯度算法_gzroy的博客-CSDN博客

在TF-Agent里面,有一个ReinforceAgent,实现了回合策略梯度算法。这个Agent需要构建一个Actor Network,通过输入环境的观察,得到动作的分布值h(s,a;\theta),对这个分布值进行Softmax计算就能得到\pi(a|s;\theta),即每个动作的概率。定义损失函数为-\gamma^{t}G ln\pi(A_{t}|S_{t};\theta),t为回合中的某一步,不断优化\theta以减小loss值,就能使得回合的预期回报增加,从而找到最优的策略。

首先加载Atari游戏的环境,如以下代码:

from tf_agents.environments import suite_gym
from tf_agents.environments import tf_py_environment
from tf_agents.trajectories import trajectory, Trajectory, time_step, TimeStep
from tf_agents.specs import tensor_spec
from tqdm import trange
import tensorflow as tf
from tensorflow import keras
from tf_agents.agents import ReinforceAgent
from tf_agents.utils import common
from tf_agents.networks.actor_distribution_network import ActorDistributionNetwork
import random
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib notebook
import os

env_name = 'PongDeterministic-v4'
train_py_env = suite_gym.load(env_name, max_episode_steps=0)
eval_py_env = suite_gym.load(env_name, max_episode_steps=0)
train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)

因为我们需要获取环境最近4帧的图像来作为输入,需要修改以下环境的默认的Timestep的spec,每帧图像的大小是220*168*3,压缩大小并转换为黑白,然后把这4帧图像叠加,因此调整后的输入是110*84*4。如以下代码:

input_shape = (110, 84, 4)
time_step_spec = train_env.time_step_spec()
new_observation_spec = tensor_spec.BoundedTensorSpec(input_shape, tf.uint8, 0, 255, 'observation')
new_time_step_spec = time_step.TimeStep(
    time_step_spec.step_type, 
    time_step_spec.reward, 
    time_step_spec.discount, 
    new_observation_spec)

定义一个ReinforceAgent,在这个Agent里面定义一个神经网络,里面有三个卷积层,把输入的观测的图像数据提取图像特征,然后通过一个全连接层输出动作的概率。

gamma = 0.99
learning_rate = 0.0005 

actor_net = ActorDistributionNetwork(
    new_observation_spec,
    train_env.action_spec(),
    preprocessing_layers=tf.keras.layers.Rescaling(scale=1./127.5, offset=-1),
    conv_layer_params=[(32,8,4), (64,4,2), (64,3,1)],
    fc_layer_params=(512,))

global_step = tf.compat.v1.train.get_or_create_global_step()

agent = ReinforceAgent(
    new_time_step_spec,
    train_env.action_spec(),
    actor_network=actor_net,
    gamma=gamma,
    optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
    train_step_counter=global_step)

agent.initialize()
action_min = agent.collect_data_spec.action.minimum
action_max = agent.collect_data_spec.action.maximum

定义一个辅助函数,用于把环境每一步得到的图像,进行图像大小和黑白转换之后,和之前的三帧图像叠加在一起:

def get_observation(images, observation):
    image = tf.squeeze(observation)
    image = tf.image.rgb_to_grayscale(image)
    image = tf.image.resize(image, [input_shape[0],input_shape[1]])
    image = tf.cast(image, tf.uint8)
    image = tf.squeeze(image)
    if len(images)==0:
        images = [image, image, image, image]
    images = images[1:]
    images.append(image)
    observation = tf.stack(images)
    observation = tf.transpose(observation, perm=[1,2,0])
    return images, observation

定义一个画图的函数,把训练过程中记录的Loss值和回合的回报值展示出来:

class Chart:
    def __init__(self):
        self.fig, self.ax = plt.subplots(figsize = (8, 6))
 
    def plot(self, data, x_name, y_name, hue_name):
        self.ax.clear()
        sns.lineplot(data=data, x=data[x_name], y=data[y_name], hue=data[hue_name], ax=self.ax)
        self.fig.canvas.draw()

chart_reward = Chart()
chart_loss = Chart()

因为整个训练过程非常久,需要定义checkpointer,定期保存模型的参数:

checkpoint_dir = os.path.join('./', 'checkpoint')
train_checkpointer = common.Checkpointer(
    ckpt_dir=checkpoint_dir,
    max_to_keep=1,
    agent=agent,
    policy=agent.policy,
    global_step=global_step
)

#if continue training, load the checkpointer
if continue_training:
    train_checkpointer.initialize_or_restore()
    global_step = tf.compat.v1.train.get_global_step()
    rewards_df = pd.read_csv('rewards_df.csv')
    loss_df = pd.read_csv('loss_df.csv')
else:
    rewards_df = pd.DataFrame(data=None, columns=['step','reward','type'])
    loss_df = pd.DataFrame(data=None, columns=['step','loss','type'])

然后我们就可以进行训练了,训练的时候每个Iteration都会先用模型来玩一个回合,搜集轨迹数据,把这些轨迹数据封装为一个Trajectory,然后调用ReinforceAgent的Train方法来进行训练。每100个迭代我们会输出Loss值和回合的回报值,以观察训练效果:

agent.train = common.function(agent.train)
num_iterations = 1000

for _ in trange(num_iterations):
    total_loss = 0
    time_step = train_env.reset()
    images = []
    observations = []
    actions = []
    policy_infos = []
    rewards = []
    discounts = []
    episode_reward = 0
    while not time_step.is_last():
        images, observation = get_observation(images, time_step.observation)
        step_type = tf.squeeze(time_step.step_type)
        observations.append(observation)
        time_step = TimeStep(time_step.step_type, time_step.reward, time_step.discount, tf.expand_dims(observation, axis=0))
        action = tf.squeeze(agent.policy.action(time_step).action)
        actions.append(action)
        time_step = train_env.step(action)
        next_step_type = tf.squeeze(time_step.step_type)
        reward = tf.squeeze(time_step.reward)
        rewards.append(reward)
        discount = tf.squeeze(time_step.discount)
        discounts.append(discount)
        episode_reward += reward.numpy()
    observation_t = tf.stack(observations)
    action_t = tf.stack(actions)
    reward_t = tf.stack(rewards)
    discount_t = tf.stack(discounts)
    traj = trajectory.from_episode(observation=observation_t, action=action_t, reward=reward_t, discount=discount_t, policy_info=reward_t)
    batch = tf.nest.map_structure(lambda t: tf.expand_dims(t, 0), traj)
    train_loss = agent.train(batch).loss
    total_loss += train_loss
    if (global_step%100)==0:
        loss_df = loss_df.append({'step':global_step.numpy(), 'loss':total_loss.numpy()/100, 'type':'train'}, ignore_index=True)
        chart_loss.plot(loss_df, 'step', 'loss', 'type')    
        rewards_df = rewards_df.append({'step':global_step.numpy(), 'reward':episode_reward, 'type':'train'}, ignore_index=True)   
        chart_reward.plot(rewards_df, 'step', 'reward', 'type')
        train_checkpointer.save(global_step)

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

gzroy

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值