tianshou + OpenAI GYM 强化学习模型 雅达利游戏环境 (附完整代码)

环境安装

tianshou + pytorch 安装

1、首先安装tianshou库

pip install tianshou

2、由于天授是基于pytorch开发的 所以还需要安装和自己电脑匹配的pytorch

pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio===0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

gym + atari环境安装

1、首先安装环境

pip install gym

2、导入atari的 ROMs 包
下载 url:http://www.atarimania.com/rom_collection_archive_atari_2600_roms.html
先下载到download目录
在这里插入图片描述
在下载目录下解压缩

然后执行

python3 -m atari_py.import_roms ./

使用代码验证一下:

import gym
import time
env=gym.make('MsPacman-ram-v0') #创建对应的游戏环境
env.seed(1) #可选,设置随机数,以便让过程重现

s=env.reset() #重新设置环境,并得到初始状态
while True: #每个步骤
    time.sleep(0.05)
    env.render() #展示环境
    a=env.action_space.sample() # 智能体随机选择一个动作
    s_,r,done,info=env.step(a) #环境返回执行动作a后的下一个状态、奖励值、是否终止以及其他信息
    # print("s_",s_)
    print("r",r)
    # print("done",done)
    # print("info",info)
    if done:
        break

其他:

NOTE1 env.render() 执行出错

 如果你是采用ssh或者其他命令行远程控制的方式 注意要注释掉
    env.render() #展示环境

NOTE2 windows 用户安装问题 module could not be found’ when running:

  windows 用户安装步骤类似 可能会出现c++库缺失的问题 可以参考以下步骤解决

1、卸载gym和atari-py(如果已经安装):

pip uninstall atari-py
pip uninstall gym[atari]

2、下载VS构建工具 : https://visualstudio.microsoft.com/thank-you-downloading-visual-studio/?sku=BuildTools&rel=16

3、运行VS build setup并选择"C++ build tools"并安装它。

4、重启电脑

5、安装 cmake, atari-py and gym

pip install cmake
pip install atari-py
pip install gym[atari]

6、测试代码

import atari_py
print(atari_py.list_games())

Reference:

解决方法 windows 环境报错: ‘module could not be found’ when running gym.make for atari environment.
openai/atari-py安装参考链接
openai/atari environment
gym展示代码
pytorch官网 可以找到合适你电脑的pytorch

輸入為 ARM 類型的雅達利遊戲強化學習代码实现

开头推一波 tianshou这个强化学习库

tianshou github website
tianshou Totorial

官网 Deep Q learning 样例学习

我们可以从tianshou Totorial先看一下Deep Q learning 的实现过程:

  1. 营造环境
import gym
import tianshou as ts
env = gym.make('CartPole-v0')
  1. 设置多环境包装器
    tianshou支持所有算法并行采样。它提供四种类型的矢量环境包装的:DummyVectorEnv,SubprocVectorEnv,ShmemVectorEnv,和RayVectorEnv。它可以按如下方式使用:(更多解释可以在Parallel Sampling找到)
train_envs = gym.make('CartPole-v0')
test_envs = gym.make('CartPole-v0')

train_envs = ts.env.DummyVectorEnv([lambda: gym.make('CartPole-v0') for _ in range(10)])
test_envs = ts.env.DummyVectorEnv([lambda: gym.make('CartPole-v0') for _ in range(100)])

NOTE :
如果使用自己的環境需要确保seed方法设置正确,例如,
否则,这些环境的输出可能彼此相同。

def seed(self, seed):
    np.random.seed(seed)
  1. 建立网络
import torch, numpy as np
from torch import nn

class Net(nn.Module):
    def __init__(self, state_shape, action_shape):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(np.prod(state_shape), 128), nn.ReLU(inplace=True),
            nn.Linear(128, 128), nn.ReLU(inplace=True),
            nn.Linear(128, 128), nn.ReLU(inplace=True),
            nn.Linear(128, np.prod(action_shape)),
        )

    def forward(self, obs, state=None, info={}):
        if not isinstance(obs, torch.Tensor):
            obs = torch.tensor(obs, dtype=torch.float)
        batch = obs.shape[0]
        logits = self.model(obs.view(batch, -1))
        return logits, state

state_shape = env.observation_space.shape or env.observation_space.n
action_shape = env.action_space.shape or env.action_space.n
net = Net(state_shape, action_shape)
optim = torch.optim.Adam(net.parameters(), lr=1e-3)
  1. 實例化 Policy
policy = ts.policy.DQNPolicy(net, optim, discount_factor=0.9, estimation_step=3, target_update_freq=320)
  1. 設置收集器
    Collector是tianshou的一个关键概念。它允许Policy方便地与不同类型的环境交互。在每个步骤中,收集器将让Policy执行(至少)指定数量的步骤或情节,并将数据存储在replay buffer中。
train_collector = ts.data.Collector(policy, train_envs, ts.data.VectorReplayBuffer(20000, 10), exploration_noise=True)
test_collector = ts.data.Collector(policy, test_envs, exploration_noise=True)
  1. 使用训练器 训练 policy 网络
    当 policy 在测试收集器上达到停止条件stop_fn时,培训器将自动停止培训。由于DQN是一个off-policy算法,我们使用offpolicy_trainer():
result = ts.trainer.offpolicy_trainer(
    policy, train_collector, test_collector,
    max_epoch=10, step_per_epoch=10000, step_per_collect=10,
    update_per_step=0.1, episode_per_test=100, batch_size=64,
    train_fn=lambda epoch, env_step: policy.set_eps(0.1),
    test_fn=lambda epoch, env_step: policy.set_eps(0.05),
    stop_fn=lambda mean_rewards: mean_rewards >= env.spec.reward_threshold)
print(f'Finished training! Use {result["duration"]}')

hyper-parameter-tutorial

  1. 保存 加载权重
torch.save(policy.state_dict(), 'dqn.pth')
policy.load_state_dict(torch.load('dqn.pth'))
  1. 观看Agent的表现
policy.eval()
policy.set_eps(0.05)
collector = ts.data.Collector(policy, env, exploration_noise=True)
collector.collect(n_episode=1, render=1 / 35)
  1. 自己定制训练策略的代码
# pre-collect at least 5000 transitions with random action before training
train_collector.collect(n_step=5000, random=True)

policy.set_eps(0.1)
for i in range(int(1e6)):  # total step
    collect_result = train_collector.collect(n_step=10)

    # once if the collected episodes' mean returns reach the threshold,
    # or every 1000 steps, we test it on test_collector
    if collect_result['rews'].mean() >= env.spec.reward_threshold or i % 1000 == 0:
        policy.set_eps(0.05)
        result = test_collector.collect(n_episode=100)
        if result['rews'].mean() >= env.spec.reward_threshold:
            print(f'Finished training! Test mean returns: {result["rews"].mean()}')
            break
        else:
            # back to training eps
            policy.set_eps(0.1)

    # train policy with a sampled batch data from buffer
    losses = policy.update(64, train_collector.buffer)

修改 Deep Q learning 的样例

样例路径:tianshou/test/discrete/test_dqn.py

import os
import gym
import torch
import pickle
import pprint
import argparse
import numpy as np
from torch.utils.tensorboard import SummaryWriter

from tianshou.policy import DQNPolicy
from tianshou.utils import BasicLogger
from tianshou.env import DummyVectorEnv
from tianshou.utils.net.common import Net
from tianshou.trainer import offpolicy_trainer
from tianshou.data import Collector, VectorReplayBuffer, PrioritizedVectorReplayBuffer

'''
max_epoch:最大允许的训练轮数,有可能没训练完这么多轮就会停止(因为满足了 stop_fn 的条件)

step_per_epoch:每个epoch要更新多少次策略网络

collect_per_step:每次更新前要收集多少帧与环境的交互数据。上面的代码参数意思是,每收集10帧进行一次网络更新

episode_per_test:每次测试的时候花几个rollout进行测试

batch_size:每次策略计算的时候批量处理多少数据

train_fn:在每个epoch训练之前被调用的函数,输入的是当前第几轮epoch和当前用于训练的env一共step了多少次。上面的代码意味着,在每次训练前将epsilon设置成0.1

test_fn:在每个epoch测试之前被调用的函数,输入的是当前第几轮epoch和当前用于训练的env一共step了多少次。上面的代码意味着,在每次测试前将epsilon设置成0.05

stop_fn:停止条件,输入是当前平均总奖励回报(the average undiscounted returns),返回是否要停止训练

writer:天授支持 TensorBoard,可以像下面这样初始化:

:return:
'''
def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--task', type=str, default='MsPacman-ram-v0')
    parser.add_argument('--seed', type=int, default=1626)
    parser.add_argument('--eps-test', type=float, default=0.05)
    parser.add_argument('--eps-train', type=float, default=0.1)
    parser.add_argument('--buffer-size', type=int, default=20000)
    parser.add_argument('--lr', type=float, default=1e-3)
    parser.add_argument('--gamma', type=float, default=0.9)
    parser.add_argument('--n-step', type=int, default=3)
    parser.add_argument('--target-update-freq', type=int, default=320)
    parser.add_argument('--epoch', type=int, default=20)
    parser.add_argument('--step-per-epoch', type=int, default=20000)
    parser.add_argument('--step-per-collect', type=int, default=10)
    parser.add_argument('--update-per-step', type=float, default=0.1)
    parser.add_argument('--batch-size', type=int, default=64)
    parser.add_argument('--hidden-sizes', type=int,
                        nargs='*', default=[512, 256, 128, 128])
    parser.add_argument('--training-num', type=int, default=10)
    parser.add_argument('--test-num', type=int, default=100)
    parser.add_argument('--logdir', type=str, default='log')
    parser.add_argument('--render', type=float, default=0.) # 這裡可以修改展示的FPS
    parser.add_argument('--prioritized-replay',
                        action="store_true", default=False)
    parser.add_argument('--alpha', type=float, default=0.6)
    parser.add_argument('--beta', type=float, default=0.4)
    parser.add_argument(
        '--save-buffer-name', type=str,
        default="./expert_DQN_MsPacman-ram-v0.pkl")
    parser.add_argument(
        '--device', type=str,
        default='cuda' if torch.cuda.is_available() else 'cpu')
    args = parser.parse_known_args()[0]
    return args


def test_dqn(args=get_args()):
    env = gym.make(args.task)
    args.state_shape = env.observation_space.shape or env.observation_space.n
    args.action_shape = env.action_space.shape or env.action_space.n


    print("args.state_shape {} ,args.action_shape {} ".format(args.state_shape,args.action_shape))

    # train_envs = gym.make(args.task)
    # you can also use tianshou.env.SubprocVectorEnv
    train_envs = DummyVectorEnv(
        [lambda: gym.make(args.task) for _ in range(args.training_num)])
    # test_envs = gym.make(args.task)
    test_envs = DummyVectorEnv(
        [lambda: gym.make(args.task) for _ in range(args.test_num)])
    # seed
    print("seed")
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    train_envs.seed(args.seed)
    test_envs.seed(args.seed)
    # Q_param = V_param = {"hidden_sizes": [128]}
    # model
    net = Net(args.state_shape, args.action_shape,
              hidden_sizes=args.hidden_sizes, device=args.device,
              # dueling=(Q_param, V_param),
              ).to(args.device)
    optim = torch.optim.Adam(net.parameters(), lr=args.lr)
    policy = DQNPolicy(
        net, optim, args.gamma, args.n_step,
        target_update_freq=args.target_update_freq)
    # buffer
    if args.prioritized_replay:
        buf = PrioritizedVectorReplayBuffer(
            args.buffer_size, buffer_num=len(train_envs),
            alpha=args.alpha, beta=args.beta)
    else:
        buf = VectorReplayBuffer(args.buffer_size, buffer_num=len(train_envs))
    # collector
    train_collector = Collector(policy, train_envs, buf, exploration_noise=True)
    test_collector = Collector(policy, test_envs, exploration_noise=True)
    # policy.set_eps(1)
    train_collector.collect(n_step=args.batch_size * args.training_num)
    # log
    log_path = os.path.join(args.logdir, args.task, 'dqn')
    writer = SummaryWriter(log_path)
    logger = BasicLogger(writer)

    def save_fn(policy):
        torch.save(policy.state_dict(), os.path.join(log_path, 'policy.pth'))

    def stop_fn(mean_rewards):
        print("mean_rewards",mean_rewards)
        return mean_rewards >= 5000

    def train_fn(epoch, env_step):
        # eps annnealing, just a demo
        if env_step <= 10000:
            policy.set_eps(args.eps_train)
        elif env_step <= 50000:
            eps = args.eps_train - (env_step - 10000) / \
                40000 * (0.9 * args.eps_train)
            policy.set_eps(eps)
        else:
            policy.set_eps(0.1 * args.eps_train)

    def test_fn(epoch, env_step):
        policy.set_eps(args.eps_test)

    # trainer
    print("trainer {} ".format("trainer"))
    result = offpolicy_trainer(
        policy, train_collector, test_collector, args.epoch,
        args.step_per_epoch, args.step_per_collect, args.test_num,
        args.batch_size, update_per_step=args.update_per_step, train_fn=train_fn,
        test_fn=test_fn, stop_fn=stop_fn, save_fn=save_fn, logger=logger)
    assert stop_fn(result['best_reward'])

    if __name__ == '__main__':
        pprint.pprint(result)
        # Let's watch its performance!
        env = gym.make(args.task)
        policy.eval()
        policy.set_eps(args.eps_test)
        collector = Collector(policy, env)
        result = collector.collect(n_episode=1, render=args.render)
        rews, lens = result["rews"], result["lens"]
        print(f"Final reward: {rews.mean()}, length: {lens.mean()}")

    # save buffer in pickle format, for imitation learning unittest


    buf = VectorReplayBuffer(args.buffer_size, buffer_num=len(test_envs))
    policy.set_eps(0.2)
    collector = Collector(policy, test_envs, buf, exploration_noise=True)
    result = collector.collect(n_step=args.buffer_size)
    pickle.dump(buf, open(args.save_buffer_name, "wb"))
    print(result["rews"].mean())


def test_pdqn(args=get_args()):
    args.prioritized_replay = True
    args.gamma = .95
    args.seed = 1
    test_dqn(args)


if __name__ == '__main__':
    test_dqn(get_args())

測試訓練結果

import os
import gym
import torch
import pickle
import pprint
import argparse
import numpy as np
from torch.utils.tensorboard import SummaryWriter

from tianshou.policy import DQNPolicy
from tianshou.utils import BasicLogger
from tianshou.env import DummyVectorEnv
from tianshou.utils.net.common import Net
from tianshou.trainer import offpolicy_trainer
from tianshou.data import Collector, VectorReplayBuffer, PrioritizedVectorReplayBuffer
import tianshou as ts

env = gym.make('MsPacman-ram-v0')
state_shape = env.observation_space.shape or env.observation_space.n
action_shape = env.action_space.shape or env.action_space.n
print("args.state_shape {} ,args.action_shape {} ".format(args.state_shape,args.action_shape))
np.random.seed(0519)
torch.manual_seed(0628)


net = Net(state_shape, action_shape,
              hidden_sizes=[512, 256, 128, 128], 
              device='cuda' if torch.cuda.is_available() else 'cpu',
              # dueling=(Q_param, V_param),
              ).to('cuda' if torch.cuda.is_available() else 'cpu')
optim = torch.optim.Adam(net.parameters(), lr=1e-3)

policy = DQNPolicy( net, optim, 0.9, 3, 320)

policy.load_state_dict(torch.load('./log/MsPacman-ram-v0/dqn/policy.pth'))

for _ in range(100):
    policy.eval()
    policy.set_eps(0.05)
    collector = ts.data.Collector(policy, env, exploration_noise=True)
    collector.collect(n_episode=1, render=1 / 30)
  • 3
    点赞
  • 28
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值