AI玩游戏(2)——Pendulum

本文介绍如何使用Q-learning算法解决Pendulum-v0环境中的摆杆竖立问题,包括离散化处理、训练与测试模式代码实现,以及最终在5400多次迭代后达成目标的过程。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

一点话

Pendulum是一个结点下连着一根棒子,通过移动结点来使棒子竖立(夹角=0,角速度=0)。本来确定的目标是Acrobot,即两根棒子通过一个结点连接,但是太难了,所以先玩Pendulum-v0。
而对于Pendulum,本来是想用DDPG的,结果电脑只有CPU,跑得贼慢,又懒得换系统搞CUDA,所以依旧是Q-learning算法。对于算法有疑问的可以找上一回

环境

win10,python3.7,numpy,gym

来源

深度强化学习(六):连续动作空间的问题,这篇用DDPG完成的文章写得还行,另外推荐一个老哥,Deep Reinforcement Learning - 1. DDPG原理和算法,讲的比较清楚。

想法

  1. 数据离散化,用于形成Q_TABLE
  2. 将数据的sin,cos降纬成theta
  3. 代码分成训练模式和测试模式,训练完自动保存训练下来的Q_TABLE,可以在测试模式下观看结果

代码

代码的模板和上一次一样,是来自这个老哥

import gym
import numpy as np
import math

'''
使用Q-learning离散化学习,效果很差,考虑用DDPG
'''
env = gym.make('Pendulum-v0')
mode = 'train'  # 'train' or 'test',两种模式
max_number_of_steps = 200  # 每一场游戏的步长
# ---------获胜的条件是最近100场平均得分高于-170-------------
goal_average_steps = -170  # 可以调整目标值,一般-400(大概在1000个episode)就能达到目标了,-170需要6000个episode
num_consecutive_iterations = 100
# ----------------------------------------------------------
num_episodes = 10000  # 共进行10000次训练
last_time_steps = np.zeros(num_consecutive_iterations)  # 只存储最近100场的得分(可以理解为是一个容量为100的栈)
# theta,thetaDot:20x30
q_table = np.random.uniform(low=-1, high=1, size=(30 * 20, 5))


# 分箱处理函数,把[clip_min,clip_max]区间平均分为num段,位于i段区间的特征值x会被离散化为i
def bins(clip_min, clip_max, num):
    return np.linspace(clip_min, clip_max, num + 1)[1:-1]


# 离散化处理
def digitize_state(observation):
    cosTheta, sinTheta, thetaDot = observation
    theta = math.acos(cosTheta)
    if sinTheta < 0:
        theta *= -1
    # 分别对各个连续特征值进行离散化(分箱处理)
    digitized = [np.digitize(theta, bins=bins(-math.pi, math.pi, 20)),
                 np.digitize(thetaDot, bins=bins(-8.0, 8.0, 30))]
    return digitized[0] + 20 * digitized[1]


# 根据本次的行动及其反馈(下一个时间步的状态),返回下一次的最佳行动
def get_action(state, action, observation, reward):
    next_state = digitize_state(observation)  # 获取下一个时间步的状态,并将其离散化
    epsilon = 0.5 * (0.99 ** episode)
    if epsilon <= np.random.uniform(0, 1):
        next_action = np.argmax(q_table[next_state])  # 查表得到最佳行动
    else:
        next_action = np.random.randint(0, 4)
    # -------------------------------------训练学习,更新q_table----------------------------------
    alpha = 0.2  # 学习系数α
    gamma = 0.9  # 报酬衰减系数γ
    q_table[state, action] = (1 - alpha) * q_table[state, action] + alpha * (
            reward + gamma * q_table[next_state, next_action])
    # -------------------------------------------------------------------------------------------
    return next_action, next_state


if mode == 'train':
    # 重复进行一场场的游戏
    for episode in range(num_episodes):
        observation = env.reset()  # 初始化本场游戏的环境
        state = digitize_state(observation)  # 获取初始状态值
        action = np.argmax(q_table[state])  # 根据状态值作出行动决策
        episode_reward = 0
        # 一场游戏分为一个个时间步
        for t in range(max_number_of_steps):
            observation, reward, done, info = env.step([action - 2])  # 获取本次行动的反馈结果,将action从0-4映射到-2-2
            action, state = get_action(state, action, observation, reward)  # 作出下一次行动的决策
            episode_reward += reward
            if done:
                print('%d Episode finished after %f time steps / mean %f' % (episode, t + 1, last_time_steps.mean()))
                last_time_steps = np.hstack((last_time_steps[1:], episode_reward))  # 更新最近100场游戏的得分stack
                break
        if last_time_steps.mean() >= goal_average_steps and episode > 100:
            print('Episode %d train agent successfully!' % episode)
            np.save('q_table', q_table)
            break
    if episode == num_episodes - 1:
        print('Failed!')
else:
    q_table = np.load('q_table.npy')
    while 1:
        observation = env.reset()
        state = digitize_state(observation)
        action = np.argmax(q_table[state])
        for t in range(max_number_of_steps):
            observation, reward, done, info = env.step([action - 2])
            env.render()
            state = digitize_state(observation)
            action = np.argmax(q_table[state])

结果

训练了5400多个episode成功,动图有点懒得截,就不上传了。


5445 Episode finished after 200.000000 time steps / mean -170.884109
5446 Episode finished after 200.000000 time steps / mean -174.619459
5447 Episode finished after 200.000000 time steps / mean -172.011839
5448 Episode finished after 200.000000 time steps / mean -173.126531
5449 Episode finished after 200.000000 time steps / mean -173.111120
5450 Episode finished after 200.000000 time steps / mean -173.087864
5451 Episode finished after 200.000000 time steps / mean -171.905497
Episode 5451 train agent successfully!

Process finished with exit code 0

总结

Q-learning确实是离散化强化学习的好方法,实现方便,效率也高。之前跑了DDPG,速度太慢了,下次要租个服务器试试。

好的,下面是利用策略梯度算法训练pendulum的pytorch代码示例: ``` import gym import torch import torch.optim as optim import torch.nn as nn from torch.distributions import Normal # 定义Actor网络(策略网络) class Actor(nn.Module): def __init__(self, state_dim, action_dim, hidden_dim): super(Actor, self).__init__() self.fc1 = nn.Linear(state_dim, hidden_dim) self.fc2 = nn.Linear(hidden_dim, hidden_dim) self.fc3 = nn.Linear(hidden_dim, action_dim) def forward(self, x): x = torch.tanh(self.fc1(x)) x = torch.tanh(self.fc2(x)) x = torch.tanh(self.fc3(x)) return x # 定义Critic网络 class Critic(nn.Module): def __init__(self, state_dim, hidden_dim): super(Critic, self).__init__() self.fc1 = nn.Linear(state_dim, hidden_dim) self.fc2 = nn.Linear(hidden_dim, hidden_dim) self.fc3 = nn.Linear(hidden_dim, 1) def forward(self, x): x = torch.tanh(self.fc1(x)) x = torch.tanh(self.fc2(x)) x = self.fc3(x) return x # 定义策略梯度算法 def policy_gradient(env, actor, critic, optimizer, gamma, num_episodes): # 每个episode的平均总reward列表 episode_rewards = [] for i in range(num_episodes): state = env.reset() done = False total_reward = 0 while not done: # 从Actor网络中获取action action_mean = actor(torch.tensor(state).float()) action_std = torch.exp(torch.tensor(0.5)).float() dist = Normal(action_mean, action_std) action = dist.sample().detach().numpy() # 执行action,获取reward和next_state next_state, reward, done, _ = env.step(action) total_reward += reward # 计算loss并优化Actor网络和Critic网络的权重 advantage = critic(torch.tensor(state).float()).item() - critic(torch.tensor(next_state).float()).item() actor_loss = -dist.log_prob(torch.tensor(action).float()) * advantage critic_loss = nn.MSELoss()(torch.tensor(reward).float() + gamma * critic(torch.tensor(next_state).float()), critic(torch.tensor(state).float())) optimizer.zero_grad() actor_loss.backward() optimizer.step() optimizer.zero_grad() critic_loss.backward() optimizer.step() state = next_state episode_rewards.append(total_reward) print('Episode {}: Total reward = {}'.format(i+1, total_reward)) return episode_rewards env = gym.make('Pendulum-v0') actor = Actor(env.observation_space.shape[0], env.action_space.shape[0], 64) critic = Critic(env.observation_space.shape[0], 64) optimizer = optim.Adam(list(actor.parameters()) + list(critic.parameters()), lr=3e-4) gamma = 0.99 episode_rewards = policy_gradient(env, actor, critic, optimizer, gamma, num_episodes=50) ``` 上述代码实现了一个简单的Actor-Critic算法,其中Actor网络是策略网络,用以预测动作,Critic网络是值函数网络,用以评估状态的价值,以表现好坏。在训练时,我们先用Actor网络产生一个动作,执行动作并得到下一个状态和奖励值,根据奖励值优化Actor和Critc两个网络的权重,直到达到足够好的表现。如果需了解更详细的理论知识,请参考相关资料。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值