DQN训练雅塔丽游戏Pong

该文分享了使用深度Q网络(DQN)在Aritta平台上对Pong游戏进行训练的过程,模型在1000个回合后开始收敛。作者提供了基于PyTorch1.8.0和CUDA10.2的实现代码,奖励曲线显示了训练效果。读者可以查阅GitHub上的参考资料以了解更多详情。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

基于DQN的Arita中Pong游戏的训练结果,在1000个回合之后开始收敛。
已经将训练好的模型放在代码中。
torch = 1.8.0+cuda10.2
Python = 3.8
奖励曲线:
在这里插入图片描述

DQN训练结果

参考代码:https://github.com/jmichaux/dqn-pytorch
本文代码:链接:https://pan.baidu.com/s/1hvjfO3C5XNO0XjZga6vceQ
提取码:mhkz

下面是一个使用Pytorch实现DQN算法解决Pong游戏的完整代码: ``` import gym import random import math import numpy as np import matplotlib.pyplot as plt import torch import torch.nn as nn import torch.optim as optim import torch.nn.functional as F from collections import deque # 定义神经网络模型 class DQN(nn.Module): def __init__(self, input_shape, num_actions): super(DQN, self).__init__() self.conv = nn.Sequential( nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4), nn.ReLU(), nn.Conv2d(32, 64, kernel_size=4, stride=2), nn.ReLU(), nn.Conv2d(64, 64, kernel_size=3, stride=1), nn.ReLU() ) conv_out_size = self._get_conv_out(input_shape) self.fc = nn.Sequential( nn.Linear(conv_out_size, 512), nn.ReLU(), nn.Linear(512, num_actions) ) def _get_conv_out(self, shape): o = self.conv(torch.zeros(1, *shape)) return int(np.prod(o.size())) def forward(self, x): conv_out = self.conv(x).view(x.size()[0], -1) return self.fc(conv_out) # 定义DQN算法 class DQNAgent: def __init__(self, env): self.env = env self.replay_buffer = deque(maxlen=10000) self.gamma = 0.99 self.epsilon = 1.0 self.epsilon_decay = 0.995 self.epsilon_min = 0.01 self.batch_size = 32 self.model = DQN(env.observation_space.shape, env.action_space.n).to(device) self.target_model = DQN(env.observation_space.shape, env.action_space.n).to(device) self.target_model.load_state_dict(self.model.state_dict()) self.optimizer = optim.Adam(self.model.parameters(), lr=0.00025) def act(self, state): if np.random.rand() <= self.epsilon: return self.env.action_space.sample() state = torch.FloatTensor(state).unsqueeze(0).to(device) q_values = self.model(state) return q_values.max(1)[1].item() def replay(self): if len(self.replay_buffer) < self.batch_size: return batch = random.sample(self.replay_buffer, self.batch_size) states, actions, rewards, next_states, dones = zip(*batch) states = torch.FloatTensor(states).to(device) actions = torch.LongTensor(actions).to(device) rewards = torch.FloatTensor(rewards).to(device) next_states = torch.FloatTensor(next_states).to(device) dones = torch.FloatTensor(dones).to(device) q_values = self.model(states).gather(1, actions.unsqueeze(-1)).squeeze(-1) next_q_values = self.target_model(next_states).max(1)[0] expected_q_values = rewards + self.gamma * next_q_values * (1 - dones) loss = F.mse_loss(q_values, expected_q_values.detach()) self.optimizer.zero_grad() loss.backward() self.optimizer.step() def train(self, num_episodes): rewards = [] for i in range(num_episodes): state = env.reset() done = False episode_reward = 0 while not done: action = self.act(state) next_state, reward, done, _ = env.step(action) self.replay_buffer.append((state, action, reward, next_state, done)) state = next_state episode_reward += reward self.replay() self.update_target_model() rewards.append(episode_reward) self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay) print("Episode {}: reward = {}, epsilon = {}".format(i, episode_reward, self.epsilon)) return rewards def update_target_model(self): self.target_model.load_state_dict(self.model.state_dict()) # 设置超参数和设备 device = torch.device("cuda" if torch.cuda.is_available() else "cpu") env = gym.make('Pong-v0') num_episodes = 1000 # 创建代理并训练 agent = DQNAgent(env) rewards = agent.train(num_episodes) # 绘制训练曲线 plt.plot(rewards) plt.xlabel('Episode') plt.ylabel('Reward') plt.title('Training Curve') plt.show() ``` 注意,在运行代码之前需要安装好Pytorch和Gym库。代码中使用了一个双重Q网络和经验回放的技巧来提高算法的性能和稳定性。在训练结束后,会输出每个回合的奖励和探索率,并绘制训练曲线。
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值