雅达利游戏 几关_使用q学习掌握雅达利游戏

雅达利游戏 几关

Q-Learning and other reinforcement learning algorithms have recently taken over the machine learning scene, with AlphaGo’s victory over former champion Lee Se-dol. I thought it would take on another project, a simpler project: Using Q-Learning to play Atari Games.

随着AlphaGo击败前冠军Lee Se-dol的胜利,Q-Learning和其他强化学习算法最近占领了机器学习领域。 我认为这将承担另一个项目,一个更简单的项目:使用Q-Learning玩Atari游戏。

什么是强化学习? (What is Reinforcement Learning?)

Reinforcement Learning is using the concept of rewards and punishments to train the machine learning algorithm. At first, the algorithm explores the environment, that is, performing random actions to see what happens. Afterwards, when it has observed the different possibilities, it starts to exploit the environment, that is, taking the actions that lead to the most rewards.

强化学习正在使用奖惩的概念来训练机器学习算法。 首先,该算法探索环境,即执行随机动作以查看会发生什么。 之后,当它观察到各种可能性时,便开始利用环境,即采取导致最大回报的行动。

Image for post
Aloïs Moubax from Pexels的AloïsMoubax Pexels

Imagine you were teaching a dog to be pick up the tennis ball:

想象一下,您在教一条狗捡起网球:

Every-time the dog picks up the tennis ball, it gets a treat. Every-time the dog is not able to pick up the tennis ball, it does not get a treat. This creates an understanding that since treats are good, and since picking up the ball leads to treats, picking up the ball is good.

狗每次拿起网球都会得到一种享受。 每次狗无法捡起网球,都不会得到治疗。 这会产生一种理解,即既然零食是好的,并且因为捡起球会导致零食,所以捡起球是好的。

This is exactly how agents learn via reinforcement learning.

代理人正是通过强化学习来学习的。

强化学习的优势: (Advantages of Reinforcement Learning:)

  • Possible to Exceed Dataset

    可能超出数据集

When DeepMind started designing DeepMind, they knew they couldn’t use labelled data to train the network: If they trained the neural network to mimic the best human player, the neural network, at best, would only be as good as the best human player. Reinforcement learning has no defined boundaries: It therefore has the potential to exceed the dataset. This is why I believe that Reinforcement Learning, Especially Deep-Q Networks (Q-tables linked with deep Neural Networks, to be at the forefront of machine learning.

当DeepMind开始设计DeepMind时,他们知道不能使用标记数据来训练网络:如果他们训练神经网络来模仿最佳人类玩家,那么神经网络充其量只能与最佳人类玩家一样好。 。 强化学习没有定义的界限:因此,它有可能超越数据集。 这就是为什么我认为强化学习,尤其是深度Q网络(与深度神经网络链接的Q表)将成为机器学习的最前沿。

什么是Q学习? (What is Q-Learning?)

Q-learning is using a large Q-table, to store all the different possible actions, given each state of the environment.

Q学习使用大型Q表存储给定环境的每种状态的所有可能的不同动作。

It follows the same concepts of exploration and exploitation, that optimizes the performance of the algorithms.

它遵循相同的探索和开发概念,可以优化算法的性能。

代码: (The Code:)

Now that you have a good grasp of reinforcement learning, and an emphasis on Q-learning, I can get into the main project: Allowing Computers to master Atari Games.

既然您已经掌握了强化学习的知识,并且对Q学习有所了解,那么我可以进入主要项目:允许计算机掌握Atari Games。

The game I will be using in my example is Space Invaders, but any Atari game in the open-ai gym library is feasible.

在我的示例中,我将使用“太空侵略者”游戏,但开放式体育馆中的任何Atari游戏都是可行的。

import numpy as np
import gym
import random
import time
from IPython.display import clear_outputenv = gym.make('SpaceInvaders-v0')q_table = np.zeros((state_space_size,action_space_size))

This script imports the necessary libraries, and creates the Space Invaders environment. Each environment has actions and states: Actions are possible moves and states are specific circumstances. These are necessary to create the Q-table that stores the different rewards for each action and each state.

该脚本导入必要的库,并创建Space Invaders环境。 每个环境都有动作和状态:动作是可能的动作,状态是特定的情况。 这些对于创建Q表是必要的,该表存储每个动作和每个状态的不同奖励。

episodes = 10000
max_steps_per_episode = 100learning_rate = 0.1
discount_rate = 0.99exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
decay_rate = 0.01

These are the set of variables that define training. Episodes is the equivalent term to epochs in reinforcement learning. Max_step_per_episode limits the number of moves that the computer can make during each episode, to add a time limit. Learning_rate changes the rate of update to the Q-values, and discount_rate is a measure of how much future values are weighted to Q-values.

这些是定义训练的变量集。 情节与强化学习中的时期是等效的。 Max_step_per_episode限制计算机在每个情节中可以进行的移动次数,以增加时间限制。 Learning_rate将更新速率更改为Q值,而Discount_rate是衡量将来值加权到Q值的度量。

Exploration_rate = 1 is the probability in which the model will choose to explore as compared to exploiting. Max and min exploration rate is to create boundaries, so as to encourage exploration, while also allowing exploitation. Decay rate is to decrease the exploration rate: When the q-table is extremely accurate and fully filled, it is inefficient to observe the environment.

Exploration_rate = 1是与开发相比,模型选择进行探索的概率。 最大和最小勘探速率是为了创建边界,以鼓励勘探,同时也允许开采。 衰减速率是为了降低探测速率:当q表非常精确且充满时,观察环境效率很低。

rewards = []for episode in range(episodes):
state = env.reset()
done = False
rewards_current_episode = 0
for step in range(max_steps_per_episode):
exploration_rate_threshold = random.uniform(0,1)
if exploration_rate_threshold > exploration_rate:
action = np.argmax(q_table[state,:])
else:
action = env.action_space.sample()
new_state,reward,done,info = env.step(action)
q_table[state,action] = q_table[state,action] * (1-learning_rate) + \
learning_rate * (reward+discount_rate * np.max(q_table[new_state,:]))
state = new_state
rewards_current_episode += reward
if done == True:
break
exploration_rate = min_exploration_rate + \
(max_exploration_rate - min_exploration_rate) * np.exp(-decay_rate * episode)
rewards.append(rewards_current_episode)

This simple script is actually the backbone of the program: Filling in the Q-table. Let’s go through the main loop in the function: For every action in the maximum number of actions, the exploration_rate_threshold is a random value from 0 to 1. If this value is larger than the exploration rate, it exploits, otherwise, it explores. It then takes the appropriate action, and then fills in the q-table, by the reward achieved. If the game is over, the loop is broken.

这个简单的脚本实际上是程序的基础:填写Q表。 让我们看一下函数中的主循环:对于最大动作数中的每个动作,exploit_rate_threshold是一个从0到1的随机值。如果该值大于探索速率,它将利用,否则进行探索。 然后,它采取适当的操作,然后根据获得的奖励填写q表。 如果游戏结束,则循环中断。

rewards_per_thousand = np.split(np.array(rewards),episodes/1000)
count = 1000
print('Average per thousand')
for r in rewards_per_thousand:
print(count,": ",str(sum(r/1000)))
count += 1000

This script essentially shows if the network is learning: If by the last thousand episodes, it is significantly better than the beginning, the reinforcement algorithm is effective. Try changing the parameters (like episodes, exploration rate, max and min exploration rate) to see how it changes the results.

该脚本从本质上说明了网络是否正在学习:如果在最后一千集之前,它比开始时要好得多,则增强算法是有效的。 尝试更改参数(例如剧集,探索率,最大和最小探索率)以查看其如何改变结果。

for episode in range(1):
state = env.reset()
done = False
print(episode+1)
time.sleep(1)
for step in range(max_steps_per_episode):
clear_output(wait = True)
env.render()
time.sleep(0.3)
action = np.argmax(q_table[state,:])
new_state,reward,done,info = env.step(action)

This is the fun part: The algorithm starts playing the game in front of you, with the Q-table. Sit back and reap the rewards of your work.

这是有趣的部分:该算法开始使用Q表在您面前玩游戏。 坐下来,收获工作的回报。

Thank you for reading my article!

感谢您阅读我的文章!

翻译自: https://towardsdatascience.com/using-q-learning-to-master-atari-games-6626197d789b

雅达利游戏 几关

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值