乐器演奏_深度强化学习代理演奏的蛇

乐器演奏

Ever since I watched the Netflix documentary AlphaGo, I have been fascinated by Reinforcement Learning. Reinforcement Learning is comparable with learning in real life: you see something, you do something, and your act has positive or negative consequences. You learn from the consequences and adjust your actions accordingly. Reinforcement Learning has many applications, like autonomous driving, robotics, trading and gaming. In this post, I will show how the computer can learn to play the game Snake using Deep Reinforcement Learning.

自从我看了Netflix的纪录片AlphaGo之后,我就对Reinforcement Learning着迷了。 强化学习与现实生活中的学习具有可比性:您看到某件事,做某事,并且您的行为产生积极或消极的后果。 您可以从后果中学习,并相应地调整自己的行动。 强化学习有许多应用,例如自动驾驶,机器人技术,交易和游戏。 在本文中,我将展示计算机如何使用深度强化学习来学习玩Snake游戏。

基础 (The Basics)

If you are familiar with Deep Reinforcement Learning, you can skip the following two sections.

如果您熟悉深度强化学习,则可以跳过以下两个部分。

强化学习 (Reinforcement Learning)

The concept behind Reinforcement Learning (RL) is easy to grasp. An agent learns by interacting with an environment. The agent chooses an action, and receives feedback from the environment in the form of states (or observations) and rewards. This cycle continues forever or until the agent ends in a terminal state. Then a new episode of learning starts. Schematically, it looks like this:

强化学习(RL)背后的概念很容易掌握。 代理通过与环境交互来学习。 代理选择一个动作,并以状态(或观察)和奖励的形式接收来自环境的反馈。 此循环将一直持续下去,或者直到代理终止于终端状态为止。 然后新的学习情节开始。 从示意图上看,它看起来像这样:

Image for post
Reinforcement Learning: an agent interacts with the environment by choosing actions and receiving observations (or states) and rewards.
强化学习:代理人通过选择动作并接收观察(或状态)和奖励与环境互动。

The goal of the agent is to maximize the sum of the rewards during an episode. In the beginning of the learning phase the agent explores a lot: it tries different actions in the same state. It needs this information to find the best actions possible for the states. When the learning continues, exploration decreases. Instead, the agent will exploit his moves: this means he will choose the action that maximizes the reward, based on his experience.

代理的目标是在情节中最大化奖励的总和。 在学习阶段的开始,代理会进行大量探索:它会在相同状态下尝试不同的操作。 它需要此信息来找到可能对各州采取的最佳措施。 当学习继续进行时,探索会减少。 相反,代理将利用自己的举动:这意味着他将根据自己的经验选择使报酬最大化的动作。

深度强化学习 (Deep Reinforcement Learning)

Deep Learning uses artificial neural networks to map inputs to outputs. Deep Learning is powerful, because it can approximate any function with only one hidden layer¹. How does it work? The network exists of layers with nodes. The first layer is the input layer. Then the hidden layers transform the data with weights and activation functions. The last layer is the output layer, where the target is predicted. By adjusting the weights the network can learn patterns and improve its predictions.

深度学习使用人工神经网络将输入映射到输出。 深度学习功能强大,因为它仅需一个隐藏层就可以近似任何功能¹。 它是如何工作的? 网络存在带有节点的层。 第一层是输入层。 然后,隐藏层使用权重和激活函数转换数据。 最后一层是输出层,在其中预测目标。 通过调整权重,网络可以学习模式并改善其预测。

Image for post

As the name suggests, Deep Reinforcement Learning is a combination of Deep Learning and Reinforcement Learning. By using the states as the input, values for actions as the output and the rewards for adjusting the weights in the right direction, the agent learns to predict the best action for a given state.

顾名思义,深度强化学习是深度学习和强化学习的结合。 通过使用状态作为输入,将动作的值用作输出,以及在正确的方向上调整权重的奖励,代理可以学习预测给定状态的最佳动作。

行动中的深度强化学习 (Deep Reinforcement Learning in Action)

Let’s apply these techniques to the famous game Snake. I bet you know the game, the goal is to grab as many apples as possible while not walking into a wall or the snake’s body. I build the game in Python with the turtle library.

让我们将这些技术应用于著名的游戏Snake。 我敢打赌,您知道这款游戏的目标是在不走入墙壁或蛇体的情况下,尽可能多地抓住苹果。 我使用乌龟库以Python构建游戏。

Image for post
Me playing Snake.
我在玩蛇。

定义行动,奖励和国家 (Defining Actions, Rewards and States)

To prepare the game for a RL agent, let’s formalize the problem. Defining the actions is easy. The agent can choose between going up, right, down or left. The rewards and state space are a bit harder. There are multiple solutions, and one will work better than the other. For now, let’s try the following. If the snake grabs an apple, give a reward of 10. If the snake dies, the reward is -100. To help the agent, give a reward of 1 if the snake comes closer to the apple, and a reward of -1 if the snake moves away from the apple.

为了为RL代理准备游戏,让我们形式化问题。 定义动作很容易。 座席可以在向上,向右,向下或向左之间选择。 奖励和状态空间要难一些。 有多种解决方案,一种将比另一种更好。 现在,让我们尝试以下方法。 如果蛇抓到一个苹果,则给予10分的奖励。如果蛇死亡,则奖励为-100。 为了帮助代理,如果蛇靠近苹果,则奖励1;如果蛇远离苹果,则奖励-1。

There are a lot of options for the state: you can choose to give scaled coordinates of the snake and the apple or to give directions to the location of the apple. An important thing to do is to add the location of obstacles (the wall and body) so the agent learns to avoid dying. Below a summary of actions, state and rewards. Later in the article you can see how adjustments to the state affect performance.

该状态有很多选项:您可以选择给出蛇和苹果的比例坐标,或者给出指向苹果位置的方向。 重要的事情是添加障碍物(墙壁和身体)的位置,以便特工学会避免死亡。 下面是行动,状态和奖励的摘要。 在本文的后面,您可以看到对状态的调整如何影响性能。

Image for post
Actions, rewards and state
行动,奖励和状态

创建环境和代理 (Creating the Environment and the Agent)

By adding some methods to the Snake program, it’s possible to create a Reinforcement Learning environment. The added methods are: reset(self), step(self, action) and get_state(self) . Besides this it’s necessary to calculate the reward every time the agent takes a step (check out run_game(self)).

通过向Snake程序添加一些方法,可以创建一个强化学习环境。 添加的方法是: reset(self)step(self, action)get_state(self) 。 除此之外,代理每次执行一步时都必须计算奖励(签出run_game(self) )。

The agent uses a Deep Q Network to find the best actions. The parameters are:

代理使用Deep Q Network查找最佳操作。 参数为:

# epsilon sets the level of exploration and decreases over time
param[‘epsilon’] = 1
param[‘epsilon_min’] = .01
param[‘epsilon_decay’] = .995# gamma: value immediate (gamma=0) or future (gamma=1) rewards
param[‘gamma’] = .95# the batch size is needed for replaying previous experiences
param[‘batch_size’] = 500# neural network parameters
param[‘learning_rate’] = 0.00025
param[‘layer_sizes’] = [128, 128, 128]

If you are interested in the code, you can find it on my GitHub.

如果您对代码感兴趣,可以在我的GitHub上找到它。

特工打的蛇 (Snake Played by the Agent)

Now it is time for the key question! Does the agent learn to play the game? Let’s find out by observing how the agent interacts with the environment.

现在是关键问题了! 代理是否学会玩游戏? 让我们观察一下代理如何与环境交互。

The first games, the agent has no clue:

最初的游戏,代理商没有任何线索:

Image for post
The first games.
第一场比赛。

The first apple! It still seems like the agent doesn’t know what he is doing…

第一个苹果! 代理商似乎仍然不知道他在做什么……

Image for post
Finds the first apple… and hits the wall.
找到第一个苹果……然后撞墙。

End of game 13 and beginning of game 14:

第13局结束,第14局开始:

Image for post
Improving!
改善中!

The agent learns: it doesn’t take the shortest path but finds his way to the apples.

代理商知道:这并不是走最短的路,而是找到通往苹果的路。

Game 30:

游戏30:

Image for post
Good job! New high score!
做得好! 新高分!

Wow, the agent avoids the body of the snake and finds a fast way to the apples, after playing only 30 games!

哇,特工只玩了30场游戏,就避开了蛇的身体,找到了通往苹果的快速途径!

玩国家空间 (Playing with the State Space)

The agent learns to play snake (with experience replay), but maybe it’s possible to change the state space and achieve similar or better performance. Let’s try the following four state spaces:

该代理学会了玩蛇(具有重播经验),但是也许可以改变状态空间并获得类似或更好的性能。 让我们尝试以下四个状态空间:

  1. State space ‘no direction’: don’t give the agent the direction the snake is going.

    状态空间“无方向”:不要给代理人蛇前进的方向。
  2. State space ‘coordinates’: replace the location of the apple (up, right, down and/or left) with the coordinates of the apple (x, y) and the snake (x, y). The coordinates are scaled between 0 and 1.

    状态空间“坐标”:用苹果(x,y)和蛇(x,y)的坐标替换苹果(上,右,下和/或左)的位置。 坐标在0到1之间缩放。
  3. State space ‘direction 0 or 1’: the original state space.

    状态空间“方向0或1”:原始状态空间。
  4. State space ‘only walls’: don’t tell the agent when the body is up, right, down or left, only tell it if there’s a wall.

    说明空间“只有墙壁”:不要告诉代理人何时身体向上,向右,向下或向左,仅告诉它是否有墙壁。

Can you make a guess and rank them from the best state space to the worst after playing 50 games?

在玩了50场游戏之后,您能猜出它们从最佳状态空间到最差状态吗?

Image for post
An agent playing snake prevents seeing the answer :)
扮演蛇的特工阻止看到答案:)

Made your guess?

你猜对了吗?

Here is a graph with the performance using the different state spaces:

这是使用不同状态空间的性能图:

Image for post
Defining the right state accelerates learning! This graph shows the mean return of the last twenty games for the different state spaces.
定义正确的状态可以加速学习! 此图显示了不同状态空间的最后二十场比赛的平均收益。

It is clear that using the state space with the directions (the original state space) learns fast and achieves the highest return. But the state space using the coordinates is improving, and maybe it can reach the same performance when it trains longer. A reason for the slow learning might be the number of possible states: 20⁴*2⁴*4 = 1,024,000 different states are possible (the snake canvas is 20*20 steps, there are 2⁴ options for obstacles, and 4 options for the current direction). For the original state space the number of possible states is equal to: 3²*2⁴*4 = 576 (3 options each for above/below and left/right). 576 is more than 1,700 times smaller than 1,024,000. This influences the learning process.

显然,使用带有方向的状态空间(原始状态空间)可以快速学习并获得最高的回报。 但是使用坐标的状态空间正在改善,也许在训练更长的时候它可以达到相同的性能。 学习缓慢的原因可能是可能的状态数:20⁴*2⁴* 4 = 1,024,000种可能的状态(蛇形画布为20 * 20步,障碍物有2⁴选项,当前方向有4个选项) 。 对于原始状态空间,可能的状态数等于:3²*2⁴* 4 = 576(3个选项分别用于上方/下方和左侧/右侧)。 576比1,024,000小1,700倍。 这会影响学习过程。

玩奖赏 (Playing with the Rewards)

What about the rewards? Is there a better way to program them?

奖励呢? 有更好的编程方法吗?

Recall that our rewards were formatted like this:

回想一下,我们的奖励格式如下:

Image for post

Blooper #1: Walk in CirclesWhat if we change the reward -1 to 1? By doing this, the agent will receive a reward of 1 every time it survives a time step. This can slow down learning in the beginning, but in the end the agent won’t die, and that’s a pretty important part of the game!

Blooper#1:绕圈走如果我们将奖励-1更改为1,该怎么办? 这样,代理每经过一个时间步长就会获得1的奖励。 这样一开始可能会减慢学习速度,但最终代理不会死,这是游戏中非常重要的部分!

Well, does it work? The agent quickly learns how to avoid dying:

好吧,行得通吗? 代理Swift学习如何避免死亡:

Image for post
Agent receives a reward of 1 for surviving a time step.
代理因在时间步长中幸存而获得1的奖励。

-1, come back please!

-1,请回来!

Blooper #2: Hit the WallNext try: change the reward for coming closer to the apple to -1, and the reward of grabbing an apple to 100, what will happen? You might think: the agent receives a -1 for every time step, so it will run to the apples as fast as possible! This could be the truth, but there’s another thing that might happen…

Blooper#2:撞墙接下来的尝试:将靠近苹果的奖励更改为-1,将靠近苹果的奖励更改为100,会发生什么? 您可能会想:该代理在每个时间步长都收到-1,因此它将尽快运行到苹果! 这可能是事实,但可能还会发生另一件事……

Image for post
The agent runs into the nearest wall to minimize the negative return.
代理会碰到最近的墙,以最大程度地减少负回报。

体验重播 (Experience Replay)

One secret behind fast learning of the agent (only needs 30 games) is experience replay. In experience replay the agent stores previous experiences and uses these experiences to learn faster. At every normal step, a number of replay steps (batch_size parameter) is performed. This works so well for Snake because given the same state action pair, there is low variance in reward and next state.

快速学习代理(仅需要30个游戏)的一个秘诀就是体验重播。 在体验重播中,代理会存储以前的体验,并使用这些体验来更快地学习。 在每个正常步骤,都会执行许多重播步骤( batch_size参数)。 这对于Snake非常有效,因为给定相同的状态动作对,奖励和下一个状态的差异很小。

Blooper #3: No Experience ReplayIs experience replay really that important? Let’s remove it! For this experiment a reward of 100 for eating an apple is used.

Blooper#3:无经验重播经验重播真的那么重要吗? 让我们删除它! 在本实验中,使用苹果的奖励为100。

This is the agent without using experience replay after playing 2500 games:

这是在玩2500游戏后不使用经验重播的代理:

Image for post
Training without experience replay. Even though the agent played 2500 (!) games, the agent can’t play snake. Fast playing, otherwise it would take days to reach the 10000 games.
没有经验重播的培训。 即使该代理人玩了2500(!)游戏,该代理人也不能玩蛇。 快速玩游戏,否则要花上几天时间才能达到10000场比赛。

After 3000 games, the highest number of apples caught in one game is 2.

在进行3000场比赛后,一场比赛中最多捕获的苹果数是2。

After 10000 games, the highest number is 3… Was this 3 learning or was it luck?

在10000场比赛之后,最高的数字是3。。。这3是学习还是运气?

It seems indeed that experience replay helps a lot, at least for these parameters, rewards and this state space. How many replay steps per step are necessary? The answers might surprise you. To answer this question we can play with the batch_size parameter (mentioned in the section Creating the Environment and the Agent). In the original experiment the value of batch_size was 500.

至少对于这些参数,奖励和这种状态空间来说,经验重播确实确实有很大帮助。 每步需要多少个重播步骤? 答案可能会让您感到惊讶。 为了回答这个问题,我们可以使用batch_size参数(在创建环境和代理一节中提到)。 在原始实验中, batch_size的值为500。

An overview of returns with different experience replay batch sizes:

具有不同经验重播批次大小的退货概述:

Image for post
Training 200 games with 3 different batch sizes: 1 (no experience replay), 2 and 4. Mean return of previous 20 episodes.
用3种不同的批量大小训练200场游戏:1(无经验重播),2和4。平均返回前20集。

Even with batch size 2 the agent learns to play the game. In the graph you can see the impact of increasing the batch size, the same performance is reached more than 100 games earlier if batch size 4 is used instead of batch size 2.

即使批次大小为2,代理也会学会玩游戏。 在图形中,您可以看到增加批量大小的影响,如果使用批量大小4而不是批量大小2,则可以提前100多个游戏达到相同的性能。

结论 (Conclusions)

The solution presented in this article gives results. The agent learns to play snake and achieves a high score (number of apples eaten) between 40 and 60 after playing 50 games. That is way better than a random agent!

本文介绍的解决方案可提供结果。 特工学会了打蛇,并在玩了50场游戏后在40到60之间获得高分(被吃的苹果数量)。 那比随机代理更好!

The attentive reader would say: ‘The maximum score for this game is 399. Why doesn’t the agent achieve a score of anything close to 399? There’s a huge difference between 60 and 399!’ That’s right! And there is a problem with the solution from this article: the agent does not learn to avoid enclosing. The agent learns to avoid obstacles directly surrounding the snake’s head, but it can’t see the whole game. So the agent will enclose itself and die, especially when the snake is longer.

细心的读者会说:“此游戏的最高分数是399。为什么代理商没有达到接近399的分数? 60和399之间有巨大差异! 那就对了! 这篇文章的解决方案存在一个问题:代理不会学会避免封闭。 特工学会了避免直接绕在蛇头周围的障碍物,但看不到整个游戏。 因此,特工将包围自己并死亡,尤其是当蛇更长时。

Image for post
Enclosing.
封闭。

An interesting way to solve this problem is to use pixels and Convolutional Neural Networks in the state space². Then it is possible for the agent to ‘see’ the whole game, instead of just nearby obstacles. It can learn to recognize the places it should go to avoid enclosing and get the maximum score.

解决此问题的一种有趣方法是在状态空间2中使用像素和卷积神经网络。 这样,代理就有可能“看到”整个游戏,而不仅仅是附近的障碍。 它可以学会识别应该去的地方,以避免封闭并获得最高分。

[1] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators (1989), Neural networks 2.5: 359–366

[1] K. Hornik,M。Stinchcombe,H。White, 多层前馈网络是通用逼近器 (1989),神经网络2.5:359–366

[2] Mnih et al, Playing Atari with Deep Reinforcement Learning (2013)

[2] Mnih等人,《 使用深度强化学习玩Atari》 (2013年)

翻译自: https://towardsdatascience.com/snake-played-by-a-deep-reinforcement-learning-agent-53f2c4331d36

乐器演奏

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值