doom 源码_Cartpole和Doom的策略梯度简介

doom 源码

by Thomas Simonini

通过托马斯·西蒙尼(Thomas Simonini)

Cartpole和Doom的策略梯度简介 (An introduction to Policy Gradients with Cartpole and Doom)

This article is part of Deep Reinforcement Learning Course with Tensorflow ?️. Check the syllabus here.

本文是使用Tensorflow?️的深度强化学习课程的一部分。 检查课程表

In the last two articles about Q-learning and Deep Q learning, we worked with value-based reinforcement learning algorithms. To choose which action to take given a state, we take the action with the highest Q-value (maximum expected future reward I will get at each state). As a consequence, in value-based learning, a policy exists only because of these action-value estimates.

在有关Q学习深度Q学习的最后两篇文章中,我们使用了基于价值的强化学习算法。 为了选择在某个状态下要采取的行动,我们采取Q值最高的行动(我将在每个状态下获得的最大预期未来奖励)。 结果,在基于价值的学习中,仅由于这些行动价值估计而存在策略。

Today, we’ll learn a policy-based reinforcement learning technique called Policy Gradients.

今天,我们将学习一种称为“策略梯度”的基于策略的强化学习技术。

We’ll implement two agents. The first will learn to keep the bar in balance.

我们将实现两个代理。 第一个将学会保持平衡。

The second will be an agent that learns to survive in a Doom hostile environment by collecting health.

第二个将是通过收集健康来学习在末日敌对环境中生存的特工。

In policy-based methods, instead of learning a value function that tells us what is the expected sum of rewards given a state and an action, we learn directly the policy function that maps state to action (select actions without using a value function).

在基于策略的方法中,我们没有直接学习将状态映射到操作的策略函数(选择不使用值函数的操作),而是学习了告诉我们给定状态和操作的预期回报总和的价值函数。

It means that we directly try to optimize our policy function π without worrying about a value function. We’ll directly parameterize π (select an action without a value function).

这意味着我们直接尝试优化策略函数π,而不必担心值函数。 我们将直接参数化π(选择一个没有值函数的动作)。

Sure, we can use a value function to optimize the policy parameters. But the value function will not be used to select an action.

当然,我们可以使用值函数来优化策略参数。 但是值函数将不会用于选择动作。

In this article you’ll learn:

在本文中,您将学习:

  • What is Policy Gradient, and its advantages and disadvantages

    什么是“政策梯度”及其优缺点
  • How to implement it in Tensorflow.

    如何在Tensorflow中实现它。

为什么要使用基于策略的方法? (Why using Policy-Based methods?)

两种政策 (Two types of policy)

A policy can be either deterministic or stochastic.

策略可以是确定性的,也可以是随机的。

A deterministic policy is policy that maps state to actions. You give it a state and the function returns an action to take.

确定性策略是将状态映射到操作的策略。 给它一个状态,函数返回一个要执行的动作。

Deterministic policies are used in deterministic environments. These are environments where the actions taken determine the outcome. There is no uncertainty. For instance, when you play chess and you move your pawn from A2 to A3, you’re sure that your pawn will move to A3.

确定性策略用于确定性环境中。 在这些环境中,所采取的措施决定了结果。 没有不确定性。 例如,当您下棋并将典当从A2移至A3时,您确定典当将移至A3。

On the other hand, a stochastic policy outputs a probability distribution over actions.

另一方面,随机策略输出动作上的概率分布。

It means that instead of being sure of taking action a (for instance left), there is a probability we’ll take a different one (in this case 30% that we take south).

这意味着我们没有确定采取行动(例如左手行动),而是有可能采取另一种行动(在这种情况下,我们采取向南行动可能性为30%)。

The stochastic policy is used when the environment is uncertain. We call this process a Partially Observable Markov Decision Process (POMDP).

当环境不确定时,将使用随机策略。 我们称此过程为部分可观察的马尔可夫决策过程(POMDP)。

Most of the time we’ll use this second type of policy.

大多数时候,我们将使用第二种类型的策略。

优点 (Advantages)
But Deep Q Learning is really great! Why using policy-based reinforcement learning methods?
但是Deep Q Learning真的很棒! 为什么要使用基于策略的强化学习方法?

There are three main advantages in using Policy Gradients.

使用策略梯度有三个主要优点。

收敛 (Convergence)

For one, policy-based methods have better convergence properties.

首先,基于策略的方法具有更好的收敛性。

The problem with value-based methods is that they can have a big oscillation while training. This is because the choice of action may change dramatically for an arbitrarily small change in the estimated action values.

基于价值的方法的问题在于,它们在训练时会产生很大的振动。 这是因为,对于估计的动作值的任意小的变化,动作的选择可能会发生巨大的变化。

On the other hand, with policy gradient, we just follow the gradient to find the best parameters. We see a smooth update of our policy at each step.

另一方面,对于策略梯度,我们只需遵循梯度即可找到最佳参数。 我们会在每个步骤上看到我们政策的顺利更新。

Because we follow the gradient to find the best parameters, we’re guaranteed to converge on a local maximum (worst case) or global maximum (best case).

因为我们遵循梯度来找到最佳参数,所以我们可以保证收敛于局部最大值(最坏情况)或全局最大值(最好情况)。

策略梯度在高维操作空间中更有效 (Policy gradients are more effective in high dimensional action spaces)

The second advantage is that policy gradients are more effective in high dimensional action spaces, or when using continuous actions.

第二个优点是策略梯度在高维操作空间或使用连续操作时更有效。

The problem with Deep Q-learning is that their predictions assign a score (maximum expected future reward) for each possible action, at each time step, given the current state.

深度Q学习的问题在于,他们的预测会在给定当前状态的情况下,在每个时间步长为每个可能的动作分配一个分数(最大预期未来奖励)。

But what if we have an infinite possibility of actions?

但是,如果我们有无限可能采取行动,该怎么办?

For instance, with a self driving car, at each state you can have a (near) infinite choice of actions (turning the wheel at 15°, 17.2°, 19,4°, honk…). We’ll need to output a Q-value for each possible action!

例如,对于自动驾驶汽车,您可以在每种状态下(几乎)无限选择动作(将车轮旋转15°,17.2°,19.4°,鸣笛……)。 我们需要为每个可能的动作输出一个Q值!

On the other hand, in policy-based methods, you just adjust the parameters directly: thanks to that you’ll start to understand what the maximum will be, rather than computing (estimating) the maximum directly at every step.

另一方面,在基于策略的方法中,您只需要直接调整参数:由于这样,您将开始了解最大值,而不是直接在每个步骤中计算(估计)最大值。

策略梯度可以学习随机策略 (Policy gradients can learn stochastic policies)

A third advantage is that policy gradient can learn a stochastic policy, while value functions can’t. This has two consequences.

第三个优点是,策略梯度可以学习随机策略,而价值函数则不能。 这有两个后果。

One of these is that we don’t need to implement an exploration/exploitation trade off. A stochastic policy allows our agent to explore the state space without always taking the same action. This is because it outputs a probability distribution over actions. As a consequence, it handles the exploration/exploitation trade off without hard coding it.

其中之一是,我们无需进行勘探/开发折衷。 随机策略使我们的代理无需始终执行相同的操作即可探索状态空间。 这是因为它输出动作上的概率分布。 结果,它无需硬编码就可以处理勘探/开发折衷。

We also get rid of the problem of perceptual aliasing. Perceptual aliasing is when we have two states that seem to be (or actually are) the same, but need different actions.

我们还摆脱了感知混叠的问题。 感知别名是指当我们有两个看起来(或实际上是)相同但需要不同动作的状态时。

Let’s take an example. We have a intelligent vacuum cleaner, and its goal is to suck the dust and avoid killing the hamsters.

让我们举个例子。 我们有一个智能的真空吸尘器,其目标是吸除灰尘并避免杀死仓鼠。

Our vacuum cleaner can only perceive where the walls are.

我们的真空吸尘器只能感知墙壁的位置。

The problem: the two red cases are aliased states, because the agent perceives an upper and lower wall for each two.

问题是:这两个红色的情况是别名状态,因为代理会感知到每个情况的上下两堵墙。

Under a deterministic policy, the policy will be either moving right when in red state or moving left. Either case will cause our agent to get stuck and never suck the dust.

在确定性策略下,该策略将在处于红色状态时向右移动或向左移动。 无论哪种情况都将导致我们的代理商被卡住,并且永不吸尘。

Under a value-based RL algorithm, we learn a quasi-deterministic policy (“epsilon greedy strategy”). As a consequence, our agent can spend a lot of time before finding the dust.

在基于价值的RL算法下,我们学习了一种准确定性策略(“ε贪婪策略”)。 结果,我们的代理商可能会花很多时间才能找到灰尘。

On the other hand, an optimal stochastic policy will randomly move left or right in grey states. As a consequence it will not be stuck and will reach the goal state with high probability.

另一方面,最佳随机策略将在灰色状态下随机向左或向右移动。 结果,它不会卡住,并且很有可能达到目标状态。

缺点 (Disadvantages)

Naturally, Policy gradients have one big disadvantage. A lot of the time, they converge on a local maximum rather than on the global optimum.

自然,政策梯度有一个很大的缺点。 很多时候,它们收敛于局部最大值而不是全局最优值。

Instead of Deep Q-Learning, which always tries to reach the maximum, policy gradients converge slower, step by step. They can take longer to train.

并非总是尝试达到最大深度学习的“深度Q学习”,而是逐步地缓慢收敛了策略梯度。 他们可能需要更长的时间来训练。

However, we’ll see there are solutions to this problem.

但是,我们将看到针对此问题的解决方案。

We have our policy π that has a parameter θ. This π outputs a probability distribution of actions.

我们的策略π的参数为θ。 该π输出动作的概率分布。

Awesome! But how do we know if our policy is good?

太棒了! 但是,我们如何知道我们的政策是否良好?

Remember that policy can be seen as an optimization problem. We must find the best parameters (θ) to maximize a score function, J(θ).

请记住,策略可以被视为优化问题。 我们必须找到最佳参数(θ)以最大化得分函数J(θ)。

There are two steps:

分两个步骤:

  • Measure the quality of a π (policy) with a policy score function J(θ)

    使用策略得分函数J(θ)测量π(策略)的质量
  • Use policy gradient ascent to find the best parameter θ that improves our π.

    使用策略梯度上升来找到改善我们的π的最佳参数θ。

The main idea here is that J(θ) will tell us how good our π is. Policy gradient ascent will help us to find the best policy parameters to maximize the sample of good actions.

这里的主要思想是J(θ)会告诉我们π有多好。 政策梯度上升将帮助我们找到最佳的政策参数,以最大程度地采取良好行动。

第一步:策略得分函数J( θ) (First Step: the Policy Score function J(θ))

To measure how good our policy is, we use a function called the objective function (or Policy Score Function) that calculates the expected reward of policy.

为了衡量我们的保单质量,我们使用了一个称为目标函数(或保单得分函数)的函数,该函数计算保单的预期收益。

Three methods work equally well for optimizing policies. The choice depends only on the environment and the objectives you have.

三种方法在优化策略方面同样有效。 选择仅取决于环境和您的目标。

First, in an episodic environment, we can use the start value. Calculate the mean of the return from the first time step (G1). This is the cumulative discounted reward for the entire episode.

首先,在情景环境中,我们可以使用起始值。 计算从第一时间步骤(G1)开始的收益平均值。 这是整个情节的累积折扣奖励。

The idea is simple. If I always start in some state s1, what’s the total reward I’ll get from that start state until the end?

这个想法很简单。 如果我总是从某个状态s1开始,那么从该开始状态到结束我将获得的总奖励是多少?

We want to find the policy that maximizes G1, because it will be the optimal policy. This is due to the reward hypothesis explained in the first article.

我们想要找到最大化G1的策略,因为它将是最佳策略。 这是由于第一篇文章中解释了奖励假设。

For instance, in Breakout, I play a new game, but I lost the ball after 20 bricks destroyed (end of the game). New episodes always begin at the same state.

例如,在Breakout中,我玩一个新游戏,但是在销毁20块砖(游戏结束)之后我输了球。 新剧集总是以相同的状态开始。

I calculate the score using J1(θ). Hitting 20 bricks is good, but I want to improve the score. To do that, I’ll need to improve the probability distributions of my actions by tuning the parameters. This happens in step 2.

我使用J1(θ)计算分数。 打20块砖很好,但是我想提高分数。 为此,我需要通过调整参数来改善动作的概率分布。 这发生在步骤2中。

In a continuous environment, we can use the average value, because we can’t rely on a specific start state.

在连续环境中,我们可以使用平均值,因为我们不能依赖特定的开始状态。

Each state value is now weighted (because some happen more than others) by the probability of the occurrence of the respected state.

现在,通过状态值的出现概率来加权每个状态值(因为某些状态比其他状态发生的更多)。

Third, we can use the average reward per time step. The idea here is that we want to get the most reward per time step.

第三,我们可以使用每个时间步长的平均奖励。 这里的想法是我们希望在每个时间步长中获得最大的回报。

第二步:政策梯度上升 (Second step: Policy gradient ascent)

We have a Policy score function that tells us how good our policy is. Now, we want to find a parameter θ that maximizes this score function. Maximizing the score function means finding the optimal policy.

我们有一个政策评分功能,可以告诉我们我们的政策有多好。 现在,我们想找到一个使该得分函数最大化的参数θ。 最大化得分函数意味着找到最佳策略。

To maximize the score function J(θ), we need to do gradient ascent on policy parameters.

为了最大化得分函数J(θ),我们需要对策略参数进行梯度提升。

Gradient ascent is the inverse of gradient descent. Remember that gradient always points to the steepest change.

梯度上升是梯度下降的逆过程。 请记住,梯度始终指向最陡峭的变化。

In gradient descent, we take the direction of the steepest decrease in the function. In gradient ascent we take the direction of the steepest increase of the function.

在梯度下降中,我们采用函数中最陡峭的下降方向。 在梯度上升中,我们采用函数最陡峭增加的方向。

Why gradient ascent and not gradient descent? Because we use gradient descent when we have an error function that we want to minimize.

为什么要进行梯度上升而不是梯度下降? 因为当我们要最小化误差函数时使用梯度下降。

But, the score function is not an error function! It’s a score function, and because we want to maximize the score, we need gradient ascent.

但是,得分功能不是错误功能! 这是一个得分函数,并且由于我们要最大化得分,因此需要渐变上升。

The idea is to find the gradient to the current policy π that updates the parameters in the direction of the greatest increase, and iterate.

想法是找到当前策略π的梯度,该梯度沿最大增加的方向更新参数并进行迭代。

Okay, now let’s implement that mathematically. This part is a bit hard, but it’s fundamental to understand how we arrive at our gradient formula.

好的,现在让我们以数学方式实现它。 这部分比较难,但是了解我们如何得出梯度公式是基础。

We want to find the best parameters θ*, that maximize the score:

我们想要找到使得分最大化的最佳参数θ*:

Our score function can be defined as:

我们的得分函数可以定义为:

Which is the total summation of expected reward given policy.

这是给定策略的预期奖励的总和。

Now, because we want to do gradient ascent, we need to differentiate our score function J(θ).

现在,由于我们要进行梯度上升,因此需要区分得分函数J(θ)。

Our score function J(θ) can be also defined as:

我们的得分函数J(θ)也可以定义为:

We wrote the function in this way to show the problem we face here.

我们以这种方式编写函数来显示我们在此处面临的问题。

We know that policy parameters change how actions are chosen, and as a consequence, what rewards we get and which states we will see and how often.

我们知道政策参数会改变行动的选择方式,从而改变我们获得的报酬,我们将看到的状态以及频率。

So, it can be challenging to find the changes of policy in a way that ensures improvement. This is because the performance depends on action selections and the distribution of states in which those selections are made.

因此,以确保改进的方式找到策略的变化可能是具有挑战性的。 这是因为性能取决于动作选择和做出这些选择的状态分布。

Both of these are affected by policy parameters. The effect of policy parameters on the actions is simple to find, but how do we find the effect of policy on the state distribution? The function of the environment is unknown.

两者均受策略参数影响。 策略参数对操作的影响很容易找到,但是我们如何找到策略对状态分布的影响呢? 环境的功能未知。

As a consequence, we face a problem: how do we estimate the ∇ (gradient) with respect to policy θ, when the gradient depends on the unknown effect of policy changes on the state distribution?

结果,我们面临一个问题:当梯度取决于策略更改对状态分布的未知影响时,我们如何相对于策略θ估算∇(梯度)?

The solution will be to use the Policy Gradient Theorem. This provides an analytic expression for the gradient ∇ of J(θ) (performance) with respect to policy θ that does not involve the differentiation of the state distribution.

解决方案将是使用“策略梯度定理”。 这提供了不涉及状态分布的微​​分的,相对于策略θ的J(θ)(性能)的梯度∇的解析表达式。

So let’s calculate:

因此,让我们计算一下:

Remember, we’re in a situation of stochastic policy. This means that our policy outputs a probability distribution π(τ ; θ). It outputs the probability of taking these series of steps (s0, a0, r0…), given our current parameters θ.

请记住,我们处于随机政策中。 这意味着我们的策略输出概率分布π(τ;θ)。 给定当前参数θ,它输出采取这一系列步骤的概率(s0,a0,r0…)。

But, differentiating a probability function is hard, unless we can transform it into a logarithm. This makes it much simpler to differentiate.

但是,很难区分概率函数,除非我们可以将其转换为对数。 这使得区分变得更加简单。

Here we’ll use the likelihood ratio trick that replaces the resulting fraction into log probability.

在这里,我们将使用似然比技巧 ,将结果分数替换为对数概率。

Now let’s convert the summation back to an expectation:

现在让我们将总和转换回期望值:

As you can see, we only need to compute the derivative of the log policy function.

如您所见,我们只需要计算日志策略函数的导数。

Now that we’ve done that, and it was a lot, we can conclude about policy gradients:

现在我们已经完成了很多工作,我们可以得出有关政策梯度的结论:

This Policy gradient is telling us how we should shift the policy distribution through changing parameters θ if we want to achieve an higher score.

该策略梯度告诉我们,如果我们想获得更高的分数,应该如何通过更改参数θ来改变策略分配。

R(tau) is like a scalar value score:

R(tau)就像一个标量值得分:

  • If R(tau) is high, it means that on average we took actions that lead to high rewards. We want to push the probabilities of the actions seen (increase the probability of taking these actions).

    如果R(tau)高,则意味着平均而言,我们采取的行动会带来很高的回报。 我们要提高所看到动作的概率(增加采取这些动作的可能性)。
  • On the other hand, if R(tau) is low, we want to push down the probabilities of the actions seen.

    另一方面,如果R(tau)低,我们想降低所看到动作的概率。

This policy gradient causes the parameters to move most in the direction that favors actions that has the highest return.

此策略梯度会导致参数朝着收益最高的操作的方向移动最多。

蒙特卡洛政策梯度 (Monte Carlo Policy Gradients)

In our notebook, we’ll use this approach to design the policy gradient algorithm. We use Monte Carlo because our tasks can be divided into episodes.

在笔记本中,我们将使用这种方法来设计策略梯度算法。 之所以使用蒙特卡洛,是因为我们的任务可以分为几集。

Initialize θfor each episode τ = S0, A0, R1, S1, …, ST:    for t <-- 1 to T-1:        Δθ = α ∇theta(log π(St, At, θ)) Gt        θ = θ + Δθ
For each episode:    At each time step within that episode:         Compute the log probabilities produced by our policy function. Multiply it by the score function.         Update the weights

But we face a problem with this algorithm. Because we only calculate R at the end of the episode, we average all actions. Even if some of the actions taken were very bad, if our score is quite high, we will average all the actions as good.

但是我们在使用该算法时遇到了问题。 因为我们仅在情节结束时计算R,所以我们平均了所有动作。 即使所采取的某些措施非常糟糕,但如果我们的得分很高,我们也会将所有措施的平均结果都评为“良好”。

So to have a correct policy, we need a lot of samples… which results in slow learning.

因此,要制定正确的政策,我们需要大量样本……这会导致学习缓慢。

如何改善我们的模型? (How to improve our Model?)

We’ll see in the next articles some improvements:

我们将在接下来的文章中看到一些改进:

  • Actor Critic: a hybrid between value-based algorithms and policy-based algorithms.

    演员评论家:基于价值的算法和基于策略的算法之间的混合体。
  • Proximal Policy Gradients: ensures that the deviation from the previous policy stays relatively small.

    邻近策略梯度:确保与先前策略的偏差保持相对较小。

让我们用Cartpole和Doom实现它 (Let’s implement it with Cartpole and Doom)

We made a video where we implement a Policy Gradient agent with Tensorflow that learns to play Doom ?? in a Deathmatch environment.

我们制作了一个视频,在该视频中,我们使用Tensorflow实现了Policy Gradient代理,该代理学习了《毁灭战士》? 在Deathmatch环境中。

You can directly access the notebooks in the Deep Reinforcement Learning Course repo.

您可以在“ 深度强化学习课程”存储库中直接访问笔记本

Cartpole:

卡特波尔:

Doom:

厄运:

That’s all! You’ve just created an agent that learns to survive in a Doom environment. Awesome!

就这样! 您刚刚创建了一个可以在Doom环境中生存的代理。 太棒了!

Don’t forget to implement each part of the code by yourself. It’s really important to try to modify the code I gave you. Try to add epochs, change the architecture, change the learning rate, use a harder environment …and so on. Have fun!

不要忘记自己实现代码的每个部分。 尝试修改我给您的代码非常重要。 尝试添加时代,改变架构,改变学习率,使用更艰苦的环境……等等。 玩得开心!

In the next article, I will discuss the last improvements in Deep Q-learning:

在下一篇文章中,我将讨论深度Q学习的最新改进:

  • Fixed Q-values

    固定的Q值

  • Prioritized Experience Replay

    优先体验重播

  • Double DQN

    双DQN

  • Dueling Networks

    决斗网络

If you liked my article, please click the ? below as many time as you liked the article so other people will see this here on Medium. And don’t forget to follow me!

如果您喜欢我的文章, 请单击“?”。 您可以根据自己喜欢该文章的次数在下面进行搜索,以便其他人可以在Medium上看到此内容。 并且不要忘记跟随我!

If you have any thoughts, comments, questions, feel free to comment below or send me an email: hello@simoninithomas.com, or tweet me @ThomasSimonini.

如果您有任何想法,意见,问题,请在下面发表评论,或给我发送电子邮件:hello@simoninithomas.com或向我发送@ThomasSimonini信息

继续学习,保持卓越! (Keep Learning, Stay awesome!)
使用Tensorflow进行深度强化学习课程? (Deep Reinforcement Learning Course with Tensorflow ?️)

? Syllabus

教学大纲

? Video version

视频版本

Part 1: An introduction to Reinforcement Learning

第1部分: 强化学习简介

Part 2: Diving deeper into Reinforcement Learning with Q-Learning

第2部分: 通过Q学习更深入地学习强化学习

Part 3: An introduction to Deep Q-Learning: let’s play Doom

第3部分: 深度Q学习简介:让我们玩《毁灭战士》

Part 3+: Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets

第3部分+: 深度Q学习中的改进:双重DQN,优先体验重播和固定Q目标

Part 4: An introduction to Policy Gradients with Doom and Cartpole

第4部分: Doom和Cartpole的策略梯度简介

Part 5: An intro to Advantage Actor Critic methods: let’s play Sonic the Hedgehog!

第5部分: 优势演员评论家方法简介:让我们玩刺猬索尼克吧!

Part 6: Proximal Policy Optimization (PPO) with Sonic the Hedgehog 2 and 3

第6部分: 使用刺猬索尼克2和3的近距离策略优化(PPO)

Part 7: Curiosity-Driven Learning made easy Part I

第七部分: 好奇心驱动学习变得简单

翻译自: https://www.freecodecamp.org/news/an-introduction-to-policy-gradients-with-cartpole-and-doom-495b5ef2207f/

doom 源码

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值