强化学习-动态规划_强化学习-第4部分

强化学习-动态规划

有关深层学习的FAU讲义 (FAU LECTURE NOTES ON DEEP LEARNING)

These are the lecture notes for FAU’s YouTube Lecture “Deep Learning”. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as much as the videos. Of course, this transcript was created with deep learning techniques largely automatically and only minor manual modifications were performed. Try it yourself! If you spot mistakes, please let us know!

这些是FAU YouTube讲座“ 深度学习 ”的 讲义 这是演讲视频和匹配幻灯片的完整记录。 我们希望您喜欢这些视频。 当然,此成绩单是使用深度学习技术自动创建的,并且仅进行了较小的手动修改。 自己尝试! 如果发现错误,请告诉我们!

导航 (Navigation)

Previous Lecture / Watch this Video / Top Level / Next Lecture

上一个讲座 / 观看此视频 / 顶级 / 下一个讲座

Image for post
Also Sonic the Hedgehog has been looked at with respect to reinforcement learning. Image created using gifify. Source: YouTube.
刺猬索尼克(Sonic the Hedgehog)也在强化学习方面受到关注。 使用 gifify创建的 图像 。 资料来源: YouTube

Welcome back to deep learning! Today we want to discuss a couple of other reinforcement learning approaches than the policy iteration concept that you’ve seen in the previous video. So let’s have a look at what I’ve got for you today. We will look at other solution methods.

欢迎回到深度学习! 今天,我们要讨论除上一段视频中看到的策略迭代概念以外的其他两种强化学习方法。 因此,让我们来看看我今天为您准备的。 我们将介绍其他解决方法。

Image for post
CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture. CC BY 4.0下的图像。

You see that in the policy and value iteration that we discussed earlier, they require updated policies during the learning to obtain better approximations of our optimal state-value function. So, these are called on policy algorithms because you need n policy. This policy is being updated. Additionally, we assumed that the state transition and the reward are known. So, the probability density functions that produce the new states and the new reward are known. If they are not then you can’t apply the previous concept. So, this very important and of course there are methods where you can then relax this. So, these methods mostly differ in how they perform the policy evaluation. So, let’s look at a couple of those alternatives.

您会看到,在我们前面讨论的策略和价值迭代中,它们在学习期间需要更新的策略才能获得最佳状态值函数的更好近似值。 因此,将这些称为策略算法,因为您需要n个策略。 此政策正在更新。 此外,我们假设状态转换和奖励是已知的。 因此,产生新状态和新奖励的概率密度函数是已知的。 如果不是,那么您将无法应用先前的概念。 因此,这非常重要,当然还有一些方法可以让您放松一下。 因此,这些方法的主要区别在于执行策略评估的方式不同。 因此,让我们看几个替代方案。

Image for post
CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture. CC BY 4.0下的图像。

The first one that I want to show you is based on Monte Carlo techniques. This applies only to episodic tasks. Here, the idea is off-policy. So, you learn the optimal state value by following an arbitrary policy. It doesn’t matter what policy you’re using. So it’s an arbitrary policy. It could be multiple policies. Of course, you still have the exploration/exploitation dilemma. So you want to choose policies that really visit all of the states. You don’t need information about the dynamics of the environment because you can simply run many of the episodic tasks. You try to reach all of the possible states. If you do so, then you can generate those episodes using some policy. Then, you loop in backward direction over one episode and you accumulate the expected future reward. Because you have played the game until the end, you can go backward in time over this episode and accumulate the different rewards that have been obtained. If a state was not yet visited, you append it to a list and essentially you use this list then to compute the update for the state value function. So, you see this is simply the sum over these lists for that specific state. This will allow you to update your state value and this way you can then iterate in order to achieve the optimal state value function.

我要向您展示的第一个基于蒙特卡洛技术。 这仅适用于情景任务。 在这里,这个想法是不合政策的。 因此,您可以通过遵循任意策略来学习最佳状态值。 您使用什么策略都没有关系。 因此,这是一个任意政策。 可能是多个策略。 当然,您仍然有探索/开发难题。 因此,您想选择真正访问所有州的政策。 您不需要有关环境动态的信息,因为您可以简单地运行许多情景任务。 您尝试达到所有可能的状态。 如果这样做,则可以使用某些策略来生成这些情节。 然后,您在一个情节中向后循环,并累积了预期的未来奖励。 因为您一直玩游戏到最后,所以您可以在此情节中向后退,并累积获得的不同奖励。 如果尚未访问状态,则将其附加到列表中,然后基本上使用该列表来计算状态值函数的更新。 因此,您看到的只是这些列表中特定状态的总和。 这将允许您更新状态值,然后可以通过这种方式进行迭代以实现最佳状态值功能。

Image for post
CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture. CC BY 4.0下的图像。

Now, another concept is temporal difference learning. This is an on-policy method. Again, it does not need information about the dynamics of the environment. So here, the scheme is that you loop and follow a certain policy. Then you use an action from the policy to observe the rewards and the new states. You update your state-value function using the previous state-value function plus α that is used to weight the influence of the new observations times the new reward plus the discounted version of the old state value function of the new state and you subtract the value of the old state. So this way, you can generate updates and this actually converges to the optimal solution. A variant of this estimates actually the action-value function and is then known as SARSA.

现在,另一个概念是时间差异学习。 这是一种基于策略的方法。 同样,它不需要有关环境动态的信息。 因此,这里的方案是您循环并遵循某个策略。 然后,您使用策略中的操作来观察奖励和新状态。 您可以使用先前的状态值函数加α来更新状态值函数,该函数用于对新观测值的影响乘以新奖励乘以新奖励再加上新状态的旧状态值函数的打折版本,然后减去该值的旧状态。 因此,您可以生成更新,并且实际上可以收敛到最佳解决方案。 这种方法的一种变体实际上是估计作用值函数,因此被称为SARSA。

Image for post
CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture. CC BY 4.0下的图像。

Q learning is an off-policy method. It’s a temporal difference type of method but it does not require information about the dynamics of the environment. Here, the idea is that you loop and follow a policy derived from your action-value function. For example, you could use an ε-greedy type of approach. Then, you use the action from the policy to observe your reward and your new state. Next, you update your action-value function using the previous action-value plus some weighting factor times the observed reward again the discounted action that would have derived the maximum action value over what you have already known from the state that is generated minus the action-value function of the previous state. So it’s again a kind of temporal difference that you are using here in order to update your action-value function.

Q学习是一种脱离政策的方法。 这是一种时间差异类型的方法,但不需要有关环境动态的信息。 这里的想法是循环并遵循从操作值函数派生的策略。 例如,您可以使用ε-贪心类型的方法。 然后,您使用策略中的操作来观察您的奖励和新状态。 接下来,您使用先前的操作值加上一些权重因子乘以观察到的奖励再一次更新贴现操作,该贴现操作将根据您从生成的状态减去操作得出的最大操作值来更新您的操作值函数前状态的-value函数。 因此,这也是您用来更新操作值函数的时间差异。

Image for post
CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture. CC BY 4.0下的图像。

Well, if you have Universal function approximators, what about just parameterizing your policy with weights w and some loss function? This is known as the policy gradient. This instance is called REINFORCE. So, you generate an episode using your policy and your weights. Then, you go forward in your episode from time 0 to time t — 1. If you do so, you can actually compute the gradient with respect to the weights. You use this gradient in order to update your weights. Very similar way as we have previously seen in our learning approaches. You can see that this idea using the gradient over the policy then gives you an idea of how you can update the weights, again with a learning rate. We are really close to our machine learning ideas from earlier now.

好吧,如果您有通用函数逼近器,那么仅使用权重w和某些损失函数对策略进行参数化怎么办? 这称为策略梯度。 该实例称为REINFORCE。 因此,您可以使用自己的政策和权重来生成情节。 然后,您可以从时间0到时间t_1前进。如果这样做,则实际上可以计算权重的梯度。 您可以使用此渐变来更新您的权重。 与我们以前在学习方法中看到的方式非常相似。 您可以看到,通过在策略上使用梯度可以使您重新了解权重,同时又可以提高学习率。 从现在开始,我们真的很接近我们的机器学习思想。

Image for post
CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture. CC BY 4.0下的图像。

This is why we talk in the next video about deep Q learning which is the kind of deep learning version of reinforcement learning. So, I hope you like this video. You’ve now seen other options on how you can actually determine the optimal state-value and action-value function. This way, we have seen that there are many different ideas that do no longer require exact knowledge on how to generate future states and on how to generate future rewards. So with these ideas, you can also do reinforcement learning and in particular the idea of the policy gradient. We’ve seen that this is very much compatible with what we’ve seen earlier in this class regarding our machine learning and deep learning methods. We will talk about exactly this idea in the next video. So thank you very much for listening and see you in the next video. Bye-bye!

这就是为什么我们在下一个视频中谈论深度Q学习,这是强化学习的深度学习版本。 所以,我希望你喜欢这个视频。 现在,您已经看到了有关如何实际确定最佳状态值和动作值函数的其他选项。 这样,我们已经看到,有许多不同的想法不再需要关于如何生成未来状态以及如何生成未来奖励的确切知识。 因此,有了这些想法,您还可以进行强化学习,尤其是政策梯度的想法。 我们已经看到,这与我们之前在本课程中有关机器学习和深度学习方法的内容非常兼容。 我们将在下一个视频中讨论这个想法。 因此,非常感谢您收听并在下一个视频中见到您。 再见!

Image for post
Sonic is still a challenge for today’s reinforcement learning methods. Image created using gifify. Source: YouTube
对于当今的强化学习方法,Sonic仍然是一个挑战。 使用 gifify创建的 图像 。 资料来源: YouTube

If you liked this post, you can find more essays here, more educational material on Machine Learning here, or have a look at our Deep LearningLecture. I would also appreciate a follow on YouTube, Twitter, Facebook, or LinkedIn in case you want to be informed about more essays, videos, and research in the future. This article is released under the Creative Commons 4.0 Attribution License and can be reprinted and modified if referenced. If you are interested in generating transcripts from video lectures try AutoBlog.

如果你喜欢这篇文章,你可以找到这里更多的文章 ,更多的教育材料,机器学习在这里 ,或看看我们的深入 学习 讲座 。 如果您希望将来了解更多文章,视频和研究信息,也欢迎关注YouTubeTwitterFacebookLinkedIn 。 本文是根据知识共享4.0署名许可发布的 ,如果引用,可以重新打印和修改。 如果您对从视频讲座中生成成绩单感兴趣,请尝试使用AutoBlog

链接 (Links)

Link to Sutton’s Reinforcement Learning in its 2018 draft, including Deep Q learning and Alpha Go details

在其2018年草案中链接到萨顿的强化学习,包括Deep Q学习和Alpha Go详细信息

翻译自: https://towardsdatascience.com/reinforcement-learning-part-4-3c51edd8c4bf

强化学习-动态规划

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值