了解第3部分非平稳性的强化学习手

系列的链接:(Series’ Links:)

  1. Introduction

    介绍

  2. Multi-Armed Bandits | Notebook

    多臂土匪| 笔记本

  3. Non-Stationary | Notebook

    非固定式| 笔记本

Welcome to the third entry of a series on Reinforcement Learning. On the previous article we explored the first on a series of many scenarios we’re going to tackle, the Multi-Armed Bandits. In this situation, we’re presented with an environment with a fixed number of actions, and are tasked on finding the action that yields the greatest reward. We presented some strategies, and measured their performance on this simple task.

欢迎来到强化学习系列的第三篇。 在上一篇文章中,我们探讨了我们要解决的许多场景中的第一个场景,即多武装土匪。 在这种情况下,我们将获得一个具有固定数量的动作的环境,并负责寻找产生最大回报的动作。 我们提出了一些策略,并评估了它们在此简单任务上的性能。

On this article, we’re going to modify the previously presented environment, and make it a little bit more dynamic. We will see how our previous strategies deal with non-stationary environments, and how we can do better.

在本文上,我们将修改先前介绍的环境,并使它更具动态性。 我们将看到我们以前的策略如何处理非平稳环境,以及我们如何做得更好。

固定与非固定: (Stationary vs. Non-Stationary:)

Last time we began our story on a Casino, filled with bandits at our disposal. Using this example, we built a simplified environment, and developed a strong strategy to obtain high rewards, the ɛ-greedy Agent. Using this strategy, we were able to find the best action given enough time, and therefore earn tons of reward. Our agent performed well because it had a good balance between exploring the environment and exploiting its knowledge. This balance allowed the agent to learn how the environment behaves, while also receiving high rewards along the way. But, there’s a small assumption our agent is doing to be able to behave so optimally, and that is that the environment is static, non-changing. What do we mean by this, and where is our agent making such assumption?

上次我们在赌场里开始我们的故事时,赌场里充斥着许多土匪。 使用此示例,我们构建了一个简化的环境,并制定了获得高回报的强大策略strategy-贪婪的Agent 。 使用这种策略,我们能够在足够的时间内找到最佳的操作,从而获得大量的回报。 我们的代理人之所以表现出色,是因为它在探索环境与开发知识之间取得了良好的平衡。 这种平衡使代理能够了解环境的行为方式,同时也能沿途获得丰厚的回报。 但是,有一个小小的假设,即我们的代理正在做才能使其表现得如此出色,那就是环境是静态的,不变的。 这是什么意思,我们的代理商在哪里做这样的假设?

固定式 (Stationary)

When we mention the word “stationary”, we’re talking about the underlying behavior of our environment. If you remember from last time, the environment is defined to have a set of actions that, upon interaction, yield a random reward. Even though the reward is random, they are generated from a central or mean value that every action has, which we called the true value. To see what I’m talking about, let’s have a look at one of the animated interactions we saw on the previous article.

当我们提到“固定的”一词时,是在谈论我们环境的潜在行为。 如果您还记得上一次,则将环境定义为具有一组行为,这些行为在交互时会产生随机奖励。 即使奖励是随机的,它们也是根据每个动作所具有的中心值或平均值产生的,我们称其为真实值。 要了解我在说什么,让我们看一下在上一篇文章中看到的一种动画交互。

Image for post
Example of an ɛ-greedy agent interacting with a static environment. Image by Author
ɛ-贪婪代理与静态环境交互的示例。 图片作者

Observe how the true values (red dots) are static. Even though the environment generates random rewards, each action has a true expected value which is never-changing. This is a big assumption to make, and one that almost no valuable real-world scenario will follow. If, for example, our casino analogy was static, then casinos would be quickly out of business! So, how can we portray a more realistic scenario without making the problem that more complex?

观察真实值(红色点)是如何静态的。 即使环境产生随机的奖励,每个动作都具有真实的期望值,并且永远不变。 这是一个很大的假设,几乎没有任何有价值的现实情况会发生。 例如,如果我们的娱乐场比喻是静态的,那么娱乐场将很快倒闭! 那么,如何在不使问题变得更加复杂的情况下描绘出更加现实的情况呢?

非平稳 (Non-Stationary)

Making the multi-armed ba

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值