k武装匪徒强化学习入门课

最新推荐文章于 2024-09-14 19:55:48 发布

weixin_26750481

最新推荐文章于 2024-09-14 19:55:48 发布

阅读量236

点赞数

文章标签： python 人工智能 java 机器学习强化学习

原文链接：https://towardsdatascience.com/the-k-armed-bandit-an-introductory-lesson-to-reinforcement-learning-4d51f5e71fdd

版权

指导性和评估性反馈 (Instructive and Evaluative Feedback)

In supervised learning, your algorithm/model gets instructive feedback. This means it is instructed what the correct choice it should have made was, it then updates itself to diminish its error and make its predictions more accurate. In reinforcement learning, you give an algorithm evaluative feedback. This tells your algorithm how good an action was, but not what the best action was. How good the action was is known as reward. The RL algorithm goes through a simulation where it learns how to maximize this reward.

在监督学习中，您的算法/模型会获得指导性反馈。这意味着将指示其应做出的正确选择，然后进行自我更新以减少错误并使其预测更加准确。在强化学习中，您会给算法提供评估反馈。这告诉您的算法，一个动作有多好，但最好的动作却没有。动作的效果如何被称为奖励。 RL算法经过仿真，在其中学习如何最大程度地提高奖励。

RL的最佳应用 (The Best Application for RL)

The best applications of RL are when you can simulate the environment it operates in well. If we wanted to teach a RL program to drive a car, we could just let it drive a car, this would be a perfect simulation. If we wanted to teach it to call plays in an American football game, we would let it play a bunch of games of Madden, this is not a perfect simulation since we are using a video game to represent the real world. Understanding how we design the feedback in the RL algorithm in either of these scenarios can be quite complex, so I will introduce some of the basic concepts of RL in a simple situation called the K-Armed Bandit.

RL的最佳应用是可以模拟它运行良好的环境。如果我们想讲授RL程序来驾驶汽车，我们可以让它驾驶汽车，这将是一个完美的模拟。如果我们想教它调用美式足球比赛中的比赛，我们会让它玩一系列Madden游戏，这并不是一个完美的模拟，因为我们正在使用视频游戏来代表真实世界。了解这两种情况下我们在RL算法中如何设计反馈的过程可能会非常复杂，因此我将在称为K-Armed Bandit的简单情况下介绍RL的一些基本概念。

K武装匪徒的模拟 (The K-Armed Bandit’s Simulation)

Imagine you are in a room with 1000 slot machines, each of which is free for you to play (this makes you a bandit, since you can’t lose any money). In this situation, there are 1000 arms for you to pull, so in this case, you are a 1000-armed bandit. Each slot machine gives you a payout that is normally distributed, each with a different mean and variance. If you had an infinite number of times to pull the levers, you could get a massive sample from each slot machine, find the one that maximizes the expected value of your payout, then pull that lever over and over again. To make things interesting let’s say you have a limit of 5000 pulls of the lever. How would you optimize this your payout in this scenario? It’s hard to say, you would be slow to experiment and find out, since it would take time to pull every lever and record the information, this is a great application for a reinforcement learning bot.

想象一下，您所在的房间里有1000台老虎机，每台老虎机都是免费供您玩的(这使您成为强盗，因为您不会损失任何钱)。在这种情况下，有1000臂供您拉动，因此，在这种情况下，您是1000臂的土匪。每台老虎机都会为您提供正态分布的支出，每种均具有不同的均值和方差。如果您有无数次拉动杠杆，则可以从每台老虎机上获取大量样本，找到可以最大程度地增加支出期望值的样本，然后反复拉动该杠杆。为了使事情变得有趣，我们假设您最多只能拉动5000次拉杆。在这种情况下，您将如何优化此支出？很难说，您的实验和发现速度会很慢，因为拉动每个杠杆并记录信息会花费一些时间，这对于强化学习机器人来说是一个很好的应用程序。

探索诉剥削 (Exploration v. Exploitation)

The bot begins by pulling one of the levers. It turns up with $500, that’s great! The bot’s current belief about the machines is that they all pay out $0, except this one which pays out $500. The bot decides to stay on this machine for bot’s remaining 1999 pulls of the lever. This is known as an exploitative approach, since the bot exploits its current knowledge of the payouts. If the bot decided to switch to a new machine this would be known as explorative, since the bot is gathering information on how to maximize reward. An exploitative strategy is also known as a greedy strategy.

机器人首先拉动其中一个操纵杆。总共有$ 500，真是太好了！机器人目前对机器的信念是，它们都支付了$ 0，但该机器支付了$ 500。机器人决定留在该机器上，以保持机器人在1999年剩余的拉力。这是一种利用性方法，因为该漫游器会利用其当前的支出知识。如果机器人决定切换到新机器，这将被称为探索性的，因为机器人正在收集有关如何最大化奖励的信息。剥削策略也称为贪婪策略。

量化贪婪 (Quantifying Greed)

You could instead set up a system where the bot will always exploit its current knowledge (i.e. pick the current machine with the biggest payout) but some percentage of the time you randomly decide to select another machine to pull. This percentage of the time is usually denoted with the variable epsilon. A bot with an epsilon of .1 will explore 10% of the time, this bot is greedier than a bot that has an epsilon of .5, which explores half the time. You can then simulate your environment with many bots of different epsilons to find the optimal epsilon for maximizing reward. You can then mimic the best bot in the real world. When an individual slot machine has a large variance in payout, exploration is more helpful. When each machine pays out the exact same amount every time (it pays out with a variance of 0), a greedier epsilon is preferred. With perfect knowledge of all the machines before the first pull of the lever, the bot with an epsilon of 0 is best, since it already knows the best machine and there is no benefit to exploring.

相反，您可以设置一个系统，在该系统中，机器人将始终利用其当前知识(即，选择支出最高的当前机器)，但是有一定百分比的时间是您随机决定选择另一台机器来拉动的。该时间百分比通常用变量epsilon表示。一个epsilon为.1的机器人将探索10％的时间，该机器人比一个epsilon为.5的机器人要探索一半的时间更为贪婪。然后，您可以使用许多不同epsilon的机器人来模拟您的环境，以找到最佳的epsilon，以实现最大的回报。然后，您可以模仿现实世界中最好的机器人。当单个老虎机的支出差异很大时，探索会更有帮助。当每台机器每次都支付完全相同的金额时(以0的方差支付)，首选greedier epsilon。机器人在第一次拉动操纵杆之前对所有机器都拥有完备的知识，因此ε为0的机器人是最好的，因为它已经知道最好的机器，因此探索毫无益处。

乐观的期望 (Optimistic Expectations)

A good way of encouraging exploration, even among greedy bots, is to change the bot’s expectations for machines it hasn’t tried. Earlier, I described our bot as having the expectation that a machine had a payout of $0 if it had no information on it. We could alter this so that the initial expectation is that a machine has a payout of $25. Now if on our first pull of the lever we have a payout of $5, even a greedy bot will want to switch to another machine, since another machine would maximize reward.

即使在贪婪的机器人中，鼓励探索的一个好方法是改变该机器人对未尝试过的机器的期望。早些时候，我将我们的漫游器描述为希望一台机器上没有任何信息时其支出为$ 0。我们可以更改此设置，以使最初的期望是一台机器的支出为25美元。现在，如果在我们第一次拉动杠杆时，我们有5美元的支出，即使是贪婪的机器人也将想要切换到另一台机器，因为另一台机器将使报酬最大化。

支出漂移 (Payout Drift)

Previously our bot was keeping track of what the mean reward from each machine was and pulled that one unless it was randomly told to explore. If the payout that each machine gives over time begins to change, meaning that the mean of its normal distribution increases or decreases, there are two simple ways to adjust our bot. The first is just exploring more, so by just picking a greater epsilon we can improve the bot. This doesn’t require you to alter any code though, just to repeat the simulation and pick what the new best bot was. A better method is instead of keeping our information about the machines as just the mean and variance, we could make a linear regression model that shows how the machines payout changes over time. This would stop a greedy bot from exploiting a machine which no longer had the highest payout.

以前，我们的机器人一直在跟踪每台机器的平均收益，并把它拉出来，除非被随机告知要探索。如果每台机器随时间分配的支出开始改变，这意味着其正态分布的平均值增加或减少，则可以通过两种简单的方法来调整机器人。第一个只是探索更多，因此只要选择更大的epsilon，我们就可以改进机器人。但这并不需要您更改任何代码，只需重复模拟并选择最新的最佳bot。更好的方法不是将我们的机器信息仅保留为均值和方差，而是可以建立一个线性回归模型来显示机器支出随时间的变化。这将阻止贪婪的机器人利用不再具有最高支出的机器。