强化学习简介

最新推荐文章于 2023-04-24 10:00:19 发布

cumi7754

最新推荐文章于 2023-04-24 10:00:19 发布

阅读量906

点赞数 2

文章标签：游戏算法 python 人工智能深度学习

原文链接：https://www.freecodecamp.org/news/a-brief-introduction-to-reinforcement-learning-7799af5840db/

版权

by ADL

通过ADL

Reinforcement Learning is an aspect of Machine learning where an agent learns to behave in an environment, by performing certain actions and observing the rewards/results which it get from those actions.

强化学习是机器学习的一个方面，其中代理通过执行某些动作并观察其从这些动作中获得的回报/结果来学习在环境中的行为。

With the advancements in Robotics Arm Manipulation, Google Deep Mind beating a professional Alpha Go Player, and recently the OpenAI team beating a professional DOTA player, the field of reinforcement learning has really exploded in recent years.

随着机器人手臂操纵技术的进步，Google Deep Mind击败了专业的Alpha Go Player以及最近的OpenAI团队击败了专业的DOTA玩家，强化学习领域的确在近年来得到了飞速发展。

In this article, we’ll discuss:

在本文中，我们将讨论：

What reinforcement learning is and its nitty-gritty like rewards, tasks, etc
强化学习是什么，它的实质是奖励，任务等
3 categorizations of reinforcement learning
强化学习的3种分类

什么是强化学习？ (What is Reinforcement Learning?)

Let’s start the explanation with an example — say there is a small baby who starts learning how to walk.

让我们从一个例子开始说明-假设有一个小婴儿开始学习如何走路。

Let’s divide this example into two parts:

让我们将此示例分为两部分：

1. 婴儿开始走路并成功地到达沙发上 (1. Baby starts walking and successfully reaches the couch)

Since the couch is the end goal, the baby and the parents are happy.

由于沙发是最终目标，因此婴儿和父母都很开心。

So, the baby is happy and receives appreciation from her parents. It’s positive — the baby feels good (Positive Reward +n).

因此，婴儿很快乐，并得到了父母的赞赏。这是积极的-婴儿感觉良好(积极奖励+ n)。

2. 婴儿开始走动，由于中间的一些障碍而摔倒并受伤。 (2. Baby starts walking and falls due to some obstacle in between and gets bruised.)

Ouch! The baby gets hurt and is in pain. It’s negative — the baby cries (Negative Reward -n).

哎哟! 婴儿受伤并处于疼痛中。这是负面的-婴儿哭了(负面奖励-n)。

That’s how we humans learn — by trail and error. Reinforcement learning is conceptually the same, but is a computational approach to learn by actions.

这就是我们人类学习的方法–循序渐进。强化学习在概念上是相同的，但它是通过动作学习的一种计算方法。

强化学习 (Reinforcement Learning)

Let’s suppose that our reinforcement learning agent is learning to play Mario as a example. The reinforcement learning process can be modeled as an iterative loop that works as below:

让我们假设我们的强化学习代理正在学习玩Mario为例。可以将增强学习过程建模为一个迭代循环，其工作方式如下：

The RL Agent receives state S⁰ from the environment i.e. Mario
RL代理从环境(即Mario)接收状态 S⁰
Based on that state S⁰, the RL agent takes an action A⁰, say — our RL agent moves right. Initially, this is random.
根据状态S⁰， RL代理采取行动A say，例如-我们的RL代理向右移动。最初，这是随机的。
Now, the environment is in a new state S¹ (new frame from Mario or the game engine)
现在，环境处于新状态S¹ (来自Mario或游戏引擎的新框架)
Environment gives some reward R¹ to the RL agent. It probably gives a +1 because the agent is not dead yet.
环境给RL代理人一些奖励 R¹。可能是+1，因为代理尚未死亡。

This RL loop continues until we are dead or we reach our destination, and it continuously outputs a sequence of state, action and reward.

RL循环一直持续到我们死亡或到达目的地为止，并不断输出一系列状态，动作和奖励。

The basic aim of our RL agent is to maximize the reward.

我们的RL代理商的基本目标是使报酬最大化。

奖励最大化 (Reward Maximization)

The RL agent basically works on a hypothesis of reward maximization. That’s why reinforcement learning should have best possible action in order to maximize the reward.

RL代理基本上基于奖励最大化的假设。 这就是为什么强化学习应该采取最佳措施以最大程度地提高奖励的原因。

The cumulative rewards at each time step with the respective action is written as:

每个时间步长和相应操作的累积奖励写为：

However, things don’t work in this way when summing up all the rewards.

但是，在总结所有奖励时，事情就不会以这种方式起作用。

Let us understand this, in detail:

让我们详细了解一下：

Let us say our RL agent (Robotic mouse) is in a maze which contains cheese, electricity shocks, and cats. The goal is to eat the maximum amount of cheese before being eaten by the cat or getting an electricity shock.

假设我们的RL代理(机器人老鼠)在迷宫中，迷宫中包含奶酪，电击和猫 。目标是在被猫吃掉或触电之前先吃最大量的奶酪。

It seems obvious to eat the cheese near us rather than the cheese close to the cat or the electricity shock, because the closer we are to the electricity shock or the cat, the danger of being dead increases. As a result, the reward near the cat or the electricity shock, even if it is bigger (more cheese), will be discounted. This is done because of the uncertainty factor.

在我们附近而不是在猫或电击附近的奶酪上吃奶酪似乎是显而易见的，因为我们离电击或猫越近，死亡的危险就越大。结果，靠近猫或电击的奖励，即使更大(更多奶酪)也将被打折。这样做是由于不确定性因素。

It makes sense, right?

有道理吧？

奖励折扣如下所示： (Discounting of rewards works like this:)

We define a discount rate called gamma. It should be between 0 and 1. The larger the gamma, the smaller the discount and vice versa.

我们定义了称为gamma的折现率。它应该在0到1之间。伽玛值越大，折扣越小，反之亦然。

So, our cumulative expected (discounted) rewards is:

因此，我们的累计预期(折现)奖励为：

强化学习中的任务及其类型 (Tasks and their types in reinforcement learning)

A task is a single instance of a reinforcement learning problem. We basically have two types of tasks: continuous and episodic.

任务是强化学习问题的单个实例。我们基本上有两种任务：连续任务和情节任务。

连续任务 (Continuous tasks)

These are the types of tasks that continue forever. For instance, a RL agent that does automated Forex/Stock trading.

这些是永远持续的任务类型。 例如，执行自动外汇/股票交易的RL代理。

In this case, the agent has to learn how to choose the best actions and simultaneously interacts with the environment. There is no starting point and end state.

在这种情况下，代理必须学习如何选择最佳操作并同时与环境交互。没有起点和终点状态。

The RL agent has to keep running until we decide to manually stop it.

RL代理必须继续运行，直到我们决定手动停止它为止。

情景任务 (Episodic task)

In this case, we have a starting point and an ending point called the terminal state. This creates an episode: a list of States (S), Actions (A), Rewards (R).

在这种情况下，我们有一个起点和终点， 称为终端状态。 这将创建一个情节 ：状态列表(S)，动作(A)，奖励(R)。

For example, playing a game of counter strike, where we shoot our opponents or we get killed by them.We shoot all of them and complete the episode or we are killed. So, there are only two cases for completing the episodes.

对于例如，玩反击游戏，我们向对手射击或者被对手杀死，我们向所有人射击并完成情节，否则我们被杀死。因此，只有两种情况可以完成这些情节。

勘探与开发之间的权衡 (Exploration and exploitation trade off)

There is an important concept of the exploration and exploitation trade off in reinforcement learning. Exploration is all about finding more information about an environment, whereas exploitation is exploiting already known information to maximize the rewards.

在强化学习中，有一个重要的探索和权衡取舍概念。探索就是寻找有关环境的更多信息，而探索则是利用已知的信息来最大化回报。

Real Life Example: Say you go to the same restaurant every day. You are basically exploiting. But on the other hand, if you search for new restaurant every time before going to any one of them, then it’s exploration. Exploration is very important for the search of future rewards which might be higher than the near rewards.

现实生活示例：假设您每天去同一家餐厅。您基本上是在利用。 但另一方面，如果您每次去新餐厅之前都去寻找新餐厅，那就是探索。探索对于寻找可能比接近奖励更高的未来奖励非常重要。

In the above game, our robotic mouse can have a good amount of small cheese (+0.5 each). But at the top of the maze there is a big sum of cheese (+100). So, if we only focus on the nearest reward, our robotic mouse will never reach the big sum of cheese — it will just exploit.

在上面的游戏中，我们的机械鼠标可以放入大量的小奶酪(每个+0.5)。但是在迷宫的顶部，有一大堆奶酪(+100)。因此，如果我们只关注最接近的奖励，那么我们的机器人鼠标将永远无法获得大量的奶酪，而只会被利用。

But if the robotic mouse does a little bit of exploration, it can find the big reward i.e. the big cheese.

但是，如果机械鼠标进行一些探索，它可以找到巨大的回报，即巨大的奶酪。

This is the basic concept of the exploration and exploitation trade-off.

这是勘探与开发权衡的基本概念。

强化学习的方法 (Approaches to Reinforcement Learning)

Let us now understand the approaches to solving reinforcement learning problems. Basically there are 3 approaches, but we will only take 2 major approaches in this article:

现在让我们了解解决强化学习问题的方法。基本上有3种方法，但是在本文中我们将仅采用2种主要方法：

1.基于政策的方法 (1. Policy-based approach)

In policy-based reinforcement learning, we have a policy which we need to optimize. The policy basically defines how the agent behaves:

在基于策略的强化学习中，我们有一个需要优化的策略。该策略基本上定义了代理的行为方式：

We learn a policy function which helps us in mapping each state to the best action.

我们学习了一项政策功能，可以帮助我们将每个状态映射到最佳行动。

Getting deep into policies, we further divide policies into two types:

深入了解策略后，我们进一步将策略分为两种类型：

Deterministic: a policy at a given state(s) will always return the same action(a). It means, it is pre-mapped as S=(s) ➡ A=(a).
确定性的 ：处于给定状态的策略将始终返回相同的动作。 这意味着，它被预先映射为S =(s)➡A =(a)。
Stochastic: It gives a distribution of probability over different actions. i.e Stochastic Policy ➡ p( A = a | S = s )
随机的 ：它给出了不同动作的概率分布。 即随机策略➡p(A = a | S = s)

2.基于价值 (2. Value Based)

In value-based RL, the goal of the agent is to optimize the value function V(s) which is defined as a function that tells us the maximum expected future reward the agent shall get at each state.

在基于价值的RL中，代理的目标是优化价值函数V(s) ，定义为告诉我们代理商在每个州应获得的最大预期未来回报的功能。

The value of each state is the total amount of the reward an RL agent can expect to collect over the future, from a particular state.

每个州的价值是RL代理商期望从特定州获得的奖励总额。

The agent will use the above value function to select which state to choose at each step. The agent will always take the state with the biggest value.

代理将使用上述值功能来选择每个步骤要选择的状态。代理将始终采用具有最大价值的状态。

In the below example, we see that at each step, we will take the biggest value to achieve our goal: 1 ➡ 3 ➡ 4 ➡ 6 so on…

在下面的例子中，我们看到，在每一步，我们将采取的最大价值，实现我们的目标：1➡3➡4➡6等等...

乒乓球游戏—直观案例研究 (The game of Pong — An Intuitive case study)

Let us take a real life example of playing pong. This case study will just introduce you to the Intuition of How reinforcement Learning Works. We will not get into details in this example, but in the next article we will certainly dig deeper.

让我们以打乒乓球为例。本案例研究将向您介绍强化学习的原理 。在本示例中，我们不会详细介绍，但是在下一篇文章中，我们当然会进行更深入的研究。

Suppose we teach our RL agent to play the game of Pong.

假设我们教RL经纪人玩Pong游戏。

Basically, we feed in the game frames (new states) to the RL algorithm and let the algorithm decide where to go up or down. This network is said to be a policy network, which we will discuss in our next article.

基本上，我们将游戏框架(新状态)输入到RL算法中，然后让算法决定向上或向下移动的位置。据说该网络是一个策略网络，我们将在下一篇文章中进行讨论。

The method used to train this Algorithm is called the policy gradient. We feed random frames from the game engine, and the algorithm produces a random output which gives a reward and this is fed back to the algorithm/network. This is an iterative process.

用于训练该算法的方法称为策略梯度 。我们从游戏引擎中获取随机帧，并且该算法产生一个随机输出，该输出给出奖励并将其反馈给算法/网络。这是一个反复的过程。

We will discuss policy gradients in the next Article with greater details.

我们将在下一篇文章中详细讨论政策梯度 。

In the context of the game, the score board acts as a reward or feed back to the agent. Whenever the agent tends to score +1, it understands that the action taken by it was good enough at that state.

在游戏的背景下，计分板可以作为奖励或反馈给代理商。每当代理倾向于得分+1时，它就会了解到代理在该状态下采取的措施足够好。

Now we will train the agent to play the pong game. To start, we will feed in a bunch of game frame (states) to the network/algorithm and let the algorithm decide the action.The Initial actions of the agent will obviously be bad, but our agent can sometimes be lucky enough to score a point and this might be a random event. But due to this lucky random event, it receives a reward and this helps the agent to understand that the series of actions were good enough to fetch a reward.

现在，我们将训练代理商打乒乓球。首先，我们将一堆游戏框架(状态)馈送到网络/算法中，然后由算法决定动作.Agent的初始动作显然会很糟糕，但我们的Agent有时会很幸运地得分点，这可能是随机事件。但是由于这个幸运的随机事件，它会收到奖励，这有助于代理了解一系列操作足以获取奖励。

So, in the future, the agent is likely to take the actions which will fetch a reward over an action which will not. Intuitively, the RL agent is leaning to play the game.

因此，在将来，代理可能会采取将获得奖励的行动，而不是将采取行动的奖励。凭直觉，RL特工倾向于玩游戏。

局限性 (Limitations)

During the training of the agent, when an agent loses an episode, then the algorithm will discard or lower the likelyhood of taking all the series of actions which existed in this episode.

在对代理进行训练期间，当代理丢失某个情节时，该算法将放弃或降低采取该情节中存在的所有一系列动作的可能性。

But if the agent was performing well from the start of the episode, but just due to the last 2 actions the agent lost the game, it does not make sense to discard all the actions. Rather it makes sense if we just remove the last 2 actions which resulted in the loss.

但是，如果特工从情节开始就表现良好，而仅仅由于最后2个动作，特工便输了比赛，则放弃所有动作是没有道理的。相反，如果我们仅删除导致损失的最后两个动作，这是有道理的。

This is called the Credit Assignment Problem. This problem arises because of a sparse reward setting. That is, instead of getting a reward at every step, we get the reward at the end of the episode. So, it’s on the agent to learn which actions were correct and which actual action led to losing the game.

这称为学分分配问题。 由于稀疏的奖励设置而出现此问题。也就是说，我们没有在每一步都获得奖励，而是在情节结束时获得了奖励。因此，由代理来了解哪些动作是正确的，以及哪些实际动作导致游戏输了。

So, due to this sparse reward setting in RL, the algorithm is very sample-inefficient. This means that huge training examples have to be fed in, in order to train the agent. But the fact is that sparse reward settings fail in many circumstance due to the complexity of the environment.

因此，由于RL中这种稀疏的奖励设置，该算法的样本效率非常低。这意味着必须提供大量的培训示例，以培训代理。但是事实是，由于环境的复杂性，稀疏的奖励设置在许多情况下都失败了。

So, there is something called rewards shaping which is used to solve this. But again, rewards shaping also suffers from some limitation as we need to design a custom reward function for every game.

因此，有一种叫做奖励整形的东西可以用来解决这个问题。但同样，由于我们需要为每个游戏设计自定义奖励功能，因此奖励塑造也受到一些限制。

结语 (Closing Note)

Today, reinforcement learning is an exciting field of study. Major developments has been made in the field, of which deep reinforcement learning is one.

如今，强化学习已成为令人兴奋的研究领域。该领域已取得重大进展，其中深度强化学习就是其中之一。

We will cover deep reinforcement learning in our upcoming articles. This article covers a lot of concepts. Please take your own time to understand the basic concepts of reinforcement learning.

我们将在以后的文章中介绍深度强化学习。本文涵盖了许多概念。请花些时间来了解强化学习的基本概念。

But, I would like to mention that reinforcement is not a secret black box. Whatever advancements we are seeing today in the field of reinforcement learning are a result of bright minds working day and night on specific applications.

但是，我想提一下，加固不是一个秘密的黑匣子。我们今天在强化学习领域看到的任何进步都是由于在特定应用程序上日夜工作的头脑聪明的结果。

Next time we’ll work on a Q-learning agent and also cover some more basic stuff in reinforcement learning.

下次，我们将研究Q学习代理，并且还将介绍强化学习中的一些更基本的知识。

Until, then enjoy AI ?…

直到享受AI为止……

Important : This article is 1st part of Deep Reinforcement Learning series, The Complete series shall be available both on Text Readable forms on Medium and in Video explanatory Form on my channel on YouTube.

重要提示 ：本文是“深度强化学习”系列的第一部分，“完整”系列既可以在“ 媒体”上的“文本可读”表格中使用，也可以在我的YouTube频道中以“视频说明”形式使用。

For deep and more Intuitive understanding of reinforcement learning, I would recommend that you watch the below video:

为了对强化学习有更深入，更直观的理解，我建议您观看以下视频：

Subscribe to my YouTube channel For more AI videos : ADL .

订阅我的YouTube频道以获取更多AI视频： ADL 。

If you liked my article, please click the ? as I remain motivated to write stuffs and Please follow me on Medium &

如果您喜欢我的文章，请单击“ ？”。 A S我仍然动机写的东西，并请跟随我的中型和

If you have any questions, please let me know in a comment below or Twitter. Subscribe to my YouTube Channel For More Tech videos : ADL .

如果您有任何疑问，请在下面的评论或Twitter中让我知道。订阅我的YouTube频道以获取更多技术视频： ADL 。