深度强化学习和强化学习_强化强化学习背后的科学

最新推荐文章于 2023-09-06 03:03:58 发布

weixin_26726011

最新推荐文章于 2023-09-06 03:03:58 发布

阅读量387

点赞数

文章标签：强化学习人工智能深度学习机器学习 python

原文链接：https://medium.com/@dakshtrehan/reinforcing-the-science-behind-reinforcement-learning-d2643ca39b51

版权

深度强化学习和强化学习

机器学习，强化学习 (Machine Learning, Reinforcement Learning)

You’re getting bore stuck in lockdown, you decided to play computer games to pass your time.

您因锁定而感到无聊，决定玩电脑游戏来度过您的时间。

You launched Chess and chose to play against the computer, and you lost!

您启动国际象棋并选择与计算机对战，然后输了！

But how did that happen? How can you lose against a machine that came into existence like 50 years ago?

但是那是怎么发生的呢？您如何在50年前出现的机器上遭受损失？

Image for post — Photo by Piotr Makowski on Unsplash

This is the magic of Reinforcement learning.

这是强化学习的魔力。

Reinforcement learning lies under the umbrella of Machine Learning. They aim at developing intelligent behavior in a complex dynamic environment. Nowadays since the range of AI is expanding enormously, we can easily locate their importance around us. From Autonomous Driving, Recommender Search Engines, Computer games to Robot skills, AI is playing a vital role.

强化学习属于机器学习的范畴。 他们旨在在复杂的动态环境中开发智能行为。如今，由于AI的范围正在极大地扩展，因此我们可以轻松地确定其重要性。从自动驾驶，推荐搜索引擎，计算机游戏到机器人技能，人工智能发挥着至关重要的作用。

巴甫洛夫的调理 (Pavlov’s Conditioning)

When we think about AI, we have a perception of thinking about the future, but our idea takes us back in the late 19th century, Ivan Pavlov, a Russian physiologist was studying the salivation effect in dogs. He was interested in knowing how much dogs salivate when they see food, but, while conducting the experiment, he noticed that dogs were even salivating before seeing any food. After his conclusions on that experiment, Pavlov would ring a bell before feeding them and as expected they again started salivating. The reason behind their behavior can be their ability to learn because they had learned that after the bell, they’ll be fed. Another thing to ponder is, the dog doesn’t salivate because the bell is ringing but because given past experiences he had learned that food will follow the bell.

当我们想到AI时，我们对未来有了一种思考，但是我们的想法使我们回到了19世纪末，俄罗斯生理学家Ivan Pavlov正在研究狗的流涎作用。他很想知道当狗看到食物时他们会流涎多少，但是在进行实验时，他发现狗甚至在看到食物之前都在流涎。在完成该实验的结论之后，巴甫洛夫会在喂食它们之前敲响钟声，然后按预期，它们又开始流涎。他们的行为背后的原因可能是他们的学习能力， 因为他们已经学会了在铃声响起后就会被喂饱 。需要考虑的另一件事是，狗不会因为铃响而流涎，而是因为根据过去的经验，他已经知道食物会随着铃响。

什么是强化学习？ (What is Reinforcement Learning?)

Reinforcement Learning is a part of Machine Learning techniques that enables an AI agent to interact with the environment and thus learn from its own sequence of actions and experiences.

强化学习是机器学习技术的一部分，该技术使AI代理能够与环境交互，从而从自身的动作和经验序列中学习。

For sake of illustration, imagine you’re stuck on an isolated island. You can expect yourself to freak out first, but, with no options left, you’ll start fighting for your survival. You’ll look for a place to hunt, you’ll look for a place to sleep, you’ll inspect what to eat and what to avoid. If you stay at a safe place, you’ll notice that it is correct action that has to be performed by you and, at the same instant, if you ate some animal that led to diarrhea you’ll avoid eating that in future. Your actions will become better over time and you’ll easily adjust to the new environment by learning. Reinforcement learning follows the same method wherein we expect the agent to experience the new environment, track its actions and consequences by discovering errors and rewards, and learn to get better or aims at maximizing the reward.

为了说明起见，假设您被困在一个孤立的岛屿上。您可以期望自己先发狂，但是在没有其他选择的情况下，您将开始为生存而战。您将寻找一个狩猎的地方，您将寻找一个睡觉的地方，您将检查吃什么和避免吃什么。如果您待在安全的地方，您会发现必须采取正确的措施，同时，如果您吃了一些会导致腹泻的动物，将来也将避免食用这种动物。随着时间的推移，您的动作会变得更好，并且您将通过学习轻松地适应新的环境。强化学习遵循相同的方法，其中我们期望代理人体验新环境，通过发现错误和奖励来跟踪其行为和后果，并学会变得更好或旨在最大化奖励。

但是，与有监督的学习相比，它又如何呢？ (But, how does it compare against supervised learning?)

It is possible to use a supervised learning method instead of reinforcement learning techniques. But, for that, we need a really large dataset that would constitute every action and its consequence. Its next unfavorable outcome would be limited learning, suppose if track actions of best player but still he is not perfect and following his actions machine might become great like him but won’t be able to exceed his scores.

可以使用监督学习方法代替强化学习技术。但是，为此，我们需要一个非常大的数据集，它将构成每个动作及其后果。它的下一个不利结果是学习受限，假设最佳球员的跟踪动作仍然不完美，并且跟随他的动作机器可能像他一样变得很棒，但不能超过他的得分。

而且，它如何抵制无监督学习？ (And, how does it stand against Unsupervised Learning?)

In unsupervised learning, there is no direct connection between input and output rather it aims at recognizing the patterns, on the contrary, Reinforcement learning is all about learning from the output provided by past input.

在无监督学习中，输入和输出之间没有直接联系，而是旨在识别模式，相反，强化学习就是从过去输入提供的输出中学习。

然后，它是深度学习吗？ (Then, is it Deep Learning?)

Deep learning irrefutably comes under the umbrella of Machine Learning and is capable of computing complex problems that require human-like intelligence.

深度学习不可避免地属于机器学习的范畴，并且能够计算需要类似人的智能的复杂问题。

The Venn-diagram shows the relation between all Machine Learning techniques, according to Universal Approximation Theorem(UAT), we can solve any problem using Neural Nets, but these are not necessarily an optimal solution to every problem as they require a lot of data to process and are often challenging to interpret.

维恩图显示了所有机器学习技术之间的关系，根据通用逼近定理 (UAT)，我们可以使用神经网络解决任何问题，但是由于每个问题需要大量数据才能完成，因此不一定是每个问题的最佳解决方案过程，通常很难解释。

Analyzing the figure, it shows that we are not required to use Deep Learning for every Reinforcement Learning problem that clears the myth that it doesn’t solely depend upon Deep Learning.

分析该图，它表明我们不需要对每个强化学习问题都使用深度学习，这消除了一个神话，即它并不完全依赖于深度学习 。

强化学习如何工作？ (How does Reinforcement Learning work?)

In Reinforcement Learning, we aim at the interaction of Agent and Environment.

在强化学习中，我们的目标是代理与环境的相互作用。

An Agent can be regarded as the “solution”, which is a computer program that we expect to make decisions to solve decision-making problems.
代理可以被视为“解决方案”，它是我们希望做出决策以解决决策问题的计算机程序。
An Environment can be regarded as the “problem”, which is where the decision taken by the agent is implemented.
可以将环境视为“问题”，由代理执行决定。

For example, in the case of the chess game, we can consider that the Agent is one of the players and the Environment constitute the board and competitor.

例如，在国际象棋游戏中，我们可以认为代理是玩家之一，而环境则是董事会和竞争对手。

Both components are inter-dependent in a way that the Agent tries to adjust its actions based on the influence by the Environment, and Environment reacts to Agent’s action.

这两个组件是相互依赖的，即代理试图根据环境的影响来调整其动作，而环境会对代理的动作做出React。

The Environment is bound by a set of variables that are usually associated with decision-making problems. A set of all possible values can be regarded as state space. A state is a part of state space i.e. a value the variable takes.

环境受一组通常与决策问题相关的变量的约束。一组所有可能的值可以视为状态空间 。状态是状态空间的一部分，即变量采用的值。

At each state, the Environment is entitled to provide a set of actions to the Agent, amongst whom it should choose one. The agent tries to influence the Environment using these actions and Environment may change states as a response to the Agent’s actions. Transition function is something that tracks these associations.

在每个州，环境均有权向代理提供一系列操作，环境应从中选择一个。代理尝试使用这些操作来影响环境，并且环境可能会更改状态以作为对代理操作的响应。 转换功能可以跟踪这些关联。

The Environment either reward or penalize the agent based on its actions. The Reward is the positive feedback provided if the last action of the agent is contributing to achieving a favorable goal. The Penalty is the negative feedback provided by the environment if the last action of the agent results in a deviation from the goal. The agent’s goal is to maximize the overall reward and keep making its actions better in order to achieve the desired final result.

环境根据代理的行为来奖励或惩罚代理。奖励是代理商的最后行动有助于实现有利目标时提供的积极反馈。如果代理商的最后行动导致偏离目标，则惩罚是环境提供的负面反馈。代理商的目标是最大程度地提高整体回报，并不断改善其行动，以实现所需的最终结果。

Another thing that Reinforcement learning requires is a lot of training time, as the rewards aren’t disclosed to the Agent until the end of an episode(game). e.g. if our computer is playing chess against us and it wins, then it will be rewarded (as our desired outcome was to win) but still, it needs to figure out for which actions it was rewarded and that can only be achieved when it is given a tonne of training time and data.

强化学习需要的另一件事是大量的培训时间，因为直到情节(游戏)结束时才会向特工透露奖励。例如，如果我们的计算机在对我们下棋并且获胜，那么它将得到奖励(因为我们期望的结果是获胜 )，但是仍然需要弄清楚它为哪些操作受到奖励，并且只有在获得奖励时才能实现给出了大量的培训时间和数据。

强化学习如何学习？ (Q学习) (How does Reinforcement Learning learn? (Q-learning))

Goal: To maximize the total reward

目标：最大化总回报

We expect, the rewards to come early as to make our training faster and thus quickly achieving desired outcomes.

我们希望，奖励会早日出现，以使我们的培训更快，从而Swift达到预期的效果。

But, in a real case, we encounter late rewards, and to penalize late rewards we will introduce Discount Factor().

但是，在实际情况下，我们会遇到延迟奖励，并且为了惩罚延迟奖励，我们将引入Discount Factor()。

In a real case scenario, as we move towards right, the uncertainty increases.

在实际情况下，随着我们向右走，不确定性会增加。

贝尔曼方程 (Bellman Equation)

Our goal was to maximize the reward or we can say to minimize the error(loss).

我们的目标是使报酬最大化，或者我们可以说使误差(损失)最小。

To minimize the loss, we can implement Gradient Descent using Mean-square error loss.

为了使损失最小化，我们可以使用均方误差损失实现梯度下降。

探索与开发权衡 (Exploration v/s Exploitation trade-off)

Other interesting components of Reinforcement Learning are Exploration and Exploitation. To obtain quick rewards, an Agent must follow past experiences. But to detect such actions, it has to try actions at first.

强化学习的其他有趣组成部分是探索和开发。为了获得快速的回报，代理商必须遵循过去的经验。但是要检测此类动作，首先必须尝试动作。

In nutshell, to obtain quick rewards an Agent has to exploit but it is also expected to explore to make its actions better, and thus that might help it to get a better reward.

简而言之，要想获得快速的回报，特工必须加以利用，但也期望它能探索以使其行动更好，从而有可能帮助其获得更好的回报。

Let’s get back to the island, you’ve three spots for fishing and each is home to three types of fishes, spot 1 is habitat to black fishes which are poisonous, spot 2 is home to orange fishes that are delicious as well as nutritious and, spot 3 constitutes grey fishes that are best in terms of nutrition and taste. The goal would be not to eat blackfish and try to have a grey one.

回到岛上，您有3个钓鱼地点，每个地点都是三种鱼类的栖息地，地点1是有毒黑鱼的栖息地，地点2是美味又有营养的橙色鱼的栖息地，点3构成灰色鱼类是营养和口味方面最好。目标是不要吃黑鱼，而要吃灰色的鱼。

Let’s assume, on Day 1, you chose spot 1 for fishing and ended up eating a blackfish and having diarrhea. On Day 2, you reached spot 2 and ended up having a delicious meal. Now, your instincts will try to exploit the path you’ve chosen i.e. the road to spot 2 because as per your past experiences spot 2 seems to be a better policy. And, hence, your mind will be stuck in a policy where it is sacrificing for a moderate award.

假设在第1天，您选择了第1点进行钓鱼，最后吃了一条黑鱼并腹泻。在第2天，您到达了第2点，最后吃了一顿美味的饭。现在，您的直觉将尝试利用您选择的路径，即发现2的道路，因为根据您过去的经验，发现2似乎是更好的策略。因此，您的想法将停留在牺牲适度奖励的政策上。

Exploration: Helps you to try various actions; good in the beginning.

探索：帮助您尝试各种操作；一开始就很好。

Exploitation: Sample good experience from past; need memory space; good at the end

剥削：回顾过去的良好经验；需要内存空间；末日好

结论 (Conclusion)

Hopefully, this article will help you to understand about Reinforcement Learning in the best possible way and also assist you to its practical usage.

希望本文将帮助您以最佳方式了解强化学习，并帮助您进行实际使用。

As always, thank you so much for reading, and please share this article if you found it useful!

与往常一样，非常感谢您的阅读，如果您觉得有用，请分享这篇文章！

Feel free to connect:

随时连接：

LinkedIn ~ https://www.linkedin.com/in/dakshtrehan/

领英〜https: //www.linkedin.com/in/dakshtrehan/

Instagram ~ https://www.instagram.com/_daksh_trehan_/

Instagram〜https: //www.instagram.com/_daksh_trehan_/

Github ~ https://github.com/dakshtrehan

Github〜https: //github.com/dakshtrehan