深度学习后向算法的直观理解_强化学习的直观介绍

深度学习后向算法的直观理解

Reinforcement Learning is the type of learning that is closest to the way humans learn.

强化学习是最接近人类学习方式的学习类型。

Reinforcement Learning, as opposed to supervised and unsupervised learning techniques, is a goal-oriented learning technique. It is based on operating in an environment wherein a hypothetical person (agent) is expected to take a decision (action) from a set of possible decisions, and to maximize the profit (reward) that is obtained by making that decision, by iteratively learning to select a decision that leads to the desired goal (basically trial and error). I’ll explain this in greater detail as we proceed through the article.

与有监督和无监督学习技术相反,强化学习是一种面向目标的学习技术。 它基于在这样的环境中进行操作,在该环境中,假设的人(代理人)应从一组可能的决策中做出决策(行动),并通过反复学习来最大程度地做出决策,从而获得利润(回报)选择可以达到预期目标的决定(基本上是反复试验)。 我们将在本文中继续进行详细说明。

In this article, I’ll be discussing the fundamentals of Reinforcement Learning or RL (with examples wherever possible).

在本文中,我将讨论强化学习或RL的基础知识(尽可能提供示例)。

有监督吗 无监督? 强化学习! (Supervised? Unsupervised? Reinforcement Learning!)

First things first! Before even starting to talk about RL, we’ll first see how exactly it differs from supervised and unsupervised learning techniques.

第一件事第一! 在甚至开始谈论RL之前,我们将首先了解它与有监督和无监督学习技术的确切区别。

Let’s consider an example of a kid who is learning to ride a bicycle. We’ll see how this problem would be addressed if the kid was to learn in a supervised, unsupervised, or a reinforcement learning way:

让我们考虑一个正在学习骑自行车的孩子的例子。 我们将看到如果孩子以有监督,无监督或强化学习的方式学习,该问题将如何解决:

  1. Supervised Learning: Now, if the kid starts calculating the force he needs to apply on the pedal, or maybe the angle he needs to maintain with the ground to stay balanced; and he starts to optimize these calculations at every instance of him riding the bicycle to perfect his riding skills, then it would be said that he is learning in a supervised way.

    监督学习:现在,如果孩子开始计算需要施加在踏板上的力,或者可能需要保持地面倾斜以保持平衡的角度; 然后他开始在每次骑自行车时优化这些计算以完善他的骑行技巧,然后就可以说他正在以一种有监督的方式学习。

  2. Unsupervised Learning: Whereas, if the kid starts watching thousands of other people riding the bicycle, and based on that knowledge if he starts to figure out what exactly is to be done for riding a bicycle, then it would be said that he learned in an unsupervised way.

    无监督学习:如果孩子开始观看成千上万的其他人骑自行车,并且基于该知识,如果他开始弄清楚骑自行车到底要做什么,那么就可以说他从无监督的方式。

  3. Reinforcement Learning: Finally, if he’s given a few options like hitting the pedal, turning the handle left or right, applying brakes, etc. and the freedom to try whatever he wants among these options to be able to ride the bicycle successfully, he’d first do it wrong and fail (maybe fall off); but eventually, after a few failed attempts, he’ll figure out how to do it, and finally succeed. This case is an example of reinforcement learning.

    强化学习:最后,如果他有一些选择,例如踩踏板,向左或向右旋转手柄,施加刹车等,并且可以自由选择这些选项中的任意项,以便能够成功地骑自行车, d首先做错了并且失败了(也许会失败); 但是最终,在几次失败的尝试之后,他会弄清楚该怎么做,并最终成功。 这个案例是强化学习的一个例子。

Well, now you know why is it said to be the closest to the way humans learn! Now you can expect the topics to get a little formal as we proceed further.

好吧,现在您知道为什么它被认为是最接近人类学习的方式! 现在,随着我们的进一步发展,您可以期望这些主题会变得正式起来。

探索与开发 (Exploration v/s Exploitation)

Let’s continue with the example of the kid, who knows a set of actions that he can perform to ride a bicycle. So, consider a scenario where he has finally figured out that hitting the pedal continuously would drive the bicycle. However, he doesn’t realize that after riding, he has to stop at some point (i.e. applying brakes at the right time is an integral part of riding a bicycle). But, he’s happy that now he knows how to ride the bicycle and doesn’t care about future events. Let’s call his happiness as ‘reward’, meaning that he is rewarded for his action of hitting the pedal. And since he’s being rewarded, he is purely ‘exploiting’ the current action, i.e. pedaling, not knowing that maybe in the end he might crash somewhere which would leave him far from achieving his ultimate goal; which is riding the bicycle correctly.

让我们继续以孩子为例,他知道他可以骑自行车执行的一系列动作。 因此,考虑一个场景,他最终发现连续踩踏板会驱动自行车。 但是,他没有意识到骑完自行车后必须停下来(即在正确的时间刹车)是骑自行车不可或缺的一部分。 但是,他感到很高兴,因为他现在知道如何骑自行车,并且不再关心将来的事情。 让我们将他的幸福称为“ 奖励 ”,这意味着他因踩踏板而获得的奖励。 而且由于他得到了回报,所以他纯粹是在“ 利用 ”当前的动作,即踩踏板,不知道最后他可能会撞到某个地方,使他远未达到最终目标。 正确地骑自行车。

Now, he can ‘explore’ other options from the set of available actions instead of just pedaling. Eventually, he’ll be able to stop the bicycle whenever he wants to. In a similar fashion, he’ll learn how to take a turn and in this way, he’d be a good rider.

现在,他可以从一组可用动作中“ 探索 ”其他选项,而不仅仅是踩踏板。 最终,他将可以随时停止自行车。 以类似的方式,他将学习如何转弯,并以此方式成为一名优秀的骑手。

But, too much of anything is bad! We saw that too much exploitation can lead to failure. In the same way, too much exploration is also bad. For example, if he just randomly changes his actions on every instance, he’ll be nowhere near riding the bike, would he? So basically, it’s a trade-off and it is coined as the Exploration-Exploitation Dilemma and is one of the major parameters to consider while solving an RL problem.

但是,太多的事情都是不好的! 我们看到太多的剥削可能导致失败。 同样,过多的探索也是不好的。 例如,如果他只是在每个实例上随机更改其动作,他将远不及骑自行车,对吗? 因此,基本上,这是一个权衡,它被称为探索与开发难题,并且是解决RL问题时要考虑的主要参数之一。

Note that the kid would decide his action at a given instance on the basis of his current ‘state’ w.r.t to the environment i.e. his current motion/position while cycling and the rewards obtained from the previous tries (This decision-making mechanism is what RL is all about).

请注意 ,孩子将根据给定环境的当前“ 状态 ”(即骑车时的当前运动/位置以及先前尝试所获得的回报)来决定在给定情况下的动作(此决策机制是RL都是关于)。

RL问题的基础 (Building Blocks of an RL Problem)

  1. Policy: A policy defines the behavior of an RL agent. In our example, a policy would be the way the kid thinks about what action to choose among the available ones (kid is the agent).

    政策: A 策略定义RL代理的行为。 在我们的示例中,政策将是孩子思考在可用的行为(孩子是代理人)中选择哪种行为的方式。

  2. Reward: These define the goal of a problem. At each step, the environment sends a reward to the agent. In our example, the pleasure of riding the bicycle, or the pain of falling off, would be the reward (The second case could be referred to as a penalty).

    奖励:这些定义了问题的目标。 在每个步骤中,环境都会向代理发送奖励。 在我们的示例中,骑自行车的乐趣或摔倒的痛苦将是回报(第二种情况可称为罚款)。

  3. Value Function: The reward is an immediate response of the environment to the agent. However, we are interested in maximizing the reward in the long-run. This is computed using the value functions. Formally, the value of a state is the total reward an agent can expect to accumulate over the future, starting from that state (Sutton & Barto). If the kid thinks it through about what could happen in the future, if he opts to select a particular action, for say, a few hundred meters, then that could be called as the value.

    价值功能:奖励是环境对代理人的立即响应。 但是,从长远来看,我们有兴趣使报酬最大化。 这是使用值函数计算的。 正式地,一个州的价值是代理商从该州开始可以期望在未来积累的总奖励( 萨顿和巴托 )。 如果孩子考虑到将来可能发生的事情,如果他选择某个特定的动作(例如几百米),则可以将其称为价值。

  4. Model: A model of the environment is a tool for planning. It mimics the actual environment and hence can be used to make inferences about how the environment would behave. For example, given a state and an action, the model might predict the resultant next state and next reward (Sutton & Barto). And of course, RL mechanisms can be classified into model-based and model-free methods.

    模型:环境模型是进行规划的工具。 它模仿了实际环境,因此可以用来推断环境的行为方式。 例如,给定一个状态和一个动作,该模型可以预测所得的下一个状态和下一个奖励( Sutton&Barto )。 当然,RL机制可以分为基于模型的方法和没有模型的方法。

结论 (Conclusion)

We’ve got an intuition about what an RL problem looks like and how one can address it. Moreover, we distinguished RL from supervised and unsupervised learning. However, Reinforcement Learning is way more intricate than the outline laid in this article; but it is enough to clear the fundamental concepts.

我们对RL问题的外观以及如何解决这个问题有一种直觉。 此外,我们将RL与有监督和无监督学习区分开。 但是,强化学习比本文概述的要复杂得多。 但足以清除基本概念。

翻译自: https://towardsdatascience.com/an-intuitive-introduction-to-reinforcement-learning-ef8f004da55c

深度学习后向算法的直观理解

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值