了解强化学习

最新推荐文章于 2023-02-28 10:28:44 发布

weixin_26704853

最新推荐文章于 2023-02-28 10:28:44 发布

阅读量395

点赞数

文章标签： python 强化学习人工智能

原文链接：https://medium.com/nerd-for-tech/understanding-reinforcement-learning-i-8613218441e5

版权

#REINFORCEDSERIES (#REINFORCEDSERIES)

“Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward.”

“强化学习(RL)是机器学习的一个领域，与软件代理应如何在环境中采取行动以最大化累积奖励的概念有关。”

Reinforcement learning currently has turned out to be a very important topic of research and has found important applications in several domains. In the reinforced series we are going to take a look at all the concepts of Reinforcement Learning and understand the basic principle behind their working, with applications. This article aims to give an introduction to the topic and explain the basic terminologies associated with reinforcement learning.

强化学习目前已成为非常重要的研究主题，并且已在多个领域中找到了重要的应用。在强化系列中，我们将研究强化学习的所有概念，并了解其在应用中的基本原理。本文旨在对该主题进行介绍，并解释与强化学习相关的基本术语。

Divisions of Machine Learning

机器学习部门

Machine Learning mostly contains three types of problems, as shown in the above figure.

机器学习主要包含三种类型的问题，如上图所示。

Why Reinforcement Learning?

为什么要进行强化学习？

Let’s explore the types to find out.

让我们探索类型以找出答案。

Supervised Learning learns from a set of labeled examples. From the instances and the labels, supervised learning models try to find the correlation among the features, used to describe an instance, and learn how each feature contributes to the label corresponding to an instance. On receiving an unseen instance, the goal of supervised learning is to label the instance based on its feature correctly.

监督学习可从一系列带有标签的示例中学习。从实例和标签中，受监督的学习模型试图找到要素之间的相关性，用于描述实例，并了解每个要素如何对对应于实例的标签做出贡献。接收到看不见的实例后，监督学习的目的是根据实例的特征正确标记实例。

Unsupervised learning deals with data instances only. This approach tries to group data and form clusters based on the similarity of features. If two instances have similar features and placed in close proximity in feature space, there are high chances the two instances will belong to the same cluster. On getting an unseen instance, the algorithm will try to find, to which cluster the instance should belong based on its feature.

无监督学习仅处理数据实例。该方法尝试根据特征的相似性对数据进行分组并形成聚类。如果两个实例具有相似的特征并紧密相邻放置在特征空间中，则这两个实例很有可能属于同一集群。在获得一个看不见的实例时，该算法将尝试根据实例的特征查找该实例应属于哪个群集。

From our above discussion, we can get an intuition, both the processes are single instance-single prediction based processes. So, they can be called single decision processes.

通过上面的讨论，我们可以得出一个直观的认识，两个过程都是基于单实例-单个预测的过程。因此，它们可以称为单个决策过程。

Having explored the other two topics, it is evident that any of the two algorithm types can not be used to reach a goal state, of multiple decision processes like a game. To play a game, we need to make multiple choices and predictions during the course of the game to achieve success, so they can be called a multiple decision processes. This is where we need the third type of algorithm called reinforcement learning algorithms. The class of algorithm is based on decision-making chains which let such algorithms to support multiple decision processes.

在探讨了其他两个主题之后，很明显，两种算法类型中的任何一种都不能用于达到目标状态，例如游戏的多个决策过程。要玩游戏，我们需要在游戏过程中做出多种选择和预测才能取得成功，因此可以将它们称为多重决策过程。这是我们需要的第三种算法，称为强化学习算法。算法类别基于决策链，决策链可让此类算法支持多个决策过程。

The whole basic idea of reinforcement learning is based on Markov processes. So, let’s study them in detail.

强化学习的整个基本思想都基于马尔可夫过程。因此，让我们详细研究它们。

马尔可夫过程 (Markov Process)

Markov Assumption:

马尔可夫假设：

Markov assumption states that the probability P of an event X at time t+1 is dependent only on the behavior of the event X at time t and independent of the behavior of X on time t=0,1….t-1 i.e, P(X (t+1)) is dependent only on X(t).

马尔可夫假设指出，事件X在时间t + 1的概率P仅取决于事件X在时间t的行为，而与X在时间t = 0,1….t-1的行为无关，即， P(X(t + 1))仅取决于X(t)。

So, mathematically:

因此，在数学上：

P( X(t+1) | X(t) )= P( X(t+1) | H(t) )

P(X(t + 1)| X(t))= P(X(t + 1)| H(t))

Where H(t)= {X(0), X(1), X(2),…………….X(t)}

其中H(t)= {X(0)，X(1)，X(2)，…………..X(t)}

H(t) represents the history of the steps of the process. The equation represents the probability of X(t+1) given X(t) is equal to the probability of X(t+1) given all the history of al the steps of the process.

H(t)代表过程步骤的历史记录。该方程式表示给定X(t)的X(t + 1)的概率等于给定过程中所有步骤的所有历史的X(t + 1)的概率。

Markov Assumption presents a memoryless approach.

马尔可夫假设提出了一种无记忆的方法。

Any process or system that satisfies Markov assumptions is termed as Markov Process.

满足马尔可夫假设的任何过程或系统都称为马尔可夫过程。

Markov Chain: It is a stochastic or probability-based model that consists of a sequence of events. The probability of each event occurring depends on only the state attained by the previous event. When we apply Markov property to a random sequential process a Markov chain is obtained.

马尔可夫链(Markov Chain)：它是由一系列事件组成的随机或基于概率的模型。 每个事件发生的可能性仅取决于前一个事件所达到的状态。 当我们将马尔可夫性质应用于随机顺序过程时，将获得马尔可夫链。

A Markov process or Markov Chain is represented by a tuple (S, P).

马尔可夫过程或马尔可夫链由元组(S，P)表示。

S represents the set of states and
S代表状态集，并且
P represents the state transition probability matrix. P(s’,s) gives the probability that the process will achieve the state s’ at t+1 if the process was at state s at time t.
P代表状态转移概率矩阵。 P(s'，s)给出了如果过程在时间t处于状态s，则过程将在t + 1达到状态s'的概率。

For instance:

例如：

If the above diagram describes a Markov process.

如果上图描述了马尔可夫过程。

S-> [0,1,2]

P-> [ [ 0.1, 0.4, 0.5], [ 0.3, 0.2, 0.5], [0.5, 0.3, 0.2] ]

P-> [[0.1，0.4，0.5]，[0.3，0.2，0.5]，[0.5，0.3，0.2]]

P is given as:

P给出为：

[ [ P(X(t+1)=0 |P(X(t=0)) P(X(t+1)=1 |P(X(t=0)) P(X(t+1)=2 |P(X(t=0)) ]

[[P(X(t + 1)= 0 | P(X(t = 0))P(X(t + 1)= 1 | P(X(t = 0))P(X(t + 1) = 2 | P(X(t = 0))]

[ P(X(t+1)=0 |P(X(t=1)) P(X(t+1)=1 |P(X(t=1)) P(X(t+1)=2 |P(X(t=1)) ]

[P(X(t + 1)= 0 | P(X(t = 1))P(X(t + 1)= 1 | P(X(t = 1))P(X(t + 1)= 2 | P(X(t = 1))]

[ P(X(t+1)=0 |P(X(t=2)) P(X(t+1)=1 |P(X(t=2)) P(X(t+1)=2 |P(X(t=2)) ] ]

[P(X(t + 1)= 0 | P(X(t = 2))P(X(t + 1)= 1 | P(X(t = 2))P(X(t + 1)= 2 | P(X(t = 2))]]

Given:

鉴于：

P(X0)= 0 ->0.3

P(X0)= 0-> 0.3

P(X0)= 1 ->0.4

P(X0)= 1-> 0.4

P(X0)= 2 ->0.3

P(X0)= 2-> 0.3

The above-given equation state the probability that the process will start from 0 i.e, the state at t=0 is state 0 is 0.3, and so on.

上面给出的方程式说明过程将从0开始的概率，即t = 0处的状态为状态0为0.3，依此类推。

P is a matrix, it is horizontally stacked due to editor constraints.

P是一个矩阵，由于编辑器的限制，它被水平堆叠。

For state 0, it reaches state 0 with 0.1 probability, state 1 with 0.4 probability, and so on.

对于状态0，它以0.1的概率到达状态0，以0.4的概率到达状态1，依此类推。

If we pick a sequence, say 210. According to Markov Process, and chain rule,

如果我们选择一个序列，请说210。根据Markov Process和链式规则，

P(210)=P(X0=2). P((X1=1) | (X0=2)). P((X2=0) | P(X1=1))

P(210)= P(X0 = 2)。 P((X1 = 1)|(X0 = 2))。 P((X2 = 0)| P(X1 = 1))

So, P(210) = 0.3 x 0.3 x 0.3= 0.027

因此，P(210)= 0.3 x 0.3 x 0.3 = 0.027

强化学习设置 (Reinforcement Learning Setting)

We all know reinforcement learning is used for building games. So, let’s understand this with respect to a game. Let’s go with the most common one, say Grid World.

众所周知，强化学习可用于构建游戏。因此，让我们从游戏的角度来理解这一点。让我们来看看最常见的一种，网格世界。

Say, the actor in the 1st box has to reach the 3rd box which has a ball he needs. So, the third box is the goal state. The red boundaries are blocks. I.e, he/she can’t pass through those boundaries. Here the man is called an Agent. It moves from its initial position, to achieve its final goal. Now, when we play a game, the game offers several obstacles at several stages. The entire setting on which the game is played is called the Environment. To move from the initial position to the final or goal position, the agent has to show a movement at every stage of the game. These movements are called Actions. Now, in this case, we can only take 4 actions, similarly, every game typically has a definite set of actions, which we can take at each stage of the game. Each of the boxes or stages are termed as the States of the Environment. The states show the position of an agent with respect to the environment.

假设第一个盒子中的演员必须到达第三个盒子中，他需要一个球。因此，第三个方框是目标状态。红色边界是块。即，他/她无法穿越这些界限。在这里，这个人称为特工。 它从最初的位置移至最终目标。现在，当我们玩游戏时，游戏会在多个阶段提供一些障碍。玩游戏的整个设置称为环境。 为了从初始位置移动到最终位置或目标位置，代理必须在游戏的每个阶段显示一个动作。这些动作称为动作。 现在，在这种情况下，我们只能执行4个动作，类似地，每个游戏通常都有一组确定的动作，我们可以在游戏的每个阶段执行这些动作。每个方框或阶段都称为环境状态。 状态显示代理相对于环境的位置。

The agent keeps taking actions and changing states until it reaches the final state or goal state. But the question is, how would an agent know which state is the goal state, and which action sequence it should consider to reach the goal state to keep the number of steps to reach the goal optimally. As an indicator of the facts, each stage or state of an environment provides a Reward to the agent. The goal state in our case gives a reward of +10 and others give a reward of -1.

代理会不断采取行动并更改状态，直到达到最终状态或目标状态。但是问题是，代理人将如何知道哪个状态是目标状态，以及应该考虑哪种行动顺序才能达到目标状态，以保持达到目标的最佳步数。作为事实的指示，环境的每个阶段或状态都会向代理提供奖励。在我们的案例中，目标状态给出的奖励是+10，其他状态给出的是-1。

Reinforcement learning aims to train the agent to reach the goal state from any initial state. This is done by “Positively reinforcing” an action at a particular state that helps the agent to reach the goal state by providing the agent a higher reward on taking that action on that particular state. It is a way of signaling the agent to take the action at that particular step. The aim of the agent is to maximize the reward while reaching the goal state to satisfy optimality.

强化学习旨在训练代理从任何初始状态达到目标状态。这是通过“积极加强”特定状态下的动作来完成的，该动作通过在特定状态下采取行动时向坐席提供更高的奖励，从而帮助坐席达到目标状态。这是发信号通知代理在该特定步骤采取措施的一种方式。代理的目的是在达到目标状态以满足最优性的同时最大化回报。

So, the agent starts from an initial state of the environment. It takes action given the state. With each action taken by the agent, the environment reacts to the action with a new state which may or may not be equivalent to the previous state and a reward. This describes the whole Reinforcement learning setting.

因此，代理从环境的初始状态开始。给定状态将采取措施。随着代理采取的每个动作，环境以新状态对动作做出React，该状态可能等于或可能不等于先前状态和奖励。这描述了整个强化学习设置。

强化学习代理的类型 (Types of Reinforcement Learning Agents)

Model-Based: In this type, the agent tries to ‘model the environment’. The agent captures the features and behaviors of the environment and its states. Using the observed, the agent tries to replicate the original environment and tries to create a virtual model with similar behaviors as the original one. The agent instead of interacting with the original environment interacts with the model, to stimulate future movements, actions, and responses. They are fast because they actually do not interact with the environment, so they do not need to wait for the response of the environment. They are risky because if the behaviors are wrongly interpreted during observations, the whole model will be wrongly created.

基于模型的：在这种类型中，代理尝试对环境进行“建模”。该代理捕获环境及其状态的功能和行为。代理使用观察到的内容尝试复制原始环境，并尝试创建行为与原始环境相似的虚拟模型。代理不是与原始环境进行交互，而是与模型进行交互，以刺激未来的移动，动作和响应。它们之所以快捷，是因为它们实际上并不与环境交互，因此它们无需等待环境的响应。它们具有风险，因为如果在观察过程中错误地解释了行为，则会错误地创建整个模型。

Model-Free: In this type, the agent directly interacts with the environment directly. They have policies and value functions. Though this type of learning is slow still reliable. The agent in search learning takes action, collects rewards, positive or negative, and updates the choice of states using reward function.

无模型：在这种类型中，代理直接与环境直接交互。他们具有政策和价值功能。尽管这种学习速度很慢，但仍然可靠。搜索学习中的主体采取行动，收集奖励(正面或负面)，并使用奖励功能更新状态选择。

Here we will be talking about Model-Based Learning.

在这里，我们将讨论基于模型的学习。

强化学习的组成部分： (Components of Reinforcement Learning:)

The basic components of Reinforcement Learning are:

强化学习的基本组成部分是：

Model: Representation of the world or environment.
型号：代表世界或环境。
Policy: It is kind of a mapping from states to actions. It tells which action should the agent take given a particular state.
策略：这是一种从状态到动作的映射。它告诉代理应采取的特定状态。
Value Function: We have talked about Rewards and States. The value function is a function of the reward of the current state and also the reward of the future steps. It shows how important or valuable a state of the environment is to the agent. It is a measure of the maximum reward an agent can get in the future from a given state. More the value function, the more favorable the state. One thing to notice is the value function is the property of the state of the environment and not the agent. It just serves as a reference or guidance to the agent.
价值功能：我们已经讨论过奖励和国家。价值函数是当前状态的奖励以及未来步骤的奖励的函数。它显示了环境状态对于代理而言多么重要或有价值。它是代理人将来可从给定状态获得的最大奖励的度量。价值函数越多，状态越好。要注意的一件事是值函数是环境状态的属性，而不是代理的属性。它只是对代理的参考或指导。

模型的组成部分： (Components of a Model:)

We have learned model copies of reflects the original environment. Until now we have seen for the environment part, it only provides the next state and reward obtained. So, the model also must contain those pieces of information. The model has two main components:

我们已经了解了反映原始环境的模型副本。到目前为止，对于环境部分，我们只看到了下一个状态和获得的奖励。因此，模型还必须包含这些信息。该模型有两个主要组成部分：

Transition Dynamics: It provides information about the probabilities with which an agent reaches from one state to another.
过渡动力学 ：它提供有关代理从一种状态到达另一种状态的概率的信息。

P(S(t+1)=s’ | S(t)=s, A(t)=a )

P(S(t + 1)= s'| S(t)= s，A(t)= a)

The equation gives the probability of reaching state s’ at t+1 after taking action at state s at time t.

该方程式给出了在时间t在状态s处采取行动之后在t + 1达到状态s'的可能性。

Reward Dynamics: It provides information about the rewards an agent receives when it takes action at a state.
奖励动态：它提供有关代理商在某种状态下采取行动时所获得的奖励的信息。

R( S(t)=s, A(t)=a) =E ( r(t) | S(t)=s,A(t)=a)

R(S(t)= s，A(t)= a)= E(r(t)| S(t)= s，A(t)= a)

The reward for taking action ‘a’ on state s is the mathematical expectation of the same.

对状态s采取动作“ a”的奖励是对状态s的数学期望。

The mathematical expectation is given as the probabilistic sum of rewards that can be obtained in the future if we take a particular action “a” at the state “s”.

如果我们在状态“ s”采取特定的行动“ a”，则数学期望值就是未来可能获得的奖励的概率总和。

If we check by example,

如果我们以身作则，

If we consider this to be our environment,

如果我们认为这是我们的环境，

Our transition dynamics looks like:

我们的过渡动态如下所示：

[0.1,0.4,0.5],

[0.1,0.4,0.5]，

[0.3,0.2,0.5],

[0.3,0.2,0.5]，

[0.5,0.3,0.2]

And our reward dynamics looks like: [2,3,5]

我们的奖励动态如下： [2,3,5]

政策规定 (Policies)

We have seen policy as a mapping from states to actions. It determines which action to take at which state of the environment.

我们将政策视为国家到行动的映射。它确定在环境状态下应采取的措施。

PI: S ->A

PI：S-> A

Pi is the policy, S is the set of states and A is the set of Actions.

Pi是策略，S是状态集，而A是动作集。

Now, let’s consider two cases.

现在，让我们考虑两种情况。

Say, the agent takes an action 1 at a state s1. The action should take the agent to state s2 according to the policy. The agent on executing the action at that state reaches state s2 on a confirmed basis i.e, with probability 1. Such a type of policy is called Deterministic policy.
说，代理在状态s1采取动作1。该操作应使代理根据策略进入状态s2。在该状态下执行动作的代理在确定的基础上(即概率为1)到达状态s2。此类策略称为确定性策略。

On transition, the agent reaches the desired state with a 100% probability.

过渡时，代理以100％的概率达到所需状态。

2. Now, the second category states, on taking action 1 at state s1, the agent reaches state 2 with a probability of say 0.7. It reaches state s1 with probability 0.1 and state s0 with a probability of 0.2. So, we are reaching the next step on a probabilistic distribution basis and not on a confirmed basis. Such a policy is called Stochastic Policy.

现在，第二类指出，在状态s1处执行动作1时，代理到达状态2的概率为0.7。它以0.1的概率到达状态s1，以0.2的概率到达状态s0。因此，我们是在概率分布的基础上而不是在确定的基础上迈出下一步。这种策略称为随机策略。

On transition, the agent reaches the desired state with a 70% probability. Such a policy represents a slip concept or reinforcement learning.

过渡时，代理以70％的概率达到所需状态。这样的策略代表了失误概念或强化学习。

One thing to note is that the rewards achieved depends on the sequence of actions taken on states, i.e, the state-action mapping or policy. If the policy changes, everything will change.

要注意的一件事是，所获得的回报取决于对状态采取的行动的顺序，即状态-作用映射或策略。如果政策改变，一切都会改变。

价值功能 (Value Function)

The value function is the expected discounted sum of future rewards for a particular policy. Mathematical Expectation is the probabilistic summation.

价值函数是特定保单的预期未来酬金折现总额。数学期望是概率求和。

P1, P2 are the probabilities to reach S2, S3, S4 respectively. R2, R3, R4 are the already calculated Expectation of future rewards at S2, S3, and S4.

P1，P2是分别达到S2，S3，S4的概率。 R2，R3，R4是已经计算出的S2，S3和S4的未来奖励期望。

E(S1)=P1.R2+P2.R3+P3.S4

E(S1)= P1.R2 + P2.R3 + P3.S4

It is a discounted sum of expectation of future rewards.

它是对未来奖励的期望的折后总和。

So, it is given by:

因此，它由下式给出：

V(PI)=E(PI) [ R(t)+gamma*R(t+1)+gamma2 *R(t+2) +gamma3 * R(t+3)………… | S(t)=s ]

V(PI)= E(PI)[R(t)+ gamma * R(t + 1)+ gamma2 * R(t + 2)+ gamma3 * R(t + 3)…………| S(t)= s]

Where PI is the policy and gamma is the discount factor, R(t+1) is the reward at the state reached at time t+1 according to the policy pi.

其中PI是策略，伽玛是折扣因子，R(t + 1)是根据策略pi在时间t + 1达到的状态下的回报。

Now, why the discount factor?

现在，为什么要使用折扣系数？

Let’s see without a discount factor.

让我们看看没有折扣的因素。

V(PI)=E(PI) [ R(t)+R(t+1)+R(t+2) + R(t+3)………… | S(t)=s ]

V(PI)= E(PI)[R(t)+ R(t + 1)+ R(t + 2)+ R(t + 3)…………| S(t)= s]

If we talk about an infinite process, then the value explodes and also grows infinite. To protect this we use a discounted factor. The discounted factor is a value less than 1. So, as we go on increasing the powers of the factor, the value actually keeps on decreasing. After some time units, the rewards at that time step become insignificant. For example, if we consider gamma= 0.9. The rewards at and after (t+10) lose significance at time label t. So, we don’t face the infinity problem anymore.

如果我们谈论一个无限的过程，那么价值就会爆炸并且也会无限增长。为了保护这一点，我们使用折现系数。折现因子的值小于1。因此，当我们继续增加因子的幂时，该值实际上一直在减少。在某些时间单位之后，在该时间步长的奖励变得微不足道。例如，如果我们考虑gamma = 0.9。 (t + 10)及之后的奖励在时间标签t失去意义。因此，我们不再面临无穷大问题。

The discount factor also serves as the weight balance between the current rewards and future rewards. Elaborately speaking, if the discount factor is 0.9 we focus on 10 future time step’s reward if it is 0.99 we focus on 50 future time step’s reward. So, it is key to a tradeoff that controls future rewards.

折扣因子还充当当前奖励和未来奖励之间的权重平衡。具体地说，如果折现系数为0.9，则我们关注10个未来时间步长的奖励；如果折扣率为0.99，则我们关注50个未来时间步长的奖励。因此，这是控制未来奖励的折衷方案的关键。

The value function of a state determines how good is a particular state of the environment is for the agent, or it shows how much reward can the agent while reaching the goal state on following that particular policy. The more the value function, the better the state, because the more the reward the agent will get.

状态的价值函数确定环境的特定状态对代理而言有多好，或者它表明在遵循该特定策略时达到目标状态时代理可以获得多少奖励。价值函数越多，状态越好，因为代理将获得更多的回报。

Value functions are also used for comparing policies. The more the value functions for the states the better the policy.

价值函数还用于比较策略。州的价值功能越多，政策越好。

勘探与开发 (Exploration and Exploitation)

The agent has to reach the goal state from any initial state according to the theories of reinforcement learning. So, to do that, the agent needs to move around and observe the environment, i.e, it must know how the environment reacts to action at a given state. To accomplish this the agent takes random actions outside its policies and notes the response of the environment. This is called Exploration.

根据强化学习的理论，主体必须从任何初始状态达到目标状态。因此，要做到这一点，代理需要四处走动并观察环境，即，它必须知道环境在给定状态下如何对动作做出React。为此，代理会在其策略之外采取随机措施，并记录环境的响应。这就是所谓的探索。

Now, if the agent keeps on exploring the agent will never obtain the optimality for reaching the goal state. For this, it will need to follow the policies established by the previous observations of the environment. So, here the agent is actually using the policies that are obtained by previously gathered experiences or gained knowledge. This is called Exploitation.

现在，如果代理商继续探索，代理商将永远无法获得达到目标状态的最优性。为此，它将需要遵循先前对环境的观察所建立的政策。因此，代理在这里实际上使用的是通过先前收集的经验或获得的知识而获得的策略。这就是所谓的剥削。

If the agent keeps on exploiting, the agent will never be able to know the entire state space. So, there is a tradeoff between exploitation and exploration. To achieve optimality we need to have both in equal amounts.

如果代理继续利用，则代理将永远无法知道整个状态空间。因此，在开发和勘探之间需要权衡。为了达到最优，我们需要使两者相等。

评估与控制 (Evaluation and Control)

Evaluation is the process of determining the value functions of the states of a particular policy.

评估是确定特定策略状态的价值函数的过程。

Control is the optimization of value functions to find the optimal policy.

控制是对价值函数的优化，以找到最佳策略。

结论 (Conclusion)

We have taken a look at the underlying principle of Reinforcement Learning and also seen the basic associated terminologies. Please check the full reinforced series to get a complete idea.

我们已经研究了强化学习的基本原理，还看到了基本的相关术语。请检查完整的增强系列以获取完整的想法。

Thanks for reading!!!!.

谢谢阅读！！！！。

翻译自: https://medium.com/nerd-for-tech/understanding-reinforcement-learning-i-8613218441e5

weixin_26704853

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
了解强化学习

#REINFORCEDSERIES (#REINFORCEDSERIES)“Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize the n...
复制链接

扫一扫