强化学习q学习求最值_Q学习简介:强化学习

强化学习q学习求最值

by ADL

通过ADL

Q学习简介:强化学习 (An introduction to Q-Learning: reinforcement learning)

This article is the second part of my “Deep reinforcement learning” series. The complete series shall be available both on Medium and in videos on my YouTube channel.

本文是我的“深度强化学习”系列的第二部分。 完整的系列影片可以在我的YouTube频道的 中型和视频中获得。

In the first part of the series we learnt the basics of reinforcement learning.

在本系列第一部分中,我们学习了强化学习基础知识

Q-learning is a values-based learning algorithm in reinforcement learning. In this article, we learn about Q-Learning and its details:

Q学习是强化学习中基于值的学习算法。 在本文中,我们了解Q-Learning及其详细信息:

  • What is Q-Learning ?

    什么是Q学习?
  • Mathematics behind Q-Learning

    Q学习背后的数学
  • Implementation using python

    使用python实现

Q学习-一个简单的概述 (Q-Learning — a simplistic overview)

Let’s say that a robot has to cross a maze and reach the end point. There are mines, and the robot can only move one tile at a time. If the robot steps onto a mine, the robot is dead. The robot has to reach the end point in the shortest time possible.

假设机器人必须越过迷宫并到达终点。 有地雷 ,并且机器人一次只能移动一个图块。 如果机器人踩到地雷,则机器人已经死亡。 机器人必须在尽可能短的时间内到达终点。

The scoring/reward system is as below:

计分/奖励制度如下:

  1. The robot loses 1 point at each step. This is done so that the robot takes the shortest path and reaches the goal as fast as possible.

    机器人在每个步骤中损失1点。 这样做是为了使机器人走最短的路径并尽快达到目标。
  2. If the robot steps on a mine, the point loss is 100 and the game ends.

    如果机器人踩到地雷,则点数损失为100,游戏结束。
  3. If the robot gets power ⚡️, it gains 1 point.

    如果机器人获得能量⚡️,它将获得1分。
  4. If the robot reaches the end goal, the robot gets 100 points.

    如果机器人达到最终目标,则机器人将获得100分。

Now, the obvious question is: How do we train a robot to reach the end goal with the shortest path without stepping on a mine?

现在,一个明显的问题是: 我们如何训练机器人以最短的路径达到最终目标而无需踩到地雷?

So, how do we solve this?

那么,我们该如何解决呢?

Q表简介 (Introducing the Q-Table)

Q-Table is just a fancy name for a simple lookup table where we calculate the maximum expected future rewards for action at each state. Basically, this table will guide us to the best action at each state.

Q-Table只是一个简单的查询表的名字,我们在其中计算每个州采取行动的最大预期未来回报。 基本上,此表将指导我们在每种状态下采取最佳措施。

There will be four numbers of actions at each non-edge tile. When a robot is at a state it can either move up or down or right or left.

每个非边缘图块将有四个动作数。 当机器人处于某种状态时,它可以上下移动,也可以左右移动。

So, let’s model this environment in our Q-Table.

因此,让我们在Q表中对此环境进行建模。

In the Q-Table, the columns are the actions and the rows are the states.

在Q表中,列是操作,行是状态。

Each Q-table score will be the maximum expected future reward that the robot will get if it takes that action at that state. This is an iterative process, as we need to improve the Q-Table at each iteration.

每个Q表得分将是如果机器人在该状态下执行该操作将获得的最大预期未来奖励。 这是一个反复的过程,因为我们需要在每次迭代时改进Q表。

But the questions are:

但是问题是:

  • How do we calculate the values of the Q-table?

    我们如何计算Q表的值?
  • Are the values available or predefined?

    这些值是可用的还是预定义的?

To learn each value of the Q-table, we use the Q-Learning algorithm.

要学习Q表的每个值,我们使用Q学习算法。

数学:Q学习算法 (Mathematics: the Q-Learning algorithm)

Q功能 (Q-function)

The Q-function uses the Bellman equation and takes two inputs: state (s) and action (a).

Q函数使用Bellman方程并接受两个输入:状态( s )和动作( a )。

Using the above function, we get the values of Q for the cells in the table.

使用上面的函数,我们获得表中单元格的Q值。

When we start, all the values in the Q-table are zeros.

开始时,Q表中的所有值均为零。

There is an iterative process of updating the values. As we start to explore the environment, the Q-function gives us better and better approximations by continuously updating the Q-values in the table.

有一个迭代过程来更新值。 随着我们开始探索环境 Q函数通过不断更新表中的Q值为我们提供越来越好的近似值。

Now, let’s understand how the updating takes place.

现在,让我们了解更新是如何进行的。

介绍Q学习算法过程 (Introducing the Q-learning algorithm process)

Each of the colored boxes is one step. Let’s understand each of these steps in detail.

每个彩色框仅一步。 让我们详细了解每个步骤。

步骤1:初始化Q表 (Step 1: initialize the Q-Table)

We will first build a Q-table. There are n columns, where n= number of actions. There are m rows, where m= number of states. We will initialise the values at 0.

我们将首先建立一个Q表。 有n列,其中n =操作数。 有m行,其中m =状态数。 我们将值初始化为0。

In our robot example, we have four actions (a=4) and five states (s=5). So we will build a table with four columns and five rows.

在我们的机器人示例中,我们有四个动作(a = 4)和五个状态(s = 5)。 因此,我们将构建一个包含四列五行的表。

步骤2和3:选择并执行操作 (Steps 2 and 3: choose and perform an action)

This combination of steps is done for an undefined amount of time. This means that this step runs until the time we stop the training, or the training loop stops as defined in the code.

步骤的组合完成了不确定的时间。 这意味着该步骤一直运行到我们停止训练的时间,或者训练循环按照代码中的定义停止为止。

We will choose an action (a) in the state (s) based on the Q-Table. But, as mentioned earlier, when the episode initially starts, every Q-value is 0.

我们将基于Q表在状态下选择一个动作。 但是,如前所述,当情节开始时,每个Q值均为0。

So now the concept of exploration and exploitation trade-off comes into play. This article has more details.

因此,现在探索与权衡取舍的概念开始发挥作用。 本文有更多详细信息

We’ll use something called the epsilon greedy strategy.

我们将使用一种称为epsilon贪婪策略的东西。

In the beginning, the epsilon rates will be higher. The robot will explore the environment and randomly choose actions. The logic behind this is that the robot does not know anything about the environment.

一开始,ε率会更高。 机器人将探索环境并随机选择动作。 其背后的逻辑是,机器人对环境一无所知。

As the robot explores the environment, the epsilon rate decreases and the robot starts to exploit the environment.

随着机器人探索环境,ε速率降低,并且机器人开始利用环境。

During the process of exploration, the robot progressively becomes more confident in estimating the Q-values.

在探索过程中,机器人逐渐变得更加自信地估计Q值。

For the robot example, there are four actions to choose from: up, down, left, and right. We are starting the training now — our robot knows nothing about the environment. So the robot chooses a random action, say right.

对于机器人示例,有四个操作可供选择 :上,下,左和右。 我们现在开始培训-我们的机器人对环境一无所知。 因此,机器人选择了一个随机动作,说对了。

We can now update the Q-values for being at the start and moving right using the Bellman equation.

现在,我们可以使用Bellman方程更新Q值,使其开始并向右移动。

步骤4和5:评估 (Steps 4 and 5: evaluate)

Now we have taken an action and observed an outcome and reward.We need to update the function Q(s,a).

现在我们采取了行动,观察了结果和回报,我们需要更新函数Q(s,a)。

In the case of the robot game, to reiterate the scoring/reward structure is:

对于机器人游戏,要重申得分/奖励结构是:

  • power = +1

    功率 = +1

  • mine = -100

    我的 = -100

  • end = +100

    结束 = +100

We will repeat this again and again until the learning is stopped. In this way the Q-Table will be updated.

我们将一遍又一遍,直到学习停止。 这样,将更新Q表。

Q-Learning的Python实现 (Python implementation of Q-Learning)

The concept and code implementation are explained in my video.

我的视频解释了该概念和代码实现。

Subscribe to my YouTube channel For more AI videos : ADL .

订阅我的YouTube频道以获取更多AI视频: ADL

最后…让我们回顾一下 (At last…let us recap)

  • Q-Learning is a value-based reinforcement learning algorithm which is used to find the optimal action-selection policy using a Q function.

    Q学习是基于值的强化学习算法,用于使用Q函数查找最佳动作选择策略。
  • Our goal is to maximize the value function Q.

    我们的目标是使值函数Q最大化。
  • The Q table helps us to find the best action for each state.

    Q表可帮助我们找到每个状态的最佳操作。
  • It helps to maximize the expected reward by selecting the best of all possible actions.

    通过选择所有可能采取的最佳措施,有助于最大程度地提高预期回报。
  • Q(state, action) returns the expected future reward of that action at that state.

    Q(状态,动作)返回该动作在该状态的预期未来回报。
  • This function can be estimated using Q-Learning, which iteratively updates Q(s,a) using the Bellman equation.

    可以使用Q-Learning估计此函数,后者使用Bellman方程迭代地更新Q(s,a)

  • Initially we explore the environment and update the Q-Table. When the Q-Table is ready, the agent will start to exploit the environment and start taking better actions.

    最初,我们探索环境并更新Q表。 Q-Table准备就绪后,代理将开始利用环境并开始采取更好的措施。

Next time we’ll work on a deep Q-learning example.

下次,我们将研究一个深入的Q学习示例

Until then, enjoy AI ?.

在此之前,请享受AI?。

Important: As stated earlier, this article is the second part of my “Deep Reinforcement Learning” series. The complete series shall be available both in articles on Medium and in videos on my YouTube channel.

重要提示 :如前所述,本文是我的“深度强化学习”系列的第二部分。 完整的系列文章既可以在“ 中型”文章中找到 ,也可以在我的YouTube频道中的视频中找到

If you liked my article, please click the ? to help me stay motivated to write articles. Please follow me on Medium and other social media:

如果您喜欢我的文章, 请单击“?”。 牛逼 Ø帮助我保持动力写文章。 请跟我M上edium和其他社交媒体:

If you have any questions, please let me know in a comment below or on Twitter.

如有任何疑问,请在下面的评论中或在Twitter上让我知道。

Subscribe to my YouTube channel for more tech videos.

订阅我的YouTube频道以观看更多技术视频。

翻译自: https://www.freecodecamp.org/news/an-introduction-to-q-learning-reinforcement-learning-14ac0b4493cc/

强化学习q学习求最值

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值