强化学习q学习求最值
by Thomas Simonini
通过托马斯·西蒙尼(Thomas Simonini)
通过Q学习更深入地学习强化学习 (Diving deeper into Reinforcement Learning with Q-Learning)
This article is part of Deep Reinforcement Learning Course with Tensorflow ?️. Check the syllabus here.
本文是使用Tensorflow?️的深度强化学习课程的一部分。 检查课程表。
Today we’ll learn about Q-Learning. Q-Learning is a value-based Reinforcement Learning algorithm.
今天,我们将学习Q学习。 Q学习是一种基于价值的强化学习算法。
This article is the second part of a free series of blog post about Deep Reinforcement Learning. For more information and more resources, check out the syllabus of the course. See the first article here.
本文是有关深度强化学习的一系列免费博客文章的第二部分。 有关更多信息和更多资源,请查看课程提纲。 在这里看到第一篇文章 。
In this article you’ll learn:
在本文中,您将学习:
- What Q-Learning is 什么是Q学习
- How to implement it with Numpy 如何使用Numpy实施
大图:骑士与公主 (The big picture: the Knight and the Princess)
Let’s say you’re a knight and you need to save the princess trapped in the castle shown on the map above.
假设您是一名骑士,您需要保存困在上面地图所示城堡中的公主。
You can move one tile at a time. The enemy can’t, but land on the same tile as the enemy, and you will die. Your goal is to go the castle by the fastest route possible. This can be evaluated using a “points scoring” system.
您一次只能移动一个图块。 敌人不能,但是和敌人降落在同一块地上,你会死。 您的目标是尽可能快地走城堡。 可以使用“积分”系统进行评估。
You lose -1 at each step (losing points at each step helps our agent to be fast).
您 每步损失-1(每步损失点 帮助我们的代理更快)。
- If you touch an enemy, you lose -100 points, and the episode ends. 如果碰到敌人,您将失去-100分,情节结束。
- If you are in the castle you win, you get +100 points. 如果您在城堡中获胜,您将获得+100分。
The question is: how do you create an agent that will be able to do that?
问题是:如何创建能够做到这一点的代理?
Here’s a first strategy. Let say our agent tries to go to each tile, and then colors each tile. Green for “safe,” and red if not.
这是第一个策略。 假设我们的经纪人尝试转到每个图块,然后为每个图块着色。 绿色表示“安全”,否则表示红色。
Then, we can tell our agent to take only green tiles.
然后,我们可以告诉我们的代理商只拿绿砖。
But the problem is that it’s not really helpful. We don’t know the best tile to take when green tiles are adjacent each other. So our agent can fall into an infinite loop by trying to find the castle!
但是问题在于它并没有真正的帮助。 我们不知道绿色瓷砖彼此相邻时采取的最佳瓷砖。 因此我们的经纪人可以通过尝试找到城堡陷入无限循环!
Q表介绍 (Introducing the Q-table)
Here’s a second strategy: create a table where we’ll calculate the maximum expected future reward, for each action at each state.
这是第二种策略:创建一个表,在该表中,我们将为每个州的每个动作计算最大的预期未来奖励。
Thanks to that, we’ll know what’s the best action to take for each state.
因此,我们将知道对每个州采取的最佳措施是什么。
Each state (tile) allows four possible actions. These are moving left, right, up, or down.
每个状态(平铺)都允许四个可能的操作。 它们在向左,向右,向上或向下移动。
In terms of computation, we can transform this grid into a table.