如何构建强化学习项目第二部分

Hello, Hello! Welcome back to the second part of my series on How to structure RL projects !

你好你好! 欢迎回到我的系列文章“ 如何构建RL项目”的第二部分!

  1. Start the Journey: Frame your Problem as an RL Problem

    开始旅程:将您的问题定为RL问题

  2. Choose your Weapons: All the Tools You Need to Build a Working RL Environment (We are Here!)

    选择武器:构建有效的RL环境所需的所有工具 (我们在这里!)

  3. Face the Beast: Pick your RL (or Deep RL) Algorithm

    面对野兽:选择您的RL(或深度RL)算法

  4. Tame the Beast: Test the Performance of the Algorithm

    驯服野兽:测试算法的性能

  5. Set it Free: Prepare your Project for Deployment/Publishing

    免费设置:为部署/发布准备项目

In this post, we discuss the second part of this series:

在这篇文章中,我们讨论本系列的第二部分:

选择武器:建立有效的RL环境所需的所有工具 (Choose your Weapons: All the Tools You Need to Build a Working RL Environment)

Image for post

(Note: This post is a follow up of the first part, in case you missed it, please take a look at it here)

(注意:本文是第一部分的后续内容,如果您错过了它,请在此处进行查看 )

In the previous article, we learned how to frame our problem into a Reinforcement Learning problem. Today we are moving to the next step: building the needed infrastructure for running Reinforcement Learning algorithms.

在上一篇文章中,我们学习了如何将问题构造为强化学习问题 。 今天,我们正在迈出下一步: 构建运行强化学习算法所需的基础架构

1.调查环境结构 (1. Investigate the Environment Structure)

Before implementing any Reinforcement Learning algorithm, we need to familiarize ourselves with the environment that we are using. For the sake of clarity, we will refer to our environment as the simulation, as in most cases the agent is added to a simulated environment that mimics the real environment where the agent will be deployed.

在实施任何强化学习算法之前,我们需要熟悉所使用的环境。 为了清楚起见,我们将环境称为模拟 ,因为在大多数情况下,会将代理添加到模拟环境中 ,该模拟环境将模拟将在其中部署代理的实际环境

Image for post
Figure 2 : MuJoCo is a famous Robotics simulator used extensively in RL (source: 图2:MuJoCo是著名的机器人模拟器在RL中广泛使用(来源: MuJoCoMuJoCo ) )

First, let’s investigate the nature of the simulator we are using:

首先,让我们研究一下我们使用的模拟器的性质:

它是事件驱动的模拟器吗? (Is it an Event-Driven Simulator?)

There are various types of simulators that can be adapted for RL, and one of the most famous simulation paradigms is the Event-driven simulation.

有多种类型的模拟器可适用于RL,最著名的模拟范例之一是事件驱动的模拟

In this paradigm, the simulation does not run over a continuous interval of time, but rather over a sequence of “events”. The simulator schedules in advance the events that might be happening at specific instants, and between any two time instants where no events are happening, the simulator just “glosses over” this interval and “moves on” to the next event.

在这种范例中,模拟不是连续的时间间隔内运行 ,而是在一系列“事件”上运行 。 模拟器会预先计划可能在特定时刻发生的事件 ,并且在没有任何事件发生的任何两个时刻之间,模拟器只是“遮盖”该时间间隔“继续”进行下一个事件

Because events are scheduled beforehand in such a simulator, the results might have been already predicted and it becomes hard (but not impossible) for an agent to intervene in changing the outcomes of a simulation.

由于事件是在此类模拟器中预先安排的 ,因此结果可能已经被预测到了,代理很难干预(但并非不可能)干预更改模拟的结果。

使用哪种类型的模拟器? (What Type of Simulator is Employed?)

Another important aspect is the type of simulator, since it would affect the type of connection and interface library to be employed later on.

另一个重要方面是模拟器的类型 ,因为它会影响稍后使用的连接和接口库的类型

Let us take the example of a Unity game simulator or an Atari game emulator. In this case, the agent has to participate in a specific game continuously over a period of time, performing specific game actions based on specific observations and receive a reward signal accordingly.

让我们以Unity游戏模拟器或Atari游戏模拟器为例。 在这种情况下,代理必须在一段时间内连续参与特定游戏,基于特定观察执行特定游戏动作并相应地接收奖励信号。

Image for post
Figure 3: Stella — an emulator of the Atari 2600 gaming console (source: 图3:斯特拉-雅达利2600游戏控制台的仿真器(源: Stella斯特拉 ) )

But the problem is: How the agent which is a running program, will connect to the running emulator at the same time? What is the best way to synchronize the two programs?

但是问题是: 作为正在运行的程序的代理将如何同时连接到正在运行的仿真器? 同步两个程序的最佳方法是什么?

  • One solution is to use HTTP POST and GET requests to be able to connect to the emulator over an HTTP connection in a consistent and guaranteed manner.

    一种解决方案是使用HTTP POST和GET请求 ,以便能够以一致且有保证的方式通过HTTP连接连接到仿真器。

  • In other cases, the simulator might be a running code with which the agent will interact. In such cases, a client-server socket program might be easier to code and control later on.

    在其他情况下,模拟器可能是代理将与之交互的运行代码 。 在这种情况下, 客户端-服务器套接字程序以后可能更容易编码和控制。

Image for post
Figure 4: Another example of a running program is the maze game, the agent has to communicate with the maze environment for every new action (source: 图4:正在运行的程序的另一个示例是迷宫游戏,代理必须为每个新动作与迷宫环境进行通信(来源: UnityListUnityList ) )

We will explain the different types of connections in section 4.

我们将在第4节中说明不同的连接类型。

2.选择环境界面 (2. Choose the Environment Interface)

Next it is time to learn how to correctly define the interface that connects the environment to the RL algorithm. Many Python libraries were specifically built to simplify training RL agents like OpenAI baselines, Tensorforce, and tf-agents. For a helpful comparison between the existing libraries, please refer to this insightful article.

接下来,是时候学习如何正确定义将环境连接到RL算法的接口了 。 许多Python库是专门为简化培训RL代理(如OpenAI基准Tensorforcetf-agents)构建的 。 要在现有库之间进行有益的比较,请参阅这篇有见地的文章

After choosing your library, you soon realize that almost all of the libraries rely on specific formats for defining the environment interface like OpenAI gym format and PyEnvironment format.

选择库之后,您很快就会意识到几乎所有的库都依赖于特定的格式来定义环境接口,例如OpenAI Gym格式PyEnvironment格式

We will discuss OpenAI gym format as it is one of the most famous and widely used formats.

我们将讨论OpenAI体育馆格式,因为它是最著名且使用最广泛的格式之一。

OpenAI gym is an “ is a toolkit for developing and comparing reinforcement learning algorithms” developed by OpenAI. It houses a variety of built-in environments that you can directly use such as CartPole, PacMan, etc…

OpenAI体育馆是OpenAI 开发的“ 是用于开发和比较强化学习算法的工具包 ”。 它包含各种可以直接使用的内置环境,例如CartPole,PacMan等。

Image for post
Figure 5: OpenAI gym is a toolkit for developing and training RL algorithms (source: 图5:OpenAI Gym是用于开发和培训RL算法的工具包(来源: towards data science面向数据科学 ) )

And most importantly, it allows you to create your custom environment interface using the gym format:

最重要的是,它允许您使用Gym格式创建自定义环境界面

import gymfrom gym import error, spaces, utilsfrom gym.utils import seedingclass Basic(gym.Env):def __init__(self):pass        def step(self, action):passdef reset(self):passdef render(self, mode='human'):passdef close(self):pass

Let’s dissect each part:

让我们分解每个部分:

def __init__(self): pass

This is the environment’s constructor, where we initialize the environment variables and, define the action space and the observation space.

这是环境的构造函数,我们在其中初始化环境变量,并定义动作空间和观察空间。

def step(self, action):pass

We use this method to allow the agent to take an action inside the environment, consequently the method returns a reward value and the new observation of the environment.

我们使用此方法允许代理在环境中采取行动 ,因此该方法返回奖励值和对环境的新观察

def reset(self):pass

This method resets the environment to its initial state, and returns the initial observation of the environment. It is recommended to use this method at the beginning of each simulation to have access to the initial observation of the environment.

此方法将环境重置为其初始状态 ,并返回对环境的初始观察 。 建议在每次模拟开始时使用此方法,以获取对环境的初始观察。

def render(self, mode='human'):pass

The render method is usually used to provide a visual presentation of the environment.

通常使用render方法提供环境的可视化表示。

def close(self):pass

The close method closes the connection with the environment, resets the simulator and stops the agent from interacting with this instance of the simulator.

close方法关闭与环境的连接 ,重置模拟器并停止代理与模拟器的该实例进行交互。

3.了解事件的顺序 (3. Understand the Sequence of Events)

Before training any RL agent, we need to set up the sequence of actions and observations that occur when the agent is interacting with the environment.

在训练任何RL代理之前,我们需要设置代理与环境交互时发生的操作和观察的顺序

This kind of design decision might look at first trivial, but it is a delicate, yet an essential, part of the design. Usually, a correct sequence of events should follow the diagram below:

这种设计决策乍看之下并不重要,但它是设计的精妙但必不可少的一部分。 通常,正确的事件顺序应遵循下图:

Image for post
Figure 6: Sequence of interactions (source: 图6:交互序列(来源: Anis In DatalandAnis In Dataland ) )

Confirming that your design fits neatly with this paradigm guarantees that the interactions between the agent and environment are correct and synchronized.

确认您的设计完全符合此范例,可以确保代理与环境之间的交互正确且同步

4.实施连接环境和代理的管道 (4. Implement the Pipeline Connecting the Environment and the Agent)

Image for post
Figure 7: Photo on 图7:照片上 UnsplashUnsplash

Our last step is to figure out the best way to build the pipeline connecting the environment and the RL agent. Below I will list some of the tools I have come across when I was developing the pipeline for my environment. These tools can be applied to almost any other environment:

我们的最后一步是找出建立连接环境和RL代理的管道的最佳方法。 下面,我将列出在开发环境管道时遇到的一些工具。 这些工具几乎可以应用于任何其他环境:

1.客户端-服务器套接字编程(1. Client-Server Socket Programming:)

Image for post
Figure 8: A Socket Programming Illustration by Real Python
图8: Real Python的 套接字编程插图

The agent is the program that interacts with the environment in multiple simulations over time, therefore it is natural to consider the agent as the server. The server spawns the connections with multiple clients, which are the simulations of the environment.

代理是随时间推移在多个模拟中与环境交互的程序,因此将代理视为服务器是很自然 。 服务器产生与多个客户端连接 ,这是环境模拟

Such a structure is the right fit whenever the speed of connection is important; using protocols such as TCP and UDP can provide a fast connection between the client and server. Although we gain in terms of connection speed, we might need to come up with a more complicated implementation, since we are in charge of synchronizing the transportation of data from the client to/from the server.

只要连接速度很重要 ,这种结构就是正确的选择。 使用TCPUDP等协议可以在客户端和服务器之间提供快速连接。 尽管我们在连接速度方面有所提高,但由于我们负责同步从客户端到服务器的数据传输,因此可能需要提出更复杂的实现。

2. HTTP请求(2. HTTP requests:)

Image for post
Figure 9: An HTTP Requests Illustration by Real Python
图9: Real Python 的HTTP请求插图

A more sophisticated structure would be to establish an HTTP connection between the client and the server. Although using HTTP is easier than other protocols, we lose in terms of connection speed due to the additional layer that HTTP adds.

更复杂的结构将是在客户端和服务器之间建立HTTP连接。 尽管使用HTTP比其他协议更容易,但是由于HTTP添加了额外的层,因此在连接速度方面我们损失了。

If you are looking for a detailed guide for implementing the previous techniques in Python, I highly recommend this tutorial for socket programming and this one for HTTP requests by Real Python.

如果您正在寻找有关在Python中实现先前技术的详细指南,我强烈建议您使用本教程进行套接字编程, 教程则针对 Real Python的HTTP请求

3.从/到外部文件/控制台的输入/输出(3. Input/Output from/to external files/console:)

Another approach would be as follows:

另一种方法如下:

  1. The agent to read data outputted from the environment through an external file or the console

    代理通过外部文件或控制台 读取环境 输出数据

  2. The environment receives the next action to be done as input from the console or an external file

    环境从控制台或外部文件 接收作为输入 的下一步操作

  3. The environment sends the observation and the reward to an external file or the console

    环境将观察结果和奖励发送外部文件或控制台

  4. We repeat this cycle until the simulation ends or the maximum number of steps is reached

    我们重复此循环,直到模拟结束达到最大步数

However, we need to take care of the type of file and the type of I/O operations used, otherwise the speed of connection might be affected.

但是,我们需要注意文件的类型和所使用的I / O操作的类型,否则连接速度可能会受到影响。

5.测试环境 (5. Test the Environment)

I cannot emphasize enough how important is to test, test, and test the environment, the interface, and the connection pipeline. We need to be confident that our implementation is working properly, else our RL agent might not learn correctly or even worse might learn to exploit the errors or loopholes found in the environment!

我不能足够强调测试,测试和测试 环境接口连接管道的重要性。 我们需要确信我们的实施工作正常 ,否则我们的RL代理可能无法正确 学习,甚至更糟的是可能会学习利用环境中发现的错误或漏洞!

Image for post
Figure 10: In OpenAI’s famous Hide and Seek RL experiment, the seekers learned to exploit the simulator physics to “surf” over the boxes and catch the hiders! (source: OpenAI )
图10:在OpenAI著名的“捉迷藏RL”实验中,搜索者学会了利用模拟器物理在盒子上“冲浪”并捉住藏身者! (来源: OpenAI )

Below are some procedures I found helpful to gauge the functioning of the environment:

我发现以下一些有助于评估环境功能的过程:

  • Using a random agent: Simply enough, using an agent whose actions are randomly generated can be helpful to check the performance of the agent in the environment. Since all actions are of equal probabilities, we can make sure than many corner cases are being tested and the environment is running successfully.

    使用随机代理 :简而言之,使用随机生成动作的代理可以帮助检查代理在环境中的性能。 由于所有动作的概率均等,因此我们可以确保对许多极端情况进行测试,并且环境能够成功运行。

  • Recording the number of steps: Running a random agent can give us an estimate of the number of steps an average agent might take inside the simulation. It can also help us to get an estimate of the duration of each simulation, and therefore an estimate of the duration of the training phase.

    记录步骤数 :运行随机代理可以估算出平均代理在模拟中可能要执行的步骤数。 它还可以帮助我们估算每个模拟的持续时间,从而估算训练阶段的持续时间。

结论 (Conclusion)

In this article, we learned about the tools we need to build an RL environment and a communication pipeline connecting the environment and the agent.

在本文中,我们了解了构建RL环境和连接环境与代理的通信管道所需的工具。

至此,我们到达了本系列第二部分的结尾! (With that, we reach the end of the second part of this series!)

Wow! How far we reached!

哇! 我们到达了多远!

In the next part: “ Face the Beast: Pick your RL (or Deep RL) Algorithm “, I am going to discuss how to choose the best algorithm needed to train a successful RL agent with all the tools you might need!

在下一部分:“ 面对野兽:选择您的RL(或深度RL)算法 ”中,我将讨论如何选择最佳的算法,以使用您可能需要的所有工具来训练成功的RL代理!

Buckle up and stay tuned!

系好安全带,敬请期待!

Image for post
giphy giphy

Originally published at https://anisdismail.com on August 24, 2020.

最初于 2020年8月24日 发布于 https://anisdismail.com

翻译自: https://medium.com/analytics-vidhya/how-to-structure-a-reinforcement-learning-project-part-2-533a168e11c2

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值