OpenAI Gym学习

最新推荐文章于 2025-03-13 00:54:08 发布

天涯遍地是小草

最新推荐文章于 2025-03-13 00:54:08 发布

阅读量1.8w

点赞数 8

分类专栏：机器学习 Python 文章标签： OpenAI 机器学习 python Gym 强化学习

本文链接：https://blog.csdn.net/u012692537/article/details/79418566

版权

机器学习同时被 2 个专栏收录

7 篇文章

订阅专栏

Python

7 篇文章

订阅专栏

OpenAI Gym学习

一、Gym介绍

最近在学习强化学习，看的视频里用的是一款用于研发和比较强化学习算法的工具包——OpenAI Gym。

据视频教程所言，OpenAI后面还出了别的，Google等也有出类似的，不过Gym用于学习已经很好了。

OpenAI Gym 是一个用于开发和比较RL 算法的工具包，与其他的数值计算库兼容，如tensorflow 或者theano 库。支持python 语言。官方提供的gym文档。

1.OpenAI Gym组成

Openai gym 包含2 部：

gym 开源
包含一个测试问题集，每个问题成为环境（environment），可以用于自己的强化学习算法开发，这些环境有共享的接口，允许用户设计通用的
算法，例如：Atari、CartPole等。
OpenAI Gym 服务
提供一个站点和api ，允许用户对他们训练的算法进行性能比较。

2.强化学习与OpenAI Gym

强化学习（reinforcement learning，RL）是机器学习的一个分支，考虑的是做出一系列的决策。它假定有一个智能体（agent）存在于环境中。在每一步中，智能体（agent）采取一个行动，随后从环境中收到观察与回报。一个RL算法寻求的是，在一个原先毫无了解的环境中通过一段学习过程——通常包括许多试错——让智能体（agent）收到的总体回报最大化。

在强化学习中有2 个基本概念，一个是环境（environment），称为外部世界，另一个为智能体agent（写的算法）。agent 发送action 至environment，environment返回观察和回报。

OpenAI Gym 的核心接口是Env，作为统一的环境接口。Env 包含以下核心方法：
env.reset(self):重置环境的状态，返回观察
env.step(self,action):推进一个时间步长，返回observation，reward，done，info
env.render(self,mode=’human’,close=False):重绘环境的一帧。默认模式一般比较友好，如弹出一个窗口

3.OpenAI Gym安装

pip安装

$  pip install gym #minimal install
or
$  pip install gym[all] #full install, fetch gym as a package

其它安装方式：
见参考文章

二、gym入门

以下基本为官方文档的汉化加简化
渣翻译，部分感觉翻译不对劲的地方附有原文

第一个示例

官方示例代码
运行一个名为 CartPole-v0 的实例1000个时间步长,这是一个很快失去稳定的系统示例。
cart-pole：倒立摆

import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample()) # take a random action

如果想查看其他一些环境，请尝试用MountainCar-v0，MsPacman-v0（需要Atari依赖关系）或Hopper-v1（需要MuJoCo依赖项）替换上述CartPole-v0，这些环境都继承自Env基类。

如果缺乏任意依赖项，会有一个错误提示信息告诉你缺什么。

观察Observations

如果想要在每个步骤中做出比采取随机行动更好的行动，那么了解我们的行动对环境的影响可能会有更好的结果。

环境的step函数返回需要的信息，step 函数有4种返回值：

observation (object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.

reward (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.

done (boolean): whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)

info (dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.

observation (object):一个与环境相关的对象描述你观察到的环境，如相机的像素信息，机器人的角速度和角加速度，棋盘游戏中的棋盘状态。

reward (float):先前行为获得的所有回报之和，不同环境的计算方式不
一，但目标总是增加自己的总回报。

done (boolean): 判断是否到了重新设定(reset)环境，大多数任务（但不是所有）分为明确定义的切片（episodes），并且done的值为True表示episode已终止。（例如，杆倾斜太远？，失去最后一条命）

info (dict):用于调试的诊断信息，有时也用于学习，但智能体（agent ）在正式的评价中不允许使用该信息进行学习。

这是一个典型的agent-environment loop 的实现。每一个时间步长，Agent 都选择一个action，Environment返回一个observation和reward。

agent-environment loop

该进程通过调用reset()来启动，它返回一个初始observation。所以之前代码的更恰当的方法是遵守done的标志：

import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break

当done 为真时，控制失败，此阶段episode 结束。可以计算每 episode 的回报就是其坚持的t+1 时间，坚持的越久回报越大，在上面算法中，agent 的行为选择是随机的。
一个示例结果的部分输出：

[ 0.04525821 -0.01058438  0.01658315 -0.0113601 ]
[ 0.04504652  0.18429587  0.01635595 -0.29876505]
[ 0.04873244  0.3791809   0.01038065 -0.58624507]
[ 0.05631606  0.57415593 -0.00134425 -0.87564   ]
[ 0.06779918  0.37905228 -0.01885705 -0.58338   ]
[ 0.07538022  0.18419951 -0.03052465 -0.29669646]
[ 0.07906421 -0.0104743  -0.03645858 -0.01379463]
[ 0.07885473  0.18515104 -0.03673447 -0.31775408]
[ 0.08255775 -0.00942898 -0.04308956 -0.03687847]
[ 0.08236917 -0.20390738 -0.04382713  0.24190395]
[ 0.07829102 -0.39837677 -0.03898905  0.52044688]
[ 0.07032349 -0.5929288  -0.02858011  0.80059327]
[ 0.05846491 -0.78764733 -0.01256824  1.08415037]
[ 0.04271196 -0.98260121  0.00911476  1.37286313]
[ 0.02305994 -0.7875944   0.03657203  1.08304477]
[ 0.00730805 -0.59297367  0.05823292  0.80205866]
[-0.00455142 -0.3986966   0.0742741   0.52824783]
[-0.01252535 -0.59478063  0.08483905  0.84337947]
[-0.02442097 -0.7909516   0.10170664  1.16149035]
[-0.04024    -0.59729068  0.12493645  0.90235035]
[-0.05218581 -0.40406247  0.14298346  0.65140302]
[-0.06026706 -0.21119085  0.15601152  0.40694193]
[-0.06449088 -0.40814108  0.16415035  0.74446085]
[-0.0726537  -0.21561904  0.17903957  0.50760351]
[-0.07696608 -0.41275194  0.18919164  0.85093302]
[-0.08522112 -0.2206433   0.2062103   0.62320299]
Episode finished after 26 timesteps

空间（Spaces）

在上面的例子中，已经从环境的动作空间中抽取随机动作。但这些行动究竟是什么呢？每个环境都带有action_space 和observation_space对象。这些属性是Space类型，它们描述格式化的有效的行动和观察。

import gym
env = gym.make('CartPole-v0')
print(env.action_space)
#> Discrete(2)
print(env.observation_space)
#> Box(4,)

The Discrete space allows a fixed range of non-negative numbers, so in this case valid actions are either 0 or 1. The Box space represents an n-dimensional box, so valid observations will be an array of 4 numbers. We can also check the Box’s bounds:
离散空间允许固定范围的非负数，因此在这种情况下，有效的动作是0或1. Box空间表示一个n维box，所以有效的观察将是4个数字的数组。也可以检查Box的范围：

print(env.observation_space.high)
#> array([ 2.4       ,         inf,  0.20943951,         inf])
print(env.observation_space.low)
#> array([-2.4       ,        -inf, -0.20943951,        -inf])

This introspection can be helpful to write generic code that works for many different environments. Box and Discrete are the most common Spaces. You can sample from a Space or check that something belongs to it:
这种内省可以帮助编写适用于许多不同环境的通用代码。盒子和离散是最常见的空间。你可以从一个空间中取样，或者检查某物是否属于它：

from gym import spaces
space = spaces.Discrete(8) # Set with 8 elements {0, 1, 2, ..., 7}
x = space.sample()
assert space.contains(x)
assert space.n == 8

对于CartPole-v0，其中一个操作会向左施加力，其中一个向右施加力。

可用的环境（Available Environments）

gym提供了多种典型的环境，可从鸟瞰图中查看： full list of environments
gym主要目的是提供大量的暴露常见界面的环境，并进行版本控制，以便进行比较，可以查看系统提供那些环境：

from gym import envs
print(envs.registry.all())
#> [EnvSpec(DoubleDunk-v0), EnvSpec(InvertedDoublePendulum-v0), EnvSpec(BeamRider-v0), EnvSpec(Phoenix-ram-v0), EnvSpec(Asterix-v0), EnvSpec(TimePilot-v0), EnvSpec(Alien-v0), EnvSpec(Robotank-ram-v0), EnvSpec(CartPole-v0), EnvSpec(Berzerk-v0), EnvSpec(Berzerk-ram-v0), EnvSpec(Gopher-ram-v0), ...

Gym背景Background: Why Gym? (2016)

感兴趣的自己从参考文献跳转官方文档拉到最下面看看，懒得翻译了

Reinforcement learning (RL) is the subfield of machine learning concerned with decision making and motor control. It studies how an agent can learn how to achieve goals in a complex, uncertain environment. It’s exciting for two reasons:

RL is very general, encompassing all problems that involve making a sequence of decisions:
RL algorithms have started to achieve good results in many difficult environments.

However, RL research is also slowed down by two factors:

The need for better benchmarks.
Lack of standardization of environments used in publications.

Gym is an attempt to fix both problems.