2021-08-16 强化学习环境-gym-gym wrappers

Gym Wrappers

The extremely powerful feature of wrappers made available to us courtesy of OpenAI’s gym

作用

  1. add functionality to environments, such as modifying observations and rewards to be fed to our agent.
  2. preprocess observations in order to make them more easy to learn from:image输入

用法

  1. The gym.Wrapperclass inherits from the gym.Envclass,
  2. Implementing the gym.Wrapperclass requires defining an __init__method that accepts the environment to be extended as a parameter
  3. 样例:
   class BasicWrapper(gym.Wrapper):
    def __init__(self, env):
        super().__init__(env)
        self.env = env
        
    def step(self, action):
        next_state, reward, done, info = self.env.step(action)
        # modify ...
        return next_state, reward, done, info
env = BasicWrapper(gym.make("CartPole-v0"))

可以利用这个结构,对原生环境得到的observation,action,reward进行修改

细节

  1. gym.observationWrapper:Used to modify the observations returned by the environment:
class ObservationWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super().__init__(env)
    
    def observation(self, obs):
        # modify obs
        return obs
  1. gym.RewardWrapper: Used to modify the rewards returned by the environment.
class RewardWrapper(gym.RewardWrapper):
    def __init__(self, env):
        super().__init__(env)
    
    def reward(self, rew):
        # modify rew
        return rew
  1. gym.ActionWrapper:Used to modify the actions passed to the environment.
class ActionWrapper(gym.ActionWrapper):
    def __init__(self, env):
        super().__init__(env)
    
    def action(self, act):
        # modify act
        return act
使用方法:
  1. ObservationWrapper:
    例: gym.Discrete: generates a new environment with a one-hot encoding of the discrete states, for use in, for example, neural networks.
class DiscreteToBoxWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super().__init__(env)
        assert isinstance(env.observation_space, gym.spaces.Discrete), \
            "Should only be used to wrap Discrete envs."
        self.n = self.observation_space.n
        self.observation_space = gym.spaces.Box(0, 1, (self.n,))
    
    def observation(self, obs):
        new_obs = np.zeros(self.n)
        new_obs[obs] = 1
        return new_obs

env = DiscreteToBoxWrapper(gym.make("FrozenLake-v0"))
T = 10
s_t = env.reset()
for t in range(T):
    a_t = env.action_space.sample()
    s_t, r_t, done, info = env.step(a_t)
    print(s_t)
    if done:
        s_t = env.reset()

上述输出:

[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  1. ActionWrapper:
import gym
from typing import TypeVar
import random

Action = TypeVar('Action')
class RandomActionWrapper(gym.ActionWrapper):
    def __init__(self, env, epsilon=0.1):
        super(RandomActionWrapper, self).__init__(env)
        self.epsilon = epsilon

  	def action(self, action):
        if random.random() < self.epsilon:
            print("Random!")
            return self.env.action_space.sample()
        return action

if __name__ == "__main__":
   env = RandomActionWrapper(gym.make("CartPole-v0"))
   obs = env.reset()
   total_reward = 0.0
   while True:
       obs, reward, done, _ = env.step(0)
       total_reward += reward
       if done:
           break

   print("Reward got: %.2f" % total_reward)
# 上述输出
rl_book_samples/ch02$ python 03_random_actionwrapper.py
WARN: gym.spaces.Box autodetected dtype as <class 
'numpy.float32'>. Please provide explicit dtype.
Random!
Random!
Random!
Random!
Reward got: 12.00

Beyond Wrapper Class

add functionality to the environment, such as providing auxillary observation functions that allow for multiple preprocessing streams to occur.
构造更多功能的类,实现更多功能

问题:

In more complex applications of deep reinforcement learning, evaluating the policy can take significantly longer than stepping the environment. This means that the majority of computational time is spent choosing actions, which makes data collection slow.

Since deep reinforcement learning is extremely data intensive (often requiring millions of timesteps of experience to achieve good performance), we should prioritize rapidly acquiring data.

为了更快采集数据,复制n个环境,形成一个环境向量:

class VectorizedEnvWrapper(gym.Wrapper):
    def __init__(self, make_env, num_envs=1):
        super().__init__(make_env())
        self.num_envs = num_envs
        self.envs = [make_env() for env_index in range(num_envs)]
    
    def reset(self):
        return np.asarray([env.reset() for env in self.envs])
    
    def reset_at(self, env_index):
        return self.envs[env_index].reset()
    
    def step(self, actions):
        next_states, rewards, dones, infos = [], [], [], []
        for env, action in zip(self.envs, actions):
            next_state, reward, done, info = env.step(action)
            next_states.append(next_state)
            rewards.append(reward)
            dones.append(done)
            infos.append(info)
        return np.asarray(next_states), np.asarray(rewards), \
            np.asarray(dones), np.asarray(infos)

num_envs = 128
env = VectorizedEnvWrapper(lambda: gym.make("CartPole-v0"), num_envs=num_envs)
T = 10
observations = env.reset()
for t in range(T):
    actions = np.random.randint(env.action_space.n, size=num_envs)
    observations, rewards, dones, infos = env.step(actions)  
    for i in range(len(dones)):
        if dones[i]:
            observations[i] = env.reset_at(i)
print(observations.shape)
print(rewards.shape)
print(dones.shape)
# 结果
(128, 4)
(128,)
(128,)

Figure 1:The Hierarchy of Wrapper classes in Gym

参考:

https://hub.packtpub.com/openai-gym-environments-wrappers-and-monitors-tutorial/

https://alexandervandekleut.github.io/gym-wrappers/

Monitor the agent

if __name__ == "__main__":
   env = gym.make("CartPole-v0")
   env = gym.wrappers.Monitor(env, "recording")
  • 输入的第二项:
    name of directory it will write the results to;shouldn’t exist 否则会出现异常:删掉当前已存在目录,或者force=True 传入monitor

  • Monitor class需要FFmpeg(系统中的环境):将observation转换为video file

  • 三个需要:

    • The code should be run in an X11 session with the OpenGL extension (GLX)
    • The code should be started in an Xvfb virtual display
    • You can use X11 forwarding in ssh connection
  • video recording:
    通过对屏幕截图生成;有些环境通过openGL得到画面,在云服务器会有问题,如果没有屏幕或者图形界面,解决:virtual graphical display 叫做Xvfb(install package xvfb),在服务器上开一个虚拟display
    另一种解决方法是用 X11 server, 还可以通过ssh 让云服务器画面在本地显示

参考:
https://hub.packtpub.com/openai-gym-environments-wrappers-and-monitors-tutorial/

  • 6
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值