强化学习使用gym库遇到的问题(持续更新......)

1. env = gym.make('CartPole-v0')

env = gym.make('CartPole-v0')
env.render()

报错:gym.error.ResetNeeded: Cannot call `env.render()` before calling `env.reset()`, if this is a intended action, set `disable_render_order_enforcing=True` on the OrderEnforcer wrapper.

分析:这是高版本gym库导致。render()函数扮演图像引擎的角色。 调用env.render()【env.render(mode='human')】的话直接被忽视。要么降低gym版本,或者按下面操作。需要更改上面的代码为:

env = gym.make('CartPole-v0',render_mode='human')

2. observation_, reward, done, info = env.step(action)

observation_, reward, done, info ,_= env.step(action)

报错:参数不足。

分析:这是高版本gym库导致。获取的变量少了,应该是5个,现在只定义4个,所以报错。可以改成:

observation_, reward, done, info ,_= env.step(action)

step函数参数:

    def step(self, action):
         #Map the action to the direction we walk in
         direction=self._action_to_direction[action]
         #We use np.clip to make sure we don't leave the grid
         self._agent_location=np.clip(self._agent_location+direction,0,self.size-1)
         terminated=np.array_equal(self._agent_location,self._target_location)
         reward=1 if terminated else 0
         observation=self._get_obs()
         info=self._get_info()

         if self.render_mode=="human":
            self._render_frame()

         return observation,reward,terminated,False,info

【1】observation (object): this will be an element of the environment's :attr:`observation_space`. This may, for instance, be a numpy array containing the positions and velocities of certain objects. 环境状态信息
【2】reward (float): The amount of reward returned as a result of taking the action. 奖励信息
【3】terminated (bool): whether a `terminal state` (as defined under the MDP of the task) is reached. In this case further step() calls could return undefined results. 是否到达终端状态。
【4】info (dictionary): `info` contains auxiliary diagnostic information (helpful for debugging, learning, and logging). `info”包含辅助诊断信息(有助于调试、学习和记录)。
This might, for instance, contain: metrics that describe the agent's performance state, variables that are hidden from observations, or individual reward terms that are combined to produce the total reward. It also can contain information that distinguishes truncation and termination, however this is deprecated in favour of returning two booleans, and will be removed in a future version. 例如,这可能包含:描述代理的绩效状态的指标、隐藏在观察中的变量,或组合起来产生总奖励的单个奖励术语。它还可以包含区分截断和终止的信息,但这是不推荐的,而是支持返回两个布尔值,并将在将来的版本中删除。

  • 7
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值