dqn实现 pendulum_[DQN] OpenAI Gym - CartPole

本文介绍了如何使用DQN算法在OpenAI Gym的CartPole-v0环境中实现平衡倒立摆(Pendulum)任务。文章详细讲解了环境的观察、动作、奖励机制以及环境的重置和终止条件。此外,还探讨了如何通过与环境的交互获取观测值和奖励,并介绍了gym库中的一系列环境。最后,提供了DQN网络结构和训练过程的概述。
摘要由CSDN通过智能技术生成

CartPole v0:

CartPole-v0

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track.

The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over.

A reward of +1 is provided forevery timestep that the pole remains upright.

The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

Environment

Observation

Type: Box(4)

NumObservationMinMax

0

Cart Position

-2.4

2.4

1

Cart Velocity

-Inf

Inf

2

Pole Angle

~ -41.8°

~ 41.8°

3

Pole Velocity At Tip

-Inf

Inf

Actions

Type: Discrete(2)

NumAction

0

Push cart to the left

1

Push cart to the right

Note: The amount the velocity is reduced or increased is not fixed as it depends on the angle the pole is pointing. This is because the center of gravity of the pole increases the amount of energy needed to move the cart underneath it

Reward

Reward is 1 for every step taken, including the termination step

Starting State

All observations are assigned a uniform random value between ±0.05

Episode Termination

Pole Angle is more than ±20.9°

Cart Position is more than ±2.4 (center of the cart reaches the edge of the display)

Episode length is greater than 200

Solved Requirements

Considered solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials.

Only one episode here.

importgym

env= gym.make('CartPole-v0')

env.reset()      # start here.for _ in range(1000):

env.render()

env.step(env.action_space.sample())#take a random action

Observations

If we ever want to do better than take random actions at each step, it'd probably be good to actually know what our actions are doing to the environment.

The environment's step function returns exactly what we need. In fact, step returns four values. These are:

observation (object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.

reward (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.

done (boolean): whether it's time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)

info (dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities b

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值