dqn实现 pendulum_[DQN] OpenAI Gym - CartPole

最新推荐文章于 2022-12-13 10:51:34 发布

周喆吾-Max

最新推荐文章于 2022-12-13 10:51:34 发布

阅读量395

点赞数

文章标签： dqn实现 pendulum

本文链接：https://blog.csdn.net/weixin_30488835/article/details/112875148

版权

本文介绍了如何使用DQN算法在OpenAI Gym的CartPole-v0环境中实现平衡倒立摆（Pendulum）任务。文章详细讲解了环境的观察、动作、奖励机制以及环境的重置和终止条件。此外，还探讨了如何通过与环境的交互获取观测值和奖励，并介绍了gym库中的一系列环境。最后，提供了DQN网络结构和训练过程的概述。

摘要由CSDN通过智能技术生成

CartPole v0:

CartPole-v0

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track.

The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over.

A reward of +1 is provided forevery timestep that the pole remains upright.

The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

Environment

Observation

Type: Box(4)

NumObservationMinMax

Cart Position

-2.4

2.4

Cart Velocity

-Inf

Inf

Pole Angle

~ -41.8°

~ 41.8°

Pole Velocity At Tip

-Inf

Inf

Actions

Type: Discrete(2)

NumAction

Push cart to the left

Push cart to the right

Note: The amount the velocity is reduced or increased is not fixed as it depends on the angle the pole is pointing. This is because the center of gravity of the pole increases the amount of energy needed to move the cart underneath it

Reward

Reward is 1 for every step taken, including the termination step

Starting State

All observations are assigned a uniform random value between ±0.05

Episode Termination

Pole Angle is more than ±20.9°

Cart Position is more than ±2.4 (center of the cart reaches the edge of the display)

Episode length is greater than 200

Solved Requirements

Considered solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials.

Only one episode here.

importgym

env= gym.make('CartPole-v0')

env.reset()　　　　　　# start here.for _ in range(1000):

env.render()

env.step(env.action_space.sample())#take a random action

Observations

If we ever want to do better than take random actions at each step, it'd probably be good to actually know what our actions are doing to the environment.

The environment's step function returns exactly what we need. In fact, step returns four values. These are:

observation (object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.

reward (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.

done (boolean): whether it's time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)

info (dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities b

最低0.47元/天解锁文章

周喆吾-Max

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
dqn实现 pendulum_[DQN] OpenAI Gym - CartPole

CartPole v0:CartPole-v0A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track.The system is controlled by applying a force of +1 or -1 to the cart. The pendulum s...
复制链接

扫一扫