基于自定义gym环境的强化学习

该文展示了如何创建一个基于gym的定制环境,用于路径规划问题。环境是一个5x5的网格,目标是从起点(0,0)移动到终点(4,4),使用上、下、左、右四个动作。每步奖励-1,到达终点奖励10,超过200步结束。之后,使用稳定基线库(stable-baselines3)的PPO算法进行训练,并保存模型。在测试中,模型展示出有效的路径选择策略。
摘要由CSDN通过智能技术生成

本文实现了一个简单的基于gym环境的强化学习的demo,参考了博客使用gym创建一个自定义环境

1. 依赖包版本

gym == 0.21.0
stable-baselines3 == 1.6.2

2. 场景描述

场景描述起点:(0,0)
终点:(4,4)
动作空间:{0:向上,1:向下,2:向左,3:向右}
状态空间:agent所处坐标
目标:以最短的路径起点走到终点
奖励设置:到达终点奖励为:10,其他每走一步奖励为:-1
终止条件:1. 达到终点;2.探索次数超过200

3. 搭建gym环境

"""
@Author: Fhz
@Create Date: 2023/4/6 22:08
@File: gym_test.py
@Description: 
@Modify Person Date: 
"""
import gym
from gym import Env
from gym import spaces
import numpy as np
from copy import deepcopy


class PathPlanning(Env):
    def __init__(self):
        self.rows = 5
        self.cols = 5
        self.start = [0, 0]
        self.goal = [4, 4]
        self.count = 0
        self.current_state = None
        self.action_space = spaces.Discrete(4)

        self.observation_space = spaces.Box(low=np.array([0, 0]), high=np.array([4, 4]))

    def step(self, action):
        self.count = self.count + 1
        new_state = deepcopy(self.current_state)
        if action == 0:  # up
            new_state[0] = max(new_state[0] - 1, 0)
        elif action == 1:  # down
            new_state[0] = min(new_state[0] + 1, self.cols - 1)
        elif action == 2:  # left
            new_state[1] = max(new_state[1] - 1, 0)
        elif action == 3:  # right
            new_state[1] = min(new_state[1] + 1, self.rows - 1)
        else:
            raise Exception("Invalid action")
        self.current_state = new_state

        if self.current_state[1] == self.goal[1] and self.current_state[0] == self.goal[0]:
            done = True
            reward = 10.0
        else:
            done = False
            reward = -1
        if self.count > 200:
            done = True

        info = {}
        return self.current_state, reward, done, info

    def render(self):
        pass

    def reset(self):
        self.count = 0
        self.current_state = self.start
        return self.current_state

4. Baseline强化学习寻优

4.1 参数设置

网络框架:“mlp”, 64x64
学习率:5e-4
batch size:32
训练次数:5e4

"""
@Author: Fhz
@Create Date: 2023/4/6 22:08
@File: gym_test.py
@Description: 
@Modify Person Date: 
"""
import gym
from gym import Env
from gym import spaces
import numpy as np
from copy import deepcopy
from stable_baselines3 import PPO


class PathPlanning(Env):
    def __init__(self):
        self.rows = 5
        self.cols = 5
        self.start = [0, 0]
        self.goal = [4, 4]
        self.count = 0
        self.current_state = None
        self.action_space = spaces.Discrete(4)

        self.observation_space = spaces.Box(low=np.array([0, 0]), high=np.array([4, 4]))

    def step(self, action):
        self.count = self.count + 1
        new_state = deepcopy(self.current_state)
        if action == 0:  # up
            new_state[0] = max(new_state[0] - 1, 0)
        elif action == 1:  # down
            new_state[0] = min(new_state[0] + 1, self.cols - 1)
        elif action == 2:  # left
            new_state[1] = max(new_state[1] - 1, 0)
        elif action == 3:  # right
            new_state[1] = min(new_state[1] + 1, self.rows - 1)
        else:
            raise Exception("Invalid action")
        self.current_state = new_state

        if self.current_state[1] == self.goal[1] and self.current_state[0] == self.goal[0]:
            done = True
            reward = 10.0
        else:
            done = False
            reward = -1
        if self.count > 200:
            done = True

        info = {}
        return self.current_state, reward, done, info

    def render(self):
        pass

    def reset(self):
        self.count = 0
        self.current_state = self.start
        return self.current_state


if __name__ == '__main__':
    env = PathPlanning()

    model = PPO('MlpPolicy', env,
                policy_kwargs=dict(net_arch=[64, 64]),
                learning_rate=5e-4,
                batch_size=32,
                gamma=0.8,
                verbose=1,
                tensorboard_log="PPO_define/")

    model.learn(int(5e4))
    model.save("PPO_define/PPOmodel")

4.2 训练结果

在这里插入图片描述在这里插入图片描述

4.3 结果测试

if __name__ == '__main__':
    env = PathPlanning()

    ACTIONS_ALL = {
        0: 'Up',   # 向上
        1: 'Down', # 向下
        2: 'Left', # 向左
        3: 'Right' # 向右
    }

    # load model
    model = PPO.load("PPO_define/PPOmodel", env=env)

    eposides = 10

    for eq in range(eposides):
        obs = env.reset()
        done = False
        rewards = 0
        while not done:
            # action = env.action_space.sample()
            action, _state = model.predict(obs, deterministic=True)
            action = action.item()
            print("The action is: {}".format(ACTIONS_ALL[action]))
            # print("The action is {}".format(action))
            obs, reward, done, info = env.step(action)
            env.render()
            rewards += reward
        print(rewards)

结果输出:

The action is: Down
The action is: Down
The action is: Right
The action is: Right
The action is: Down
The action is: Right
The action is: Down
The action is: Right
3.0
The action is: Down
The action is: Down
The action is: Right
The action is: Right
The action is: Down
The action is: Right
The action is: Down
The action is: Right
3.0
The action is: Down
The action is: Down
The action is: Right
The action is: Right
The action is: Down
The action is: Right
The action is: Down
The action is: Right
3.0
The action is: Down
The action is: Down
The action is: Right
The action is: Right
The action is: Down
The action is: Right
The action is: Down
The action is: Right
3.0
The action is: Down
The action is: Down
The action is: Right
The action is: Right
The action is: Down
The action is: Right
The action is: Down
The action is: Right
3.0
The action is: Down
The action is: Down
The action is: Right
The action is: Right
The action is: Down
The action is: Right
The action is: Down
The action is: Right
3.0
The action is: Down
The action is: Down
The action is: Right
The action is: Right
The action is: Down
The action is: Right
The action is: Down
The action is: Right
3.0
The action is: Down
The action is: Down
The action is: Right
The action is: Right
The action is: Down
The action is: Right
The action is: Down
The action is: Right
3.0
The action is: Down
The action is: Down
The action is: Right
The action is: Right
The action is: Down
The action is: Right
The action is: Down
The action is: Right
3.0
The action is: Down
The action is: Down
The action is: Right
The action is: Right
The action is: Down
The action is: Right
The action is: Down
The action is: Right
3.0

在这里插入图片描述

  • 4
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 7
    评论
当涉及到基于离散型动作的强化学习环境编写时,你可以使用OpenAI Gym库来创建自定义环境。下面是一个简单的示例,展示了一个基于离散型动作的环境,你可以根据具体的任务需求进行修改和扩展: ```python import gym from gym import spaces import numpy as np class CustomEnv(gym.Env): def __init__(self): super(CustomEnv, self).__init__() self.action_space = spaces.Discrete(3) # 3种离散动作 self.observation_space = spaces.Box(low=0, high=1, shape=(2,)) # 状态空间为2维向量,取值范围为[0, 1] def reset(self): # 返回初始状态 self.state = np.random.rand(2) return self.state def step(self, action): # 执行动作并返回新的状态、奖励、完成标志和附加信息 assert self.action_space.contains(action) # 根据动作更新状态 if action == 0: self.state[0] += 0.1 elif action == 1: self.state[0] -= 0.1 elif action == 2: self.state[1] += 0.1 # 计算奖励 reward = self._calculate_reward() # 判断是否完成 done = (self.state[0] >= 1) or (self.state[1] >= 1) # 返回状态、奖励、完成标志和附加信息 return self.state, reward, done, {} def _calculate_reward(self): # 计算奖励函数 reward = -np.sum(np.abs(self.state - np.array([0.5, 0.5]))) # 距离目标点(0.5, 0.5)越远,奖励越小 return reward def render(self): # 可视化环境状态(可选) print(f"Current state: {self.state}") # 创建自定义环境实例 env = CustomEnv() # 重置环境 state = env.reset() # 运行环境交互 for _ in range(10): action = env.action_space.sample() # 随机选择动作 next_state, reward, done, _ = env.step(action) env.render() if done: break ``` 在这个示例中,我们创建了一个名为`CustomEnv`的自定义环境。它具有一个2维状态空间和3个离散动作。`reset`方法用于重置环境并返回初始状态,`step`方法接收一个动作作为参数,并返回新的状态、奖励、完成标志和附加信息。`_calculate_reward`方法用于计算奖励函数。最后,我们使用`env.action_space.sample()`随机选择动作,并通过`env.step(action)`与环境交互。 请根据你的具体需求修改和扩展这个示例,以适应你的任务。希望对你有所帮助!如果还有其他问题,请随时提问。
评论 7
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值