《TF2.x 强化学习手册》P0-P13构建强化学习智能体的环境和奖励机制以及离散动作空间和离散决策问题的实现基于NN的RL策略

yuri_yagn

已于 2024-07-13 08:36:46 修改

阅读量1k

点赞数 17

文章标签： python deep learning 机器学习

于 2024-07-13 01:08:45 首次发布

本文链接：https://blog.csdn.net/yuri_sospiro/article/details/140391628

版权

文章目录

译者序
前言
第1章使用TensorFlow 2.x开发深度强化学习的基本模块

译者序

理解应用TensorFlow 2.X框架和相关工具库

如何使用TensorFlow 2.X来构建、训练和部署强化学习系统的实用知识和技能

首先介绍强化学习环境的构建
讲解基于模型和无模型的强化学习算法，深入介绍更高级的强化学习算法，以及如何运用来训练智能体
如何将智能体部署到云端，如何使用分布式训练加速智能体开发，以及如何为网页、移动设备和其他平台构建跨平台应用程序。

前言

第一章，使用TensorFlow 2.x开发深度强化学习的基本模块

第二章，基于价值、策略和行动者-评论家的深度强化学算法实现

第三章，高级强化学习算法的实现（深度Q网络、双重与竞争深度Q网络、深度循环Q网络、异步优势行动者-评论家、近端策略优化、深度确定性策略梯度）

第四章，现实世界中的强化学习——构建加密货币交易智能体

第五章，现实世界中的强化学习——构建股票/股份交易智能体

第六章，现实世界中的强化学习——构建智能体来完成您的待办事项

第七章，在云端部署深度强化学习智能体

第八章，使用分布式训练加速深度强化学习智能体开发

第九章，深度强化学习智能体的多平台部署

第1章使用TensorFlow 2.x开发深度强化学习的基本模块

对DRL基本原理进行具体描述
在OpenAI Gym 强化学习环境开发基于神经网络的智能体以及用于解决DRL在离散和连续值空间上应用的精华神经网络智能体的方法

1.1技术要求

python 3.7

1.2构建训练强化学习智能体的环境和奖励机制

Gridworld是一个学习环境以训练强化学习智能体

在这里插入图片描述

蓝色：智能体位置

绿色：目标位置

红色：炸弹位置

灰色：墙

智能体需要在不碰到炸弹的情况下到达目标

前期准备

import copy
import sys
import time
import gym
import numpy as np

实现步骤

构建学习环境，学习环境为智能体提供观测值（状态），让智能体可以执行一些列动作，并根据动作返回奖励和新的状态。

首先定义环境中使用的不同单元格状态及其颜色之间的映射

# Grid cell state and color mapping
EMPTY = BLACK = 0
WALL = GRAY = 1
AGENT = BLUE = 2
BOMB = RED = 3
GOAL = GREEN = 4

使用RGB强度值生成颜色图

# RGB color value table
COLOR_MAP = {
    BLACK: [0.0, 0.0, 0.0],
    GRAY: [0.5, 0.5, 0.5],
    BLUE: [0.0, 0.0, 1.0],
    RED: [1.0, 0.0, 0.0],
    GREEN: [0.0, 1.0, 0.0],
}

定义动作映射

# Action mapping
NOOP = 0
DOWN = 1
UP = 2
LEFT = 3
RIGHT = 4

创建GridworldEnv类，在__init__()函数定义包括观测空间和动作空间的类变量

class GridworldEnv(gym.Env):
    def __init__(self, max_steps=100):
        """Initialize Gridworld

        Args:
            max_steps (int, optional): Max steps per episode. Defaults to 100.
        """

使用单元格状态映射定义Gridworld环境的布局

 self.grid_layout = """
        1 1 1 1 1 1 1 1
        1 2 0 0 0 0 0 1
        1 0 1 1 1 0 0 1
        1 0 1 0 1 0 0 1
        1 0 1 4 1 0 0 1
        1 0 3 0 0 0 0 1
        1 0 0 0 0 0 0 1
        1 1 1 1 1 1 1 1
        """

此处对应步骤1中的映射

定义观测空间

self.initial_grid_state = np.fromstring(self.grid_layout, dtype=int, sep=" ")
        self.initial_grid_state = self.initial_grid_state.reshape(8, 8)
        self.grid_state = copy.deepcopy(self.initial_grid_state)
        self.observation_space = gym.spaces.Box(
            low=0, high=6, shape=self.grid_state.shape
        )
        self.img_shape = [256, 256, 3]
        self.metadata = {"render.modes": ["human"]}

定义动作空间和动作与智能体在网格中移动方式的映射关系

self.action_space = gym.spaces.Discrete(5)
        self.actions = [NOOP, UP, DOWN, LEFT, RIGHT]
        self.action_pos_dict = {
            NOOP: [0, 0],
            UP: [-1, 0],
            DOWN: [1, 0],
            LEFT: [0, -1],
            RIGHT: [0, 1],
        }

使用get_state()函数初始化智能体的出示和目标状态

(self.agent_state, self.goal_state) = self.get_state()
        self.step_num = 0  # To keep track of number of steps
        self.max_steps = max_steps
        self.done = False
        self.info = {"status": "Live"}
        self.viewer = None

实现get_state()函数

 def get_state(self):
        start_state = np.where(self.grid_state == AGENT)
        goal_state = np.where(self.grid_state == GOAL)

        start_or_goal_not_found = not (start_state[0] and goal_state[0])
        if start_or_goal_not_found:
            sys.exit(
                "Start and/or Goal state not present in the Gridworld. "
                "Check the Grid layout"
            )
        start_state = (start_state[0][0], start_state[1][0])
        goal_state = (goal_state[0][0], goal_state[1][0])

        return start_state, goal_state

实现step()函数，执行动作并获得下一个状态、相关的奖励以及该回合是否结束

    def step(self, action):
        """Return next observation, reward, done , info"""
        action = int(action)
        reward = 0.0

        next_state = (
            self.agent_state[0] + self.action_pos_dict[action][0],
            self.agent_state[1] + self.action_pos_dict[action][1],
        )

        next_state_invalid = (
            next_state[0] < 0 or next_state[0] >= self.grid_state.shape[0]
        ) or (next_state[1] < 0 or next_state[1] >= self.grid_state.shape[1])
        if next_state_invalid:
            # Leave the agent state unchanged
            next_state = self.agent_state
            self.info["status"] = "Next state is invalid"

        next_agent_state = self.grid_state[next_state[0], next_state[1]]

设定奖励，并在最后返回grid_state、reward、done和info

# Calculate reward
        if next_agent_state == EMPTY:
            # Move agent from previous state to the next state on the grid
            self.info["status"] = "Agent moved to a new cell"
            self.grid_state[next_state[0], next_state[1]] = AGENT
            self.grid_state[self.agent_state[0], self.agent_state[1]] = EMPTY
            self.agent_state = copy.deepcopy(next_state)

        elif next_agent_state == WALL:
            self.info["status"] = "Agent bumped into a wall"
            reward = -0.1
        # Terminal states
        elif next_agent_state == GOAL:
            self.info["status"] = "Agent reached the GOAL "
            self.done = True
            reward = 1
        elif next_agent_state == BOMB:
            self.info["status"] = "Agent stepped on a BOMB"
            self.done = True
            reward = -1
        # elif next_agent_state == AGENT:
        else:
            # NOOP or next state is invalid
            self.done = False

        self.step_num += 1

        # Check if max steps per episode has been reached
        if self.step_num >= self.max_steps:
            self.done = True
            self.info["status"] = "Max steps reached"

        if self.done:
            done = True
            terminal_state = copy.deepcopy(self.grid_state)
            terminal_info = copy.deepcopy(self.info)
            _ = self.reset()
            return (terminal_state, reward, done, terminal_info)

        return self.grid_state, reward, self.done, self.info

实现reset()函数，该函数在一个回合结束时或再请求重置环境时，重置Gridworld环境

 def reset(self):
        self.grid_state = copy.deepcopy(self.initial_grid_state)
        (
            self.agent_state,
            self.agent_goal_state,
        ) = self.get_state()
        self.step_num = 0
        self.done = False
        self.info["status"] = "Live"
        return self.grid_state

可视化GridWorld状态，实现一个渲染函数，将第5步定义的grid_layout转换为图像并显示

    def gridarray_to_image(self, img_shape=None):
        if img_shape is None:
            img_shape = self.img_shape
        observation = np.random.randn(*img_shape) * 0.0
        scale_x = int(observation.shape[0] / self.grid_state.shape[0])
        scale_y = int(observation.shape[1] / self.grid_state.shape[1])
        for i in range(self.grid_state.shape[0]):
            for j in range(self.grid_state.shape[1]):
                for k in range(3):  # 3-channel RGB image
                    pixel_value = COLOR_MAP[self.grid_state[i, j]][k]
                    observation[
                        i * scale_x : (i + 1) * scale_x,
                        j * scale_y : (j + 1) * scale_y,
                        k,
                    ] = pixel_value
        return (255 * observation).astype(np.uint8)
    
    def render(self, mode="human", close=False):
        if close:
            if self.viewer is not None:
                self.viewer.close()
                self.viewer = None
            return

        img = self.gridarray_to_image()
        if mode == "rgb_array":
            return img
        elif mode == "human":
            from gym.envs.classic_control import rendering

            if self.viewer is None:
                self.viewer = rendering.SimpleImageViewer()
            self.viewer.imshow(img)

在Windows下某个py文件会报错：

    if not (_dc := user32.GetDC(window_handle)):
                ^
SyntaxError: invalid syntax

:= 不是标准的python语法

修改为

_dc = user32.GetDC(window_handle)

if not _dc

测试运行

if __name__ == "__main__":
    env = GridworldEnv(max_steps=1000)
    obs = env.reset()
    done = False
    step_num = 1
    # Run one episode
    while not done:
        # Sample a random action from the action space
        action = env.action_space.sample()
        next_obs, reward, done, info = env.step(action)
        # 休眠0.5s
        time.sleep(0.1)
        print(f"step#:{step_num} reward:{reward} done:{done} info:{info}")
        step_num += 1
        env.render()
    env.close()

在这里插入图片描述

工作原理

grid_layout表示学习环境的状态。

Gridworld环境定义了观测空间、动作空间和奖励机制实现 马尔科夫决策过程

从环境的动作空间中抽取一个有效的动作，并在环境中执行所选动作，从而得到新的观测、奖励和表示回合是否结束的布尔状态值，这些作为对智能体动作的响应。

针对离散动作空间和离散决策问题实现基于神经网络的强化学习策略

使用简单的线性函数来表示智能体的策略，但不能扩展到复杂问题。

基于神经网络的策略网络是高级强化学习和 深度强化学习 的重要组成部分，并适用于一般的决策问题

本节实现一个在TensorFlow 2.x中实现的基于神经网络策略的智能体，该智能体可以再Gridworld 环境中执行动作，也可以指啊少量或没有修改的情况下在任何其他离散动作空间的环境中执行动作。

前期准备

import seaborn as sns
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt

import tensorflow_probability as tfp

实现步骤

二元策略分布
```
binary_policy = tfp.distributions.Bernoulli(probs=0.5)
for i in range(5):
    action = binary_policy.sample(1)
    print("Action: ", action)
```
输出类似：伯努利分布不是一个确定性采样

Action: tf.Tensor([1], shape=(1,), dtype=int32)
Action: tf.Tensor([0], shape=(1,), dtype=int32)
Action: tf.Tensor([0], shape=(1,), dtype=int32)
Action: tf.Tensor([0], shape=(1,), dtype=int32)
Action: tf.Tensor([0], shape=(1,), dtype=int32)

快速查看二元策略分布

sample_actions = binary_policy.sample(500)
sns.displot(sample_actions)

在这里插入图片描述

实现一个离散策略分布
```
action_dim = 4  # Dimension of the discrete action space
action_probability = [0.25, 0.25, 0.25, 0.25]
discrete_policy = tfp.distributions.Multinomial(
    probs=action_probability, total_count=1)
for i in range(5):
    action = discrete_policy.sample(1)
    print(action)
```
输出类似：

tf.Tensor([[1. 0. 0. 0.]], shape=(1, 4), dtype=float32)
tf.Tensor([[0. 0. 1. 0.]], shape=(1, 4), dtype=float32)
tf.Tensor([[0. 0. 1. 0.]], shape=(1, 4), dtype=float32)
tf.Tensor([[1. 0. 0. 0.]], shape=(1, 4), dtype=float32)
tf.Tensor([[0. 0. 1. 0.]], shape=(1, 4), dtype=float32)

查看离散概率的分布

sns.displot(discrete_policy.sample(1).numpy())

在这里插入图片描述

计算离散策略的熵

def entropy(action_probs):
    return -tf.reduce_sum(action_probs * tf.math.log(action_probs), axis=-1)  #数学表达：H(p) = -Σp(x)log(p(x))

action_probability = [0.25, 0.25, 0.25, 0.25]
print(entropy(action_probability))

tf.Tensor(1.3862944, shape=(), dtype=float32)

实现一个离散策略类DiscretePolicy()

import numpy as np


class DiscretePolicy(object):
    def __init__(self, num_actions):
        self.action_dim = num_actions

    def sample(self, action_logits):
        self.distribution = tfp.distributions.Multinomial(
            logits=action_logits, total_count=1)
        return self.distribution.sample(1)

    def get_action(self, action_logits):
        action = self.sample(action_logits)
        return np.where(action)[-1]

    def entropy(self, action_probability):
        return -tf.reduce_sum(action_probability * tf.math.log(action_probability), axis=-1)

实现一个evaluate()函数评估给定环境中的智能体

def evaluate(agent, env, render=True):
    global info
    obs, episode_reward, done, step_num = env.reset(), 0.0, False, 0
    while not done:
        action = agent.get_action(obs)
        obs, reward, done, info = env.step(action)
        episode_reward += reward
        step_num += 1
        if render:
            env.render()
    return episode_reward, step_num, done, info

使用TensorFlow 2.x实现一个基于神经网络的Brain()类

class Brain(keras.Model):
    def __init__(self, action_dim=5,
                 input_shape=(1, 8 * 8)):
        """Initialize the Agent's Brain model
        Args:
            action_dim (int): Number of actions
            input_shape (tuple): Shape of the input tensor
        """
        super(Brain, self).__init__()
        self.densel = layers.Dense(32, input_shape=input_shape, activation='relu')
        self.logits = layers.Dense(action_dim)

    def call(self, inputs):
        x = tf.convert_to_tensor(inputs)
        if len(x.shape) >= 2 and x.shape[0] != 1:
            x = tf.reshape(x, (1, -1))
        return self.logits(self.densel(x))

    def process(self, observations):
        action_logits = self.predict_on_batch(observations)
        return action_logits

实现一个简单的Agent() 类，它使用DiscretePolicy对象在离散环境中执行动作

class Agent(object):
    def __init__(self, action_dim=5, input_shape=(1, 8 * 8)):
        self.brain = Brain(action_dim, input_shape)
        self.policy = DiscretePolicy(action_dim)

    def get_action(self, obs):
        action_logits = self.brain.process(obs)
        action = self.policy.get_action(np.squeeze(action_logits, 0))
        return action

在GridworldEnv()中测试智能体

from envs.gridworld import GridworldEnv
env = GridworldEnv(500)
agent = Agent(env.action_space.n, env.observation_space.shape)
steps, reward, done, info = evaluate(agent, env)
print(f"Steps: {steps}, Reward: {reward}, Done: {done}, Info: {info}")
env.close()

在这里插入图片描述