《TF2.x 强化学习手册》P0-P13构建强化学习智能体的环境和奖励机制以及离散动作空间和离散决策问题的实现基于NN的RL策略

译者序

理解应用TensorFlow 2.X框架和相关工具库

如何使用TensorFlow 2.X来构建、训练和部署强化学习系统的实用知识和技能

  1. 首先介绍强化学习环境的构建
  2. 讲解基于模型和无模型的强化学习算法,深入介绍更高级的强化学习算法,以及如何运用来训练智能体
  3. 如何将智能体部署到云端,如何使用分布式训练加速智能体开发,以及如何为网页、移动设备和其他平台构建跨平台应用程序。

前言

第一章,使用TensorFlow 2.x开发深度强化学习的基本模块

第二章,基于价值、策略和行动者-评论家的深度强化学算法实现

第三章,高级强化学习算法的实现(深度Q网络、双重与竞争深度Q网络、深度循环Q网络、异步优势行动者-评论家、近端策略优化、深度确定性策略梯度)

第四章,现实世界中的强化学习——构建加密货币交易智能体

第五章,现实世界中的强化学习——构建股票/股份交易智能体

第六章,现实世界中的强化学习——构建智能体来完成您的待办事项

第七章,在云端部署深度强化学习智能体

第八章,使用分布式训练加速深度强化学习智能体开发

第九章,深度强化学习智能体的多平台部署

第1章使用TensorFlow 2.x开发深度强化学习的基本模块

  • 对DRL基本原理进行具体描述
  • 在OpenAI Gym 强化学习环境开发基于神经网络的智能体以及用于解决DRL在离散和连续值空间上应用的精华神经网络智能体的方法

1.1技术要求

python 3.7

1.2构建训练强化学习智能体的环境和奖励机制

Gridworld是一个学习环境以训练强化学习智能体

在这里插入图片描述

蓝色:智能体位置

绿色:目标位置

红色:炸弹位置

灰色:墙

智能体需要在不碰到炸弹的情况下到达目标

前期准备

import copy
import sys
import time
import gym
import numpy as np

实现步骤

构建学习环境,学习环境为智能体提供观测值(状态),让智能体可以执行一些列动作,并根据动作返回奖励和新的状态。

  1. 首先定义环境中使用的不同单元格状态及其颜色之间的映射

    # Grid cell state and color mapping
    EMPTY = BLACK = 0
    WALL = GRAY = 1
    AGENT = BLUE = 2
    BOMB = RED = 3
    GOAL = GREEN = 4
    
  2. 使用RGB强度值生成颜色图

    # RGB color value table
    COLOR_MAP = {
        BLACK: [0.0, 0.0, 0.0],
        GRAY: [0.5, 0.5, 0.5],
        BLUE: [0.0, 0.0, 1.0],
        RED: [1.0, 0.0, 0.0],
        GREEN: [0.0, 1.0, 0.0],
    }
    
  3. 定义动作映射

    # Action mapping
    NOOP = 0
    DOWN = 1
    UP = 2
    LEFT = 3
    RIGHT = 4
    
  4. 创建GridworldEnv类,在__init__()函数定义包括观测空间和动作空间的类变量

    class GridworldEnv(gym.Env):
        def __init__(self, max_steps=100):
            """Initialize Gridworld
    
            Args:
                max_steps (int, optional): Max steps per episode. Defaults to 100.
            """
    
  5. 使用单元格状态映射定义Gridworld环境的布局

     self.grid_layout = """
            1 1 1 1 1 1 1 1
            1 2 0 0 0 0 0 1
            1 0 1 1 1 0 0 1
            1 0 1 0 1 0 0 1
            1 0 1 4 1 0 0 1
            1 0 3 0 0 0 0 1
            1 0 0 0 0 0 0 1
            1 1 1 1 1 1 1 1
            """
    

    此处对应步骤1中的映射

  6. 定义观测空间

    self.initial_grid_state = np.fromstring(self.grid_layout, dtype=int, sep=" ")
            self.initial_grid_state = self.initial_grid_state.reshape(8, 8)
            self.grid_state = copy.deepcopy(self.initial_grid_state)
            self.observation_space = gym.spaces.Box(
                low=0, high=6, shape=self.grid_state.shape
            )
            self.img_shape = [256, 256, 3]
            self.metadata = {"render.modes": ["human"]}
    
  7. 定义动作空间和动作与智能体在网格中移动方式的映射关系

    self.action_space = gym.spaces.Discrete(5)
            self.actions = [NOOP, UP, DOWN, LEFT, RIGHT]
            self.action_pos_dict = {
                NOOP: [0, 0],
                UP: [-1, 0],
                DOWN: [1, 0],
                LEFT: [0, -1],
                RIGHT: [0, 1],
            }
            
    
  8. 使用get_state()函数初始化智能体的出示和目标状态

    (self.agent_state, self.goal_state) = self.get_state()
            self.step_num = 0  # To keep track of number of steps
            self.max_steps = max_steps
            self.done = False
            self.info = {"status": "Live"}
            self.viewer = None
    
  9. 实现get_state()函数

     def get_state(self):
            start_state = np.where(self.grid_state == AGENT)
            goal_state = np.where(self.grid_state == GOAL)
    
            start_or_goal_not_found = not (start_state[0] and goal_state[0])
            if start_or_goal_not_found:
                sys.exit(
                    "Start and/or Goal state not present in the Gridworld. "
                    "Check the Grid layout"
                )
            start_state = (start_state[0][0], start_state[1][0])
            goal_state = (goal_state[0][0], goal_state[1][0])
    
            return start_state, goal_state
    
  10. 实现step()函数,执行动作并获得下一个状态、相关的奖励以及该回合是否结束

        def step(self, action):
            """Return next observation, reward, done , info"""
            action = int(action)
            reward = 0.0
    
            next_state = (
                self.agent_state[0] + self.action_pos_dict[action][0],
                self.agent_state[1] + self.action_pos_dict[action][1],
            )
    
            next_state_invalid = (
                next_state[0] < 0 or next_state[0] >= self.grid_state.shape[0]
            ) or (next_state[1] < 0 or next_state[1] >= self.grid_state.shape[1])
            if next_state_invalid:
                # Leave the agent state unchanged
                next_state = self.agent_state
                self.info["status"] = "Next state is invalid"
    
            next_agent_state = self.grid_state[next_state[0], next_state[1]]
    
  11. 设定奖励,并在最后返回grid_state、reward、done和info

    # Calculate reward
            if next_agent_state == EMPTY:
                # Move agent from previous state to the next state on the grid
                self.info["status"] = "Agent moved to a new cell"
                self.grid_state[next_state[0], next_state[1]] = AGENT
                self.grid_state[self.agent_state[0], self.agent_state[1]] = EMPTY
                self.agent_state = copy.deepcopy(next_state)
    
            elif next_agent_state == WALL:
                self.info["status"] = "Agent bumped into a wall"
                reward = -0.1
            # Terminal states
            elif next_agent_state == GOAL:
                self.info["status"] = "Agent reached the GOAL "
                self.done = True
                reward = 1
            elif next_agent_state == BOMB:
                self.info["status"] = "Agent stepped on a BOMB"
                self.done = True
                reward = -1
            # elif next_agent_state == AGENT:
            else:
                # NOOP or next state is invalid
                self.done = False
    
            self.step_num += 1
    
            # Check if max steps per episode has been reached
            if self.step_num >= self.max_steps:
                self.done = True
                self.info["status"] = "Max steps reached"
    
            if self.done:
                done = True
                terminal_state = copy.deepcopy(self.grid_state)
                terminal_info = copy.deepcopy(self.info)
                _ = self.reset()
                return (terminal_state, reward, done, terminal_info)
    
            return self.grid_state, reward, self.done, self.info
    
  12. 实现reset()函数,该函数在一个回合结束时或再请求重置环境时,重置Gridworld环境

     def reset(self):
            self.grid_state = copy.deepcopy(self.initial_grid_state)
            (
                self.agent_state,
                self.agent_goal_state,
            ) = self.get_state()
            self.step_num = 0
            self.done = False
            self.info["status"] = "Live"
            return self.grid_state
    
  13. 可视化GridWorld状态,实现一个渲染函数,将第5步定义的grid_layout转换为图像并显示

        def gridarray_to_image(self, img_shape=None):
            if img_shape is None:
                img_shape = self.img_shape
            observation = np.random.randn(*img_shape) * 0.0
            scale_x = int(observation.shape[0] / self.grid_state.shape[0])
            scale_y = int(observation.shape[1] / self.grid_state.shape[1])
            for i in range(self.grid_state.shape[0]):
                for j in range(self.grid_state.shape[1]):
                    for k in range(3):  # 3-channel RGB image
                        pixel_value = COLOR_MAP[self.grid_state[i, j]][k]
                        observation[
                            i * scale_x : (i + 1) * scale_x,
                            j * scale_y : (j + 1) * scale_y,
                            k,
                        ] = pixel_value
            return (255 * observation).astype(np.uint8)
        
        def render(self, mode="human", close=False):
            if close:
                if self.viewer is not None:
                    self.viewer.close()
                    self.viewer = None
                return
    
            img = self.gridarray_to_image()
            if mode == "rgb_array":
                return img
            elif mode == "human":
                from gym.envs.classic_control import rendering
    
                if self.viewer is None:
                    self.viewer = rendering.SimpleImageViewer()
                self.viewer.imshow(img)
    

    在Windows下某个py文件会报错:

        if not (_dc := user32.GetDC(window_handle)):
                    ^
    SyntaxError: invalid syntax
    

    := 不是标准的python语法

    修改为

    _dc = user32.GetDC(window_handle)

    if not _dc

  14. 测试运行

    if __name__ == "__main__":
        env = GridworldEnv(max_steps=1000)
        obs = env.reset()
        done = False
        step_num = 1
        # Run one episode
        while not done:
            # Sample a random action from the action space
            action = env.action_space.sample()
            next_obs, reward, done, info = env.step(action)
            # 休眠0.5s
            time.sleep(0.1)
            print(f"step#:{step_num} reward:{reward} done:{done} info:{info}")
            step_num += 1
            env.render()
        env.close()
    

在这里插入图片描述

工作原理

grid_layout表示学习环境的状态。

Gridworld环境定义了观测空间、动作空间和奖励机制实现 马尔科夫决策过程

从环境的动作空间中抽取一个有效的动作,并在环境中执行所选动作,从而得到新的观测、奖励和表示回合是否结束的布尔状态值,这些作为对智能体动作的响应。

针对离散动作空间和离散决策问题实现基于神经网络的强化学习策略

使用简单的线性函数来表示智能体的策略,但不能扩展到复杂问题。

基于神经网络的策略网络是高级强化学习和 深度强化学习 的重要组成部分,并适用于一般的决策问题

本节实现一个在TensorFlow 2.x中实现的基于神经网络策略的智能体,该智能体可以再Gridworld 环境中执行动作,也可以指啊少量或没有修改的情况下在任何其他离散动作空间的环境中执行动作。

前期准备

import seaborn as sns
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt

import tensorflow_probability as tfp

实现步骤

  1. 二元策略分布

    binary_policy = tfp.distributions.Bernoulli(probs=0.5)
    for i in range(5):
        action = binary_policy.sample(1)
        print("Action: ", action)
    

    输出类似:伯努利分布不是一个确定性采样

    Action: tf.Tensor([1], shape=(1,), dtype=int32)
    Action: tf.Tensor([0], shape=(1,), dtype=int32)
    Action: tf.Tensor([0], shape=(1,), dtype=int32)
    Action: tf.Tensor([0], shape=(1,), dtype=int32)
    Action: tf.Tensor([0], shape=(1,), dtype=int32)

  2. 快速查看二元策略分布

    sample_actions = binary_policy.sample(500)
    sns.displot(sample_actions)
    

    在这里插入图片描述

  3. 实现一个离散策略分布

    action_dim = 4  # Dimension of the discrete action space
    action_probability = [0.25, 0.25, 0.25, 0.25]
    discrete_policy = tfp.distributions.Multinomial(
        probs=action_probability, total_count=1)
    for i in range(5):
        action = discrete_policy.sample(1)
        print(action)
    
    

    输出类似:

    tf.Tensor([[1. 0. 0. 0.]], shape=(1, 4), dtype=float32)
    tf.Tensor([[0. 0. 1. 0.]], shape=(1, 4), dtype=float32)
    tf.Tensor([[0. 0. 1. 0.]], shape=(1, 4), dtype=float32)
    tf.Tensor([[1. 0. 0. 0.]], shape=(1, 4), dtype=float32)
    tf.Tensor([[0. 0. 1. 0.]], shape=(1, 4), dtype=float32)

  4. 查看离散概率的分布

    sns.displot(discrete_policy.sample(1).numpy())
    

    在这里插入图片描述

  5. 计算离散策略的熵

    def entropy(action_probs):
        return -tf.reduce_sum(action_probs * tf.math.log(action_probs), axis=-1)  #数学表达:H(p) = -Σp(x)log(p(x))
    
    action_probability = [0.25, 0.25, 0.25, 0.25]
    print(entropy(action_probability))
    

    tf.Tensor(1.3862944, shape=(), dtype=float32)

  6. 实现一个离散策略类DiscretePolicy()

    import numpy as np
    
    
    class DiscretePolicy(object):
        def __init__(self, num_actions):
            self.action_dim = num_actions
    
        def sample(self, action_logits):
            self.distribution = tfp.distributions.Multinomial(
                logits=action_logits, total_count=1)
            return self.distribution.sample(1)
    
        def get_action(self, action_logits):
            action = self.sample(action_logits)
            return np.where(action)[-1]
    
        def entropy(self, action_probability):
            return -tf.reduce_sum(action_probability * tf.math.log(action_probability), axis=-1)
    
  7. 实现一个evaluate()函数评估给定环境中的智能体

    def evaluate(agent, env, render=True):
        global info
        obs, episode_reward, done, step_num = env.reset(), 0.0, False, 0
        while not done:
            action = agent.get_action(obs)
            obs, reward, done, info = env.step(action)
            episode_reward += reward
            step_num += 1
            if render:
                env.render()
        return episode_reward, step_num, done, info
    
  8. 使用TensorFlow 2.x实现一个基于神经网络的Brain()类

    class Brain(keras.Model):
        def __init__(self, action_dim=5,
                     input_shape=(1, 8 * 8)):
            """Initialize the Agent's Brain model
            Args:
                action_dim (int): Number of actions
                input_shape (tuple): Shape of the input tensor
            """
            super(Brain, self).__init__()
            self.densel = layers.Dense(32, input_shape=input_shape, activation='relu')
            self.logits = layers.Dense(action_dim)
    
        def call(self, inputs):
            x = tf.convert_to_tensor(inputs)
            if len(x.shape) >= 2 and x.shape[0] != 1:
                x = tf.reshape(x, (1, -1))
            return self.logits(self.densel(x))
    
        def process(self, observations):
            action_logits = self.predict_on_batch(observations)
            return action_logits
    
  9. 实现一个简单的Agent() 类,它使用DiscretePolicy对象在离散环境中执行动作

    class Agent(object):
        def __init__(self, action_dim=5, input_shape=(1, 8 * 8)):
            self.brain = Brain(action_dim, input_shape)
            self.policy = DiscretePolicy(action_dim)
    
        def get_action(self, obs):
            action_logits = self.brain.process(obs)
            action = self.policy.get_action(np.squeeze(action_logits, 0))
            return action
    
  10. 在GridworldEnv()中测试智能体

    from envs.gridworld import GridworldEnv
    env = GridworldEnv(500)
    agent = Agent(env.action_space.n, env.observation_space.shape)
    steps, reward, done, info = evaluate(agent, env)
    print(f"Steps: {steps}, Reward: {reward}, Done: {done}, Info: {info}")
    env.close()
    

在这里插入图片描述

工作原理

强化学习的智能体的核心组件之一就是在状态和动作之间进行映射的策略函数

形式上,策略是动作的一个分步,给定在给定状态情况下选择某一动作的概率

在二元动作空间中,可以用伯努利分布表示策略,采用动作1的概率为 p ( x = 1 ) = ϕ p(x=1) = \phi p(x=1)=ϕ,采用动作2的概率为 p ( x = 0 ) = 1 − ϕ p(x=0) = 1 - \phi p(x=0)=1ϕ,因此概率分布为:

p ( x = x ) 2 = 2 ϕ x ( 1 − ϕ ) ( 1 − x ) p(x=x)^2=2\phi^x(1-\phi)^{(1-x)} p(x=x)2=2ϕx(1ϕ)(1x)

当智能体在环境中采用k种可能的动作之一时,可以使用离散概率分布来表示强化学习智能体的策略。

一般来说这种分布可用于描述取k种可能类别之一的随机变量的可能结果,被称为分类分布。

这是对伯努利分布进行的k种事件的推广,被称为范畴分布。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值