RLlib学习

RAY是UBC rise实验室开发的开源lib,可以实现强化学习的分布式训练,调参(tune) 和自定义强化学习框架(rllib)。

rllib install

pip install -U ray
pip install -U ray[tune]
pip install -U "ray[rllib]"

如果需要使用atari,pytorch,tensorflow等,都需要自己下载。

tune实现标准RL+调参

我们可以直接使用rllib建立trainer,然后设计训练的方式。同时结合demonstration, imitation learning或者添加 自定义的experience replay等等。但是如果我们只使用标准的model-free强化学习框架,我们可以直接使用ray.tune来直接选择 强化学习算法,因为在tune中可以直接完成 参数优化的过程。我们可以在tune中 对于每一个算法参数 选择多个值,或者在一定范围内随机选择值。然后tune.ray就可以根据 程序员设定的目标,挑选出来使得目标最大的参数。无需程序员手动调参,方便模型的整理。tune除了可以对强化学习模型进行调参,对于其他的AI模型都可以调参。

tune 最基本使用

基本算法 + 算法参数 + 环境定义 + 终止参数调节

import ray
import ray.tune as tune

algo_config = {
    # 环境信息
        "env": "CartPole-v0", # "my_env"  需要提前注册好, 注册方法附后
        "env_config":{ }    , # 环境生成
    
        "log_level":"INFO",
        
    # 模型信息
        "model":{
            # cnn
            "conv_filters":[], # [ [output_channel, kernel, stride] ]: [ [16,[4,4],2], [128,[6,6],3] ]
            "conv_activation":"relu",
            
            # 全链接层
            "fcnet_hiddens": [256,256],
            "fcnet_activation":"tanh",
            
            # post fcnet 
            # 有时候我们的网络输入是 复杂的数据类型: matrix + vector,
            # 我们想要 matrix经过CNN,之后和vector合并,然后经过全连接层
            # 此时我们就可以设置 fcnet为 None, 然后使用 post fcnet
            "post_fcnet_hiddens": [], #  [256,256]
            "post_fcnet_activation":  "linear" , # "relu"
            
            #value policy 共用部分网络  可以自行设置 true or false
            "vf_share_layers": True, 
            
            ## LSTM 设置
            # Whether to wrap the model with an LSTM.
            "use_lstm": False,
            # Max seq len for training the LSTM, defaults to 20.
            "max_seq_len": 20,
            # Size of the LSTM cell.
            "lstm_cell_size": 256,
            # Whether to feed a_{t-1} to LSTM (one-hot encoded if discrete).
            "lstm_use_prev_action": False,
            # Whether to feed r_{t-1} to LSTM.
            "lstm_use_prev_reward": False,
            # Whether the LSTM is time-major (TxBx..) or batch-major (BxTx..).
            "_time_major": False,     
            
            # 还有 preprocessor, attention, action等可以进行设置, 具体附后 
            
        },
    
        # learning parameters
        "lr": tune.grid_search([0.0001,0.005]),   # 会使用不同的learning rate进行实验
        "gamma":0.99,
        # 对于不设置的参数,会自行进行设置默认值
        
        # train batch
        "rollout_fragment_length": 200,
        "train_batch_size": 400,
        "batch_mode": "truncate_episodes",  # 也可以设置 "complete_episodes"
    }

analysis = tune.run(
    'PPO',
    config= algo_config,
    stop={
        "episode_reward_mean":100,   # 哪个条件先达到,都会结束 
        "timesteps_total":4000    # 条件是 result = trainer.train() ,result中的 信息
    }
)

print("best config: ", analysis.get_best_config(metric="episode_reward_mean", mode="max"))

使用tune建立自己的训练过程

我们在训练强化学习时,可能会进行多个阶段/多种的训练,或者我们会同时进行多个任务| 不同种类任务同时进行,因此,我们需要能够自己定义训练过程。

rllib 基本框架

RLlib框架包含trainer 和 rollout workers两部分。

  1. trainer包含experience replay pool 和 policy的定义与更新 两部分。
  2. workers根据需要,可以定义不同数目的worker,每个worker从policy中获得动作指令,可以采用同步式|非同步式获取经验,将获取的经验传输到trainer中,然后让trainer中的policy根据经验进行更新。

rllib强化学习框架搭建

设计环境

环境应该符合 gym.env

初始化

含有 observation_space 和 action_space 用于rllib后续policy中建立神经网络的输入。

基本框架

# env需要按照 gym.env 进行设置
class MyEnv(gym.Env):
    def __init__(self, env_config):  # 此处的变量请只使用 dict env_config 用于定义所有的参数
    								# 除非使用register建立,同时需要在 env_creator中合理对应
        self.action_space = <gym.Space>
        self.observation_space = <gym.Space>
    def reset(self):
        return <obs>
    def step(self, action):
        return <obs>, <reward: float>, <done: bool>, <info: dict>

gym.observation_space 配置

环境的调用过程

import gym, ray
from ray.rllib.algorithms import ppo

ray.init()
algo = ppo.PPOTrainer(env=MyEnv, config={
    "env_config": {},  # config to pass to env class
})

环境的注册

我们可以使用register(注册),将我们自己定义的环境设置成 rllib可以识别的 环境string,然后可以直接进行调用。
需要注意的是: gym中的registry和 ray不完全兼容。因此请使用ray中的resgister进行注册。

from ray.tune.registry import register_env

def env_creator(env_config):   # 此处的 env_config对应 我们在建立trainer时传入的dict env_config
    return MyEnv(...)  # return an env instance

register_env("my_env", env_creator) # 此处传入了 环境的名称 | 环境的实例调用函数
algo = ppo.PPO(env="my_env",config={
									"env_config":{}
})

多环境的建立

有时候一个worker需要对多个环境进行学习,因此我们在定义环境的时候 需要同时定义几个环境。

对于 num_envs_per_worker >0 的情况, 每一个worker会对应有多个环境,因此需要通过根据env_config.worker_indexenv_config.vector_index 来得到 worker的 id和 env id

class MultiEnv(gym.Env):
    def __init__(self, env_config):
        # pick actual env based on worker and env indexes
        self.env = gym.make(
            choose_env_for(env_config.worker_index, env_config.vector_index))
        self.action_space = self.env.action_space
        self.observation_space = self.env.observation_space
    def reset(self):
        return self.env.reset()
    def step(self, action):
        return self.env.step(action)

register_env("multienv", lambda config: MultiEnv(config))

注意事项

如果我们需要在环境中进行log,我们需要将log的设置也设置在环境中,然后会在worker内运行。

如何拓展经验

  1. 在每个worker中同时创建多个env, 可以设置 {“num_envs_per_worker”:M}
  2. 建立多个worker,可以设置 {“num_workers”:N}
    如果不设置多个worker,只有一个local worker产生经验,并进行更新;当建立多个worker时,会出现remote worker。 会出现N个remote worker,产生经验,而local worker只会进行更新。
    因此,如果有多个worker会出现cpu | gpu资源的分配问题。 local worker的分配: num_gpus,remote worker的资源通过 num_cpus_per_worker,num_gpus_per_worker, custom_resources_per_worker。GPU 可以分配分数。
    train-worker structure

模型选择

  1. 对于环境比较慢,或者无法复制的情况(与物理环境进行交互),优先使用 sample-efficient off-policy methods: DQN, SAC。 默认 num_wroker:0只单线程。 或者使用offline 的 batch RL training。对于gpu,设置num_gpus:1

  2. 对于环境较快,或者model较小,可以采用多个worker,time-efficient alg, 比如PPO,IMPALA,APEX。 对于gpu,设置num_gpus:1,多个gpus则设置num_gpus>1.

3.对于神经网络比较大的情况,可以减少worker的使用,增加policy计算的资源。

进行训练

import ray
import ray.rllib.agents.ppo as ppo
from ray.tune.logger import pretty_print

ray.init()
config = ppo.DEFAULT_CONFIG.copy()  # 该config附后

# 根据需要进行修改config
algo_config["num_gpus"] = 0
algo_config["num_worker"] = 3 # 如果使用ppo,可以使用多个worker
algo_config["num_cpus_per_worker"] = 1 # 每个worker的cpu数量
# 默认的 驱动(计算policy)的cpu数量: 1 

# 默认的framework是 tensorflow,  如果使用torch,可以在此处进行修改 
algo_config["framework"] = "torch"

# 确定 batch size
algo_config["train_batch_size"] = 4000

# 建立 trainer, 确定算法
trainer = PPOTrainer(env="CartPole-v0",config = algo_config # config passes to env class )
# 如果有之前训练好的参数,可以先进行调用:
trainer.restore(path)

for i in range(1000):
   # Perform one iteration of training the policy with PPO
   result = trainer.train() 
   print(pretty_print(result))

   if i % 100 == 0:
       checkpoint = trainer.save()
       print("checkpoint saved at", checkpoint)
# 使用训练好的trainer,计算action
# trainer 在这之前需要先  trainer.restore(path) 获取之前的checkpoint
action = trainer.compute_action(obs)

callbacks and custom metrics

这些callback函数会在 不同的时间(开始sample之前,每个sample结束后,policy更新之前等等)。
source

以下的函数都是该class DefaultCallbacks 的子函数。 因此我们在创建自己的callback类时,应该将此作为父类。 
import ray
from ray.rllib.agents.callbacks import DefaultCallbacks
class Mycallbacks(DefaultCallbacks):
	# 定义自己想要定义的给定时间的函数

def initDefaultCallbacks(logprint=False, Record=False,call_info):
	
	callbackclass = 	Mycallbacks(call_info)
	
	# 根据需要,可以最后选择需要的callback 函数 
	if logprint:
		def on_train_result(self, *, trainer, result: dict, **kwargs):
            print("trainer.train() result: {} -> {} episodes".format(
                trainer, result["episodes_this_iter"])
		callbackclass.on_train_result=on_train_result

	if Record:
		
	# 最后返回 继承DefaultCallbacks 的子类
	return callbackclass

在trainer或者tune的 config 中 写入一下: 
'callbacks':initDefaultCallbacks(logprint=False, Record=False,call_info)

每一个阶段 的函数

#################### env创建时
def on_sub_environment_created(
        self,
        *,
        worker: "RolloutWorker",
        sub_environment: EnvType,
        env_context: EnvContext,
        **kwargs,
    ) -> None:
    

#################### new trainer创建时
def on_trainer_init(
    self,
    *,
    trainer: "Trainer",
    **kwargs,
) -> None:
############### 在每一个episode开始时
def on_episode_start(
   self,
   *,
   worker: "RolloutWorker",
   base_env: BaseEnv,
   policies: Dict[PolicyID, Policy],
   episode: Episode,
   **kwargs,
) -> None:

# 在这里,我们可以在每个episode开始之前定义 一些自己想要的变量,
可以使用 "episode.user_data"dict"episode.custom_metrics" 来存储自定义的变量(暂时性的)。

####################### 在每一个step 
def on_episode_step(
    self,
    *,
    worker: "RolloutWorker",
    base_env: BaseEnv,
    policies: Optional[Dict[PolicyID, Policy]] = None,
    episode: Episode,
    **kwargs,
) -> None:
# 定义每一步想要的变量

####################### 在每一个episode之后
def on_episode_end(
    self,
    *,
    worker: "RolloutWorker",
    base_env: BaseEnv,
    policies: Dict[PolicyID, Policy],
    episode: Episode,
    **kwargs,
) -> None:

######################## 在每次sample结束之后
def on_sample_end(
    self, *, worker: "RolloutWorker", samples: SampleBatch, **kwargs
) -> None:
######################### 在学习之前
def on_learn_on_batch(
    self, *, policy: Policy, train_batch: SampleBatch, result: dict, **kwargs
) -> None:

########################## 学习之后 
def on_train_result(self, *, trainer: "Trainer", result: dict, **kwargs) -> None:

exploration 设置

exploration

# All of the following configs go into Trainer.config.

# 1) Switching *off* exploration by default.
# Behavior: Calling `compute_action(s)` without explicitly setting its `explore`
# param will result in no exploration.
# However, explicitly calling `compute_action(s)` with `explore=True` will
# still(!) result in exploration (per-call overrides default).
"explore": False,

# 2) Switching *on* exploration by default.
# Behavior: Calling `compute_action(s)` without explicitly setting its
# explore param will result in exploration.
# However, explicitly calling `compute_action(s)` with `explore=False`
# will result in no(!) exploration (per-call overrides default).
"explore": True,

# 3) Example exploration_config usages:
# a) DQN: see rllib/agents/dqn/dqn.py
"explore": True,
"exploration_config": {
   # Exploration sub-class by name or full path to module+class
   # (e.g. “ray.rllib.utils.exploration.epsilon_greedy.EpsilonGreedy”)
   "type": "EpsilonGreedy",
   # Parameters for the Exploration class' constructor:
   "initial_epsilon": 1.0,
   "final_epsilon": 0.02,
   "epsilon_timesteps": 10000,  # Timesteps over which to anneal epsilon.
},

# b) DQN Soft-Q: In order to switch to Soft-Q exploration, do instead:
"explore": True,
"exploration_config": {
   "type": "SoftQ",
   # Parameters for the Exploration class' constructor:
   "temperature": 1.0,
},

# c) All policy-gradient algos and SAC: see rllib/agents/trainer.py
# Behavior: The algo samples stochastically from the
# model-parameterized distribution. This is the global Trainer default
# setting defined in trainer.py and used by all PG-type algos (plus SAC).
"explore": True,
"exploration_config": {
   "type": "StochasticSampling",
   "random_timesteps": 0,  # timesteps at beginning, over which to act uniformly randomly
},

customized evaluation during training

# 可以在train的中间 进行自定义的evaluation
在config中定义 evaluation 的频率和长度
# Run one evaluation step on every 3rd `Trainer.train()` call.
{
    "evaluation_interval": 3,
}


# Every time we do run an evaluation step, run it for exactly 10 episodes.
{
    "evaluation_duration": 10,
    "evaluation_duration_unit": "episodes",
}
# Every time we do run an evaluation step, run it for close to 200 timesteps.
{
    "evaluation_duration": 200,
    "evaluation_duration_unit": "timesteps",
}

# 另外 evaluation中的 是否要使用exploration,可以进行设置
# Switching off exploration behavior for evaluation workers
# (see rllib/agents/trainer.py). Use any keys in this sub-dict that are
# also supported in the main Trainer config.
"evaluation_config": {
   "explore": False
}

另外,根据需要,也可以自己定义 需要的 evaluation function。

curriculum learning

rllib中也允许设置从易到难的学习过程。(此处待完善)

external agents and applications

有时候我们会使用和外界交互的环境,此时环境就不是由rlli进行控制。 因此我们此时只能 使用 external application api

算法

rllib包含了常见的强化学习算法,具体的算法详见链接

user guide

模型,预处理和动作分布

以下说明了数据流的处理过程。
dataflow in rllib

内置preprocessors

对于不同的observation space,会产生不同的预处理。

  1. 对于 discrete obs space,会自动one-hot coded – e.g. Discrete(3) and value(1) -> [ 0, 1, 0 ]
  2. 对于 multi-discrete obs, 会对每一个元素进行one-hot coded,然后将所有的维度concatenated。 e.g. MUltiDiscrete([3, 4]) and value = [ 1, 3] -> [ 0 1 0 0 0 0 1 ]
  3. 对于Tuple和Dict 也会自动flattened。 同时可以访问flattened之前的obs。 flattened之前的 input_dict[“obs”] | flattened之后的 input_dict[“obs_flat”]. 在policy的 loss function中 可以使用 dict_or_tuple_obs = restore_original_dimensions(input_dict[“obs”], self.obs_space, “tf|torch”) 恢复之前的obs 状态。

对于atari game

默认的preprocessor是 DeepMInd preprocessor, config中是 preprocessor_pref=deepmind
另外也可以设置 preprocessor_pref=rllib。 我们可以在 model config中设置 dim x dim, 另外可以设置 grayscale=True, zero_mean=True(将值设置为 -1.0 ~ 1.0 ,而不是 0~1.0)。

默认的model config

可以在ModelConfigDict中设置 全连接层,卷积层和RNN等。

MODEL_DEFAULTS: ModelConfigDict = {
    # Experimental flag.
    # If True, try to use a native (tf.keras.Model or torch.Module) default
    # model instead of our built-in ModelV2 defaults.
    # If False (default), use "classic" ModelV2 default models.
    # Note that this currently only works for:
    # 1) framework != torch AND
    # 2) fully connected and CNN default networks as well as
    # auto-wrapped LSTM- and attention nets.
    "_use_default_native_models": False,
    # Experimental flag.
    # If True, user specified no preprocessor to be created
    # (via config._disable_preprocessor_api=True). If True, observations
    # will arrive in model as they are returned by the env.
    "_disable_preprocessor_api": False,
    # Experimental flag.
    # If True, RLlib will no longer flatten the policy-computed actions into
    # a single tensor (for storage in SampleCollectors/output files/etc..),
    # but leave (possibly nested) actions as-is. Disabling flattening affects:
    # - SampleCollectors: Have to store possibly nested action structs.
    # - Models that have the previous action(s) as part of their input.
    # - Algorithms reading from offline files (incl. action information).
    "_disable_action_flattening": False,

    # === Built-in options ===
    # FullyConnectedNetwork (tf and torch): rllib.models.tf|torch.fcnet.py
    # These are used if no custom model is specified and the input space is 1D.
    # Number of hidden layers to be used.
    "fcnet_hiddens": [256, 256],
    # Activation function descriptor.
    # Supported values are: "tanh", "relu", "swish" (or "silu"),
    # "linear" (or None).
    "fcnet_activation": "tanh",

    # VisionNetwork (tf and torch): rllib.models.tf|torch.visionnet.py
    # These are used if no custom model is specified and the input space is 2D.
    # Filter config: List of [out_channels, kernel, stride] for each filter.
    # Example:
    # Use None for making RLlib try to find a default filter setup given the
    # observation space.
    "conv_filters": None,
    # Activation function descriptor.
    # Supported values are: "tanh", "relu", "swish" (or "silu"),
    # "linear" (or None).
    "conv_activation": "relu",

    # Some default models support a final FC stack of n Dense layers with given
    # activation:
    # - Complex observation spaces: Image components are fed through
    #   VisionNets, flat Boxes are left as-is, Discrete are one-hot'd, then
    #   everything is concated and pushed through this final FC stack.
    # - VisionNets (CNNs), e.g. after the CNN stack, there may be
    #   additional Dense layers.
    # - FullyConnectedNetworks will have this additional FCStack as well
    # (that's why it's empty by default).
    "post_fcnet_hiddens": [],
    "post_fcnet_activation": "relu",

    # For DiagGaussian action distributions, make the second half of the model
    # outputs floating bias variables instead of state-dependent. This only
    # has an effect is using the default fully connected net.
    "free_log_std": False,
    # Whether to skip the final linear layer used to resize the hidden layer
    # outputs to size `num_outputs`. If True, then the last hidden layer
    # should already match num_outputs.
    "no_final_linear": False,
    # Whether layers should be shared for the value function.
    "vf_share_layers": True,

    # == LSTM ==
    # Whether to wrap the model with an LSTM.
    "use_lstm": False,
    # Max seq len for training the LSTM, defaults to 20.
    "max_seq_len": 20,
    # Size of the LSTM cell.
    "lstm_cell_size": 256,
    # Whether to feed a_{t-1} to LSTM (one-hot encoded if discrete).
    "lstm_use_prev_action": False,
    # Whether to feed r_{t-1} to LSTM.
    "lstm_use_prev_reward": False,
    # Whether the LSTM is time-major (TxBx..) or batch-major (BxTx..).
    "_time_major": False,

    # == Attention Nets (experimental: torch-version is untested) ==
    # Whether to use a GTrXL ("Gru transformer XL"; attention net) as the
    # wrapper Model around the default Model.
    "use_attention": False,
    # The number of transformer units within GTrXL.
    # A transformer unit in GTrXL consists of a) MultiHeadAttention module and
    # b) a position-wise MLP.
    "attention_num_transformer_units": 1,
    # The input and output size of each transformer unit.
    "attention_dim": 64,
    # The number of attention heads within the MultiHeadAttention units.
    "attention_num_heads": 1,
    # The dim of a single head (within the MultiHeadAttention units).
    "attention_head_dim": 32,
    # The memory sizes for inference and training.
    "attention_memory_inference": 50,
    "attention_memory_training": 50,
    # The output dim of the position-wise MLP.
    "attention_position_wise_mlp_dim": 32,
    # The initial bias values for the 2 GRU gates within a transformer unit.
    "attention_init_gru_gate_bias": 2.0,
    # Whether to feed a_{t-n:t-1} to GTrXL (one-hot encoded if discrete).
    "attention_use_n_prev_actions": 0,
    # Whether to feed r_{t-n:t-1} to GTrXL.
    "attention_use_n_prev_rewards": 0,

    # == Atari ==
    # Set to True to enable 4x stacking behavior.
    "framestack": True,
    # Final resized frame dimension
    "dim": 84,
    # (deprecated) Converts ATARI frame to 1 Channel Grayscale image
    "grayscale": False,
    # (deprecated) Changes frame to range from [-1, 1] if true
    "zero_mean": True,

    # === Options for custom models ===
    # Name of a custom model to use
    "custom_model": None,
    # Extra options to pass to the custom classes. These will be available to
    # the Model's constructor in the model_config field. Also, they will be
    # attempted to be passed as **kwargs to ModelV2 models. For an example,
    # see rllib/models/[tf|torch]/attention_net.py.
    "custom_model_config": {},
    # Name of a custom action distribution to use.
    "custom_action_dist": None,
    # Custom preprocessors are deprecated. Please use a wrapper class around
    # your environment instead to preprocess observations.
    "custom_preprocessor": None,

    # Deprecated keys:
    # Use `lstm_use_prev_action` or `lstm_use_prev_reward` instead.
    "lstm_use_prev_action_reward": DEPRECATED_VALUE,
}

在trainer中 可以通过model来传递参数
algo_config = {
    # All model-related settings go into this sub-dict.
    "model": {
        # By default, the MODEL_DEFAULTS dict above will be used.

        # Change individual keys in that dict by overriding them, e.g.
        "fcnet_hiddens": [512, 512, 512],
        "fcnet_activation": "relu",
    },

    # ... other Trainer config keys, e.g. "lr" ...
    "lr": 0.00001,
}

自定义preprocessor and model

如果想要自定义preprocessor,rllib已经弃用了自定义preprocessor。 但我们可以在 wrapper classes中设置预处理。

import gym
from ray.rllib.utils.numpy import one_hot

class OneHotEnv(gym.core.ObservationWrapper):
    # Override `observation` to custom process the original observation
    # coming from the env.
    def observation(self, observation):
        # E.g. one-hotting a float obs [0.0, 5.0[.
        return one_hot(observation, depth=5)


class ClipRewardEnv(gym.core.RewardWrapper):
    def __init__(self, env, min_, max_):
        super().__init__(env)
        self.min = min_
        self.max = max_

    # Override `reward` to custom process the original reward coming
    # from the env.
    def reward(self, reward):
        # E.g. simple clipping between min and max.
        return np.clip(reward, self.min, self.max)

另外也可以自己建立model。(此处待补充)

supervised model losses

我们可以使用imitation learning来加入专家经验。

tune调参

import ray
from ray import tune
import time

ray.shutdown()
ray.init(ignore_reinit_error=True)

config ={
	"env":  ,
	"num_worker":10,
	"framework":"troch",
	"num_gpus":1,
	
	"vf_share_layers":tune.grid_search([True,False]), # 通过tune.grid_search可以实现 自动的参数选择
	"lr": tune.grid_search([1e-4,1e-5,1e-6]),
}

stop_config ={ "timesteps_total":10000 }

result = tune.run(
	"PPO",
	stop = stop_config,
	config=config
)


学习链接

Ray Rllib API summary
Ray Rllib example

参数

COMMON_CONFIG: TrainerConfigDict = {
    # === Settings for Rollout Worker processes ===
    # Number of rollout worker actors to create for parallel sampling. Setting
    # this to 0 will force rollouts to be done in the trainer actor.
    "num_workers": 2,
    # Number of environments to evaluate vector-wise per worker. This enables
    # model inference batching, which can improve performance for inference
    # bottlenecked workloads.
    "num_envs_per_worker": 1,
    # When `num_workers` > 0, the driver (local_worker; worker-idx=0) does not
    # need an environment. This is because it doesn't have to sample (done by
    # remote_workers; worker_indices > 0) nor evaluate (done by evaluation
    # workers; see below).
    "create_env_on_driver": False,
    # Divide episodes into fragments of this many steps each during rollouts.
    # Sample batches of this size are collected from rollout workers and
    # combined into a larger batch of `train_batch_size` for learning.
    #
    # For example, given rollout_fragment_length=100 and train_batch_size=1000:
    #   1. RLlib collects 10 fragments of 100 steps each from rollout workers.
    #   2. These fragments are concatenated and we perform an epoch of SGD.
    #
    # When using multiple envs per worker, the fragment size is multiplied by
    # `num_envs_per_worker`. This is since we are collecting steps from
    # multiple envs in parallel. For example, if num_envs_per_worker=5, then
    # rollout workers will return experiences in chunks of 5*100 = 500 steps.
    #
    # The dataflow here can vary per algorithm. For example, PPO further
    # divides the train batch into minibatches for multi-epoch SGD.
    "rollout_fragment_length": 200,
    # How to build per-Sampler (RolloutWorker) batches, which are then
    # usually concat'd to form the train batch. Note that "steps" below can
    # mean different things (either env- or agent-steps) and depends on the
    # `count_steps_by` (multiagent) setting below.
    # truncate_episodes: Each produced batch (when calling
    #   RolloutWorker.sample()) will contain exactly `rollout_fragment_length`
    #   steps. This mode guarantees evenly sized batches, but increases
    #   variance as the future return must now be estimated at truncation
    #   boundaries.
    # complete_episodes: Each unroll happens exactly over one episode, from
    #   beginning to end. Data collection will not stop unless the episode
    #   terminates or a configured horizon (hard or soft) is hit.
    # 对于truncate episodes,每次更新 不要求是完整的episode,以batch size数量为准
    # 如果是 completer_episodes: 每次更新都是完整的episodes, batch size 是最少的经验数量(用于确定每次更新的episode的数量)
    "batch_mode": "truncate_episodes",  

    # === Settings for the Trainer process ===
    # Discount factor of the MDP.
    "gamma": 0.99,
    # The default learning rate.
    "lr": 0.0001,
    # Training batch size, if applicable. Should be >= rollout_fragment_length.
    # Samples batches will be concatenated together to a batch of this size,
    # which is then passed to SGD.
    "train_batch_size": 200,
    # Arguments to pass to the policy model. See models/catalog.py for a full
    # list of the available model options.
    "model": MODEL_DEFAULTS,
    # Arguments to pass to the policy optimizer. These vary by optimizer.
    "optimizer": {},

    # === Environment Settings ===
    # Number of steps after which the episode is forced to terminate. Defaults
    # to `env.spec.max_episode_steps` (if present) for Gym envs.
    "horizon": None,
    # Calculate rewards but don't reset the environment when the horizon is
    # hit. This allows value estimation and RNN state to span across logical
    # episodes denoted by horizon. This only has an effect if horizon != inf.
    "soft_horizon": False,
    # Don't set 'done' at the end of the episode.
    # In combination with `soft_horizon`, this works as follows:
    # - no_done_at_end=False soft_horizon=False:
    #   Reset env and add `done=True` at end of each episode.
    # - no_done_at_end=True soft_horizon=False:
    #   Reset env, but do NOT add `done=True` at end of the episode.
    # - no_done_at_end=False soft_horizon=True:
    #   Do NOT reset env at horizon, but add `done=True` at the horizon
    #   (pretending the episode has terminated).
    # - no_done_at_end=True soft_horizon=True:
    #   Do NOT reset env at horizon and do NOT add `done=True` at the horizon.
    "no_done_at_end": False,
    # The environment specifier:
    # This can either be a tune-registered env, via
    # `tune.register_env([name], lambda env_ctx: [env object])`,
    # or a string specifier of an RLlib supported type. In the latter case,
    # RLlib will try to interpret the specifier as either an openAI gym env,
    # a PyBullet env, a ViZDoomGym env, or a fully qualified classpath to an
    # Env class, e.g. "ray.rllib.examples.env.random_env.RandomEnv".
    "env": None,
    # The observation- and action spaces for the Policies of this Trainer.
    # Use None for automatically inferring these from the given env.
    "observation_space": None,
    "action_space": None,
    # Arguments dict passed to the env creator as an EnvContext object (which
    # is a dict plus the properties: num_workers, worker_index, vector_index,
    # and remote).
    "env_config": {},
    # If using num_envs_per_worker > 1, whether to create those new envs in
    # remote processes instead of in the same worker. This adds overheads, but
    # can make sense if your envs can take much time to step / reset
    # (e.g., for StarCraft). Use this cautiously; overheads are significant.
    "remote_worker_envs": False,
    # Timeout that remote workers are waiting when polling environments.
    # 0 (continue when at least one env is ready) is a reasonable default,
    # but optimal value could be obtained by measuring your environment
    # step / reset and model inference perf.
    "remote_env_batch_wait_ms": 0,
    # A callable taking the last train results, the base env and the env
    # context as args and returning a new task to set the env to.
    # The env must be a `TaskSettableEnv` sub-class for this to work.
    # See `examples/curriculum_learning.py` for an example.
    "env_task_fn": None,
    # If True, try to render the environment on the local worker or on worker
    # 1 (if num_workers > 0). For vectorized envs, this usually means that only
    # the first sub-environment will be rendered.
    # In order for this to work, your env will have to implement the
    # `render()` method which either:
    # a) handles window generation and rendering itself (returning True) or
    # b) returns a numpy uint8 image of shape [height x width x 3 (RGB)].
    "render_env": False,
    # If True, stores videos in this relative directory inside the default
    # output dir (~/ray_results/...). Alternatively, you can specify an
    # absolute path (str), in which the env recordings should be
    # stored instead.
    # Set to False for not recording anything.
    # Note: This setting replaces the deprecated `monitor` key.
    "record_env": False,
    # Whether to clip rewards during Policy's postprocessing.
    # None (default): Clip for Atari only (r=sign(r)).
    # True: r=sign(r): Fixed rewards -1.0, 1.0, or 0.0.
    # False: Never clip.
    # [float value]: Clip at -value and + value.
    # Tuple[value1, value2]: Clip at value1 and value2.
    "clip_rewards": None,
    # If True, RLlib will learn entirely inside a normalized action space
    # (0.0 centered with small stddev; only affecting Box components).
    # We will unsquash actions (and clip, just in case) to the bounds of
    # the env's action space before sending actions back to the env.
    "normalize_actions": True,
    # If True, RLlib will clip actions according to the env's bounds
    # before sending them back to the env.
    # TODO: (sven) This option should be obsoleted and always be False.
    "clip_actions": False,
    # Whether to use "rllib" or "deepmind" preprocessors by default
    # Set to None for using no preprocessor. In this case, the model will have
    # to handle possibly complex observations from the environment.
    "preprocessor_pref": "deepmind",

    # === Debug Settings ===
    # Set the ray.rllib.* log level for the agent process and its workers.
    # Should be one of DEBUG, INFO, WARN, or ERROR. The DEBUG level will also
    # periodically print out summaries of relevant internal dataflow (this is
    # also printed out once at startup at the INFO level). When using the
    # `rllib train` command, you can also use the `-v` and `-vv` flags as
    # shorthand for INFO and DEBUG.
    "log_level": "WARN",
    # Callbacks that will be run during various phases of training. See the
    # `DefaultCallbacks` class and `examples/custom_metrics_and_callbacks.py`
    # for more usage information.
    "callbacks": DefaultCallbacks,
    # Whether to attempt to continue training if a worker crashes. The number
    # of currently healthy workers is reported as the "num_healthy_workers"
    # metric.
    "ignore_worker_failures": False,
    # Whether - upon a worker failure - RLlib will try to recreate the lost worker as
    # an identical copy of the failed one. The new worker will only differ from the
    # failed one in its `self.recreated_worker=True` property value. It will have
    # the same `worker_index` as the original one.
    # If True, the `ignore_worker_failures` setting will be ignored.
    "recreate_failed_workers": False,
    # Log system resource metrics to results. This requires `psutil` to be
    # installed for sys stats, and `gputil` for GPU metrics.
    "log_sys_usage": True,
    # Use fake (infinite speed) sampler. For testing only.
    "fake_sampler": False,

    # === Deep Learning Framework Settings ===
    # tf: TensorFlow (static-graph)
    # tf2: TensorFlow 2.x (eager or traced, if eager_tracing=True)
    # tfe: TensorFlow eager (or traced, if eager_tracing=True)
    # torch: PyTorch
    "framework": "tf",
    # Enable tracing in eager mode. This greatly improves performance
    # (speedup ~2x), but makes it slightly harder to debug since Python
    # code won't be evaluated after the initial eager pass.
    # Only possible if framework=[tf2|tfe].
    "eager_tracing": False,
    # Maximum number of tf.function re-traces before a runtime error is raised.
    # This is to prevent unnoticed retraces of methods inside the
    # `..._eager_traced` Policy, which could slow down execution by a
    # factor of 4, without the user noticing what the root cause for this
    # slowdown could be.
    # Only necessary for framework=[tf2|tfe].
    # Set to None to ignore the re-trace count and never throw an error.
    "eager_max_retraces": 20,

    # === Exploration Settings ===
    # Default exploration behavior, iff `explore`=None is passed into
    # compute_action(s).
    # Set to False for no exploration behavior (e.g., for evaluation).
    "explore": True,
    # Provide a dict specifying the Exploration object's config.
    "exploration_config": {
        # The Exploration class to use. In the simplest case, this is the name
        # (str) of any class present in the `rllib.utils.exploration` package.
        # You can also provide the python class directly or the full location
        # of your class (e.g. "ray.rllib.utils.exploration.epsilon_greedy.
        # EpsilonGreedy").
        "type": "StochasticSampling",
        # Add constructor kwargs here (if any).
    },
    # === Evaluation Settings ===
    # Evaluate with every `evaluation_interval` training iterations.
    # The evaluation stats will be reported under the "evaluation" metric key.
    # Note that for Ape-X metrics are already only reported for the lowest
    # epsilon workers (least random workers).
    # Set to None (or 0) for no evaluation.
    "evaluation_interval": None,
    # Duration for which to run evaluation each `evaluation_interval`.
    # The unit for the duration can be set via `evaluation_duration_unit` to
    # either "episodes" (default) or "timesteps".
    # If using multiple evaluation workers (evaluation_num_workers > 1),
    # the load to run will be split amongst these.
    # If the value is "auto":
    # - For `evaluation_parallel_to_training=True`: Will run as many
    #   episodes/timesteps that fit into the (parallel) training step.
    # - For `evaluation_parallel_to_training=False`: Error.
    "evaluation_duration": 10,
    # The unit, with which to count the evaluation duration. Either "episodes"
    # (default) or "timesteps".
    "evaluation_duration_unit": "episodes",
    # Whether to run evaluation in parallel to a Trainer.train() call
    # using threading. Default=False.
    # E.g. evaluation_interval=2 -> For every other training iteration,
    # the Trainer.train() and Trainer.evaluate() calls run in parallel.
    # Note: This is experimental. Possible pitfalls could be race conditions
    # for weight synching at the beginning of the evaluation loop.
    "evaluation_parallel_to_training": False,
    # Internal flag that is set to True for evaluation workers.
    "in_evaluation": False,
    # Typical usage is to pass extra args to evaluation env creator
    # and to disable exploration by computing deterministic actions.
    # IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal
    # policy, even if this is a stochastic one. Setting "explore=False" here
    # will result in the evaluation workers not using this optimal policy!
    "evaluation_config": {
        # Example: overriding env_config, exploration, etc:
        # "env_config": {...},
        # "explore": False
    },

    # === Replay Buffer Settings ===
    # Provide a dict specifying the ReplayBuffer's config.
    # "replay_buffer_config": {
    #     The ReplayBuffer class to use. Any class that obeys the
    #     ReplayBuffer API can be used here. In the simplest case, this is the
    #     name (str) of any class present in the `rllib.utils.replay_buffers`
    #     package. You can also provide the python class directly or the
    #     full location of your class (e.g.
    #     "ray.rllib.utils.replay_buffers.replay_buffer.ReplayBuffer").
    #     "type": "ReplayBuffer",
    #     The capacity of units that can be stored in one ReplayBuffer
    #     instance before eviction.
    #     "capacity": 10000,
    #     Specifies how experiences are stored. Either 'sequences' or
    #     'timesteps'.
    #     "storage_unit": "timesteps",
    #     Add constructor kwargs here (if any).
    # },

    # Number of parallel workers to use for evaluation. Note that this is set
    # to zero by default, which means evaluation will be run in the trainer
    # process (only if evaluation_interval is not None). If you increase this,
    # it will increase the Ray resource usage of the trainer since evaluation
    # workers are created separately from rollout workers (used to sample data
    # for training).
    "evaluation_num_workers": 0,
    # Customize the evaluation method. This must be a function of signature
    # (trainer: Trainer, eval_workers: WorkerSet) -> metrics: dict. See the
    # Trainer.evaluate() method to see the default implementation.
    # The Trainer guarantees all eval workers have the latest policy state
    # before this function is called.
    "custom_eval_function": None,
    # Make sure the latest available evaluation results are always attached to
    # a step result dict.
    # This may be useful if Tune or some other meta controller needs access
    # to evaluation metrics all the time.
    "always_attach_evaluation_results": False,
    # Store raw custom metrics without calculating max, min, mean
    "keep_per_episode_custom_metrics": False,

    # === Advanced Rollout Settings ===
    # Use a background thread for sampling (slightly off-policy, usually not
    # advisable to turn on unless your env specifically requires it).
    "sample_async": False,

    # The SampleCollector class to be used to collect and retrieve
    # environment-, model-, and sampler data. Override the SampleCollector base
    # class to implement your own collection/buffering/retrieval logic.
    "sample_collector": SimpleListCollector,

    # Element-wise observation filter, either "NoFilter" or "MeanStdFilter".
    "observation_filter": "NoFilter",
    # Whether to synchronize the statistics of remote filters.
    "synchronize_filters": True,
    # Configures TF for single-process operation by default.
    "tf_session_args": {
        # note: overridden by `local_tf_session_args`
        "intra_op_parallelism_threads": 2,
        "inter_op_parallelism_threads": 2,
        "gpu_options": {
            "allow_growth": True,
        },
        "log_device_placement": False,
        "device_count": {
            "CPU": 1
        },
        # Required by multi-GPU (num_gpus > 1).
        "allow_soft_placement": True,
    },
    # Override the following tf session args on the local worker
    "local_tf_session_args": {
        # Allow a higher level of parallelism by default, but not unlimited
        # since that can cause crashes with many concurrent drivers.
        "intra_op_parallelism_threads": 8,
        "inter_op_parallelism_threads": 8,
    },
    # Whether to LZ4 compress individual observations.
    "compress_observations": False,
    # Wait for metric batches for at most this many seconds. Those that
    # have not returned in time will be collected in the next train iteration.
    "metrics_episode_collection_timeout_s": 180,
    # Smooth metrics over this many episodes.
    "metrics_num_episodes_for_smoothing": 100,
    # Minimum time interval to run one `train()` call for:
    # If - after one `step_attempt()`, this time limit has not been reached,
    # will perform n more `step_attempt()` calls until this minimum time has
    # been consumed. Set to None or 0 for no minimum time.
    "min_time_s_per_reporting": None,
    # Minimum train/sample timesteps to optimize for per `train()` call.
    # This value does not affect learning, only the length of train iterations.
    # If - after one `step_attempt()`, the timestep counts (sampling or
    # training) have not been reached, will perform n more `step_attempt()`
    # calls until the minimum timesteps have been executed.
    # Set to None or 0 for no minimum timesteps.
    "min_train_timesteps_per_reporting": None,
    "min_sample_timesteps_per_reporting": None,

    # This argument, in conjunction with worker_index, sets the random seed of
    # each worker, so that identically configured trials will have identical
    # results. This makes experiments reproducible.
    "seed": None,
    # Any extra python env vars to set in the trainer process, e.g.,
    # {"OMP_NUM_THREADS": "16"}
    "extra_python_environs_for_driver": {},
    # The extra python environments need to set for worker processes.
    "extra_python_environs_for_worker": {},

    # === Resource Settings ===
    # Number of GPUs to allocate to the trainer process. Note that not all
    # algorithms can take advantage of trainer GPUs. Support for multi-GPU
    # is currently only available for tf-[PPO/IMPALA/DQN/PG].
    # This can be fractional (e.g., 0.3 GPUs).
    "num_gpus": 0,
    # Set to True for debugging (multi-)?GPU funcitonality on a CPU machine.
    # GPU towers will be simulated by graphs located on CPUs in this case.
    # Use `num_gpus` to test for different numbers of fake GPUs.
    "_fake_gpus": False,
    # Number of CPUs to allocate per worker.
    "num_cpus_per_worker": 1,
    # Number of GPUs to allocate per worker. This can be fractional. This is
    # usually needed only if your env itself requires a GPU (i.e., it is a
    # GPU-intensive video game), or model inference is unusually expensive.
    "num_gpus_per_worker": 0,
    # Any custom Ray resources to allocate per worker.
    "custom_resources_per_worker": {},
    # Number of CPUs to allocate for the trainer. Note: this only takes effect
    # when running in Tune. Otherwise, the trainer runs in the main program.
    "num_cpus_for_driver": 1,
    # The strategy for the placement group factory returned by
    # `Trainer.default_resource_request()`. A PlacementGroup defines, which
    # devices (resources) should always be co-located on the same node.
    # For example, a Trainer with 2 rollout workers, running with
    # num_gpus=1 will request a placement group with the bundles:
    # [{"gpu": 1, "cpu": 1}, {"cpu": 1}, {"cpu": 1}], where the first bundle is
    # for the driver and the other 2 bundles are for the two workers.
    # These bundles can now be "placed" on the same or different
    # nodes depending on the value of `placement_strategy`:
    # "PACK": Packs bundles into as few nodes as possible.
    # "SPREAD": Places bundles across distinct nodes as even as possible.
    # "STRICT_PACK": Packs bundles into one node. The group is not allowed
    #   to span multiple nodes.
    # "STRICT_SPREAD": Packs bundles across distinct nodes.
    "placement_strategy": "PACK",

    # TODO(jungong, sven): we can potentially unify all input types
    #     under input and input_config keys. E.g.
    #     input: sample
    #     input_config {
    #         env: Cartpole-v0
    #     }
    #     or:
    #     input: json_reader
    #     input_config {
    #         path: /tmp/
    #     }
    #     or:
    #     input: dataset
    #     input_config {
    #         format: parquet
    #         path: /tmp/
    #     }
    # === Offline Datasets ===
    # Specify how to generate experiences:
    #  - "sampler": Generate experiences via online (env) simulation (default).
    #  - A local directory or file glob expression (e.g., "/tmp/*.json").
    #  - A list of individual file paths/URIs (e.g., ["/tmp/1.json",
    #    "s3://bucket/2.json"]).
    #  - A dict with string keys and sampling probabilities as values (e.g.,
    #    {"sampler": 0.4, "/tmp/*.json": 0.4, "s3://bucket/expert.json": 0.2}).
    #  - A callable that takes an `IOContext` object as only arg and returns a
    #    ray.rllib.offline.InputReader.
    #  - A string key that indexes a callable with tune.registry.register_input
    "input": "sampler",
    # Arguments accessible from the IOContext for configuring custom input
    "input_config": {},
    # True, if the actions in a given offline "input" are already normalized
    # (between -1.0 and 1.0). This is usually the case when the offline
    # file has been generated by another RLlib algorithm (e.g. PPO or SAC),
    # while "normalize_actions" was set to True.
    "actions_in_input_normalized": False,
    # Specify how to evaluate the current policy. This only has an effect when
    # reading offline experiences ("input" is not "sampler").
    # Available options:
    #  - "wis": the weighted step-wise importance sampling estimator.
    #  - "is": the step-wise importance sampling estimator.
    #  - "simulation": run the environment in the background, but use
    #    this data for evaluation only and not for learning.
    "input_evaluation": ["is", "wis"],
    # Whether to run postprocess_trajectory() on the trajectory fragments from
    # offline inputs. Note that postprocessing will be done using the *current*
    # policy, not the *behavior* policy, which is typically undesirable for
    # on-policy algorithms.
    "postprocess_inputs": False,
    # If positive, input batches will be shuffled via a sliding window buffer
    # of this number of batches. Use this if the input data is not in random
    # enough order. Input is delayed until the shuffle buffer is filled.
    "shuffle_buffer_size": 0,
    # Specify where experiences should be saved:
    #  - None: don't save any experiences
    #  - "logdir" to save to the agent log dir
    #  - a path/URI to save to a custom output directory (e.g., "s3://bucket/")
    #  - a function that returns a rllib.offline.OutputWriter
    "output": None,
    # Arguments accessible from the IOContext for configuring custom output
    "output_config": {},
    # What sample batch columns to LZ4 compress in the output data.
    "output_compress_columns": ["obs", "new_obs"],
    # Max output file size (in bytes) before rolling over to a new file.
    "output_max_file_size": 64 * 1024 * 1024,

    # === Settings for Multi-Agent Environments ===
    "multiagent": {
        # Map of type MultiAgentPolicyConfigDict from policy ids to tuples
        # of (policy_cls, obs_space, act_space, config). This defines the
        # observation and action spaces of the policies and any extra config.
        "policies": {},
        # Keep this many policies in the "policy_map" (before writing
        # least-recently used ones to disk/S3).
        "policy_map_capacity": 100,
        # Where to store overflowing (least-recently used) policies?
        # Could be a directory (str) or an S3 location. None for using
        # the default output dir.
        "policy_map_cache": None,
        # Function mapping agent ids to policy ids.
        "policy_mapping_fn": None,
        # Determines those policies that should be updated.
        # Options are:
        # - None, for all policies.
        # - An iterable of PolicyIDs that should be updated.
        # - A callable, taking a PolicyID and a SampleBatch or MultiAgentBatch
        #   and returning a bool (indicating whether the given policy is trainable
        #   or not, given the particular batch). This allows you to have a policy
        #   trained only on certain data (e.g. when playing against a certain
        #   opponent).
        "policies_to_train": None,
        # Optional function that can be used to enhance the local agent
        # observations to include more state.
        # See rllib/evaluation/observation_function.py for more info.
        "observation_fn": None,
        # When replay_mode=lockstep, RLlib will replay all the agent
        # transitions at a particular timestep together in a batch. This allows
        # the policy to implement differentiable shared computations between
        # agents it controls at that timestep. When replay_mode=independent,
        # transitions are replayed independently per policy.
        "replay_mode": "independent",
        # Which metric to use as the "batch size" when building a
        # MultiAgentBatch. The two supported values are:
        # env_steps: Count each time the env is "stepped" (no matter how many
        #   multi-agent actions are passed/how many multi-agent observations
        #   have been returned in the previous step).
        # agent_steps: Count each individual agent step as one step.
        "count_steps_by": "env_steps",
    },

    # === Logger ===
    # Define logger-specific configuration to be used inside Logger
    # Default value None allows overwriting with nested dicts
    "logger_config": None,

    # === API deprecations/simplifications/changes ===
    # Experimental flag.
    # If True, TFPolicy will handle more than one loss/optimizer.
    # Set this to True, if you would like to return more than
    # one loss term from your `loss_fn` and an equal number of optimizers
    # from your `optimizer_fn`.
    # In the future, the default for this will be True.
    "_tf_policy_handles_more_than_one_loss": False,
    # Experimental flag.
    # If True, no (observation) preprocessor will be created and
    # observations will arrive in model as they are returned by the env.
    # In the future, the default for this will be True.
    "_disable_preprocessor_api": False,
    # Experimental flag.
    # If True, RLlib will no longer flatten the policy-computed actions into
    # a single tensor (for storage in SampleCollectors/output files/etc..),
    # but leave (possibly nested) actions as-is. Disabling flattening affects:
    # - SampleCollectors: Have to store possibly nested action structs.
    # - Models that have the previous action(s) as part of their input.
    # - Algorithms reading from offline files (incl. action information).
    "_disable_action_flattening": False,
    # Experimental flag.
    # If True, the execution plan API will not be used. Instead,
    # a Trainer's `training_iteration` method will be called as-is each
    # training iteration.
    "_disable_execution_plan_api": False,

    # If True, disable the environment pre-checking module.
    "disable_env_checking": False,

    # === Deprecated keys ===
    # Uses the sync samples optimizer instead of the multi-gpu one. This is
    # usually slower, but you might want to try it if you run into issues with
    # the default optimizer.
    # This will be set automatically from now on.
    "simple_optimizer": DEPRECATED_VALUE,
    # Whether to write episode stats and videos to the agent log dir. This is
    # typically located in ~/ray_results.
    "monitor": DEPRECATED_VALUE,
    # Replaced by `evaluation_duration=10` and
    # `evaluation_duration_unit=episodes`.
    "evaluation_num_episodes": DEPRECATED_VALUE,
    # Use `metrics_num_episodes_for_smoothing` instead.
    "metrics_smoothing_episodes": DEPRECATED_VALUE,
    # Use `min_[env|train]_timesteps_per_reporting` instead.
    "timesteps_per_iteration": 0,
    # Use `min_time_s_per_reporting` instead.
    "min_iter_time_s": DEPRECATED_VALUE,
    # Use `metrics_episode_collection_timeout_s` instead.
    "collect_metrics_timeout": DEPRECATED_VALUE,
}
# 建立 trainer
trainer = PPOTrainer(env="CartPole-v0",config={"train_batch_size":4000,
											   "env_config":{} # config to pass to env class )
# trainer 有多个trainer, 对应不同的alg
# 在建立trainer时需要设置好 env环境,需要输入对应的RL alg的参数 
  • 0
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
PyTorch是一个开源的深度学习框架,也可以用于强化学习任务的实现。以下是关于PyTorch强化学习的教程: PyTorch提供了一些用于强化学习的工具和库,例如PyTorch的神经网络模块nn和优化器optim。在开始之前,先要了解强化学习的基本概念,特别是强化学习中的环境、状态、动作和奖励。 首先,我们需要定义强化学习任务的环境。这可以是一个简单的游戏,如迷宫,也可以是一个复杂的环境,如自动驾驶汽车的模拟器。接下来,我们需要定义状态空间和动作空间。状态空间表示环境可能的状态,动作空间表示智能体可以采取的动作。然后,我们需要定义奖励函数,即智能体在每个动作后获得的奖励。 接下来,可以使用PyTorch的神经网络模块nn来定义强化学习的智能体。可以选择不同的神经网络架构,如深度Q网络(DQN)或策略梯度方法。网络的输入是状态,输出是每个动作的Q值或概率。在这个教程中,我们将以DQN为例。 在训练过程中,智能体与环境进行交互。它从当前状态观察到环境,根据当前策略选择一个动作,并将其应用于环境。然后,智能体观察到下一个状态和对应的奖励。通过这种方式,我们可以逐步收集经验和样本。使用这些样本,我们可以计算损失函数,并使用优化器optim来更新神经网络的参数。 接下来,我们使用PyTorch的强化学习RLlib执行训练过程。RLlib提供了一种方便的方式来管理整个强化学习训练过程的迭代和评估。通过调整训练过程中的参数和神经网络架构,我们可以改进智能体的性能。 总结来说,PyTorch提供了一个强大的深度学习框架,可以用于强化学习任务的实现。通过定义环境、状态空间、动作空间和奖励函数,以及使用PyTorch的nn模块和optim优化器来构建和训练强化学习的神经网络模型,我们可以实现一个高效的强化学习系统。同时,RLlib库提供了方便的工具来迭代和评估训练过程。希望这个教程能够帮助您更好地理解和应用PyTorch强化学习
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值