
RAY是UBC rise实验室开发的开源lib,可以实现强化学习的分布式训练,调参(tune) 和自定义强化学习框架(rllib)。

rllib install

pip install -U ray
pip install -U ray[tune]
pip install -U "ray[rllib]"



我们可以直接使用rllib建立trainer,然后设计训练的方式。同时结合demonstration, imitation learning或者添加 自定义的experience replay等等。但是如果我们只使用标准的model-free强化学习框架,我们可以直接使用ray.tune来直接选择 强化学习算法,因为在tune中可以直接完成 参数优化的过程。我们可以在tune中 对于每一个算法参数 选择多个值,或者在一定范围内随机选择值。然后tune.ray就可以根据 程序员设定的目标,挑选出来使得目标最大的参数。无需程序员手动调参,方便模型的整理。tune除了可以对强化学习模型进行调参,对于其他的AI模型都可以调参。

tune 最基本使用

基本算法 + 算法参数 + 环境定义 + 终止参数调节

import ray
import ray.tune as tune

algo_config = {
    # 环境信息
        "env": "CartPole-v0", # "my_env"  需要提前注册好, 注册方法附后
        "env_config":{ }    , # 环境生成
    # 模型信息
            # cnn
            "conv_filters":[], # [ [output_channel, kernel, stride] ]: [ [16,[4,4],2], [128,[6,6],3] ]
            # 全链接层
            "fcnet_hiddens": [256,256],
            # post fcnet 
            # 有时候我们的网络输入是 复杂的数据类型: matrix + vector,
            # 我们想要 matrix经过CNN,之后和vector合并,然后经过全连接层
            # 此时我们就可以设置 fcnet为 None, 然后使用 post fcnet
            "post_fcnet_hiddens": [], #  [256,256]
            "post_fcnet_activation":  "linear" , # "relu"
            #value policy 共用部分网络  可以自行设置 true or false
            "vf_share_layers": True, 
            ## LSTM 设置
            # Whether to wrap the model with an LSTM.
            "use_lstm": False,
            # Max seq len for training the LSTM, defaults to 20.
            "max_seq_len": 20,
            # Size of the LSTM cell.
            "lstm_cell_size": 256,
            # Whether to feed a_{t-1} to LSTM (one-hot encoded if discrete).
            "lstm_use_prev_action": False,
            # Whether to feed r_{t-1} to LSTM.
            "lstm_use_prev_reward": False,
            # Whether the LSTM is time-major (TxBx..) or batch-major (BxTx..).
            "_time_major": False,     
            # 还有 preprocessor, attention, action等可以进行设置, 具体附后 
        # learning parameters
        "lr": tune.grid_search([0.0001,0.005]),   # 会使用不同的learning rate进行实验
        # 对于不设置的参数,会自行进行设置默认值
        # train batch
        "rollout_fragment_length": 200,
        "train_batch_size": 400,
        "batch_mode": "truncate_episodes",  # 也可以设置 "complete_episodes"

analysis = tune.run(
    config= algo_config,
        "episode_reward_mean":100,   # 哪个条件先达到,都会结束 
        "timesteps_total":4000    # 条件是 result = trainer.train() ,result中的 信息

print("best config: ", analysis.get_best_config(metric="episode_reward_mean", mode="max"))


我们在训练强化学习时,可能会进行多个阶段/多种的训练,或者我们会同时进行多个任务| 不同种类任务同时进行,因此,我们需要能够自己定义训练过程。

rllib 基本框架

RLlib框架包含trainer 和 rollout workers两部分。

  1. trainer包含experience replay pool 和 policy的定义与更新 两部分。
  2. workers根据需要,可以定义不同数目的worker,每个worker从policy中获得动作指令,可以采用同步式|非同步式获取经验,将获取的经验传输到trainer中,然后让trainer中的policy根据经验进行更新。



环境应该符合 gym.env


含有 observation_space 和 action_space 用于rllib后续policy中建立神经网络的输入。


# env需要按照 gym.env 进行设置
class MyEnv(gym.Env):
    def __init__(self, env_config):  # 此处的变量请只使用 dict env_config 用于定义所有的参数
    								# 除非使用register建立,同时需要在 env_creator中合理对应
        self.action_space = <gym.Space>
        self.observation_space = <gym.Space>
    def reset(self):
        return <obs>
    def step(self, action):
        return <obs>, <reward: float>, <done: bool>, <info: dict>

gym.observation_space 配置


import gym, ray
from ray.rllib.algorithms import ppo

algo = ppo.PPOTrainer(env=MyEnv, config={
    "env_config": {},  # config to pass to env class


我们可以使用register(注册),将我们自己定义的环境设置成 rllib可以识别的 环境string,然后可以直接进行调用。
需要注意的是: gym中的registry和 ray不完全兼容。因此请使用ray中的resgister进行注册。

from ray.tune.registry import register_env

def env_creator(env_config):   # 此处的 env_config对应 我们在建立trainer时传入的dict env_config
    return MyEnv(...)  # return an env instance

register_env("my_env", env_creator) # 此处传入了 环境的名称 | 环境的实例调用函数
algo = ppo.PPO(env="my_env",config={


有时候一个worker需要对多个环境进行学习,因此我们在定义环境的时候 需要同时定义几个环境。

对于 num_envs_per_worker >0 的情况, 每一个worker会对应有多个环境,因此需要通过根据env_config.worker_indexenv_config.vector_index 来得到 worker的 id和 env id

class MultiEnv(gym.Env):
    def __init__(self, env_config):
        # pick actual env based on worker and env indexes
        self.env = gym.make(
            choose_env_for(env_config.worker_index, env_config.vector_index))
        self.action_space = self.env.action_space
        self.observation_space = self.env.observation_space
    def reset(self):
        return self.env.reset()
    def step(self, action):
        return self.env.step(action)

register_env("multienv", lambda config: MultiEnv(config))




  1. 在每个worker中同时创建多个env, 可以设置 {“num_envs_per_worker”:M}
  2. 建立多个worker,可以设置 {“num_workers”:N}
    如果不设置多个worker,只有一个local worker产生经验,并进行更新;当建立多个worker时,会出现remote worker。 会出现N个remote worker,产生经验,而local worker只会进行更新。
    因此,如果有多个worker会出现cpu | gpu资源的分配问题。 local worker的分配: num_gpus,remote worker的资源通过 num_cpus_per_worker,num_gpus_per_worker, custom_resources_per_worker。GPU 可以分配分数。
    train-worker structure


  1. 对于环境比较慢,或者无法复制的情况(与物理环境进行交互),优先使用 sample-efficient off-policy methods: DQN, SAC。 默认 num_wroker:0只单线程。 或者使用offline 的 batch RL training。对于gpu,设置num_gpus:1

  2. 对于环境较快,或者model较小,可以采用多个worker,time-efficient alg, 比如PPO,IMPALA,APEX。 对于gpu,设置num_gpus:1,多个gpus则设置num_gpus>1.



import ray
import ray.rllib.agents.ppo as ppo
from ray.tune.logger import pretty_print

config = ppo.DEFAULT_CONFIG.copy()  # 该config附后

# 根据需要进行修改config
algo_config["num_gpus"] = 0
algo_config["num_worker"] = 3 # 如果使用ppo,可以使用多个worker
algo_config["num_cpus_per_worker"] = 1 # 每个worker的cpu数量
# 默认的 驱动(计算policy)的cpu数量: 1 

# 默认的framework是 tensorflow,  如果使用torch,可以在此处进行修改 
algo_config["framework"] = "torch"

# 确定 batch size
algo_config["train_batch_size"] = 4000

# 建立 trainer, 确定算法
trainer = PPOTrainer(env="CartPole-v0",config = algo_config # config passes to env class )
# 如果有之前训练好的参数,可以先进行调用:

for i in range(1000):
   # Perform one iteration of training the policy with PPO
   result = trainer.train() 

   if i % 100 == 0:
       checkpoint = trainer.save()
       print("checkpoint saved at", checkpoint)
# 使用训练好的trainer,计算action
# trainer 在这之前需要先  trainer.restore(path) 获取之前的checkpoint
action = trainer.compute_action(obs)

callbacks and custom metrics

这些callback函数会在 不同的时间(开始sample之前,每个sample结束后,policy更新之前等等)。

以下的函数都是该class DefaultCallbacks 的子函数。 因此我们在创建自己的callback类时,应该将此作为父类。 
import ray
from ray.rllib.agents.callbacks import DefaultCallbacks
class Mycallbacks(DefaultCallbacks):
	# 定义自己想要定义的给定时间的函数

def initDefaultCallbacks(logprint=False, Record=False,call_info):
	callbackclass = 	Mycallbacks(call_info)
	# 根据需要,可以最后选择需要的callback 函数 
	if logprint:
		def on_train_result(self, *, trainer, result: dict, **kwargs):
            print("trainer.train() result: {} -> {} episodes".format(
                trainer, result["episodes_this_iter"])

	if Record:
	# 最后返回 继承DefaultCallbacks 的子类
	return callbackclass

在trainer或者tune的 config 中 写入一下: 
'callbacks':initDefaultCallbacks(logprint=False, Record=False,call_info)

每一个阶段 的函数

#################### env创建时
def on_sub_environment_created(
        worker: "RolloutWorker",
        sub_environment: EnvType,
        env_context: EnvContext,
    ) -> None:

#################### new trainer创建时
def on_trainer_init(
    trainer: "Trainer",
) -> None:
############### 在每一个episode开始时
def on_episode_start(
   worker: "RolloutWorker",
   base_env: BaseEnv,
   policies: Dict[PolicyID, Policy],
   episode: Episode,
) -> None:

# 在这里,我们可以在每个episode开始之前定义 一些自己想要的变量,
可以使用 "episode.user_data"dict"episode.custom_metrics" 来存储自定义的变量(暂时性的)。

####################### 在每一个step 
def on_episode_step(
    worker: "RolloutWorker",
    base_env: BaseEnv,
    policies: Optional[Dict[PolicyID, Policy]] = None,
    episode: Episode,
) -> None:
# 定义每一步想要的变量

####################### 在每一个episode之后
def on_episode_end(
    worker: "RolloutWorker",
    base_env: BaseEnv,
    policies: Dict[PolicyID, Policy],
    episode: Episode,
) -> None:

######################## 在每次sample结束之后
def on_sample_end(
    self, *, worker: "RolloutWorker", samples: SampleBatch, **kwargs
) -> None:
######################### 在学习之前
def on_learn_on_batch(
    self, *, policy: Policy, train_batch: SampleBatch, result: dict, **kwargs
) -> None:

########################## 学习之后 
def on_train_result(self, *, trainer: "Trainer", result: dict, **kwargs) -> None:

exploration 设置


# All of the following configs go into Trainer.config.

# 1) Switching *off* exploration by default.
# Behavior: Calling `compute_action(s)` without explicitly setting its `explore`
# param will result in no exploration.
# However, explicitly calling `compute_action(s)` with `explore=True` will
# still(!) result in exploration (per-call overrides default).
"explore": False,

# 2) Switching *on* exploration by default.
# Behavior: Calling `compute_action(s)` without explicitly setting its
# explore param will result in exploration.
# However, explicitly calling `compute_action(s)` with `explore=False`
# will result in no(!) exploration (per-call overrides default).
"explore": True,

# 3) Example exploration_config usages:
# a) DQN: see rllib/agents/dqn/dqn.py
"explore": True,
"exploration_config": {
   # Exploration sub-class by name or full path to module+class
   # (e.g. “ray.rllib.utils.exploration.epsilon_greedy.EpsilonGreedy”)
   "type": "EpsilonGreedy",
   # Parameters for the Exploration class' constructor:
   "initial_epsilon": 1.0,
   "final_epsilon": 0.02,
   "epsilon_timesteps": 10000,  # Timesteps over which to anneal epsilon.

# b) DQN Soft-Q: In order to switch to Soft-Q exploration, do instead:
"explore": True,
"exploration_config": {
   "type": "SoftQ",
   # Parameters for the Exploration class' constructor:
   "temperature": 1.0,

# c) All policy-gradient algos and SAC: see rllib/agents/trainer.py
# Behavior: The algo samples stochastically from the
# model-parameterized distribution. This is the global Trainer default
# setting defined in trainer.py and used by all PG-type algos (plus SAC).
"explore": True,
"exploration_config": {
   "type": "StochasticSampling",
   "random_timesteps": 0,  # timesteps at beginning, over which to act uniformly randomly

customized evaluation during training

# 可以在train的中间 进行自定义的evaluation
在config中定义 evaluation 的频率和长度
# Run one evaluation step on every 3rd `Trainer.train()` call.
    "evaluation_interval": 3,

# Every time we do run an evaluation step, run it for exactly 10 episodes.
    "evaluation_duration": 10,
    "evaluation_duration_unit": "episodes",
# Every time we do run an evaluation step, run it for close to 200 timesteps.
    "evaluation_duration": 200,
    "evaluation_duration_unit": "timesteps",

# 另外 evaluation中的 是否要使用exploration,可以进行设置
# Switching off exploration behavior for evaluation workers
# (see rllib/agents/trainer.py). Use any keys in this sub-dict that are
# also supported in the main Trainer config.
"evaluation_config": {
   "explore": False

另外,根据需要,也可以自己定义 需要的 evaluation function。

curriculum learning


external agents and applications

有时候我们会使用和外界交互的环境,此时环境就不是由rlli进行控制。 因此我们此时只能 使用 external application api



user guide


dataflow in rllib


对于不同的observation space,会产生不同的预处理。

  1. 对于 discrete obs space,会自动one-hot coded – e.g. Discrete(3) and value(1) -> [ 0, 1, 0 ]
  2. 对于 multi-discrete obs, 会对每一个元素进行one-hot coded,然后将所有的维度concatenated。 e.g. MUltiDiscrete([3, 4]) and value = [ 1, 3] -> [ 0 1 0 0 0 0 1 ]
  3. 对于Tuple和Dict 也会自动flattened。 同时可以访问flattened之前的obs。 flattened之前的 input_dict[“obs”] | flattened之后的 input_dict[“obs_flat”]. 在policy的 loss function中 可以使用 dict_or_tuple_obs = restore_original_dimensions(input_dict[“obs”], self.obs_space, “tf|torch”) 恢复之前的obs 状态。

对于atari game

默认的preprocessor是 DeepMInd preprocessor, config中是 preprocessor_pref=deepmind
另外也可以设置 preprocessor_pref=rllib。 我们可以在 model config中设置 dim x dim, 另外可以设置 grayscale=True, zero_mean=True(将值设置为 -1.0 ~ 1.0 ,而不是 0~1.0)。

默认的model config

可以在ModelConfigDict中设置 全连接层,卷积层和RNN等。

MODEL_DEFAULTS: ModelConfigDict = {
    # Experimental flag.
    # If True, try to use a native (tf.keras.Model or torch.Module) default
    # model instead of our built-in ModelV2 defaults.
    # If False (default), use "classic" ModelV2 default models.
    # Note that this currently only works for:
    # 1) framework != torch AND
    # 2) fully connected and CNN default networks as well as
    # auto-wrapped LSTM- and attention nets.
    "_use_default_native_models": False,
    # Experimental flag.
    # If True, user specified no preprocessor to be created
    # (via config._disable_preprocessor_api=True). If True, observations
    # will arrive in model as they are returned by the env.
    "_disable_preprocessor_api": False,
    # Experimental flag.
    # If True, RLlib will no longer flatten the policy-computed actions into
    # a single tensor (for storage in SampleCollectors/output files/etc..),
    # but leave (possibly nested) actions as-is. Disabling flattening affects:
    # - SampleCollectors: Have to store possibly nested action structs.
    # - Models that have the previous action(s) as part of their input.
    # - Algorithms reading from offline files (incl. action information).
    "_disable_action_flattening": False,

    # === Built-in options ===
    # FullyConnectedNetwork (tf and torch): rllib.models.tf|torch.fcnet.py
    # These are used if no custom model is specified and the input space is 1D.
    # Number of hidden layers to be used.
    "fcnet_hiddens": [256, 256],
    # Activation function descriptor.
    # Supported values are: "tanh", "relu", "swish" (or "silu"),
    # "linear" (or None).
    "fcnet_activation": "tanh",

    # VisionNetwork (tf and torch): rllib.models.tf|torch.visionnet.py
    # These are used if no custom model is specified and the input space is 2D.
    # Filter config: List of [out_channels, kernel, stride] for each filter.
    # Example:
    # Use None for making RLlib try to find a default filter setup given the
    # observation space.
    "conv_filters": None,
    # Activation function descriptor.
    # Supported values are: "tanh", "relu", "swish" (or "silu"),
    # "linear" (or None).
    "conv_activation": "relu",

    # Some default models support a final FC stack of n Dense layers with given
    # activation:
    # - Complex observation spaces: Image components are fed through
    #   VisionNets, flat Boxes are left as-is, Discrete are one-hot'd, then
    #   everything is concated and pushed through this final FC stack.
    # - VisionNets (CNNs), e.g. after the CNN stack, there may be
    #   additional Dense layers.
    # - FullyConnectedNetworks will have this additional FCStack as well
    # (that's why it's empty by default).
    "post_fcnet_hiddens": [],
    "post_fcnet_activation": "relu",

    # For DiagGaussian action distributions, make the second half of the model
    # outputs floating bias variables instead of state-dependent. This only
    # has an effect is using the default fully connected net.
    "free_log_std": False,
    # Whether to skip the final linear layer used to resize the hidden layer
    # outputs to size `num_outputs`. If True, then the last hidden layer
    # should already match num_outputs.
    "no_final_linear": False,
    # Whether layers should be shared for the value function.
    "vf_share_layers": True,

    # == LSTM ==
    # Whether to wrap the model with an LSTM.
    "use_lstm": False,
    # Max seq len for training the LSTM, defaults to 20.
    "max_seq_len": 20,
    # Size of the LSTM cell.
    "lstm_cell_size": 256,
    # Whether to feed a_{t-1} to LSTM (one-hot encoded if discrete).
    "lstm_use_prev_action": False,
    # Whether to feed r_{t-1} to LSTM.
    "lstm_use_prev_reward": False,
    # Whether the LSTM is time-major (TxBx..) or batch-major (BxTx..).
    "_time_major": False,

    # == Attention Nets (experimental: torch-version is untested) ==
    # Whether to use a GTrXL ("Gru transformer XL"; attention net) as the
    # wrapper Model around the default Model.
    "use_attention": False,
    # The number of transformer units within GTrXL.
    # A transformer unit in GTrXL consists of a) MultiHeadAttention module and
    # b) a position-wise MLP.
    "attention_num_transformer_units": 1,
    # The input and output size of each transformer unit.
    "attention_dim": 64,
    # The number of attention heads within the MultiHeadAttention units.
    "attention_num_heads": 1,
    # The dim of a single head (within the MultiHeadAttention units).
    "attention_head_dim": 32,
    # The memory sizes for inference and training.
    "attention_memory_inference": 50,
    "attention_memory_training": 50,
    # The output dim of the position-wise MLP.
    "attention_position_wise_mlp_dim": 32,
    # The initial bias values for the 2 GRU gates within a transformer unit.
    "attention_init_gru_gate_bias": 2.0,
    # Whether to feed a_{t-n:t-1} to GTrXL (one-hot encoded if discrete).
    "attention_use_n_prev_actions": 0,
    # Whether to feed r_{t-n:t-1} to GTrXL.
    "attention_use_n_prev_rewards": 0,

    # == Atari ==
    # Set to True to enable 4x stacking behavior.
    "framestack": True,
    # Final resized frame dimension
    "dim": 84,
    # (deprecated) Converts ATARI frame to 1 Channel Grayscale image
    "grayscale": False,
    # (deprecated) Changes frame to range from [-1, 1] if true
    "zero_mean": True,

    # === Options for custom models ===
    # Name of a custom model to use
    "custom_model": None,
    # Extra options to pass to the custom classes. These will be available to
    # the Model's constructor in the model_config field. Also, they will be
    # attempted to be passed as **kwargs to ModelV2 models. For an example,
    # see rllib/models/[tf|torch]/attention_net.py.
    "custom_model_config": {},
    # Name of a custom action distribution to use.
    "custom_action_dist": None,
    # Custom preprocessors are deprecated. Please use a wrapper class around
    # your environment instead to preprocess observations.
    "custom_preprocessor": None,

    # Deprecated keys:
    # Use `lstm_use_prev_action` or `lstm_use_prev_reward` instead.
    "lstm_use_prev_action_reward": DEPRECATED_VALUE,

在trainer中 可以通过model来传递参数
algo_config = {
    # All model-related settings go into this sub-dict.
    "model": {
        # By default, the MODEL_DEFAULTS dict above will be used.

        # Change individual keys in that dict by overriding them, e.g.
        "fcnet_hiddens": [512, 512, 512],
        "fcnet_activation": "relu",

    # ... other Trainer config keys, e.g. "lr" ...
    "lr": 0.00001,

自定义preprocessor and model

如果想要自定义preprocessor,rllib已经弃用了自定义preprocessor。 但我们可以在 wrapper classes中设置预处理。

import gym
from ray.rllib.utils.numpy import one_hot

class OneHotEnv(gym.core.ObservationWrapper):
    # Override `observation` to custom process the original observation
    # coming from the env.
    def observation(self, observation):
        # E.g. one-hotting a float obs [0.0, 5.0[.
        return one_hot(observation, depth=5)

class ClipRewardEnv(gym.core.RewardWrapper):
    def __init__(self, env, min_, max_):
        self.min = min_
        self.max = max_

    # Override `reward` to custom process the original reward coming
    # from the env.
    def reward(self, reward):
        # E.g. simple clipping between min and max.
        return np.clip(reward, self.min, self.max)


supervised model losses

我们可以使用imitation learning来加入专家经验。


import ray
from ray import tune
import time


config ={
	"env":  ,
	"vf_share_layers":tune.grid_search([True,False]), # 通过tune.grid_search可以实现 自动的参数选择
	"lr": tune.grid_search([1e-4,1e-5,1e-6]),

stop_config ={ "timesteps_total":10000 }

result = tune.run(
	stop = stop_config,


Ray Rllib API summary
Ray Rllib example


COMMON_CONFIG: TrainerConfigDict = {
    # === Settings for Rollout Worker processes ===
    # Number of rollout worker actors to create for parallel sampling. Setting
    # this to 0 will force rollouts to be done in the trainer actor.
    "num_workers": 2,
    # Number of environments to evaluate vector-wise per worker. This enables
    # model inference batching, which can improve performance for inference
    # bottlenecked workloads.
    "num_envs_per_worker": 1,
    # When `num_workers` > 0, the driver (local_worker; worker-idx=0) does not
    # need an environment. This is because it doesn't have to sample (done by
    # remote_workers; worker_indices > 0) nor evaluate (done by evaluation
    # workers; see below).
    "create_env_on_driver": False,
    # Divide episodes into fragments of this many steps each during rollouts.
    # Sample batches of this size are collected from rollout workers and
    # combined into a larger batch of `train_batch_size` for learning.
    # For example, given rollout_fragment_length=100 and train_batch_size=1000:
    #   1. RLlib collects 10 fragments of 100 steps each from rollout workers.
    #   2. These fragments are concatenated and we perform an epoch of SGD.
    # When using multiple envs per worker, the fragment size is multiplied by
    # `num_envs_per_worker`. This is since we are collecting steps from
    # multiple envs in parallel. For example, if num_envs_per_worker=5, then
    # rollout workers will return experiences in chunks of 5*100 = 500 steps.
    # The dataflow here can vary per algorithm. For example, PPO further
    # divides the train batch into minibatches for multi-epoch SGD.
    "rollout_fragment_length": 200,
    # How to build per-Sampler (RolloutWorker) batches, which are then
    # usually concat'd to form the train batch. Note that "steps" below can
    # mean different things (either env- or agent-steps) and depends on the
    # `count_steps_by` (multiagent) setting below.
    # truncate_episodes: Each produced batch (when calling
    #   RolloutWorker.sample()) will contain exactly `rollout_fragment_length`
    #   steps. This mode guarantees evenly sized batches, but increases
    #   variance as the future return must now be estimated at truncation
    #   boundaries.
    # complete_episodes: Each unroll happens exactly over one episode, from
    #   beginning to end. Data collection will not stop unless the episode
    #   terminates or a configured horizon (hard or soft) is hit.
    # 对于truncate episodes,每次更新 不要求是完整的episode,以batch size数量为准
    # 如果是 completer_episodes: 每次更新都是完整的episodes, batch size 是最少的经验数量(用于确定每次更新的episode的数量)
    "batch_mode": "truncate_episodes",  

    # === Settings for the Trainer process ===
    # Discount factor of the MDP.
    "gamma": 0.99,
    # The default learning rate.
    "lr": 0.0001,
    # Training batch size, if applicable. Should be >= rollout_fragment_length.
    # Samples batches will be concatenated together to a batch of this size,
    # which is then passed to SGD.
    "train_batch_size": 200,
    # Arguments to pass to the policy model. See models/catalog.py for a full
    # list of the available model options.
    "model": MODEL_DEFAULTS,
    # Arguments to pass to the policy optimizer. These vary by optimizer.
    "optimizer": {},

    # === Environment Settings ===
    # Number of steps after which the episode is forced to terminate. Defaults
    # to `env.spec.max_episode_steps` (if present) for Gym envs.
    "horizon": None,
    # Calculate rewards but don't reset the environment when the horizon is
    # hit. This allows value estimation and RNN state to span across logical
    # episodes denoted by horizon. This only has an effect if horizon != inf.
    "soft_horizon": False,
    # Don't set 'done' at the end of the episode.
    # In combination with `soft_horizon`, this works as follows:
    # - no_done_at_end=False soft_horizon=False:
    #   Reset env and add `done=True` at end of each episode.
    # - no_done_at_end=True soft_horizon=False:
    #   Reset env, but do NOT add `done=True` at end of the episode.
    # - no_done_at_end=False soft_horizon=True:
    #   Do NOT reset env at horizon, but add `done=True` at the horizon
    #   (pretending the episode has terminated).
    # - no_done_at_end=True soft_horizon=True:
    #   Do NOT reset env at horizon and do NOT add `done=True` at the horizon.
    "no_done_at_end": False,
    # The environment specifier:
    # This can either be a tune-registered env, via
    # `tune.register_env([name], lambda env_ctx: [env object])`,
    # or a string specifier of an RLlib supported type. In the latter case,
    # RLlib will try to interpret the specifier as either an openAI gym env,
    # a PyBullet env, a ViZDoomGym env, or a fully qualified classpath to an
    # Env class, e.g. "ray.rllib.examples.env.random_env.RandomEnv".
    "env": None,
    # The observation- and action spaces for the Policies of this Trainer.
    # Use None for automatically inferring these from the given env.
    "observation_space": None,
    "action_space": None,
    # Arguments dict passed to the env creator as an EnvContext object (which
    # is a dict plus the properties: num_workers, worker_index, vector_index,
    # and remote).
    "env_config": {},
    # If using num_envs_per_worker > 1, whether to create those new envs in
    # remote processes instead of in the same worker. This adds overheads, but
    # can make sense if your envs can take much time to step / reset
    # (e.g., for StarCraft). Use this cautiously; overheads are significant.
    "remote_worker_envs": False,
    # Timeout that remote workers are waiting when polling environments.
    # 0 (continue when at least one env is ready) is a reasonable default,
    # but optimal value could be obtained by measuring your environment
    # step / reset and model inference perf.
    "remote_env_batch_wait_ms": 0,
    # A callable taking the last train results, the base env and the env
    # context as args and returning a new task to set the env to.
    # The env must be a `TaskSettableEnv` sub-class for this to work.
    # See `examples/curriculum_learning.py` for an example.
    "env_task_fn": None,
    # If True, try to render the environment on the local worker or on worker
    # 1 (if num_workers > 0). For vectorized envs, this usually means that only
    # the first sub-environment will be rendered.
    # In order for this to work, your env will have to implement the
    # `render()` method which either:
    # a) handles window generation and rendering itself (returning True) or
    # b) returns a numpy uint8 image of shape [height x width x 3 (RGB)].
    "render_env": False,
    # If True, stores videos in this relative directory inside the default
    # output dir (~/ray_results/...). Alternatively, you can specify an
    # absolute path (str), in which the env recordings should be
    # stored instead.
    # Set to False for not recording anything.
    # Note: This setting replaces the deprecated `monitor` key.
    "record_env": False,
    # Whether to clip rewards during Policy's postprocessing.
    # None (default): Clip for Atari only (r=sign(r)).
    # True: r=sign(r): Fixed rewards -1.0, 1.0, or 0.0.
    # False: Never clip.
    # [float value]: Clip at -value and + value.
    # Tuple[value1, value2]: Clip at value1 and value2.
    "clip_rewards": None,
    # If True, RLlib will learn entirely inside a normalized action space
    # (0.0 centered with small stddev; only affecting Box components).
    # We will unsquash actions (and clip, just in case) to the bounds of
    # the env's action space before sending actions back to the env.
    "normalize_actions": True,
    # If True, RLlib will clip actions according to the env's bounds
    # before sending them back to the env.
    # TODO: (sven) This option should be obsoleted and always be False.
    "clip_actions": False,
    # Whether to use "rllib" or "deepmind" preprocessors by default
    # Set to None for using no preprocessor. In this case, the model will have
    # to handle possibly complex observations from the environment.
    "preprocessor_pref": "deepmind",

    # === Debug Settings ===
    # Set the ray.rllib.* log level for the agent process and its workers.
    # Should be one of DEBUG, INFO, WARN, or ERROR. The DEBUG level will also
    # periodically print out summaries of relevant internal dataflow (this is
    # also printed out once at startup at the INFO level). When using the
    # `rllib train` command, you can also use the `-v` and `-vv` flags as
    # shorthand for INFO and DEBUG.
    "log_level": "WARN",
    # Callbacks that will be run during various phases of training. See the
    # `DefaultCallbacks` class and `examples/custom_metrics_and_callbacks.py`
    # for more usage information.
    "callbacks": DefaultCallbacks,
    # Whether to attempt to continue training if a worker crashes. The number
    # of currently healthy workers is reported as the "num_healthy_workers"
    # metric.
    "ignore_worker_failures": False,
    # Whether - upon a worker failure - RLlib will try to recreate the lost worker as
    # an identical copy of the failed one. The new worker will only differ from the
    # failed one in its `self.recreated_worker=True` property value. It will have
    # the same `worker_index` as the original one.
    # If True, the `ignore_worker_failures` setting will be ignored.
    "recreate_failed_workers": False,
    # Log system resource metrics to results. This requires `psutil` to be
    # installed for sys stats, and `gputil` for GPU metrics.
    "log_sys_usage": True,
    # Use fake (infinite speed) sampler. For testing only.
    "fake_sampler": False,

    # === Deep Learning Framework Settings ===
    # tf: TensorFlow (static-graph)
    # tf2: TensorFlow 2.x (eager or traced, if eager_tracing=True)
    # tfe: TensorFlow eager (or traced, if eager_tracing=True)
    # torch: PyTorch
    "framework": "tf",
    # Enable tracing in eager mode. This greatly improves performance
    # (speedup ~2x), but makes it slightly harder to debug since Python
    # code won't be evaluated after the initial eager pass.
    # Only possible if framework=[tf2|tfe].
    "eager_tracing": False,
    # Maximum number of tf.function re-traces before a runtime error is raised.
    # This is to prevent unnoticed retraces of methods inside the
    # `..._eager_traced` Policy, which could slow down execution by a
    # factor of 4, without the user noticing what the root cause for this
    # slowdown could be.
    # Only necessary for framework=[tf2|tfe].
    # Set to None to ignore the re-trace count and never throw an error.
    "eager_max_retraces": 20,

    # === Exploration Settings ===
    # Default exploration behavior, iff `explore`=None is passed into
    # compute_action(s).
    # Set to False for no exploration behavior (e.g., for evaluation).
    "explore": True,
    # Provide a dict specifying the Exploration object's config.
    "exploration_config": {
        # The Exploration class to use. In the simplest case, this is the name
        # (str) of any class present in the `rllib.utils.exploration` package.
        # You can also provide the python class directly or the full location
        # of your class (e.g. "ray.rllib.utils.exploration.epsilon_greedy.
        # EpsilonGreedy").
        "type": "StochasticSampling",
        # Add constructor kwargs here (if any).
    # === Evaluation Settings ===
    # Evaluate with every `evaluation_interval` training iterations.
    # The evaluation stats will be reported under the "evaluation" metric key.
    # Note that for Ape-X metrics are already only reported for the lowest
    # epsilon workers (least random workers).
    # Set to None (or 0) for no evaluation.
    "evaluation_interval": None,
    # Duration for which to run evaluation each `evaluation_interval`.
    # The unit for the duration can be set via `evaluation_duration_unit` to
    # either "episodes" (default) or "timesteps".
    # If using multiple evaluation workers (evaluation_num_workers > 1),
    # the load to run will be split amongst these.
    # If the value is "auto":
    # - For `evaluation_parallel_to_training=True`: Will run as many
    #   episodes/timesteps that fit into the (parallel) training step.
    # - For `evaluation_parallel_to_training=False`: Error.
    "evaluation_duration": 10,
    # The unit, with which to count the evaluation duration. Either "episodes"
    # (default) or "timesteps".
    "evaluation_duration_unit": "episodes",
    # Whether to run evaluation in parallel to a Trainer.train() call
    # using threading. Default=False.
    # E.g. evaluation_interval=2 -> For every other training iteration,
    # the Trainer.train() and Trainer.evaluate() calls run in parallel.
    # Note: This is experimental. Possible pitfalls could be race conditions
    # for weight synching at the beginning of the evaluation loop.
    "evaluation_parallel_to_training": False,
    # Internal flag that is set to True for evaluation workers.
    "in_evaluation": False,
    # Typical usage is to pass extra args to evaluation env creator
    # and to disable exploration by computing deterministic actions.
    # IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal
    # policy, even if this is a stochastic one. Setting "explore=False" here
    # will result in the evaluation workers not using this optimal policy!
    "evaluation_config": {
        # Example: overriding env_config, exploration, etc:
        # "env_config": {...},
        # "explore": False

    # === Replay Buffer Settings ===
    # Provide a dict specifying the ReplayBuffer's config.
    # "replay_buffer_config": {
    #     The ReplayBuffer class to use. Any class that obeys the
    #     ReplayBuffer API can be used here. In the simplest case, this is the
    #     name (str) of any class present in the `rllib.utils.replay_buffers`
    #     package. You can also provide the python class directly or the
    #     full location of your class (e.g.
    #     "ray.rllib.utils.replay_buffers.replay_buffer.ReplayBuffer").
    #     "type": "ReplayBuffer",
    #     The capacity of units that can be stored in one ReplayBuffer
    #     instance before eviction.
    #     "capacity": 10000,
    #     Specifies how experiences are stored. Either 'sequences' or
    #     'timesteps'.
    #     "storage_unit": "timesteps",
    #     Add constructor kwargs here (if any).
    # },

    # Number of parallel workers to use for evaluation. Note that this is set
    # to zero by default, which means evaluation will be run in the trainer
    # process (only if evaluation_interval is not None). If you increase this,
    # it will increase the Ray resource usage of the trainer since evaluation
    # workers are created separately from rollout workers (used to sample data
    # for training).
    "evaluation_num_workers": 0,
    # Customize the evaluation method. This must be a function of signature
    # (trainer: Trainer, eval_workers: WorkerSet) -> metrics: dict. See the
    # Trainer.evaluate() method to see the default implementation.
    # The Trainer guarantees all eval workers have the latest policy state
    # before this function is called.
    "custom_eval_function": None,
    # Make sure the latest available evaluation results are always attached to
    # a step result dict.
    # This may be useful if Tune or some other meta controller needs access
    # to evaluation metrics all the time.
    "always_attach_evaluation_results": False,
    # Store raw custom metrics without calculating max, min, mean
    "keep_per_episode_custom_metrics": False,

    # === Advanced Rollout Settings ===
    # Use a background thread for sampling (slightly off-policy, usually not
    # advisable to turn on unless your env specifically requires it).
    "sample_async": False,

    # The SampleCollector class to be used to collect and retrieve
    # environment-, model-, and sampler data. Override the SampleCollector base
    # class to implement your own collection/buffering/retrieval logic.
    "sample_collector": SimpleListCollector,

    # Element-wise observation filter, either "NoFilter" or "MeanStdFilter".
    "observation_filter": "NoFilter",
    # Whether to synchronize the statistics of remote filters.
    "synchronize_filters": True,
    # Configures TF for single-process operation by default.
    "tf_session_args": {
        # note: overridden by `local_tf_session_args`
        "intra_op_parallelism_threads": 2,
        "inter_op_parallelism_threads": 2,
        "gpu_options": {
            "allow_growth": True,
        "log_device_placement": False,
        "device_count": {
            "CPU": 1
        # Required by multi-GPU (num_gpus > 1).
        "allow_soft_placement": True,
    # Override the following tf session args on the local worker
    "local_tf_session_args": {
        # Allow a higher level of parallelism by default, but not unlimited
        # since that can cause crashes with many concurrent drivers.
        "intra_op_parallelism_threads": 8,
        "inter_op_parallelism_threads": 8,
    # Whether to LZ4 compress individual observations.
    "compress_observations": False,
    # Wait for metric batches for at most this many seconds. Those that
    # have not returned in time will be collected in the next train iteration.
    "metrics_episode_collection_timeout_s": 180,
    # Smooth metrics over this many episodes.
    "metrics_num_episodes_for_smoothing": 100,
    # Minimum time interval to run one `train()` call for:
    # If - after one `step_attempt()`, this time limit has not been reached,
    # will perform n more `step_attempt()` calls until this minimum time has
    # been consumed. Set to None or 0 for no minimum time.
    "min_time_s_per_reporting": None,
    # Minimum train/sample timesteps to optimize for per `train()` call.
    # This value does not affect learning, only the length of train iterations.
    # If - after one `step_attempt()`, the timestep counts (sampling or
    # training) have not been reached, will perform n more `step_attempt()`
    # calls until the minimum timesteps have been executed.
    # Set to None or 0 for no minimum timesteps.
    "min_train_timesteps_per_reporting": None,
    "min_sample_timesteps_per_reporting": None,

    # This argument, in conjunction with worker_index, sets the random seed of
    # each worker, so that identically configured trials will have identical
    # results. This makes experiments reproducible.
    "seed": None,
    # Any extra python env vars to set in the trainer process, e.g.,
    # {"OMP_NUM_THREADS": "16"}
    "extra_python_environs_for_driver": {},
    # The extra python environments need to set for worker processes.
    "extra_python_environs_for_worker": {},

    # === Resource Settings ===
    # Number of GPUs to allocate to the trainer process. Note that not all
    # algorithms can take advantage of trainer GPUs. Support for multi-GPU
    # is currently only available for tf-[PPO/IMPALA/DQN/PG].
    # This can be fractional (e.g., 0.3 GPUs).
    "num_gpus": 0,
    # Set to True for debugging (multi-)?GPU funcitonality on a CPU machine.
    # GPU towers will be simulated by graphs located on CPUs in this case.
    # Use `num_gpus` to test for different numbers of fake GPUs.
    "_fake_gpus": False,
    # Number of CPUs to allocate per worker.
    "num_cpus_per_worker": 1,
    # Number of GPUs to allocate per worker. This can be fractional. This is
    # usually needed only if your env itself requires a GPU (i.e., it is a
    # GPU-intensive video game), or model inference is unusually expensive.
    "num_gpus_per_worker": 0,
    # Any custom Ray resources to allocate per worker.
    "custom_resources_per_worker": {},
    # Number of CPUs to allocate for the trainer. Note: this only takes effect
    # when running in Tune. Otherwise, the trainer runs in the main program.
    "num_cpus_for_driver": 1,
    # The strategy for the placement group factory returned by
    # `Trainer.default_resource_request()`. A PlacementGroup defines, which
    # devices (resources) should always be co-located on the same node.
    # For example, a Trainer with 2 rollout workers, running with
    # num_gpus=1 will request a placement group with the bundles:
    # [{"gpu": 1, "cpu": 1}, {"cpu": 1}, {"cpu": 1}], where the first bundle is
    # for the driver and the other 2 bundles are for the two workers.
    # These bundles can now be "placed" on the same or different
    # nodes depending on the value of `placement_strategy`:
    # "PACK": Packs bundles into as few nodes as possible.
    # "SPREAD": Places bundles across distinct nodes as even as possible.
    # "STRICT_PACK": Packs bundles into one node. The group is not allowed
    #   to span multiple nodes.
    # "STRICT_SPREAD": Packs bundles across distinct nodes.
    "placement_strategy": "PACK",

    # TODO(jungong, sven): we can potentially unify all input types
    #     under input and input_config keys. E.g.
    #     input: sample
    #     input_config {
    #         env: Cartpole-v0
    #     }
    #     or:
    #     input: json_reader
    #     input_config {
    #         path: /tmp/
    #     }
    #     or:
    #     input: dataset
    #     input_config {
    #         format: parquet
    #         path: /tmp/
    #     }
    # === Offline Datasets ===
    # Specify how to generate experiences:
    #  - "sampler": Generate experiences via online (env) simulation (default).
    #  - A local directory or file glob expression (e.g., "/tmp/*.json").
    #  - A list of individual file paths/URIs (e.g., ["/tmp/1.json",
    #    "s3://bucket/2.json"]).
    #  - A dict with string keys and sampling probabilities as values (e.g.,
    #    {"sampler": 0.4, "/tmp/*.json": 0.4, "s3://bucket/expert.json": 0.2}).
    #  - A callable that takes an `IOContext` object as only arg and returns a
    #    ray.rllib.offline.InputReader.
    #  - A string key that indexes a callable with tune.registry.register_input
    "input": "sampler",
    # Arguments accessible from the IOContext for configuring custom input
    "input_config": {},
    # True, if the actions in a given offline "input" are already normalized
    # (between -1.0 and 1.0). This is usually the case when the offline
    # file has been generated by another RLlib algorithm (e.g. PPO or SAC),
    # while "normalize_actions" was set to True.
    "actions_in_input_normalized": False,
    # Specify how to evaluate the current policy. This only has an effect when
    # reading offline experiences ("input" is not "sampler").
    # Available options:
    #  - "wis": the weighted step-wise importance sampling estimator.
    #  - "is": the step-wise importance sampling estimator.
    #  - "simulation": run the environment in the background, but use
    #    this data for evaluation only and not for learning.
    "input_evaluation": ["is", "wis"],
    # Whether to run postprocess_trajectory() on the trajectory fragments from
    # offline inputs. Note that postprocessing will be done using the *current*
    # policy, not the *behavior* policy, which is typically undesirable for
    # on-policy algorithms.
    "postprocess_inputs": False,
    # If positive, input batches will be shuffled via a sliding window buffer
    # of this number of batches. Use this if the input data is not in random
    # enough order. Input is delayed until the shuffle buffer is filled.
    "shuffle_buffer_size": 0,
    # Specify where experiences should be saved:
    #  - None: don't save any experiences
    #  - "logdir" to save to the agent log dir
    #  - a path/URI to save to a custom output directory (e.g., "s3://bucket/")
    #  - a function that returns a rllib.offline.OutputWriter
    "output": None,
    # Arguments accessible from the IOContext for configuring custom output
    "output_config": {},
    # What sample batch columns to LZ4 compress in the output data.
    "output_compress_columns": ["obs", "new_obs"],
    # Max output file size (in bytes) before rolling over to a new file.
    "output_max_file_size": 64 * 1024 * 1024,

    # === Settings for Multi-Agent Environments ===
    "multiagent": {
        # Map of type MultiAgentPolicyConfigDict from policy ids to tuples
        # of (policy_cls, obs_space, act_space, config). This defines the
        # observation and action spaces of the policies and any extra config.
        "policies": {},
        # Keep this many policies in the "policy_map" (before writing
        # least-recently used ones to disk/S3).
        "policy_map_capacity": 100,
        # Where to store overflowing (least-recently used) policies?
        # Could be a directory (str) or an S3 location. None for using
        # the default output dir.
        "policy_map_cache": None,
        # Function mapping agent ids to policy ids.
        "policy_mapping_fn": None,
        # Determines those policies that should be updated.
        # Options are:
        # - None, for all policies.
        # - An iterable of PolicyIDs that should be updated.
        # - A callable, taking a PolicyID and a SampleBatch or MultiAgentBatch
        #   and returning a bool (indicating whether the given policy is trainable
        #   or not, given the particular batch). This allows you to have a policy
        #   trained only on certain data (e.g. when playing against a certain
        #   opponent).
        "policies_to_train": None,
        # Optional function that can be used to enhance the local agent
        # observations to include more state.
        # See rllib/evaluation/observation_function.py for more info.
        "observation_fn": None,
        # When replay_mode=lockstep, RLlib will replay all the agent
        # transitions at a particular timestep together in a batch. This allows
        # the policy to implement differentiable shared computations between
        # agents it controls at that timestep. When replay_mode=independent,
        # transitions are replayed independently per policy.
        "replay_mode": "independent",
        # Which metric to use as the "batch size" when building a
        # MultiAgentBatch. The two supported values are:
        # env_steps: Count each time the env is "stepped" (no matter how many
        #   multi-agent actions are passed/how many multi-agent observations
        #   have been returned in the previous step).
        # agent_steps: Count each individual agent step as one step.
        "count_steps_by": "env_steps",

    # === Logger ===
    # Define logger-specific configuration to be used inside Logger
    # Default value None allows overwriting with nested dicts
    "logger_config": None,

    # === API deprecations/simplifications/changes ===
    # Experimental flag.
    # If True, TFPolicy will handle more than one loss/optimizer.
    # Set this to True, if you would like to return more than
    # one loss term from your `loss_fn` and an equal number of optimizers
    # from your `optimizer_fn`.
    # In the future, the default for this will be True.
    "_tf_policy_handles_more_than_one_loss": False,
    # Experimental flag.
    # If True, no (observation) preprocessor will be created and
    # observations will arrive in model as they are returned by the env.
    # In the future, the default for this will be True.
    "_disable_preprocessor_api": False,
    # Experimental flag.
    # If True, RLlib will no longer flatten the policy-computed actions into
    # a single tensor (for storage in SampleCollectors/output files/etc..),
    # but leave (possibly nested) actions as-is. Disabling flattening affects:
    # - SampleCollectors: Have to store possibly nested action structs.
    # - Models that have the previous action(s) as part of their input.
    # - Algorithms reading from offline files (incl. action information).
    "_disable_action_flattening": False,
    # Experimental flag.
    # If True, the execution plan API will not be used. Instead,
    # a Trainer's `training_iteration` method will be called as-is each
    # training iteration.
    "_disable_execution_plan_api": False,

    # If True, disable the environment pre-checking module.
    "disable_env_checking": False,

    # === Deprecated keys ===
    # Uses the sync samples optimizer instead of the multi-gpu one. This is
    # usually slower, but you might want to try it if you run into issues with
    # the default optimizer.
    # This will be set automatically from now on.
    "simple_optimizer": DEPRECATED_VALUE,
    # Whether to write episode stats and videos to the agent log dir. This is
    # typically located in ~/ray_results.
    "monitor": DEPRECATED_VALUE,
    # Replaced by `evaluation_duration=10` and
    # `evaluation_duration_unit=episodes`.
    "evaluation_num_episodes": DEPRECATED_VALUE,
    # Use `metrics_num_episodes_for_smoothing` instead.
    "metrics_smoothing_episodes": DEPRECATED_VALUE,
    # Use `min_[env|train]_timesteps_per_reporting` instead.
    "timesteps_per_iteration": 0,
    # Use `min_time_s_per_reporting` instead.
    "min_iter_time_s": DEPRECATED_VALUE,
    # Use `metrics_episode_collection_timeout_s` instead.
    "collect_metrics_timeout": DEPRECATED_VALUE,
# 建立 trainer
trainer = PPOTrainer(env="CartPole-v0",config={"train_batch_size":4000,
											   "env_config":{} # config to pass to env class )
# trainer 有多个trainer, 对应不同的alg
# 在建立trainer时需要设置好 env环境,需要输入对应的RL alg的参数 




