Rllib学习 - [4] - AlgorithmConfig详细介绍 [Pytorch版]

Algorithm config 包含了各个方面的 参数,对应不同模块的生成。
“env_config”, # 环境生成时对应的参数,与自己定义的env有关,本文不做介绍
“model”, # 定义与model结构生成过程中用到的参数,详细介绍
“optimizer”, # 定义 优化的参数,简单介绍
“multiagent”, # 与多智能体有关,本文不做介绍
“custom_resources_per_worker”, # 定义 资源, 简单介绍
“evaluation_config”, # 评价相关的参数,简单介绍
“exploration_config”, # 探索有关的参数,详细介绍
“replay_buffer_config”, # replay buffer设置相关的参数,详细介绍
“extra_python_environs_for_worker”, #
“input_config”,
“output_config”,

resources

定义环境运行的资源

        self.num_gpus = 0 # GPU数目  可以为小数
        self.num_cpus_per_worker = 1   # 每个worker 的CPU数目
        self.num_gpus_per_worker = 0   # 每个worker 的GPU数目
        self._fake_gpus = False    
        self.num_cpus_for_local_worker = 1  # Local worker占用的CPU数目
        self.custom_resources_per_worker = {} 
        self.placement_strategy = "PACK"

framework

self.framework_str = "tf" # tf , tf2, torch 本文对于tensorflow相关的 tf_session不做介绍

env

        self.env = None  # 先使用 register 注册环境,然后使用str进行定义
        self.env_config = {} # 根据自己设置的env_config变量进行设置
        self.observation_space = None # 观测空间,可以不设置,根据env.observation_space直接得到
        self.action_space = None # 动作空间,可以不设置,根据env.action_space 直接得到
        self.env_task_fn = None # 在多任务中,设置环境中如何执行每一个任务
        self.render_env = False # 是否渲染环境, True会生成video 和 ray_results | False 只会生成 ray_results 
        self.clip_rewards = None # 是否要对reward 进行clip
        self.normalize_actions = True # 是否要正则化动作 ,根据action_space的上下界进行normalization
        self.clip_actions = False # 是否要对action进行clip , 根据动作空间的上下界进行clip
        self.disable_env_checking = False # 在开始时,是否要拒绝检查环境 : True: 不检查环境 

rollouts

        self.num_workers = 2  # remote_worker的数量。 如果只希望一个local_worker,则设置为0 
        self.num_envs_per_worker = 1 # 每个worker几个env同时执行 
        self.sample_collector = SimpleListCollector # 如何从不同的worker之间收集samples
        self.create_env_on_local_worker = False # 是否要在 local_worker中创造env。 可以不设置,会自动设置。 
        self.sample_async = False # 是否要采取 非同步sample
        self.enable_connectors = False
        self.rollout_fragment_length = 200 # 每次从一个worker的一个env中获得多少samples
        # num_envs_per_worker 大于1的时候,每个worker获得的samples是 num_envs_per_worker*rollout_fragment_length
		# 最终合在一起,生成 train_batch_size 

        self.batch_mode = "truncate_episodes" # orsample时是否要求收集完整episode的信息 "complete_episode"
        self.remote_worker_envs = False # 
        self.remote_env_batch_wait_ms = 0
        self.validate_workers_after_construction = True
        self.ignore_worker_failures = False
        self.recreate_failed_workers = False # 如果有worker failed,是否重新创建worker
        self.restart_failed_sub_environments = False
        self.num_consecutive_worker_failures_tolerance = 100
        self.horizon = None # 每一个env的最长长度
        self.soft_horizon = False # soft_horizon设置
        self.no_done_at_end = False
        self.preprocessor_pref = "deepmind" # preprocessor 预处理设置, 可以设置为None。 该preprocessor只有在atari game才会开始
        self.observation_filter = "NoFilter" # 可以设置均方差filter,一般在自己的env中完成filter
        self.synchronize_filters = True
        self.compress_observations = False

        # `self.training()`
        self.gamma = 0.99
        self.lr = 0.001
        self.train_batch_size = 32
        self.model = copy.deepcopy(MODEL_DEFAULTS)  # 下一节重点说明
        self.optimizer = {}

model_config

ModelConfigDict = {
    # Experimental flag.
    # If True, try to use a native (tf.keras.Model or torch.Module) default
    # model instead of our built-in ModelV2 defaults.
    # If False (default), use "classic" ModelV2 default models.
    # Note that this currently only works for:
    # 1) framework != torch AND
    # 2) fully connected and CNN default networks as well as
    # auto-wrapped LSTM- and attention nets.
    "_use_default_native_models": False,
    # Experimental flag.
    # If True, user specified no preprocessor to be created
    # (via config._disable_preprocessor_api=True). If True, observations
    # will arrive in model as they are returned by the env.
    "_disable_preprocessor_api": False, # 可以设置为disable
    # Experimental flag.
    # If True, RLlib will no longer flatten the policy-computed actions into
    # a single tensor (for storage in SampleCollectors/output files/etc..),
    # but leave (possibly nested) actions as-is. Disabling flattening affects:
    # - SampleCollectors: Have to store possibly nested action structs.
    # - Models that have the previous action(s) as part of their input.
    # - Algorithms reading from offline files (incl. action information).
    "_disable_action_flattening": False,

# ===  
# 本次只会说明 CNN + MLP, 不会涉及 RNN,LSTM,transformer
# 因此输入通常只有三种, 纯图片输入 gym.Spaces.Box()  len(shape)= 3
#                    纯vector输入:  gym.space.Discrete / MultiDiscrete / Box, len(shape)=1
#                    complex,包含 图片和 vector的输入 Tuple/Dict( Discrete, Box,.. )

# 对于这三种输入,设置时使用的参数不太相同。
# === 纯图片输入 ===
# "conv_filters" + "conv_activation" 实现CNN, "post_fcnet_hiddens" + "post_fcnet_activation" 实现后期MLP, 
# “fcnet_hiddens”不设置

# === 纯vector输入 ===
# "fcnet_hiddens" + "fcnet_activation" 实现 MLP  ; "fcnet_hiddens"不设置(此处设置也可以,会看作"fcnet_hiddens"的延续)

# === complex类型输入 ===
# rllib会自动将 图片整合, 将vector整合
# 图片输入到 CNN ("conv_filters"+"conv_activation") , vector 单独经过 "fcnet_hiddens" + "fcnet_activation" (设置为None,则直接线性输出vector)
# 最后将 图片输出结果 faltten之后 和 vector经过的MLP concatenate一起,
# 然后输入到 "post_fcnet_hiddens"+"post_fcnet_activation"

# =====

    # === Built-in options ===
    # FullyConnectedNetwork (tf and torch): rllib.models.tf|torch.fcnet.py
    # These are used if no custom model is specified and the input space is 1D.
    # Number of hidden layers to be used.
    "fcnet_hiddens": [256, 256], 
    # Activation function descriptor.
    # Supported values are: "tanh", "relu", "swish" (or "silu"),
    # "linear" (or None).
    "fcnet_activation": "tanh",

    # VisionNetwork (tf and torch): rllib.models.tf|torch.visionnet.py
    # These are used if no custom model is specified and the input space is 2D.
    # Filter config: List of [out_channels, kernel, stride] for each filter.
    # Example:
    # Use None for making RLlib try to find a default filter setup given the
    # observation space.
    "conv_filters": None,
    # Activation function descriptor.
    # Supported values are: "tanh", "relu", "swish" (or "silu"),
    # "linear" (or None).
    "conv_activation": "relu",

    # Some default models support a final FC stack of n Dense layers with given
    # activation:
    # - Complex observation spaces: Image components are fed through
    #   VisionNets, flat Boxes are left as-is, Discrete are one-hot'd, then
    #   everything is concated and pushed through this final FC stack.
    # - VisionNets (CNNs), e.g. after the CNN stack, there may be
    #   additional Dense layers.
    # - FullyConnectedNetworks will have this additional FCStack as well
    # (that's why it's empty by default).
    "post_fcnet_hiddens": [],
    "post_fcnet_activation": "relu",

    # For DiagGaussian action distributions, make the second half of the model
    # outputs floating bias variables instead of state-dependent. This only
    # has an effect is using the default fully connected net.
    # False时, policy产生的 action 是 mu + log_std; 最后会在 action_dis中 stochastic sampling,生成动作
    # True时,网络不会生成 state-dependent log_std, 而是,对于所有状态都是一个log_std。最后mu + 这一个全局的log_std产生动作
    "free_log_std": False, 
    # Whether to skip the final linear layer used to resize the hidden layer
    # outputs to size `num_outputs`. If True, then the last hidden layer
    # should already match num_outputs.
    # no_final_linear ==True 时, 我们同时定义num_output,最后一层num_output会有activation
	# 常用于 自己生成模型时,先使用Model_v2生成前半部分模型,然后自己生成后半部分模型
    "no_final_linear": False,
    
    # Whether layers should be shared for the value function.
    # 如果共享的话, policy 和 value网络共享参数。 否则的化,value单独生成相同网络,不共享参数
    "vf_share_layers": True,

    # == LSTM ==
    # Whether to wrap the model with an LSTM.
    "use_lstm": False,
    # Max seq len for training the LSTM, defaults to 20.
    "max_seq_len": 20,
    # Size of the LSTM cell.
    "lstm_cell_size": 256,
    # Whether to feed a_{t-1} to LSTM (one-hot encoded if discrete).
    "lstm_use_prev_action": False,
    # Whether to feed r_{t-1} to LSTM.
    "lstm_use_prev_reward": False,
    # Whether the LSTM is time-major (TxBx..) or batch-major (BxTx..).
    "_time_major": False,

    # == Attention Nets (experimental: torch-version is untested) ==
    # Whether to use a GTrXL ("Gru transformer XL"; attention net) as the
    # wrapper Model around the default Model.
    "use_attention": False,
    # The number of transformer units within GTrXL.
    # A transformer unit in GTrXL consists of a) MultiHeadAttention module and
    # b) a position-wise MLP.
    "attention_num_transformer_units": 1,
    # The input and output size of each transformer unit.
    "attention_dim": 64,
    # The number of attention heads within the MultiHeadAttention units.
    "attention_num_heads": 1,
    # The dim of a single head (within the MultiHeadAttention units).
    "attention_head_dim": 32,
    # The memory sizes for inference and training.
    "attention_memory_inference": 50,
    "attention_memory_training": 50,
    # The output dim of the position-wise MLP.
    "attention_position_wise_mlp_dim": 32,
    # The initial bias values for the 2 GRU gates within a transformer unit.
    "attention_init_gru_gate_bias": 2.0,
    # Whether to feed a_{t-n:t-1} to GTrXL (one-hot encoded if discrete).
    "attention_use_n_prev_actions": 0,
    # Whether to feed r_{t-n:t-1} to GTrXL.
    "attention_use_n_prev_rewards": 0,

    # == Atari ==
    # Set to True to enable 4x stacking behavior.
    "framestack": True,
    # Final resized frame dimension
    "dim": 84,
    # (deprecated) Converts ATARI frame to 1 Channel Grayscale image
    "grayscale": False,
    # (deprecated) Changes frame to range from [-1, 1] if true
    "zero_mean": True,

    # === Options for custom models ===
    # Name of a custom model to use
    "custom_model": None, # 定义自己模型的时候可以使用
    # Extra options to pass to the custom classes. These will be available to
    # the Model's constructor in the model_config field. Also, they will be
    # attempted to be passed as **kwargs to ModelV2 models. For an example,
    # see rllib/models/[tf|torch]/attention_net.py.
    "custom_model_config": {},
    # Name of a custom action distribution to use.
    "custom_action_dist": None, # action_dist是否要自己定义;否则的话会根据 exploration 和action_space创建对应的action_dist
    # Custom preprocessors are deprecated. Please use a wrapper class around
    # your environment instead to preprocess observations.
    "custom_preprocessor": None,
}

callbacks

        # `self.callbacks()`
        self.callbacks_class = DefaultCallbacks

explore

        # `self.explore()`
        self.explore = True
        self.exploration_config = {
            # The Exploration class to use. In the simplest case, this is the name
            # (str) of any class present in the `rllib.utils.exploration` package.
            # You can also provide the python class directly or the full location
            # of your class (e.g. "ray.rllib.utils.exploration.epsilon_greedy.
            # EpsilonGreedy").
            "type": "StochasticSampling",
            # Add constructor kwargs here (if any).
        }

在这里插入图片描述

evaluation

        # `self.evaluation()`
        self.evaluation_interval = None # None 则是 不评估 ; 评估的间隔(训练的间隔)
        self.evaluation_duration = 10 # episode|timesteps,根据evaluation_duration_unit来定
        self.evaluation_duration_unit = "episodes" # 也可以是"timesteps"
        self.evaluation_parallel_to_training = False # 评估和训练是否同步进行 
        self.evaluation_config = {} # 重写 algorithm_config
        self.off_policy_estimation_methods = {} 
        self.evaluation_num_workers = 0
        self.custom_evaluation_function = None # 自己定义的评估方法
        self.always_attach_evaluation_results = False
        # TODO: Set this flag still in the config or - much better - in the
        #  RolloutWorker as a property.
        self.in_evaluation = False
        self.sync_filters_on_rollout_workers_timeout_s = 60.0

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
PyTorch是一个开源的深度学习框架,也可以用于强化学习任务的实现。以下是关于PyTorch强化学习的教程: PyTorch提供了一些用于强化学习的工具和库,例如PyTorch的神经网络模块nn和优化器optim。在开始之前,先要了解强化学习的基本概念,特别是强化学习中的环境、状态、动作和奖励。 首先,我们需要定义强化学习任务的环境。这可以是一个简单的游戏,如迷宫,也可以是一个复杂的环境,如自动驾驶汽车的模拟器。接下来,我们需要定义状态空间和动作空间。状态空间表示环境可能的状态,动作空间表示智能体可以采取的动作。然后,我们需要定义奖励函数,即智能体在每个动作后获得的奖励。 接下来,可以使用PyTorch的神经网络模块nn来定义强化学习的智能体。可以选择不同的神经网络架构,如深度Q网络(DQN)或策略梯度方法。网络的输入是状态,输出是每个动作的Q值或概率。在这个教程中,我们将以DQN为例。 在训练过程中,智能体与环境进行交互。它从当前状态观察到环境,根据当前策略选择一个动作,并将其应用于环境。然后,智能体观察到下一个状态和对应的奖励。通过这种方式,我们可以逐步收集经验和样本。使用这些样本,我们可以计算损失函数,并使用优化器optim来更新神经网络的参数。 接下来,我们使用PyTorch的强化学习RLlib执行训练过程。RLlib提供了一种方便的方式来管理整个强化学习训练过程的迭代和评估。通过调整训练过程中的参数和神经网络架构,我们可以改进智能体的性能。 总结来说,PyTorch提供了一个强大的深度学习框架,可以用于强化学习任务的实现。通过定义环境、状态空间、动作空间和奖励函数,以及使用PyTorch的nn模块和optim优化器来构建和训练强化学习的神经网络模型,我们可以实现一个高效的强化学习系统。同时,RLlib库提供了方便的工具来迭代和评估训练过程。希望这个教程能够帮助您更好地理解和应用PyTorch强化学习

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值