RLilib 训练API

最新推荐文章于 2023-12-09 10:46:32 发布

快乐地笑

最新推荐文章于 2023-12-09 10:46:32 发布

阅读量1.9k

点赞数 3

分类专栏：学习文章标签： ray rllib

本文链接：https://blog.csdn.net/weixin_43255962/article/details/91358631

版权

1. 入门指南（命令行）

在高层次上，RLlib提供了一个训练器（Trainer）类，其中包含环境交互策略。通过trainer接口，可以对策略进行训练、检查或计算操作。在多智能体训练中，训练器同时管理多个策略的查询和优化。
在这里插入图片描述
你可以用以下命令训练一个简单的DQN训练器:

rllib train --run DQN --env CartPole-v0

默认情况下，结果将被记录到~/ray_results子目录中。这个子目录将包含一个文件params.json包含超参数，一个文件result.jsonn包含每次迭代的训练摘要和一个TensorBoard文件，该文件可用于通过运行TensorBoard以下命令进而可视化训练过程：

tensorboard --logdir=~/ray_results

rllib train命令(与repo中的train.py脚本相同)有许多选项可以通过运行一下命令来显示:

rllib train --help
## 或者
python ray/python/ray/rllib/train.py --help

最重要的选项是使用--env环境选项(可以使用任何OpenAI gym环境，包括用户注册的环境)，以及使用--run运行算法选项(可用选项有PPO、PG、A2C、A3C、IMPALA、ES、DDPG、DQN、MARWIL、APEX和APEX_DDPG)。

训练策略的评估

为了保存用于评估策略的检查点，在运行rllib train时设置--checkpoint-freq(检查点之间的训练迭代次数)。
评估一个以前训练过的DQN策略的例子如下:

rllib rollout \
    ~/ray_results/default/DQN_CartPole-v0_0upjmdgr0/checkpoint_1/checkpoint-1 \
    --run DQN --env CartPole-v0 --steps 10000

rollout.py helper脚本从位于~/ray_results/default/DQN_CartPole-v0_0upjmdgr0/checkpoint t_1/checkpoint-1的检查点重新构造DQN策略，并在--env指定的环境中呈现其行为。

2. 配置信息（Configuration）

参数配置说明

除了一些常见的超参数外，每个算法都有特定的超参数，可以使用--config来设置。有关更多信息，请参见算法文档。

在下面的示例中，我们通过配置（config）标志指定8个worker来训练A2C。

rllib train --env=PongDeterministic-v4 --run=A2C --config '{"num_workers": 8}'

资源配置说明

通过为大多数算法设置num_workers超参数来控制并行度。驱动程序应该使用的gpu数量可以通过num_gpu选项设置。类似地，可以通过num_cpus_per_worker、num_gpus_per_worker和custom_resources_per_worker控制对worker的资源分配。GPU的数量可以是一个分数量，只分配GPU的一部分。例如，使用DQN，您可以通过设置num_gpu: 0.2将5个训练器打包到一个GPU上。

在这里插入图片描述

常用参数

面是常用算法超参数列表:

COMMON_CONFIG = {
    # === Debugging ===
    # Whether to write episode stats and videos to the agent log dir
    # 是否把每次迭代的状态和videos 写入智能体日志文件中
    "monitor": False,
    # Set the ray.rllib.* log level for the agent process and its workers.
    # Should be one of DEBUG, INFO, WARN, or ERROR. The DEBUG level will also
    # periodically print out summaries of relevant internal dataflow (this is
    # also printed out once at startup at the INFO level).
    # 设置ray.rllib.*代理（智能体）进程及其worker的日志级别。应该是调试（ DEBUG）、信息（INFO）、
    # 警告（WARN）或错误（ERROR）之一。 调试级别还将定期打印出相关内部数据流的摘要(在INFO
    # 级别启动时也会打印一次)。
    "log_level": "INFO",
    # Callbacks that will be run during various phases of training. These all
    # take a single "info" dict as an argument. For episode callbacks, custom
    # metrics can be attached to the episode by updating the episode object's
    # custom metrics dict (see examples/custom_metrics_and_callbacks.py). You
    # may also mutate the passed in batch data in your callback.
    # 将在不同训练阶段运行的回调。这些都以一个“info”dict作为参数。对于迭代回调，可以通过更新迭代
    # 对象的自定义度量dict将自定义度量附加到迭代中(参见示例/custom_metrics_and_callbacks.py)。
    # 还可以在回调中修改传入的批处理数据。
    "callbacks": {
        "on_episode_start": None,     # arg: {"env": .., "episode": ...}
        "on_episode_step": None,      # arg: {"env": .., "episode": ...}
        "on_episode_end": None,       # arg: {"env": .., "episode": ...}
        "on_sample_end": None,        # arg: {"samples": .., "worker": ...}
        "on_train_result": None,      # arg: {"trainer": ..., "result": ...}
        "on_postprocess_traj": None,  # arg: {
                                      #   "agent_id": ..., "episode": ...,
                                      #   "pre_batch": (before processing),
                                      #   "post_batch": (after processing),
                                      #   "all_pre_batches": (other agent ids),
                                      # }
    },
    # Whether to attempt to continue training if a worker crashes.
    # 是否忽略失败worker继续运行训练
    "ignore_worker_failures": False,
    # Execute TF loss functions in eager mode. This is currently experimental
    # and only really works with the basic PG algorithm.
    # 是否在紧急（eager）模式下执行TF的损失函数。
    "use_eager": False,

    # === Policy ===
    # Arguments to pass to model. See models/catalog.py for a full list of the
    # available model options.
    # 传递给模型的参数。有关可用模型选项的完整列表，请参见models/catalog.py。
    "model": MODEL_DEFAULTS,
    # Arguments to pass to the policy optimizer. These vary by optimizer.
    # 传递给策略优化器的参数。这些参数因优化器而异。
    "optimizer": {},

    # === Environment ===
    # Discount factor of the MDP
    # MDP的折扣系数
    "gamma": 0.99,
    # Number of steps after which the episode is forced to terminate. Defaults
    # to `env.spec.max_episode_steps` (if present) for Gym envs.
    # 事件被迫终止的步骤数。默认为“env.spec.max_episode_steps '(如果有的话)用于env.spec.max_episode_steps的envs。
    "horizon": None,
    # Calculate rewards but don't reset the environment when the horizon is
    # hit. This allows value estimation and RNN state to span across logical
    # episodes denoted by horizon. This only has an effect if horizon != inf.
    # 计算奖励，但不要在horizon 被击中时重置环境。这使得值估计和RNN状态可以跨由horizon表示的逻辑事件。这只有在horizon != inf时才有效。
    "soft_horizon": False,
    # Arguments to pass to the env creator
    # 传递给env创建者的参数
    "env_config": {},
    # Environment name can also be passed via config
    # 环境名称，也可以通过配置传递

最低0.47元/天解锁文章

快乐地笑

关注

3
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
RLilib 训练API

1. 入门指南（命令行）在高层次上，RLlib提供了一个训练器（Trainer）类，其中包含环境交互策略。通过trainer接口，可以对策略进行训练、检查或计算操作。在多智能体训练中，训练器同时管理多个策略的查询和优化。你可以用以下命令训练一个简单的DQN训练器:rllib train --run DQN --env CartPole-v0默认情况下，结果将被记录到~/ray_resu...
复制链接

扫一扫

专栏目录