【ray】【RLlib】【Algorithms 算法】

最新推荐文章于 2025-03-11 18:44:20 发布

资源存储库

最新推荐文章于 2025-03-11 18:44:20 发布

阅读量686

点赞数 4

文章标签：深度学习 rnn lstm

本文链接：https://blog.csdn.net/wq6qeg88/article/details/136617560

版权

RLlib提供了多种强化学习算法，包括离线学习的BC、CQL和MARWIL，以及On-policy的APPO和PPO等。这些算法用于不同场景，如行为克隆、保守Q学习和多智能体学习。RLlib还支持基于模型和无模型的RL，以及多智能体参数共享和独立学习策略。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

RLlib: Industry-Grade Reinforcement LearningRLlib：工业级强化学习
Algorithms 算法

Note 注意

From Ray 2.6.0 onwards, RLlib is adopting a new stack for training and model customization, gradually replacing the ModelV2 API and some convoluted parts of Policy API with the RLModule API. Click here for details.
从Ray 2.6.0开始，RLlib采用新的堆栈进行训练和模型定制，逐渐用RLModule API取代ModelV2 API和Policy API的一些复杂部分。点击这里了解详情。

Algorithms 算法编号

Check out the environments page to learn more about different environment types.
查看环境页面以了解有关不同环境类型的更多信息。

1 Available Algorithms - Overview
1 可用算法-概述

Algorithm 算法	Frameworks 框架	Discrete Actions 离散动作	Continuous Actions 连续动作	Multi-Agent 多agent	Model Support 模型支持	Multi-GPU 多gpu
APPO	tf + torch	Yes +parametric 是+参数	Yes	Yes	+RNN, +LSTM auto-wrapping, +Attention, +autoreg +RNN，+LSTM自动包装，+Attention，+autoreg	tf + torch
BC	tf + torch	Yes +parametric 是+参数	Yes	Yes	+RNN	torch 火炬
CQL	tf + torch	No	Yes	No		tf + torch
DreamerV3	tf	Yes	Yes	No	+RNN (GRU-based by default) +RNN（默认基于GRU）	tf
DQN, Rainbow DQN，彩虹	tf + torch	Yes +parametric 是+参数	No	Yes		tf + torch
IMPALA	tf + torch	Yes +parametric 是+参数	Yes	Yes	+RNN, +LSTM auto-wrapping, +Attention, +autoreg +RNN，+LSTM自动包装，+Attention，+autoreg	tf + torch
MARWIL	tf + torch	Yes +parametric 是+参数	Yes	Yes	+RNN	torch 火炬
PPO	tf + torch	Yes +parametric 是+参数	Yes	Yes	+RNN, +LSTM auto-wrapping, +Attention, +autoreg +RNN，+LSTM自动包装，+Attention，+autoreg	tf + torch
SAC	tf + torch	Yes	Yes	Yes		torch 火炬

Multi-Agent only Methods
仅多代理方法

Algorithm 算法	Frameworks 框架	Discrete Actions 离散动作	Continuous Actions 连续动作	Multi-Agent 多agent	Model Support 模型支持
Parameter Sharing 参数共享	Depends on bootstrapped algorithm 取决于自举算法
Fully Independent Learning完全独立的学习	Depends on bootstrapped algorithm 取决于自举算法
Shared Critic Methods 共享批评方法	Depends on bootstrapped algorithm 取决于自举算法

2 Offline 离线

Behavior Cloning (BC; derived from MARWIL implementation)

行为克隆（BC;源自MARWIL实现）

[论文] [实施]

Our behavioral cloning implementation is directly derived from our MARWIL implementation, with the only difference being the beta parameter force-set to 0.0. This makes BC try to match the behavior policy, which generated the offline data, disregarding any resulting rewards. BC requires the offline datasets API to be used.
我们的行为克隆实现直接源自我们的MARWIL实现，唯一的区别是 beta 参数force-set为0.0。这使得BC尝试匹配生成离线数据的行为策略，而不考虑任何由此产生的奖励。BC要求使用离线数据集API。

Tuned examples: CartPole-v1
优化示例：CartPole-v1

BC-specific configs (see also common configs):
BC特定的错误（另请参阅常见错误）：

class ray.rllib.algorithms.bc.bc.BCConfig(algo_class=None)
class ray.rllib.algorithms.bc.bc.BCConfig（algo_class=None）[source] [出处]

Defines a configuration class from which a new BC Algorithm can be built
定义一个配置类，可以从该配置类构建新的BC算法

from ray.rllib.algorithms.bc import BCConfig
# Run this from the ray directory root.
config = BCConfig().training(lr=0.00001, gamma=0.99)
config = config.offline_data(
    input_="./rllib/tests/data/cartpole/large.json")

# Build an Algorithm object from the config and run 1 training iteration.
algo = config.build()
algo.train()

from ray.rllib.algorithms.bc import BCConfig
from ray import tune
config = BCConfig()
# Print out some default values.
print(config.beta)
# Update the config object.
config.training(
    lr=tune.grid_search([0.001, 0.0001]), beta=0.75
)
# Set the config object's data path.
# Run this from the ray directory root.
config.offline_data(
    input_="./rllib/tests/data/cartpole/large.json"
)
# Set the config object's env, used for evaluation.
config.environment(env="CartPole-v1")
# Use to_dict() to get the old-style python config dict
# when running with tune.
tune.Tuner(
    "BC",
    param_space=config.to_dict(),
).fit()

training(*, beta: float | None = <ray.rllib.utils.from_config._NotProvided object>, bc_logstd_coeff: float | None = <ray.rllib.utils.from_config._NotProvided object>, moving_average_sqd_adv_norm_update_rate: float | None = <ray.rllib.utils.from_config._NotProvided object>, moving_average_sqd_adv_norm_start: float | None = <ray.rllib.utils.from_config._NotProvided object>, vf_coeff: float | None = <ray.rllib.utils.from_config._NotProvided object>, grad_clip: float | None = <ray.rllib.utils.from_config._NotProvided object>, **kwargs) → MARWILConfig
training（*，beta：float|无=，bc_logstd_coeff：float|无=，moving_average_sqd_adv_norm_update_rate：float|无=，moving_average_sqd_adv_norm_start：float|无=，vf_coeff：浮点|None =，格拉德_clip：float|无=，**kwargs）→ MARWILConfig #

Sets the training related configuration.
设置培训相关配置。

Parameters 参数:

beta – Scaling of advantages in exponential terms. When beta is 0.0, MARWIL is reduced to behavior cloning (imitation learning); see bc.py algorithm in this same directory.
beta -以指数形式衡量优势。当beta为0.0时，MARWIL被简化为行为克隆（模仿学习）;请参阅同一目录中的bc.py算法。
bc_logstd_coeff – A coefficient to encourage higher action distribution entropy for exploration.
bc_logstd_coeff -一个系数，鼓励更高的动作分布熵进行探索。
moving_average_sqd_adv_norm_start – Starting value for the squared moving average advantage norm (c^2).
moving_average_sqd_adv_norm_start -平方移动平均优势范数（c^2）的起始值。
vf_coeff – Balancing value estimation loss and policy optimization loss. moving_average_sqd_adv_norm_update_rate: Update rate for the squared moving average advantage norm (c^2).
vf_coeff -平衡值估计损失和策略优化损失。moving_average_sqd_adv_norm_update_rate：移动平均优势范数平方（c^2）的更新率。
grad_clip – If specified, clip the global norm of gradients by this amount.
格拉德_clip -如果指定，则按此量裁剪梯度的全局范数。

Returns 返回:

This updated AlgorithmConfig object.
此已更新的JummConfig对象。

Conservative Q-Learning (CQL)

保守Q-Learning（CQL）

In offline RL, the algorithm has no access to an environment, but can only sample from a fixed dataset of pre-collected state-action-reward tuples. In particular, CQL (Conservative Q-Learning) is an offline RL algorithm that mitigates the overestimation of Q-values outside the dataset distribution via conservative critic estimates. It does so by adding a simple Q regularizer loss to the standard Bellman update loss. This ensures that the critic does not output overly-optimistic Q-values. This conservative correction term can be added on top of any off-policy Q-learning algorithm (here, we provide this for SAC).
在离线RL中，算法无法访问环境，但只能从预先收集的状态-动作-奖励元组的固定数据集中进行采样。特别是，CQL（Conservative Q-Learning）是一种离线RL算法，通过保守的批评估计来减轻数据集分布之外的Q值的高估。它通过在标准Bellman更新损失上增加一个简单的Q正则化损失来实现。这确保了评论家不会输出过于乐观的Q值。这个保守校正项可以添加到任何非策略Q学习算法之上（这里，我们为SAC提供了这个）。

RLlib’s CQL is evaluated against the Behavior Cloning (BC) benchmark at 500K gradient steps over the dataset. The only difference between the BC- and CQL configs is the bc_iters parameter in CQL, indicating how many gradient steps we perform over the BC loss. CQL is evaluated on the D4RL benchmark, which has pre-collected offline datasets for many types of environments.
RLlib的CQL在数据集上以500K梯度步长针对行为克隆（BC）基准进行评估。BC和CQL的唯一区别是CQL中的 bc_iters 参数，它表示我们在BC损失上执行了多少梯度步骤。CQL是在D4RL基准上进行评估的，该基准为许多类型的环境预先收集了离线数据集。

Tuned examples: HalfCheetah Random, Hopper Random
调整示例：半猎豹随机，跳跃随机

CQL-specific configs (see also common configs):
CQL特异性抗体（另见常见抗体）：

class ray.rllib.algorithms.cql.cql.CQLConfig(algo_class=None)
类ray.rllib.algorithms.cql.cql.CQLConfig（algo_class=None）[source] [出处]

Defines a configuration class from which a CQL can be built.
定义一个配置类，从这个类可以生成CQL。

from ray.rllib.algorithms.cql import CQLConfig
config = CQLConfig().training(gamma=0.9, lr=0.01)
config = config.resources(num_gpus=0)
config = config.rollouts(num_rollout_workers=4)
print(config.to_dict())
# Build a Algorithm object from the config and run 1 training iteration.
algo = config.build(env="CartPole-v1")
algo.train()

training(*, bc_iters: int | None = <ray.rllib.utils.from_config._NotProvided object>, temperature: float | None = <ray.rllib.utils.from_config._NotProvided object>, num_actions: int | None = <ray.rllib.utils.from_config._NotProvided object>, lagrangian: bool | None = <ray.rllib.utils.from_config._NotProvided object>, lagrangian_thresh: float | None = <ray.rllib.utils.from_config._NotProvided object>, min_q_weight: float | None = <ray.rllib.utils.from_config._NotProvided object>, **kwargs) → CQLConfig
training（*，bc_iters：int|无=，温度：浮动|None =，num_actions：int| None =，拉格朗日：bool| None =，拉格朗日阈值：浮点数|无=，min_q_weight：浮动|无=，**kwargs）→ CQLConfig[source] [出处]

Sets the training-related configuration.
设置培训相关配置。

Parameters 参数:

bc_iters – Number of iterations with Behavior Cloning pretraining.
bc_iters -行为克隆预训练的迭代次数。
temperature – CQL loss temperature.
温度- CQL损失温度。
num_actions – Number of actions to sample for CQL loss
num_actions -要对CQL丢失进行采样的操作数
lagrangian – Whether to use the Lagrangian for Alpha Prime (in CQL loss).
lagrangian -是否对Alpha Prime使用Lagrangian（在CQL损失中）。
lagrangian_thresh – Lagrangian threshold.
lagrangian_thresh -拉格朗日阈值。
min_q_weight – in Q weight multiplier.
min_q_weight -在Q权重乘数中。