【ray】【RLlib】【Algorithms 算法】

RLlib提供了多种强化学习算法,包括离线学习的BC、CQL和MARWIL,以及On-policy的APPO和PPO等。这些算法用于不同场景,如行为克隆、保守Q学习和多智能体学习。RLlib还支持基于模型和无模型的RL,以及多智能体参数共享和独立学习策略。
摘要由CSDN通过智能技术生成

Note 注意

From Ray 2.6.0 onwards, RLlib is adopting a new stack for training and model customization, gradually replacing the ModelV2 API and some convoluted parts of Policy API with the RLModule API. Click here for details.
从Ray 2.6.0开始,RLlib采用新的堆栈进行训练和模型定制,逐渐用RLModule API取代ModelV2 API和Policy API的一些复杂部分。点击这里了解详情。

Algorithms 算法编号

Check out the environments page to learn more about different environment types.
查看环境页面以了解有关不同环境类型的更多信息。

1 Available Algorithms - Overview
1 可用算法-概述

Algorithm 算法

Frameworks 框架

Discrete Actions 离散动作

Continuous Actions 连续动作

Multi-Agent 多agent

Model Support 模型支持

Multi-GPU 多gpu

APPO

tf + torch

Yes +parametric 是+参数

Yes

Yes

+RNN+LSTM auto-wrapping+Attention+autoreg
+RNN,+LSTM自动包装,+Attention,+autoreg

tf + torch

BC

tf + torch

Yes +parametric 是+参数

Yes

Yes

+RNN

torch 火炬

CQL

tf + torch

No

Yes

No

tf + torch

DreamerV3

tf

Yes

Yes

No

+RNN (GRU-based by default)
+RNN(默认基于GRU)

tf

DQNRainbow  DQN,彩虹

tf + torch

Yes +parametric 是+参数

No

Yes

tf + torch

IMPALA

tf + torch

Yes +parametric 是+参数

Yes

Yes

+RNN+LSTM auto-wrapping+Attention+autoreg
+RNN,+LSTM自动包装,+Attention,+autoreg

tf + torch

MARWIL

tf + torch

Yes +parametric 是+参数

Yes

Yes

+RNN

torch 火炬

PPO

tf + torch

Yes +parametric 是+参数

Yes

Yes

+RNN+LSTM auto-wrapping+Attention+autoreg
+RNN,+LSTM自动包装,+Attention,+autoreg

tf + torch

SAC

tf + torch

Yes

Yes

Yes

torch 火炬

Multi-Agent only Methods
仅多代理方法

Algorithm 算法

Frameworks 框架

Discrete Actions 离散动作

Continuous Actions 连续动作

Multi-Agent 多agent

Model Support 模型支持

Parameter Sharing  参数共享

Depends on bootstrapped algorithm
取决于自举算法

Fully Independent Learning完全独立的学习

Depends on bootstrapped algorithm
取决于自举算法

Shared Critic Methods  共享批评方法

Depends on bootstrapped algorithm
取决于自举算法

2 Offline 离线

Behavior Cloning (BC; derived from MARWIL implementation)

行为克隆(BC;源自MARWIL实现)

 [论文] [实施]

Our behavioral cloning implementation is directly derived from our MARWIL implementation, with the only difference being the beta parameter force-set to 0.0. This makes BC try to match the behavior policy, which generated the offline data, disregarding any resulting rewards. BC requires the offline datasets API to be used.
我们的行为克隆实现直接源自我们的MARWIL实现,唯一的区别是 beta 参数force-set为0.0。这使得BC尝试匹配生成离线数据的行为策略,而不考虑任何由此产生的奖励。BC要求使用离线数据集API。

Tuned examples: CartPole-v1
优化示例:CartPole-v1

BC-specific configs (see also common configs):
BC特定的错误(另请参阅常见错误):

class ray.rllib.algorithms.bc.bc.BCConfig(algo_class=None)
class ray.rllib.algorithms.bc.bc.BCConfig(algo_class=None)[source] [出处]

Defines a configuration class from which a new BC Algorithm can be built
定义一个配置类,可以从该配置类构建新的BC算法

from ray.rllib.algorithms.bc import BCConfig
# Run this from the ray directory root.
config = BCConfig().training(lr=0.00001, gamma=0.99)
config = config.offline_data(
    input_="./rllib/tests/data/cartpole/large.json")

# Build an Algorithm object from the config and run 1 training iteration.
algo = config.build()
algo.train()
from ray.rllib.algorithms.bc import BCConfig
from ray import tune
config = BCConfig()
# Print out some default values.
print(config.beta)
# Update the config object.
config.training(
    lr=tune.grid_search([0.001, 0.0001]), beta=0.75
)
# Set the config object's data path.
# Run this from the ray directory root.
config.offline_data(
    input_="./rllib/tests/data/cartpole/large.json"
)
# Set the config object's env, used for evaluation.
config.environment(env="CartPole-v1")
# Use to_dict() to get the old-style python config dict
# when running with tune.
tune.Tuner(
    "BC",
    param_space=config.to_dict(),
).fit()

training(*beta: float | None = <ray.rllib.utils.from_config._NotProvided object>bc_logstd_coeff: float | None = <ray.rllib.utils.from_config._NotProvided object>moving_average_sqd_adv_norm_update_rate: float | None = <ray.rllib.utils.from_config._NotProvided object>moving_average_sqd_adv_norm_start: float | None = <ray.rllib.utils.from_config._NotProvided object>vf_coeff: float | None = <ray.rllib.utils.from_config._NotProvided object>grad_clip: float | None = <ray.rllib.utils.from_config._NotProvided object>**kwargs) → MARWILConfig
training(*,beta:float|无=,bc_logstd_coeff:float|无=,moving_average_sqd_adv_norm_update_rate:float|无=,moving_average_sqd_adv_norm_start:float|无=,vf_coeff:浮点|None =,格拉德_clip:float|无=,**kwargs)→ MARWILConfig #

Sets the training related configuration.
设置培训相关配置。

Parameters 参数:

  • beta – Scaling of advantages in exponential terms. When beta is 0.0, MARWIL is reduced to behavior cloning (imitation learning); see bc.py algorithm in this same directory.
    beta -以指数形式衡量优势。当beta为0.0时,MARWIL被简化为行为克隆(模仿学习);请参阅同一目录中的bc.py算法。

  • bc_logstd_coeff – A coefficient to encourage higher action distribution entropy for exploration.
    bc_logstd_coeff -一个系数,鼓励更高的动作分布熵进行探索。

  • moving_average_sqd_adv_norm_start – Starting value for the squared moving average advantage norm (c^2).
    moving_average_sqd_adv_norm_start -平方移动平均优势范数(c^2)的起始值。

  • vf_coeff – Balancing value estimation loss and policy optimization loss. moving_average_sqd_adv_norm_update_rate: Update rate for the squared moving average advantage norm (c^2).
    vf_coeff -平衡值估计损失和策略优化损失。moving_average_sqd_adv_norm_update_rate:移动平均优势范数平方(c^2)的更新率。

  • grad_clip – If specified, clip the global norm of gradients by this amount.
    格拉德_clip -如果指定,则按此量裁剪梯度的全局范数。

Returns 返回:

This updated AlgorithmConfig object.
此已更新的JummConfig对象。

Conservative Q-Learning (CQL)

保守Q-Learning(CQL)

In offline RL, the algorithm has no access to an environment, but can only sample from a fixed dataset of pre-collected state-action-reward tuples. In particular, CQL (Conservative Q-Learning) is an offline RL algorithm that mitigates the overestimation of Q-values outside the dataset distribution via conservative critic estimates. It does so by adding a simple Q regularizer loss to the standard Bellman update loss. This ensures that the critic does not output overly-optimistic Q-values. This conservative correction term can be added on top of any off-policy Q-learning algorithm (here, we provide this for SAC).
在离线RL中,算法无法访问环境,但只能从预先收集的状态-动作-奖励元组的固定数据集中进行采样。特别是,CQL(Conservative Q-Learning)是一种离线RL算法,通过保守的批评估计来减轻数据集分布之外的Q值的高估。它通过在标准Bellman更新损失上增加一个简单的Q正则化损失来实现。这确保了评论家不会输出过于乐观的Q值。这个保守校正项可以添加到任何非策略Q学习算法之上(这里,我们为SAC提供了这个)。

RLlib’s CQL is evaluated against the Behavior Cloning (BC) benchmark at 500K gradient steps over the dataset. The only difference between the BC- and CQL configs is the bc_iters parameter in CQL, indicating how many gradient steps we perform over the BC loss. CQL is evaluated on the D4RL benchmark, which has pre-collected offline datasets for many types of environments.
RLlib的CQL在数据集上以500K梯度步长针对行为克隆(BC)基准进行评估。BC和CQL的唯一区别是CQL中的 bc_iters 参数,它表示我们在BC损失上执行了多少梯度步骤。CQL是在D4RL基准上进行评估的,该基准为许多类型的环境预先收集了离线数据集。

Tuned examples: HalfCheetah RandomHopper Random
调整示例:半猎豹随机,跳跃随机

CQL-specific configs (see also common configs):
CQL特异性抗体(另见常见抗体):

class ray.rllib.algorithms.cql.cql.CQLConfig(algo_class=None)
类ray.rllib.algorithms.cql.cql.CQLConfig(algo_class=None)[source] [出处]

Defines a configuration class from which a CQL can be built.
定义一个配置类,从这个类可以生成CQL。

from ray.rllib.algorithms.cql import CQLConfig
config = CQLConfig().training(gamma=0.9, lr=0.01)
config = config.resources(num_gpus=0)
config = config.rollouts(num_rollout_workers=4)
print(config.to_dict())
# Build a Algorithm object from the config and run 1 training iteration.
algo = config.build(env="CartPole-v1")
algo.train()

training(*bc_iters: int | None = <ray.rllib.utils.from_config._NotProvided object>temperature: float | None = <ray.rllib.utils.from_config._NotProvided object>num_actions: int | None = <ray.rllib.utils.from_config._NotProvided object>lagrangian: bool | None = <ray.rllib.utils.from_config._NotProvided object>lagrangian_thresh: float | None = <ray.rllib.utils.from_config._NotProvided object>min_q_weight: float | None = <ray.rllib.utils.from_config._NotProvided object>**kwargs) → CQLConfig
training(*,bc_iters:int|无=,温度:浮动|None =,num_actions:int| None =,拉格朗日:bool| None =,拉格朗日阈值:浮点数|无=,min_q_weight:浮动|无=,**kwargs)→ CQLConfig[source] [出处]

Sets the training-related configuration.
设置培训相关配置。

Parameters 参数:

  • bc_iters – Number of iterations with Behavior Cloning pretraining.
    bc_iters -行为克隆预训练的迭代次数。

  • temperature – CQL loss temperature.
    温度- CQL损失温度。

  • num_actions – Number of actions to sample for CQL loss
    num_actions -要对CQL丢失进行采样的操作数

  • lagrangian – Whether to use the Lagrangian for Alpha Prime (in CQL loss).
    lagrangian -是否对Alpha Prime使用Lagrangian(在CQL损失中)。

  • lagrangian_thresh – Lagrangian threshold.
    lagrangian_thresh -拉格朗日阈值。

  • min_q_weight – in Q weight multiplier.
    min_q_weight -在Q权重乘数中。

Returns 返回:

This updated AlgorithmConfig object.
此已更新的JummConfig对象。

Monotonic Advantage Re-Weighted Imitation Learning (MARWIL)
单调优势重加权模仿学习(MARWIL)

  • 4
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

资源存储库

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值