RLlib概念和构建自定义算法

本篇描述了用于在RLlib中实现算法的内部概念。

1.Policies

策略类封装了RL算法的核心数值组件。这通常包括确定要采取的操作的策略模型、用于体验的轨迹后处理程序和用于改进给定后处理体验的策略的损失函数。有关简单示例,请参见策略渐变策略定义
与深度学习框架的大多数交互都与策略接口隔离,从而允许RLlib支持多个框架。为了简化策略的定义,RLlib包括Tensorflow特定于pytorch的模板。你也可以从头开始写你自己的。举个例子:

class CustomPolicy(Policy):
    """Example of a custom policy written from scratch.

    You might find it more convenient to use the `build_tf_policy` and
    `build_torch_policy` helpers instead for a real policy, which are
    described in the next sections.
    """

    def __init__(self, observation_space, action_space, config):
        Policy.__init__(self, observation_space, action_space, config)
        # example parameter
        self.w = 1.0

    def compute_actions(self,
                        obs_batch,
                        state_batches,
                        prev_action_batch=None,
                        prev_reward_batch=None,
                        info_batch=None,
                        episodes=None,
                        **kwargs):
        # return action batch, RNN states, extra values to include in batch
        return [self.action_space.sample() for _ in obs_batch], [], {}

    def learn_on_batch(self, samples):
        # implement your learning code here
        return {}  # return stats

    def get_weights(self):
        return {"w": self.w}

    def set_weights(self, weights):
        self.w = weights["w"]

上面的基本策略在运行时,将使用基本obsnew_obsactionsrewardsdonesinfos列生成成批的观察结果。还有两种传递和发出额外信息的机制:

Policy recurrent state: 假设您想基于当前事件的时间步长计算操作(action)。虽然可以让环境将其作为观察的一部分提供,但我们可以将其计算并存储为策略重复状态(Policy recurrent state)的一部分:

def get_initial_state(self):
    """Returns initial RNN state for the current policy."""
    return [0]  # list of single state element (t=0)
                # you could also return multiple values, e.g., [0, "foo"]

def compute_actions(self,
                    obs_batch,
                    state_batches,
                    prev_action_batch=None,
                    prev_reward_batch=None,
                    info_batch=None,
                    episodes=None,
                    **kwargs):
    assert len(state_batches) == len(self.get_initial_state())
    new_state_batches = [[
        t + 1 for t in state_batches[0]
    ]]
    return ..., new_state_batches, {}

def learn_on_batch(self, samples):
    # can access array of the state elements at each timestep
    # or state_in_1, 2, etc. if there are multiple state elements
    assert "state_in_0" in samples.keys()
    assert "state_out_0" in samples.keys()

Extra action info output(额外的行动信息输出): 您还可以在每个步骤中发出额外的输出,以便进行学习。例如,您可能想要将行为策略日志输出为额外的操作信息,它可以用于重要性加权,但一般情况下任意值都可以存储在这里(只要它们可以转换为numpy数组):

def compute_actions(self,
                    obs_batch,
                    state_batches,
                    prev_action_batch=None,
                    prev_reward_batch=None,
                    info_batch=None,
                    episodes=None,
                    **kwargs):
    action_info_batch = {
        "some_value": ["foo" for _ in obs_batch],
        "other_value": [12345 for _ in obs_batch],
    }
    return ..., [], action_info_batch

def learn
  • 2
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值