本篇描述了用于在RLlib中实现算法的内部概念。
1.Policies
策略类封装了RL算法的核心数值组件。这通常包括确定要采取的操作的策略模型、用于体验的轨迹后处理程序和用于改进给定后处理体验的策略的损失函数。有关简单示例,请参见策略渐变策略定义。
与深度学习框架的大多数交互都与策略接口隔离,从而允许RLlib支持多个框架。为了简化策略的定义,RLlib包括Tensorflow和特定于pytorch的模板。你也可以从头开始写你自己的。举个例子:
class CustomPolicy(Policy):
"""Example of a custom policy written from scratch.
You might find it more convenient to use the `build_tf_policy` and
`build_torch_policy` helpers instead for a real policy, which are
described in the next sections.
"""
def __init__(self, observation_space, action_space, config):
Policy.__init__(self, observation_space, action_space, config)
# example parameter
self.w = 1.0
def compute_actions(self,
obs_batch,
state_batches,
prev_action_batch=None,
prev_reward_batch=None,
info_batch=None,
episodes=None,
**kwargs):
# return action batch, RNN states, extra values to include in batch
return [self.action_space.sample() for _ in obs_batch], [], {}
def learn_on_batch(self, samples):
# implement your learning code here
return {} # return stats
def get_weights(self):
return {"w": self.w}
def set_weights(self, weights):
self.w = weights["w"]
上面的基本策略在运行时,将使用基本obs
、new_obs
、actions
、rewards
、dones
和infos
列生成成批的观察结果。还有两种传递和发出额外信息的机制:
Policy recurrent state: 假设您想基于当前事件的时间步长计算操作(action)。虽然可以让环境将其作为观察的一部分提供,但我们可以将其计算并存储为策略重复状态(Policy recurrent state)的一部分:
def get_initial_state(self):
"""Returns initial RNN state for the current policy."""
return [0] # list of single state element (t=0)
# you could also return multiple values, e.g., [0, "foo"]
def compute_actions(self,
obs_batch,
state_batches,
prev_action_batch=None,
prev_reward_batch=None,
info_batch=None,
episodes=None,
**kwargs):
assert len(state_batches) == len(self.get_initial_state())
new_state_batches = [[
t + 1 for t in state_batches[0]
]]
return ..., new_state_batches, {}
def learn_on_batch(self, samples):
# can access array of the state elements at each timestep
# or state_in_1, 2, etc. if there are multiple state elements
assert "state_in_0" in samples.keys()
assert "state_out_0" in samples.keys()
Extra action info output(额外的行动信息输出): 您还可以在每个步骤中发出额外的输出,以便进行学习。例如,您可能想要将行为策略日志输出为额外的操作信息,它可以用于重要性加权,但一般情况下任意值都可以存储在这里(只要它们可以转换为numpy数组):
def compute_actions(self,
obs_batch,
state_batches,
prev_action_batch=None,
prev_reward_batch=None,
info_batch=None,
episodes=None,
**kwargs):
action_info_batch = {
"some_value": ["foo" for _ in obs_batch],
"other_value": [12345 for _ in obs_batch],
}
return ..., [], action_info_batch
def learn