强化学习策略梯度梳理-SOTA下
策略梯度SOTA
这个部分主要参考周博磊老师的第九节的顺序
主要参考课程 Intro to Reinforcement Learning,Bolei Zhou
相关文中代码 https://github.com/ThousandOfWind/RL-basic-alg.git
参考到了pytorch a3c, 另一个版本
分布式Actor learner
A2C
A2C还没有分布式,只是后面分布了A2C,基于QAC我们做两个改动
advantage & lambda return
这里懒了一下,直接从ppo那里抄过来的
# advantage
advantage = th.zeros_like(reward)
returns = th.zeros_like(reward)
deltas = th.zeros_like(reward)
pre_return = 0
pre_value = 0
pre_advantage = 0
for i in range(advantage.shape[0]-1, -1, -1):
returns[i] = reward[i] + self.gamma * pre_return
deltas[i] = reward[i] + self.gamma * pre_value - value[i]
advantage[i] = deltas[i] + self.gamma * self.lamda * pre_advantage
pre_return = returns[i]
pre_value = value[i]
pre_advantage = advantage[i]
最大熵
entropies = -(log_pi * log_pi.exp()).sum(dim=1, keepdim=True)
J = - (advantage.detach() * log_pi + entropies).mean()
batched A2C
多线程,同步更新,效率还比较低
A3C(Asynchronous Advantage Actor critic)
只是比A2C多了Asynchronous
- 在cpu上并行多个actor和本地环境交互
- 因为多个actor提供多样化的经验,所以不需要经验池
- 在本地计算梯度,并把梯度更新到目标网络上
worker
首先worker初始化的时候需要也传入那个真正被更新的网络
def __init__(self, param_set, writer, share_model:DNNAgent, optimizer):
self.obs_shape = param_set['obs_shape'][0]
self.gamma = param_set['gamma']
self.learning_rate = param_set['learning_rate']
self.clone_share_model = param_set['clone_share_model']
self.id = param_set['worker_id']
self.ac = copy.deepcopy(share_model)
self.soft_clone = param_set['soft_clone']
if self.soft_clone:
self.tau = param_set['tau']
self.params = share_model.parameters()
self.optimiser = optimizer
梯度回传的时候需要是把本地梯度引入到全局
self.optimiser.zero_grad()
loss.backward()
grad_norm = th.nn.utils.clip_grad_norm_(self.params, 10)
self.ensure_shared_grads()
self.optimiser.step()
利用函数
def ensure_shared_grads(self):
for param, shared_param in zip(self.ac.parameters(),
self.params):
if shared_param.grad is not None:
return
shared_param._grad = param.grad
optimiser
这里我还没有懂啦,大概也可以直接用原本的pytorch自带的,我就在这里贴一个短的自定义版本做参考
class GlobalAdam(optim.Adam):
def __init__(self, params, lr):
super(GlobalAdam, self).__init__(params, lr=lr)
for group in self.param_groups:
for p in group['params']:
state = self.state[p]
state['step'] = 0
state['exp_avg'] = torch.zeros_like(p.data)
state['exp_avg_sq'] = torch.zeros_like(p.data)
state['exp_avg'].share_memory_()
state['exp_avg_sq'].share_memory_()
train
writer, param_set = init()
share_model = ac_sharenet(param_set)
processes = []
for i in range(param_set['num_processes']):
p = mp.Process(target=run, args=(writer, param_set, share_model))
p.start()
processes.append(p)
for p in processes:
p.join()
IMPALA
- 分布式的actor 和分布式的learner
- actor不计算梯度仅仅采样给learner
- learner之间互相传递梯度
- 利用importance sampling复用轨迹