一、PPO优化
PPO的简介和实践可以看笔者之前的文章 强化学习_06_pytorch-PPO实践(Pendulum-v1)
针对之前的PPO做了主要以下优化:
batch_normalize
: 在mini_batch
函数中进行adv的normalize, 加速模型对adv的学习policyNet
采用beta
分布(0~1): 同时增加MaxMinScale 将beta分布产出值转换到action的分布空间- 收集多个
episode
的数据,依次计算adv,后合并到一个dataloader中进行遍历:加速模型收敛
1.1 PPO2 代码
详细可见 Github: PPO2_old.py
class PPO2:
"""
PPO2算法, 采用截断方式
"""
def __init__(self,
state_dim: int,
actor_hidden_layers_dim: typ.List,
critic_hidden_layers_dim: typ.List,
action_dim: int,
actor_lr: float,
critic_lr: float,
gamma: float,
PPO_kwargs: typ.Dict,
device: torch.device,
reward_func: typ.Optional[typ.Callable]=None
):
dist_type = PPO_kwargs.get('dist_type', 'beta')
self.dist_type = dist_type
self.actor = policyNet(state_dim, actor_hidden_layers_dim, action_dim, dist_type=dist_type).to(device)
self.critic = valueNet(state_dim, critic_hidden_layers_dim).to(device)
self.actor_lr = actor_lr
self.critic_lr = critic_lr
self.actor_opt = torch.optim.Adam(self.actor.parameters(), lr=actor_lr)
self.critic_opt = torch.optim.Adam(self.critic.parameters(), lr=critic_lr)
self.gamma = gamma
self.lmbda = PPO_kwargs['lmbda']
self.k_epochs = PPO_kwargs['k_epochs'] # 一条序列的数据用来训练的轮次
self.eps = PPO_kwargs['eps'] # PPO中截断范围的参数
self.sgd_batch_size = PPO_kwargs.get('sgd_batch_size', 512)
self.minibatch_size = PPO_kwargs.get('minibatch_size', 128)
self.action_bound = PPO_kwargs.get('action_bound', 1.0)
self.action_low = -1 * self.action_bound
self.action_high = self.action_bound
if 'action_space' in PPO_kwargs:
self.action_low = self.action_space.low
self.action_high = self.action_space.high
self.count = 0
self.device = device
self.reward_func = reward_func
self.min_batch_collate_func = partial(mini_batch, mini_batch_size=self.minibatch_size)
def _action_fix(self, act):
if self.dist_type == 'beta':
# beta 0-1 -> low ~ high
return act * (self.action_high - self.action_low) + self.action_low
return act
def