pytorch自动编码_用pytorch第2部分从头开始编码ppo 4

最新推荐文章于 2024-08-09 08:22:27 发布

吴雄辉

最新推荐文章于 2024-08-09 08:22:27 发布

阅读量745

点赞数

原文链接：https://medium.com/swlh/coding-ppo-from-scratch-with-pytorch-part-2-4-f9d8b8aa938a

版权

本文翻译自《使用PyTorch从零开始编码PPO：第二部分4》。深入探讨如何在PyTorch环境中构建自动编码器，并应用于PPO算法。

摘要由CSDN通过智能技术生成

pytorch自动编码

Welcome to Part 2 of our series, where we shall start coding Proximal Policy Optimization (PPO) from scratch with PyTorch. If you haven’t read Part 1, please do so first.

欢迎来到本系列的第2部分，我们将开始使用PyTorch从头开始编写近端策略优化(PPO)。如果您尚未阅读第1部分，请先阅读。

Note that going forward, I will be posting code screenshots rather than GitHub gists because I don’t want you to just copy-paste code (you can just go to the main repository for that). Instead, you are encouraged to follow along this tutorial while coding manually in another window.

请注意，接下来，我将发布代码屏幕快照，而不是GitHub要点，因为我不希望您仅复制粘贴代码(您可以直接进入主存储库)。相反，建议您在另一个窗口中手动编码时遵循本教程。

We will be following the PPO-clip variant with pseudocode found in OpenAI’s Spinning Up docs and an Actor-Critic Framework. Here’s a picture of the pseudocode:

我们将使用在OpenAI的Spinning Up文档和Actor-Critic Framework中找到的伪代码跟踪PPO-clip变体。这是伪代码的图片：

Image for post — Pseudocode of PPO on OpenAI’s Spinning Up doc.

Initial Thoughts: Only 8 steps? Nice. Since this is a pseudocode for a learning algorithm, might be wise to first design the way our code will flow. This pseudocode looks like it can fit all in one function; we’ll call it learn. It appears that we will need to write subroutines for many steps (i.e. Step 3 wants us to basically roll out a bunch of simulations. In that case, we can define something like rollout later), so it’s best to encapsulate everything into a class PPO. This way, to train on an environment, we can first create a PPO object, then simply call learn.

最初的想法：仅8个步骤？真好由于这是学习算法的伪代码，因此首先设计代码流动方式可能是明智的。这个伪代码看起来可以满足所有功能。我们称之为learn 。看来，我们将需要许多步骤写子程序(即第3步要我们主要推出了一堆模拟。在这种情况下，我们可以定义像rollout更高版本)，所以最好封装一切都变成类PPO 。这样，为了在环境上进行训练，我们可以首先创建一个PPO对象，然后只需调用learn 。

First, let’s set up our PPO class in a file called ppo.py:

首先，让我们在名为ppo.py的文件中设置PPO类：

Cool, pat on the back. Let’s look at Step 1:

酷，轻拍背面。让我们看一下步骤1：

Here’s where we’ll initialize our actor and critic networks. This means we’ll either need to import a neural network module or write our own. Let’s do the latter; we’ll do something similar to PyTorch’s tutorial on creating a neural network with torch.nn. We’ll create a very basic Feed Forward Neural Network. If you’re not comfortable with neural networks, watch this series.

在这里，我们将初始化演员和评论家网络。这意味着我们将需要导入神经网络模块或编写我们自己的模块。让我们做后者；我们将做类似于PyTorch的有关使用torch.nn创建神经网络的教程。我们将创建一个非常基本的前馈神经网络。如果您对神经网络不满意，请观看本系列。

Let’s set up our neural network module in a new file network.py:

让我们在新文件network.py中设置神经网络模块：

import torch
from torch import nn
import torch.nn.functional as F
import numpy as npclass FeedForwardNN(nn.Module):
  def __init__(self):
    super(FeedForwardNN, self).__init__()

We’ll need to define our neural network layers now. We can use a few basic nn.Linear layers, nothing too fancy. We need to define the input and output dimensions, so let’s add some parameters to __init__ to capture that.

我们现在需要定义我们的神经网络层。我们可以使用一些基本的nn.Linear层，没有什么花哨的。我们需要定义输入和输出尺寸，因此让我们在__init__添加一些参数来捕获它。

def __init__(self, in_dim, out_dim):
    super(FeedForwardNN, self).__init__()    self.layer1 = nn.Linear(in_dim, 64)
    self.layer2 = nn.Linear(64, 64)
    self.layer3 = nn.Linear(64, out_dim)

Note that I chose 64 arbitrarily, it doesn’t matter too much. Our __init__ is done; now we can define a forward function to do a forward pass on our network. We can use ReLU for activation (again picked arbitrarily). Since we’re planning on using this network module to define our actor and critic, and they both will take in an observation and return either an action or a value, we’ll set observation as a parameter. One thing to note is that the input to our network must be a tensor, so we should convert our observation to a tensor first in case it’s passed in as a numpy array.

请注意，我任意选择了64个，不要紧。我们的__init__完成了；现在，我们可以定义forward功能以在网络上进行转发。我们可以使用ReLU进行激活(再次选择)。由于我们正计划使用此网络模块来定义我们的演员和评论者，并且它们都将接受观察并返回一个动作或一个值，因此我们将观察设置为参数。需要注意的一件事是，我们网络的输入必须是张量，因此，如果将其作为numpy数组传入，我们应该首先将观察值转换为张量。

def forward(self, obs):
  # Convert observation to tensor if it's a numpy array
  if isinstance(obs, np.ndarray):
    obs = torch.tensor(obs, dtype=torch.float)
  
  activation1 = F.relu(self.layer1(obs))
  activation2 = F.relu(self.layer2(activation1))
  output = self.layer3(activation2)  return output

We are now done with defining our network module; we are ready to define our actor and critic networks. Here’s how network.py should look like:

现在，我们定义了网络模块。我们准备定义演员和评论家网络。这是network.py的外观：

Back to ppo.py; we should be ready to do Step 1 really easily now and define our initial policy, or actor, parameters and value function, or critic, parameters.

回到ppo.py ; 我们现在应该准备好真正轻松地执行第1步，并定义我们的初始策略或参与者，参数和值函数，或批评者参数。

from network import FeedForwardNNself.actor = FeedForwardNN(

Uh oh, road block. We don’t have any information on input or output size, which depends on the environment. Since we’ll need access to that environment in many subroutines as well, let’s just add it as an instance variable in our PPO __init__.

嗯，路障。我们没有有关输入或输出大小的任何信息，这取决于环境。由于我们还需要在许多子例程中访问该环境，因此我们只需将其作为实例变量添加到我们的PPO __init__ 。

def __init__(self, env):
  # Extract environment information
  self.env = env
  self.obs_dim = env.observation_space.shape[0]
  self.act_dim = env.action_space.shape[0]

Eh, we’ll need our actor and critic networks later too, so let’s define them as instance variables in __init__ too.

嗯，我们稍后也会需要我们的参与者和评论者网络，因此我们也将它们定义为__init__实例变量。

# ALG STEP 1
# Initialize actor and critic networks
self.actor = FeedForwardNN(self.obs_dim, self.act_dim)
self.critic = FeedForwardNN(self.obs_dim, 1)

And we’re done with step 1! Officially done with 1/8 of PPO. Here’s the code so far:

第一步完成了！正式使用1/8的PPO完成。这是到目前为止的代码：

Onto Step 2 now.

现在进入步骤2。

Easy. They want us to define a for loop to learn for some number of iterations. Now we could loop by iterations, but we also know that Stable Baselines PPO2 makes you specify how many timesteps to train in total when calling learn. Let’s follow that design. This way, instead of counting off to infinite iterations, we can specify how many timesteps to train before we stop.

简单。他们希望我们定义一个for循环，以学习一些迭代次数。现在我们可以循环进行迭代，但是我们也知道，稳定基准PPO2可以让您指定在调用learn时总共要训练多少个时间步长。让我们遵循该设计。这样，我们可以指定停止之前需要训练的时间步长，而不必计算无限迭代。

def learn(self, total_timesteps):
  t_so_far = 0 # Timesteps simulated so far  while t_so_far < total_timesteps:              # ALG STEP 2
    # Increment t_so_far somewhere below

Step 2, done. Here’s the code so far:

步骤2，完成。这是到目前为止的代码：

Step 3:

第三步：

Our first mini-challenge. We need to collect data from a set of episodes by running our current actor policy. Sure, sounds like a rollout to me. We can call our data collected in each rollout a batch. Now what data do we need? Let’s take a little look ahead in our pseudocode.

我们的第一个小挑战。我们需要通过运行当前的演员策略从一组情节中收集数据。当然，对我来说听起来像是一次rollout 。我们可以将每次rollout收集的数据称为batch 。现在我们需要什么数据？让我们先看一下伪代码。

Looks like we’ll need observations per timestep as I see sₜ in steps 6 and 7. We’ll also need actions per timesteps with aₜ in steps 6 and 7, action probabilities with π_θ (aₜ | sₜ) in step 6, and rewards-to-go with Rₜ in step 4 and 7. Oh, and don’t forget that in order to increment t_so_far in learn, we’ll need to know how many timesteps are simulated per batch; let’s return the lengths of each episode run in our batch (not summing yet because it can be used for logging average episodic length later too. You can also just sum the episodic lengths before returning, doesn’t really matter).

看起来我们将需要在每个时间步进行观察，就像在步骤6和7中看到sₜ一样。在步骤6和7中，我们还需要在每个时间步中使用aₜ进行操作，在步骤6中需要π_θ(aₜ|sₜ)的操作概率，并获得奖励与Rₜ-to-go在第4和第7。哦，不要忘记的是，为了增加t_so_far在learn ，我们需要知道有多少时间步每批进行模拟; 让我们返回批处理中运行的每个情节的长度(尚未汇总，因为它也可以用于以后记录平均情节长度。您也可以在返回之前仅对情节长度进行求和，这没关系)。

We’ll also have to figure out how many timesteps to run per batch; sounds like a hyperparameter to me. We’ll first create a function _init_hyperparameters to define some default hyperparameters, and call the function from our __init__.

我们还必须找出每个批处理要运行多少时间；对我来说听起来像是个超参数。我们首先创建一个函数_init_hyperparameters来定义一些默认的超参数，然后从__init__调用该函数。

def __init__(self, env):
  ...
  self._init_hyperparameters()def _init_hyperparameters(self):
  # Default values for hyperparameters, will need to change later.
  self.timesteps_per_batch = 4800            # timesteps per batch
  self.max_timesteps_per_episode = 1600      # timesteps per episode

Next, let’s create a rollout function to collect our data.

接下来，让我们创建一个rollout功能来收集我们的数据。

def rollout(self):
  # Batch data
  batch_obs = []             # batch observations
  batch_acts = []            # batch actions
  batch_log_probs = []       # log probs of each action
  batch_rews = []            # batch rewards
  batch_rtgs = []            # batch rewards-to-go
  batch_lens = []            # episodic lengths in batch

In our batch, we’ll be running episodes until we hit self.timesteps_per_batch timesteps; in the process, we shall collect observations, actions, log probabilities of those actions, rewards, rewards-to-go, and lengths of each episode. We’ll need these for our PPO algorithm later. The respective shapes of each list will be:

在我们的批次中，我们将运行情节，直到达到self.timesteps_per_batch timesteps;为止。在此过程中，我们将收集观察结果，动作，这些动作的记录概率，奖励，获得的奖励以及每集的持续时间。稍后，我们将需要这些用于我们的PPO算法。每个列表的各自形状为：

observations: (number of timesteps per batch, dimension of observation)
观察结果：(每批次的时间步数，观察结果的维度)
actions: (number of timesteps per batch, dimension of action)
操作：(每批次的时间步数，操作范围)
log probabilities: (number of timesteps per batch)
记录概率：(每批次的时间步数)
rewards: (number of episodes, number of timesteps per episode)
奖励：(情节数，每个情节的时间步数)
reward-to-go’s: (number of timesteps per batch)
奖励奖励：(每批次的时间步数)
batch lengths: (number of episodes)
批次长度：(发作次数)

For why we keep track of log probabilities instead of raw action probabilities, here is a resource that explains why and here is another. TL;DR: makes gradient ascent easier behind the scenes. Let’s write our generic gym rollout on one episode.

对于为什么我们要跟踪日志概率而不是原始动作概率，这里有一个资源来解释原因，这里是另一个。 TL; DR：使幕后攀登更为轻松。让我们在一集上编写我们的通用体育馆展示。

obs = self.env.reset()
done = Falsefor ep_t in range(self.max_timesteps_per_episode):
  
  action = self.env.action_space.sample()
  obs, rew, done, _ = self.env.step(action)  if done:
    break

Few things we need to change. We’re not sampling an action, but querying our actor network. We need to collect observations, actions, log probs, episodic rewards, and episodic length. We need to stop once we hit self.timesteps_per_batch. Let’s do that now, assuming we have some get_action function to help us query an action and its log prob.

我们需要改变的几件事。我们不是在采样动作，而是查询我们的参与者网络。我们需要收集观察值，动作，对数概率，情节奖励和情节长短。一旦我们击中self.timesteps_per_batch我们需要停止。假设我们有一些get_action函数可以帮助我们查询一个动作及其日志概率，现在就开始做。

# Number of timesteps run so far this batch
t = 0 while t < self.timesteps_per_batch:  # Rewards this episode
  ep_rews = []  obs = self.env.reset()
  done = False  for ep_t in range(self.max_timesteps_per_episode):
    # Increment timesteps ran this batch so far
    t += 1    # Collect observation
    batch_obs.append(obs)    action, log_prob = self.get_action(obs)
    obs, rew, done, _ = self.env.step(action)
  
    # Collect reward, action, and log prob
    ep_rews.append(rew)
    batch_acts.append(action)
    batch_log_probs.append(log_prob)    if done:
      break  # Collect episodic length and rewards
  batch_lens.append(ep_t + 1) # plus 1 because timestep starts at 0
  batch_rews.append(ep_rews)

Okay, so we need a get_action. Let’s go ahead and write that. During training, we’ll need a way to “explore” actions; we’ll use something called a “Multivariate Normal Distribution”. The idea is to have the actor network output a “mean” action on a forward pass, then create a covariance matrix with some standard deviation along the diagonals. Then, we can use this mean and stddev to generate a Multivariate Normal Distribution using PyTorch’s distributions, and then sample an action close to our mean. We’ll also extract the log probability of that action in the distribution. If you’re uncomfortable with Multivariate Normal Distributions, here’s a great lecture by Andrew Ng on it.

好的，所以我们需要一个get_action 。让我们继续写吧。在训练过程中，我们需要一种“探索”行动的方法。我们将使用“多元正态分布”。想法是让角色网络在前向通过时输出“均值”动作，然后创建沿对角线具有一些标准偏差的协方差矩阵。然后，我们可以使用此平均值和stddev使用PyTorch的分布生成多元正态分布，然后对接近平均值的动作进行采样。我们还将提取该动作在分布中的对数概率。如果您对多元正态分布不满意，这是Andrew Ng的精彩演讲。

Note: actions will be deterministic when testing, meaning that the “mean” action will be our actual action during testing. However, during training we need an exploratory factor, which this distribution can help us with.

注意：测试时动作将是确定性的，这意味着“平均”动作将是我们在测试期间的实际动作。但是，在训练过程中，我们需要一个探索性因素，这种分布可以帮助我们。

from torch.distributions import MultivariateNormaldef __init(self, env):
  ...
  # Create our variable for the matrix.
  # Note that I chose 0.5 for stdev arbitrarily.
  self.cov_var = torch.full(size=(self.act_dim,), fill_value=0.5)
  
  # Create the covariance matrix
  self.cov_mat = torch.diag(self.cov_var)def get_action(self, obs):  # Query the actor network for a mean action.
  # Same thing as calling self.actor.forward(obs)
  mean = self.actor(obs)  # Create our Multivariate Normal Distribution
  dist = MultivariateNormal(mean, self.cov_mat)  # Sample an action from the distribution and get its log prob
  action = dist.sample()
  log_prob = dist.log_prob(action)
  
  # Return the sampled action and the log prob of that action
  # Note that I'm calling detach() since the action and log_prob  
  # are tensors with computation graphs, so I want to get rid
  # of the graph and just convert the action to numpy array.
  # log prob as tensor is fine. Our computation graph will
  # start later down the line.
  return action.detach().numpy(), log_prob.detach()

Finally, back in our rollout function, we should convert our batch_obs, batch_acts, batch_log_probs, and batch_rtgs to tensors since we’ll need them in that form later to draw our computation graphs. Assume that we have a function compute_rtgs that will compute the rewards-to-go of the batch rewards. Funnily enough, finding the rewards-to-go is Step 4 in our algorithm:

最后，返回到rollout功能，我们应该将batch_obs，batch_acts，batch_log_probs和batch_rtgs转换为张量，因为稍后将需要这种形式的张量来绘制计算图。假设我们有一个compute_rtgs函数，它将计算批量奖励的compute_rtgs奖励。有趣的是，找到奖赏是我们算法的第4步：

# Reshape data as tensors in the shape specified before returning
batch_obs = torch.tensor(batch_obs, dtype=torch.float)
batch_acts = torch.tensor(batch_acts, dtype=torch.float)
batch_log_probs = torch.tensor(batch_log_probs, dtype=torch.float)# ALG STEP #4
batch_rtgs = self.compute_rtgs(batch_rews)# Return the batch data
return batch_obs, batch_acts, batch_log_probs, batch_rtgs, batch_lens

Let’s figure out now how to calculate rewards-to-go. Typically when calculating rewards-to-go on a set of rewards from a single episode, you start from the end, have some variable to track the sum of rewards, multiply the variable by a discount factor (gamma) per timestep, add the variable with the immediate reward, and append it to some reward-to-go array. In case you’re fuzzy on how to calculate reward-to-go, or return, given some observation, here’s the formula.

现在让我们弄清楚如何计算待取奖励。通常，在计算单个情节中的一组奖励的待取奖励时，您要从头开始，使用一些变量来跟踪奖励的总和，将该变量乘以每个时间步长的折扣因子( gamma )，然后添加变量立即奖励，并将其附加到一些奖励奖励阵列中。万一您对如何计算“获得奖励”或“回报”感到模棱两可，可以参考以下公式。

where G is reward-to-go function, sₖ is our observation at timestep k, T is timesteps per episode, γ is discount factor, and R(sᵢ) is reward given some observation sᵢ.

其中G是随行赏金函数，sₖ是我们在时间步长k处的观测值，T是每集的时间步长，γ是折现因子，R(sᵢ)是在给定观察值sᵢ时的奖励。

We’ll apply this exact same workflow, except on multiple episodes (to keep the order consistent, we’ll need to iterate the episodes backward too).

除了多个情节外，我们将应用完全相同的工作流程(为保持顺序一致，我们也需要向后迭代情节)。

def compute_rtgs(self, batch_rews):
  # The rewards-to-go (rtg) per episode per batch to return.
  # The shape will be (num timesteps per episode)
  batch_rtgs = []  # Iterate through each episode backwards to maintain same order
  # in batch_rtgs
  for ep_rews in reversed(batch_rews):    discounted_reward = 0 # The discounted reward so far    for rew in reversed(ep_rews):
      discounted_reward = rew + discounted_reward * self.gamma
      batch_rtgs.insert(0, discounted_reward)  # Convert the rewards-to-go into a tensor
  batch_rtgs = torch.tensor(batch_rtgs, dtype=torch.float)  return batch_rtgsdef _init_hyperparameters(self):
  ...
  self.gamma = 0.95

Finally, let’s call our rollout function in learn.

最后，让我们把我们的rollout在功能上learn 。

def learn(self, total_timesteps):
  ...
  while t_so_far < total_timesteps:
    # ALG STEP 3
    batch_obs, batch_acts, batch_log_probs, batch_rtgs, batch_lens = self.rollout()

And there we go! We’re done with Steps 3 and 4, and halfway done with our PPO implementation. Here’s the code so far:

然后我们去了！我们完成了步骤3和4，完成了PPO实施的一半。这是到目前为止的代码：

Congratulations! We are already halfway through implementing a bare-bones PPO, and finished the majority of the code. In Part 3, we will finish up the PPO implementation.

恭喜你！我们已经完成了准系统PPO的一半，并完成了大部分代码。在第3部分中，我们将完成PPO的实现。

If you have any questions up to this point, don’t hesitate to leave a comment or reach out to me at eyyu@ucsd.edu. Otherwise, see you in Part 3!

如果您对此有任何疑问，请随时发表评论或通过eyyu@ucsd.edu与我联系。否则，在第三部分见！