pytorch多GPU实践——解决RuntimeError: Expected hidden[0] size (1, 2500, 50), got (1, 10000, 50)

最新推荐文章于 2023-07-17 21:42:52 发布

pyxiea

最新推荐文章于 2023-07-17 21:42:52 发布

阅读量7.3k

点赞数 9

分类专栏： PyTorch 文章标签： GPU 多GPU RuntimeError

本文链接：https://blog.csdn.net/xpy870663266/article/details/100002592

版权

PyTorch 专栏收录该内容

12 篇文章 1 订阅

订阅专栏

说在前头

本文针对以下读者：

如果你也是在使用pytorch多gpu模式的过程遇到了题目所述的问题。
如果你自定义的网络结构中使用到了RNN的hidden state。

正文

最近在想办法将一个pytorch项目修改为可以使用多GPU。这个项目是github上对于relational recurrent network（RRN）的一个pytorch实现（github地址）。RRN的网络结构使用到了RNN，而pytorch的多GPU使用其实在遇到RNN时是有一些坑的。

虽然网络上已经有很多介绍pytorch使用单机多GPU的博客，但是这些博客很多都是抄来抄去，并没有注意到一些细节和坑，所以没有很大的参考意义。幸运的是，我看到了一篇好文：pytorch多GPU数据并行模式踩坑指南。我参考了踩坑点4以及最后给出的代码，然后才知道应该怎样对原来的代码作出对应的修改。

回到题目，题中所述的错误原因在于我们没有处理好RNN（本项目中是LSTM，为了避免混淆RNN和RRN，后文用LSTM代替RNN）的hidden state的初始化以及后续每个时间步的处理过程。具体到我所处理的项目，RRN需要使用MLP作为消息传递网络，而这个MLP网络需要LSTM的 $hidden\space state$ 作为输入，所以我们在实现RRN的过程中免不了要和LSTM的 $hidden\space state$ 打交道。这里就涉及到了一个坑：

在tensor输入维度上可以选择第一位是batch size，或者第二位是batch size。理论上说这是个人习惯问题，只要前后统一就可以；但是在pytorch中内置的lstm层只接受batch second 的hidden层tensor。虽然在lstm层或padding层上可以定义batch_first=True，但是这只定义了输入tensor；hidden层仍然必须是batch second 格式。
————————————————
引用自CSDN博主「Edward Tivrusky IV」的原创文章，遵循CC 4.0 by-sa版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/yuuyuhaksho/article/details/87560640

github项目的作者在实现RRN时，他是知道这个坑的，所以他在初始化LSTM的hidden state时，将batch size放到了第二个维度（见文末附录代码的reset_g()函数），在forward函数的实现时，也将传入的h当做batch size在第二维度来处理（注意LSTM的 $hidden\space state$ 不是forward函数的hidden参数，hidden参数是RRN的 $N o d e$ 对应的 $hidden\space state$ ）。

在多GPU的模式下，pytorch将会对 $hidden\space state$ 的第一个维度进行切分，如果我们在初始化以及后续处理时将batch size放到第二个维度，切分过程就会出错，从而导致标题所述的维度不匹配的错误。所以我将reset_g()修改为了下述代码，也就是初始化时把batch_size放到第一个维度：

    def reset_g(self, b):
        # hidden is composed by hidden and cell state vectors
        # self.batch_size = b  (这句没用，注释掉行了)
        h = (
            torch.zeros(b, self.g_layers, self.dim_hidden, device=self.device, requires_grad=True),
            torch.zeros(b, self.g_layers, self.dim_hidden, device=self.device, requires_grad=True)
        )
        return h

进而带来的问题是如何处理前面所说的坑，即LSTM单元的 $hidden\space state$ 只能接受batch_size在第二个维度。这就用到了前面博客里面给出的代码，tensor.permute()将Tensor的维度调换位置。具体来说，我把附录中的第73行到第77行修改为了如下：

        input_g = self.g_mlp(input_g_mlp)
		# LSTM时间步之前加上维度调换的代码
        hidden0 = [x.permute(1, 0, 2).contiguous() for x in h]

        # out, h = self.g(input_g, h)
        out, hidden0 = self.g(input_g, hidden0)
		# LSTM时间步之后也加上维度调换的代码
        h = [x.permute(1, 0, 2).contiguous() for x in hidden0]

        hidden = out.clone()

小结一下，在使用多GPU模式下，如果我们设置了batch_first=True，那么初始化LSTM的hidden state时也要将batch size放到第一个维度。如果自定义的网络结构（例如RRN）用到了LSTM的hidden state，而导致我们需要使用只接受batch size作为第二维度的LSTM单元，那么就在时间步转换对应代码的前后都加上将hidden state的第一和第二维度交换位置的代码即可。

题外话：效率探索——多GPU v.s. 单GPU

在我做的实验中，多GPU并没有提高效率，反而对比单GPU而言，降低了效率。下面表格记录的是本项目在不同方案下（CPU/单GPU/多GPU）使用不同epoch大小测量出的训练时间。

epochs	100	1000	10000
CPU	18s	165s	——
单GPU	7s	60s	699s
多GPU	17s	156s	1515s

单独从这次实验来看的话，多GPU反而会让训练时间下降。

个人认为使用多GPU的：

好处：减低进行前向传播和反向传播的运算的耗时（因为可以在多个GPU上并行计算多个batch）

坏处：在GPU与CPU之间的数据复制需要增加训练的时间成本。

所以，是否应该使用多GPU，应该看具体网络的情况

如果该网络的训练过程是计算密集而非IO密集的，即进行前向传播和反向传播需要耗时很长，反而每个batch的数据量相对不大，那么使用多GPU理论上是可以加速训练过程的。
反之，如果该网络的训练过程是IO密集的而非计算密集的，即每个batch的数据量非常大（在GPU和CPU之间通信的成本很高），而进行前向传播和反向传播反而不需要耗时很长，使用单GPU理论上训练更快。

附录: RRN核心代码

import torch
import torch.nn as nn
from torch.nn import LSTM
from src.models.MLP import MLP

class RRN(nn.Module):

    def __init__(self, dim_hidden, message_dim, output_dim, f_dims, o_dims, device,  g_layers=1, edge_attribute_dim=0, single_output=False):
        '''
        :param n_units: number of nodes in the graph
        :param edge_attribute_dim: 0 if edges have no attributes, else an integer. Default 0.
        :param single_output: True if RRN emits only one output at a time, False if it emits as many outputs as units. Default False.
        '''

        super(RRN, self).__init__()

        self.dim_hidden = dim_hidden
        self.dim_input = dim_hidden
        self.message_dim = message_dim
        self.output_dim = output_dim

        self.device = device

        self.f_dims = f_dims
        self.o_dims = o_dims
        self.g_layers = g_layers

        self.edge_attribute_dim = edge_attribute_dim
        self.single_output = single_output

        input_f_dim = 2 * self.dim_hidden + self.edge_attribute_dim
        self.f = MLP(input_f_dim, self.f_dims, self.message_dim)

        input_gmlp_dim = self.dim_input + self.message_dim
        output_gmlp_dim = 128
        self.g_mlp = MLP(input_gmlp_dim, self.f_dims, output_gmlp_dim)
        self.g = LSTM(output_gmlp_dim, self.dim_hidden, num_layers=self.g_layers, batch_first=True)

        input_o_dim = self.dim_hidden
        self.o = MLP(input_o_dim, self.o_dims, self.output_dim, dropout=True)

    def forward(self, x, hidden, h, edge_attribute=None):
        '''
        This can be called repeatedly after hidden states are set.
        :param x: inputs to the RRN nodes
        :param hidden: hidden states of RRN nodes (B, N_facts, H)
        :param h: hidden and cell states of g
        :param edge_attributes: (B, Q_dim) tensor containing edge attribute or None if edges have no attributes. Default None.
        '''

        n_facts = hidden.size(1)

        hi = hidden.repeat(1, n_facts, 1)
        hj = hidden.unsqueeze(2)
        hj = hj.repeat(1,1,n_facts,1).view(hidden.size(0),-1,hidden.size(2))
        if edge_attribute is not None:
            ea = edge_attribute.unsqueeze(1)
            ea = ea.repeat(1,hi.size(1),1)
            input_f = torch.cat((hj,hi,ea), dim=2)
        else:
            input_f = torch.cat((hi,hj), dim=2)


        messages = self.f(input_f)

        messages = messages.view(hidden.size(0),hidden.size(1),hidden.size(1), self.message_dim)

        # sum_messages[i] contains the sum of the messages incoming to node i
        sum_messages = torch.sum(messages, dim=2) # B, N_facts, Message_dim

        input_g_mlp = torch.cat((x, sum_messages), dim=2)

        input_g = self.g_mlp(input_g_mlp)

        out, h = self.g(input_g, h)

        hidden = out.clone()

        if self.single_output:
            sum_hidden = torch.sum(hidden, dim=1)
            out = self.o(sum_hidden)
        else:
            out = self.o(hidden)

        return out, hidden, h

    def reset_g(self, b):
        # hidden is composed by hidden and cell state vectors
        self.batch_size = b
        h = (
            torch.zeros(self.g_layers, b, self.dim_hidden, device=self.device, requires_grad=True),
            torch.zeros(self.g_layers, b, self.dim_hidden, device=self.device, requires_grad=True)
            )
        return h