LSTM多GPU训练、pytorch 多GPU 数据并行模式踩坑日记， LSTM, nn.DataParallel()

Offer.harvester

已于 2022-05-12 14:10:41 修改

阅读量3.2k

点赞数 4

分类专栏： pytorch 文章标签： lstm pytorch GPU DataParallel

于 2022-05-12 14:10:03 首次发布

本文链接：https://blog.csdn.net/qq_39072627/article/details/124729887

版权

pytorch 专栏收录该内容

2 篇文章

订阅专栏

本文记录了在使用PyTorch进行多GPU训练LSTM模型时遇到的常见错误，包括AttributeError、设备不匹配、批次大小动态变化等问题，并详细解释了错误原因及解决方案。主要涉及DataParallel的使用、LSTM隐藏状态初始化、输入数据与设备一致性等方面，适合初学者参考。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

关键字：

pytorch 多GPU 数据并行模式踩坑日记

LSTM多GPU训练

多GPU训练时遇到的BUG解决方法

下面是关于使用多GPU训练LSTM的时候，遇到的一系列错误和解决办法：（初学者的一些探索，感谢各位知识分享的前辈们解决了困扰我一晚上的问题，如有错误也希望各位指出，赐教）

我的模型定义,以及训练函数如下：

class Classfication_Model(nn.Module):
    def __init__(self):
        super(Classfication_Model, self).__init__()
        self.hidden_size = 128
        self.embedding_dim = 200
        self.number_layer = 4
        self.bidirectional = True
        self.bi_number = 2 if self.bidirectional else 1
        self.dropout = 0.5
        self.embedding = nn.Embedding(num_embeddings=len(model.index_to_key)+200
                                       , embedding_dim=self.embedding_dim)

        self.lstm = nn.LSTM(input_size=self.embedding_dim
                            , hidden_size=self.hidden_size
                            , num_layers=self.number_layer
                            , dropout=self.dropout
                            , bidirectional=self.bidirectional)
        self.fc = nn.Sequential(
            nn.Linear(self.hidden_size*self.bi_number,20)
            , nn.ReLU()
            , nn.Linear(20,2)
        )

    def init_hidden_state(self, batch_size):
        h_0 = torch.rand(batch_size, self.number_layer * self.bi_number,  self.hidden_size).to(device)
        c_0 = torch.rand(batch_size, self.number_layer * self.bi_number, self.hidden_size).to(device)
        return (h_0, c_0)

    def forward(self, input, hidden):
        input_embeded = self.embedding(input)
        input_embeded = input_embeded.permute(1, 0, 2) # 调整为:[sqe_len,batch_size,embedding_dim]
        hidden = [x.permute(1,0,2).contiguous() for x in hidden]
        _, (h_n, c_n) = self.lstm(input_embeded, hidden)
        out = torch.cat((h_n[-2, :, :], h_n[-1, :, :]), -1)# 2,256
        out = self.fc(out)
        return out
 

def train(epoch):
    ds = corpus_dataset(train_model=True, max_sentence_length=50,train_set=train_set,test_set=test_set)
    train_dataloader = DataLoader(ds, batch, shuffle=True,num_workers=5)
    total_loss = 0
    classfication_model.train()
    # hidden = classfication_model.init_hidden_state(batch) DataParallel时出错
    # hidden = classfication_model.module.init_hidden_state(batch) 这个batch_size设置是死的
    for idx, (input, target) in enumerate(train_dataloader):
        target = target.to(device)
        input = input.to(device)
        optimizer.zero_grad()
        hidden = classfication_model.module.init_hidden_state(len(input))# 这个batch_size设置是活的
        output = classfication_model(input, hidden)
        loss = criterion(output, target)  # traget需要是[0,9]，不能是[1-10]
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"epoch:{epoch}  ######  total_loss:{total_loss:.6f}")

1、AttributeError: ‘DataParallel’ object has no attribute ‘init_hidden_state’

原因：这个错误主要是由于调用nn.DataParallel()进行数据并行化后，我的model会被封装到新model的module这个属性之下。

解决办法：在调用原模型的属性的时候，加上一层module.比如将hidden = classfication_model.init_hidden_state(batch) 改为如下形式hidden = classfication_model.module.init_hidden_state(batch)，这个bug就解决了。

2、input and hidden tensors are not at the same device,found input tensor at GPU and hidden at cpu

原因：这个就是一部分数据在GPU一部分在CPU，主要是LSTM的hidden parameters 在CPU上，我们需要初始化LSTM的h_0,c_0放入GPU中就好了。

解决办法：

def init_hidden_state(self, batch_size):
        h_0 = torch.rand(batch_size, self.number_layer * self.bi_number,  self.hidden_size).to(device)
        c_0 = torch.rand(batch_size, self.number_layer * self.bi_number, self.hidden_size).to(device)
        return (h_0, c_0)

3、input and hidden tensors are not at the same device, found input tensor at cuda:1 and hidden tensor at cuda:0

这个BUG和第二个BUG，字面意思很相似，但是却麻烦得多，也是把我卡住最久的一个地方。

首先明确一下背景，出现这个问题是我将本能够在一块GPU上跑通的LSTM模型进行多GPU训练时遇到的新问题。

问题分析：从字面意思可以看出，在使用Dataparallel的时候，将输入的数据拆分到了不同的GPU，而 hidden_parameter这些参数没有拆分到不同的GPU上，所以最后导致了input and hidden tensor are not at the same device。

原因：在使用DataParallel时，pytorch会把forward函数中的参数里同一个batch的数据拆分到不同GPU里，而不会拆分形如self.xx的类的属性。

一开始我的forward函数是这样的，可以看到我的h_0,c_0的初始化是在forward函数内直接调用初始化函数实现的，而没有放到forward函数中。正是因为这个原因导致了h_0,c_0这些隐状态无法并行到多个GPU上，从而报错。

def forward(self, input):
        input_embeded = self.embedding(input)
        
        input_embeded = input_embeded.permute(1, 0, 2)
        h_0, c_0 = self.init_hidden_state(input_embeded.shape[1])
        _, (h_n, c_n) = self.lstm(input_embeded, (h_0, c_0))
        out = torch.cat((h_n[-2, :, :], h_n[-1, :, :]), -1)
        out = self.fc(out)
        return out

解决办法：所有需要拆分到不同GPU上的数据，在forward函数中作为形参与返回值。在RNN中，要注意的问题还有一些，隐藏状态的初始化要在每一个epoch或者每一个batch的开始，而不可以和optimizer在一起。

我将(h_0,c_0)—>hidden 通过forward函数的形参传递进函数中去，这样GPU就会将h_0,c_0切分到各个GPU中，但是解决了这个BUG后紧接着可能就会出现接下来的BUG。

def forward(self, input, hidden):
        input_embeded = self.embedding(input)
        input_embeded = input_embeded.permute(1, 0, 2) # 调整为:[sqe_len,batch_size,embedding_dim]
        hidden = [x.permute(1,0,2).contiguous() for x in hidden]
        _, (h_n, c_n) = self.lstm(input_embeded, hidden)
        out = torch.cat((h_n[-2, :, :], h_n[-1, :, :]), -1)# 2,256
        out = self.fc(out)
        return out

4、RuntimeError: Expected hidden[0] size (x, x, x), get(x, x, x)

出现这个BUG可能有两种原因，

1、与LSTM输入的数据以及初始权重h_0,c_0的格式有关

在定义LSTM网络层的时候，其参数设置如下所示：

Args:
        input_size: The number of expected features in the input `x`
        hidden_size: The number of features in the hidden state `h`
        num_layers: Number of recurrent layers. E.g., setting ``num_layers=2``
            would mean stacking two LSTMs together to form a `stacked LSTM`,
            with the second LSTM taking in outputs of the first LSTM and
            computing the final results. Default: 1
        bias: If ``False``, then the layer does not use bias weights `b_ih` and `b_hh`.
            Default: ``True``
        batch_first: If ``True``, then the input and output tensors are provided
            as `(batch, seq, feature)` instead of `(seq, batch, feature)`.
            Note that this does not apply to hidden or cell states. See the
            Inputs/Outputs sections below for details.  Default: ``False``
        dropout: If non-zero, introduces a `Dropout` layer on the outputs of each
            LSTM layer except the last layer, with dropout probability equal to
            :attr:`dropout`. Default: 0
        bidirectional: If ``True``, becomes a bidirectional LSTM. Default: ``False``
        proj_size: If ``> 0``, will use LSTM with projections of corresponding size. Default: 0

这里尤其注意一下batch_first的设置

输入数据：如果batch_first设置为True，那么输入的数据的shape就是(batch_size, sequence_length, embedding_size)。而他默认是False，也就是输入的shape是(sequence_length, batch_size, embedding_size)。

而对于初试权重h_0,c_0：无论batch_first=false or True。h_0,c_0的shape永远都是batch_first=False的，也就是

(number_layers * num_directions, batch_size, hidden_size)、

问题原因：当模型调用nn.DataParallel后，在执行model.forward()函数的时候，其输入的参数不同的batch会被分配到不同的GPU上进行并行计算。拆分的维度默认是第一维(dim=0)，但可以设置为其他维度进行拆分（比如如果你习惯所有的tensor都用batch second 的格式，就可以设置拆分维度为dim=1）。前提是所有输入tensor都必须是cuda类型。cpu类型的输入只会被原样拷贝到每个实例中而不会被拆分。如果输入的数据第一维不是batch_size或者，输入的hidden(h_0,c_0)第一维不是batch_size，那么就会遇到这个问题。一开始我的init_hidden_state函数将h_0,c_0的batch_size设置到了第二个维度，导致报这个错。

原始init_hidden_state函数设置如下：

def init_hidden_state(self, batch_size):
        h_0 = torch.rand(self.number_layer * self.bi_number, batch_size, self.hidden_size).to(device)
        c_0 = torch.rand(self.number_layer * self.bi_number, batch_size, self.hidden_size).to(device)
        return h_0, c_0

解决办法：所以我需要在将hidden_state(h_0, c_0)在传入forward的时候保证batch_first, 在forward函数内我再将第一个维度和第二个维度换一下位置，变成h_0,c_0要求的batch_second模式。同理，输入的数据也需要保证第一个维度是batch_size，（如果自定义了拆分的维度就得另说了）。

修改后的代码：

在forward函数中添加了：hidden = [x.permute(1,0,2).contiguous() for x in hidden]

def init_hidden_state(self, batch_size):
        h_0 = torch.rand(batch_size, self.number_layer * self.bi_number,  self.hidden_size).to(device)
        c_0 = torch.rand(batch_size, self.number_layer * self.bi_number, self.hidden_size).to(device)
        return (h_0, c_0)
        
def forward(self, input, hidden):
        input_embeded = self.embedding(input)
        input_embeded = input_embeded.permute(1, 0, 2) # 调整为:[sqe_len,batch_size,embedding_dim]
        hidden = [x.permute(1,0,2).contiguous() for x in hidden]
        _, (h_n, c_n) = self.lstm(input_embeded, hidden)
        out = torch.cat((h_n[-2, :, :], h_n[-1, :, :]), -1)# 2,256
        out = self.fc(out)
        return out

2、h_0,c_0的batch_size需要根据输入的batch_size的大小动态变化

问题原因：如果在构建Dataloader时，drop_last=False【默认就是False】，也就是不丢弃最后的 len(datasets)%batch_size个数据，那么此时的h_0,c_0的batch_size没有转变为len(datasets)%batch_size的情况下，也会报RuntimeError: Expected hidden[0] size (x, x, x), get(x, x, x)这个错。

解决方法：所以我们可以在构建dataloader的时候使得drop_last=True，或者在训练的过程中动态定义h_0,c_0的batch_size如下所示：

for idx, (input, target) in enumerate(train_dataloader):
    hidden = classfication_model.module.init_hidden_state(len(input))# 这个batch_size动态变化的

参考链接：

DataParallel LSTM/GRU wrong hidden batch size (8 GPUs) - PyTorch Forums

RuntimeError: Input and hidden tensors are not at the same device, found

pytorch多GPU数据并行模式踩坑指南_Edward Tivrusky IV的博客-CSDN博客_pytorch多gpu并行【非常建议大家阅读阅读这个博客】

pytorch多GPU实践——解决RuntimeError: Expected hidden[0] size (1, 2500, 50), got (1, 10000, 50)_pyxiea的博客-CSDN博客

RuntimeError: Expected hidden[0] size (x, x, x), got(x, x, x)_带鱼工作室的博客-CSDN博客