pytorch遇到的bug记录—2

最新推荐文章于 2024-08-15 18:04:08 发布

羊藤枝

最新推荐文章于 2024-08-15 18:04:08 发布

阅读量1.1w

点赞数 11

分类专栏： pytorch 文章标签： pytorch 多GPU训练

本文链接：https://blog.csdn.net/qq_30614451/article/details/106766551

版权

在使用PyTorch的torch.nn.DataParallel进行多GPU训练时，遇到自定义参数未正确分配到各GPU的问题。问题在于自定义的参数list未被复制到各个GPU。解决方案是将自定义参数存储在特定容器中，以便DataParallel能识别并复制。通过学习PyTorch的机制，成功解决了问题。

摘要由CSDN通过智能技术生成

使用Pytorch的torch.nn.DataParallel 进行多GPU训练时遇到的一个bug

问题描述：

我定义的网络模型除了原始的参数之外，还自定义了一组参数，这组参数也要参与训练，但是我在使用torch.nn.DataParallel 进行多GPU训练时，出现bug如下:

Traceback (most recent call last):
  File "train_search.py", line 741, in <module>
    architecture.main()
  File "train_search.py", line 358, in main
    train_acc,loss, error_loss, resource_loss, trainable_filter_number,model_performance = self.train(self.all_epochs, logging)
  File "train_search.py", line 527, in train
    logits, model_property, _ = self.model(input)
  File "/home/dhb/jupyter notebook/distiller/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dhb/jupyter notebook/distiller/env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_