PyTorch错误收集与解决方法

PyTorch错误收集与解决方法

不定时更新,主要收集自己使用pytorch过程中遇到的错误。

RNN的初始状态在多GPU训练报错

init_state = torch.zeros(2,200,32) # 这个我是使用register_buffer注册到模型中了
self.rnn(x,init_state)
>>>但gpu这样做事没问题的,但是多gpu会出现问题
init_state  = torch.zeros(4,100,32) #相当于 2*2,200/2 ,32
self.rnn(x,init_state)

RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu

解决方法就是让,parameters(模块)或者buffers .to('cuda') 就可以了

RNN module weights are not part of single contiguous chunk of memory

每次要调用了一下flatten_parameters()

def forward(self,x):
    self.gru.flatten_parameters()
    output,hidden = self.gru(x)

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

 [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [57,0,0], thread: [12,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [57,0,0], thread: [13,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [57,0,0], thread: [14,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [57,0,0], thread: [15,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [57,0,0], thread: [16,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [57,0,0], thread: [17,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [57,0,0], thread: [18,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [57,0,0], thread: [19,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [123,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [123,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [123,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [123,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "train_multilabel.py", line 248, in <module>
    app.run(main)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "train_multilabel.py", line 237, in main
    train_loader, writer, logging.info)
  File "train_multilabel.py", line 67, in train
    output, _, _ = model.forward(data)
  File "/home/ubuntu/guang/Personality_Analysis/deeplearning_base/model.py", line 287, in forward
    out = torch.cat([self.conv_and_pool(out,conv) for conv in self.convs], dim=1)
  File "/home/ubuntu/guang/Personality_Analysis/deeplearning_base/model.py", line 287, in <listcomp>
    out = torch.cat([self.conv_and_pool(out,conv) for conv in self.convs], dim=1)
  File "/home/ubuntu/guang/Personality_Analysis/deeplearning_base/model.py", line 277, in conv_and_pool
    x = torch.relu(conv(x).squeeze(dim=3)) # [batch,max_doc_len,dim]
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

问题出在embedding上的数量上,初始化是Embedding(400000,50),结果输入数据中居然包含有400001,就导致embedding出现错误。

  • 3
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
PyTorch Geometric是一个基于PyTorch的几何深度学习扩展库,它提供了许多用于处理图形、点云和其他几何数据的函数和类。如果你在安装PyTorch Geometric时遇到了问题,可以尝试以下几种解决方法: 1. 检查Python版本:PyTorch Geometric只支持Python 3.6、3.7和3.8版本。如果你的Python版本不在这个范围内,建议升级或降级到支持的版本。 2. 检查PyTorch版本:PyTorch Geometric要求PyTorch版本大于等于1.6。可以通过以下命令检查PyTorch版本:`import torch; print(torch.__version__)`。如果版本低于1.6,可以通过以下命令升级:`pip install torch -U`。 3. 安装依赖库:PyTorch Geometric有一些依赖库需要安装。可以尝试通过以下命令安装所有依赖库:`pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-${TORCH}+${CUDA}.html`、`pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-${TORCH}+${CUDA}.html`、`pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-${TORCH}+${CUDA}.html`、`pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-${TORCH}+${CUDA}.html`、`pip install torch-geometric`。 4. 检查CUDA版本:如果你使用的是GPU版的PyTorchPyTorch Geometric,需要检查CUDA版本是否与PyTorch版本匹配。可以通过以下命令检查CUDA版本:`nvcc --version`。如果CUDA版本不匹配,可以尝试升级或降级到与PyTorch版本匹配的CUDA版本。 如果以上方法都不能解决问题,可以尝试在GitHub上提出问题,或者参考PyTorch Geometric的官方文档:https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值