RuntimeError: cuda runtime error (59) : device-side assert triggered

在使用PyTorch训练Transformer模型过程中,遇到RuntimeError: CUDA运行时错误59,原因是索引超出范围。错误发生在THC库的源文件中,经过排查发现是由于词表索引错误导致。将设备切换到CPU后,定位到实际错误:尝试访问第103个元素,但表格只有99行。修复词表索引问题后,错误得以解决。
摘要由CSDN通过智能技术生成

问题

在训练 Transformer 的过程中,pytorhc出现的问题:RuntimeError: cuda runtime error (59) : device-side assert triggered at C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src\THC/THCReduceAll.cuh:327

具体报错如下

C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [72,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [73,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [74,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [75,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [76,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [77,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [78,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [79,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [80,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [81,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [82,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src/THC/THCTensorIndex.cu:361: block: [80,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
THCudaCheck FAIL file=C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src\THC/THCReduceAll.cuh line=327 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "C:\Users\AppData\Local\conda\conda\envs\yuanbo_pytorch\lib\site-packages\torch\nn\functional.py", line 3105, in multi_head_attention_forward
    qkv_same = torch.equal(query, key) and torch.equal(key, value)
RuntimeError: cuda runtime error (59) : device-side assert triggered at C:/w/1/s/tmp_conda_3.6_155139/conda/conda-bld/pytorch_1565366019852/work/aten/src\THC/THCReduceAll.cuh:327

解决方法

debug了很久也没有找到问题所在,后来发现 GPU 不能正确定位异常位置,device改用 CPU 后才发现真正的错误:RuntimeError: index out of range: Tried to access index 103 out of table with 99 rows. at C:\w\1\s\tmp_conda_3.6_155139\conda\conda-bld\pytorch_1565366019852\work\aten\src\TH/generic/THTensorEvenMoreMath.cpp:237

原来是由于索引出错了,检查后发现,在 Transformer 的 decoder 做 position embedding 的时候,由于词表中的索引出错导致出现了 “RuntimeError: cuda runtime error (59) : device-side assert triggered”。重新制备词表即可。

 

 

 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值