关于以下Assetion failed错误的观察:../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: ...

在使用PyTorch写代码时,可能会出现如下错误:

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [15,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [16,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [17,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [18,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [19,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

但这个错误是直接输出的,并不会抛异常,程序还继续跑,直到若干行之后突然冒出一个幺蛾子:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "....py", line 340, in batched_fusion
    ref_intrinsics.inverse(),
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgetrfBatched( handle, n, dA_array, ldda, ipiv_array, info_array, batchsize)`

这是怎么回事呢?


既然是index out of bounds错误,那想必还是经典的数组越界问题了。你需要从抛异常(CUBLAS_STATUS_EXECUTION_FAILED)的那一行往上检查,看哪里的运算出现了数组越界错误。不过,Python不应该会在数组越界时抛异常吗,怎么这里不抛了?

我做了一些实验:

>>> import torch
>>> x = torch.rand(5, 5, device='cuda')
>>> x[5]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: index 5 is out of bounds for dimension 0 with size 5
>>> x[5, 5]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: index 5 is out of bounds for dimension 0 with size 5
>>> x[torch.tensor(5)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: index 5 is out of bounds for dimension 0 with size 5
>>> x[torch.tensor([5])]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zry/anaconda3/envs/mvsgs2/lib/python3.7/site-packages/torch/_tensor.py", line 427, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
  File "/home/zry/anaconda3/envs/mvsgs2/lib/python3.7/site-packages/torch/_tensor_str.py", line 637, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/home/zry/anaconda3/envs/mvsgs2/lib/python3.7/site-packages/torch/_tensor_str.py", line 568, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/home/zry/anaconda3/envs/mvsgs2/lib/python3.7/site-packages/torch/_tensor_str.py", line 328, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "/home/zry/anaconda3/envs/mvsgs2/lib/python3.7/site-packages/torch/_tensor_str.py", line 116, in __init__
    tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0)
RuntimeError: numel: integer multiplication overflow
>>> ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [1,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [2,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [3,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [4,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

我创建了在cuda上的tensor x,先用单个数字访问,会抛异常;但把下标换成一个非标量的tensor,就会出现aten的Assertion failed了。

此后,我又把x放在CPU上,都正常地抛了异常:

>>> x = torch.rand(5, 5)
>>> x[torch.tensor([5])]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: index 5 is out of bounds for dimension 0 with size 5

所以,我的初步结论是:当访问的tensor在cuda上,且下标是非标量的tensor时,会不抛异常直接输出调试信息。(这么设计可能是因为用tensor做下标检查的工作量太大,会影响效率,所以干脆不检查了?)所以,应该检查用非标量的tensor作为下标的那几行代码。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值