关于以下Assetion failed错误的观察：../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: ...

seh_sjlj

已于 2024-09-03 10:56:47 修改

阅读量1.4k

点赞数 3

分类专栏： python 文章标签： pytorch 人工智能 python

于 2024-09-03 10:54:30 首次发布

本文链接：https://blog.csdn.net/qaqwqaqwq/article/details/141855451

版权

python 专栏收录该内容

7 篇文章

订阅专栏

在使用PyTorch写代码时，可能会出现如下错误：

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [15,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [16,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [17,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [18,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [19,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

但这个错误是直接输出的，并不会抛异常，程序还继续跑，直到若干行之后突然冒出一个幺蛾子：

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "....py", line 340, in batched_fusion
    ref_intrinsics.inverse(),
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgetrfBatched( handle, n, dA_array, ldda, ipiv_array, info_array, batchsize)`

这是怎么回事呢？

既然是index out of bounds错误，那想必还是经典的数组越界问题了。你需要从抛异常（CUBLAS_STATUS_EXECUTION_FAILED）的那一行往上检查，看哪里的运算出现了数组越界错误。不过，Python不应该会在数组越界时抛异常吗，怎么这里不抛了？

我做了一些实验：

>>> import torch
>>> x = torch.rand(5, 5, device='cuda')
>>> x[5]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: index 5 is out of bounds for dimension 0 with size 5
>>> x[5, 5]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: index 5 is out of bounds for dimension 0 with size 5
>>> x[torch.tensor(5)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: index 5 is out of bounds for dimension 0 with size 5
>>> x[torch.tensor([5])]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zry/anaconda3/envs/mvsgs2/lib/python3.7/site-packages/torch/_tensor.py", line 427, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
  File "/home/zry/anaconda3/envs/mvsgs2/lib/python3.7/site-packages/torch/_tensor_str.py", line 637, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/home/zry/anaconda3/envs/mvsgs2/lib/python3.7/site-packages/torch/_tensor_str.py", line 568, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/home/zry/anaconda3/envs/mvsgs2/lib/python3.7/site-packages/torch/_tensor_str.py", line 328, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "/home/zry/anaconda3/envs/mvsgs2/lib/python3.7/site-packages/torch/_tensor_str.py", line 116, in __init__
    tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0)
RuntimeError: numel: integer multiplication overflow
>>> ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [1,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [2,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [3,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [4,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

我创建了在cuda上的tensor x，先用单个数字访问，会抛异常；但把下标换成一个非标量的tensor，就会出现aten的Assertion failed了。

此后，我又把x放在CPU上，都正常地抛了异常：

>>> x = torch.rand(5, 5)
>>> x[torch.tensor([5])]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: index 5 is out of bounds for dimension 0 with size 5