【深度学习 DeBug 小技巧！】RuntimeError: CUDA error: device-side assert triggeredCUDA（用CPU debug 解决后再用GPU）

多恩Stone

已于 2024-06-06 11:58:08 修改

阅读量460

点赞数 16

分类专栏：编程学习模型部署 AIGC 文章标签：深度学习人工智能 pytorch python AIGC

于 2024-06-06 11:57:49 首次发布

本文链接：https://blog.csdn.net/weixin_44212848/article/details/139496301

版权

编程学习同时被 3 个专栏收录

57 篇文章 2 订阅

订阅专栏

AIGC

52 篇文章 2 订阅

订阅专栏

模型部署

22 篇文章 0 订阅

订阅专栏

在 Pytorch 到 onnx 转化的过程中，出现以下问题。

/path/model/bin2onnx.py:157: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if sign != 0:
/path/model/bin2onnx.py:172: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if sign != 0:
../aten/src/ATen/native/cuda/Indexing.cu:922: indexSelectSmallIndex: block: [2,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:922: indexSelectSmallIndex: block: [2,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:922: indexSelectSmallIndex: block: [2,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:922: indexSelectSmallIndex: block: [2,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:922: indexSelectSmallIndex: block: [2,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:922: indexSelectSmallIndex: block: [2,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

1. 将 device 改为 CPU

# device = torch.device("cuda:0")
# ⬆️原代码，⬇️ 修改后
device = torch.device("cpu")

将所有 tensor 、变量所在的 device 改为 CPU 后，报错变得可读性更强的！

IndexError: index out of range in self

2. 通过对每个输入变量进行观测，定位问题

class Embedder(nn.Module):
    def __init__(self, vocab_size, d_model, padding_idx=None):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model, padding_idx)
    def forward(self, x):
        return self.embed(x)

self.coord_embed_x = Embedder(BBOX+COORD_PAD+SVG_END, self.embed_dim, padding_idx=MASK)
self.coord_embed_y = Embedder(BBOX+COORD_PAD+SVG_END, self.embed_dim, padding_idx=MASK)

在本文采用的代码中，Embedder 在定义时就确定了 vocab_size = BBOX+COORD_PAD+SVG_END。

因此，输入的形状也需要匹配

pixel_seq = torch.randint(0, BBOX+COORD_PAD+SVG_END, (n_samples, 2), device=device)
xy_seq = torch.randint(0, BBOX+COORD_PAD+SVG_END, (n_samples, 2, 2), device=device)

🔥结论

遇到类似问题，可以先调到 CPU 上，搞清楚真实问题后再逐步排查，效果更佳！

参考文章：
[1] https://blog.csdn.net/weixin_43301333/article/details/121155260
[2] https://blog.csdn.net/BetrayFree/article/details/134267306

多恩Stone

关注

16
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【深度学习 DeBug 小技巧！】RuntimeError: CUDA error: device-side assert triggeredCUDA（用CPU debug 解决后再用GPU）

在 Pytorch 到 onnx 转化的过程中，出现以下问题。
复制链接

扫一扫