RuntimeError: CUDA error: device-side assert triggered

gy-7

已于 2024-08-07 11:45:24 修改

阅读量300

点赞数 3

文章标签：深度学习人工智能

于 2024-08-07 00:00:00 首次发布

本文链接：https://blog.csdn.net/qq_39435411/article/details/140964305

版权

1. 报错信息:

block: [0,0,0], thread: [0,0,0] Assertion ^idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed
block: [0,0,0], thread: [0,0,0] Assertion ^idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed
block: [0,0,0], thread: [0,0,0] Assertion ^idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

解决方法，主要看图片：

2. 报错信息

/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [0,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [1,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,0,0], thread: [2,0,0] Assertion `input_val >= zero && input_val <= one` failed.


RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

解决方法，主要看图片：

这个报错主要有以下几种情况，排查一下自己的代码有无下面情况。

出现nan值。
预测值与标签长度不一致。
预测值或标签不在[0,1]范围内。

经过打印变量值，我发现我的预测值出现了nan的情况，但是跑其它的数据好好的。于是我便怀疑是数据出了问题。便去排查了一遍数据，最终发现了有些数据不正常。除此之外，ChatGPT还给了一些解决nan的解决方法：

在训练 PyTorch 模型时，如果出现了 NaN (Not a Number) 值，这通常是因为数值不稳定或溢出导致的。以下是一些可能的原因及其解决方法：

学习率过高：学习率过高可能导致梯度爆炸，从而引发 NaN。尝试减小学习率。
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)  # 调整学习率
梯度爆炸：使用梯度裁剪来防止梯度爆炸。
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
数据预处理问题：检查输入数据是否包含 NaN 或 Inf。确保数据标准化或归一化正确。
assert not torch.isnan(input_tensor).any(), "Input contains NaN"
assert not torch.isinf(input_tensor).any(), "Input contains Inf"
损失函数问题：确保损失函数没有引入 NaN。例如，log 函数的输入不能为0或负值。
loss = F.cross_entropy(pred, target)
assert not torch.isnan(loss).any(), "Loss contains NaN"
初始化权重：使用适当的权重初始化方法来避免训练初期的不稳定性。
def weights_init(m):
    if isinstance(m, nn.Linear):
        torch.nn.init.xavier_uniform_(m.weight)
        if m.bias is not None:
            torch.nn.init.zeros_(m.bias)

model.apply(weights_init)
检查计算图中的每一步：在训练循环中插入断言来检查 NaN
output = model(input)
assert not torch.isnan(output).any(), "Model output contains NaN"
调整批量大小：如果批量大小过大，可能会导致数值不稳定。尝试减小批量大小。
train_loader = DataLoader(train_dataset, batch_size=32)  # 调整批量大小
确保模型的数值稳定性：在模型设计上，尽量使用数值稳定的操作。例如，使用 log1p 替代 log(1+x)，使用 softmax 的同时使用 log 的稳定版本。

启用混合精度训练：使用混合精度训练可以提高计算效率和稳定性。
from torch.cuda.amp import GradScaler, autocast

scaler = GradScaler()
for input, target in train_loader:
    optimizer.zero_grad()
    with autocast():
        output = model(input)
        loss = criterion(output, target)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

gy-7

关注

3
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
RuntimeError: CUDA error: device-side assert triggered

RuntimeError: CUDA error: device-side assert triggeredAssertion ^idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failedAssertion `input_val >= zero && input_val
复制链接

扫一扫