Cuda报错:`srcIndex ＜ srcSelectDimSize` failed

袋鼠奥特曼

于 2024-09-17 20:45:29 发布

阅读量617

点赞数 10

文章标签：深度学习人工智能 python pytorch 机器学习

本文链接：https://blog.csdn.net/weixin_43268247/article/details/142318600

版权

Cuda报错:`srcIndex < srcSelectDimSize` failed

错误简述

我在2024年9月8日开发的模型遇到这个问题。
cuda报错如下

../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [224,0,0], thread: [49,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [224,0,0], thread: [50,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [224,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [224,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [224,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [224,0,0], thread: [54,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [224,0,0], thread: [55,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [224,0,0], thread: [56,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

python端捕捉的额外报错如下：

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())`

背景

我搭建了一个partial_CLIP模型用于处理多模态，使用pretrain模型结果作为输入，然后搭建一个简单的classifier模型来进行分类（只有classifier部分是训练的）。因为场景需要用构造模态缺失场景，所以又搭建了两个模态的缺失场景。同时解决方案是模态重建来解决模态缺失。大致如此。

模态重建

模态重建训练代码如下

@torch.no_grad()
def gen_missing_modality(i2tmodel,x_modalities,index,modal_missing):
    '''
    x_modalities is input values.\\
    index is the index in this batch data.\\
    modal_missing is which modal batch missing.
    '''
    if x_modalities is None or modal_missing is None:
        return
    if modal_missing == 1:
        images,texts,attention_masks = x_modalities[0],x_modalities[1],x_modalities[2]
        img = images[index].clone().detach().to(device)
        img = img.unsqueeze(0)
        i2tmodel.eval()
        gentext = i2tmodel(img,generate_lengths = texts.size(1)).squeeze(0)
        texts[index].copy_(gentext)
        # print(texts[index])
        if texts[index][0] != 49406:
            texts[index][0] = 49406
        for j in range(texts.size(1)):
            if attention_masks[index][j] == 0:
                texts[index,j-1:] = 49407
                break
        texts[index][-1] = 49407
                # xb =xb.to(previous_device)
        # print('after:',texts[index])

因为clip处理的token输入是从49406作为start_token和49407作为ending_token，所以在过程中需要做一些小小的转换。输出过程都很正常。但是将缺失模态补齐后输入到clip中却爆发了开头一幕。

解决方案

（1）起初以为是cuda问题，然后转移了模型到另一个机器上跑，A100 40G，但是仍然报错。所以可以总结该bug必现

（2）然后以为是模型的eval()和train()切换不正常，开始检查（确实有不正常的过程，也算额外的错误），解决完后仍然报错。

（3）解决完上述问题后，私以为tensor张量转移过程可能出现问题，原本在torch.utils.DataLoader类的collate function中处理如下

img = copy.deepcopy(xa[i]).to(device)
img = img.unsqueeze(0)
gentext = self.i2tmodelgen(img,max_length = xb.size(1)).squeeze(0)
xb[i].copy_(gentext)

可以看出我在输出后没有再将gentext返回到cpu中，自信满满改完再跑，仍然出错。。。

gentext = self.i2tmodelgen(img,max_length = xb.size(1)).squeeze(0).cpu()

（4）我又注意到我的img是通过python原生的copy中产生的，我依稀记得torch的tensor有自带的复制函数，后来将上部分代码改成如下

img = xa[i].clone().detach().to(self.device)

不出意外，仍然报错。

（5）感觉代码已经改的七七八八了，从debug的显示中生成的token也不觉得有什么问题，都是很正常的数字。开始做消融实验，将每部分代码都mask掉，看看哪部分报错，以为会是文生图有报错，没想到还是图生文这部分报错。

（6）回过头来了看了一下报错，发现有几个关键词：

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_73865/1587976768.py in <module>
     23 #     )[2]
     24     print(inputs[1].max(),inputs[1].min())
---> 25     res = partial_clip(inputs,preclip)
     26     loss = criterion(res, label)
     27     optimizer.zero_grad()

~/anaconda3/envs/py377/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1192         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194             return forward_call(*input, **kwargs)
   1195         # Do not call functions when jit is used
   1196         full_backward_hooks, non_full_backward_hooks = [], []

~/workspace/FedAdvan/models/multimodal/pretrain_clip.py in forward(self, inputs, basemodel)
    137         # self.pretrained_forward(inputs,basemodel)
    138         # self.forward_input.clear()
--> 139         visual_layer_output,text_layer_output,attention_mask,casual_attention_mask = self.pretrained_forward(inputs,basemodel)
    140         # text_attention_mask,text_casual_attention_mask = self.forward_input[self.text_attention_mask_key],self.forward_input[self.text_casual_attention_mask_key]
    141         # img_attention_mask,img_casual_attention_mask = self.forward_input[self.img_attention_mask_key],self.forward_input[self.img_casual_attention_mask_key]

~/anaconda3/envs/py377/lib/python3.7/site-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
     25         def decorate_context(*args, **kwargs):
...
--> 114         return F.linear(input, self.weight, self.bias)
    115 
    116     def extra_repr(self) -> str:

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())`

发现一个很关键的地方，之前cuda报了

Assertion `srcIndex < srcSelectDimSize`

有没有可能输入的数据是数字没问题，但是大小有问题呢？我开始上网搜，发现有网友提到过可能是大小超boundry。在clip将token转成可以学习的向量前是经过了nn.Embedding层，embedding本质上就是将token转成可以学习更加稠密的one-hot向量。one-hot的过程是一对一的，有没有可能我输入的token超过了目标限制？

开始在代码中加入两行

print(inputs[1].max(),inputs[1].min())
res = partial_clip(inputs,preclip)

然后到报错前一次的输出是

tensor(49407, device='cuda:0') tensor(5, device='cuda:0')
tensor(49407, device='cuda:0') tensor(64, device='cuda:0')
tensor(49735, device='cuda:0') tensor(5, device='cuda:0')

前文提到clip输入是49407为结尾，但是我生成的token存在超过了ending token的大小，我模型确实有可能生成超过ending token的token。

i2tmodel = ImageToTextModel(vocab_size=50000)

然后我改成

i2tmodel = ImageToTextModel(vocab_size=49407)

这次训练直接报错（还没生成成功呢）

然后调成49408后，所有训练和测试过程均正常了。

总结

还是得根据结果找原因，不要盲目的乱修改，可以去推测哪部分可能引发问题，比如大小不匹配一定是输入导致大小不匹配，应该去找哪部分会导致大小不匹配。当然我初始版本的代码也有很多问题就是了，吃一堑长一智。。。