一,问题描述:
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
今天遇到的问题很奇怪,在相同的虚拟环境下,运行一个Pytorch的程序,在Pycharm中运行正常,但是通过命令行启动就会报上面的错误。
而且在另一台服务器上,也是相同的环境,Pytorch版本一致,正常运行,这就导致这个问题更加奇怪。
二,报错详细信息
提示:这里描述项目中遇到的问题:
例如:数据传输过程中数据不时出现丢失的情况,偶尔会丢失一部分数据
APP 中接收数据代码:
Traceback (most recent call last):
File "/home/dell/anaconda3/envs/CFM/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/dell/anaconda3/envs/CFM/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/data/aigc/workspace/CFM/code_mp_pipeline.py", line 292, in code_former_core
num_det_faces = face_helper.get_face_landmarks_5(
File "/data/aigc/workspace/CFM/facelib/utils/face_restoration_helper.py", line 155, in get_face_landmarks_5
bboxes = self.face_det.detect_faces(input_img)
File "/data/aigc/workspace/CFM/facelib/detection/retinaface/retinaface.py", line 211, in detect_faces
loc, conf, landmarks, priors = self.__detect_faces(image)
File "/data/aigc/workspace/CFM/facelib/detection/retinaface/retinaface.py", line 158, in __detect_faces
loc, conf, landmarks = self(inputs)
File "/home/dell/anaconda3/envs/CFM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/data/aigc/workspace/CFM/facelib/detection/retinaface/retinaface.py", line 123, in forward
out = self.body(inputs)
File "/home/dell/anaconda3/envs/CFM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dell/anaconda3/envs/CFM/lib/python3.8/site-packages/torchvision/models/_utils.py", line 69, in forward
x = module(x)
File "/home/dell/anaconda3/envs/CFM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dell/anaconda3/envs/CFM/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/dell/anaconda3/envs/CFM/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
三,原因分析:
第一时间,我询问了ChatGPT, 给出的问题并不能解决我的问题
询问了同事,同事说因为是CUBLAS的问题,可以尝试卸载虚拟环境中的CUBLAS相关包试试,成功。
也就是卸载了nvidia-cublas-cu11=11.10.3.66, 不过具体原因还是没有搞清楚。
四,解决方案:
pip uninstall nvidia-cublas-cu11