[ERROR] KERNEL(5124,7f9d16bf9240,python):2022-09-17-15:58:13.292.697 [mindspore/ccsrc/plugin/device/gpu/kernel/nn/flatten_gpu_kernel.h:44] Launch] cudaMemcpyAsync error in FlattenFwdGpuKernelMod::Launch, error code is 700
[ERROR] DEVICE(5124,7f9d16bf9240,python):2022-09-17-15:58:13.292.710 [mindspore/ccsrc/plugin/device/gpu/hal/hardware/gpu_device_context.cc:540] LaunchKernel] Launch kernel failed, kernel full name: Gradients/Default/gradAdd/Reshape-op10082
[CRITICAL] KERNEL(5124,7f9d16bf9240,python):2022-09-17-15:58:13.293.145 [mindspore/ccsrc/plugin/device/gpu/kernel/nn/conv2d_grad_input_gpu_kernel.h:117] Launch] cuDNN Error: ConvolutionBackwardData failed | Error Number: 8 CUDNN_STATUS_EXECUTION_FAILED
The function call stack:
In file /home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/ops/_grad/grad_nn_ops.py(65)/ dx = input_grad(dout, w, x_shape)/
[CRITICAL] KERNEL(5124,7f9d16bf9240,python):2022-09-17-15:58:13.293.288 [mindspore/ccsrc/plugin/device/gpu/kernel/nn/conv2d_grad_input_gpu_kernel.h:117] Launch] cuDNN Error: ConvolutionBackwardData failed | Error Number: 8 CUDNN_STATUS_EXECUTION_FAILED
The function call stack:
In file /home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/ops/_grad/grad_nn_ops.py(65)/ dx = input_grad(dout, w, x_shape)/
[CRITICAL] KERNEL(5124,7f9d16bf9240,python):2022-09-17-15:58:13.293.592 [mindspore/ccsrc/plugin/device/gpu/kernel/nn/conv2d_grad_filter_gpu_kernel.h:118] Launch] cuDNN Error: ConvolutionBackwardFilter failed | Error Number: 8 CUDNN_STATUS_EXECUTION_FAILED
The function call stack:
In file /home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/ops/_grad/grad_nn_ops.py(67)/ dw = filter_grad(dout, x, w_shape)/
[ERROR] DEVICE(5124,7f9d16bf9240,python):2022-09-17-15:58:13.293.700 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:167] SyncStream] cudaStreamSynchronize failed, ret[700], an illegal memory access was encountered
Traceback (most recent call last):
File "/home/luoxuewei/shelei/PFST-LSTM-source-4567_x2ms/experiment/CIKM/dec_PFST_ConvLSTM_dataloader_Gan_SA_mindspore.py", line 526, in
model.train()
File "/home/luoxuewei/shelei/PFST-LSTM-source-4567_x2ms/experiment/CIKM/dec_PFST_ConvLSTM_dataloader_Gan_SA_mindspore.py", line 249, in train
output_G = trainer1(in_frame_dat, group_truth)
File "/home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/nn/cell.py", line 601, in __call__
raise err
File "/home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/nn/cell.py", line 597, in __call__
output = self._run_construct(cast_inputs, kwargs)
File "/home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/nn/cell.py", line 416, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/nn/wrap/cell_wrapper.py", line 375, in construct
grads = self.grad(self.network, self.weights)(*inputs, sens)
File "/home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/ops/composite/base.py", line 399, in after_grad
return grad_(fn, weights)(*args, **kwargs)
File "/home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/common/api.py", line 93, in wrapper
results = fn(*arg, **kwargs)
File "/home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/ops/composite/base.py", line 391, in after_grad
out = _pynative_executor(fn, grad_.sens_param, *args, **kwargs)
File "/home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/common/api.py", line 951, in __call__
return self._executor(sens_param, obj, args)
RuntimeError: mindspore/ccsrc/plugin/device/gpu/kernel/nn/conv2d_grad_filter_gpu_kernel.h:118 Launch] cuDNN Error: ConvolutionBackwardFilter failed | Error Number: 8 CUDNN_STATUS_EXECUTION_FAILED
The function call stack:
In file /home/luoxuewei/miniconda3/lib/python3.9/site-packages/mindspore/ops/_grad/grad_nn_ops.py(67)/ dw = filter_grad(dout, x, w_shape)/
[ERROR] DEVICE(5124,7f9d16bf9240,python):2022-09-17-15:58:13.818.362 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:167] SyncStream] cudaStreamSynchronize failed, ret[700], an illegal memory access was encountered
[ERROR] ME(5124,7f9d16bf9240,python):2022-09-17-15:58:13.818.390 [mindspore/ccsrc/runtime/hardware/device_context_manager.cc:81] WaitTaskFinishOnDevice] SyncStream failed
[ERROR] DEVICE(5124,7f9d16bf9240,python):2022-09-17-15:58:13.829.692 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:158] DestroyStream] cudaStreamDestroy failed, ret[700], an illegal memory access was encountered
[ERROR] DEVICE(5124,7f9d16bf9240,python):2022-09-17-15:58:13.829.710 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_device_manager.cc:61] ReleaseDevice] Op Error: Failed to destroy CUDA stream. | Error Number: 0
[ERROR] DEVICE(5124,7f9d16bf9240,python):2022-09-17-15:58:13.829.724 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:158] DestroyStream] cudaStreamDestroy failed, ret[700], an illegal memory access was encountered
[ERROR] DEVICE(5124,7f9d16bf9240,python):2022-09-17-15:58:13.829.733 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_device_manager.cc:61] ReleaseDevice] Op Error: Failed to destroy CUDA stream. | Error Number: 0
[ERROR] DEVICE(5124,7f9d16bf9240,python):2022-09-17-15:58:13.830.140 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_device_manager.cc:67] ReleaseDevice] cuDNN Error: Failed to destroy cuDNN handle | Error Number: 4 CUDNN_STATUS_INTERNAL_ERROR
[ERROR] DEVICE(5124,7f9d16bf9240,python):2022-09-17-15:58:13.831.354 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:48] FreeDeviceMem] cudaFree failed, ret[700], an illegal memory access was encountered
[CRITICAL] PRE_ACT(5124,7f9d16bf9240,python):2022-09-17-15:58:13.831.371 [mindspore/ccsrc/common/mem_reuse/mem_dynamic_allocator.cc:428] operator()] Free device memory[0x7f992e000000] error.
Error in atexit._run_exitfuncs:
RuntimeError: mindspore/ccsrc/common/mem_reuse/mem_dynamic_allocator.cc:428 operator()] Free device memory[0x7f992e000000] error.
****************************************************解答*****************************************************
看起来是算子出现了 内存相关问题,能否参考这篇帖子帮忙协助定位下,具体是哪个算子。
https://bbs.huaweicloud.com/forum/thread-169762-1-1.html
或者 如果不清楚的话也可以设置下这两个环境变量,把日志给我们看下
export CUDA_LAUNCH_BLOCKING=1
export GLOG_v=1