【随记】TensorFlow 2.x GPU错误:CUDNN_STATUS_INTERNAL_ERROR解决

今天准备在公司的GPU服务器上训练,但是只有第一次运行成功,后面都报错:CUDNN_STATUS_INTERNAL_ERROR
已知是那台服务器在运行一个pyTorch检测任务(从后面结果来看,应该是pyThon事先申请了很多显存)

环境

CUDA:10.1
cuDNN:7.6
tensorflow:2.1.0
python:3.6

版本符合官网的要求

查看显存占用:

(base) jzg@xxx:~$ nvidia-smi
Tue Sep 21 11:05:41 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.88       Driver Version: 418.88       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:26:00.0  On |                  N/A |
| 34%   38C    P8    15W / 257W |   1517MiB / 10986MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1290      G   /usr/lib/xorg/Xorg                            14MiB |
|    0      9028      G   /usr/lib/xorg/Xorg                           141MiB |
|    0     16944      C   python                                      1349MiB |
+-----------------------------------------------------------------------------+

报错内容

...
Mode Run:train
Epoch 1/100
2020-09-21 09:31:31.270788: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-09-21 09:31:31.274736: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "train_direction.py", line 155, in <module>
    max_queue_size=30)
  File "/home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/keras/engine/training.py", line 1732, in fit_generator
    initial_epoch=initial_epoch)
  File "/home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/keras/engine/training_generator.py", line 220, in fit_generator
    reset_metrics=False)
  File "/home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/keras/engine/training.py", line 1514, in train_on_batch
    outputs = self.train_function(ins)
  File "/home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/tensorflow_core/python/keras/backend.py", line 3727, in __call__
    outputs = self._graph_fn(*converted_inputs)
  File "/home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1551, in __call__
    return self._call_impl(args, kwargs)
  File "/home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1591, in _call_impl
    return self._call_flat(args, self.captured_inputs, cancellation_manager)
  File "/home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
    ctx=ctx)
  File "/home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node conv1d_1/convolution (defined at /home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3009) ]]
	 [[dense_1/BiasAdd/_20]]
  (1) Unknown:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node conv1d_1/convolution (defined at /home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3009) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_keras_scratch_graph_3127]

Function call stack:
keras_scratch_graph -> keras_scratch_graph

解决方法

删除缓存

执行sudo rm -rf ~/.nv/
然而并没有解决

控制gpu显存分配

tfconfig = tf.ConfigProto(allow_soft_placement=True)
tfconfig.gpu_options.allow_growth = True
sess = tf.Session(config=tfconfig)

这还是TensorFlow1的函数,虽然可以加v1…兼容运行,但是也是没有解决

动态申请显存

原文参考 解決 TensorFlow 2.0 程式出現 cuDNN failed to initialize 錯誤問題

默认情况下,TensorFlow会映射CUDA_VISIBLE_DEVICES该进程可见的所有GPU中的几乎所有GPU内存(视情况而定 )。这样做是为了通过减少内存碎片来更有效地使用设备上相对宝贵的GPU内存资源。

参考TensorFlow文档:限制GPU内存增长

def solve_cudnn_error():
	gpus = tf.config.experimental.list_physical_devices('GPU')
	if gpus:
	  try:
	    # Currently, memory growth needs to be the same across GPUs
	    for gpu in gpus:
	      tf.config.experimental.set_memory_growth(gpu, True)
	    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
	    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
	  except RuntimeError as e:
	    # Memory growth must be set before GPUs have been initialized
	    print(e)

在代码前执行

solve_cudnn_error()

完美解决问题,然后就把那个pyTorch检测进程搞得显存不足了 =_=

参考

解決 TensorFlow 2.0 程式出現 cuDNN failed to initialize 錯誤問題
tensorflow 2.0 GPU错误| 当GPU内存不足时
Limiting GPU memory growth

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值