今天准备在公司的GPU服务器上训练,但是只有第一次运行成功,后面都报错:CUDNN_STATUS_INTERNAL_ERROR
已知是那台服务器在运行一个pyTorch检测任务(从后面结果来看,应该是pyThon事先申请了很多显存)
环境
CUDA:10.1
cuDNN:7.6
tensorflow:2.1.0
python:3.6
版本符合官网的要求
查看显存占用:
(base) jzg@xxx:~$ nvidia-smi
Tue Sep 21 11:05:41 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.88 Driver Version: 418.88 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:26:00.0 On | N/A |
| 34% 38C P8 15W / 257W | 1517MiB / 10986MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1290 G /usr/lib/xorg/Xorg 14MiB |
| 0 9028 G /usr/lib/xorg/Xorg 141MiB |
| 0 16944 C python 1349MiB |
+-----------------------------------------------------------------------------+
报错内容
...
Mode Run:train
Epoch 1/100
2020-09-21 09:31:31.270788: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-09-21 09:31:31.274736: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
File "train_direction.py", line 155, in <module>
max_queue_size=30)
File "/home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/keras/engine/training.py", line 1732, in fit_generator
initial_epoch=initial_epoch)
File "/home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/keras/engine/training_generator.py", line 220, in fit_generator
reset_metrics=False)
File "/home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/keras/engine/training.py", line 1514, in train_on_batch
outputs = self.train_function(ins)
File "/home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/tensorflow_core/python/keras/backend.py", line 3727, in __call__
outputs = self._graph_fn(*converted_inputs)
File "/home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1551, in __call__
return self._call_impl(args, kwargs)
File "/home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1591, in _call_impl
return self._call_flat(args, self.captured_inputs, cancellation_manager)
File "/home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
ctx=ctx)
File "/home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv1d_1/convolution (defined at /home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3009) ]]
[[dense_1/BiasAdd/_20]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv1d_1/convolution (defined at /home/jzg/miniconda3/envs/DirectionDetection/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3009) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_keras_scratch_graph_3127]
Function call stack:
keras_scratch_graph -> keras_scratch_graph
解决方法
删除缓存
执行sudo rm -rf ~/.nv/
然而并没有解决
控制gpu显存分配
tfconfig = tf.ConfigProto(allow_soft_placement=True)
tfconfig.gpu_options.allow_growth = True
sess = tf.Session(config=tfconfig)
这还是TensorFlow1的函数,虽然可以加v1…兼容运行,但是也是没有解决
动态申请显存
默认情况下,TensorFlow会映射CUDA_VISIBLE_DEVICES该进程可见的所有GPU中的几乎所有GPU内存(视情况而定 )。这样做是为了通过减少内存碎片来更有效地使用设备上相对宝贵的GPU内存资源。
参考TensorFlow文档:限制GPU内存增长
def solve_cudnn_error():
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
在代码前执行
solve_cudnn_error()
完美解决问题,然后就把那个pyTorch检测进程搞得显存不足了 =_=
参考
解決 TensorFlow 2.0 程式出現 cuDNN failed to initialize 錯誤問題
tensorflow 2.0 GPU错误| 当GPU内存不足时
Limiting GPU memory growth