跑模型的时候出现了下面的错误(太长了,所以只保留了有用的关键信息)。在网上得知,出现这种错误的原因可能是显存空间不够,这有可能是使用的batch_size过大或者显卡被其他服务占用引起的。之后我查看了一下源码,偶然间发现代码里使用的n_gpu的默认值是4,我将其修改为1并重新运行代码之后,代码被成功执行。
结合网上搜索到的资源和我的这次试验,总结一下出现这个问题的原因:
- batch_size太大;
- 有其他模型在占用GPU资源;
- 对GPU数量的设置不符合实际(过大)。
2019-03-16 18:59:38.881528: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2019-03-16 18:59:38.881535: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2019-03-16 18:59:38.881540: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2019-03-16 18:59:38.881545: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX512F instructions, but these are available on your machine and could speed up CPU computations.
2019-03-16 18:59:38.881550: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2019-03-16 18:59:39.005554: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-03-16 18:59:39.005820: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: Tesla P4
major: 6 minor: 1 memoryClockRate (GHz) 1.531
pciBusID 0000:00:07.0
Total memory: 7.43GiB
Free memory: 7.32GiB
2019-03-16 18:59:39.005851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2019-03-16 18:59:39.005858: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y
2019-03-16 18:59:39.005868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P4, pci bus id: 0000:00:07.0)
0%| | 0/46 [00:00<?, ?it/s]2019-03-16 19:00:05.441385: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 14.44MiB. Current allocation summary follows.
2019-03-16 19:00:05.441859: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 14.44MiB was 8.00MiB, Chunk State:
2019-03-16 19:00:05.462553: I tensorflow/core/common_runtime/bfc_allocator.cc:693] Summary of in-use Chunks by size:
2019-03-16 19:00:05.462905: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.95GiB
2019-03-16 19:00:05.462917: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 7463944192
InUse: 7462481920
MaxInUse: 7462915328
NumAllocs: 3978
MaxAllocSize: 197274112
2019-03-16 19:00:05.463019: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ****************************************************************************************************
2019-03-16 19:00:05.463075: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[16,77,3072]
2019-03-16 19:00:05.463170: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 14.44MiB. Current allocation summary follows.
2019-03-16 19:00:05.463596: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 14.44MiB was 8.00MiB, Chunk State:
2019-03-16 19:00:05.484133: I tensorflow/core/common_runtime/bfc_allocator.cc:693] Summary of in-use Chunks by size:
2019-03-16 19:00:05.484464: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.95GiB
2019-03-16 19:00:05.484475: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 7463944192
InUse: 7462481920
MaxInUse: 7462915328
NumAllocs: 3978
MaxAllocSize: 197274112
2019-03-16 19:00:05.484576: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ****************************************************************************************************
2019-03-16 19:00:05.484592: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[16,77,3072]
2019-03-16 19:00:05.530899: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.61MiB. Current allocation summary follows.
2019-03-16 19:00:05.531407: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 3.61MiB was 2.00MiB, Chunk State:
2019-03-16 19:00:05.553057: I tensorflow/core/common_runtime/bfc_allocator.cc:693] Summary of in-use Chunks by size:
2019-03-16 19:00:05.553394: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.95GiB
2019-03-16 19:00:05.553404: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 7463944192
InUse: 7462481920
MaxInUse: 7462915328
NumAllocs: 3978
MaxAllocSize: 197274112
2019-03-16 19:00:05.553505: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ****************************************************************************************************
2019-03-16 19:00:05.553531: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[16,77,768]
2019-03-16 19:00:05.553668: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.61MiB. Current allocation summary follows.
2019-03-16 19:00:05.554103: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 3.61MiB was 2.00MiB, Chunk State:
2019-03-16 19:00:05.574314: I tensorflow/core/common_runtime/bfc_allocator.cc:693] Summary of in-use Chunks by size:
2019-03-16 19:00:05.574638: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.95GiB
2019-03-16 19:00:05.574666: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 7463944192
InUse: 7462481920
MaxInUse: 7462915328
NumAllocs: 3978
MaxAllocSize: 197274112
2019-03-16 19:00:05.574770: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ****************************************************************************************************
2019-03-16 19:00:05.574786: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[1232,768]
2019-03-16 19:00:15.484765: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 14.44MiB. Current allocation summary follows.
2019-03-16 19:00:15.485248: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 14.44MiB was 8.00MiB, Chunk State:
2019-03-16 19:00:15.506609: I tensorflow/core/common_runtime/bfc_allocator.cc:693] Summary of in-use Chunks by size:
2019-03-16 19:00:15.506956: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.95GiB
2019-03-16 19:00:15.506968: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 7463944192
InUse: 7462422528
MaxInUse: 7462915328
NumAllocs: 3978
MaxAllocSize: 197274112
2019-03-16 19:00:15.507082: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ****************************************************************************************************
2019-03-16 19:00:15.507112: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[16,77,3072]
2019-03-16 19:00:25.507333: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 14.44MiB. Current allocation summary follows.
2019-03-16 19:00:25.507912: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 14.44MiB was 8.00MiB, Chunk State:
2019-03-16 19:00:25.527807: I tensorflow/core/common_runtime/bfc_allocator.cc:693] Summary of in-use Chunks by size:
2019-03-16 19:00:25.528034: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.95GiB
2019-03-16 19:00:25.528044: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 7463944192
InUse: 7462422528
MaxInUse: 7462915328
NumAllocs: 3978
MaxAllocSize: 197274112
2019-03-16 19:00:25.528124: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ****************************************************************************************************
2019-03-16 19:00:25.528148: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[16,77,3072]
Traceback (most recent call last):
File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1306, in _run_fn
status, run_metadata)
File "/anaconda3/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,77,3072]
[[Node: model_2/h2/mlp/Pow = Pow[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](model_2/h2/mlp/c_fc/Reshape_2, model_2/h2/mlp/Pow/y)]]
[[Node: Mean_8/_2963 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_62814_Mean_8", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 433, in <module>
cost, _ = sess.run([clf_loss, train], {X_train:xmb, M_train:mmb, Y_train:ymb})
File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,77,3072]
[[Node: model_2/h2/mlp/Pow = Pow[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](model_2/h2/mlp/c_fc/Reshape_2, model_2/h2/mlp/Pow/y)]]
[[Node: Mean_8/_2963 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_62814_Mean_8", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op 'model_2/h2/mlp/Pow', defined at:
File "train.py", line 397, in <module>
train, logits, clf_losses, lm_losses = mgpu_train(X_train, M_train, Y_train)
File "train.py", line 203, in mgpu_train
clf_logits, clf_losses, lm_losses = model(*xs, train=True, reuse=do_reuse)
File "train.py", line 172, in model
h = block(h, 'h%d'%layer, train=train, scale=True)
File "train.py", line 145, in block
m = mlp(n, 'mlp', nx*4, train=train)
File "train.py", line 135, in mlp
h = act(conv1d(x, 'c_fc', n_state, 1, train=train))
File "train.py", line 23, in gelu
return 0.5*x*(1+tf.tanh(math.sqrt(2/math.pi)*(x+0.044715*tf.pow(x, 3))))
File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 544, in pow
return gen_math_ops._pow(x, y, name=name)
File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1533, in _pow
result = _op_def_lib.apply_op("Pow", x=x, y=y, name=name)
File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[16,77,3072]
[[Node: model_2/h2/mlp/Pow = Pow[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](model_2/h2/mlp/c_fc/Reshape_2, model_2/h2/mlp/Pow/y)]]
[[Node: Mean_8/_2963 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_62814_Mean_8", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]