ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[16,77,3072]

最新推荐文章于 2022-11-30 07:38:15 发布

tomeasure

最新推荐文章于 2022-11-30 07:38:15 发布

阅读量4.4k

点赞数

分类专栏： tensorflow 深度学习 NLP 文章标签： ResourceExhaustedError OOM Chunk tensorflow

本文链接：https://blog.csdn.net/qq_29695701/article/details/88603837

版权

深度学习同时被 3 个专栏收录

45 篇文章 3 订阅

订阅专栏

NLP

11 篇文章 0 订阅

订阅专栏

tensorflow

4 篇文章 0 订阅

订阅专栏

跑模型的时候出现了下面的错误（太长了，所以只保留了有用的关键信息）。在网上得知，出现这种错误的原因可能是显存空间不够，这有可能是使用的batch_size过大或者显卡被其他服务占用引起的。之后我查看了一下源码，偶然间发现代码里使用的n_gpu的默认值是4，我将其修改为1并重新运行代码之后，代码被成功执行。

结合网上搜索到的资源和我的这次试验，总结一下出现这个问题的原因：

batch_size太大；
有其他模型在占用GPU资源；
对GPU数量的设置不符合实际（过大）。

2019-03-16 18:59:38.881528: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2019-03-16 18:59:38.881535: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2019-03-16 18:59:38.881540: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2019-03-16 18:59:38.881545: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX512F instructions, but these are available on your machine and could speed up CPU computations.
2019-03-16 18:59:38.881550: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2019-03-16 18:59:39.005554: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-03-16 18:59:39.005820: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: Tesla P4
major: 6 minor: 1 memoryClockRate (GHz) 1.531
pciBusID 0000:00:07.0
Total memory: 7.43GiB
Free memory: 7.32GiB
2019-03-16 18:59:39.005851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2019-03-16 18:59:39.005858: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2019-03-16 18:59:39.005868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P4, pci bus id: 0000:00:07.0)


  0%|                                                    | 0/46 [00:00<?, ?it/s]2019-03-16 19:00:05.441385: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 14.44MiB.  Current allocation summary follows.
2019-03-16 19:00:05.441859: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 14.44MiB was 8.00MiB, Chunk State: 
2019-03-16 19:00:05.462553: I tensorflow/core/common_runtime/bfc_allocator.cc:693]      Summary of in-use Chunks by size: 
2019-03-16 19:00:05.462905: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.95GiB
2019-03-16 19:00:05.462917: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats: 
Limit:                  7463944192
InUse:                  7462481920
MaxInUse:               7462915328
NumAllocs:                    3978
MaxAllocSize:            197274112

2019-03-16 19:00:05.463019: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ****************************************************************************************************
2019-03-16 19:00:05.463075: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[16,77,3072]
2019-03-16 19:00:05.463170: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 14.44MiB.  Current allocation summary follows.
2019-03-16 19:00:05.463596: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 14.44MiB was 8.00MiB, Chunk State: 
2019-03-16 19:00:05.484133: I tensorflow/core/common_runtime/bfc_allocator.cc:693]      Summary of in-use Chunks by size: 
2019-03-16 19:00:05.484464: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.95GiB
2019-03-16 19:00:05.484475: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats: 
Limit:                  7463944192
InUse:                  7462481920
MaxInUse:               7462915328
NumAllocs:                    3978
MaxAllocSize:            197274112

2019-03-16 19:00:05.484576: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ****************************************************************************************************
2019-03-16 19:00:05.484592: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[16,77,3072]
2019-03-16 19:00:05.530899: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.61MiB.  Current allocation summary follows.
2019-03-16 19:00:05.531407: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 3.61MiB was 2.00MiB, Chunk State: 
2019-03-16 19:00:05.553057: I tensorflow/core/common_runtime/bfc_allocator.cc:693]      Summary of in-use Chunks by size: 
2019-03-16 19:00:05.553394: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.95GiB
2019-03-16 19:00:05.553404: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats: 
Limit:                  7463944192
InUse:                  7462481920
MaxInUse:               7462915328
NumAllocs:                    3978
MaxAllocSize:            197274112

2019-03-16 19:00:05.553505: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ****************************************************************************************************
2019-03-16 19:00:05.553531: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[16,77,768]
2019-03-16 19:00:05.553668: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.61MiB.  Current allocation summary follows.
2019-03-16 19:00:05.554103: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 3.61MiB was 2.00MiB, Chunk State: 
2019-03-16 19:00:05.574314: I tensorflow/core/common_runtime/bfc_allocator.cc:693]      Summary of in-use Chunks by size: 
2019-03-16 19:00:05.574638: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.95GiB
2019-03-16 19:00:05.574666: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats: 
Limit:                  7463944192
InUse:                  7462481920
MaxInUse:               7462915328
NumAllocs:                    3978
MaxAllocSize:            197274112

2019-03-16 19:00:05.574770: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ****************************************************************************************************
2019-03-16 19:00:05.574786: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[1232,768]
2019-03-16 19:00:15.484765: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 14.44MiB.  Current allocation summary follows.
2019-03-16 19:00:15.485248: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 14.44MiB was 8.00MiB, Chunk State: 
2019-03-16 19:00:15.506609: I tensorflow/core/common_runtime/bfc_allocator.cc:693]      Summary of in-use Chunks by size: 
2019-03-16 19:00:15.506956: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.95GiB
2019-03-16 19:00:15.506968: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats: 
Limit:                  7463944192
InUse:                  7462422528
MaxInUse:               7462915328
NumAllocs:                    3978
MaxAllocSize:            197274112

2019-03-16 19:00:15.507082: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ****************************************************************************************************
2019-03-16 19:00:15.507112: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[16,77,3072]
2019-03-16 19:00:25.507333: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 14.44MiB.  Current allocation summary follows.
2019-03-16 19:00:25.507912: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 14.44MiB was 8.00MiB, Chunk State: 
2019-03-16 19:00:25.527807: I tensorflow/core/common_runtime/bfc_allocator.cc:693]      Summary of in-use Chunks by size: 
2019-03-16 19:00:25.528034: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.95GiB
2019-03-16 19:00:25.528044: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats: 
Limit:                  7463944192
InUse:                  7462422528
MaxInUse:               7462915328
NumAllocs:                    3978
MaxAllocSize:            197274112

2019-03-16 19:00:25.528124: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ****************************************************************************************************
2019-03-16 19:00:25.528148: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[16,77,3072]
Traceback (most recent call last):
  File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
    return fn(*args)
  File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1306, in _run_fn
    status, run_metadata)
  File "/anaconda3/lib/python3.5/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,77,3072]
	 [[Node: model_2/h2/mlp/Pow = Pow[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](model_2/h2/mlp/c_fc/Reshape_2, model_2/h2/mlp/Pow/y)]]
	 [[Node: Mean_8/_2963 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_62814_Mean_8", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 433, in <module>
    cost, _ = sess.run([clf_loss, train], {X_train:xmb, M_train:mmb, Y_train:ymb})
  File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1124, in _run
    feed_dict_tensor, options, run_metadata)
  File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
    options, run_metadata)
  File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,77,3072]
	 [[Node: model_2/h2/mlp/Pow = Pow[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](model_2/h2/mlp/c_fc/Reshape_2, model_2/h2/mlp/Pow/y)]]
	 [[Node: Mean_8/_2963 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_62814_Mean_8", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Caused by op 'model_2/h2/mlp/Pow', defined at:
  File "train.py", line 397, in <module>
    train, logits, clf_losses, lm_losses = mgpu_train(X_train, M_train, Y_train)
  File "train.py", line 203, in mgpu_train
    clf_logits, clf_losses, lm_losses = model(*xs, train=True, reuse=do_reuse)
  File "train.py", line 172, in model
    h = block(h, 'h%d'%layer, train=train, scale=True)
  File "train.py", line 145, in block
    m = mlp(n, 'mlp', nx*4, train=train)
  File "train.py", line 135, in mlp
    h = act(conv1d(x, 'c_fc', n_state, 1, train=train))
  File "train.py", line 23, in gelu
    return 0.5*x*(1+tf.tanh(math.sqrt(2/math.pi)*(x+0.044715*tf.pow(x, 3))))
  File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 544, in pow
    return gen_math_ops._pow(x, y, name=name)
  File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1533, in _pow
    result = _op_def_lib.apply_op("Pow", x=x, y=y, name=name)
  File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[16,77,3072]
	 [[Node: model_2/h2/mlp/Pow = Pow[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](model_2/h2/mlp/c_fc/Reshape_2, model_2/h2/mlp/Pow/y)]]
	 [[Node: Mean_8/_2963 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_62814_Mean_8", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

tomeasure

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
4
评论
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[16,77,3072]

跑模型的时候出现了下面的错误（太长了，所以只保留了有用的关键信息）。在网上得知，出现这种错误的原因可能是显存空间不够，这有可能是使用的batch_size过大或者显卡被其他服务占用引起的。之后我查看了一下源码，偶然间发现代码里使用的n_gpu的默认值是4，我将其修改为1并重新运行代码之后，代码被成功执行。结合网上搜索到的资源和我的这次试验，总结一下出现这个问题的原因：batch_size太...
复制链接

扫一扫

专栏目录