在使用tensor2tensor对已经生成的数据进行训练时,有报错如下:
INFO:tensorflow:Done running local_init_op.
I0518 10:08:16.107131 139741801011008 session_manager.py:534] Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 298000...
I0518 10:08:21.952861 139741801011008 basic_session_run_hooks.py:614] Calling checkpoint listeners before saving checkpoint 298000...
INFO:tensorflow:Saving checkpoints for 298000 into ./tf2_out/model.ckpt.
I0518 10:08:21.961663 139741801011008 basic_session_run_hooks.py:618] Saving checkpoints for 298000 into ./tf2_out/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 298000...
I0518 10:08:24.544301 139741801011008 basic_session_run_hooks.py:626] Calling checkpoint listeners after saving checkpoint 298000...
2021-05-18 10:08:24.769320: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-05-18 10:08:30.209993: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-05-18 10:08:30.493887: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-05-18 10:08:30.493932: W tensorflow/stream_executor/stream.cc:1455] attempting to perform BLAS operation using StreamExecutor without BLAS support
2021-05-18 10:08:30.509157: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-05-18 10:08:30.509205: W tensorflow/stream_executor/stream.cc:1455] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1375, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1360, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas xGEMM launch failed : a.shape=[1,840,512], b.shape=[1,512,512], m=840, n=512, k=512
[[{{node transformer/parallel_0_5/transformer/transformer/body/encoder/layer_0/self_attention/multihead_attention/q/Tensordot/MatMul}}]]
[[transformer/parallel_0_5/transformer/transformer/body/decoder/layer_5/self_attention/multihead_attention/dot_product_attention/Max/_6345]]
(1) Internal: Blas xGEMM launch failed : a.shape=[1,840,512], b.shape=[1,512,512], m=840, n=512, k=512
[[{{node transformer/parallel_0_5/transformer/transformer/body/encoder/layer_0/self_attention/multihead_attention/q/Tensordot/MatMul}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/t2t-trainer", line 35, in <module>
tf.app.run(main)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/usr/local/bin/t2t-trainer", line 28, in main
t2t_trainer.main(argv)
File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 418, in main
execute_schedule(exp)
File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 371, in execute_schedule
getattr(exp, FLAGS.schedule)()
File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/trainer_lib.py", line 468, in continuous_train_and_eval
self._eval_spec)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 505, in train_and_evaluate
return executor.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 646, in run
return self.run_local()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 747, in run_local
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1208, in _train_model_default
saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1514, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 779, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1284, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1385, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1370, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1443, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1201, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 968, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1191, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1369, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1394, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas xGEMM launch failed : a.shape=[1,840,512], b.shape=[1,512,512], m=840, n=512, k=512
[[node transformer/parallel_0_5/transformer/transformer/body/encoder/layer_0/self_attention/multihead_attention/q/Tensordot/MatMul (defined at /lib/python3.6/dist-packages/tensor2tensor/layers/common_layers.py:3031) ]]
[[transformer/parallel_0_5/transformer/transformer/body/decoder/layer_5/self_attention/multihead_attention/dot_product_attention/Max/_6345]]
(1) Internal: Blas xGEMM launch failed : a.shape=[1,840,512], b.shape=[1,512,512], m=840, n=512, k=512
[[node transformer/parallel_0_5/transformer/transformer/body/encoder/layer_0/self_attention/multihead_attention/q/Tensordot/MatMul (defined at /lib/python3.6/dist-packages/tensor2tensor/layers/common_layers.py:3031) ]]
0 successful operations.
0 derived errors ignored.
Errors may have originated from an input operation.
Input Source operations connected to node transformer/parallel_0_5/transformer/transformer/body/encoder/layer_0/self_attention/multihead_attention/q/Tensordot/MatMul:
Identity_35 (defined at /lib/python3.6/dist-packages/tensor2tensor/utils/expert_utils.py:186)
Input Source operations connected to node transformer/parallel_0_5/transformer/transformer/body/encoder/layer_0/self_attention/multihead_attention/q/Tensordot/MatMul:
Identity_35 (defined at /lib/python3.6/dist-packages/tensor2tensor/utils/expert_utils.py:186)
Original stack trace for 'transformer/parallel_0_5/transformer/transformer/body/encoder/layer_0/self_attention/multihead_attention/q/Tensordot/MatMul':
File "/bin/t2t-trainer", line 35, in <module>
tf.app.run(main)
File "/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/lib/python3.6/dist-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/bin/t2t-trainer", line 28, in main
t2t_trainer.main(argv)
File "/lib/python3.6/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 418, in main
execute_schedule(exp)
File "/lib/python3.6/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 371, in execute_schedule
getattr(exp, FLAGS.schedule)()
File "/lib/python3.6/dist-packages/tensor2tensor/utils/trainer_lib.py", line 468, in continuous_train_and_eval
self._eval_spec)
File "/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 505, in train_and_evaluate
return executor.run()
File "/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 646, in run
return self.run_local()
File "/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 747, in run_local
saving_listeners=saving_listeners)
File "/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1204, in _train_model_default
self.config)
File "/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 1421, in wrapping_model_fn
use_tpu=use_tpu)
File "/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 1486, in estimator_model_fn
logits, losses_dict = model(features) # pylint: disable=not-callable
File "/lib/python3.6/dist-packages/tensorflow/python/keras/legacy_tf_layers/base.py", line 561, in __call__
outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
File "/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer_v1.py", line 783, in __call__
outputs = call_fn(cast_inputs, *args, **kwargs)
File "/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 325, in call
sharded_logits, losses = self.model_fn_sharded(sharded_features)
File "/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 365, in model_fn_sharded
if self.use_body_sharded():
File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1172, in if_stmt
_py_if_stmt(cond, body, orelse)
File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1225, in _py_if_stmt
return body() if cond else orelse()
File "/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 402, in model_fn_sharded
sharded_logits, sharded_losses = dp(self.model_fn, datashard_to_features)
File "/lib/python3.6/dist-packages/tensor2tensor/utils/expert_utils.py", line 171, in __call__
for i in range(self.n):
File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 443, in for_stmt
_py_for_stmt(iter_, extra_test, body, None, None)
File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 472, in _py_for_stmt
body(target)
File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 458, in protected_body
original_body(protected_iter)
File "/lib/python3.6/dist-packages/tensor2tensor/utils/expert_utils.py", line 229, in __call__
if self._devices[i] != DEFAULT_DEV_STRING:
File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1172, in if_stmt
_py_if_stmt(cond, body, orelse)
File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1225, in _py_if_stmt
return body() if cond else orelse()
File "/lib/python3.6/dist-packages/tensor2tensor/utils/expert_utils.py", line 231, in __call__
outputs.append(fns[i](*my_args[i], **my_kwargs[i]))
File "/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 429, in model_fn
body_out = self.body(transformed_features)
File "/lib/python3.6/dist-packages/tensor2tensor/models/transformer.py", line 243, in body
if self.has_input:
File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1172, in if_stmt
_py_if_stmt(cond, body, orelse)
File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1225, in _py_if_stmt
return body() if cond else orelse()
File "/lib/python3.6/dist-packages/tensor2tensor/models/transformer.py", line 246, in body
encoder_output, encoder_decoder_attention_bias = self.encode(
File "/lib/python3.6/dist-packages/tensor2tensor/models/transformer.py", line 201, in encode
self._encoder_function, inputs, target_space, hparams,
File "/lib/python3.6/dist-packages/tensor2tensor/models/transformer.py", line 103, in transformer_encode
encoder_output = encoder_function(
File "/lib/python3.6/dist-packages/tensor2tensor/layers/transformer_layers.py", line 201, in transformer_encoder
for layer in range(hparams.num_encoder_layers or hparams.num_hidden_layers):
File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 443, in for_stmt
_py_for_stmt(iter_, extra_test, body, None, None)
File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 472, in _py_for_stmt
body(target)
File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 458, in protected_body
original_body(protected_iter)
File "/lib/python3.6/dist-packages/tensor2tensor/layers/transformer_layers.py", line 212, in transformer_encoder
y = common_attention.multihead_attention(
File "/lib/python3.6/dist-packages/tensor2tensor/layers/common_attention.py", line 4649, in multihead_attention
if cache is None or memory_antecedent is None:
File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1172, in if_stmt
_py_if_stmt(cond, body, orelse)
File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1225, in _py_if_stmt
return body() if cond else orelse()
File "/lib/python3.6/dist-packages/tensor2tensor/layers/common_attention.py", line 4650, in multihead_attention
q, k, v = compute_qkv(query_antecedent, memory_antecedent,
File "/lib/python3.6/dist-packages/tensor2tensor/layers/common_attention.py", line 4454, in compute_qkv
q = compute_attention_component(
File "/lib/python3.6/dist-packages/tensor2tensor/layers/common_attention.py", line 4399, in compute_attention_component
if vars_3d_num_heads is not None and vars_3d_num_heads > 0:
File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1172, in if_stmt
_py_if_stmt(cond, body, orelse)
File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1225, in _py_if_stmt
return body() if cond else orelse()
File "/lib/python3.6/dist-packages/tensor2tensor/layers/common_attention.py", line 4414, in compute_attention_component
if filter_width == 1:
File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1172, in if_stmt
_py_if_stmt(cond, body, orelse)
File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1225, in _py_if_stmt
return body() if cond else orelse()
File "/lib/python3.6/dist-packages/tensor2tensor/layers/common_attention.py", line 4416, in compute_attention_component
antecedent, total_depth, use_bias=False, name=name,
File "/lib/python3.6/dist-packages/tensor2tensor/layers/common_layers.py", line 3031, in dense
activations = layers().Dense(units, **kwargs)(x)
File "/lib/python3.6/dist-packages/tensorflow/python/keras/legacy_tf_layers/base.py", line 561, in __call__
outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
File "/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer_v1.py", line 783, in __call__
outputs = call_fn(cast_inputs, *args, **kwargs)
File "/lib/python3.6/dist-packages/tensorflow/python/keras/layers/core.py", line 1245, in call
outputs = standard_ops.tensordot(inputs, self.kernel, [[rank - 1], [0]])
File "/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
return target(*args, **kwargs)
File "/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py", line 4802, in tensordot
ab_matmul = matmul(a_reshape, b_reshape)
File "/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
return target(*args, **kwargs)
File "/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py", line 3490, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "/lib/python3.6/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 5718, in mat_mul
name=name)
File "/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 750, in _apply_op_helper
attrs=attr_protos, op_def=op_def)
File "/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3565, in _create_op_internal
op_def=op_def)
File "/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2045, in __init__
self._traceback = tf_stack.extract_stack_for_node(self._c_op)
进行baidu与google多方,基本确定为GPU占用问题导致的,在尝试关闭多个tensorflow docker, 重启docker 容器, 重启docker 进程均无效果后, 进行宿主机Ubuntu系统重启.
重启后, 再次运行t2t-trainer后正常运行, 后再次运行后又出现了问题.
确认两次运行之间无任何更改, 对系统无任何操作, 除了打开chrome浏览器!
忽然我想到了什么!!!!!!!!!!!!!!!!!
找到chrome的设置中, "使用硬件加速"一项并将其关闭, 再次运行t2t-trainer.
问题解决!!!!!!!!!!!!!
分析: Chrome的使用硬件加速该功能可以利用计算机的 GPU 来加速进程,以释放出更为重要的 CPU 时间. 但是在宿主机的chrome进程挤占了需要用于进行训练的GPU资源, 从而导致了报错.
这给我一通好找.