关于使用tensor2tensor进行训练或decode时,报Blas xGEMM launch failed的解决方法,竟然是!!!

在使用tensor2tensor对已经生成的数据进行训练时,有报错如下:

INFO:tensorflow:Done running local_init_op.
I0518 10:08:16.107131 139741801011008 session_manager.py:534] Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 298000...
I0518 10:08:21.952861 139741801011008 basic_session_run_hooks.py:614] Calling checkpoint listeners before saving checkpoint 298000...
INFO:tensorflow:Saving checkpoints for 298000 into ./tf2_out/model.ckpt.
I0518 10:08:21.961663 139741801011008 basic_session_run_hooks.py:618] Saving checkpoints for 298000 into ./tf2_out/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 298000...
I0518 10:08:24.544301 139741801011008 basic_session_run_hooks.py:626] Calling checkpoint listeners after saving checkpoint 298000...
2021-05-18 10:08:24.769320: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-05-18 10:08:30.209993: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-05-18 10:08:30.493887: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-05-18 10:08:30.493932: W tensorflow/stream_executor/stream.cc:1455] attempting to perform BLAS operation using StreamExecutor without BLAS support
2021-05-18 10:08:30.509157: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-05-18 10:08:30.509205: W tensorflow/stream_executor/stream.cc:1455] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1360, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas xGEMM launch failed : a.shape=[1,840,512], b.shape=[1,512,512], m=840, n=512, k=512
	 [[{{node transformer/parallel_0_5/transformer/transformer/body/encoder/layer_0/self_attention/multihead_attention/q/Tensordot/MatMul}}]]
	 [[transformer/parallel_0_5/transformer/transformer/body/decoder/layer_5/self_attention/multihead_attention/dot_product_attention/Max/_6345]]
  (1) Internal: Blas xGEMM launch failed : a.shape=[1,840,512], b.shape=[1,512,512], m=840, n=512, k=512
	 [[{{node transformer/parallel_0_5/transformer/transformer/body/encoder/layer_0/self_attention/multihead_attention/q/Tensordot/MatMul}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/t2t-trainer", line 35, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/usr/local/bin/t2t-trainer", line 28, in main
    t2t_trainer.main(argv)
  File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 418, in main
    execute_schedule(exp)
  File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 371, in execute_schedule
    getattr(exp, FLAGS.schedule)()
  File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/trainer_lib.py", line 468, in continuous_train_and_eval
    self._eval_spec)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 505, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 646, in run
    return self.run_local()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 747, in run_local
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1208, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1514, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 779, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1284, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1385, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1370, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1443, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1201, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 968, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1191, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1369, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas xGEMM launch failed : a.shape=[1,840,512], b.shape=[1,512,512], m=840, n=512, k=512
	 [[node transformer/parallel_0_5/transformer/transformer/body/encoder/layer_0/self_attention/multihead_attention/q/Tensordot/MatMul (defined at /lib/python3.6/dist-packages/tensor2tensor/layers/common_layers.py:3031) ]]
	 [[transformer/parallel_0_5/transformer/transformer/body/decoder/layer_5/self_attention/multihead_attention/dot_product_attention/Max/_6345]]
  (1) Internal: Blas xGEMM launch failed : a.shape=[1,840,512], b.shape=[1,512,512], m=840, n=512, k=512
	 [[node transformer/parallel_0_5/transformer/transformer/body/encoder/layer_0/self_attention/multihead_attention/q/Tensordot/MatMul (defined at /lib/python3.6/dist-packages/tensor2tensor/layers/common_layers.py:3031) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node transformer/parallel_0_5/transformer/transformer/body/encoder/layer_0/self_attention/multihead_attention/q/Tensordot/MatMul:
 Identity_35 (defined at /lib/python3.6/dist-packages/tensor2tensor/utils/expert_utils.py:186)

Input Source operations connected to node transformer/parallel_0_5/transformer/transformer/body/encoder/layer_0/self_attention/multihead_attention/q/Tensordot/MatMul:
 Identity_35 (defined at /lib/python3.6/dist-packages/tensor2tensor/utils/expert_utils.py:186)

Original stack trace for 'transformer/parallel_0_5/transformer/transformer/body/encoder/layer_0/self_attention/multihead_attention/q/Tensordot/MatMul':
  File "/bin/t2t-trainer", line 35, in <module>
    tf.app.run(main)
  File "/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/lib/python3.6/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/bin/t2t-trainer", line 28, in main
    t2t_trainer.main(argv)
  File "/lib/python3.6/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 418, in main
    execute_schedule(exp)
  File "/lib/python3.6/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 371, in execute_schedule
    getattr(exp, FLAGS.schedule)()
  File "/lib/python3.6/dist-packages/tensor2tensor/utils/trainer_lib.py", line 468, in continuous_train_and_eval
    self._eval_spec)
  File "/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 505, in train_and_evaluate
    return executor.run()
  File "/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 646, in run
    return self.run_local()
  File "/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 747, in run_local
    saving_listeners=saving_listeners)
  File "/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1204, in _train_model_default
    self.config)
  File "/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 1421, in wrapping_model_fn
    use_tpu=use_tpu)
  File "/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 1486, in estimator_model_fn
    logits, losses_dict = model(features)  # pylint: disable=not-callable
  File "/lib/python3.6/dist-packages/tensorflow/python/keras/legacy_tf_layers/base.py", line 561, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer_v1.py", line 783, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 325, in call
    sharded_logits, losses = self.model_fn_sharded(sharded_features)
  File "/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 365, in model_fn_sharded
    if self.use_body_sharded():
  File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1172, in if_stmt
    _py_if_stmt(cond, body, orelse)
  File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1225, in _py_if_stmt
    return body() if cond else orelse()
  File "/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 402, in model_fn_sharded
    sharded_logits, sharded_losses = dp(self.model_fn, datashard_to_features)
  File "/lib/python3.6/dist-packages/tensor2tensor/utils/expert_utils.py", line 171, in __call__
    for i in range(self.n):
  File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 443, in for_stmt
    _py_for_stmt(iter_, extra_test, body, None, None)
  File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 472, in _py_for_stmt
    body(target)
  File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 458, in protected_body
    original_body(protected_iter)
  File "/lib/python3.6/dist-packages/tensor2tensor/utils/expert_utils.py", line 229, in __call__
    if self._devices[i] != DEFAULT_DEV_STRING:
  File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1172, in if_stmt
    _py_if_stmt(cond, body, orelse)
  File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1225, in _py_if_stmt
    return body() if cond else orelse()
  File "/lib/python3.6/dist-packages/tensor2tensor/utils/expert_utils.py", line 231, in __call__
    outputs.append(fns[i](*my_args[i], **my_kwargs[i]))
  File "/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 429, in model_fn
    body_out = self.body(transformed_features)
  File "/lib/python3.6/dist-packages/tensor2tensor/models/transformer.py", line 243, in body
    if self.has_input:
  File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1172, in if_stmt
    _py_if_stmt(cond, body, orelse)
  File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1225, in _py_if_stmt
    return body() if cond else orelse()
  File "/lib/python3.6/dist-packages/tensor2tensor/models/transformer.py", line 246, in body
    encoder_output, encoder_decoder_attention_bias = self.encode(
  File "/lib/python3.6/dist-packages/tensor2tensor/models/transformer.py", line 201, in encode
    self._encoder_function, inputs, target_space, hparams,
  File "/lib/python3.6/dist-packages/tensor2tensor/models/transformer.py", line 103, in transformer_encode
    encoder_output = encoder_function(
  File "/lib/python3.6/dist-packages/tensor2tensor/layers/transformer_layers.py", line 201, in transformer_encoder
    for layer in range(hparams.num_encoder_layers or hparams.num_hidden_layers):
  File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 443, in for_stmt
    _py_for_stmt(iter_, extra_test, body, None, None)
  File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 472, in _py_for_stmt
    body(target)
  File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 458, in protected_body
    original_body(protected_iter)
  File "/lib/python3.6/dist-packages/tensor2tensor/layers/transformer_layers.py", line 212, in transformer_encoder
    y = common_attention.multihead_attention(
  File "/lib/python3.6/dist-packages/tensor2tensor/layers/common_attention.py", line 4649, in multihead_attention
    if cache is None or memory_antecedent is None:
  File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1172, in if_stmt
    _py_if_stmt(cond, body, orelse)
  File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1225, in _py_if_stmt
    return body() if cond else orelse()
  File "/lib/python3.6/dist-packages/tensor2tensor/layers/common_attention.py", line 4650, in multihead_attention
    q, k, v = compute_qkv(query_antecedent, memory_antecedent,
  File "/lib/python3.6/dist-packages/tensor2tensor/layers/common_attention.py", line 4454, in compute_qkv
    q = compute_attention_component(
  File "/lib/python3.6/dist-packages/tensor2tensor/layers/common_attention.py", line 4399, in compute_attention_component
    if vars_3d_num_heads is not None and vars_3d_num_heads > 0:
  File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1172, in if_stmt
    _py_if_stmt(cond, body, orelse)
  File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1225, in _py_if_stmt
    return body() if cond else orelse()
  File "/lib/python3.6/dist-packages/tensor2tensor/layers/common_attention.py", line 4414, in compute_attention_component
    if filter_width == 1:
  File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1172, in if_stmt
    _py_if_stmt(cond, body, orelse)
  File "/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 1225, in _py_if_stmt
    return body() if cond else orelse()
  File "/lib/python3.6/dist-packages/tensor2tensor/layers/common_attention.py", line 4416, in compute_attention_component
    antecedent, total_depth, use_bias=False, name=name,
  File "/lib/python3.6/dist-packages/tensor2tensor/layers/common_layers.py", line 3031, in dense
    activations = layers().Dense(units, **kwargs)(x)
  File "/lib/python3.6/dist-packages/tensorflow/python/keras/legacy_tf_layers/base.py", line 561, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer_v1.py", line 783, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/lib/python3.6/dist-packages/tensorflow/python/keras/layers/core.py", line 1245, in call
    outputs = standard_ops.tensordot(inputs, self.kernel, [[rank - 1], [0]])
  File "/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
    return target(*args, **kwargs)
  File "/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py", line 4802, in tensordot
    ab_matmul = matmul(a_reshape, b_reshape)
  File "/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
    return target(*args, **kwargs)
  File "/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py", line 3490, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/lib/python3.6/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 5718, in mat_mul
    name=name)
  File "/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 750, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3565, in _create_op_internal
    op_def=op_def)
  File "/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2045, in __init__
    self._traceback = tf_stack.extract_stack_for_node(self._c_op)

进行baidu与google多方,基本确定为GPU占用问题导致的,在尝试关闭多个tensorflow docker, 重启docker 容器, 重启docker 进程均无效果后, 进行宿主机Ubuntu系统重启.

 

重启后, 再次运行t2t-trainer后正常运行,  后再次运行后又出现了问题.  

确认两次运行之间无任何更改, 对系统无任何操作, 除了打开chrome浏览器!

忽然我想到了什么!!!!!!!!!!!!!!!!!

找到chrome的设置中, "使用硬件加速"一项并将其关闭, 再次运行t2t-trainer.

 

问题解决!!!!!!!!!!!!!

 

分析: Chrome的使用硬件加速该功能可以利用计算机的 GPU 来加速进程,以释放出更为重要的 CPU 时间.  但是在宿主机的chrome进程挤占了需要用于进行训练的GPU资源, 从而导致了报错. 

 

这给我一通好找.

 

 

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值