写tf和pytorch遇到的一些错误以及解决办法(如果当时解决了的话)

2022年

  • RuntimeError('cuDNN error: CUDNN_STATUS_EXECUTION_FAILED',)
    nlp的有一个数据id超过了嵌入层的输入大小
Traceback (most recent call last):
  File "/tmp/pycharm_project_327/main.py", line 946, in train
    outputs = model(inputs, labels=labels)
  File "/root/anaconda3/envs/roberta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/tmp/pycharm_project_327/main.py", line 359, in forward
    lstm_outputs, _ = self.lstm(inputs_embeds)
  File "/root/anaconda3/envs/roberta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/envs/roberta/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 692, in forward
    self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [48,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [49,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [50,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [54,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [55,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [56,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [57,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [58,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [59,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

1月份

  • Fatal Python error: Segmentation fault
Please switch to tf.train.MonitoredTrainingSession
Fatal Python error: Segmentation fault

Current thread 0x00007fe0a5162700 (most recent call first):
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3166 in _as_graph_def
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3238 in as_graph_def
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/training/supervisor.py", line 323 in __init__
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 324 in new_func
  File "attention.py", line 439 in main
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 251 in _run_main
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 303 in run
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40 in run
  File "attention.py", line 475 in <module>
[1]    5145 segmentation fault (core dumped)  python attention.py --save_path lstm --gpu 1

不知道什么问题,改小batch_size试一试

Please switch to tf.train.MonitoredTrainingSession
Traceback (most recent call last):
  File "attention.py", line 475, in <module>
    tf.app.run()
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "attention.py", line 439, in main
    sv = tf.train.Supervisor(logdir=None, summary_op=None)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/training/supervisor.py", line 323, in __init__
    graph_def=graph.as_graph_def(add_shapes=True),
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3238, in as_graph_def
    result, _ = self._as_graph_def(from_version, add_shapes)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3166, in _as_graph_def
    graph.ParseFromString(compat.as_bytes(data))
google.protobuf.message.DecodeError: Error parsing message

改掉一些参数后

2021-01-16 19:33:38.647179: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-01-16 19:33:38.648679: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-01-16 19:33:38.649929: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-01-16 19:33:38.718605: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-01-16 19:33:38.718984: F tensorflow/core/common_runtime/gpu/gpu_util.cc:293] GPU->CPU Memcpy failed
Fatal Python error: Aborted

Thread 0x00007ef9d4ff9700 (most recent call first):
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/threading.py", line 295 in wait
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/threading.py", line 551 in wait
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 311 in wait_for_stop
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/training/queue_runner_impl.py", line 293 in _close_on_stop
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/threading.py", line 864 in run
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/threading.py", line 884 in _bootstrap

这个问题最后应该还是内存不够,或者我在跑的时候,别人也在跑,出现抢GPU的情况

12月份

  • Tensorflow:Can’t parse serialized Example.
    这个问题是我做了一个tfrecord的数据,然后读不出来,读的时候报这个错误
    最后的发现原因是,我把维度指定错了。

path_feature_description = {
    'path': tf.io.FixedLenFeature((50 * 5 * 10,), tf.int64),
}

这里的(50 * 5 * 10,)我自己弄错了,最后查了好久才找出来。
我在调试的时候,可以用

tf.train.Example.FromString(example)

读出数据,没有任何问题。但是就是解析不了,最后改对了维度就解析正确了。

可以参考这个:
https://stackoverflow.com/questions/45427637/numpy-to-tfrecords-is-there-a-more-simple-way-to-handle-batch-inputs-from-tfrec/45428167#45428167
怎么把numpy的数组存到tfrecord中,主要用到了api flatten()

11月份

11月17日

  • ModuleNotFoundError: No module named ‘_lzma’
Traceback (most recent call last):
  File "/tmp/code/ct/pyl.py", line 6, in <module>
    from torchvision.datasets import MNIST
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torchvision/__init__.py", line 7, in <module>
    from torchvision import datasets
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torchvision/datasets/__init__.py", line 6, in <module>
    from .mnist import MNIST, EMNIST, FashionMNIST, KMNIST, QMNIST
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torchvision/datasets/mnist.py", line 11, in <module>
    import lzma
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/lzma.py", line 27, in <module>
    from _lzma import *
ModuleNotFoundError: No module named '_lzma'

解决办法:
https://github.com/JaidedAI/EasyOCR/issues/84

10月份

10月31日

  • nan问题
    这个是因为我算log的时候,出现了log(-1e7),实际上应该是log(1e-7)

10月30日

  • RuntimeError: implement_array_function method already has a docstring
    这个因为我写了一个名为copy.py的文件,应该是冲突了

10月25日

  • Cannot serialize protocol buffer of type tensorflow.GraphDef as the serialized size (2625849871bytes) would be larger than the limit (2147483647 bytes)
    未解决

10月24日

  • Found Inf or NaN global norm. : Tensor had NaN values
Traceback (most recent call last):
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
	 [[{{node VerifyFinite/CheckNumerics}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/code/cc1020/lstm.py", line 846, in <module>
    tf.app.run(main)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/tmp/code/cc1020/lstm.py", line 837, in main
    run_epoch(epoch, sess, train_data)
  File "/tmp/code/cc1020/lstm.py", line 799, in run_epoch
    fetched = sess.run(fetches, feed_dict)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
	 [[node VerifyFinite/CheckNumerics (defined at /tmp/code/cc1020/lstm.py:707) ]]

Caused by op 'VerifyFinite/CheckNumerics', defined at:
  File "/tmp/code/cc1020/lstm.py", line 846, in <module>
    tf.app.run(main)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/tmp/code/cc1020/lstm.py", line 707, in main
    grads, _ = tf.clip_by_global_norm(tf.gradients(loss, tvars), FLAGS.max_grad_norm)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/clip_ops.py", line 271, in clip_by_global_norm
    "Found Inf or NaN global norm.")
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 44, in verify_tensor_all_finite
    return verify_tensor_all_finite_v2(t, msg, name)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 62, in verify_tensor_all_finite_v2
    verify_input = array_ops.check_numerics(x, message=message)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 919, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
	 [[node VerifyFinite/CheckNumerics (defined at /tmp/code/cc1020/lstm.py:707) ]]

这个是我把某些特殊的标签改成了-1,然后出现这个莫名其妙的错误

具体的调试方法是,先在CPU上跑,CPU上的报错具体些。CPU上问题解决了,再在GPU上跑。

10月20日

  • UnknownError (see above for traceback): CUDNN_STATUS_EXECUTION_FAILED
2020-10-20 12:36:29.138645: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at cudnn_rnn_ops.cc:1217 : Unknown: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(914): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
Train loss:       , accuracy      Test loss:      , accuracy        :   0%| | 0/
Traceback (most recent call last):
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(914): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
	 [[{{node rnn/cudnn_lstm/CudnnRNN}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/tf_learn/mnist_culstm_debug.py", line 60, in <module>
    _, loss_np, accuracy_np = sess.run([train, loss, accuracy], feed_dict={_inputs: batch_x, y: batch_y})
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(914): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
	 [[node rnn/cudnn_lstm/CudnnRNN (defined at /tmp/tf_learn/mnist_culstm_debug.py:34) ]]

Caused by op 'rnn/cudnn_lstm/CudnnRNN', defined at:
  File "/tmp/tf_learn/mnist_culstm_debug.py", line 34, in <module>
    outputs, _ = lstm(rnn_input, training=is_training)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 530, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 554, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 414, in call
    training)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 489, in _forward
    seed=self._seed)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1014, in _cudnn_rnn
    outputs, output_h, output_c, _ = gen_cudnn_rnn_ops.cudnn_rnn(**args)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/gen_cudnn_rnn_ops.py", line 142, in cudnn_rnn
    seed2=seed2, is_training=is_training, name=name)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(914): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
	 [[node rnn/cudnn_lstm/CudnnRNN (defined at /tmp/tf_learn/mnist_culstm_debug.py:34) ]]

这个不知道怎么解决,反正就是不能用dropout,去掉dropout就好了

  • ValueError: could not convert string to float: ‘Five One Nine Five Three PAD’
Traceback (most recent call last):
  File "/tmp/tf_learn/text_rnn_example.py", line 129, in <module>
    _seqlens: seqlen_test
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1121, in _run
    np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: 'Five One Nine Five Three PAD'

这个还是数据错误,把输入的数据当成标签了

  • ValueError: setting an array element with a sequence.
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/tmp/tf_learn/text_rnn_example.py", line 115, in <module>
    _seqlens: seqlen_batch
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1121, in _run
    np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.

这个问题其实是我pad弄错了,导致数据的长度不一致,然后无法转换为numpy

  • 运行时的一个warning
/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])

在这里插入图片描述
在github搜到的pr:
https://github.com/tensorflow/tensorboard/pull/2482

10月19日

  • 初始化问题,准确率非常差
    最后MLP的时候,如果初始化为0,效果非常差
    # w = tf.Variable(tf.zeros([hidden_layer_size, num_classes]), dtype=tf.float32)
    w = tf.Variable(tf.truncated_normal([hidden_layer_size, num_classes], mean=0, stddev=0.01), dtype=tf.float32)
    # _b = tf.Variable(tf.zeros([num_classes]), dtype=tf.float32)
    b = tf.Variable(tf.truncated_normal([num_classes], mean=0, stddev=0.01), dtype=tf.float32)
  • shape问题
Traceback (most recent call last):
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [32,128] vs. [128,128]
	 [[{{node rnn/scan/while/add_1}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/tf_learn/mnist_rnn_example_rewrite.py", line 51, in <module>
    _, loss_np, accuracy_np = sess.run([train, loss, accuracy], feed_dict={_inputs:batch_x,y:batch_y})
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [32,128] vs. [128,128]
	 [[node rnn/scan/while/add_1 (defined at /tmp/tf_learn/mnist_rnn_example_rewrite.py:28) ]]

Caused by op 'rnn/scan/while/add_1', defined at:
  File "/tmp/tf_learn/mnist_rnn_example_rewrite.py", line 31, in <module>
    output = tf.scan(run_rnn, rnn_input, tf.zeros([batch_size, hidden_layer_size]))[-1]
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/functional_ops.py", line 724, in scan
    maximum_iterations=n)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3556, in while_loop
    return_same_structure)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3087, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3022, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3525, in <lambda>
    body = lambda i, lv: (i + 1, orig_body(*lv))
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/functional_ops.py", line 703, in compute
    a_out = fn(packed_a, packed_elems)
  File "/tmp/tf_learn/mnist_rnn_example_rewrite.py", line 28, in run_rnn
    + b
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 812, in binary_op_wrapper
    return func(x, y, name=name)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 365, in add
    "Add", x=x, y=y, name=name)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Incompatible shapes: [32,128] vs. [128,128]
	 [[node rnn/scan/while/add_1 (defined at /tmp/tf_learn/mnist_rnn_example_rewrite.py:28) ]]


Process finished with exit code 1

这个的问题是我把batch_size搞错了,改了就好

  • 训练的参数变成nan
0 [[1221726.2  2054588.9   783324.44]] -961194.375
1 [[-4.9709313e+12 -8.4502328e+12 -3.0732030e+12]] 4141717651456.0
2 [[2.0209081e+19 3.4784079e+19 1.2077052e+19]] -1.7788167394997305e+19
3 [[-8.2097530e+25 -1.4329765e+26 -4.7536997e+25]] 7.618131755874552e+25
4 [[3.3328387e+32 5.9077980e+32 1.8740524e+32]] -3.254525020468151e+32
5 [[-inf -inf -inf]] inf
6 [[nan nan nan]] nan
7 [[nan nan nan]] nan
8 [[nan nan nan]] nan
9 [[nan nan nan]] nan

最后把loss取了一个平均解决了loss = tf.reduce_mean(loss)

import tensorflow as tf
import numpy as np
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "-1"
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
# 制造2000个数据,每个数据有3个维度
x_data = np.random.randn(2000, 3)
w_real = [0.3, 0.5, 0.2]
b_real = -0.2
noise = np.random.randn(2000, 1)
y_data = np.matmul(w_real, x_data.T) + b_real + noise
# tf的数据
x=tf.placeholder(tf.float32, shape=(None, 3))
y=tf.placeholder(tf.float32)
w=tf.Variable([[0,0,0]], dtype=tf.float32)
b=tf.Variable(0, dtype=tf.float32)
y_pred = tf.matmul(w, tf.transpose(x)) + b
loss = tf.square(y_pred-y)
# loss = tf.reduce_mean(loss)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.5)
train = optimizer.minimize(loss)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(10):
        _, w_np, b_np = sess.run([train, w, b], feed_dict={x: x_data, y: y_data})
        print(f'{i} {w_np} {b_np}')

早期遇到的问题

Traceback (most recent call last):
  File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 569, in <module>
    GnnTrainer(config).start()
  File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 346, in start
    self._run_epoch(train=True, model=model, dataloader=train_dataloader)
  File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 368, in _run_epoch
    loss, cur_correct_prediction, cur_total_prediction, cur_total_unk = self._cal(data)
  File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 447, in _cal
    pred = self.model(N, T, edge_index, edge_attr)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 66, in forward
    h = self.gnn(gnn_input, edge_index.to(config.gpu2), edge_attr.to(config.gpu2)).to(config.gpu0)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/userfolder/code/cc/CodeCompletion/graph_encoder.py", line 84, in forward
    m = self.propagate(edge_index, x=m, edge_type=edge_type, layer=i)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch_geometric/nn/conv/message_passing.py", line 263, in propagate
    out = self.message(**msg_kwargs)
  File "/root/userfolder/code/cc/CodeCompletion/graph_encoder.py", line 92, in message
    w = torch.index_select(edge_w, 0, edge_type)
RuntimeError: invalid argument 3: Index is supposed to be an empty tensor or a vector at /pytorch/aten/src/THC/generic/THCTensorIndex.cu:415

可能维度不对

Traceback (most recent call last):
  File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 596, in <module>
    GnnTrainer(config).start()
  File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 406, in start
    self._run_epoch(train=True, model=model, dataloader=train_dataloader)
  File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 449, in _run_epoch
    loss.backward()
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: index_put_ with accumulation is not supported on large tensors, number of source elements =2524650492file a support request on github

数据不断变大,直到溢出

Traceback (most recent call last):
  File "CodeCompletionGNN.py", line 597, in <module>
    GnnTrainer(config).start()
  File "CodeCompletionGNN.py", line 406, in start
    self._run_epoch(train=True, model=model, dataloader=train_dataloader)
  File "CodeCompletionGNN.py", line 428, in _run_epoch
    loss, cur_correct_prediction, cur_total_prediction, cur_total_unk = self._cal(data)
  File "CodeCompletionGNN.py", line 508, in _cal
    pred = self.model(N, T, edge_index, edge_attr)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "CodeCompletionGNN.py", line 125, in forward
    h = self.gnn(gnn_input, edge_index)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch_geometric/nn/conv/gated_graph_conv.py", line 71, in forward
    m = self.propagate(edge_index, x=m, edge_weight=edge_weight)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch_geometric/nn/conv/message_passing.py", line 246, in propagate
    assert isinstance(size, list)
AssertionError
Target is multiclass but average='binary'. Please choose another average setting.

这里应该是不同版本默认设置不一样

Traceback (most recent call last):
  File "/root/userfolder/code/hope/my_hin2vec.py", line 42, in <module>
    train(epoch)
  File "/root/userfolder/code/hope/my_hin2vec.py", line 21, in train
    loss = model.loss(data.to(device))
  File "/root/userfolder/code/hope/my_model.py", line 367, in loss
    loss = self.loss_fn(out, data[:, -1])
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 498, in forward
    return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/functional.py", line 2077, in binary_cross_entropy
    input, target, weight, reduction_enum)
RuntimeError: Expected object of scalar type Float but got scalar type Long for argument #2 'target' in call to _thnn_binary_cross_entropy_forward


Traceback (most recent call last):
  File "train_inst2vec.py", line 74, in <module>
    app.run(main)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "train_inst2vec.py", line 54, in main
    data_folders = i2v_prep.construct_xfg(data_folder)
  File "/root/userfolder/code/ncc/inst2vec/inst2vec_preprocess.py", line 2955, in construct_xfg
    pool.map(_partial_func, enumerate(folders_raw), chunksize=1)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
AttributeError: 'MultiDiGraph' object has no attribute 'node'

networkx api变动, node->nodes

Traceback (most recent call last):
  File "/root/userfolder/code/ncc/train_inst2vec.py", line 74, in <module>
    app.run(main)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/root/userfolder/code/ncc/train_inst2vec.py", line 57, in main
    i2v_vocab.construct_vocabulary(data_folder, data_folders)
  File "/root/userfolder/code/ncc/inst2vec/inst2vec_vocabulary.py", line 634, in construct_vocabulary
    H_dic = build_H_dictionary(D, context_width, folder_mat, base_filename, dictionary, stmts_cut_off)
  File "/root/userfolder/code/ncc/inst2vec/inst2vec_vocabulary.py", line 333, in build_H_dictionary
    A1 = nx.adjacency_matrix(D)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/networkx/linalg/graphmatrix.py", line 163, in adjacency_matrix
    return nx.to_scipy_sparse_matrix(G, nodelist=nodelist, weight=weight)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/networkx/convert_matrix.py", line 775, in to_scipy_sparse_matrix
    raise nx.NetworkXError("Graph has no nodes or edges")
networkx.exception.NetworkXError: Graph has no nodes or edges

安装

    ERROR: Command errored out with exit status 1:
     command: /root/.pyenv/versions/3.6.8/bin/python3.6 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-tjiu83h7/mysqlclient/setup.py'"'"'; __file__='"'"'/tmp/pip-install-tjiu83h7/mysqlclient/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-or8qo7_5
         cwd: /tmp/pip-install-tjiu83h7/mysqlclient/
    Complete output (10 lines):
    /bin/sh: 1: mysql_config: not found
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-tjiu83h7/mysqlclient/setup.py", line 16, in <module>
        metadata, options = get_config()
      File "/tmp/pip-install-tjiu83h7/mysqlclient/setup_posix.py", line 51, in get_config
        libs = mysql_config("libs")
      File "/tmp/pip-install-tjiu83h7/mysqlclient/setup_posix.py", line 29, in mysql_config
        raise EnvironmentError("%s not found" % (_mysql_config_path,))
    OSError: mysql_config not found
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
 sudo apt-get install libmysqlclient-dev

WARNING:tensorflow:From pointer_parent.py:422: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
Traceback (most recent call last):
  File "pointer_parent.py", line 458, in <module>
    tf.app.run()
  File "/root/anaconda3/envs/my-cc/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "pointer_parent.py", line 422, in main
    sv = tf.train.Supervisor(logdir=None, summary_op=None)
  File "/root/anaconda3/envs/my-cc/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 272, in new_func
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/my-cc/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 326, in __init__
    graph_def=graph.as_graph_def(add_shapes=True),
  File "/root/anaconda3/envs/my-cc/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3005, in as_graph_def
    result, _ = self._as_graph_def(from_version, add_shapes)
  File "/root/anaconda3/envs/my-cc/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2963, in _as_graph_def
    c_api.TF_GraphToGraphDef(self._c_graph, buf)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot serialize protocol buffer of type tensorflow.GraphDef as the serialized size (2625933684bytes) would be larger than the limit (2147483647 bytes)
functools.partial(func[,*args][, **kwargs])

给函数某些参数默认值

config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)

当allow_growth设置为True时,分配器将不会指定所有的GPU内存,而是根据需求增长

tf.expand_dims # 加一个维度
# 't' is a tensor of shape [2]
tf.shape(tf.expand_dims(t, 0))  # [1, 2]
# 移除所有为1的维度
# 't' is a tensor of shape [1, 2, 1, 3, 1, 1]
tf.shape(tf.squeeze(t))  # [2, 3]

torch.arange(N)
# 产生 0到N-1 tensor([0, 1, 2, 3])
torch.stack
# 会增加一个维度
torch.cat
# 维度的值会变大
# * 每个元素相乘
(A, 1, B) * (A, n, B) -> (A, n, B)
AttributeError: 'SparseTensor' object has no attribute 'sample'

需要用到pytorch 1.5.0

def sample(src: SparseTensor, num_neighbors: int,
           subset: Optional[torch.Tensor] = None) -> torch.Tensor:

    rowptr, col, _ = src.csr()
    rowcount = src.storage.rowcount()

    if subset is not None:
        rowcount = rowcount[subset]
        rowptr = rowptr[subset]

    rand = torch.rand((rowcount.size(0), num_neighbors), device=col.device)
    rand.mul_(rowcount.to(rand.dtype).view(-1, 1))
    rand = rand.to(torch.long)
    rand.add_(rowptr.view(-1, 1))

    return col[rand]

这个搜不到

torch.ops.torch_sparse.ind2ptr

https://www.cnblogs.com/xbinworld/p/4273506.html

UnicodeEncodeError: 'ascii' codec can't encode character '\xa7' in position 29: ordinal not in range(128)

运行时环境变量加PYTHONIOENCODING=utf-8

评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值