写tf和pytorch遇到的一些错误以及解决办法(如果当时解决了的话)

最新推荐文章于 2025-03-15 13:49:21 发布

置顶大黄老鼠

最新推荐文章于 2025-03-15 13:49:21 发布

阅读量9.5k

点赞数

文章标签： pytorch 深度学习 python

本文链接：https://blog.csdn.net/qq_32768743/article/details/107416901

版权

2022年

RuntimeError('cuDNN error: CUDNN_STATUS_EXECUTION_FAILED',)
nlp的有一个数据id超过了嵌入层的输入大小

Traceback (most recent call last):
  File "/tmp/pycharm_project_327/main.py", line 946, in train
    outputs = model(inputs, labels=labels)
  File "/root/anaconda3/envs/roberta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/tmp/pycharm_project_327/main.py", line 359, in forward
    lstm_outputs, _ = self.lstm(inputs_embeds)
  File "/root/anaconda3/envs/roberta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/envs/roberta/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 692, in forward
    self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [48,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [49,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [50,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [54,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [55,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [56,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [57,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [58,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [59,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

1月份

Fatal Python error: Segmentation fault

Please switch to tf.train.MonitoredTrainingSession
Fatal Python error: Segmentation fault

Current thread 0x00007fe0a5162700 (most recent call first):
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3166 in _as_graph_def
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3238 in as_graph_def
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/training/supervisor.py", line 323 in __init__
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 324 in new_func
  File "attention.py", line 439 in main
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 251 in _run_main
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 303 in run
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40 in run
  File "attention.py", line 475 in <module>
[1]    5145 segmentation fault (core dumped)  python attention.py --save_path lstm --gpu 1

不知道什么问题，改小batch_size试一试

Please switch to tf.train.MonitoredTrainingSession
Traceback (most recent call last):
  File "attention.py", line 475, in <module>
    tf.app.run()
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "attention.py", line 439, in main
    sv = tf.train.Supervisor(logdir=None, summary_op=None)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/training/supervisor.py", line 323, in __init__
    graph_def=graph.as_graph_def(add_shapes=True),
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3238, in as_graph_def
    result, _ = self._as_graph_def(from_version, add_shapes)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3166, in _as_graph_def
    graph.ParseFromString(compat.as_bytes(data))
google.protobuf.message.DecodeError: Error parsing message

改掉一些参数后

2021-01-16 19:33:38.647179: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-01-16 19:33:38.648679: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-01-16 19:33:38.649929: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-01-16 19:33:38.718605: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-01-16 19:33:38.718984: F tensorflow/core/common_runtime/gpu/gpu_util.cc:293] GPU->CPU Memcpy failed
Fatal Python error: Aborted

Thread 0x00007ef9d4ff9700 (most recent call first):
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/threading.py", line 295 in wait
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/threading.py", line 551 in wait
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 311 in wait_for_stop
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/training/queue_runner_impl.py", line 293 in _close_on_stop
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/threading.py", line 864 in run
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/threading.py", line 884 in _bootstrap

这个问题最后应该还是内存不够，或者我在跑的时候，别人也在跑，出现抢GPU的情况

12月份

Tensorflow:Can’t parse serialized Example.
这个问题是我做了一个tfrecord的数据，然后读不出来，读的时候报这个错误
最后的发现原因是，我把维度指定错了。


path_feature_description = {
    'path': tf.io.FixedLenFeature((50 * 5 * 10,), tf.int64),
}

这里的(50 * 5 * 10,)我自己弄错了，最后查了好久才找出来。
我在调试的时候，可以用

tf.train.Example.FromString(example)

读出数据，没有任何问题。但是就是解析不了，最后改对了维度就解析正确了。

可以参考这个：
https://stackoverflow.com/questions/45427637/numpy-to-tfrecords-is-there-a-more-simple-way-to-handle-batch-inputs-from-tfrec/45428167#45428167
怎么把numpy的数组存到tfrecord中，主要用到了api flatten()

11月份

11月17日

ModuleNotFoundError: No module named ‘_lzma’

Traceback (most recent call last):
  File "/tmp/code/ct/pyl.py", line 6, in <module>
    from torchvision.datasets import MNIST
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torchvision/__init__.py", line 7, in <module>
    from torchvision import datasets
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torchvision/datasets/__init__.py", line 6, in <module>
    from .mnist import MNIST, EMNIST, FashionMNIST, KMNIST, QMNIST
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torchvision/datasets/mnist.py", line 11, in <module>
    import lzma
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/lzma.py", line 27, in <module>
    from _lzma import *
ModuleNotFoundError: No module named '_lzma'

解决办法：
https://github.com/JaidedAI/EasyOCR/issues/84

10月份

10月31日

nan问题
这个是因为我算log的时候，出现了log(-1e7)，实际上应该是log(1e-7)

10月30日

RuntimeError: implement_array_function method already has a docstring
这个因为我写了一个名为copy.py的文件，应该是冲突了

10月25日

Cannot serialize protocol buffer of type tensorflow.GraphDef as the serialized size (2625849871bytes) would be larger than the limit (2147483647 bytes)
未解决

10月24日

Found Inf or NaN global norm. : Tensor had NaN values

Traceback (most recent call last):
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
	 [[{{node VerifyFinite/CheckNumerics}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/code/cc1020/lstm.py", line 846, in <module>
    tf.app.run(main)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/tmp/code/cc1020/lstm.py", line 837, in main
    run_epoch(epoch, sess, train_data)
  File "/tmp/code/cc1020/lstm.py", line 799, in run_epoch
    fetched = sess.run(fetches, feed_dict)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
	 [[node VerifyFinite/CheckNumerics (defined at /tmp/code/cc1020/lstm.py:707) ]]

Caused by op 'VerifyFinite/CheckNumerics', defined at:
  File "/tmp/code/cc1020/lstm.py", line 846, in <module>
    tf.app.run(main)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/tmp/code/cc1020/lstm.py", line 707, in main
    grads, _ = tf.clip_by_global_norm(tf.gradients(loss, tvars), FLAGS.max_grad_norm)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/clip_ops.py", line 271, in clip_by_global_norm
    "Found Inf or NaN global norm.")
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 44, in verify_tensor_all_finite
    return verify_tensor_all_finite_v2(t, msg, name)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 62, in verify_tensor_all_finite_v2
    verify_input = array_ops.check_numerics(x, message=message)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 919, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
	 [[node VerifyFinite/CheckNumerics (defined at /tmp/code/cc1020/lstm.py:707) ]]

这个是我把某些特殊的标签改成了-1，然后出现这个莫名其妙的错误

具体的调试方法是，先在CPU上跑，CPU上的报错具体些。CPU上问题解决了，再在GPU上跑。

10月20日

UnknownError (see above for traceback): CUDNN_STATUS_EXECUTION_FAILED

2020-10-20 12:36:29.138645: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at cudnn_rnn_ops.cc:1217 : Unknown: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(914): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
Train loss:       , accuracy      Test loss:      , accuracy        :   0%| | 0/
Traceback (most recent call last):
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(914): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
	 [[{{node rnn/cudnn_lstm/CudnnRNN}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/tf_learn/mnist_culstm_debug.py", line 60, in <module>
    _, loss_np, accuracy_np = sess.run([train, loss, accuracy], feed_dict={_inputs: batch_x, y: batch_y})
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(914): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
	 [[node rnn/cudnn_lstm/CudnnRNN (defined at /tmp/tf_learn/mnist_culstm_debug.py:34) ]]

Caused by op 'rnn/cudnn_lstm/CudnnRNN', defined at:
  File "/tmp/tf_learn/mnist_culstm_debug.py", line 34, in <module>
    outputs, _ = lstm(rnn_input, training=is_training)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 530, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 554, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 414, in call
    training)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 489, in _forward
    seed=self._seed)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1014, in _cudnn_rnn
    outputs, output_h, output_c, _ = gen_cudnn_rnn_ops.cudnn_rnn(**args)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/gen_cudnn_rnn_ops.py", line 142, in cudnn_rnn
    seed2=seed2, is_training=is_training, name=name)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(914): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
	 [[node rnn/cudnn_lstm/CudnnRNN (defined at /tmp/tf_learn/mnist_culstm_debug.py:34) ]]

这个不知道怎么解决，反正就是不能用dropout，去掉dropout就好了

ValueError: could not convert string to float: ‘Five One Nine Five Three PAD’

Traceback (most recent call last):
  File "/tmp/tf_learn/text_rnn_example.py", line 129, in <module>
    _seqlens: seqlen_test
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1121, in _run
    np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: 'Five One Nine Five Three PAD'

这个还是数据错误，把输入的数据当成标签了

ValueError: setting an array element with a sequence.

TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/tmp/tf_learn/text_rnn_example.py", line 115, in <module>
    _seqlens: seqlen_batch
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1121, in _run
    np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.

这个问题其实是我pad弄错了，导致数据的长度不一致，然后无法转换为numpy

运行时的一个warning

/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])

在这里插入图片描述
在github搜到的pr：
https://github.com/tensorflow/tensorboard/pull/2482

10月19日

初始化问题，准确率非常差
最后MLP的时候，如果初始化为0，效果非常差

    # w = tf.Variable(tf.zeros([hidden_layer_size, num_classes]), dtype=tf.float32)
    w = tf.Variable(tf.truncated_normal([hidden_layer_size, num_classes], mean=0, stddev=0.01), dtype=tf.float32)
    # _b = tf.Variable(tf.zeros([num_classes]), dtype=tf.float32)
    b = tf.Variable(tf.truncated_normal([num_classes], mean=0, stddev=0.01), dtype=tf.float32)

shape问题

Traceback (most recent call last):
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [32,128] vs. [128,128]
	 [[{{node rnn/scan/while/add_1}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/tf_learn/mnist_rnn_example_rewrite.py", line 51, in <module>
    _, loss_np, accuracy_np = sess.run([train, loss, accuracy], feed_dict={_inputs:batch_x,y:batch_y})
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [32,128] vs. [128,128]
	 [[node rnn/scan/while/add_1 (defined at /tmp/tf_learn/mnist_rnn_example_rewrite.py:28) ]]

Caused by op 'rnn/scan/while/add_1', defined at:
  File "/tmp/tf_learn/mnist_rnn_example_rewrite.py", line 31, in <module>
    output = tf.scan(run_rnn, rnn_input, tf.zeros([batch_size, hidden_layer_size]))[-1]
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/functional_ops.py", line 724, in scan
    maximum_iterations=n)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3556, in while_loop
    return_same_structure)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3087, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3022, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3525, in <lambda>
    body = lambda i, lv: (i + 1, orig_body(*lv))
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/functional_ops.py", line 703, in compute
    a_out = fn(packed_a, packed_elems)
  File "/tmp/tf_learn/mnist_rnn_example_rewrite.py", line 28, in run_rnn
    + b
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 812, in binary_op_wrapper
    return func(x, y, name=name)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 365, in add
    "Add", x=x, y=y, name=name)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Incompatible shapes: [32,128] vs. [128,128]
	 [[node rnn/scan/while/add_1 (defined at /tmp/tf_learn/mnist_rnn_example_rewrite.py:28) ]]


Process finished with exit code 1

这个的问题是我把batch_size搞错了，改了就好

训练的参数变成nan

0 [[1221726.2  2054588.9   783324.44]] -961194.375
1 [[-4.9709313e+12 -8.4502328e+12 -3.0732030e+12]] 4141717651456.0
2 [[2.0209081e+19 3.4784079e+19 1.2077052e+19]] -1.7788167394997305e+19
3 [[-8.2097530e+25 -1.4329765e+26 -4.7536997e+25]] 7.618131755874552e+25
4 [[3.3328387e+32 5.9077980e+32 1.8740524e+32]] -3.254525020468151e+32
5 [[-inf -inf -inf]] inf
6 [[nan nan nan]] nan
7 [[nan nan nan]] nan
8 [[nan nan nan]] nan
9 [[nan nan nan]] nan

最后把loss取了一个平均解决了loss = tf.reduce_mean(loss)

import tensorflow as tf
import numpy as np
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "-1"
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
# 制造2000个数据，每个数据有3个维度
x_data = np.random.randn(2000, 3)
w_real = [0.3, 0.5, 0.2]
b_real = -0.2
noise = np.random.randn(2000, 1)
y_data = np.matmul(w_real, x_data.T) + b_real + noise
# tf的数据
x=tf.placeholder(tf.float32, shape=(None, 3))
y=tf.placeholder(tf.float32)
w=tf.Variable([[0,0,0]], dtype=tf.float32)
b=tf.Variable(0, dtype=tf.float32)
y_pred = tf.matmul(w, tf.transpose(x)) + b
loss = tf.square(y_pred-y)
# loss = tf.reduce_mean(loss)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.5)
train = optimizer.minimize(loss)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(10):
        _, w_np, b_np = sess.run([train, w, b], feed_dict={x: x_data, y: y_data})
        print(f'{i} {w_np} {b_np}')

早期遇到的问题

Traceback (most recent call last):
  File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 569, in <module>
    GnnTrainer(config).start()
  File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 346, in start
    self._run_epoch(train=True, model=model, dataloader=train_dataloader)
  File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 368, in _run_epoch
    loss, cur_correct_prediction, cur_total_prediction, cur_total_unk = self._cal(data)
  File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 447, in _cal
    pred = self.model(N, T, edge_index, edge_attr)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 66, in forward
    h = self.gnn(gnn_input, edge_index.to(config.gpu2), edge_attr.to(config.gpu2)).to(config.gpu0)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/userfolder/code/cc/CodeCompletion/graph_encoder.py", line 84, in forward
    m = self.propagate(edge_index, x=m, edge_type=edge_type, layer=i)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch_geometric/nn/conv/message_passing.py", line 263, in propagate
    out = self.message(**msg_kwargs)
  File "/root/userfolder/code/cc/CodeCompletion/graph_encoder.py", line 92, in message
    w = torch.index_select(edge_w, 0, edge_type)
RuntimeError: invalid argument 3: Index is supposed to be an empty tensor or a vector at /pytorch/aten/src/THC/generic/THCTensorIndex.cu:415

可能维度不对

Traceback (most recent call last):
  File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 596, in <module>
    GnnTrainer(config).start()
  File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 406, in start
    self._run_epoch(train=True, model=model, dataloader=train_dataloader)
  File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 449, in _run_epoch
    loss.backward()
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: index_put_ with accumulation is not supported on large tensors, number of source elements =2524650492file a support request on github

数据不断变大，直到溢出

Traceback (most recent call last):
  File "CodeCompletionGNN.py", line 597, in <module>
    GnnTrainer(config).start()
  File "CodeCompletionGNN.py", line 406, in start
    self._run_epoch(train=True, model=model, dataloader=train_dataloader)
  File "CodeCompletionGNN.py", line 428, in _run_epoch
    loss, cur_correct_prediction, cur_total_prediction, cur_total_unk = self._cal(data)
  File "CodeCompletionGNN.py", line 508, in _cal
    pred = self.model(N, T, edge_index, edge_attr)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "CodeCompletionGNN.py", line 125, in forward
    h = self.gnn(gnn_input, edge_index)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch_geometric/nn/conv/gated_graph_conv.py", line 71, in forward
    m = self.propagate(edge_index, x=m, edge_weight=edge_weight)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch_geometric/nn/conv/message_passing.py", line 246, in propagate
    assert isinstance(size, list)
AssertionError

Target is multiclass but average='binary'. Please choose another average setting.

这里应该是不同版本默认设置不一样

Traceback (most recent call last):
  File "/root/userfolder/code/hope/my_hin2vec.py", line 42, in <module>
    train(epoch)
  File "/root/userfolder/code/hope/my_hin2vec.py", line 21, in train
    loss = model.loss(data.to(device))
  File "/root/userfolder/code/hope/my_model.py", line 367, in loss
    loss = self.loss_fn(out, data[:, -1])
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 498, in forward
    return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/functional.py", line 2077, in binary_cross_entropy
    input, target, weight, reduction_enum)
RuntimeError: Expected object of scalar type Float but got scalar type Long for argument #2 'target' in call to _thnn_binary_cross_entropy_forward


Traceback (most recent call last):
  File "train_inst2vec.py", line 74, in <module>
    app.run(main)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "train_inst2vec.py", line 54, in main
    data_folders = i2v_prep.construct_xfg(data_folder)
  File "/root/userfolder/code/ncc/inst2vec/inst2vec_preprocess.py", line 2955, in construct_xfg
    pool.map(_partial_func, enumerate(folders_raw), chunksize=1)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
AttributeError: 'MultiDiGraph' object has no attribute 'node'

networkx api变动, node->nodes

Traceback (most recent call last):
  File "/root/userfolder/code/ncc/train_inst2vec.py", line 74, in <module>
    app.run(main)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/root/userfolder/code/ncc/train_inst2vec.py", line 57, in main
    i2v_vocab.construct_vocabulary(data_folder, data_folders)
  File "/root/userfolder/code/ncc/inst2vec/inst2vec_vocabulary.py", line 634, in construct_vocabulary
    H_dic = build_H_dictionary(D, context_width, folder_mat, base_filename, dictionary, stmts_cut_off)
  File "/root/userfolder/code/ncc/inst2vec/inst2vec_vocabulary.py", line 333, in build_H_dictionary
    A1 = nx.adjacency_matrix(D)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/networkx/linalg/graphmatrix.py", line 163, in adjacency_matrix
    return nx.to_scipy_sparse_matrix(G, nodelist=nodelist, weight=weight)
  File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/networkx/convert_matrix.py", line 775, in to_scipy_sparse_matrix
    raise nx.NetworkXError("Graph has no nodes or edges")
networkx.exception.NetworkXError: Graph has no nodes or edges

安装

    ERROR: Command errored out with exit status 1:
     command: /root/.pyenv/versions/3.6.8/bin/python3.6 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-tjiu83h7/mysqlclient/setup.py'"'"'; __file__='"'"'/tmp/pip-install-tjiu83h7/mysqlclient/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-or8qo7_5
         cwd: /tmp/pip-install-tjiu83h7/mysqlclient/
    Complete output (10 lines):
    /bin/sh: 1: mysql_config: not found
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-tjiu83h7/mysqlclient/setup.py", line 16, in <module>
        metadata, options = get_config()
      File "/tmp/pip-install-tjiu83h7/mysqlclient/setup_posix.py", line 51, in get_config
        libs = mysql_config("libs")
      File "/tmp/pip-install-tjiu83h7/mysqlclient/setup_posix.py", line 29, in mysql_config
        raise EnvironmentError("%s not found" % (_mysql_config_path,))
    OSError: mysql_config not found
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

 sudo apt-get install libmysqlclient-dev


WARNING:tensorflow:From pointer_parent.py:422: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
Traceback (most recent call last):
  File "pointer_parent.py", line 458, in <module>
    tf.app.run()
  File "/root/anaconda3/envs/my-cc/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "pointer_parent.py", line 422, in main
    sv = tf.train.Supervisor(logdir=None, summary_op=None)
  File "/root/anaconda3/envs/my-cc/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 272, in new_func
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/my-cc/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 326, in __init__
    graph_def=graph.as_graph_def(add_shapes=True),
  File "/root/anaconda3/envs/my-cc/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3005, in as_graph_def
    result, _ = self._as_graph_def(from_version, add_shapes)
  File "/root/anaconda3/envs/my-cc/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2963, in _as_graph_def
    c_api.TF_GraphToGraphDef(self._c_graph, buf)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot serialize protocol buffer of type tensorflow.GraphDef as the serialized size (2625933684bytes) would be larger than the limit (2147483647 bytes)

functools.partial(func[,*args][, **kwargs])

给函数某些参数默认值

config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)

当allow_growth设置为True时，分配器将不会指定所有的GPU内存，而是根据需求增长

tf.expand_dims # 加一个维度
# 't' is a tensor of shape [2]
tf.shape(tf.expand_dims(t, 0))  # [1, 2]

# 移除所有为1的维度
# 't' is a tensor of shape [1, 2, 1, 3, 1, 1]
tf.shape(tf.squeeze(t))  # [2, 3]

torch.arange(N)
# 产生 0到N-1 tensor([0, 1, 2, 3])
torch.stack
# 会增加一个维度
torch.cat
# 维度的值会变大
# * 每个元素相乘
(A, 1, B) * (A, n, B) -> (A, n, B)

AttributeError: 'SparseTensor' object has no attribute 'sample'

需要用到pytorch 1.5.0

def sample(src: SparseTensor, num_neighbors: int,
           subset: Optional[torch.Tensor] = None) -> torch.Tensor:

    rowptr, col, _ = src.csr()
    rowcount = src.storage.rowcount()

    if subset is not None:
        rowcount = rowcount[subset]
        rowptr = rowptr[subset]

    rand = torch.rand((rowcount.size(0), num_neighbors), device=col.device)
    rand.mul_(rowcount.to(rand.dtype).view(-1, 1))
    rand = rand.to(torch.long)
    rand.add_(rowptr.view(-1, 1))

    return col[rand]

这个搜不到

torch.ops.torch_sparse.ind2ptr

https://www.cnblogs.com/xbinworld/p/4273506.html

UnicodeEncodeError: 'ascii' codec can't encode character '\xa7' in position 29: ordinal not in range(128)

运行时环境变量加PYTHONIOENCODING=utf-8