2022年
RuntimeError('cuDNN error: CUDNN_STATUS_EXECUTION_FAILED',)
nlp的有一个数据id超过了嵌入层的输入大小
Traceback (most recent call last):
File "/tmp/pycharm_project_327/main.py", line 946, in train
outputs = model(inputs, labels=labels)
File "/root/anaconda3/envs/roberta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/tmp/pycharm_project_327/main.py", line 359, in forward
lstm_outputs, _ = self.lstm(inputs_embeds)
File "/root/anaconda3/envs/roberta/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/envs/roberta/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 692, in forward
self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [48,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [49,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [50,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [54,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [55,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [56,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [57,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [58,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [59,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272155627/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [329,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
1月份
- Fatal Python error: Segmentation fault
Please switch to tf.train.MonitoredTrainingSession
Fatal Python error: Segmentation fault
Current thread 0x00007fe0a5162700 (most recent call first):
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3166 in _as_graph_def
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3238 in as_graph_def
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/training/supervisor.py", line 323 in __init__
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 324 in new_func
File "attention.py", line 439 in main
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 251 in _run_main
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 303 in run
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40 in run
File "attention.py", line 475 in <module>
[1] 5145 segmentation fault (core dumped) python attention.py --save_path lstm --gpu 1
不知道什么问题,改小batch_size试一试
Please switch to tf.train.MonitoredTrainingSession
Traceback (most recent call last):
File "attention.py", line 475, in <module>
tf.app.run()
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "attention.py", line 439, in main
sv = tf.train.Supervisor(logdir=None, summary_op=None)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/training/supervisor.py", line 323, in __init__
graph_def=graph.as_graph_def(add_shapes=True),
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3238, in as_graph_def
result, _ = self._as_graph_def(from_version, add_shapes)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3166, in _as_graph_def
graph.ParseFromString(compat.as_bytes(data))
google.protobuf.message.DecodeError: Error parsing message
改掉一些参数后
2021-01-16 19:33:38.647179: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-01-16 19:33:38.648679: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-01-16 19:33:38.649929: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-01-16 19:33:38.718605: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-01-16 19:33:38.718984: F tensorflow/core/common_runtime/gpu/gpu_util.cc:293] GPU->CPU Memcpy failed
Fatal Python error: Aborted
Thread 0x00007ef9d4ff9700 (most recent call first):
File "/root/.pyenv/versions/3.6.8/lib/python3.6/threading.py", line 295 in wait
File "/root/.pyenv/versions/3.6.8/lib/python3.6/threading.py", line 551 in wait
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 311 in wait_for_stop
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow_core/python/training/queue_runner_impl.py", line 293 in _close_on_stop
File "/root/.pyenv/versions/3.6.8/lib/python3.6/threading.py", line 864 in run
File "/root/.pyenv/versions/3.6.8/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/root/.pyenv/versions/3.6.8/lib/python3.6/threading.py", line 884 in _bootstrap
这个问题最后应该还是内存不够,或者我在跑的时候,别人也在跑,出现抢GPU的情况
12月份
- Tensorflow:Can’t parse serialized Example.
这个问题是我做了一个tfrecord的数据,然后读不出来,读的时候报这个错误
最后的发现原因是,我把维度指定错了。
path_feature_description = {
'path': tf.io.FixedLenFeature((50 * 5 * 10,), tf.int64),
}
这里的(50 * 5 * 10,)
我自己弄错了,最后查了好久才找出来。
我在调试的时候,可以用
tf.train.Example.FromString(example)
读出数据,没有任何问题。但是就是解析不了,最后改对了维度就解析正确了。
可以参考这个:
https://stackoverflow.com/questions/45427637/numpy-to-tfrecords-is-there-a-more-simple-way-to-handle-batch-inputs-from-tfrec/45428167#45428167
怎么把numpy的数组存到tfrecord中,主要用到了api flatten()
11月份
11月17日
- ModuleNotFoundError: No module named ‘_lzma’
Traceback (most recent call last):
File "/tmp/code/ct/pyl.py", line 6, in <module>
from torchvision.datasets import MNIST
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torchvision/__init__.py", line 7, in <module>
from torchvision import datasets
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torchvision/datasets/__init__.py", line 6, in <module>
from .mnist import MNIST, EMNIST, FashionMNIST, KMNIST, QMNIST
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torchvision/datasets/mnist.py", line 11, in <module>
import lzma
File "/root/.pyenv/versions/3.6.8/lib/python3.6/lzma.py", line 27, in <module>
from _lzma import *
ModuleNotFoundError: No module named '_lzma'
解决办法:
https://github.com/JaidedAI/EasyOCR/issues/84
10月份
10月31日
- nan问题
这个是因为我算log的时候,出现了log(-1e7)
,实际上应该是log(1e-7)
10月30日
- RuntimeError: implement_array_function method already has a docstring
这个因为我写了一个名为copy.py
的文件,应该是冲突了
10月25日
- Cannot serialize protocol buffer of type tensorflow.GraphDef as the serialized size (2625849871bytes) would be larger than the limit (2147483647 bytes)
未解决
10月24日
- Found Inf or NaN global norm. : Tensor had NaN values
Traceback (most recent call last):
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
[[{{node VerifyFinite/CheckNumerics}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/code/cc1020/lstm.py", line 846, in <module>
tf.app.run(main)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/tmp/code/cc1020/lstm.py", line 837, in main
run_epoch(epoch, sess, train_data)
File "/tmp/code/cc1020/lstm.py", line 799, in run_epoch
fetched = sess.run(fetches, feed_dict)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
[[node VerifyFinite/CheckNumerics (defined at /tmp/code/cc1020/lstm.py:707) ]]
Caused by op 'VerifyFinite/CheckNumerics', defined at:
File "/tmp/code/cc1020/lstm.py", line 846, in <module>
tf.app.run(main)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/tmp/code/cc1020/lstm.py", line 707, in main
grads, _ = tf.clip_by_global_norm(tf.gradients(loss, tvars), FLAGS.max_grad_norm)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/clip_ops.py", line 271, in clip_by_global_norm
"Found Inf or NaN global norm.")
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 44, in verify_tensor_all_finite
return verify_tensor_all_finite_v2(t, msg, name)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 62, in verify_tensor_all_finite_v2
verify_input = array_ops.check_numerics(x, message=message)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 919, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
[[node VerifyFinite/CheckNumerics (defined at /tmp/code/cc1020/lstm.py:707) ]]
这个是我把某些特殊的标签改成了-1,然后出现这个莫名其妙的错误
具体的调试方法是,先在CPU上跑,CPU上的报错具体些。CPU上问题解决了,再在GPU上跑。
10月20日
- UnknownError (see above for traceback): CUDNN_STATUS_EXECUTION_FAILED
2020-10-20 12:36:29.138645: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at cudnn_rnn_ops.cc:1217 : Unknown: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(914): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
Train loss: , accuracy Test loss: , accuracy : 0%| | 0/
Traceback (most recent call last):
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(914): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
[[{{node rnn/cudnn_lstm/CudnnRNN}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/tf_learn/mnist_culstm_debug.py", line 60, in <module>
_, loss_np, accuracy_np = sess.run([train, loss, accuracy], feed_dict={_inputs: batch_x, y: batch_y})
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(914): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
[[node rnn/cudnn_lstm/CudnnRNN (defined at /tmp/tf_learn/mnist_culstm_debug.py:34) ]]
Caused by op 'rnn/cudnn_lstm/CudnnRNN', defined at:
File "/tmp/tf_learn/mnist_culstm_debug.py", line 34, in <module>
outputs, _ = lstm(rnn_input, training=is_training)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 530, in __call__
outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 554, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 414, in call
training)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 489, in _forward
seed=self._seed)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1014, in _cudnn_rnn
outputs, output_h, output_c, _ = gen_cudnn_rnn_ops.cudnn_rnn(**args)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/gen_cudnn_rnn_ops.py", line 142, in cudnn_rnn
seed2=seed2, is_training=is_training, name=name)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
UnknownError (see above for traceback): CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(914): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)'
[[node rnn/cudnn_lstm/CudnnRNN (defined at /tmp/tf_learn/mnist_culstm_debug.py:34) ]]
这个不知道怎么解决,反正就是不能用dropout,去掉dropout就好了
- ValueError: could not convert string to float: ‘Five One Nine Five Three PAD’
Traceback (most recent call last):
File "/tmp/tf_learn/text_rnn_example.py", line 129, in <module>
_seqlens: seqlen_test
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1121, in _run
np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/numpy/core/_asarray.py", line 83, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: 'Five One Nine Five Three PAD'
这个还是数据错误,把输入的数据当成标签了
- ValueError: setting an array element with a sequence.
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/tmp/tf_learn/text_rnn_example.py", line 115, in <module>
_seqlens: seqlen_batch
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1121, in _run
np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/numpy/core/_asarray.py", line 83, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
这个问题其实是我pad弄错了,导致数据的长度不一致,然后无法转换为numpy
- 运行时的一个warning
/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
在github搜到的pr:
https://github.com/tensorflow/tensorboard/pull/2482
10月19日
- 初始化问题,准确率非常差
最后MLP的时候,如果初始化为0,效果非常差
# w = tf.Variable(tf.zeros([hidden_layer_size, num_classes]), dtype=tf.float32)
w = tf.Variable(tf.truncated_normal([hidden_layer_size, num_classes], mean=0, stddev=0.01), dtype=tf.float32)
# _b = tf.Variable(tf.zeros([num_classes]), dtype=tf.float32)
b = tf.Variable(tf.truncated_normal([num_classes], mean=0, stddev=0.01), dtype=tf.float32)
- shape问题
Traceback (most recent call last):
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [32,128] vs. [128,128]
[[{{node rnn/scan/while/add_1}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/tf_learn/mnist_rnn_example_rewrite.py", line 51, in <module>
_, loss_np, accuracy_np = sess.run([train, loss, accuracy], feed_dict={_inputs:batch_x,y:batch_y})
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [32,128] vs. [128,128]
[[node rnn/scan/while/add_1 (defined at /tmp/tf_learn/mnist_rnn_example_rewrite.py:28) ]]
Caused by op 'rnn/scan/while/add_1', defined at:
File "/tmp/tf_learn/mnist_rnn_example_rewrite.py", line 31, in <module>
output = tf.scan(run_rnn, rnn_input, tf.zeros([batch_size, hidden_layer_size]))[-1]
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/functional_ops.py", line 724, in scan
maximum_iterations=n)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3556, in while_loop
return_same_structure)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3087, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3022, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3525, in <lambda>
body = lambda i, lv: (i + 1, orig_body(*lv))
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/functional_ops.py", line 703, in compute
a_out = fn(packed_a, packed_elems)
File "/tmp/tf_learn/mnist_rnn_example_rewrite.py", line 28, in run_rnn
+ b
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 812, in binary_op_wrapper
return func(x, y, name=name)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 365, in add
"Add", x=x, y=y, name=name)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): Incompatible shapes: [32,128] vs. [128,128]
[[node rnn/scan/while/add_1 (defined at /tmp/tf_learn/mnist_rnn_example_rewrite.py:28) ]]
Process finished with exit code 1
这个的问题是我把batch_size搞错了,改了就好
- 训练的参数变成nan
0 [[1221726.2 2054588.9 783324.44]] -961194.375
1 [[-4.9709313e+12 -8.4502328e+12 -3.0732030e+12]] 4141717651456.0
2 [[2.0209081e+19 3.4784079e+19 1.2077052e+19]] -1.7788167394997305e+19
3 [[-8.2097530e+25 -1.4329765e+26 -4.7536997e+25]] 7.618131755874552e+25
4 [[3.3328387e+32 5.9077980e+32 1.8740524e+32]] -3.254525020468151e+32
5 [[-inf -inf -inf]] inf
6 [[nan nan nan]] nan
7 [[nan nan nan]] nan
8 [[nan nan nan]] nan
9 [[nan nan nan]] nan
最后把loss取了一个平均解决了loss = tf.reduce_mean(loss)
import tensorflow as tf
import numpy as np
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "-1"
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
# 制造2000个数据,每个数据有3个维度
x_data = np.random.randn(2000, 3)
w_real = [0.3, 0.5, 0.2]
b_real = -0.2
noise = np.random.randn(2000, 1)
y_data = np.matmul(w_real, x_data.T) + b_real + noise
# tf的数据
x=tf.placeholder(tf.float32, shape=(None, 3))
y=tf.placeholder(tf.float32)
w=tf.Variable([[0,0,0]], dtype=tf.float32)
b=tf.Variable(0, dtype=tf.float32)
y_pred = tf.matmul(w, tf.transpose(x)) + b
loss = tf.square(y_pred-y)
# loss = tf.reduce_mean(loss)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.5)
train = optimizer.minimize(loss)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(10):
_, w_np, b_np = sess.run([train, w, b], feed_dict={x: x_data, y: y_data})
print(f'{i} {w_np} {b_np}')
早期遇到的问题
Traceback (most recent call last):
File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 569, in <module>
GnnTrainer(config).start()
File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 346, in start
self._run_epoch(train=True, model=model, dataloader=train_dataloader)
File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 368, in _run_epoch
loss, cur_correct_prediction, cur_total_prediction, cur_total_unk = self._cal(data)
File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 447, in _cal
pred = self.model(N, T, edge_index, edge_attr)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 66, in forward
h = self.gnn(gnn_input, edge_index.to(config.gpu2), edge_attr.to(config.gpu2)).to(config.gpu0)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/root/userfolder/code/cc/CodeCompletion/graph_encoder.py", line 84, in forward
m = self.propagate(edge_index, x=m, edge_type=edge_type, layer=i)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch_geometric/nn/conv/message_passing.py", line 263, in propagate
out = self.message(**msg_kwargs)
File "/root/userfolder/code/cc/CodeCompletion/graph_encoder.py", line 92, in message
w = torch.index_select(edge_w, 0, edge_type)
RuntimeError: invalid argument 3: Index is supposed to be an empty tensor or a vector at /pytorch/aten/src/THC/generic/THCTensorIndex.cu:415
可能维度不对
Traceback (most recent call last):
File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 596, in <module>
GnnTrainer(config).start()
File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 406, in start
self._run_epoch(train=True, model=model, dataloader=train_dataloader)
File "/root/userfolder/code/cc/CodeCompletion/CodeCompletionGNN.py", line 449, in _run_epoch
loss.backward()
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/tensor.py", line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: index_put_ with accumulation is not supported on large tensors, number of source elements =2524650492file a support request on github
数据不断变大,直到溢出
Traceback (most recent call last):
File "CodeCompletionGNN.py", line 597, in <module>
GnnTrainer(config).start()
File "CodeCompletionGNN.py", line 406, in start
self._run_epoch(train=True, model=model, dataloader=train_dataloader)
File "CodeCompletionGNN.py", line 428, in _run_epoch
loss, cur_correct_prediction, cur_total_prediction, cur_total_unk = self._cal(data)
File "CodeCompletionGNN.py", line 508, in _cal
pred = self.model(N, T, edge_index, edge_attr)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "CodeCompletionGNN.py", line 125, in forward
h = self.gnn(gnn_input, edge_index)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch_geometric/nn/conv/gated_graph_conv.py", line 71, in forward
m = self.propagate(edge_index, x=m, edge_weight=edge_weight)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch_geometric/nn/conv/message_passing.py", line 246, in propagate
assert isinstance(size, list)
AssertionError
Target is multiclass but average='binary'. Please choose another average setting.
这里应该是不同版本默认设置不一样
Traceback (most recent call last):
File "/root/userfolder/code/hope/my_hin2vec.py", line 42, in <module>
train(epoch)
File "/root/userfolder/code/hope/my_hin2vec.py", line 21, in train
loss = model.loss(data.to(device))
File "/root/userfolder/code/hope/my_model.py", line 367, in loss
loss = self.loss_fn(out, data[:, -1])
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 498, in forward
return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/functional.py", line 2077, in binary_cross_entropy
input, target, weight, reduction_enum)
RuntimeError: Expected object of scalar type Float but got scalar type Long for argument #2 'target' in call to _thnn_binary_cross_entropy_forward
Traceback (most recent call last):
File "train_inst2vec.py", line 74, in <module>
app.run(main)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "train_inst2vec.py", line 54, in main
data_folders = i2v_prep.construct_xfg(data_folder)
File "/root/userfolder/code/ncc/inst2vec/inst2vec_preprocess.py", line 2955, in construct_xfg
pool.map(_partial_func, enumerate(folders_raw), chunksize=1)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/root/.pyenv/versions/3.6.8/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
AttributeError: 'MultiDiGraph' object has no attribute 'node'
networkx api变动, node->nodes
Traceback (most recent call last):
File "/root/userfolder/code/ncc/train_inst2vec.py", line 74, in <module>
app.run(main)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/root/userfolder/code/ncc/train_inst2vec.py", line 57, in main
i2v_vocab.construct_vocabulary(data_folder, data_folders)
File "/root/userfolder/code/ncc/inst2vec/inst2vec_vocabulary.py", line 634, in construct_vocabulary
H_dic = build_H_dictionary(D, context_width, folder_mat, base_filename, dictionary, stmts_cut_off)
File "/root/userfolder/code/ncc/inst2vec/inst2vec_vocabulary.py", line 333, in build_H_dictionary
A1 = nx.adjacency_matrix(D)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/networkx/linalg/graphmatrix.py", line 163, in adjacency_matrix
return nx.to_scipy_sparse_matrix(G, nodelist=nodelist, weight=weight)
File "/root/.pyenv/versions/3.6.8/lib/python3.6/site-packages/networkx/convert_matrix.py", line 775, in to_scipy_sparse_matrix
raise nx.NetworkXError("Graph has no nodes or edges")
networkx.exception.NetworkXError: Graph has no nodes or edges
安装
ERROR: Command errored out with exit status 1:
command: /root/.pyenv/versions/3.6.8/bin/python3.6 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-tjiu83h7/mysqlclient/setup.py'"'"'; __file__='"'"'/tmp/pip-install-tjiu83h7/mysqlclient/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-or8qo7_5
cwd: /tmp/pip-install-tjiu83h7/mysqlclient/
Complete output (10 lines):
/bin/sh: 1: mysql_config: not found
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-tjiu83h7/mysqlclient/setup.py", line 16, in <module>
metadata, options = get_config()
File "/tmp/pip-install-tjiu83h7/mysqlclient/setup_posix.py", line 51, in get_config
libs = mysql_config("libs")
File "/tmp/pip-install-tjiu83h7/mysqlclient/setup_posix.py", line 29, in mysql_config
raise EnvironmentError("%s not found" % (_mysql_config_path,))
OSError: mysql_config not found
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
sudo apt-get install libmysqlclient-dev
WARNING:tensorflow:From pointer_parent.py:422: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
Traceback (most recent call last):
File "pointer_parent.py", line 458, in <module>
tf.app.run()
File "/root/anaconda3/envs/my-cc/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "pointer_parent.py", line 422, in main
sv = tf.train.Supervisor(logdir=None, summary_op=None)
File "/root/anaconda3/envs/my-cc/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 272, in new_func
return func(*args, **kwargs)
File "/root/anaconda3/envs/my-cc/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 326, in __init__
graph_def=graph.as_graph_def(add_shapes=True),
File "/root/anaconda3/envs/my-cc/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3005, in as_graph_def
result, _ = self._as_graph_def(from_version, add_shapes)
File "/root/anaconda3/envs/my-cc/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2963, in _as_graph_def
c_api.TF_GraphToGraphDef(self._c_graph, buf)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot serialize protocol buffer of type tensorflow.GraphDef as the serialized size (2625933684bytes) would be larger than the limit (2147483647 bytes)
functools.partial(func[,*args][, **kwargs])
给函数某些参数默认值
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
当allow_growth设置为True时,分配器将不会指定所有的GPU内存,而是根据需求增长
tf.expand_dims # 加一个维度
# 't' is a tensor of shape [2]
tf.shape(tf.expand_dims(t, 0)) # [1, 2]
# 移除所有为1的维度
# 't' is a tensor of shape [1, 2, 1, 3, 1, 1]
tf.shape(tf.squeeze(t)) # [2, 3]
torch.arange(N)
# 产生 0到N-1 tensor([0, 1, 2, 3])
torch.stack
# 会增加一个维度
torch.cat
# 维度的值会变大
# * 每个元素相乘
(A, 1, B) * (A, n, B) -> (A, n, B)
AttributeError: 'SparseTensor' object has no attribute 'sample'
需要用到pytorch 1.5.0
def sample(src: SparseTensor, num_neighbors: int,
subset: Optional[torch.Tensor] = None) -> torch.Tensor:
rowptr, col, _ = src.csr()
rowcount = src.storage.rowcount()
if subset is not None:
rowcount = rowcount[subset]
rowptr = rowptr[subset]
rand = torch.rand((rowcount.size(0), num_neighbors), device=col.device)
rand.mul_(rowcount.to(rand.dtype).view(-1, 1))
rand = rand.to(torch.long)
rand.add_(rowptr.view(-1, 1))
return col[rand]
这个搜不到
torch.ops.torch_sparse.ind2ptr
https://www.cnblogs.com/xbinworld/p/4273506.html
UnicodeEncodeError: 'ascii' codec can't encode character '\xa7' in position 29: ordinal not in range(128)
运行时环境变量加PYTHONIOENCODING=utf-8