索引超过张量维度范围

索引超过张量维度范围

这个错误通常发生在 PyTorch 在 CUDA 上进行索引操作时。它的原因可能是你正在尝试访问超出张量维度范围的索引。这可能是由于索引值超过了张量的大小,或者由于张量维度的设置有问题。

要解决这个问题,你可以检查以下几点:

确保索引值没有超过张量的大小。请检查你在索引操作中使用的索引值,确保它们在合理的范围内,并且不超过对应维度的大小。

确保张量的维度设置正确。检查张量的形状和维度设置是否正确。确保使用的索引操作与张量的维度一致。

报错信息如下

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [183,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize    ` failed.
component_trace = _Fire(component, args, parsed_flag_args, context, name)/pytorch/aten/src/ATen/native/cuda/Indexing.cu
:699: indexSelectLargeIndex  File "/usr/local/lib/python3.7/site-packages/fire/core.py", line 480, in _Fire
: block: [183,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [183,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [183,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [183,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [183,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [183,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [183,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu    :699target=component.__name__): indexSelectLargeIndex
: block: [183,0  File "/usr/local/lib/python3.7/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    component = fn(*varargs, **kwargs)
  File "pretraining_myjob.py", line 206, in train
    loss = ddp_model(**data).loss
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py", line 1605, in forward
    return_dict=return_dict,
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py", line 942, in forward
    attention_mask = torch.ones(batch_size, mask_seq_length).to(inputs_embeds.device)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "pretraining_myjob.py", line 256, in <module>
    fire.Fire(train)
  File "/usr/local/lib/python3.7/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.7/site-packages/fire/core.py", line 480, in _Fire
    target=component.__name__)
  File "/usr/local/lib/python3.7/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "pretraining_myjob.py", line 206, in train
    loss = ddp_model(**data).loss
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py", line 1605, in forward
    return_dict=return_dict,
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py", line 942, in forward
    attention_mask = torch.ones(batch_size, mask_seq_length).to(inputs_embeds.device)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::Error'
  what():  NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:181, unhandled cuda error, NCCL version 21.0.3
Process Group destroyed on rank 0
Exception raised from ncclCommAbort at ../torch/csrc/distributed/c10d/NCCLUtils.hpp:181 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f8b60d0ed62 in /usr/local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f8b60d0b68b in /usr/local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x32ac75e (0x7f8b6446075e in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0x113 (0x7f8b64449443 in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0x9 (0x7f8b64449669 in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xe96c16 (0x7f8bbe2c9c16 in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xe7c745 (0x7f8bbe2af745 in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x2a2ca8 (0x7f8bbd6d5ca8 in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x2a3fae (0x7f8bbd6d6fae in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0xaa7a5 (0x55570e3dc7a5 in /usr/local/bin/python3)
frame #10: <unknown function> + 0xab224 (0x55570e3dd224 in /usr/local/bin/python3)
frame #11: <unknown function> + 0xaa7bb (0x55570e3dc7bb in /usr/local/bin/python3)
frame #12: <unknown function> + 0xace80 (0x55570e3dee80 in /usr/local/bin/python3)
frame #13: <unknown function> + 0x193d5c (0x55570e4c5d5c in /usr/local/bin/python3)
frame #14: _PyGC_CollectNoFail + 0x31 (0x55570e4c7031 in /usr/local/bin/python3)
frame #15: PyImport_Cleanup + 0x62a (0x55570e483eaa in /usr/local/bin/python3)
frame #16: <unknown function> + 0x16310d (0x55570e49510d in /usr/local/bin/python3)
frame #17: <unknown function> + 0x69b03 (0x55570e39bb03 in /usr/local/bin/python3)
frame #18: _Py_UnixMain + 0x49 (0x55570e39c2a9 in /usr/local/bin/python3)
frame #19: __libc_start_main + 0xe7 (0x7f8bc972dc87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #20: _start + 0x2a (0x55570e39615a in /usr/local/bin/python3)

terminate called after throwing an instance of 'c10::Error'
  what():  NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:181, unhandled cuda error, NCCL version 21.0.3
Process Group destroyed on rank 1
Exception raised from ncclCommAbort at ../torch/csrc/distributed/c10d/NCCLUtils.hpp:181 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f9749ab8d62 in /usr/local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f9749ab568b in /usr/local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x32ac75e (0x7f974d20a75e in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0x113 (0x7f974d1f3443 in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0x9 (0x7f974d1f3669 in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xe96c16 (0x7f97a7073c16 in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xe7c745 (0x7f97a7059745 in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x2a2ca8 (0x7f97a647fca8 in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x2a3fae (0x7f97a6480fae in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0xaa7a5 (0x560d5f21e7a5 in /usr/local/bin/python3)
frame #10: <unknown function> + 0xab224 (0x560d5f21f224 in /usr/local/bin/python3)
frame #11: <unknown function> + 0xaa7bb (0x560d5f21e7bb in /usr/local/bin/python3)
frame #12: <unknown function> + 0xace80 (0x560d5f220e80 in /usr/local/bin/python3)
frame #13: <unknown function> + 0x193d5c (0x560d5f307d5c in /usr/local/bin/python3)
frame #14: _PyGC_CollectNoFail + 0x31 (0x560d5f309031 in /usr/local/bin/python3)
frame #15: PyImport_Cleanup + 0x62a (0x560d5f2c5eaa in /usr/local/bin/python3)
frame #16: <unknown function> + 0x16310d (0x560d5f2d710d in /usr/local/bin/python3)
frame #17: <unknown function> + 0x69b03 (0x560d5f1ddb03 in /usr/local/bin/python3)
frame #18: _Py_UnixMain + 0x49 (0x560d5f1de2a9 in /usr/local/bin/python3)
frame #19: __libc_start_main + 0xe7 (0x7f97b24d7c87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #20: _start + 0x2a (0x560d5f1d815a in /usr/local/bin/python3)

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1289 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 1290) of binary: /usr/local/bin/python3
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/usr/local/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/usr/local/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/usr/local/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
pretraining_myjob.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-02_12:35:23
  host      : dev-xiaoying-zuo-mt-pretrain-coav-665f688c7d-hcbwj
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 1290)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 1290
============================================================
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值