pytorch loss.backward() 报错RuntimeError: select(): index 0 out of range for tensor of size [0, 1]解决方法

遇到了一个很奇怪的bug,尝试了很多方法终于解决了,特来此记录一下,以供后人参考

问题描述:
使用对抗生成模型,判别器和生成器,然后进行反向传播,基本代码结构如下

G = Generator(3, 3, 32, norm='bn').apply(weights_init)
D = MS_Discriminator(input_nc=6).apply(weights_init)
optimizer_G = optim.Adam(G.parameters(), lr=0.0002, betas=(0.5, 0.999))
optimizer_D = optim.Adam(D.parameters(), lr=0.0002, betas=(0.5, 0.999))
for epoch in range(EPOCH):
	train(...)
def train(...):
	...
	...
	...
	loss_D = (loss_D_fake + loss_D_real)/2
	loss_G = loss_G_GAN + loss_G_ReID + loss_G_ssim

	############## Backward #############
	# update generator weights
	optimizer_G.zero_grad()
	loss_G.backward()
	#loss_G.backward()
	optimizer_G.step()
	# update discriminator weights
	optimizer_D.zero_grad()
	loss_D.backward()
	optimizer_D.step()

报错很奇怪,报错的位置是在loss.backward(),报错原因是RuntimeError: select(): index 0 out of range for tensor of size [0, 1] at dimension 0

  File "./xxx/utils/attack_patch/attack_algorithm/MIS_RANKING/MIS_RANKING.py", line 161, in train
    loss_G.backward()
  File "/home/xxx/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/xxx/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: select(): index 0 out of range for tensor of size [0, 1] at dimension 0
Exception raised from select at /opt/conda/conda-bld/pytorch_1595629427478/work/aten/src/ATen/native/TensorShape.cpp:889 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x2b527cfbb77d in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: at::native::select(at::Tensor const&, long, long) + 0x347 (0x2b5240b88ff7 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0xfe3789 (0x2b5240f6d789 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0xfd6a83 (0x2b5240f60a83 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: at::select(at::Tensor const&, long, long) + 0xe0 (0x2b5240e930f0 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x2b62186 (0x2b5242aec186 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0xfd6a83 (0x2b5240f60a83 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #7: at::Tensor::select(long, long) const + 0xe0 (0x2b524101e240 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x2a6d69d (0x2b52429f769d in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #9: torch::autograd::generated::MaxBackward1::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x188 (0x2b5242a110d8 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x30d1017 (0x2b524305b017 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1400 (0x2b5243056860 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x451 (0x2b5243057401 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x89 (0x2b524304f579 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x4a (0x2b523efbc99a in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0xc9039 (0x2b523cb4c039 in /home/luzhixing/project/anaconda3/envs/fastreid/lib/python3.7/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #16: <unknown function> + 0x7e65 (0x2b52126b9e65 in /lib64/libpthread.so.0)
frame #17: clone + 0x6d (0x2b52129cc88d in /lib64/libc.so.6)

最令我感到疑惑的地方是它并不是一开始就报错,而是运行期间报错,也就是运行了几个epoch之后报错,运行的结果文件如下:

start training epoch 0 for Mis-ranking model G and D
ssim = 0.22805009749679203
ssim = 0.2545814767895862
ssim = 0.2610446151667491
ssim = 0.3381733775738129
ssim = 0.39292161640214335
ssim = 0.4152196025258168
loss_G = 308529.04647636414
loss_D = 184930.34714967012
average ssim in epoch 0 is 0.31499846432581674
successfully save attack model weights of G and D
start training epoch 1 for Mis-ranking model G and D
ssim = 0.41696580195656113
ssim = 0.49091494214729137
ssim = 0.5351251625051616
ssim = 0.5997415484053679
ssim = 0.6003826963236703
ssim = 0.6587150843015662
loss_G = 134753.96519470215
loss_D = 9467.809808760881
average ssim in epoch 1 is 0.5503075392732698
successfully save attack model weights of G and D
start training epoch 2 for Mis-ranking model G and D
ssim = 0.6423501039873163
Traceback (most recent call last):
  File "tools/train_net.py", line 111, in <module>
    args=(args,),
  File "./fastreid/engine/launch.py", line 71, in launch
    main_func(*args)
    ...
    ...
    (报错信息)

也就是说代码一开始是可以运行的,每一个独立的train()函数都可以正确执行,问题出现在循环过程中,也就是进行了几个epoch之后突然终止,报错。关于这个问题我搜索发现帖子和我的问题比较相似,但也没有解决办法。有GAN训练时后移optimizer.step()的解决方法,但其报错原因与我不同,我不能确定是否是同一个问题

不过其引用的博客确实可以解决我的报错问题
解决办法如下:

  • 交换loss_D,loss_G的backward()顺序
  • 把生成器(D)的loss_D.backward()修改为loss_D.backward(retain_graph = True)
  • optimizer_D.step()optimizer_G.step()放在最后一起更新
def train(...):
	...
	...
	...
	loss_D = (loss_D_fake + loss_D_real)/2
    loss_G = loss_G_GAN + loss_G_ReID + loss_G_ssim

    ############## Backward #############
    # update discriminator weights
    optimizer_D.zero_grad()
    loss_D.backward(retain_graph=True)
   
    # update generator weights
    optimizer_G.zero_grad()
    loss_G.backward()
    #loss_G.backward()
    optimizer_D.step()
    optimizer_G.step()

而后正确执行没有报错


2022.1.10-笔者机器学习功力尚浅,暂时没有针对这个问题的理论层面的合理解释,日后若有所顿悟必定回来补充

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
回答: 根据你提供的引用内容,你遇到的问题是一个RuntimeError: selected index k out of range的错误。然而,根据提供的引用内容,我无法确定这个错误的具体原因。可能是由于代码中的某些逻辑错误导致了索引超出范围。为了解决这个问题,你可以尝试以下几个步骤: 1. 检查代码中与索引相关的部分,确保没有超出索引范围的情况发生。你可以使用调试工具或打印语句来跟踪代码执行过程,找出导致错误的具体位置。 2. 确保你的数据集或输入数据的维度与代码中的索引范围相匹配。如果数据集的大小与代码中的索引范围不一致,可能会导致索引超出范围的错误。 3. 检查你的代码中是否存在其他可能导致索引超出范围的逻辑错误,例如循环条件或条件语句中的错误。 如果以上步骤都没有解决问题,我建议你提供更多的代码和错误信息,以便我能够更准确地帮助你解决这个问题。 #### 引用[.reference_title] - *1* [pytorch loss.backward() 报错RuntimeError: select(): index 0 out of range for tensor of size [0, 1]...](https://blog.csdn.net/qq_45475106/article/details/122407163)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insert_down28v1,239^v3^insert_chatgpt"}} ] [.reference_item] - *2* [Tidb 查询sql 报runtime error: index out of range [-1]错误](https://blog.csdn.net/zanpengfei/article/details/125876124)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insert_down28v1,239^v3^insert_chatgpt"}} ] [.reference_item] - *3* [sentinel-golang,再linux上运行报错runtime error: index out of range [-6]](https://blog.csdn.net/u014686399/article/details/128632100)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insert_down28v1,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值