Pytorch Parallel KeyError Bug

报错:

Traceback (most recent call last):                                                                                                                                                                                          
  File "train_point_corr.py", line 122, in <module>                                                                                                                                                                         
    main()                                                                                                                                                                                                                  
  File "train_point_corr.py", line 44, in main                                                                                                                                                                              
    return main_train(model_class_pointer, hparams, parser)                                                                                                                                                                 
  File "train_point_corr.py", line 111, in main_train                                                                                                                                                                       
    trainer.fit(model)                                                                                                                                                                                                      
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit                                                                                                   
    self.dispatch()                                                                                                                                                                                                         
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch                                                                                              
    self.accelerator.start_training(self)                                                                                                                                                                                   
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training                                                                                
    self.training_type_plugin.start_training(trainer)                                                                                                                                                                       
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training                                                             
    self._results = trainer.run_train()                                                                                                                                                                                     
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train                                                                                             
    self.train_loop.run_training_epoch()                                                                                                                                                                                    
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 492, in run_training_epoch                                                                              
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)                                                                                                                                                
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 631, in run_training_batch                                                                              
    split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens                                                                                                                                                        
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 742, in training_step_and_backward                                                                      
    result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)                                                                                                                                                   
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 293, in training_step
    training_step_output = self.trainer.accelerator.training_step(args)                                       
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 156, in training_step
    return self.training_type_plugin.training_step(*args)                                                     
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/dp.py", line 94, in training_step
    return self.model(*args, **kwargs)                 
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)                                                                   
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)                                                   
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()                                   
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)                           
KeyError: Caught KeyError in replica 0 on device 0.                                                           
Original Traceback (most recent call last):                                                                   
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)                  
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)                                                                   
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/overrides/data_parallel.py", line 74, in forward
    output = super().forward(*inputs, **kwargs)                                                               
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/overrides/base.py", line 48, in forward
    output = self.module.training_step(*inputs, **kwargs)                                                     
  File "/home2/djc/DPC/models/shape_corr_trainer.py", line 60, in training_step
    batch = self(batch)                                
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)                                                                   
  File "/home2/djc/DPC/models/DeepPointCorr/CrossPointCorr.py", line 230, in forward
    # dense features, similarity, and cross reconstruction                                                    
  File "/home2/djc/DPC/models/DeepPointCorr/CrossPointCorr.py", line 130, in forward_source_target
    ###transformers                                    
  File "/home2/djc/DPC/models/DeepPointCorr/CrossPointCorr.py", line 118, in compute_cross_features
    src_pos=source_pe.transpose(0,1) if self.hparams.transformer_encoder_has_pos_emb else None,
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)                                                                   
  File "/home2/djc/DPC/models/sub_models/cross_attention/transformers.py", line 41, in forward
    src_pos=src_pos, tgt_pos=tgt_pos)                  
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)                                                                   
  File "/home2/djc/DPC/models/sub_models/cross_attention/transformers.py", line 256, in forward
    src_key_padding_mask, tgt_key_padding_mask, src_pos, tgt_pos)                                             
  File "/home2/djc/DPC/models/sub_models/cross_attention/transformers.py", line 214, in forward_pre
    src_w_pos = self.with_pos_embed(src2, src_pos)                                                            
  File "/home2/djc/DPC/models/sub_models/cross_attention/transformers.py", line 119, in with_pos_embed
    return tensor if pos is None else tensor + pos                                                            
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/traceback.py", line 197, in format_stack
    return format_list(extract_stack(f, limit=limit))                                                         
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/traceback.py", line 211, in extract_stack
    stack = StackSummary.extract(walk_stack(f), limit=limit)                                                  
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/traceback.py", line 360, in extract
    linecache.checkcache(filename)                     
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/linecache.py", line 79, in checkcache
    del cache[filename]                                
KeyError: '/home2/djc/DPC/models/DeepPointCorr/CrossPointCorr.py'

关键报错:

KeyError: Caught KeyError in replica 0 on device 0. 
...
...
return format_list(extract_stack(f, limit=limit))
stack = StackSummary.extract(walk_stack(f), limit=limit) 
linecache.checkcache(filename)
del cache[filename]
KeyError: '/home2/djc/DPC/models/DeepPointCorr/CrossPointCorr.py'

bug出现特点:

  1. 非Parallel情况下不会出现此bug。
  2. 在运行此模型的时候,进行了相关的平凡修改模型。例如添加高参数,修改训练计划(training schedule options),测试模型等。

解决方法:
别在模型运行的时候修改、测试此模型,虽然理论上the running code should stick with the previous version instead the unpdated one,但是有这bug怎么办呢,运行程序的时候享受生活吧~lol

Reference:
When modified the model python file, the pytorch will raise the KeyError of this file #43120

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

SoaringPigeon

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值