mmdet3d+waymo 踩坑+验证环境正确性流程

处理新版的waymo数据集已经很费劲了,结果eval的结果和train的loss总是很差,原本以为只是model的问题,后面发现环境也有大坑。修好了之后,evaluate结束又开始报error了,明明结果都对的,非要报个error,一系列的事情忙了20天才弄好,中间基本没休息过,累死了。

前面配mmdet3d的时候,由于使用了最新版mmdet3d v1.0.0rc2,导致使用官方的config和model,nuscenes数据集上的eval和train结果都不对,后面用了同学环境的版本才好了,但这个时候测waymo就会报错,找了很久bug,才发现新版cuda,旧版torch和tensorflow存在一定程度的冲突,以至于一起用显卡的时候会出现问题。

waymo evaluate error复现

有空交个issue
环境:

mmcv-full                 1.4.0            
mmdet                     2.19.1     
mmdet3d                   0.17.3 
tensorflow                2.6.0
torch                     1.10.2+cu113
waymo-open-dataset-tf-2-6-0 1.4.7

问题出现在bash tools/dist_train.sh或者bash tools/dist_test.sh。
一旦调用过waymo_dataset.evaluate,程序结束都会报错:

terminate called after throwing an instance of ‘c10::CUDAError’
what(): CUDA error: unspecified launch failure
Exception raised from create_event_internal at
…/c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42
(0x7f27dd18ed62 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x1c5f3 (0x7f282083b5f3 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2
(0x7f282083c002 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f27dd178314
in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0x29eb89 (0x7f28a3b62b89 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xadfbe1 (0x7f28a43a3be1 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object*) + 0x292
(0x7f28a43a3ee2 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #61: PyRun_SimpleFileExFlags + 0x1bf (0x56231eaba54f in
/home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python) frame #62:
Py_RunMain + 0x3a9 (0x56231eabaa29 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python) frame #63:
Py_BytesMain + 0x39 (0x56231eabac29 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python)

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process
57875 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process
57876 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode:
-6) local_rank: 0 (pid: 57874) of binary: /home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python Traceback
(most recent call last): File
“/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py”,
line 194, in _run_module_as_main
return _run_code(code, main_globals, None, File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py”,
line 87, in _run_code
exec(code, run_globals) File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py”,
line 193, in
main() File “/home/zhengliangta

  • 8
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值