学习目标:
1.复现PointCT这篇论文代码。
2.记录过程中遇到的问题。
3.简要总结这个过程
复现PointCT这篇论文代码:
1.代码来源:https://github.com/anhthuan1999/PointCT
这里是需要做好前期准备工作,安装好环境依赖。我这里遇到的第一个问题就是cuda版本不一致的问题,原因就是在学院服务器跑,之前是设置了cuda不允许自己安装,只能用所给出的最新版本cuda:12.2.然后再跑的过程中就遇到了问题。幸好就是在cuda官网安装了合适的版本,以为无法在服务器上安装,后面成功了。
2.然后在这里下载数据集,我以为就是简单的下载s3dis这个数据集,因为在之前我已经下载过,这里没有下载,然后在跑的时候就报错了,.npy文件找不到,后面就想着这里涉及到另一步对这个数据集进行了处理。
这里是一个输出结果过程。
这里是安装一步一步过程来处理。
后面就是进行train.py 的过程
然后再跑的过程中就遇到了train.py和s3dis.yaml两个文件中的很多问题存在,然后不断debug,对代码进行修改。因为为了能够跑通,也做了一点修改,但那个数据没有传进去。
所以显示很多类别的数据都是0。
由于都是0,所以就停止运行了,尝试去找到问题所在。
跑到后面最好的效果就是跑通了一部分有精度的数据,不再是0.但到后面继续跑的时候又抛错了。
Traceback (most recent call last):
File "train.py", line 415, in <module>
main()
File "train.py", line 87, in main
mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args))
File "/public/home/ncu_418000220012/anaconda3/envs/pointct1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/public/home/ncu_418000220012/anaconda3/envs/pointct1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/public/home/ncu_418000220012/anaconda3/envs/pointct1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/public/home/ncu_418000220012/anaconda3/envs/pointct1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/share/home/ncu_418000220012/project/PointCT/train.py", line 229, in main_worker
loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, criterion, optimizer, epoch)
File "/share/home/ncu_418000220012/project/PointCT/train.py", line 283, in train
output = model([coord, feat, offset])
File "/public/home/ncu_418000220012/anaconda3/envs/pointct1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/public/home/ncu_418000220012/anaconda3/envs/pointct1/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 42 43 85 86 131 132 174 175 220 221 263 264 306 307 349 350 392 393 438 439 481 482
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
/public/home/ncu_418000220012/anaconda3/envs/pointct1/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 32 leaked semaphores to clean up at shutdown
len(cache))
这里就是遇到的错误,影响整个过程,目前也就是在研究。争取能够尽早解决掉,复现论文中的指标数据。