PointCT代码复现-CSDN博客

本文链接：https://blog.csdn.net/qq_45268614/article/details/136986646

本文讲述了作者在复现PointCT论文代码时遇到的问题，包括CUDA版本不一致、数据集下载与预处理、DistributedDataParallel中的错误以及训练过程中遇到的参数未使用等问题，作者正在努力解决以达到论文指标。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

学习目标：

1.复现PointCT这篇论文代码。
2.记录过程中遇到的问题。
3.简要总结这个过程

复现PointCT这篇论文代码：

1.代码来源：https://github.com/anhthuan1999/PointCT

在这里插入图片描述这里是需要做好前期准备工作，安装好环境依赖。我这里遇到的第一个问题就是cuda版本不一致的问题，原因就是在学院服务器跑，之前是设置了cuda不允许自己安装，只能用所给出的最新版本cuda：12.2.然后再跑的过程中就遇到了问题。幸好就是在cuda官网安装了合适的版本，以为无法在服务器上安装，后面成功了。
在这里插入图片描述
2.然后在这里下载数据集，我以为就是简单的下载s3dis这个数据集，因为在之前我已经下载过，这里没有下载，然后在跑的时候就报错了，.npy文件找不到，后面就想着这里涉及到另一步对这个数据集进行了处理。
在这里插入图片描述
这里是一个输出结果过程。
这里是安装一步一步过程来处理。
后面就是进行train.py 的过程
然后再跑的过程中就遇到了train.py和s3dis.yaml两个文件中的很多问题存在，然后不断debug，对代码进行修改。因为为了能够跑通，也做了一点修改，但那个数据没有传进去。
在这里插入图片描述所以显示很多类别的数据都是0。
由于都是0，所以就停止运行了，尝试去找到问题所在。

跑到后面最好的效果就是跑通了一部分有精度的数据，不再是0.但到后面继续跑的时候又抛错了。

Traceback (most recent call last):
  File "train.py", line 415, in <module>
    main()
  File "train.py", line 87, in main
    mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args))
  File "/public/home/ncu_418000220012/anaconda3/envs/pointct1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/public/home/ncu_418000220012/anaconda3/envs/pointct1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/public/home/ncu_418000220012/anaconda3/envs/pointct1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/public/home/ncu_418000220012/anaconda3/envs/pointct1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/share/home/ncu_418000220012/project/PointCT/train.py", line 229, in main_worker
    loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, criterion, optimizer, epoch)
  File "/share/home/ncu_418000220012/project/PointCT/train.py", line 283, in train
    output = model([coord, feat, offset])
  File "/public/home/ncu_418000220012/anaconda3/envs/pointct1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/public/home/ncu_418000220012/anaconda3/envs/pointct1/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 42 43 85 86 131 132 174 175 220 221 263 264 306 307 349 350 392 393 438 439 481 482
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

/public/home/ncu_418000220012/anaconda3/envs/pointct1/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 32 leaked semaphores to clean up at shutdown
  len(cache))