1.现象描述
项目修改后现在本地电脑(单卡)试运行之后没有问题,上传到服务器开多卡训练出现报错,改用单卡训练则一切正常,多卡报错如下:
Traceback (most recent call last):
File “/raid/shen_nn/projects/ultralytics_/train.py”, line 28, in
main_DGX2()
File “/raid/shen_nn/projects/ultralytics_/train.py”, line 18, in main_DGX2
model.train(**{‘data’: ‘ultralytics/cfg/datasets/WiderPerson_DGX2.yaml’,
File “/raid/shen_nn/projects/ultralytics_/ultralytics/engine/model.py”, line 654, in train
self.trainer.train()
File “/raid/shen_nn/projects/ultralytics_/ultralytics/engine/trainer.py”, line 208, in train
raise e
File “/raid/shen_nn/projects/ultralytics_/ultralytics/engine/trainer.py”, line 206, in train
subprocess.run(cmd, check=True)
File “/raid/shen_nn/anaconda3/envs/yolov8/lib/python3.9/subprocess.py”, line 528, in run
raise CalledProcessError(retcode, process.args,
解决方法
修改了网络结构后,需要进行源码安装:
pip install -e .
安装成功
拓展
查找解决方案过程中还看到,若源码安装失败,如出现错误: “error: package directory ‘ultralytics/ipynb_checkpoints’ does not exist”
使用命令删除所有’.ipynb_checkpoints’文件:
find -type d -name '.ipynb_checkpoints' -exec rm -rf {} +
之后重新运行 pip install -e . 即可。