yolov5训练时,出现系数为nan和0的问题。
cpu跑没有问题,gpu出现nan和0的问题。一般问题cuda问题和显卡的原因。
显卡为GTX 16XX系列的在cuda使用较新版本时会出现该问题。
例如我自己的问题:飞行堡垒7锐龙版 显卡:GTX 1650 cuda11.3(cuda11.5调试过)都会出现该问题 pytorch为1.11.0 。
AutoAnchor: 6.13 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset
Image sizes 640 train, 640 val
Using 0 dataloader workers
Logging results to runs\train\exp7
Starting training for 100 epochs...
Epoch gpu_mem box obj cls labels img_size
0/99 1.88G nan nan nan 10 640: 100%|██████████| 14/14 [00:35<00:00, 2.52s/it]
D:\19837\anaconda3\envs\pytorch\lib\site-packages\torch\optim\lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████| 7/7 [00:07<00:00, 1.09s/it]
all 106 0 0 0 0 0
Epoch gpu_mem box obj cls labels img_size
1/99 1.96G nan nan nan 104 640: 7%|▋ | 1/14 [00:02<00:38, 2.92s/it]
Process finished with exit code -1
解决方案为将cuda换为10.2的版本,链接如下,直接进行下载
cudnn下载:
安装cuda过后将cudnn里面的放入C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2这个路径下。根据自己的路径进行修改
然后继续安装pytorch cu102版本
pip install torch==1.10.1+cu102 torchvision==0.11.2+cu102 torchaudio==0.10.1 -f https://download.pytorch.org/whl/torch_stable.html
接下来回到运行程序阶段
AutoAnchor: 6.13 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset
Image sizes 640 train, 640 val
Using 0 dataloader workers
Logging results to runs\train\exp10
Starting training for 100 epochs...
Epoch gpu_mem box obj cls labels img_size
0/99 1.85G 0.1244 0.0515 0.06827 10 640: 100%|██████████| 14/14 [01:59<00:00, 8.53s/it]
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████| 7/7 [00:11<00:00, 1.62s/it]
all 106 433 0.00107 0.00842 0.000487 0.000122
Epoch gpu_mem box obj cls labels img_size
1/99 1.96G 0.1171 0.06178 0.06603 63 640: 50%|█████ | 7/14 [00:30<00:30, 4.37s/it]
至此就完成了