解决YOLOV5出现全为nan和0的问题

最新推荐文章于 2025-02-13 11:34:29 发布

诶哟喂

最新推荐文章于 2025-02-13 11:34:29 发布

阅读量1.5k

点赞数 6

文章标签： YOLO

本文链接：https://blog.csdn.net/u014093296/article/details/135738316

版权

文章讲述了在使用Yolov5进行训练时遇到NaN和0值问题，特别是在GTX16XX系列显卡上，通过升级CUDA至10.2版本并调整PyTorch和相关库版本，成功解决了这些问题并展示了训练过程中的指标变化。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

yolov5训练时，出现系数为nan和0的问题,cpu跑没有问题，gpu出现nan和0的问题。一般问题cuda问题和显卡的原因。

显卡为GTX 16XX系列的在cuda使用较新版本时会出现该问题。

AutoAnchor: 6.13 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset 
Image sizes 640 train, 640 val
Using 0 dataloader workers
Logging results to runs\train\exp7
Starting training for 100 epochs...
 
     Epoch   gpu_mem       box       obj       cls    labels  img_size
      0/99     1.88G       nan       nan       nan        10       640: 100%|██████████| 14/14 [00:35<00:00,  2.52s/it]
D:\19837\anaconda3\envs\pytorch\lib\site-packages\torch\optim\lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████| 7/7 [00:07<00:00,  1.09s/it]
                 all        106          0          0          0          0          0
 
     Epoch   gpu_mem       box       obj       cls    labels  img_size
      1/99     1.96G       nan       nan       nan       104       640:   7%|▋         | 1/14 [00:02<00:38,  2.92s/it]
Process finished with exit code -1

解决方案为将cuda换为10.2的版本，我已经为大家准备好相应cuda和nudnn,下载链接:CUDA_10.2.zip官方版下载丨最新版下载丨绿色版下载丨APP下载-123云盘123云盘为您提供CUDA_10.2.zip最新版正式版官方版绿色版下载,CUDA_10.2.zip安卓版手机版apk免费下载安装到手机,支持电脑端一键快捷安装https://www.123pan.com/s/lgZzVv-eTQk3.html提取码:AMDZ

然后继续安装pytorch cu102版本

pip install torch==1.10.1+cu102 torchvision==0.11.2+cu102 torchaudio==0.10.1 -f https://download.pytorch.org/whl/torch_stable.html

接下来回到运行程序阶段

AutoAnchor: 6.13 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset 
Image sizes 640 train, 640 val
Using 0 dataloader workers
Logging results to runs\train\exp10
Starting training for 100 epochs...
 
     Epoch   gpu_mem       box       obj       cls    labels  img_size
      0/99     1.85G    0.1244    0.0515   0.06827        10       640: 100%|██████████| 14/14 [01:59<00:00,  8.53s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████| 7/7 [00:11<00:00,  1.62s/it]
                 all        106        433    0.00107    0.00842   0.000487   0.000122
 
     Epoch   gpu_mem       box       obj       cls    labels  img_size
      1/99     1.96G    0.1171   0.06178   0.06603        63       640:  50%|█████     | 7/14 [00:30<00:30,  4.37s/it]

至此就完成了