解决YOLOV5出现全为nan和0的问题

xiaoxu_飞

已于 2022-04-20 16:00:31 修改

阅读量1.7w

点赞数 42

文章标签：深度学习 pytorch

于 2022-04-18 23:28:42 首次发布

本文链接：https://blog.csdn.net/qq_52902342/article/details/124261371

版权

本文讲述了在使用GTX1650显卡和CUDA 11.3环境下，训练Yolov5模型时遇到NaN和0值问题的解决过程，通过降级CUDA版本到10.2并配合cuDNN调整，成功解决了这些问题，最终实现了正常训练。

摘要由CSDN通过智能技术生成

yolov5训练时，出现系数为nan和0的问题。

cpu跑没有问题，gpu出现nan和0的问题。一般问题cuda问题和显卡的原因。

显卡为GTX 16XX系列的在cuda使用较新版本时会出现该问题。

例如我自己的问题：飞行堡垒7锐龙版显卡：GTX 1650 cuda11.3（cuda11.5调试过）都会出现该问题 pytorch为1.11.0 。

AutoAnchor: 6.13 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset 
Image sizes 640 train, 640 val
Using 0 dataloader workers
Logging results to runs\train\exp7
Starting training for 100 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
      0/99     1.88G       nan       nan       nan        10       640: 100%|██████████| 14/14 [00:35<00:00,  2.52s/it]
D:\19837\anaconda3\envs\pytorch\lib\site-packages\torch\optim\lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████| 7/7 [00:07<00:00,  1.09s/it]
                 all        106          0          0          0          0          0

     Epoch   gpu_mem       box       obj       cls    labels  img_size
      1/99     1.96G       nan       nan       nan       104       640:   7%|▋         | 1/14 [00:02<00:38,  2.92s/it]
Process finished with exit code -1

解决方案为将cuda换为10.2的版本，链接如下，直接进行下载

CUDA Toolkit Archive | NVIDIA DeveloperPrevious releases of the CUDA Toolkit, GPU Computing SDK, documentation and developer drivers can be found using the links below. Please select the release you want from the list below, and be sure to check www.nvidia.com/drivers for more recent production drivers appropriate for your hardware configuration.https://developer.nvidia.com/cuda-toolkit-archive

cudnn下载:

cuDNN Archive | NVIDIA DeveloperNVIDIA cuDNN is a GPU-accelerated library of primitives for deep neural networks.https://developer.nvidia.com/rdp/cudnn-archive#a-collapse51b选择对应的版本

安装cuda过后将cudnn里面的放入C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2这个路径下。根据自己的路径进行修改

然后继续安装pytorch cu102版本

pip install torch==1.10.1+cu102 torchvision==0.11.2+cu102 torchaudio==0.10.1 -f https://download.pytorch.org/whl/torch_stable.html

接下来回到运行程序阶段

AutoAnchor: 6.13 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset 
Image sizes 640 train, 640 val
Using 0 dataloader workers
Logging results to runs\train\exp10
Starting training for 100 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
      0/99     1.85G    0.1244    0.0515   0.06827        10       640: 100%|██████████| 14/14 [01:59<00:00,  8.53s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████| 7/7 [00:11<00:00,  1.62s/it]
                 all        106        433    0.00107    0.00842   0.000487   0.000122

     Epoch   gpu_mem       box       obj       cls    labels  img_size
      1/99     1.96G    0.1171   0.06178   0.06603        63       640:  50%|█████     | 7/14 [00:30<00:30,  4.37s/it]

至此就完成了