nnunet(二) Common Issues and their Solutions

最新推荐文章于 2024-07-21 13:09:21 发布

shchojj

最新推荐文章于 2024-07-21 13:09:21 发布

阅读量2.6k

点赞数 1

分类专栏： segmentation

原文链接：https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/common_problems_and_solutions.md

版权

segmentation 专栏收录该内容

32 篇文章 15 订阅

订阅专栏

https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/common_problems_and_solutions.md

Common Issues and their Solutions

RuntimeError: Expected scalar type half but found float

显卡太老了，没办法使用半精度， nnUNet_predict 和 nnUNet_train.只能使用fp32位，运行nnUNet_predict -h and nnUNet_train -h应该能够解决。

nnU-Net gets 'stuck' during preprocessing, training or inference

就是CPU不行，线程卡住了。

肯定会有错误提示，一般不在输出文本最底部，可能在稍微往上一点。如果你也和作者一样使用GPU，也许WAYYYY off 在log文件中，有时候也在training/inference开始的时候。定位error信息，可以将stdout复制到text编辑器直接搜索error。
没有error信息，可能直接被OS杀死进程了，可能是因为即将内存溢出。如果是这个情况，请重新运行一次，并时刻关注内存变换（不是显存），如果内存是满的，或者即将占满，可以进行下列操作：
1. 减少background workers数，nnUNet_plan_and_preprocess使用-tl 和 -tf，可能需要直接减少到1.减少nnUNet_predict的workers数，使用--num_threads_preprocessing and --num_threads_nifti_save。
2. 如果-tf都降低到1了还是不行，试着在SSD上开辟一个缓存空间。
3. 买内存。

nnU-Net training: RuntimeError: CUDA out of memory

RuntimeError: CUDA out of memory. Tried to allocate 4.16 GiB (GPU 0; 10.76 GiB total capacity; 2.82 GiB already allocated; 4.18 GiB free; 4.33 GiB reserved in total by PyTorch)

很明显就是显存不足，对大多数数据集nnU-Net大概使用8G左右显存，为了确保能够正常训练，至少11GB显存。如果GPU还有其他程序占用，比如显示，那nnU-Net能够使用的显存就更少了。要么关闭这些不必要的程序，要么把它们移动到其他GPU，比如用偏移的GPU显示，用贵的GPU做训练。

在每次训练开始时，cuDNN都会运行一些基准测试benchmarks，以找出当前网络架构下最快的卷积算法(作者使用的是torch.backend.cuDNN.benchmark=True。使用者写benchmarks测试的时候，显存可能急剧消耗，甚至短暂的超过8G。如果一直遇到RuntimeError: CUDA out of memory问题，那你可能需要考虑是否禁用benchmark了，在nnUNet_train时可以设置--deterministic标志，除非必要，最好不要打开此标志位，因为他会让训练变慢。

nnU-Net training in Docker container: RuntimeError: unable to write to file </torch_781_2606105346>

Docker的问题，在docker启动的时候使用--ipc=host标志。

Downloading pretrained models: unzip: cannot find zipfile directory in one of /home/isensee/.nnunetdownload_16031094034174126

一些比较大的zip文件可能会以前你该问题，可以通过zenodo (https://zenodo.org/record/4003545)下载，然后通过nnUNet_install_pretrained_model_from_zip来安装该模型。

Downloading pre-trained models: `unzip: 'unzip' is not recognized as an internal or external command` OR `Command 'unzip' not found`

Windows或WSL2系统中可能没有unzip命令，可以在zenodo上下载预训练模型并解压，或者更新nnunet，会安装对应的https://docs.python.org/3/library/zipfile.html

nnU-Net training (2D U-Net): High (and increasing) system RAM usage, OOM

混合精度导致的内存泄露，保证CUDNN和pytorch是对应的。

nnU-Net training of cascade: Error `seg from prev stage missing`

需要运行3d_lowres的five folds。前一阶段的分割只能从验证集生成，否则会过拟合。

nnU-Net training: `RuntimeError: CUDA error: device-side assert triggered`

一般而言还伴随着如下错误提示：

void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [4770,0,0], thread: [374,0,0] Assertion indexValue >= 0 && indexValue < tensor.sizes[dim] failed.

这意味着您的数据集在分割中包含了意想不到的值。nnU-Net希望所有标签都是连续的整数。所以如果你的数据集有4个类(背景和三个前景标签)，那么标签必须是0,1,2,3(其中0必须是背景!)在地面真值分割中不能有任何其他值。就是说所有的分割类别需要连续且从1开始。

运行nnUNet_plan_and_preprocess并打开--verify_dataset_integrity标志，就会检查label中错误的值。