Bug来源:
报错如下:
cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
解决方案:
(1)改小batch_size的值(4或更小),但该方法不太管用
(2)如果经过(1)后又出现如下的报错:
RuntimeError: CUDA out of memory. Tried to allocate 3.03 GiB (GPU 0; 8.00 GiB total capacity; 409.61 MiB already allocated; 5.88 GiB free; 654.00 MiB reserved in total by PyTorch)
这不是超出memory的问题,可能是DataLoader的workers太多了,例如我下面加起来workers=16了。
dataloader_train = DataLoader(train_set, batch_size=args.batch_size, shuffle=True, num_workers=8)
dataloader_val = DataLoader(val_set, batch_size=args.batch_size, shuffle=False, num_workers=8)
trainer.fit(model, dataloader_train, dataloader_val)
然后将workers改小就行了(workers<cpu线程数?)