1. 错误:ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
2. 原因:Pytorch的IPC会利用共享内存,在服务器上的docker中运行训练代码时,batch size设置得过大,shared memory不够(因为docker限制了shm),所以对于当前代码运行环境的共享内存必须足够大。
3. 解决方法:
(1)修改当前Docker的shm-size
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0,1 --shm-size 8G -it ******* env LANG=C.UTF-8 /bin/bash
(2)修改DataLoader中参数num_workers的值
dataloader = torch.utils.data.DataLoader(
dataset,
batch_size=16,
shuffle=True,
num_workers=0,
pin_memory=True,
collate_fn=dataset.collate_fn
)