Docker容器中运行pytorch模型shared memory(shm)不足的解决方法

博客内容描述了在使用PyTorch训练模型时遇到的错误,该错误源于Dataloader工作进程因共享内存不足(shm)而崩溃。解决方案包括在代码中将Dataloader的num_workers设置为0,或者增加Docker容器的共享内存大小。这两种方法分别有其优缺点,前者简单但可能减慢训练速度,后者则需要重启Docker服务或重建容器。
摘要由CSDN通过智能技术生成

错误信息

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
  File "train_raf-db.py", line 214, in <module>
    run_training()
  File "train_raf-db.py", line 158, in run_training
    outputs, alpha = model(imgs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/nanfang-pytorch1.7/Amend-Representation-Module/src/Networks.py", line 39, in forward
    x = self.features(x)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 423, in forward
    return self._conv_forward(input, self.weight)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 419, in _conv_forward
    return F.conv2d(input, weight, self.bias, self.stride,
  File "/opt/conda/lib/python3.8/site-packages/apex/amp/wrap.py", line 28, in wrapper
    return orig_fn(*new_args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 103) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

解决方法

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Dataloader中的num_workers设置与docker的shared memory相关问题

总结:是由于docker容器内shm_size默认大小64m太小导致的问题,有两个解决思路
1.在Dataloader中将num_worker设置为0,只需在代码中修改比较简单,缺点是训练过程变慢,特别是对较大数据例如视频图像
2.改变容器中shared_memory大小,方法普遍要求重启docker服务或重建docker容器,公共服务器上重启docker服务不太现实,重建容器是进行设置较为方便。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值