Docker容器中运行pytorch模型shared memory(shm)不足的解决方法

最新推荐文章于 2024-06-18 20:03:35 发布

南方Alan

最新推荐文章于 2024-06-18 20:03:35 发布

阅读量3.6k

点赞数

分类专栏：工程问题文章标签： docker pytorch

本文链接：https://blog.csdn.net/weixin_38267786/article/details/117260637

版权

工程问题专栏收录该内容

4 篇文章 0 订阅

订阅专栏

博客内容描述了在使用PyTorch训练模型时遇到的错误，该错误源于Dataloader工作进程因共享内存不足（shm）而崩溃。解决方案包括在代码中将Dataloader的num_workers设置为0，或者增加Docker容器的共享内存大小。这两种方法分别有其优缺点，前者简单但可能减慢训练速度，后者则需要重启Docker服务或重建容器。

摘要由CSDN通过智能技术生成

错误信息

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
  File "train_raf-db.py", line 214, in <module>
    run_training()
  File "train_raf-db.py", line 158, in run_training
    outputs, alpha = model(imgs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/nanfang-pytorch1.7/Amend-Representation-Module/src/Networks.py", line 39, in forward
    x = self.features(x)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 423, in forward
    return self._conv_forward(input, self.weight)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 419, in _conv_forward
    return F.conv2d(input, weight, self.bias, self.stride,
  File "/opt/conda/lib/python3.8/site-packages/apex/amp/wrap.py", line 28, in wrapper
    return orig_fn(*new_args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 103) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

解决方法

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Dataloader中的num_workers设置与docker的shared memory相关问题

总结：是由于docker容器内shm_size默认大小64m太小导致的问题，有两个解决思路
1.在Dataloader中将num_worker设置为0，只需在代码中修改比较简单，缺点是训练过程变慢，特别是对较大数据例如视频图像
2.改变容器中shared_memory大小，方法普遍要求重启docker服务或重建docker容器，公共服务器上重启docker服务不太现实，重建容器是进行设置较为方便。

南方Alan

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Docker容器中运行pytorch模型shared memory(shm)不足的解决方法

错误信息ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).Traceback (most recent call last): File "train_raf-db.py", line 214, in <module> run_training() File "train_raf-db.py", line 158
复制链接

扫一扫