Pytorch中Dataloader踩坑：RuntimeError: DataLoader worker (pid(s) 6700, 10620) exited unexpectedly

最新推荐文章于 2024-08-05 09:10:39 发布

小镇拾光

最新推荐文章于 2024-08-05 09:10:39 发布

阅读量2.9w

点赞数 50

分类专栏： # 深度学习文章标签：多线程 python pytorch

本文链接：https://blog.csdn.net/qq_38662733/article/details/108549461

版权

深度学习专栏收录该内容

15 篇文章 3 订阅

订阅专栏

Pytorch中Dataloader踩坑

环境：
问题背景：
观察报错信息进行分析
根据分析进行修改尝试
总结

环境：

系统：windows10
Pytorch版本：1.5.1+cu101

问题背景：

直接上代码（来源莫烦Python）：

import torch
import torch.utils.data as Data

print(torch.__version__)
BATACH_SIZE = 5

x = torch.linspace(1,10,10)         # [1,  2, 3, 4, 5, 6, 7, 8, 9, 10]
y = torch.linspace(10,1,10)         # [10 ,9, 8, 7, 6, 5, 4, 3, 2, 1 ]
torch_dataset = Data.TensorDataset(x, y)
loader = Data.DataLoader(
    dataset = torch_dataset,
    batch_size=BATACH_SIZE,
    shuffle=True,
    num_workers=2,
)

# if __name__ == '__main__':
for epoch in range(3):
    for step, (batch_x, batch_y) in enumerate(loader):
        print('Epoch:', epoch, '|Step', step, '|batch x:', batch_x.numpy(), '|batch_y:', batch_y.numpy())

报错信息：

RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
Traceback (most recent call last):
  File "C:\Softwares\Proware\Deeplearn\Anaconda3\envs\YOLOV5\lib\site-packages\torch\utils\data\dataloader.py", line 761, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "C:\Softwares\Proware\Deeplearn\Anaconda3\envs\YOLOV5\lib\multiprocessing\queues.py", line 105, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:/WorkSpace/Codes/Pycharm/2020/NPyTorch/Minibatchtraing.py", line 24, in <module>
    for step, (batch_x, batch_y) in enumerate(loader):
  File "C:\Softwares\Proware\Deeplearn\Anaconda3\envs\YOLOV5\lib\site-packages\torch\utils\data\dataloader.py", line 345, in __next__
    data = self._next_data()
  File "C:\Softwares\Proware\Deeplearn\Anaconda3\envs\YOLOV5\lib\site-packages\torch\utils\data\dataloader.py", line 841, in _next_data
    idx, data = self._get_data()
  File "C:\Softwares\Proware\Deeplearn\Anaconda3\envs\YOLOV5\lib\site-packages\torch\utils\data\dataloader.py", line 808, in _get_data
    success, data = self._try_get_data()
  File "C:\Softwares\Proware\Deeplearn\Anaconda3\envs\YOLOV5\lib\site-packages\torch\utils\data\dataloader.py", line 774, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 8528, 8488) exited unexpectedly

观察报错信息进行分析

定位到错误位置为：for step, (batch_x, batch_y) in enumerate(loader):
观察错误信息发现两点：
1：DataLoader worker中的pis(s)存在异常，因此推断为num_workers=2,设置有问题。
2：错误提示：

This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:

    if __name__ == '__main__':
        freeze_support()
        ...

可以看到两个和线程（进程？）有关的两个函数:freeze_support()、fork。
因此基本断定是由于线程设置而引发的错误。
并且错误信息有提示：习惯用法为：freeze_support()函数紧跟在main后。

根据分析进行修改尝试

1，修改进程数：将DataLoader中的num_workers=2,改成num_workers=0,仅执行主进程。运行成功！！！
2,使用多进程习惯用法：再for循环前加上main函数，成功运行！！！

总结

根据询问对线程(进程)熟悉的同学和后续查阅资料得出问题出现的原因：

程序在运行时启用了多线程，而多线程的使用用到了freeze_support()函数。
freeze_support()函数在linux和类unix系统上可直接运行，在windows系统中需要跟在main后边。

小镇拾光

关注

50
点赞
踩
62

收藏

觉得还不错? 一键收藏
9
评论
Pytorch中Dataloader踩坑：RuntimeError: DataLoader worker (pid(s) 6700, 10620) exited unexpectedly

Pytorch中Dataloader踩坑环境：问题背景：观察报错信息进行分析根据分析进行修改尝试总结环境：系统：windows10Pytorch版本：1.5.1+cu101问题背景：直接上代码（来源莫烦Python）：import torchimport torch.utils.data as Dataprint(torch.__version__)BATACH_SIZE = 5x = torch.linspace(1,10,10) # [1, 2, 3, 4, 5
复制链接

扫一扫

专栏目录