pytorch dataload源码解读

最新推荐文章于 2024-07-24 16:54:58 发布

dguochuan

最新推荐文章于 2024-07-24 16:54:58 发布

阅读量472

点赞数

文章标签： pytorch 深度学习神经网络

本文链接：https://blog.csdn.net/qq_27172615/article/details/128798204

版权

概要：

主类：dataloader 调用，主要逻辑。

sampler: 接口，定义了索引随机方法

dataset: 接口，数据实现接口

_BaseDataLoaderIter ：迭代器 多进程 主进程

设计模式：？

Dataloader:

初始化

        self.batch_size = batch_size 批量大小
        self.drop_last = drop_last 最后的数据不满足批量大小，是否舍弃
        self.sampler = sampler 策略
        self.batch_sampler = batch_sampler 批量数据策略
        self.generator = generator 随机策略

DistributedSampler：

    r"""Sampler that restricts data loading to a subset of the dataset.

    It is especially useful in conjunction with
    :class:`torch.nn.parallel.DistributedDataParallel`. In such a case, each
    process can pass a :class:`~torch.utils.data.DistributedSampler` instance as a
    :class:`~torch.utils.data.DataLoader` sampler, and load a subset of the
    original dataset that is exclusive to it.

iter:

使用随机方法打乱生成索引：注意使用的是len(self.dataset),所以继承dataset的时候必须实现len方法。

 indices = torch.randperm(len(self.dataset), generator=g).tolist()

BatchSampler

class BatchSampler(Sampler[List[int]]):
    r"""Wraps another sampler to yield a mini-batch of indices.

    Args:
        sampler (Sampler or Iterable): Base sampler. Can be any iterable object
        batch_size (int): Size of mini-batch.
        drop_last (bool): If ``True``, the sampler will drop the last batch if
            its size would be less than ``batch_size``

    Example:
        >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=False))
        [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
        >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=True))
        [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
    """

sampler: 是打乱索引的方法

batch_size:是批量大小

drop_last: 是否返回完整数据

        if self.drop_last:  返回完整的索引
            sampler_iter = iter(self.sampler)
            while True:
                try:  如果批量大小是3 就用测咯返回3个索引
                    batch = [next(sampler_iter) for _ in range(self.batch_size)]
                    yield batch
                except StopIteration:
                    break  如果最后一个批量只有2，但是批量是3，肯定报错，则直接跳出。
        else:
            batch = [0] * self.batch_size
            idx_in_batch = 0
            for idx in self.sampler:
                batch[idx_in_batch] = idx
                idx_in_batch += 1
                if idx_in_batch == self.batch_size:
                    yield batch
                    idx_in_batch = 0
                    batch = [0] * self.batch_size
            if idx_in_batch > 0:
                yield batch[:idx_in_batch]

多进程迭代器：

debug顺序：

1 dataloader迭代方法

    def __iter__(self) -> '_BaseDataLoaderIter':
        # When using a single worker the returned iterator should be
        # created everytime to avoid reseting its state
        # However, in the case of a multiple workers iterator
        # the iterator is only created once in the lifetime of the
        # DataLoader object so that workers can be reused
        if self.persistent_workers and self.num_workers > 0:
            if self._iterator is None: 设置进程数大于0 调用
                self._iterator = self._get_iterator()
            else:
                self._iterator._reset(self)
            return self._iterator
        else:
            return self._get_iterator()

    def _get_iterator(self) -> '_BaseDataLoaderIter':
        if self.num_workers == 0:
            return _SingleProcessDataLoaderIter(self)
        else:
            self.check_worker_number_rationality() 多进程迭代器调用此处
            return _MultiProcessingDataLoaderIter(self)

class _MultiProcessingDataLoaderIter(_BaseDataLoaderIter):

    # Our data model looks like this (queues are indicated with curly brackets):
    #
    #                main process                              ||
    #                     |                                    ||
    #               {index_queue}                              ||
    #                     |                                    ||
    #              worker processes                            ||     DATA
    #                     |                                    ||
    #            {worker_result_queue}                         ||     FLOW
    #                     |                                    ||
    #      pin_memory_thread of main process                   ||   DIRECTION
    #                     |                                    ||
    #               {data_queue}                               ||
    #                     |                                    ||
    #                data output                               \/

index_queue:

在初始化中：

        self._index_queues = []
        self._workers = []
        for i in range(self._num_workers):
            # No certainty which module multiprocessing_context is
            index_queue = multiprocessing_context.Queue()  # type: ignore[var-annotated]
            # Need to `cancel_join_thread` here!
            # See sections (2) and (3b) above.
            index_queue.cancel_join_thread()
            w = multiprocessing_context.Process(
                target=_utils.worker._worker_loop,
                args=(self._dataset_kind, self._dataset, index_queue,
                      self._worker_result_queue, self._workers_done_event,
                      self._auto_collation, self._collate_fn, self._drop_last,
                      self._base_seed, self._worker_init_fn, i, self._num_workers,
                      self._persistent_workers, self._shared_seed))
            w.daemon = True
            # NB: Process.start() actually take some time as it needs to
            #     start a process and pass the arguments over via a pipe.
            #     Therefore, we only add a worker to self._workers list after
            #     it started, so that we do not call .join() if program dies
            #     before it starts, and __del__ tries to join but will get:
            #     AssertionError: can only join a started process.
            w.start()
            self._index_queues.append(index_queue)
            self._workers.append(w)

开启几个进程

        self._reset(loader, first_iter=True)

初始化最后一行代码调用了reset

    def _reset(self, loader, first_iter=False):
        super()._reset(loader, first_iter)
        self._send_idx = 0  # idx of the next task to be sent to workers
        self._rcvd_idx = 0  # idx of the next task to be returned in __next__
        # information about data not yet yielded, i.e., tasks w/ indices in range [rcvd_idx, send_idx).
        # map: task idx => - (worker_id,)        if data isn't fetched (outstanding)
        #                  \ (worker_id, data)   if data is already fetched (out-of-order)
        self._task_info = {}
        self._tasks_outstanding = 0  # always equal to count(v for v in task_info.values() if len(v) == 1)
        # A list of booleans representing whether each worker still has work to
        # do, i.e., not having exhausted its iterable dataset object. It always
        # contains all `True`s if not using an iterable-style dataset
        # (i.e., if kind != Iterable).
        # Not that this indicates that a worker still has work to do *for this epoch*.
        # It does not mean that a worker is dead. In case of `_persistent_workers`,
        # the worker will be reset to available in the next epoch.
        self._workers_status = [True for i in range(self._num_workers)]
        # Reset the worker queue cycle so it resumes next epoch at worker 0
        self._worker_queue_idx_cycle = itertools.cycle(range(self._num_workers))
        # We resume the prefetching in case it was enabled
        if not first_iter:
            for idx in range(self._num_workers):
                self._index_queues[idx].put(_utils.worker._ResumeIteration(self._shared_seed))
            resume_iteration_cnt = self._num_workers
            while resume_iteration_cnt > 0:
                return_idx, return_data = self._get_data()
                if isinstance(return_idx, _utils.worker._ResumeIteration):
                    assert return_data is None
                    resume_iteration_cnt -= 1
        # prime the prefetch loop
        for _ in range(self._prefetch_factor * self._num_workers):
            self._try_put_index()

resize调用了：

    def _try_put_index(self):
        assert self._tasks_outstanding < self._prefetch_factor * self._num_workers

        try:
            index = self._next_index()
        except StopIteration:
            return
        for _ in range(self._num_workers):  # find the next active worker, if any
            worker_queue_idx = next(self._worker_queue_idx_cycle)
            if self._workers_status[worker_queue_idx]:
                break
        else:
            # not found (i.e., didn't break)
            return

        self._index_queues[worker_queue_idx].put((self._send_idx, index))
        self._task_info[self._send_idx] = (worker_queue_idx,)
        self._tasks_outstanding += 1
        self._send_idx += 1

    def _next_index(self):
        return next(self._sampler_iter)  # may raise StopIteration

最终调用的sampler返回

多进程做的事：

def _worker_loop(dataset_kind, dataset, index_queue, data_queue, done_event,
                 auto_collation, collate_fn, drop_last, base_seed, init_fn, worker_id,
                 num_workers, persistent_workers, shared_seed):



    最后只把索引和数据放入队列
    data_queue.put((idx, data))

主进程 __next__调用了：

 def _try_get_data(self, timeout=_utils.MP_STATUS_CHECK_INTERVAL):
     data = self._data_queue.get(timeout=timeout)

总结：

执行流程：初始化dataloader -> 创建索引策略(sample) -> 迭代 __next__ -> 选择选择进程迭代器 -> 工作进程读取索引，数据放入队列中 -> 主进程从队列中取得数据 -> 完成迭代。

sample: 负责随机策略，通过dataset的 len() 初始化一个索引列表。

batchsample: 负责切取 batch大小的索引列表。

_MultiProcessingDataLoaderIter ： 完成迭代策略，因为数据先读到队列里，所以内存压力很大。

应用开发接口:

实现dataset

__len__ :它需要用这个创建索引策略

__getitem : 它需要调用这个完成批量数据读取

collate_fn : 自定义回调函数，对批量数据额外处理。

DDP模式：选择 DistributedSampler

同步过程太过复杂，没看懂。

dguochuan

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
pytorch dataload源码解读

执行流程：初始化dataloader -> 创建索引策略(sample) -> 迭代 __next__ -> 选择选择进程迭代器 -> 工作进程读取索引，数据放入队列中 -> 主进程从队列中取得数据 -> 完成迭代。sample: 负责随机策略，通过dataset的 len() 初始化一个索引列表。batchsample: 负责切取 batch大小的索引列表。_MultiProcessingDataLoaderIter ：完成迭代策略，因为数据先读到队列里，所以内存压力很大。
复制链接

扫一扫