关于DataLoader:default_collate(batch)，Batch自动变成Tensor数据类型的探索

最新推荐文章于 2024-07-11 12:07:59 发布

Offer.harvester

最新推荐文章于 2024-07-11 12:07:59 发布

阅读量2.6k

点赞数 1

分类专栏： NLP入门文章标签： batch 深度学习 pytorch

本文链接：https://blog.csdn.net/qq_39072627/article/details/121926553

版权

NLP入门专栏收录该内容

7 篇文章 0 订阅

订阅专栏

1、我DataLoader中DateSet的数据形式为

([101, 860, 7741, 9648, 2330, 2292, 953, 1921, 2248, 7987, 6381, 1282, 1920, 1158, 3173, 3519, 6229, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 17, 8)
([101, 8183, 2399, 7188, 3409, 2458, 5709, 2501, 4307, 849, 4373, 5101, 5708, 113, 5299, 1745, 114, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 17, 5)
([101, 1398, 3635, 143, 5500, 7674, 4899, 8038, 3949, 5500, 5367, 7030, 1726, 6444, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 14, 2)
......

格式为：(token_ids, attention_mask, seq_length, label)
其中

token_ids:list
attention_mask:list
seq_length:int
label:int

# train_data就是上述的格式
dataloader = DataLoader(train_data,batch_size=500,shuffle=True)

for data in dataloader:
    print(train_data)

这里的每个data，也就是一个batch 里面的元素都变成了Tensor格式，
在这里插入图片描述

原因是下面代码中的 torch.tensor(batch)：

def default_collate(batch):
    r"""Puts each data field into a tensor with outer dimension batch size"""

    elem = batch[0]
    elem_type = type(elem)
    if isinstance(elem, torch.Tensor):
        out = None
        if torch.utils.data.get_worker_info() is not None:
            # If we're in a background process, concatenate directly into a
            # shared memory tensor to avoid an extra copy
            numel = sum(x.numel() for x in batch)
            storage = elem.storage()._new_shared(numel)
            out = elem.new(storage)
        return torch.stack(batch, 0, out=out)
    elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
            and elem_type.__name__ != 'string_':
        if elem_type.__name__ == 'ndarray' or elem_type.__name__ == 'memmap':
            # array of string classes and object
            if np_str_obj_array_pattern.search(elem.dtype.str) is not None:
                raise TypeError(default_collate_err_msg_format.format(elem.dtype))

            return default_collate([torch.as_tensor(b) for b in batch])
        elif elem.shape == ():  # scalars
            return torch.as_tensor(batch)
    elif isinstance(elem, float):
        return torch.tensor(batch, dtype=torch.float64)
    elif isinstance(elem, int):
        return torch.tensor(batch)
    elif isinstance(elem, string_classes):
        return batch
    elif isinstance(elem, collections.abc.Mapping):
        return {key: default_collate([d[key] for d in batch]) for key in elem}
    elif isinstance(elem, tuple) and hasattr(elem, '_fields'):  # namedtuple
        return elem_type(*(default_collate(samples) for samples in zip(*batch)))
    elif isinstance(elem, collections.abc.Sequence):
        # check to make sure that the elements in batch have consistent size
        it = iter(batch)
        elem_size = len(next(it))
        if not all(len(elem) == elem_size for elem in it):
            raise RuntimeError('each element in list of batch should be of equal size')
        transposed = zip(*batch)
        # [1,2,3]
        return [default_collate(samples) for samples in transposed]

    raise TypeError(default_collate_err_msg_format.format(elem_type))

Offer.harvester

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
关于DataLoader:default_collate(batch)，Batch自动变成Tensor数据类型的探索

1、我DataLoader中DateSet的数据形式为([101, 860, 7741, 9648, 2330, 2292, 953, 1921, 2248, 7987, 6381, 1282, 1920, 1158, 3173, 3519, 6229, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
复制链接

扫一扫

专栏目录