方便学习之 torchtext.data 篇章翻译续集Iterators

chuanyang09

于 2023-04-12 20:20:35 发布

阅读量209

点赞数

文章标签：学习 pytorch 人工智能

本文链接：https://blog.csdn.net/u014474004/article/details/130115326

版权

torchtext

torchtext 包由数据处理实用程序和自然语言的流行数据集组成。

(1) Iterator (迭代器)

# Defines an iterator that loads batches of data from a Dataset.
# 定义一个迭代器，从数据集加载批量数据。

class torchtext.data.Iterator(dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)

'''
Variables:	
dataset – The Dataset object to load Examples from.
(数据集-要从中加载示例的数据集对象。)

batch_size – Batch size.(批量大小。)

batch_size_fn – Function of three arguments (new example to add, current count of examples in the batch, and current effective batch size) that returns the new effective batch size resulting from adding that example to a batch. This is useful for dynamic batching, where this function would add to the current effective batch size the number of tokens in the new example.
(三个参数（要添加的新示例、批处理中的当前示例计数和当前
有效批处理大小）的函数，该参数返回将该示例添加到批处理中
后产生的新有效批处理大小。这对于动态批处理很有用，
该函数将在新示例中将令牌数量添加到当前有效批处理大小中。)

sort_key – A key to use for sorting examples in order to batch together examples with similar lengths and minimize padding. The sort_key provided to the Iterator constructor overrides the sort_key attribute of the Dataset, or defers to it if None.
(用于排序示例的键，以便将长度相似的
示例批处理在一起并最小化填充。
提供给迭代器构造函数的sort_key覆盖了
数据集的sort_key属性，如果无，则服从它。)

train – Whether the iterator represents a train set.
(迭代器是否代表train集。)

repeat – Whether to repeat the iterator for multiple epochs. Default: False.
(重复-是否为多个纪元重复迭代器。默认：False。)

shuffle – Whether to shuffle examples between epochs.
(打乱次序-是否在时代之间打乱次序示例。)

sort – Whether to sort examples according to self.sort_key. Note that shuffle and sort default to train and (not train).
(排序-是否根据self.sort_key对示例进行排序。
请注意，随机播放和排序默认为训练和（不是训练）。)

sort_within_batch – Whether to sort (in descending order according to self.sort_key) within each batch. If None, defaults to self.sort. If self.sort is True and this is False, the batch is left in the original (ascending) sorted order.
(是否在每个批次中进行排序
（根据self.sort_key按降序排列）。
如果“None”，则默认为self.ort。
如果self.sort为True且此值为False，
则批次将按原始（升序）排序。)

device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
(torch.device的字符串或实例，
指定要在哪个设备上创建变量。
如果保持默认值，张量将在cpu上创建。默认：None。)
'''

# Initialize self. See help(type(self)) for accurate signature.
# 初始化自我。有关准确的签名，请参阅帮助(type(self))
__init__(dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)

# Return the examples in the dataset in order, sorted, or shuffled.
# 按顺序、排序或随机返回数据集中的示例。
data()

# Set up the batch generator for a new epoch.
# 为新纪元设置批处理生成器。
init_epoch()

# Create Iterator objects for multiple splits of a dataset.
# 为数据集的多个拆分创建迭代器对象。

'''
Parameters =>:	
datasets – Tuple of Dataset objects corresponding to the splits. The first such object should be the train set.
(数据集-与拆分对应的数据集对象的元组。
第一个这样的对象应该是train组。)

batch_sizes – Tuple of batch sizes to use for the different splits, or None to use the same batch_size for all splits.
(用于不同拆分的批处理大小的元组，
或无用于所有拆分使用相同的批处理大小。)

keyword arguments (Remaining) – Passed to the constructor of the iterator class being used.
(关键字参数-传递给正在使用的迭代器类的构造函数。)

'''
classmethod splits(datasets, batch_sizes=None, **kwargs)

(2) BucketIterator (桶迭代器)

'''
Defines an iterator that batches examples of similar lengths together.

Minimizes amount of padding needed while producing freshly shuffled batches for each new epoch. See pool for the bucketing procedure used.

定义一个迭代器，将类似长度的示例批量在一起。

最大限度地减少所需的填充量，同时为每个新时代生产
新鲜打乱顺序的批次。有关使用的桶装过程，请参阅池。

'''
class torchtext.data.BucketIterator(dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)

(3) BPTTIterator (BPTT迭代器)

‘’‘
Defines an iterator for language modeling tasks that use BPTT.

Provides contiguous streams of examples together with targets that are one timestep further forward, for language modeling training with backpropagation through time (BPTT). Expects a Dataset with a single example and a single field called ‘text’ and produces Batches with text and target attributes.

为使用BPTT的语言建模任务定义迭代器。

提供连续的示例流以及进一步向前一步的目标，
用于通过时间反向传播（BPTT）进行语言建模培训。
期望具有单个示例和单个名为“文本”的字段的数据集，
并生成具有文本和目标属性的批次。
’‘’

class torchtext.data.BPTTIterator(dataset, batch_size, bptt_len, **kwargs)

‘’‘
Variables =>:	
dataset – The Dataset object to load Examples from.
(数据集-要从中加载示例的数据集对象。)

batch_size – Batch size.(批量大小。)

bptt_len – Length of sequences for backpropagation through time.
(随着时间的推移反向传播的序列长度。)

sort_key – A key to use for sorting examples in order to batch together examples with similar lengths and minimize padding. The sort_key provided to the Iterator constructor overrides the sort_key attribute of the Dataset, or defers to it if None.
(用于排序示例的键，以便将长度相似的
示例批处理在一起并最小化填充。
提供给迭代器构造函数的sort_key覆盖了数据集的sort_key属性，
如果None，则服从它。)

train – Whether the iterator represents a train set.
(列车(train)-迭代器是否代表列车(train)集。)

repeat – Whether to repeat the iterator for multiple epochs. Default: False.
(重复-是否为多个纪元重复迭代器。默认：False。)

shuffle – Whether to shuffle examples between epochs.
(打乱顺序-是否在时代之间打乱顺序示例。)

sort – Whether to sort examples according to self.sort_key. Note that shuffle and sort default to train and (not train).
(排序-是否根据self.sort_key对示例进行排序。
请注意，随机播放和排序默认为训练和（不是训练）。)

device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
(torch.device的字符串或实例，
指定要在哪个设备上创建变量。
如果保持默认值，张量将在cpu上创建。默认：None。)

’‘’

# Initialize self. See help(type(self)) for accurate signature.
# 初始化自我。有关准确的签名，请参阅帮助(type(self))
__init__(dataset, batch_size, bptt_len, **kwargs)