方便学习之 torchtext.data 篇章翻译

最新推荐文章于 2023-09-14 21:37:52 发布

chuanyang09

最新推荐文章于 2023-09-14 21:37:52 发布

阅读量275

点赞数

文章标签：学习

本文链接：https://blog.csdn.net/u014474004/article/details/130088534

版权

torchtext

torchtext包由数据处理实用程序和自然语言的流行数据集组成。

Dataset, Batch, and Example 数据集、批量处理和示例; Fields 字段; Iterators 迭代器; Pipeline 传递途径;Functions 功能;

(1) Dataset 数据集

# Defines a dataset composed of Examples along with its Fields.(定义由示例及其字段组成的数据集)。 
'''
Variables=>:	
sort_key (callable) – A key to use for sorting dataset examples for batching together examples with similar lengths to minimize padding. 
(用于对数据集示例进行排序的键，以便将长度相似的示例进行批处理，以尽量减少填充。)
examples (list(Example)) – The examples in this dataset.(此数据集中的示例。)
fields (dict[str, Field]) – Contains the name of each column or field, together with the corresponding Field object. Two fields with the same Field object will have a shared vocabulary.
(包含每个列或字段的名称，以及相应的字段对象。具有相同字段对象的两个字段将具有共享词汇表。)
'''


class torchtext.data.Dataset(examples, fields, filter_pred=None)

# Create a dataset from a list of Examples and Fields. (从示例和字段列表中创建数据集。)
'''
Parameters =>:	
examples – List of Examples. (示例列表)
fields (List(tuple(str, Field))) – The Fields to use in this tuple. The string is a field name, and the Field is the associated field. 
(此元组中使用的字段。字符串是一个字段名称，字段是关联字段。)
filter_pred (callable or None) – Use only examples for which filter_pred(example) is True, or use all examples if None. Default is None. 
(仅使用filter_pred（example）为True的示例，如果无，则使用所有示例。默认为无。)
'''

__init__(examples, fields, filter_pred=None)

# Download and unzip an online archive (.zip, .gz, or .tgz). (下载并解压缩在线存档（.zip、.gz或.tgz）)。

'''
Parameters => :	
root (str) – Folder to download data to.(要下载数据的文件夹。)
check (str or None) – Folder whose existence indicates that the dataset has already been downloaded, or None to check the existence of root/{cls.name}. 
(存在表明数据集已下载的文件夹，或无检查root/{cls.name}是否存在的文件夹。)

Returns: Path to extracted dataset. 提取数据集的路径

Return type: str  返回类型 字符串
'''

classmethod download(root, check=None)

# Remove unknown words from dataset examples with respect to given field. 
# 从数据集示例中删除有关给定字段的未知单词。
'''
Parameters:	field_names (list(str)) – Within example only the parts with field names in field_names will have their unknown words deleted. 
(在示例中，只有字段名称为 field_names 的部分才会删除其未知单词。)

'''


filter_examples(field_names)

# Create train-test(-valid?) splits from the instance’s examples.
# 创建train-test（-有效？）从实例的示例中拆分。
'''
Parameters =>:	
split_ratio (float or List of python:floats) – a number [0, 1] denoting the amount of data to be used for the training split (rest is used for validation), or a list of numbers denoting the relative sizes of train, test and valid splits respectively. If the relative size for valid is missing, only the train-test split is returned. Default is 0.7 (for the train set).
(–表示用于训练拆分的数据量的数字[0, 1]（其余用于验证），或分别表示列车、测试和有效拆分的相对大小的数字列表。如果缺少有效的相对大小，则仅返回train-test拆分。默认值为0.7（对于train组）。)

stratified (bool) – whether the sampling should be stratified. Default is False.
(采样是否应该分层。默认值为False。)

strata_field (str) – name of the examples Field stratified over. Default is ‘label’ for the conventional label field.
(分层的示例字段的名称。默认值是常规标签字段的“标签”。)

random_state (tuple) – the random seed used for shuffling. A return value of random.getstate().
(用于混洗的随机种子。random.getstate（）的返回值。)

Returns:	
Datasets for train, validation, and test splits in that order, if the splits are provided.
(如果提供了拆分，则用于训练、验证和测试拆分的数据集将按此顺序进行拆分。)

Return type: Tuple[Dataset]

'''

split(split_ratio=0.7, stratified=False, strata_field='label', random_state=None)

# Create Dataset objects for multiple splits of a dataset.
# 为数据集的多个拆分创建数据集对象。

'''
Parameters =>:	
path (str) – Common prefix of the splits’ file paths, or None to use the result of cls.download(root).
(拆分文件路径的公共前缀，或无使用cls.download（root）的结果。)
root (str) – Root dataset storage directory. Default is ‘.data’.
(根数据集存储目录。默认是“.data”。)
train (str) – Suffix to add to path for the train set, or None for no train set. Default is None.
(后缀添加到train集的路径中，或无train集的无。默认为无。)
validation (str) – Suffix to add to path for the validation set, or None for no validation set. Default is None.
(要添加到验证集路径的后缀，或无验证集的无后缀。默认为无。)
test (str) – Suffix to add to path for the test set, or None for no test set. Default is None.
(添加到测试集路径的后缀，或无测试集的无后缀。默认为无。)
keyword arguments (Remaining) – Passed to the constructor of the Dataset (sub)class being used.
(传递给正在使用的数据集（子）类的构造函数。)

Returns:	
Datasets for train, validation, and test splits in that order, if provided.
(如果提供，用于训练、验证和测试的数据集按该顺序拆分。)

Return type: Tuple[Dataset]
'''

classmethod splits(path=None, root='.data', train=None, validation=None, test=None, **kwargs)

(2) TabularDataset 表格数据集

# Defines a Dataset of columns stored in CSV, TSV, or JSON format.
# 定义以CSV、TSV或JSON格式存储的列的数据集。

class torchtext.data.TabularDataset(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)

# Create a TabularDataset given a path, file format, and field list.

'''
Parameters =>:	
path (str) – Path to the data file.(数据文件的路径。)
format (str) – The format of the data file. One of “CSV”, “TSV”, or “JSON” (case-insensitive).
(数据文件的格式。“CSV”、“TSV”或“JSON”（不区分大小写）之一。)
fields (list(tuple(str, Field)) –
tuple(str, Field)]: If using a list, the format must be CSV or TSV, and the values of the list should be tuples of (name, field). The fields should be in the same order as the columns in the CSV or TSV file, while tuples of (name, None) represent columns that will be ignored.
(如果使用列表，格式必须是CSV或TSV，列表的值应该是（名称，字段）的元组。字段应与CSV或TSV文件中的列顺序相同，而（名称，无）的元组表示将被忽略的列。)
If using a dict, the keys should be a subset of the JSON keys or CSV/TSV columns, and the values should be tuples of (name, field). Keys not present in the input dictionary are ignored. This allows the user to rename columns from their JSON/CSV/TSV key names and also enables selecting a subset of columns to load.
(如果使用dict，则键应该是JSON键或CSV/TSV列的子集，并且值应该是（name，field）的元组。输入字典中不存在的键将被忽略。这允许用户从JSON/CSV/TSV键名称中重命名列，还允许选择要加载的列的子集。)

skip_header (bool) – Whether to skip the first line of the input file.
(是否跳过输入文件的第一行。)
csv_reader_params (dict) – Parameters to pass to the csv reader. Only relevant when format is csv or tsv. See https://docs.python.org/3/library/csv.html#csv.reader for more details.
(要传递给csv阅读器的参数。仅当格式为csv或tsv时才相关。有关更多详细信息，请参阅https://docs.python.org/3/library/csv.html#csv.reader。)
'''


__init__(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)

(3) Batch 批量

# Defines a batch of examples along with its Fields.
# 定义一批示例及其字段。

'''
Variables => :	
batch_size – Number of examples in the batch.(批次中的示例数量。)
dataset – A reference to the dataset object the examples come from (which itself contains the dataset’s Field objects). 
(对示例来自的数据集对象的引用（其本身包含数据集的字段对象）。)
train – Deprecated: this attribute is left for backwards compatibility, however it is UNUSED as of the merger with pytorch 0.4. 
(已弃用：此属性留待向后兼容，但截至与pytorch 0.4合并时未使用。)
input_fields – The names of the fields that are used as input for the model
( 用作模型输入的字段的名称)
target_fields – The names of the fields that are used as targets during model training
(模型训练期间用作目标的字段的名称)
'''

# Also stores the Variable for each column in the batch as an attribute.
# 还将批处理中每列的变量存储为属性。

class torchtext.data.Batch(data=None, dataset=None, device=None)

# Create a Batch from a list of examples.
# 从示例列表中创建一个批处理。
__init__(data=None, dataset=None, device=None)

# Create a Batch directly from a number of Variables.
# 直接从多个变量创建批处理。

classmethod fromvars(dataset, batch_size, train=None, **kwargs)

(4) Example 示例

# Defines a single training or test example.(定义单个培训或测试示例。)
# Stores each column of the example as an attribute. (将示例的每一列存储为属性。)

class torchtext.data.Example

classmethod fromCSV(data, fields, field_to_index=None)

classmethod fromJSON(data, fields)

classmethod fromdict(data, fields)

classmethod fromlist(data, fields)

classmethod fromtree(data, fields, subtrees=False)

文章翻译于 torchtext.data — torchtext 0.4.0 documentation

chuanyang09

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫