(1) Dataset 数据集

# Defines a dataset composed of Examples along with its Fields.(定义由示例及其字段组成的数据集)。 
sort_key (callable) – A key to use for sorting dataset examples for batching together examples with similar lengths to minimize padding. 
examples (list(Example)) – The examples in this dataset.(此数据集中的示例。)
fields (dict[str, Field]) – Contains the name of each column or field, together with the corresponding Field object. Two fields with the same Field object will have a shared vocabulary.

class torchtext.data.Dataset(examples, fields, filter_pred=None)
# Create a dataset from a list of Examples and Fields. (从示例和字段列表中创建数据集。)
Parameters =>:	
examples – List of Examples. (示例列表)
fields (List(tuple(str, Field))) – The Fields to use in this tuple. The string is a field name, and the Field is the associated field. 
filter_pred (callable or None) – Use only examples for which filter_pred(example) is True, or use all examples if None. Default is None. 

__init__(examples, fields, filter_pred=None)
# Download and unzip an online archive (.zip, .gz, or .tgz). (下载并解压缩在线存档(.zip、.gz或.tgz))。

Parameters => :	
root (str) – Folder to download data to.(要下载数据的文件夹。)
check (str or None) – Folder whose existence indicates that the dataset has already been downloaded, or None to check the existence of root/{cls.name}. 

Returns: Path to extracted dataset. 提取数据集的路径

Return type: str  返回类型 字符串

classmethod download(root, check=None)
# Remove unknown words from dataset examples with respect to given field. 
# 从数据集示例中删除有关给定字段的未知单词。
Parameters:	field_names (list(str)) – Within example only the parts with field names in field_names will have their unknown words deleted. 
(在示例中,只有字段名称为 field_names 的部分才会删除其未知单词。)


# Create train-test(-valid?) splits from the instance’s examples.
# 创建train-test(-有效?)从实例的示例中拆分。
Parameters =>:	
split_ratio (float or List of python:floats) – a number [0, 1] denoting the amount of data to be used for the training split (rest is used for validation), or a list of numbers denoting the relative sizes of train, test and valid splits respectively. If the relative size for valid is missing, only the train-test split is returned. Default is 0.7 (for the train set).
(–表示用于训练拆分的数据量的数字[0, 1](其余用于验证),或分别表示列车、测试和有效拆分的相对大小的数字列表。如果缺少有效的相对大小,则仅返回train-test拆分。默认值为0.7(对于train组)。)

stratified (bool) – whether the sampling should be stratified. Default is False.

strata_field (str) – name of the examples Field stratified over. Default is ‘label’ for the conventional label field.

random_state (tuple) – the random seed used for shuffling. A return value of random.getstate().

Datasets for train, validation, and test splits in that order, if the splits are provided.

Return type: Tuple[Dataset]


split(split_ratio=0.7, stratified=False, strata_field='label', random_state=None)
# Create Dataset objects for multiple splits of a dataset.
# 为数据集的多个拆分创建数据集对象。

Parameters =>:	
path (str) – Common prefix of the splits’ file paths, or None to use the result of cls.download(root).
root (str) – Root dataset storage directory. Default is ‘.data’.
train (str) – Suffix to add to path for the train set, or None for no train set. Default is None.
validation (str) – Suffix to add to path for the validation set, or None for no validation set. Default is None.
test (str) – Suffix to add to path for the test set, or None for no test set. Default is None.
keyword arguments (Remaining) – Passed to the constructor of the Dataset (sub)class being used.

Datasets for train, validation, and test splits in that order, if provided.

Return type: Tuple[Dataset]

classmethod splits(path=None, root='.data', train=None, validation=None, test=None, **kwargs)

(2) TabularDataset 表格数据集

# Defines a Dataset of columns stored in CSV, TSV, or JSON format.
# 定义以CSV、TSV或JSON格式存储的列的数据集。

class torchtext.data.TabularDataset(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)
# Create a TabularDataset given a path, file format, and field list.

Parameters =>:	
path (str) – Path to the data file.(数据文件的路径。)
format (str) – The format of the data file. One of “CSV”, “TSV”, or “JSON” (case-insensitive).
fields (list(tuple(str, Field)) –
tuple(str, Field)]: If using a list, the format must be CSV or TSV, and the values of the list should be tuples of (name, field). The fields should be in the same order as the columns in the CSV or TSV file, while tuples of (name, None) represent columns that will be ignored.
If using a dict, the keys should be a subset of the JSON keys or CSV/TSV columns, and the values should be tuples of (name, field). Keys not present in the input dictionary are ignored. This allows the user to rename columns from their JSON/CSV/TSV key names and also enables selecting a subset of columns to load.

skip_header (bool) – Whether to skip the first line of the input file.
csv_reader_params (dict) – Parameters to pass to the csv reader. Only relevant when format is csv or tsv. See https://docs.python.org/3/library/csv.html#csv.reader for more details.

__init__(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)

(3) Batch 批量

# Defines a batch of examples along with its Fields.
# 定义一批示例及其字段。

Variables => :	
batch_size – Number of examples in the batch.(批次中的示例数量。)
dataset – A reference to the dataset object the examples come from (which itself contains the dataset’s Field objects). 
train – Deprecated: this attribute is left for backwards compatibility, however it is UNUSED as of the merger with pytorch 0.4. 
(已弃用:此属性留待向后兼容,但截至与pytorch 0.4合并时未使用。)
input_fields – The names of the fields that are used as input for the model
( 用作模型输入的字段的名称)
target_fields – The names of the fields that are used as targets during model training

# Also stores the Variable for each column in the batch as an attribute.
# 还将批处理中每列的变量存储为属性。

class torchtext.data.Batch(data=None, dataset=None, device=None)
# Create a Batch from a list of examples.
# 从示例列表中创建一个批处理。
__init__(data=None, dataset=None, device=None)
# Create a Batch directly from a number of Variables.
# 直接从多个变量创建批处理。

classmethod fromvars(dataset, batch_size, train=None, **kwargs)

(4) Example 示例

# Defines a single training or test example.(定义单个培训或测试示例。)
# Stores each column of the example as an attribute. (将示例的每一列存储为属性。)

class torchtext.data.Example
classmethod fromCSV(data, fields, field_to_index=None)
classmethod fromJSON(data, fields)
classmethod fromdict(data, fields)
classmethod fromlist(data, fields)
classmethod fromtree(data, fields, subtrees=False)

