torchtext.data.Field

最新推荐文章于 2023-09-14 21:37:52 发布

我是一颗棒棒糖

最新推荐文章于 2023-09-14 21:37:52 发布

阅读量2.8k

点赞数 2

分类专栏：大学学习文章标签： python

本文链接：https://blog.csdn.net/qq_42255269/article/details/112802555

版权

大学学习专栏收录该内容

40 篇文章 4 订阅

订阅专栏

torchtext.data.Field

类接口

class torchtext.data.Field(sequential=True, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, lower=False, tokenize=None, tokenizer_language='en', include_lengths=False, batch_first=False, pad_token='<pad>', unk_token='<unk>', pad_first=False, truncate_first=False, stop_words=None, is_target=False)

功能

Defines a datatype together with instructions for converting to Tensor.
定义数据类型以及转换为张量的指令

Field class models common text processing datatypes that can be represented by tensors. It holds a Vocab object that defines the set of possible values for elements of the field and their corresponding numerical representations. The Field object also holds other parameters relating to how a datatype should be numericalized, such as a tokenization method and the kind of Tensor that should be produced.
字段类对可用张量表示的通用文本处理数据类型进行建模。它保存一个Vocab对象，该对象定义字段元素的可能值集及其相应的数值表示。Field对象还包含与数据类型应如何数值化有关的其他参数，例如标记化方法和应生成的张量类型。

If a Field is shared between two columns in a dataset (e.g., question and answer in a QA dataset), then they will have a shared vocabulary.
如果一个字段在数据集中的两列之间共享（例如，QA数据集中的问题和答案），那么它们将有一个共享词汇表

参数

sequential – Whether the datatype represents sequential data. If False, no tokenization is applied. Default: True.
use_vocab – Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.
init_token – A token that will be prepended to every example using this field, or None for no initial token. Default: None.
eos_token – A token that will be appended to every example using this field, or None for no end-of-sentence token. Default: None.
fix_length – A fixed length that all examples using this field will be padded to, or None for flexible sequence lengths. Default: None.
dtype – The torch.dtype class that represents a batch of examples of this kind of data. Default: torch.long.
preprocessing – The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. Many Datasets replace this attribute with a custom preprocessor. Default: None.
postprocessing – A Pipeline that will be applied to examples using this field after numericalizing but before the numbers are turned into a Tensor. The pipeline function takes the batch as a list, and the field’s Vocab. Default: None.
lower – Whether to lowercase the text in this field. Default: False.
tokenize – The function used to tokenize strings using this field into sequential examples. If “spacy”, the SpaCy tokenizer is used. If a non-serializable function is passed as an argument, the field will not be able to be serialized. Default: string.split.
tokenizer_language – The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.
include_lengths – Whether to return a tuple of a padded minibatch and a list containing the lengths of each examples, or just a padded minibatch. Default: False.
batch_first – Whether to produce tensors with the batch dimension first. Default: False.
pad_token – The string token used as padding. Default: “”.
unk_token – The string token used to represent OOV words. Default: “”.
pad_first – Do the padding of the sequence at the beginning. Default: False.
truncate_first – Do the truncating of the sequence at the beginning. Default: False
stop_words – Tokens to discard during the preprocessing step. Default: None
is_target – Whether this field is a target variable. Affects iteration over batches. Default: False