torchtext.data.Field
类接口
class torchtext.data.Field(sequential=True, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, lower=False, tokenize=None, tokenizer_language='en', include_lengths=False, batch_first=False, pad_token='<pad>', unk_token='<unk>', pad_first=False, truncate_first=False, stop_words=None, is_target=False)
功能
Defines a datatype together with instructions for converting to Tensor.
定义数据类型以及转换为张量的指令
Field class models common text processing datatypes that can be represented by tensors. It holds a Vocab object that defines the set of possible values for elements of the field and their corresponding numerical representations. The Field object also holds other parameters relating to how a datatype should be numericalized, such as a tokenization method and the kind of Tensor that should be produced.
字段类对可用张量表示的通用文本处理数据类型进行建模。它保存一个Vocab对象,该对象定义字段元素的可能值集及其相应的数值表示。Field对象还包含与数据类型应如何数值化有关的其他参数,例如标记化方法和应生成的张量类型。
If a Field is shared between two columns in a dataset (e.g., question and answer in a QA dataset), then they will have a shared vocabulary.
如果一个字段在数据集中的两列之间共享(例如,QA数据集中的问题和答案),那么它们将有一个共享词汇表
参数
- sequential – Whether the datatype represents sequential data. If False, no tokenization is applied. Default: True.
- use_vocab – Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.
- init_token – A token that will be prepended to every example using this field, or None for no initial token. Default: None.
- eos_token – A token that will be appended to every example using this field, or None for no end-of-sentence token. Default: None.
- fix_length – A fixed length that all examples using this field will be padded to, or None for flexible sequence lengths. Default: None.
- dtype – The torch.dtype class that represents a batch of examples of this kind of data. Default: torch.long.
- preprocessing – The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. Many Datasets replace this attribute with a custom preprocessor. Default: None.
- postprocessing – A Pipeline that will be applied to examples using this field after numericalizing but before the numbers are turned into a Tensor. The pipeline function takes the batch as a list, and the field’s Vocab. Default: None.
- lower – Whether to lowercase the text in this field. Default: False.
- tokenize – The function used to tokenize strings using this field into sequential examples. If “spacy”, the SpaCy tokenizer is used. If a non-serializable function is passed as an argument, the field will not be able to be serialized. Default: string.split.
- tokenizer_language – The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.
- include_lengths – Whether to return a tuple of a padded minibatch and a list containing the lengths of each examples, or just a padded minibatch. Default: False.
- batch_first – Whether to produce tensors with the batch dimension first. Default: False.
- pad_token – The string token used as padding. Default: “”.
- unk_token – The string token used to represent OOV words. Default: “”.
- pad_first – Do the padding of the sequence at the beginning. Default: False.
- truncate_first – Do the truncating of the sequence at the beginning. Default: False
- stop_words – Tokens to discard during the preprocessing step. Default: None
- is_target – Whether this field is a target variable. Affects iteration over batches. Default: False