torchtext.data.Field

torchtext.data.Field

类接口

class torchtext.data.Field(sequential=True, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, lower=False, tokenize=None, tokenizer_language='en', include_lengths=False, batch_first=False, pad_token='<pad>', unk_token='<unk>', pad_first=False, truncate_first=False, stop_words=None, is_target=False)

功能

Defines a datatype together with instructions for converting to Tensor.
定义数据类型以及转换为张量的指令

Field class models common text processing datatypes that can be represented by tensors. It holds a Vocab object that defines the set of possible values for elements of the field and their corresponding numerical representations. The Field object also holds other parameters relating to how a datatype should be numericalized, such as a tokenization method and the kind of Tensor that should be produced.
字段类对可用张量表示的通用文本处理数据类型进行建模。它保存一个Vocab对象,该对象定义字段元素的可能值集及其相应的数值表示。Field对象还包含与数据类型应如何数值化有关的其他参数,例如标记化方法和应生成的张量类型。

If a Field is shared between two columns in a dataset (e.g., question and answer in a QA dataset), then they will have a shared vocabulary.
如果一个字段在数据集中的两列之间共享(例如,QA数据集中的问题和答案),那么它们将有一个共享词汇表

参数

  • sequential – Whether the datatype represents sequential data. If False, no tokenization is applied. Default: True.
  • use_vocab – Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.
  • init_token – A token that will be prepended to every example using this field, or None for no initial token. Default: None.
  • eos_token – A token that will be appended to every example using this field, or None for no end-of-sentence token. Default: None.
  • fix_length – A fixed length that all examples using this field will be padded to, or None for flexible sequence lengths. Default: None.
  • dtype – The torch.dtype class that represents a batch of examples of this kind of data. Default: torch.long.
  • preprocessing – The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. Many Datasets replace this attribute with a custom preprocessor. Default: None.
  • postprocessing – A Pipeline that will be applied to examples using this field after numericalizing but before the numbers are turned into a Tensor. The pipeline function takes the batch as a list, and the field’s Vocab. Default: None.
  • lower – Whether to lowercase the text in this field. Default: False.
  • tokenize – The function used to tokenize strings using this field into sequential examples. If “spacy”, the SpaCy tokenizer is used. If a non-serializable function is passed as an argument, the field will not be able to be serialized. Default: string.split.
  • tokenizer_language – The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.
  • include_lengths – Whether to return a tuple of a padded minibatch and a list containing the lengths of each examples, or just a padded minibatch. Default: False.
  • batch_first – Whether to produce tensors with the batch dimension first. Default: False.
  • pad_token – The string token used as padding. Default: “”.
  • unk_token – The string token used to represent OOV words. Default: “”.
  • pad_first – Do the padding of the sequence at the beginning. Default: False.
  • truncate_first – Do the truncating of the sequence at the beginning. Default: False
  • stop_words – Tokens to discard during the preprocessing step. Default: None
  • is_target – Whether this field is a target variable. Affects iteration over batches. Default: False
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值