torch text -- dataset 迷魂阵

本文介绍了如何使用torch text库处理文本数据,包括Fields的配置,如是否序列化、词汇表的构建,以及Dataset和不同类型的迭代器如Iterator、BucketIterator和BPTTIterator的使用,这些工具使文本训练更高效。
摘要由CSDN通过智能技术生成

处理文本

核心

怎样使得训练样本成为一个个 batch, 也就是怎样自己做一个迭代器,使得训练更加方便

Fields – 你要我怎样

在语言模型里面,我们一般会预测下一个单词的出现,这样的无监督学习,天然有label。在情感分析,文本分类里面,label 有自己的column, 所以处理的方式是会不同。不同的field 是告诉框架,每个不同的column 是怎样处理的。

Field api
  • sequential – Whether the datatype represents sequential data. If False, no tokenization is applied. Default: True.
  • use_vocab – Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.
  • init_token – A token that will be prepended to every example using this field, or None for no initial token. Default: None.
  • eos_token – A token that will be appended to every example using this field, or None for no end-of-sentence token. Default: None.
  • dtype – The torch.dtype class that represents a batch of examples of this kind of data. Default: torch.long.
  • lower – Whether to lowercase the text in this field. Default: False.
  • tokenize – The function used to tokenize strings using this field into sequential examples. If “spacy”, the SpaCy tokenizer is used. If a non-serializable function is passed as an argument, the field will not be able to be serialized. Default: string.split.
  • tokenizer_language – The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.
如果想自己训练自己的 vocabulary,同样提供了 api

build_vocab(*args, **kwargs)

  • Parameters:
    • arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
    • keyword arguments (Remaining) – Passed to the constructor of V
Data
Dataset
  • Defines a dataset composed of Examples along with its Fields. 这个暂时还没用
TabularDataset
  • Defines a Dataset of columns stored in CSV, TSV, or JSON format.

  • init

    • path (str) – Path to the data file.

    • format (str) – The format of the data file. One of “CSV”, “TSV”, or “JSON” (case-insensitive).

    • fields (list(tuple(str, Field)) –
      tuple(str, Field)]: If using a list, the format must be CSV or TSV, and the values of the list should be tuples of (name, field). The fields should be in the same order as the columns in the CSV or TSV file, while tuples of (name, None) represent columns that will be ignored.If using a dict, the keys should be a subset of the JSON keys or CSV/TSV columns, and the values should be tuples of (name, field). Keys not present in the input dictionary are ignored. This allows the user to rename columns from their JSON/CSV/TSV key names and also enables selecting a subset of columns to loa

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值