torch text -- dataset 迷魂阵

最新推荐文章于 2023-04-03 21:05:03 发布

Yif_Zhou

最新推荐文章于 2023-04-03 21:05:03 发布

阅读量204

点赞数

分类专栏：学习笔记 torch

本文链接：https://blog.csdn.net/weixin_40733475/article/details/103779987

版权

本文介绍了如何使用torch text库处理文本数据，包括Fields的配置，如是否序列化、词汇表的构建，以及Dataset和不同类型的迭代器如Iterator、BucketIterator和BPTTIterator的使用，这些工具使文本训练更高效。

摘要由CSDN通过智能技术生成

怎样使得训练样本成为一个个 batch，也就是怎样自己做一个迭代器，使得训练更加方便

在语言模型里面，我们一般会预测下一个单词的出现，这样的无监督学习，天然有label。在情感分析，文本分类里面，label 有自己的column，所以处理的方式是会不同。不同的field 是告诉框架，每个不同的column 是怎样处理的。

sequential – Whether the datatype represents sequential data. If False, no tokenization is applied. Default: True.
use_vocab – Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.
init_token – A token that will be prepended to every example using this field, or None for no initial token. Default: None.
eos_token – A token that will be appended to every example using this field, or None for no end-of-sentence token. Default: None.
dtype – The torch.dtype class that represents a batch of examples of this kind of data. Default: torch.long.
lower – Whether to lowercase the text in this field. Default: False.
tokenize – The function used to tokenize strings using this field into sequential examples. If “spacy”, the SpaCy tokenizer is used. If a non-serializable function is passed as an argument, the field will not be able to be serialized. Default: string.split.
tokenizer_language – The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.

build_vocab(*args, **kwargs)

Parameters:
- arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
- keyword arguments (Remaining) – Passed to the constructor of V

Defines a Dataset of columns stored in CSV, TSV, or JSON format.
init
- path (str) – Path to the data file.
- format (str) – The format of the data file. One of “CSV”, “TSV”, or “JSON” (case-insensitive).
- fields (list(tuple(str, Field)) –
  tuple(str, Field)]: If using a list, the format must be CSV or TSV, and the values of the list should be tuples of (name, field). The fields should be in the same order as the columns in the CSV or TSV file, while tuples of (name, None) represent columns that will be ignored.If using a dict, the keys should be a subset of the JSON keys or CSV/TSV columns, and the values should be tuples of (name, field). Keys not present in the input dictionary are ignored. This allows the user to rename columns from their JSON/CSV/TSV key names and also enables selecting a subset of columns to loa