手把手教你使用torchtext

最新推荐文章于 2024-05-03 20:11:46 发布

Muasci

最新推荐文章于 2024-05-03 20:11:46 发布

阅读量886

点赞数

分类专栏： # pytorch

本文链接：https://blog.csdn.net/jokerxsy/article/details/110247642

版权

pytorch 专栏收录该内容

24 篇文章 3 订阅

订阅专栏

文章目录

前言
import
自定义Dataset(写在__ init __ function)
产生迭代器(写在iters function)
扯点别的——数据增强
参考
题外话
补充:SNLI数据集预处理实战

前言

本文记录torchtext的简单使用，大致流程如下

import需要的包
自定义Dataset(__ init __ function)
产生迭代器(iters function)

import

from torchtext.data import Example, Field, Dataset
import torchtext.data as data

自定义Dataset(写在 init function)

初始化Field

参数

sequential – Whether the datatype represents sequential data. If False, no tokenization is applied. Default: True
use_vocab – Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.
include_lengths – Whether to return a tuple of a padded minibatch and a list containing the lengths of each examples, or just a padded minibatch. Default: False.
batch_first – Whether to produce tensors with the batch dimension first. Default: False.
pad_token – The string token used as padding. Default: “< pad >”.

实例:

text_field = Field(sequential=True, use_vocab=False, include_lengths=True,
                           batch_first=True, pad_token=model.tokenizer.pad_token_id)
fields = [('text', text_field),
             ('span', Field(sequential=False, use_vocab=False, batch_first=True)),
             ('orig_span', Field(sequential=False, use_vocab=False, batch_first=True)),
             ('label', label_field)]

初始化examples

以torchtext.data.Example.fromlist(data, fields)为例子
参数

data
fields

实例:

examples = []
f = open(path, encoding=encoding)
lines = f.readlines()
for line in lines:
    #print("line:",line)
    instance = json.loads(line)
    #print("instance:",instance)
    text, subword_to_word_idx = model.tokenize(instance["text"].split(), get_subword_indices=True)
    #print("text:",text)
    cot = 0
    for target in instance["targets"]:
        cot += 1
       # print("example_{}".format(cot))
        span_index = self.get_tokenized_span_indices(
            subword_to_word_idx, target["span1"])
        #print("span_index:",span_index)
        label = target["label"]
        #print("label:",label)
        examples.append(
            Example.fromlist([text, span_index, target["span1"], label], fields))

打印出来:
在这里插入图片描述

用Field和example构建Dataset

实例:

super(NERDataset, self).__init__(examples, fields)

这样就自定义了一个dataset

当然也可以直接使用，而不用自定义:torchtext学习总结
在这里插入图片描述

产生迭代器(写在iters function)

数据划分splits(实例化Dataset)

参数

path (str) – Common prefix of the splits’ file paths, or None to use the result of cls.download(root).
train (str) – Suffix to add to path for the train set, or None for no train set. Default is None.
validation (str) – Suffix to add to path for the validation set, or None for no validation set. Default is None.
test (str) – Suffix to add to path for the test set, or None for no test set. Default is None.
keyword arguments (Remaining) – Passed to the constructor of the Dataset (sub)class being used.

说一下"keyword arguments (Remaining)",如果我们自定义Dataset，那么"train, val, test = Dataset.splits"的过程肯定是要先初始化Dataset的，初始化的方式就是上面——在我们自己定义的Dataset的__init__方法中先初始化Field、再初始化Examples、最后初始化Dataset，初始化过程中我们肯定需要其它一些自己的参数，比如下文数据增强中的rate就是一个keyword arguments

实例:

train, val, test = NERDataset.splits(
            path=path, train='train.json', validation='development.json', test='test.json',
            model=model, train_frac=train_frac, label_field=label_field)
# 这里的model、train_frac、label_field都是keyword arguments，你的数据集处理方式跟我不一样，所以这几个参数你也应该不需要。

实例化迭代器

如果上面是实例化Dataset,那我感觉迭代器就是DataLoader
以torchtext.data.BucketIterator.splits为例
参数:

dataset – The Dataset object to load Examples from.
batch_size – Batch size.
sort_within_batch – Whether to sort (in descending order according to self.sort_key) within each batch. If None, defaults to self.sort. If self.sort is True and this is False, the batch is left in the original (ascending) sorted order.
shuffle – Whether to shuffle examples between epochs.
repeat – Whether to repeat the iterator for multiple epochs. Default: False.

我暂时不知道repeat参数是什么意思
实例:

train_iter, val_iter, test_iter = data.BucketIterator.splits(
    (train, val, test), batch_size=eval_batch_size,
    sort_within_batch=True, shuffle=False, repeat=False)

建立词表

实例:

label_field.build_vocab(train) # 这里的train是上面的train Dataset

差不多就是这样，你可以把从"数据划分splits(实例化Dataset)"到"建立词表
"的部分写在你这个自定义的Dataset的iters方法里面。然后:

train_iter, val_iter, test_iter, num_labels = CustomnDataset.iters(
        hp.data_dir, encoder, batch_size=hp.batch_size,
        eval_batch_size=hp.eval_batch_size, train_frac=hp.train_frac)

扯点别的——数据增强

上面那篇博客好赞，我还看见了"NLP领域数据增强"(shuffle or dropout):

if rate > 0.5:
    text = self.dropout(text)
else:
    text = self.shuffle(text)
def shuffle(self, text):
    text = np.random.permutation(text.strip().split())
    return ' '.join(text)
    
def dropout(self, text, p=0.5):
    text = text.strip().split()
    len_ = len(text)
    indexs = np.random.choice(len_, int(len_ * p))
    for i in indexs:
        text[i] = '' 
    retrurn ' '.join(text)

参考

torchtext官网
 torchtext学习总结

题外话

顺便记录一下各个任务的输入是什么吧。
上面记录的是nel任务的输入。

Named entity labeling(NEL)

介绍:

Named entity labeling (NEL) is the task of predicting the entity type of a given span corresponding
to an entity. For example, does the span “German”
in some context refer to people, organization, or
language.

实例:
对于一段话，有很多个span，每个span对应一个标签。
在这里插入图片描述

Semantic role labeling(SRL)

介绍

Semantic role labeling (SRL) is concerned with
predicting the semantic roles of phrases in a sentence. In this probing task the locations of the predicate and its argument are given, and the goal is to
classify the argument into its specific semantic roles
(agent, patient, etc.).

实例:
{“info”: {“document_id”: “mz/sinorama/10/ectb_1029”, “sentence_id”: 0}, “text”: “The enterovirus detection biochip [0-4) developed by [4-5) DR. Chip Biotechnology takes [5-8) takes为什么也算进来了呢 only six hours to give hospitals the answer to whether a sample contains enterovirus , and if it is the deadly strain Entero 71 .”, “targets”: [{“span1”: [4, 5], “label”: “ARG0”, “span2”: [5, 9]}, {“span1”: [4, 5], “label”: “ARG1”, “span2”: [0, 4]}]}

Mention detection

介绍

Mention detection is the task of predicting
whether a span represents a mention of an entity
or not. For example, in the sentence “Mary goes
to the market”, the spans “Mary” and “the market” refer to mentions while all other spans are not
mentions. The task is similar to named entity recognition (Tjong Kim Sang and De Meulder, 2003),
but the mentions are not limited to named entities.

实例
妈的没有

Coreference arc prediction

介绍

Coreference arc prediction is the task of predicting whether a pair of spans refer to the same
entity or not. For example, in “John is his own
enemy”, “John” and “his” refer to the same person.

实例
{“info”: {“document_id”: “mz/sinorama/10/ectb_1029”, “sentence_id”: 0}, “text”: “Everyone will carry with them a biochip bearing a record and analysis of their own DNA .”, “targets”: [{“span1”: [0, 1], “label”: “1”, “span2”: [4, 5]}, {“span1”: [0, 1], “label”: “1”, “span2”: [13, 14]}, {“span1”: [0, 1], “label”: “0”, “span2”: [5, 16]}, {“span1”: [4, 5], “label”: “1”, “span2”: [13, 14]}, {“span1”: [4, 5], “label”: “0”, “span2”: [5, 16]}, {“span1”: [13, 14], “label”: “0”, “span2”: [5, 16]}]}

constituent labeling

介绍

Constituent labeling is the task of predicting the
non-terminal label (e.g., noun phrase or verb phrase)
for a span corresponding to a constituent.

实例:
{“text”: “Powerful Tools for Biotechnology - Biochips”, “targets”: [{“span1”: [0, 6], “label”: “TOP”, “info”: {“height”: 6, “form_function_discrepancies”: [], “grammatical_rule”: [], “adverbials”: [], “miscellaneous”: []}},
{“span1”: [0, 6], “label”: “NP”, “info”: {“height”: 5, “form_function_discrepancies”: [], “grammatical_rule”: [], “adverbials”: [], “miscellaneous”: []}},
{“span1”: [0, 4], “label”: “NP”, “info”: {“height”: 4, “form_function_discrepancies”: [], “grammatical_rule”: [], “adverbials”: [], “miscellaneous”: []}},
{“span1”: [0, 2], “label”: “NP”, “info”: {“height”: 2, “form_function_discrepancies”: [], “grammatical_rule”: [], “adverbials”: [], “miscellaneous”: []}},
{“span1”: [0, 1], “label”: “JJ”, “info”: {“height”: 1, “form_function_discrepancies”: [], “grammatical_rule”: [], “adverbials”: [], “miscellaneous”: []}},
{“span1”: [1, 2], “label”: “NNS”, “info”: {“height”: 1, “form_function_discrepancies”: [], “grammatical_rule”: [], “adverbials”: [], “miscellaneous”: []}},
{“span1”: [2, 4], “label”: “PP”, “info”: {“height”: 3, “form_function_discrepancies”: [], “grammatical_rule”: [], “adverbials”: [], “miscellaneous”: []}},
{“span1”: [2, 3], “label”: “IN”, “info”: {“height”: 1, “form_function_discrepancies”: [], “grammatical_rule”: [], “adverbials”: [], “miscellaneous”: []}},
{“span1”: [3, 4], “label”: “NP”, “info”: {“height”: 2, “form_function_discrepancies”: [], “grammatical_rule”: [], “adverbials”: [], “miscellaneous”: []}},
{“span1”: [3, 4], “label”: “NN”, “info”: {“height”: 1, “form_function_discrepancies”: [], “grammatical_rule”: [], “adverbials”: [], “miscellaneous”: []}},
{“span1”: [4, 5], “label”: “:”, “info”: {“height”: 1, “form_function_discrepancies”: [], “grammatical_rule”: [], “adverbials”: [], “miscellaneous”: []}},
{“span1”: [5, 6], “label”: “NP”, “info”: {“height”: 2, “form_function_discrepancies”: [], “grammatical_rule”: [], “adverbials”: [], “miscellaneous”: []}},
{“span1”: [5, 6], “label”: “NNS”, “info”: {“height”: 1, “form_function_discrepancies”: [], “grammatical_rule”: [], “adverbials”: [], “miscellaneous”: []}}], “info”: {“document_id”: “mz/sinorama/10/ectb_1029”, “sentence_id”: 0}}