Task09 Transformers 解决抽取式问答任务

最新推荐文章于 2024-04-28 23:57:50 发布

数据闲逛人

最新推荐文章于 2024-04-28 23:57:50 发布

阅读量368

点赞数

分类专栏： # 【NLP入门之transforme】文章标签： python

原文链接：https://github.com/datawhalechina/Learn-NLP-with-Transformers

版权

【NLP入门之transforme】专栏收录该内容

8 篇文章 0 订阅

订阅专栏

文章目录

1 在机器问答任务上微调transformer模型
总结
参考

1 在机器问答任务上微调transformer模型

# squad_v2等于True或者False分别代表使用SQUAD v1 或者 SQUAD v2。
# 如果您使用的是其他数据集，那么True代表的是：模型可以回答“不可回答”问题，也就是部分问题不给出答案，而False则代表所有问题必须回答。
squad_v2 = False
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

1.1 Preprocessing the trainning data

用Datasets库来下载数据并且得到我们需要的评测指标

from datasets import load_dataset, load_metric

# 下载数据（确保有网络）
datasets = load_dataset("squad_v2" if squad_v2 else "squad")

import os

data_path = './dataset/squad/'
path = os.path.join(data_path, 'squad.py')
cache_dir = os.path.join(data_path, 'cache')
data_files = {"train": os.path.join(data_path, "train-v1.1.json"), "validation": os.path.join(data_path, "dev-v1.1.json")}
datasets = load_dataset(path, data_files=data_files, cache_dir=cache_dir)

无论是训练集、验证集还是测试集，对于每一个问答数据样本都会有“context", "question"和“answers”三个key。

datasets["train"][0]
# answers代表答案
# context代表文本片段
# question代表问题

举几个例子

from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

1.2 Preprocessing the trainng data

预处理 Tokenizer

from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# 如果我们想要看到tokenizer预处理之后的文本格式，我们仅使用tokenizer的tokenize方法，add special tokens意思是增加预训练模型所要求的特俗token。
print("单个文本tokenize: {}".format(tokenizer.tokenize("What is your name?"), add_special_tokens=True))
print("2个文本tokenize: {}".format(tokenizer.tokenize("My name is Sylvain.", add_special_tokens=True)))
# 预训练模型输入格式要求的输入为token IDs，还需要attetnion mask。可以使用下面的方法得到预训练模型格式所要求的输入。

# 对单个文本进行预处理
tokenizer("What is your name?")

# 对2个文本进行预处理，可以看到tokenizer在开始添加了101 token ID，中间用102token ID区分两段文本，末尾用102结尾。这些规则都是预训练模型是所设计的。
tokenizer("What is your name?", "My name is Sylvain.")

1.3 Fine-tuning微调模型

开始问答任务

from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

args = TrainingArguments(
    f"test-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5, #学习率
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3, # 训练的论次
    weight_decay=0.01,
)

使用 default_data_collator将预处理好的数据喂给模型。

from transformers import default_data_collator

data_collator = default_data_collator

数据投递工具default_data_collator传入Tranier即可。

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

1.4 Evaluation 评估

模型输出后处理成我们需要的文本格式

import torch

for batch in trainer.get_eval_dataloader():
    break
batch = {k: v.to(trainer.args.device) for k, v in batch.items()}
with torch.no_grad():
    output = trainer.model(**batch)
output.keys()

import numpy as np

start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
# 收集最佳的start和end logits的位置:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        if start_index <= end_index: # 如果start小雨end，那么合理的
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": "" # 后续需要根据token的下标将答案找出来
                }
            )

新处理了以下validation验证集


def prepare_validation_features(examples):
    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

总结

头大害感觉稀里糊涂的只是有了一点点概念的东西果然纸上得来终觉浅，绝知此事要躬行
勤学如春起之苗，虽不见其增，但日有所长。辍学如磨刀之石，虽不见其损，但日有所亏

参考

Datawhale基于transformers的自然语言处理(NLP入门)

数据闲逛人

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Task09 Transformers 解决抽取式问答任务

文章目录1 在机器问答任务上微调transformer模型1.1 Preprocessing the trainning data1.2 Preprocessing the trainng data1.3 Fine-tuning微调模型1.4 Evaluation 评估总结参考1 在机器问答任务上微调transformer模型# squad_v2等于True或者False分别代表使用SQUAD v1 或者 SQUAD v2。# 如果您使用的是其他数据集，那么True代表的是：模型可以回答“不可回答”问
复制链接

扫一扫