Task08 Transformer 解决序列标注任务

最新推荐文章于 2024-05-17 16:44:22 发布

数据闲逛人

最新推荐文章于 2024-05-17 16:44:22 发布

阅读量490

点赞数

文章标签： transformer python

本文链接：https://blog.csdn.net/jcjic/article/details/120518853

版权

文章目录

1 序列标注 (token级的分类问题)
总结
参考

1 序列标注 (token级的分类问题)

NER (Named-entity recognition 名词实体识别) 分辩出文本中的名词和实体
POS (Part-of-speech tagging 词性标注) 根据语法对token 进行词表型标注(noun名词, verb 动词adjective形容词 …)
Chunk (Chunking短语组块) 将同一个短语的tokens 组块放在一起

1.1 加载数据

from datasets import load_dataset, load_metric

datasets = load_dataset("conll2003")

给定一个数据切分的key（train、validation或者test）和下标即可查看数据
在这里插入图片描述所有的数据标签labels都已经被编码成了整数，可以直接被预训练transformer模型使用。这些整数的编码所对应的实际类别储存在features中。

理解数据长什么样子举几个例子

from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

1.2 预处理数据

预处理的工具叫Tokenizer

from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

tokenizer既可以对单个文本进行预处理，也可以对一对文本进行预处理，tokenizer预处理后得到的数据满足预训练模型输入格式

tokenizer("Hello, this is one sentence!")

tokenizer(["Hello", ",", "this", "is", "one", "sentence", "split", "into", "words", "."], is_split_into_words=True)

word 被 tokenizer 继续切分

example = datasets["train"][4]
print(example["tokens"])

1.3 微调预训练模型

与tokenizer 相似 from_pretrained 方法同样可以帮助我们下载并加载模型

from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(label_list))

所有属性

args = TrainingArguments(
    f"test-{task}",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
)

需要一个数据收集器 data collator 处理好喂给模型

from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer)

用函数聚合并起来

import numpy as np

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

总结

感觉自己走马观花了一篇 [捂脸]

参考

Datawhale基于transformers的自然语言处理(NLP入门)

数据闲逛人

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
打赏
0
评论
Task08 Transformer 解决序列标注任务

文章目录1 序列标注 (token级的分类问题)1.1 加载数据1.2 预处理数据1.3 微调预训练模型总结参考1 序列标注 (token级的分类问题)NER (Named-entity recognition 名词实体识别) 分辩出文本中的名词和实体POS (Part-of-speech tagging 词性标注) 根据语法对token 进行词表型标注(noun名词, verb 动词adjective形容词 …)Chunk (Chunking短语组块) 将同一个短语的tokens 组块放在一起
复制链接

扫一扫