1 序列标注 (token级的分类问题)
NER (Named-entity recognition 名词 实体识别) 分辩出文本中的名词和实体
POS (Part-of-speech tagging 词性标注) 根据语法对token 进行词表型标注(noun名词, verb 动词adjective形容词 …)
Chunk (Chunking短语组块) 将同一个短语的tokens 组块 放在一起
1.1 加载数据
from datasets import load_dataset, load_metric
datasets = load_dataset("conll2003")
给定一个数据切分的key(train、validation或者test)和下标即可查看数据
所有的数据标签labels都已经被编码成了整数,可以直接被预训练transformer模型使用。这些整数的编码所对应的实际类别储存在features中。
理解数据长什么样子 举几个例子
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML
def show_random_elements(dataset, num_examples=10):
assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
picks = []
for _ in range(num_examples):
pick = random.randint(0, len(dataset)-1)
while pick in picks:
pick = random.randint(0, len(dataset)-1)
picks.append(pick)
df = pd.DataFrame(dataset[picks])
for column, typ in dataset.features.items():
if isinstance(typ, ClassLabel):
df[column] = df[column].transform(lambda i: typ.names[i])
elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
display(HTML(df.to_html()))
1.2 预处理数据
预处理的工具叫Tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)
tokenizer既可以对单个文本进行预处理,也可以对一对文本进行预处理,tokenizer预处理后得到的数据满足预训练模型输入格式
tokenizer("Hello, this is one sentence!")
tokenizer(["Hello", ",", "this", "is", "one", "sentence", "split", "into", "words", "."], is_split_into_words=True)
word 被 tokenizer 继续切分
example = datasets["train"][4]
print(example["tokens"])
1.3 微调预训练模型
与tokenizer 相似 from_pretrained 方法同样可以帮助我们下载并加载模型
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(label_list))
所有属性
args = TrainingArguments(
f"test-{task}",
evaluation_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=3,
weight_decay=0.01,
)
需要一个数据收集器 data collator 处理好喂给模型
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer)
用函数聚合并起来
import numpy as np
def compute_metrics(p):
predictions, labels = p
predictions = np.argmax(predictions, axis=2)
# Remove ignored index (special tokens)
true_predictions = [
[label_list[p] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
true_labels = [
[label_list[l] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
results = metric.compute(predictions=true_predictions, references=true_labels)
return {
"precision": results["overall_precision"],
"recall": results["overall_recall"],
"f1": results["overall_f1"],
"accuracy": results["overall_accuracy"],
}
总结
感觉自己走马观花了一篇 [捂脸]