深度学习系列35：transformer库入门

IE06

已于 2024-02-25 15:49:46 修改

阅读量4k

点赞数 1

分类专栏：深度学习系列文章标签：深度学习 transformer 人工智能

于 2022-05-31 15:25:48 首次发布

本文链接：https://blog.csdn.net/kittyzc/article/details/125010178

版权

深度学习系列专栏收录该内容

79 篇文章

订阅专栏

1. 介绍

在这里插入图片描述
首先安装: pip install transformers
这里有不同种类语言的离线模型清单：https://huggingface.co/languages

2. pipeline例子

最简单的使用方式，是使用现成的pipeline，背后流程如下：
在这里插入图片描述
我们可以去huggingface上找模型。我们以情绪分析为例，默认的pipeline是识别英文的，如果我们要识别中文怎么办？
首先去模型库寻找合适的模型(点击左边的tasks和language可以进行筛选)：

from transformers import BertForSequenceClassification
from transformers import BertTokenizer
import torch

tokenizer=BertTokenizer.from_pretrained('IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment')
model=BertForSequenceClassification.from_pretrained('IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment')

text='今天心情不好'

output=model(torch.tensor([tokenizer.encode(text)]))
print(torch.nn.functional.softmax(output.logits,dim=-1))

保存模型的代码如下

pt_save_directory = "./pt_save_pretrained"
tokenizer.save_pretrained(pt_save_directory)
pt_model.save_pretrained(pt_save_directory)

2.1 简介

预训练的模型如下：
"audio-classification": 语音分类
"automatic-speech-recognition" 语音识别
"conversational": 对话
"feature-extraction": 提取特征
"fill-mask": 填充
"image-classification": 图像分类
"question-answering": 问答
"table-question-answering": 表格问答
"text2text-generation": 文本生成
"text-classification" (又名"sentiment-analysis"): 文本分类
"text-generation": 文本生成
"token-classification" (又名"ner"): token分类
"translation": 翻译
"translation_xx_to_yy": 翻译
"summarization": 总结
"zero-shot-classification": 零样本分类

pipepline加载的内容包含如下：
在这里插入图片描述

2.2 情绪分析

from transformers import pipeline
classifier = pipeline('sentiment-analysis')
classifier('We are very happy to introduce pipeline to the transformers repository.')

2.3 问答

from transformers import pipeline
question_answerer = pipeline('question-answering')
question_answerer({ 'question': 'What is the name of the repository ?', 'context': 'Pipeline has been included in the huggingface/transformers repository'})

2.4 语音识别

from transformers import pipeline
import torch
from datasets import load_dataset, Audio
dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
# 对数据进行重采样
dataset = dataset. cast_column("audio", Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate))
speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h")
result = speech_recognizer([a['array'] for a in dataset[:4]["audio"]])

2.5 文本生成

generator = pipeline(task="text-generation")
generator("Eight people were kill at party in California.")

2.6 图像分类

vision_classifier = pipeline(task="image-classification")
vision_classifier(images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")

3. 通用方法

3.1 进行编码（tokenizer或者extractFeature）

文字的话需要定义tokenizer。tokenizer负责把文字转换为一个字典，例如：

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
encoding = tokenizer("We are very happy to show you the 🤗 Transformers library.")
print(encoding)
{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

文字和图像同理：

from transformers import AutoFeatureExtractor
feature_extractor = AutoFeatureExtractor.from_pretrained(
    "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
)

也可以使用AutoProcessor，兼容文字和图像：

from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("microsoft/layoutlmv2-base-uncased")

tokernizer可以在string，tokens，ids之间相互转换，例如：

ids = tokenizer.convert_tokens_to_ids(tokens)

而使用encode函数除了将string转为ids外，还会自动添加模型需要的特殊 token，例如BERT分词器会分别在序列的首尾添加[CLS]和[SEP]。
在实际编码文本时，最常见的是直接使用分词器进行处理，这样不仅会返回分词后的 token IDs，还包含模型需要的其他输入。例如 BERT 分词器还会自动在输入中添加 token_type_ids 和 attention_mask。
对于长序列，有以下三种处理方法：
1）使用一个支持长文的 Transformer 模型，例如 Longformer 和 LED（最大长度 4096）；
2）设定最大长度 max_sequence_length 以截断输入序列：sequence = sequence[:max_sequence_length]。
3）将长文切片为短文本块 (chunk)，然后分别对每一个 chunk 编码。

如果需要添加新token，可以用下列代码：

new_tokens = ["new_token1", "my_new-token2"]
new_tokens = set(new_tokens) - set(tokenizer.vocab.keys())
tokenizer.add_tokens(list(new_tokens))

向词表中添加新 token 后，必须重置模型 embedding 矩阵的大小，也就是向矩阵中添加新 token 对应的 embedding，这样模型才可以正常工作。调整 embedding 矩阵通过 resize_token_embeddings() 函数来实现

model.resize_token_embeddings(len(tokenizer))

使用已有 token 的 embedding 来初始化新添加 token。例如对于上面的例子，我们可以将 [ENT_START] 和 [ENT_END] 的值都初始化为“entity” token 对应的 embedding。

import torch

token_id = tokenizer.convert_tokens_to_ids('entity')
token_embedding = model.embeddings.word_embeddings.weight[token_id]
print(token_id)

with torch.no_grad():
    for i in range(1, num_added_toks+1):
        model.embeddings.word_embeddings.weight[-i:, :] = token_embedding.clone().detach().requires_grad_(True)
print(model.embeddings.word_embeddings.weight[-2:, :])

更为高级的做法是根据新添加 token 的语义来进行初始化。对于上面的例子，我们可以分别为和编写对应的描述，然后再对它们的值进行初始化：

descriptions = ['start of entity', 'end of entity']

with torch.no_grad():
    for i, token in enumerate(reversed(descriptions), start=1):
        tokenized = tokenizer.tokenize(token)
        print(tokenized)
        tokenized_ids = tokenizer.convert_tokens_to_ids(tokenized)
        new_embedding = model.embeddings.word_embeddings.weight[tokenized_ids].mean(axis=0)
        model.embeddings.word_embeddings.weight[-i, :] = new_embedding.clone().detach().requires_grad_(True)
print(model.embeddings.word_embeddings.weight[-2:, :])

3.2 加载模型

使用AutoModel以及如下带后缀的模型：
*ForCausalLM
*ForMaskedLM
*ForMultipleChoice
*ForQuestionAnswering
*ForSequenceClassification
*ForTokenClassification
……
默认会从本地加载，如果找不到会从huggingface上下载，默认保存到 ~/.cache/huggingface/transformers，我们也可以通过 HF_HOME 环境变量自定义缓存目录。：

model = AutoModelForCausalLM.from_pretrained( 'openbmb/MiniCPM-2B-dpo-fp16', torch_dtype=torch.float16, device_map='mps', trust_remote_code=True)

有时还需要加载config：

from transformers import BertConfig, BertModel
config = BertConfig()
model = BertModel(config)

如果连不上huggingface，可以自己想办法下载到本地。我们通常只需要下载模型对应的 config.json（模型结构文件）和 pytorch_model.bin（模型参数文件），以及分词器对应的 special_tokens_map.json（映射文件，里面包含 unknown token 等特殊字符的映射关系），tokenizer_config.json（分词器配置文件，存储构建分词器需要的参数），以及vocab.txt（词表，一行一个 token，行号就是对应的 token ID）。

3.3 模型推理

模型的直接输出digits，可以用下面的代码查看：

outputs = model(**inputs)

在这里插入图片描述

3.4 流式打印

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

tokenizer = AutoTokenizer.from_pretrained("uer/gpt2-chinese-cluecorpussmall")
model = AutoModelForCausalLM.from_pretrained("uer/gpt2-chinese-cluecorpussmall")
input_text = "昨天已经过去，"
inputs = tokenizer([input_text], return_tensors="pt", add_special_tokens=False)
streamer = TextStreamer(tokenizer)

# Despite returning the usual output, the streamer will also print the generated text to stdout.
_ = model.generate(**inputs, streamer=streamer, max_new_tokens=86)

4 后处理

后处理是将digits转为需要的结果。

4.1 序列标注任务

一般的分类问题是接一个SoftMax层：

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

outputs.logits有问题的可以换成outputs.last_hidden_state，这一步就是把最后一层的输出转化为概率分布。
我们以"dbmdz/bert-large-cased-finetuned-conll03-english"为例，模型的输入是一个长度为N的 token 序列，输出尺寸为1N9，即模型对每个 token 都会输出一个包含9个 logits 值的向量（9 分类），使用argmax获得预测值：

probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()
predictions = outputs.logits.argmax(dim=-1)[0].tolist()
print(predictions)

results = []
tokens = inputs.tokens()

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O":
        results.append(
            {"entity": label, "score": probabilities[idx][pred], "word": tokens[idx]}
        )

print(results)

快速分词器可以追踪从文本到 token 的映射，只需要给分词器传递 return_offsets_mapping=True 参数，就可以获取从 token 到原文的映射（特殊 token 对应的原文位置为 (0, 0)。）：

inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
offset_mapping = inputs_with_offsets["offset_mapping"]
print(offset_mapping)

4.2 问答抽取

我们首先通过 QA pipeline 模型来完成问答任务：

from transformers import pipeline

question_answerer = pipeline("question-answering")
context = """
Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back Transformers?"
results = question_answerer(question=question, context=context)
print(results)

输出：
{‘score’: 0.9741130471229553, ‘start’: 76, ‘end’: 104, ‘answer’: ‘Jax, PyTorch, and TensorFlow’}

自定义的方式为：

from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_checkpoint = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

context = """
Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back Transformers?"

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)

需要构建 Mask 遮蔽掉问题文本以及 [SEP]:

start_logits = outputs.start_logits
end_logits = outputs.end_logits

import torch

sequence_ids = inputs.sequence_ids()
mask = [i != 1 for i in sequence_ids]
mask[0] = False # Unmask the [CLS] token
mask = torch.tensor(mask)[None]

start_logits[mask] = -10000
end_logits[mask] = -10000

start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)[0]
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)[0]

在这里插入图片描述

max_index = scores.argmax().item()
start_index = max_index // scores.shape[1]
end_index = max_index % scores.shape[1]

inputs_with_offsets = tokenizer(question, context, return_offsets_mapping=True)
offsets = inputs_with_offsets["offset_mapping"]

start, _ = offsets[start_index]
_, end = offsets[end_index]

result = {
    "answer": context[start:end],
    "start": start,
    "end": end,
    "score": float(scores[start_index, end_index]),
}
print(result)

4.3 Sequence2sequence

可以参考：https://www.likecs.com/show-308663700.html
使用 AutoModelForSeq2SeqLM 构造的模型同样对 Decoder 的解码过程进行了封装，我们只需要调用模型的 generate() 函数就可以自动地逐个生成预测 token。例如，我们可以直接调用预训练好的 Marian 模型进行翻译：

import torch
from transformers import AutoTokenizer
from transformers import AutoModelForSeq2SeqLM

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using {device} device')

model_checkpoint = "Helsinki-NLP/opus-mt-zh-en"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
model = model.to(device)

sentence = '我叫张三，我住在苏州。'

sentence_inputs = tokenizer(sentence, return_tensors="pt").to(device)
sentence_generated_tokens = model.generate(
    sentence_inputs["input_ids"],
    attention_mask=sentence_inputs["attention_mask"],
    max_length=128
)
sentence_decoded_pred = tokenizer.decode(sentence_generated_tokens[0], skip_special_tokens=True)
print(sentence_decoded_pred)

在本文中，我们使用 AutoModelForSeq2SeqLM 模型自带的 generate() 函数，通过柱搜索 (Beam search) 解码出翻译结果（使用模型默认解码参数）。实际上所有 Transformers 库中的生成模型都可以通过 generate() 函数来完成解码，只需要向其传递不同的参数。

下面我们将简单介绍目前常用的几种解码策略。为了方便，我们将统一使用 GPT-2 模型来进行展示。

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

4.3.1 贪心搜索 (Greedy Search)

在这里插入图片描述
下面我们使用 GPT-2 模型结合贪心算法来为上下文 (“I”, “enjoy”, “walking”, “with”, “my”, “cute”, “dog”) 生成后续序列：

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='pt')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure if I'll

模型成功地生成了一个短文本，但是它似乎开始不停地重复。这是一个语言生成中常见的问题，特别是在贪心搜索和柱搜索中经常会出现。

贪心搜索最大的问题是由于每次都只选择当前概率最大的词，相当于是区部最优解，因此生成的序列往往并不是全局最优的。

4.3.2 柱搜索 (Beam search)

在这里插入图片描述
柱搜索虽然通过在每个时间步保留多个分支来缓解贪心算法局部最优解的问题，但是它依然不能保证找到全局最优解。

下面我们同样运用 GPT-2 模型结合柱搜索来生成文本，只需要设置参数 num_beams > 1 以及 early_stopping=True，这样只要所有柱搜索保留的分支都到达休止符 EOS token，生成过程就结束。

# activate beam search and early_stopping
beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I'm not sure if I'll ever be able to walk with him again. I'm not sure if I'll

虽然柱搜索得到的序列更加流畅，但是输出中依然出现了重复片段。最简单的解决方法是引入 n-grams 惩罚，其在每个时间步都手工将那些会产生重复 n-gram 片段的词的概率设为 0。例如，我们额外设置参数 no_repeat_ngram_size=2 就能使生成序列中不会出现重复的 2-gram 片段。
不过 n-grams 惩罚虽然能够缓解“重复”问题，却也要谨慎使用。例如对于一篇关于”New York“文章就不能使用 2-gram 惩罚，否则”New York“在全文中就只能出现一次了。
在这里插入图片描述

4.3.3 随机采样

在 generate() 中设置 do_sample=True可设置为随机采样。
在这里插入图片描述

# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

看上去还不错，但是细读的话会发现不是很连贯，这也是采样生成文本的通病：模型经常会生成前后不连贯的片段。一种解决方式是通过降低 softmax 的温度 (temperature) 使得分布更尖锐，即进一步增加高概率词出现的可能性和降低低概率词出现的可能性。例如对上面的例子应用降温：

在这里插入图片描述
这样在第一个时间步，条件概率变得更加尖锐，几乎不可能会选择到“car”。我们只需要在 generate() 中设置 temperature 来就可以实现对分布的降温。
temperature=0.6时生成的文本更加连贯了。降温操作实际上是在减少分布的随机性，当我们把 temperature 设为 0 时就等同于贪心解码。

通过在 generate() 中设置 top_k=10 来进行 Top-K 采样。Top-p 对 Top-K 进行了改进，每次只从累积概率超过的最小的可能词集中进行选择，然后在这组词语中重新分配概率质量。这样，每个时间步的词语集合的大小就可以根据下一个词的条件概率分布动态增加和减少。我们只需要在 generate() 中设置 0 < top_p < 1 就可以激活 Top-p 采样了。

5. 微调模型

5.1 微调分类模型

下面是个自定义模型的例子

from torch import nn
from transformers import AutoModel

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using {device} device')

class BertForPairwiseCLS(nn.Module):
    def __init__(self):
        super(BertForPairwiseCLS, self).__init__()
        self.bert_encoder = AutoModel.from_pretrained(checkpoint)
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(768, 2)

    def forward(self, x):
        bert_output = self.bert_encoder(**x)
        cls_vectors = bert_output.last_hidden_state[:, 0, :]
        cls_vectors = self.dropout(cls_vectors)
        logits = self.classifier(cls_vectors)
        return logits

model = BertForPairwiseCLS().to(device)
print(model)

或者使用官方预定义的preTrainedModel：

from torch import nn
from transformers import AutoConfig
from transformers import BertPreTrainedModel, BertModel

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using {device} device')

class BertForPairwiseCLS(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.bert = BertModel(config, add_pooling_layer=False)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(768, 2)
        self.post_init()
    
    def forward(self, x):
        bert_output = self.bert(**x)
        cls_vectors = bert_output.last_hidden_state[:, 0, :]
        cls_vectors = self.dropout(cls_vectors)
        logits = self.classifier(cls_vectors)
        return logits

config = AutoConfig.from_pretrained(checkpoint)
model = BertForPairwiseCLS.from_pretrained(checkpoint, config=config).to(device)
print(model)

5.2 微调s2s模型

默认情况下分词器会采用源语言的设定来编码文本，要编码目标语言则需要通过上下文管理器 as_target_tokenizer()：

zh_sentence = train_data[0]["chinese"]
en_sentence = train_data[0]["english"]

inputs = tokenizer(zh_sentence)
with tokenizer.as_target_tokenizer():
    targets = tokenizer(en_sentence)

对于翻译任务，标签序列就是目标语言的 token ID 序列。与序列标注任务类似，我们会在模型预测出的标签序列与答案标签序列之间计算损失来调整模型参数，因此我们同样需要将填充的 pad 字符设置为 -100，以便在使用交叉熵计算序列损失时将它们忽略.
与我们之前任务中使用的纯 Encoder 模型不同，Seq2Seq 任务对应的模型采用的是 Encoder-Decoder 框架：Encoder 负责编码输入序列，Decoder 负责循环地逐个生成输出 token。因此，对于每一个样本，我们还需要额外准备 decoder input IDs 作为 Decoder 的输入。decoder input IDs 是标签序列的移位，在序列的开始位置增加了一个特殊的“序列起始符”。
完整代码如下：

import random
import os
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader, random_split
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import AdamW, get_scheduler
from sacrebleu.metrics import BLEU
from tqdm.auto import tqdm
import json

def seed_everything(seed=1029):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using {device} device')
seed_everything(42)

max_dataset_size = 220000
train_set_size = 200000
valid_set_size = 20000

max_input_length = 128
max_target_length = 128

batch_size = 32
learning_rate = 1e-5
epoch_num = 3

class TRANS(Dataset):
    def __init__(self, data_file):
        self.data = self.load_data(data_file)
    
    def load_data(self, data_file):
        Data = {}
        with open(data_file, 'rt', encoding='utf-8') as f:
            for idx, line in enumerate(f):
                if idx >= max_dataset_size:
                    break
                sample = json.loads(line.strip())
                Data[idx] = sample
        return Data
    
    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

data = TRANS('data/translation2019zh/translation2019zh_train.json')
train_data, valid_data = random_split(data, [train_set_size, valid_set_size])
test_data = TRANS('data/translation2019zh/translation2019zh_valid.json')

model_checkpoint = "Helsinki-NLP/opus-mt-zh-en"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
model = model.to(device)

def collote_fn(batch_samples):
    batch_inputs, batch_targets = [], []
    for sample in batch_samples:
        batch_inputs.append(sample['chinese'])
        batch_targets.append(sample['english'])
    batch_data = tokenizer(
        batch_inputs, 
        padding=True, 
        max_length=max_input_length,
        truncation=True, 
        return_tensors="pt"
    )
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            batch_targets, 
            padding=True, 
            max_length=max_target_length,
            truncation=True, 
            return_tensors="pt"
        )["input_ids"]
        batch_data['decoder_input_ids'] = model.prepare_decoder_input_ids_from_labels(labels)
        end_token_index = torch.where(labels == tokenizer.eos_token_id)[1]
        for idx, end_idx in enumerate(end_token_index):
            labels[idx][end_idx+1:] = -100
        batch_data['labels'] = labels
    return batch_data

train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True, collate_fn=collote_fn)
valid_dataloader = DataLoader(valid_data, batch_size=batch_size, shuffle=False, collate_fn=collote_fn)
test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=False, collate_fn=collote_fn)

def train_loop(dataloader, model, optimizer, lr_scheduler, epoch, total_loss):
    progress_bar = tqdm(range(len(dataloader)))
    progress_bar.set_description(f'loss: {0:>7f}')
    finish_batch_num = (epoch-1) * len(dataloader)
    
    model.train()
    for batch, batch_data in enumerate(dataloader, start=1):
        batch_data = batch_data.to(device)
        outputs = model(**batch_data)
        loss = outputs.loss

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        lr_scheduler.step()

        total_loss += loss.item()
        progress_bar.set_description(f'loss: {total_loss/(finish_batch_num + batch):>7f}')
        progress_bar.update(1)
    return total_loss

bleu = BLEU()

def test_loop(dataloader, model):
    preds, labels = [], []
    
    model.eval()
    for batch_data in tqdm(dataloader):
        batch_data = batch_data.to(device)
        with torch.no_grad():
            generated_tokens = model.generate(
                batch_data["input_ids"],
                attention_mask=batch_data["attention_mask"],
                max_length=max_target_length,
            ).cpu().numpy()
        label_tokens = batch_data["labels"].cpu().numpy()
        
        decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
        label_tokens = np.where(label_tokens != -100, label_tokens, tokenizer.pad_token_id)
        decoded_labels = tokenizer.batch_decode(label_tokens, skip_special_tokens=True)

        preds += [pred.strip() for pred in decoded_preds]
        labels += [[label.strip()] for label in decoded_labels]
    bleu_score = bleu.corpus_score(preds, labels).score
    print(f"BLEU: {bleu_score:>0.2f}\n")
    return bleu_score

optimizer = AdamW(model.parameters(), lr=learning_rate)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=epoch_num*len(train_dataloader),
)

total_loss = 0.
best_bleu = 0.
for t in range(epoch_num):
    print(f"Epoch {t+1}/{epoch_num}\n-------------------------------")
    total_loss = train_loop(train_dataloader, model, optimizer, lr_scheduler, t+1, total_loss)
    valid_bleu = test_loop(valid_dataloader, model)
    if valid_bleu > best_bleu:
        best_bleu = valid_bleu
        print('saving new weights...\n')
        torch.save(
            model.state_dict(), 
            f'epoch_{t+1}_valid_bleu_{valid_bleu:0.2f}_model_weights.bin'
        )
print("Done!")

5.3 prompt-learning

Prompting 方法的核心思想就是借助模板将问题转换为与预训练任务类似的形式来处理。

例如要判断标题“American Duo Wins Opening Beach Volleyball Match”的新闻类别，就可以应用模板“This is a News: ”将其转换为“This is a News: American Duo Wins Opening Beach Volleyball Match”，然后送入到包含 MLM (Mask Language Modeling) 预训练任务的模型中预测对应的词，最后将词映射到新闻类别（比如“Sports”对应“体育”类）。
下面我们以情感分析任务为例，运用 Transformers 库手工构建一个基于 Prompt 的模型来完成任务。
这里我们选择中文情感分析语料库 ChnSentiCorp 作为数据集，其包含各类网络评论接近一万条，可以从百度 ERNIE 示例仓库下载。
语料已经划分好了训练集、验证集、测试集（分别包含 9600、1200、1200 条评论），一行是一个样本，使用 TAB 分隔评论和对应的标签，“0”表示消极，“1”表示积极。
在这里插入图片描述
核心是模板和 verbalizer 对应的函数：

def get_prompt(x):
    prompt = f'总体上来说很[MASK]。{x}'
    return {
        'prompt': prompt, 
        'mask_offset': prompt.find('[MASK]')
    }

def get_verbalizer(tokenizer):
    return {
        'pos': {'token': '好', 'id': tokenizer.convert_tokens_to_ids("好")}, 
        'neg': {'token': '差', 'id': tokenizer.convert_tokens_to_ids("差")}
    }'

但是这种做法要求我们能够从词表中找到合适的 label word 来代表每一个类别，并且 label word 只能包含一个 token，而很多时候这是无法实现的。因此，另一种常见做法是为每个类别构建一个可学习的虚拟 token（又称伪 token），然后运用类别描述来初始化虚拟 token 的表示，最后使用这些虚拟 token 来扩展模型的 MLM 头。

def get_verbalizer(tokenizer):
    return {
        'pos': {
            'token': '[POS]', 'id': tokenizer.convert_tokens_to_ids("[POS]"), 
            'description': '好的、优秀的、正面的评价、积极的态度'
        }, 
        'neg': {
            'token': '[NEG]', 'id': tokenizer.convert_tokens_to_ids("[NEG]"), 
            'description': '差的、糟糕的、负面的评价、消极的态度'
        }
    }

tokenizer.add_special_tokens({'additional_special_tokens': ['[POS]', '[NEG]']})

对于 MLM 任务，可以直接使用 Transformers 库封装好的 AutoModelForMaskedLM 类。由于 BERT 已经在 MLM 任务上进行了预训练，因此借助模板我们甚至可以在不微调的情况下 (Zero-shot) 直接使用模型来预测情感极性。例如对我们的第一个样本：

import torch
from transformers import AutoModelForMaskedLM

checkpoint = "bert-base-chinese"
model = AutoModelForMaskedLM.from_pretrained(checkpoint)

text = "总体上来说很[MASK]。这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般。"
inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")