Transformers 库在NLP任务上的入门与使用

iteapoy

已于 2022-11-23 14:19:56 修改

阅读量2.2k

点赞数 2

分类专栏： ❤️ 机器学习文章标签：自然语言处理人工智能深度学习

于 2022-10-13 23:13:55 首次发布

本文链接：https://blog.csdn.net/iteapoy/article/details/127311483

版权

❤️ 机器学习专栏收录该内容

16 篇文章 17 订阅

订阅专栏

0 库

	网址
库的 GitHub 地址	https://github.com/huggingface/transformers
官方开发文档	https://huggingface.co/docs/transformers/index
预训练模型下载地址	https://huggingface.co/models

1 pipeline

1.1 介绍

现有的模型和checkpoint，可以直接拿来用，处理某个任务，如情感分类、文本生成、命名实体识别、问答等。

支持的任务

Task	Description	Modality	Pipeline identifier
Text classification	assign a label to a given sequence of text	NLP	pipeline(task=“sentiment-analysis”)
Text generation	generate text that follows a given prompt	NLP	pipeline(task=“text-generation”)
Name entity recognition	assign a label to each token in a sequence (people, organization, location, etc.)	NLP	pipeline(task=“ner”)
Question answering	extract an answer from the text given some context and a question	NLP	pipeline(task=“question-answering”)
Fill-mask	predict the correct masked token in a sequence	NLP	pipeline(task=“fill-mask”)
Summarization	generate a summary of a sequence of text or document	NLP	pipeline(task=“summarization”)
Translation	translate text from one language into another	NLP	pipeline(task=“translation”)
Image classification	assign a label to an image	Computer vision	pipeline(task=“image-classification”)
Image segmentation	assign a label to each individual pixel of an image (supports semantic, panoptic, and instance segmentation)	Computer vision	pipeline(task=“image-segmentation”)
Object detection	predict the bounding boxes and classes of objects in an image	Computer vision	pipeline(task=“object-detection”)
Audio classification	assign a label to an audio file	Audio	pipeline(task=“audio-classification”)
Automatic speech recognition	extract speech from an audio file into text	Audio	pipeline(task=“automatic-speech-recognition”)
Visual question answering	given an image and a question, correctly answer a question about the image	Multimodal	pipeline(task=“vqa”)

1.2 用于某个任务

1.2.1 情感分析

默认模型

from transformers import pipeline

classifier = pipeline("sentiment-analysis")

测试 “We are very happy to show you the 🤗 Transformers library.” 这句句子的情感倾向

classifier("We are very happy to show you the 🤗 Transformers library.")

输出：

[{'label': 'POSITIVE', 'score': 0.9998}]

测试一个 batch 的句子的情感倾向

results = classifier(["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

输出：

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309

自定义 tokenizer 和 model

详细介绍见【载入预训练的模型】部分

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# 会下载缓存这个模型
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

测试 “We are very happy to show you the 🤗 Transformers library.” 这句句子的情感倾向

classifier("We are very happy to show you the 🤗 Transformers library.")

输出：

[{'label': '5 stars', 'score': 0.772534966468811}]

测试一个 batch 的句子的情感倾向

results = classifier(["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."])

for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

输出：

label: 5 stars, with score: 0.7725
label: 5 stars, with score: 0.2365

1.3 QA 问答

给定一段上下文 context，提问

from transformers import pipeline

question_answerer = pipeline("question-answering")

context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""

模型回答
问题 question=“What is extractive question answering?”

result = question_answerer(question="What is extractive question answering?", context=context)

print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)},  start: {result['start']},  end: {result['end']}")

输出：

Answer: 'the task of extracting an answer from a text given a question', score: 0.6177,  start: 34,  end: 95

问题 question=“What is a good example of a question answering dataset?”

result = question_answerer(question="What is a good example of a question answering dataset?", context=context)

print(
    f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
)

输出：

Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160

2 载入预训练的模型

BERT 指的是模型结构（architecture），model_name 指的是一个加载进模型的权重（checkpoint）

from transformers import BertTokenizer
from transformers import BertModel
from transformers import BertConfig

  
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
config = BertConfig.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

一般来说，这个 checkpoint 会从网上下载并缓存在本地，如果想要指定模型缓存的位置，就

import os
os.environ['TRANSFORMERS_CACHE'] = './cache' # 地址

.from_pretrained() 方法可以加载一个预训练的模型，不需要从头开始训练

目前 transformer 支持的模型结构（architecture）比较多，比较复杂，以下截取一小部分：

albert — AlbertConfig (ALBERT model)
bart — BartConfig (BART model)
beit — BeitConfig (BEiT model)
bert — BertConfig (BERT model)
bert-generation — BertGenerationConfig (Bert Generation model)
big_bird — BigBirdConfig (BigBird model)
bigbird_pegasus — BigBirdPegasusConfig (BigBird-Pegasus model)
blenderbot — BlenderbotConfig (Blenderbot model)
blenderbot-small — BlenderbotSmallConfig (BlenderbotSmall model)
bloom — BloomConfig (BLOOM model)
camembert — CamembertConfig (CamemBERT model)
canine — CanineConfig (CANINE model)
clip — CLIPConfig (CLIP model)
codegen — CodeGenConfig (CodeGen model)
conditional_detr — ConditionalDetrConfig (Conditional DETR model)
convbert — ConvBertConfig (ConvBERT model)
convnext — ConvNextConfig (ConvNeXT model)
...

Auto 类型可以根据你提供的 checkpoint 自动推断模型的结构（architecture）
最常用的有：AutoTokenizer, AutoModel, AutoConfig

from transformers import AutoTokenizer, AutoModel, AutoConfig

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
config = AutoConfig.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-cased")

以上是根据库里的对应名字的模型直接载入，也可以把模型下载到本地后，用本地路径载入

model_path = "../pretrained_model/distilbert-base-uncased"
model = AutoModel.from_pretrained(model_path)

AutoModelForXXXX 类型可以根据给定的任务，载入对应的预训练模型

# sequence classification 任务
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

# token classification 任务
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased")

3 预处理

训练模型前，需要把你的数据预处理成模型的输入格式

文本：用 Tokenizer 转化成 token 的序列，用数字（索引）来表示 token，再把它们变成 tensor

3.1 tokenizer

定义一个 tokenizer

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

3.1.1 编码（encode）

sentence = "Do not meddle in the affairs of wizards, for they are subtle and quick to anger."

encoded_input = tokenizer(sentence)
# 或者也可以写成
encoded_input = tokenizer.encode_plus(sentence)

print(encoded_input)

输出：

{'input_ids': [101, 2091, 1136, 1143, 13002, 1107, 1103, 5707, 1104, 16678, 1116, 117, 1111, 1152, 1132, 11515, 1105, 3613, 1106, 4470, 119, 102], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

encode和encode_plus的区别

encode仅返回input_ids
encode_plus返回所有的编码信息，具体如下：
- input_ids：是单词在词典中的编码
- token_type_ids：区分token 是否来源于同一句句子
  - 如果有两个句子，编码中上句全为0，下句全为1
- attention_mask：指定对哪些词进行self-Attention操作

sentence = "Hello, my son is laughing." 
print(tokenizer.encode(sentence)) 
print(tokenizer.encode_plus(sentence))

3.1.2 解码（decode）

tokenizer.decode(encoded_input["input_ids"])

输出

'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'

自动在句子开始加上了 [CLS] 符号，在句子末尾加上了 [SEP] 符号

3.1.3 批处理

可以一次性编码多个句子

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)

输出：

{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1]]}

3.1.4 其它参数设置

填充（Pad）

同一个批次（batch）中的句子将根据最长的句子，被补全（padding）到同一个长度，用 [PAD] 补充，token 的索引为0

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True) # 设置 padding
print(encoded_input)

输出

{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

[PAD] 不需要被 attention，所以attention_mask为 0

截断（Truncation）

如果句子太长，超过了模型的最大长度 max_length 的设置，将会被截断（比如超过 512 个token）

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]

encoded_input = tokenizer(batch_sentences, 
						  padding=True, 
						  truncation=True) # 设置 truncation
print(encoded_input)

输出

{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

因为这里的句子都比较短，所以没有被截断

可以自己设置 max_length

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]

encoded_input = tokenizer(batch_sentences,
						  padding=True, 
						  truncation=True,  # 设置 truncation
						  max_length=8)     # max_length
print(encoded_input)

输出

{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], 
			  [101, 1790, 112, 189, 1341, 1119, 3520, 102], 
			  [101, 1327, 1164, 5450, 23434, 136, 102, 0]], 
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], 
		 		  [0, 0, 0, 0, 0, 0, 0, 0], 
				  [0, 0, 0, 0, 0, 0, 0, 0]], 
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], 
				  [1, 1, 1, 1, 1, 1, 1, 1], 
				  [1, 1, 1, 1, 1, 1, 1, 0]]}

超过长度为 max_length=8 会被截断，不足长度为 max_length=8 会补全

构建张量（Build tensors）

设置 return_tensors="pt"，就会返回 pytorch 对应的 tensor，不再是前面的 list

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]

encoded_input = tokenizer(batch_sentences,
						  padding=True, 
						  truncation=True,  # 设置 truncation
						  max_length=8,     # max_length
						  return_tensors="pt")
print(encoded_input)

输出

{'input_ids': 
tensor([[  101,  1252,  1184,  1164,  1248,  6462,   136,   102],
        [  101,  1790,   112,   189,  1341,  1119,  3520,   102],
        [  101,  1327,  1164,  5450, 23434,   136,   102,     0]]), 'token_type_ids': 
tensor([[0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0]]), 
'attention_mask': 
tensor([[1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0]])}

4 微调一个预训练模型

在预训练模型的基础上，继续训练模型，用于特定任务，叫做微调。

可以有两种方法：

用 Transformer 的 Trainer 来训练
用原本的 pytorch 来训练

这一部分分享一下博主的 jupyter notebook，以下两个相同，任选：

链接: https://pan.baidu.com/s/1KcVR8CKBbHIWVvsXLt3lyw 密码: lk6u
github链接：https://github.com/iteapoy/NLP-tutorial/blob/main/transformer_tutorial.ipynb

4.1 准备数据集

载入内置的数据集 yelp_review，API

from datasets import load_dataset

dataset = load_dataset("yelp_review_full")
dataset

输出

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

有训练集（train）和测试集（test）两个数据集，数据量分别为 650,000 和 50,000

特征为：label 和 text

dataset["train"][100]

输出

{'label': 0, 
'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!'}

用 tokenizer 预处理

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets

输出

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

多了 input_ids, token_type_ids, attention_mask

tokenized_datasets["train"][100]

输出

{'label': 0, 
'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!', 
'input_ids': [101, 1422, 11471, 1111, 9092, 1116, 1132, 189, 6034, 1344, 119, 1252, 1111, 1141, 1106, 1253, 8693, 1177, 14449, 1193, 119, 119, 119, 1115, 2274, 1380, 1957, 106, 165, 183, 1942, 4638, 5948, 2852, 1261, 1139, 2053, 112, 188, 1546, 117, 1173, 13796, 5794, 1143, 119, 146, 1125, 1106, 2049, 1991, 1107, 1524, 1104, 170, 5948, 2852, 1150, 1533, 1117, 8077, 1106, 3074, 1113, 1103, 1825, 139, 2036, 3048, 11607, 2137, 1143, 119, 146, 3932, 1166, 1421, 1904, 1111, 170, 23275, 1546, 1115, 1529, 11228, 1141, 5102, 112, 188, 7696, 119, 1258, 2903, 1160, 1234, 1150, 2802, 1170, 1143, 1129, 3541, 1147, 2094, 117, 146, 1455, 1187, 2317, 1108, 119, 1109, 2618, 1408, 13732, 1120, 1103, 5948, 11528, 1111, 165, 107, 2688, 1228, 1147, 3791, 165, 107, 1165, 1152, 1238, 112, 189, 1138, 1147, 2094, 119, 1252, 4534, 5948, 2852, 1108, 5456, 1485, 1343, 7451, 117, 1105, 1103, 2618, 1108, 1103, 1141, 2688, 2094, 1106, 5793, 1105, 8650, 1103, 8190, 119, 165, 183, 1942, 4638, 2618, 1108, 14708, 1165, 2368, 1143, 1139, 1546, 119, 1153, 1238, 112, 189, 1294, 1612, 1115, 146, 1125, 1917, 21748, 150, 3663, 155, 8231, 27514, 2101, 1942, 117, 1105, 1309, 1256, 1125, 1103, 1260, 2093, 7232, 1106, 12529, 1115, 146, 1464, 146, 1108, 2033, 2869, 1555, 119, 165, 183, 2240, 112, 1396, 8527, 1120, 1672, 9092, 1116, 7724, 1111, 1166, 1476, 1201, 119, 146, 112, 1396, 1589, 1120, 1167, 1190, 1141, 2450, 119, 146, 5363, 2213, 1552, 117, 2213, 6601, 1116, 117, 1105, 1103, 7957, 6223, 119, 1252, 146, 1138, 1870, 1106, 1138, 170, 11858, 2541, 1120, 1142, 2984, 119, 1135, 1209, 3118, 170, 1282, 146, 3644, 4895, 1800, 1107, 1139, 1710, 2993, 1106, 3644, 6946, 1121, 1822, 1892, 6656, 119, 5203, 146, 1431, 1301, 1171, 1106, 1103, 5209, 1193, 15069, 1174, 1555, 1104, 1457, 23783, 183, 25775, 1939, 106, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

可以只选取一小部分数据集

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

4.2 训练

4.2.1 用 Trainer 训练

transformer 封装好了一个 Trainer 类（API）

yelp_review_full 数据集内的 label 有五个取值（1～5），定义 num_labels=5

model 是个五分类的模型

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

用 TrainingArguments 类（API）来封装超参数，这里使用默认的超参数

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer")

在训练时的 metric 需要自己定，Evaluate库（API）提供了简单的 accuracy 函数实现

import numpy as np
import evaluate

metric = evaluate.load("accuracy")

需要自定义 compute 函数，需要将模型的输出 logits 通过 softmax 函数转化为概率值

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

如果想要在训练阶段观测 metrics 变化情况，设置 evaluation_strategy="epoch"，这样，在每个epoch结束的时候，会输出 metric 的分数

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

创建一个Trainer类的实例，传入模型、训练参数、训练集、验证集、评价指标

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

训练模型

trainer.train()

输出：

TrainOutput(global_step=39, 
			training_loss=1.516761681972406, 
		metrics={
			'train_runtime': 109.446, 
			'train_samples_per_second': 27.411, 
			'train_steps_per_second': 0.356, 
			'total_flos': 789354427392000.0, 
			'train_loss': 1.516761681972406, 
			'epoch': 3.0
			})

4.2.2 用原生 pytorch 训练

先对 tokenized_datasets 进一步处理，不需要 text 列，只要 input_ids, token_type_ids, attention_mask 和 labels

tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

在本次任务中，只选择一小部分数据集

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

用 DataLoader 封装数据集

from torch.utils.data import DataLoader

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

加载模型

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

定义优化器（optimizer）和 learning rate scheduler

使用 AdamW 这个优化器

from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", 
    optimizer=optimizer, 
    num_warmup_steps=0, 
    num_training_steps=num_training_steps
)

如果有 GPU，把模型放到 GPU 上

import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

训练

from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

评估

import evaluate

metric = evaluate.load("accuracy")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

输出

{'accuracy': 0.587}

5 分布式加速

在多GPU的机器上，或者在多台机器的多GPU上，利用 accelerate 库（API）进行加速

安装库

pip install accelerate

创建一个对象

from accelerate import Accelerator

accelerator = Accelerator()

将之前的 dataloader 、模型、优化器封装处理

train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

将之前训练部分的 loss.backward() 替换为 accelerator.backward(loss)

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        # batch = {k: v.to(device) for k, v in batch.items()} # delete
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss) # !!! here

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

训练

用脚本

先创建一个配置文件

accelerate config

然后加速训练

accelerate launch train.py

用 notebook

如果直接在notebook里，先把训练部分的所有有关代码放在一个 training_function 函数中，再调用 notebook_launcher

from accelerate import notebook_launcher

notebook_launcher(training_function, num_processes=8) #指定gpu块数

说明：使用的要求比较高

不能在 training_function 以外实例化 Accelerator()
不能在 training_function 以外运行 cuda 的任何一个函数，包括 torch.cuda.is_available()

用 notebook的完整加速代码

import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, get_scheduler
from torch.utils.data import DataLoader
from accelerate import Accelerator
from torch.optim import AdamW
from tqdm.auto import tqdm

dataset = load_dataset("yelp_review_full")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")


def training_function():
    accelerator = Accelerator()

    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)
    model.to(device)

    optimizer = AdamW(model.parameters(), lr=5e-5)


    small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
    small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

    train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
    eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

    train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
        train_dataloader, eval_dataloader, model, optimizer
    )

    num_epochs = 3
    num_training_steps = num_epochs * len(train_dataloader)
    lr_scheduler = get_scheduler(
        name="linear", 
        optimizer=optimizer, 
        num_warmup_steps=0, 
        num_training_steps=num_training_steps
    )

    progress_bar = tqdm(range(num_training_steps))

    model.train()
    for epoch in range(num_epochs):
        for batch in train_dataloader:
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)

            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)

from accelerate import notebook_launcher

notebook_launcher(training_function, num_processes=8)

参考文献