Transformers 库在NLP任务上的入门与使用

0 库

相关资源对应网址如下:

网址
库的 GitHub 地址https://github.com/huggingface/transformers
官方开发文档https://huggingface.co/docs/transformers/index
预训练模型下载地址https://huggingface.co/models

pytorch 和 tensorflow 都可用,但建议用 pytorch,以下代码全都基于 pytorch.

需要安装的库:

  • pytorch
  • transformer
  • datasets
  • evaluate

1 pipeline

1.1 介绍

现有的模型和checkpoint,可以直接拿来用,处理某个任务,如情感分类、文本生成、命名实体识别、问答等。

支持的任务

TaskDescriptionModalityPipeline identifier
Text classificationassign a label to a given sequence of textNLPpipeline(task=“sentiment-analysis”)
Text generationgenerate text that follows a given promptNLPpipeline(task=“text-generation”)
Name entity recognitionassign a label to each token in a sequence (people, organization, location, etc.)NLPpipeline(task=“ner”)
Question answeringextract an answer from the text given some context and a questionNLPpipeline(task=“question-answering”)
Fill-maskpredict the correct masked token in a sequenceNLPpipeline(task=“fill-mask”)
Summarizationgenerate a summary of a sequence of text or documentNLPpipeline(task=“summarization”)
Translationtranslate text from one language into anotherNLPpipeline(task=“translation”)
Image classificationassign a label to an imageComputer visionpipeline(task=“image-classification”)
Image segmentationassign a label to each individual pixel of an image (supports semantic, panoptic, and instance segmentation)Computer visionpipeline(task=“image-segmentation”)
Object detectionpredict the bounding boxes and classes of objects in an imageComputer visionpipeline(task=“object-detection”)
Audio classificationassign a label to an audio fileAudiopipeline(task=“audio-classification”)
Automatic speech recognitionextract speech from an audio file into textAudiopipeline(task=“automatic-speech-recognition”)
Visual question answeringgiven an image and a question, correctly answer a question about the imageMultimodalpipeline(task=“vqa”)

1.2 用于某个任务

1.2.1 情感分析

默认模型
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

测试 “We are very happy to show you the 🤗 Transformers library.” 这句句子的情感倾向

classifier("We are very happy to show you the 🤗 Transformers library.")

输出:

[{'label': 'POSITIVE', 'score': 0.9998}]

测试一个 batch 的句子的情感倾向

results = classifier(["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

输出:

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309
自定义 tokenizer 和 model

详细介绍见【载入预训练的模型】部分

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# 会下载缓存这个模型
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

测试 “We are very happy to show you the 🤗 Transformers library.” 这句句子的情感倾向

classifier("We are very happy to show you the 🤗 Transformers library.")

输出:

[{'label': '5 stars', 'score': 0.772534966468811}]

测试一个 batch 的句子的情感倾向

results = classifier(["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."])

for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

输出:

label: 5 stars, with score: 0.7725
label: 5 stars, with score: 0.2365

1.3 QA 问答

给定一段上下文 context,提问

from transformers import pipeline

question_answerer = pipeline("question-answering")

context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""

模型回答
问题 question=“What is extractive question answering?”

result = question_answerer(question="What is extractive question answering?", context=context)

print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)},  start: {result['start']},  end: {result['end']}")

输出:

Answer: 'the task of extracting an answer from a text given a question', score: 0.6177,  start: 34,  end: 95

问题 question=“What is a good example of a question answering dataset?”

result = question_answerer(question="What is a good example of a question answering dataset?", context=context)

print(
    f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
)

输出:

Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160

2 载入预训练的模型

BERT 指的是模型结构(architecture),model_name 指的是一个加载进模型的权重(checkpoint)

from transformers import BertTokenizer
from transformers import BertModel
from transformers import BertConfig

  
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
config = BertConfig.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

一般来说,这个 checkpoint 会从网上下载并缓存在本地,如果想要指定模型缓存的位置,就

import os
os.environ['TRANSFORMERS_CACHE'] = './cache' # 地址

.from_pretrained() 方法可以加载一个预训练的模型,不需要从头开始训练

目前 transformer 支持的模型结构(architecture)比较多,比较复杂,以下截取一小部分:

albert — AlbertConfig (ALBERT model)
bart — BartConfig (BART model)
beit — BeitConfig (BEiT model)
bert — BertConfig (BERT model)
bert-generation — BertGenerationConfig (Bert Generation model)
big_bird — BigBirdConfig (BigBird model)
bigbird_pegasus — BigBirdPegasusConfig (BigBird-Pegasus model)
blenderbot — BlenderbotConfig (Blenderbot model)
blenderbot-small — BlenderbotSmallConfig (BlenderbotSmall model)
bloom — BloomConfig (BLOOM model)
camembert — CamembertConfig (CamemBERT model)
canine — CanineConfig (CANINE model)
clip — CLIPConfig (CLIP model)
codegen — CodeGenConfig (CodeGen model)
conditional_detr — ConditionalDetrConfig (Conditional DETR model)
convbert — ConvBertConfig (ConvBERT model)
convnext — ConvNextConfig (ConvNeXT model)
...

Auto 类型可以根据你提供的 checkpoint 自动推断模型的结构(architecture)
最常用的有:AutoTokenizer, AutoModel, AutoConfig

from transformers import AutoTokenizer, AutoModel, AutoConfig

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
config = AutoConfig.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-cased")

以上是根据库里的对应名字的模型直接载入,也可以把模型下载到本地后,用本地路径载入

model_path = "../pretrained_model/distilbert-base-uncased"
model = AutoModel.from_pretrained(model_path)

AutoModelForXXXX 类型可以根据给定的任务,载入对应的预训练模型

# sequence classification 任务
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

# token classification 任务
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased")

3 预处理

训练模型前,需要把你的数据预处理成模型的输入格式

  • 文本:用 Tokenizer 转化成 token 的序列,用数字(索引)来表示 token,再把它们变成 tensor

3.1 tokenizer

定义一个 tokenizer

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

3.1.1 编码(encode)

sentence = "Do not meddle in the affairs of wizards, for they are subtle and quick to anger."

encoded_input = tokenizer(sentence)
# 或者也可以写成
encoded_input = tokenizer.encode_plus(sentence)

print(encoded_input)

输出:

{'input_ids': [101, 2091, 1136, 1143, 13002, 1107, 1103, 5707, 1104, 16678, 1116, 117, 1111, 1152, 1132, 11515, 1105, 3613, 1106, 4470, 119, 102], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

encode和encode_plus的区别

  1. encode仅返回input_ids
  2. encode_plus返回所有的编码信息,具体如下:
    • input_ids:是单词在词典中的编码
    • token_type_ids:区分token 是否来源于同一句句子
      • 如果有两个句子,编码中上句全为0,下句全为1
    • attention_mask:指定对哪些词进行self-Attention操作
sentence = "Hello, my son is laughing." 
print(tokenizer.encode(sentence)) 
print(tokenizer.encode_plus(sentence))

3.1.2 解码(decode)

tokenizer.decode(encoded_input["input_ids"])

输出

'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'

自动在句子开始加上了 [CLS] 符号,在句子末尾加上了 [SEP] 符号

3.1.3 批处理

可以一次性编码多个句子

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)

输出:

{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1]]}

3.1.4 其它参数设置

填充(Pad)

同一个批次(batch)中的句子将根据最长的句子,被补全(padding)到同一个长度,用 [PAD] 补充,token 的索引为0

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True) # 设置 padding
print(encoded_input)

输出

{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

[PAD] 不需要被 attention,所以attention_mask为 0

截断(Truncation)

如果句子太长,超过了模型的最大长度 max_length 的设置,将会被截断(比如超过 512 个token)

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]

encoded_input = tokenizer(batch_sentences, 
						  padding=True, 
						  truncation=True) # 设置 truncation
print(encoded_input)

输出

{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

因为这里的句子都比较短,所以没有被截断

可以自己设置 max_length

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]

encoded_input = tokenizer(batch_sentences,
						  padding=True, 
						  truncation=True,  # 设置 truncation
						  max_length=8)     # max_length
print(encoded_input)

输出

{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], 
			  [101, 1790, 112, 189, 1341, 1119, 3520, 102], 
			  [101, 1327, 1164, 5450, 23434, 136, 102, 0]], 
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], 
		 		  [0, 0, 0, 0, 0, 0, 0, 0], 
				  [0, 0, 0, 0, 0, 0, 0, 0]], 
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], 
				  [1, 1, 1, 1, 1, 1, 1, 1], 
				  [1, 1, 1, 1, 1, 1, 1, 0]]}

超过长度为 max_length=8 会被截断,不足长度为 max_length=8 会补全

构建张量(Build tensors)

设置 return_tensors="pt",就会返回 pytorch 对应的 tensor,不再是前面的 list

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]

encoded_input = tokenizer(batch_sentences,
						  padding=True, 
						  truncation=True,  # 设置 truncation
						  max_length=8,     # max_length
						  return_tensors="pt")
print(encoded_input)

输出

{'input_ids': 
tensor([[  101,  1252,  1184,  1164,  1248,  6462,   136,   102],
        [  101,  1790,   112,   189,  1341,  1119,  3520,   102],
        [  101,  1327,  1164,  5450, 23434,   136,   102,     0]]), 'token_type_ids': 
tensor([[0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0]]), 
'attention_mask': 
tensor([[1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0]])}

4 微调一个预训练模型

在预训练模型的基础上,继续训练模型,用于特定任务,叫做微调。

可以有两种方法:

  1. 用 Transformer 的 Trainer 来训练
  2. 用原本的 pytorch 来训练

这一部分分享一下博主的 jupyter notebook,以下两个相同,任选:

  1. 链接: https://pan.baidu.com/s/1KcVR8CKBbHIWVvsXLt3lyw 密码: lk6u
  2. github链接:https://github.com/iteapoy/NLP-tutorial/blob/main/transformer_tutorial.ipynb

4.1 准备数据集

载入内置的数据集 yelp_reviewAPI

from datasets import load_dataset

dataset = load_dataset("yelp_review_full")
dataset

输出

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

有训练集(train)和测试集(test)两个数据集,数据量分别为 650,000 和 50,000

特征为:label 和 text

dataset["train"][100]

输出

{'label': 0, 
'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!'}

用 tokenizer 预处理

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets

输出

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

多了 input_ids, token_type_ids, attention_mask

tokenized_datasets["train"][100]

输出

{'label': 0, 
'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!', 
'input_ids': [101, 1422, 11471, 1111, 9092, 1116, 1132, 189, 6034, 1344, 119, 1252, 1111, 1141, 1106, 1253, 8693, 1177, 14449, 1193, 119, 119, 119, 1115, 2274, 1380, 1957, 106, 165, 183, 1942, 4638, 5948, 2852, 1261, 1139, 2053, 112, 188, 1546, 117, 1173, 13796, 5794, 1143, 119, 146, 1125, 1106, 2049, 1991, 1107, 1524, 1104, 170, 5948, 2852, 1150, 1533, 1117, 8077, 1106, 3074, 1113, 1103, 1825, 139, 2036, 3048, 11607, 2137, 1143, 119, 146, 3932, 1166, 1421, 1904, 1111, 170, 23275, 1546, 1115, 1529, 11228, 1141, 5102, 112, 188, 7696, 119, 1258, 2903, 1160, 1234, 1150, 2802, 1170, 1143, 1129, 3541, 1147, 2094, 117, 146, 1455, 1187, 2317, 1108, 119, 1109, 2618, 1408, 13732, 1120, 1103, 5948, 11528, 1111, 165, 107, 2688, 1228, 1147, 3791, 165, 107, 1165, 1152, 1238, 112, 189, 1138, 1147, 2094, 119, 1252, 4534, 5948, 2852, 1108, 5456, 1485, 1343, 7451, 117, 1105, 1103, 2618, 1108, 1103, 1141, 2688, 2094, 1106, 5793, 1105, 8650, 1103, 8190, 119, 165, 183, 1942, 4638, 2618, 1108, 14708, 1165, 2368, 1143, 1139, 1546, 119, 1153, 1238, 112, 189, 1294, 1612, 1115, 146, 1125, 1917, 21748, 150, 3663, 155, 8231, 27514, 2101, 1942, 117, 1105, 1309, 1256, 1125, 1103, 1260, 2093, 7232, 1106, 12529, 1115, 146, 1464, 146, 1108, 2033, 2869, 1555, 119, 165, 183, 2240, 112, 1396, 8527, 1120, 1672, 9092, 1116, 7724, 1111, 1166, 1476, 1201, 119, 146, 112, 1396, 1589, 1120, 1167, 1190, 1141, 2450, 119, 146, 5363, 2213, 1552, 117, 2213, 6601, 1116, 117, 1105, 1103, 7957, 6223, 119, 1252, 146, 1138, 1870, 1106, 1138, 170, 11858, 2541, 1120, 1142, 2984, 119, 1135, 1209, 3118, 170, 1282, 146, 3644, 4895, 1800, 1107, 1139, 1710, 2993, 1106, 3644, 6946, 1121, 1822, 1892, 6656, 119, 5203, 146, 1431, 1301, 1171, 1106, 1103, 5209, 1193, 15069, 1174, 1555, 1104, 1457, 23783, 183, 25775, 1939, 106, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

可以只选取一小部分数据集

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

4.2 训练

4.2.1 用 Trainer 训练

transformer 封装好了一个 Trainer 类(API

yelp_review_full 数据集内的 label 有五个取值(1~5),定义 num_labels=5

model 是个五分类的模型

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

用 TrainingArguments 类(API)来封装超参数,这里使用默认的超参数

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer")

在训练时的 metric 需要自己定,Evaluate库(API)提供了简单的 accuracy 函数实现

import numpy as np
import evaluate

metric = evaluate.load("accuracy")

需要自定义 compute 函数,需要将模型的输出 logits 通过 softmax 函数转化为概率值

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

如果想要在训练阶段观测 metrics 变化情况,设置 evaluation_strategy="epoch",这样,在每个epoch结束的时候,会输出 metric 的分数

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

创建一个Trainer类的实例,传入模型、训练参数、训练集、验证集、评价指标

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

训练模型

trainer.train()

输出:

TrainOutput(global_step=39, 
			training_loss=1.516761681972406, 
		metrics={
			'train_runtime': 109.446, 
			'train_samples_per_second': 27.411, 
			'train_steps_per_second': 0.356, 
			'total_flos': 789354427392000.0, 
			'train_loss': 1.516761681972406, 
			'epoch': 3.0
			})

4.2.2 用原生 pytorch 训练

先对 tokenized_datasets 进一步处理,不需要 text 列,只要 input_ids, token_type_ids, attention_masklabels

tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

在本次任务中,只选择一小部分数据集

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

用 DataLoader 封装数据集

from torch.utils.data import DataLoader

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

加载模型

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

定义优化器(optimizer)和 learning rate scheduler

使用 AdamW 这个优化器

from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", 
    optimizer=optimizer, 
    num_warmup_steps=0, 
    num_training_steps=num_training_steps
)

如果有 GPU,把模型放到 GPU 上

import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

训练

from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

评估

import evaluate

metric = evaluate.load("accuracy")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

输出

{'accuracy': 0.587}

5 分布式加速

在多GPU的机器上,或者在多台机器的多GPU上,利用 accelerate 库(API)进行加速

安装库

pip install accelerate

创建一个对象

from accelerate import Accelerator

accelerator = Accelerator()

将之前的 dataloader 、模型、优化器封装处理

train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

将之前训练部分的 loss.backward() 替换为 accelerator.backward(loss)

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        # batch = {k: v.to(device) for k, v in batch.items()} # delete
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss) # !!! here

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

训练

用脚本

先创建一个配置文件

accelerate config

然后加速训练

accelerate launch train.py

用 notebook

如果直接在notebook里,先把训练部分的所有有关代码放在一个 training_function 函数中,再调用 notebook_launcher

from accelerate import notebook_launcher

notebook_launcher(training_function, num_processes=8) #指定gpu块数

说明:使用的要求比较高

  1. 不能在 training_function 以外实例化 Accelerator()
  2. 不能在 training_function 以外运行 cuda 的任何一个函数,包括 torch.cuda.is_available()

用 notebook的完整加速代码

import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, get_scheduler
from torch.utils.data import DataLoader
from accelerate import Accelerator
from torch.optim import AdamW
from tqdm.auto import tqdm

dataset = load_dataset("yelp_review_full")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

def training_function():
    accelerator = Accelerator()

    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)
    model.to(device)

    optimizer = AdamW(model.parameters(), lr=5e-5)


    small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
    small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

    train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
    eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

    train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
        train_dataloader, eval_dataloader, model, optimizer
    )

    num_epochs = 3
    num_training_steps = num_epochs * len(train_dataloader)
    lr_scheduler = get_scheduler(
        name="linear", 
        optimizer=optimizer, 
        num_warmup_steps=0, 
        num_training_steps=num_training_steps
    )

    progress_bar = tqdm(range(num_training_steps))

    model.train()
    for epoch in range(num_epochs):
        for batch in train_dataloader:
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)

            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)
from accelerate import notebook_launcher

notebook_launcher(training_function, num_processes=8)

参考文献

  1. 官方文档的 tutorial 部分
  2. Transformers 库的基本使用
  • 2
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
### 回答1: 好的,我很乐意帮助您使用huggingface transformerstransformers是一个自然语言处理,提供了许多预训练的模型和工具,可以用于各种NLP任务。 以下是使用transformers的一些步骤: 1. 安装transformers: 您可以使用pip在终端或命令提示符中安装transformers。请执行以下命令: ``` pip install transformers ``` 2. 加载预训练模型: transformers提供了许多预训练模型,您可以选择适合您的任务的模型。例如,如果您想对文本进行情感分析,可以使用BERT模型。您可以使用以下代码加载预训练模型: ```python from transformers import BertModel, BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained('bert-base-uncased') ``` 3. 处理数据: 在使用模型之前,您需要将数据转换为模型可以处理的格式。您可以使用tokenizer将文本转换为tokens,并使用model将tokens转换为模型可以处理的张量。例如,以下是一个将句子编码为BERT输入的示例: ```python text = "This is a sample sentence." encoded_input = tokenizer(text, padding=True, truncation=True, return_tensors='pt') output = model(**encoded_input) ``` 在这里,我们使用padding和truncation参数将所有句子都填充或截断为相同的长度,以便于模型处理。return_tensors参数指定我们想要输出PyTorch张量格式的编码输入。 4. 进行预测: 现在,您可以使用模型进行预测。对于分类任务,您可以使用模型的输出进行分类。例如,在情感分析任务中,您可以使用输出进行情感预测: ```python import torch logits = output.last_hidden_state.mean(dim=1) # 取最后一层隐藏状态的平均值作为输入特征 predictions = torch.softmax(logits, dim=-1) ``` 在这里,我们将最后一层隐藏状态的平均值作为输入特征,并使用softmax函数将输出转换为概率分布。您可以使用这些概率进行分类预测。 这是一个基本的使用transformers进行NLP任务的示例。希望这可以帮助您入门。如果您有任何问题,请随时提问。 ### 回答2: Huggingface transformers自然语言处理领域的一个开源框架,专门用于处理文本数据的预处理、模型训练和部署等任务。它目前支持多种预训练模型,包括BERT、GPT、RoBERTa和T5等,还提供了多个预训练模型的权重文件,方便用户直接使用。 Huggingface transformers提供了方便易用的API和示例代码,使得用户可以在少量的代码修改下快速搭建自己的模型。用户只需要按照要求将输入文本转换成模型可接受的格式,就可以使用预训练模型进行文本分类、文本生成、文本翻译等任务使用Huggingface transformers可以方便地调整和优化模型的结构和参数,以达到更好的性能。用户可以选择适合自己需求的模型,同时还可以对预处理和后处理等过程进行自定义,来满足不同任务的需求。 除了提供预训练模型和API,Huggingface transformers还支持多种部署方式。用户可以选择将模型部署到云端,也可以将模型部署到移动端或嵌入式设备等环境。 总之,Huggingface transformers是一个功能强大、易于使用自然语言处理框架,提供了多种预处理、模型训练和部署等功能,能帮助用户快速建立自己的自然语言处理应用。 ### 回答3: Hugging Face Transformers是一个用于自然语言处理任务的Python,其主要应用是针对预训练模型的微调和推断。该旨在使用户能够轻松使用各种已有的预训练模型,包括BERT和GPT2等。 它提供了一组工具,以简化使用这些先进技术的过程。 使用Hugging Face Transformers,用户可以快速地构建用于特定自然语言处理任务的模型,如文本分类、命名实体识别、文本生成等。Hugging Face Transformers还提供了一种方法,帮助用户优化文本数据的预处理,以获得更好的性能。该还提供了各种工具,可以帮助用户对模型进行解释和可视化。 在使用Hugging Face Transformers时,用户首先需要选择一个合适的预训练模型,并使用句子分隔技术将其输入拆分为一组短文本段。 接着,用户需要对该模型进行微调,以适应他们自己的特定任务。Hugging Face Transformers提供了一组简单的API,以便用户轻松地完成这些步骤。 使用Hugging Face Transformers的主要优点是,它使得使用先进的自然语言处理技术变得非常简单。它提供了一组易于使用的工具,其中包括各种自然语言处理任务的预训练模型,以及用于微调和推理的API。此外,Hugging Face Transformers还提供了一些工具,可以帮助用户分析和可视化他们的模型并提高模型的性能。 总之,Hugging Face Transformers是一个完善的自然语言处理框架,可以使得使用先进技术的自然语言处理任务变得更加容易。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

iteapoy

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值