Transformers:
由Hugging Face提出,包括许多预训练大模型,能够通过pipeline轻松实现部署应用,我在VSCode中通过虚拟环境实现,需要先pip install transformers。
比如一个小例子:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
res = classifier("春天就要结束了.")
print(res)
Fine-tune:
目的:在预训练大模型上使用自己的数据集进行微调,让模型在特有领域表现更好。
数据集tokenize:
from datasets import load_dataset, load_metric
dataset = load_dataset("glue", 'sst2')
metric = load_metric('glue', 'sst2')
datasets库包括了许多常用的数据集,通过(load_dataset)就可以完成数据集的下载与加载,且load_metric能够加载该数据集对应的评价指标以便计算,其中glue是“General Language Understanding Evaluation”(通用语言理解评估)的缩写,它是一个用于评估和比较自然语言理解(NLU)系统性能的benchmark。GLUE包括了多种不同的任务,如文本蕴含、情感分析、句子相似性匹配等,旨在覆盖广泛的语言理解能力。而sst2是其中一个进行情感分析的数据集。
展示metric的计算使用:
import numpy as np
fake_preds = np.random.randint(0, 2, size=(64,))
fake_labels = np.random.randint(0, 2, size=(64,))
metric.compute(predictions=fake_preds, references=fake_labels)#传入我们随机生成的模型结果与真实结果,计算准确性
插播一点NLP token小知识(我对NLP之前完全不了解): Tokenization:语言任务中的分词处理,比如RNN使用空格进行分词,Transformer架构中发现这种方法对于词典里没有的词语表现效果很差,使用一种BPE的方法,类似于英文单词中的词根,通过统计词根的词频,提高了对新单词的效果。
进行微调时需要自己写一些模型加载以及数据处理等的代码,transformers对此流程进行了统一和简化。
Tokenization
from transformers import AutoTokenizer
# 利用tokenizer处理数据集
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') #可以去官网查看不同模型的名字
def preprocess_function(examples):
return tokenizer(examples['sentence'], truncation=True, max_length=512)#bert不能处理长度超过512的序列
# 用preprocess_function来处理整个数据集
encoded_dataset = dataset.map(preprocess_function, batched=True)
看下tokenizer生成的是什么东西吧:
tokenizer("Tsinghua University is located in Beijing.")
结果为:
{‘input_ids’:[ ],‘token_type_ids’:[ ],‘attention_mask’:[ ]}三部分,attention_mask要对补长的对应部分设置为0,这些内容不需要处理。
Fine-tune 模型
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
目标任务是情感分类,所以label=2,label数量决定了最后的全连接层数量。
接下来使用前面提到过的Hugging Face中的trainer类实现:
from transformers import TrainingArguments
batch_size=16
args = TrainingArguments(
"bert-base-uncased-finetuned-sst2",
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=5,
weight_decay=0.01,
load_best_model_at_end=True,
metric_for_best_model="accuracy"
)
def compute_metrics(eval_pred):
logits, labels = eval_pred # predictions: [batch_size,num_labels], labels:[batch_size,]
predictions = np.argmax(logits, axis=1) # 将概率最大的类别作为预测结果
return metric.compute(predictions=predictions, references=labels)
from transformers import Trainer
trainer = Trainer(
model,
args,
train_dataset=encoded_dataset["train"],
eval_dataset=encoded_dataset["validation"],
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
每个参数的含义:
第一个参数:本次训练的名称
evaluation_strategy=“epoch”:在每个epoch结束的时候在validation集上测试模型效果
save_strategy=“epoch”:在每个epoch结束的时候保存一个checkpoint
learning_rate=2e-5:优化的学习率
per_device_train_batch_size=batch_size:训练时每个gpu上的batch_size
per_device_eval_batch_size=batch_size:测试时每个gpu上的batch_size
num_train_epochs=5:训练5个epoch
weight_decay=0.01:优化时采用的weight_decay
load_best_model_at_end=True:在训练结束后,加载训练过程中最好的参数
metric_for_best_model=“accuracy”:以准确率作为指标
开始训练~
trainer.train()
此时trainer相当于一个pipeline。
补充:我在自己VSCode中运行这个demo时遇到报错:
packages/torch/utils/data/dataloader.py", line 183, in __init__
assert prefetch_factor > 0
TypeError: '>' not supported between instances of 'NoneType' and 'int'
大致能看出是关于dataloader中prefetch_factor的问题,参考github和CSDN:这篇文章,终于找到了解决办法!:
""" if num_workers == 0 and prefetch_factor != 2:
raise ValueError('prefetch_factor option could only be specified in multiprocessing.'
'let num_workers > 0 to enable multiprocessing.')
assert prefetch_factor > 0 """
if num_workers > 0:
if prefetch_factor is None:
prefetch_factor = 2 # default value
else:
if prefetch_factor is not None:
raise ValueError('prefetch_factor option could only be specified in multiprocessing.'
'let num_workers > 0 to enable multiprocessing, otherwise set prefetch_factor to None.')
大家能够看到,解决办法是将dataloader.py中原始的这段代码替换成了新代码:
if num_workers > 0:
if prefetch_factor is None:
prefetch_factor = 2 # default value
else:
if prefetch_factor is not None:
raise ValueError('prefetch_factor option could only be specified in multiprocessing.'
'let num_workers > 0 to enable multiprocessing, otherwise set prefetch_factor to None.')
已经成功解决,并且完美运行demo!
完整代码
在文章末尾我给出本demo的完整代码:
from transformers import AutoTokenizer
# 利用tokenizer处理数据集
from datasets import load_dataset, load_metric
from transformers import TrainingArguments
from transformers import AutoModelForSequenceClassification
from transformers import Trainer
import numpy as np
dataset = load_dataset("glue", 'sst2', cache_dir="./datasets_cache)
metric = load_metric('glue', 'sst2')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') #可以去官网查看不同模型的名字
def preprocess_function(examples):
return tokenizer(examples['sentence'], truncation=True, max_length=512)#bert不能处理长度超过512的序列
# 用preprocess_function来处理整个数据集
encoded_dataset = dataset.map(preprocess_function, batched=True)
print(preprocess_function(dataset['train'][:5]))
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
# Rest of your code
batch_size=16
args = TrainingArguments(
"bert-base-uncased-finetuned-sst2",
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=5,
weight_decay=0.01,
load_best_model_at_end=True,
metric_for_best_model="accuracy"
)
def compute_metrics(eval_pred):
logits, labels = eval_pred # predictions: [batch_size,num_labels], labels:[batch_size,]
predictions = np.argmax(logits, axis=1) # 将概率最大的类别作为预测结果
return metric.compute(predictions=predictions, references=labels)
trainer = Trainer(
model,
args,
train_dataset=encoded_dataset["train"],
eval_dataset=encoded_dataset["validation"],
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.train()