ACL2020的一片论文,Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks,提出来领域预训练的思想,简单来说,平常是将预训练模型在下游任务上做微调,单领域预训练就是拿着与训练模型,在数据上再做一次pretrain,然后将得到的新模型拿来在下游任务做微调,这个pretrain其实就是让预训练模型更好的适应当前的数据集。
- 核心代码
领域预训练,就是先使用模型在数据上做MLM(类似bert里的MLM任务,先进行mask,然后进行预测),相关代码有对应的包,可以直接使用,如下代码
# 对模型进行MLM预训练
from transformers import AutoModelForMaskedLM,AutoTokenizer,BertTokenizer,NezhaModel,ErnieForMaskedLM,NezhaForMaskedLM
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from transformers import LineByLineTextDataset
import os
import math
model_name = "pretrained/Nezha"
# 训练集
train_file = "data/Sentencetrain.txt"
# 验证集
eval_file = "data/Sentencevalid.txt"
max_seq_length = 512
#训练好的模型保存的位置
out_model_path = "pretain"
train_epoches = 50
batch_size = 64
# 这里不是从零训练,而是在原有预训练的基础上增加数据进行预训练,因此不会从 config 导入模型
model = NezhaForMaskedLM.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path=train_file,
block_size=128,
)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
eval_dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path=eval_file,
block_size=128,
)
training_args = TrainingArguments(
do_train=True,
do_eval=True,
evaluation_strategy="epoch",
output_dir=out_model_path,
overwrite_output_dir=True,
num_train_epochs=train_epoches,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
save_strategy="epoch",
save_total_limit=4,
learning_rate=1e-4,
prediction_loss_only=True,
metric_for_best_model="eval_loss",
load_best_model_at_end=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
eval_dataset=eval_dataset,
data_collator=data_collator,
)
trainer.train()
trainer.save_model(out_model_path)
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
数据集Sentencetrain.txt准备为以下形式即可,一行一个样本
训练完之后,可以将原始下载的预训练模型目录中的vocab.txt 复制到新的模型文件内,在进行下游任务微调时便可直接使用新模型。相关参数若需修改,可以搜索了解这些包的参数,进行修改。
PS:本人进行了预训练,在验证集上效果有提高,但没在测试集上测试。