我之前没怎么用过Transformers的GPT2,今天尝试了自己训练,结果报:ValueError: You are attempting to pad samples but the tokenizer you are using (GPT2Tokenizer) does not have one. 搜了一下,并不是我一个人遇到了这个问题,例如这里:
https://github.com/huggingface/transformers/issues/4122
按照大家的讨论,解决的方法也很简单,在这里就有网友指出了:
https://stackoverflow.com/questions/63377135/training-gpt2-and-reformer-from-scratch
You can’t use the LineByLineTextDataset class with GPT2 as mentioned
here
. Use
TextDataset
instead.
所以换一下即可。我完全是按照之前参考RoBERTa的教程:
https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb
把模型换成了GPT2,并且按照上面说的,将LineByLineTextDataset换成了:TextDataset,其实网上有些例子也说得挺详细的:
https://www.philschmid.de/fine-tune-a-non-english-gpt-2-model-with-huggingface
最后贴上自己的代码,大家可以直接在这个基础上预训练自己的GPT2 model:
import torch
import os
import shutil
print(torch.cuda.is_available())
############################################################
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
repo_name="GPT-Corpus"
model_dir=repo_name+'-GPTModel'
if(os.path.exists(model_dir)):
shutil.rmtree(model_dir)
os.mkdir(model_dir)
paths = [str(x) for x in Path(repo_name).glob("**/*.txt")]
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=paths, vocab_size=15_000, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
tokenizer.save_model(model_dir)
############################################################
from transformers import GPT2Config
config = GPT2Config(
vocab_size=15_000,
n_positions=512,
n_head=2,
n_layer=2,
n_embd=256,
)
############################################################
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(model_dir)
############################################################
from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel(config=config)
print(model.num_parameters())
############################################################
# from transformers import LineByLineTextDataset
# dataset = LineByLineTextDataset(
# tokenizer=tokenizer,
# file_path=repo_name+"/GPT_Corpus.txt",
# block_size=128,
# )
from transformers import TextDataset
dataset = TextDataset(
tokenizer=tokenizer,
file_path="GPT_Corpus_whole.txt",
block_size=128
)
############################################################
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=False
)
############################################################
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./"+model_dir,
overwrite_output_dir=True,
num_train_epochs=100,
per_gpu_train_batch_size=64,
save_steps=20_000,
save_total_limit=2,
logging_steps=100
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
prediction_loss_only=True,
)
trainer.train()
trainer.save_model("./"+model_dir)
需要注意的是:这段示例代码中,我用了很小的一个corpus,所以相应做了一些调整。另外,就像前面说的,tokenizer使用的文件夹下的corpus是每一行代表一个sentence的,但是TextDataset那里是把整个语料库搞成了一行。也不知道这样会不会对model有影响,anyway,这样反正是能正常训练了。
如果要使用已经训练好的模型,也很简单:
from transformers import pipeline
pred = pipeline(
"text-generation",
model="./GPT-Corpus-GPTModel",
tokenizer="./GPT-Corpus-GPTModel"
)
result=pred('%输入字符串%')[0]['generated_text']
print(result)
就简单记录这么多。