接下来,我们将使用一些戏剧剧本对 GPT-2 进行微调,用莎士比亚的戏剧作品《罗密欧与朱丽叶》作为训练样本。
5.1 读取文件
读取莎士比亚一个戏剧作品文件数据。
with open('./data/romeo_and_juliet.txt', 'r') as f:
dataset = f.read()
len(dataset)
运行结果:
138150
该文件共有13万多个单词。
5.2 对文件进行分词
对文件进行分词并把字符串分段成长度不超过 512的长度。
indexed_text = tokenizer.encode(dataset)
del(dataset)
dataset_cut = []
for i in range(len(indexed_text)//512):
# 对字符串分段,使每段长度为512个标识符
dataset_cut.append(indexed_text[i*512:i*512+512])
del(indexed_text)
dataset_tensor = torch.tensor(dataset_cut)
dataset_tensor.shape
5.3 把数据集转换为可迭代对象
利用dataloader方法,把数据集转换为可批量处理的迭代对象。
from torch.utils.data import DataLoader, TensorDataset
# 构建数据集和数据迭代器,设定 batch_size 大小为 1
train_set = TensorDataset(dataset_tensor,
dataset_tensor) # 标签与样本数据相同
train_loader = DataLoader(dataset=train_set,
batch_size=1,
shuffle=False)
train_loader
5.4 训练模型
训练模型,迭代30次,采用Adam优化方法,学习率为le-5。
from torch import nn
import time
pre = time.time()
epoch = 30 # 循环学习 30 次
#model.to(device)
model.train()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5) # 定义优化器
for i in range(epoch):
total_loss = 0
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data, target
optimizer.zero_grad()
loss, logits, _ = model(data, labels=target)
#print(loss.shape)
total_loss += loss.item()
loss.backward()
optimizer.step()
if batch_idx == len(train_loader)-1:
# 在每次循环(epoch)的最后输出一下结果
print('average loss:', total_loss/len(train_loader))
print('训练时间:', time.time()-pre)
5.5 用模型生成文本
利用训练好的GPT-2模型,根据提供的引导语句,推断一段新文字。
text = "From fairest creatures we desire" # 这里也可以输入不同的英文文本
indexed_tokens = tokenizer.encode(text)
tokens_tensor = torch.tensor([indexed_tokens])
model.eval()
total_predicted_text = text
# 使训练后的模型进行 500 次训练
for _ in range(500):
tokens_tensor = tokens_tensor
with torch.no_grad():
outputs = model(tokens_tensor)
predictions = outputs[0]
predicted_index = select_top_k(predictions, k=10)
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
total_predicted_text += tokenizer.decode(predicted_index)
if '<|endoftext|>' in total_predicted_text:
# 如果出现文本结束标志,就结束文本生成
break
indexed_tokens += [predicted_index]
if len(indexed_tokens) > 1023:
# 模型最长输入长度为1024个标识符,如果长度过长则截断。
indexed_tokens = indexed_tokens[-1023:]
tokens_tensor = torch.tensor([indexed_tokens])
print(total_predicted_text)
运行结果(部分):
From fairest creatures we desire to be friends
Our lives may not stand but to be separated; we may call
This holy division death: this holy marriage must cease in our death. Thus saying farewell we besan our dead, that we should be as friends in death. Thus says he;
Death may not withdraw his holy order: 'tis but our common grief
In that part where thou wossiest death, and where death withdraw'd the blessings that we should Dainst us by parting ways:'So sayeth he our hearts will live on this vow for eon life. This last part of our vow may be quiesolved and our marmarry be as dear a vow ours are; we do wish our death, but God give it not; let death remove this part which
从生成结果可以看到,模型已经学习到了戏剧剧本的文本特点。不过仔细读起来会发现缺少逻辑和关联,这是因为由于循环次数、语料等还不够充分。如果有更多的数据,训练更长的时间,这样模型应该有更好的表现。
数据及代码请看:https://github.com/Wumg3000/feiguyunai/tree/main/Embedding-Transformer