学习如何使用GPT2进行文本生成(torch+transformers)

学习如何使用GPT2进行文本生成(torch+transformers)

GPT2是OPen AI发布的一个预训练语言模型,见论文《Language Models are Unsupervised Multitask Learners》,GPT-2利用单向Transformer的优势,做一些BERT使用的双向Transformer所做不到的事。那就是通过上文生成下文文本。
理论部分的文章有很多,这里不做深究,下面直接看代码吧

导入相关包

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

加载tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

编码输入

对给出的文本进行编码,并转换为tensor

indexed_tokens = tokenizer.encode("Xiao Ming is a primary school student. He likes playing games")

print( tokenizer.decode(indexed_tokens))

tokens_tensor = torch.tensor([indexed_tokens])

Xiao Ming is a primary school student. He likes playing games

加载预训练模型(权重)

model = GPT2LMHeadModel.from_pretrained('gpt2')

将模型设置为评估模式

model.eval()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

tokens_tensor = tokens_tensor.to(device)
model.to(device)

预测所有标记

with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

得到预测的下一词

predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
print(predicted_text)

可以看到,GPT2预测的下一个词是and

Xiao Ming is a primary school student. He likes playing games and

生成一段完整的话

stopids = tokenizer.convert_tokens_to_ids(["."])[0] 
past = None
for i in range(100):
    with torch.no_grad():
        output, past = model(tokens_tensor, past_key_values=past, return_dict=False)

    token = torch.argmax(output[..., -1, :])

    indexed_tokens += [token.tolist()]

    if stopids== token.tolist():
        break
    tokens_tensor = token.unsqueeze(0)
    
sequence = tokenizer.decode(indexed_tokens)

print(sequence)

生成的文本为:and playing with his friends.与给出的句子构成了一段完整的话。

Xiao Ming is a primary school student. He likes playing games and playing with his friends.

试试其他语句

我们将上面的句子加上句号,gpt2会生成一个不一样的句子。
原:Xiao Ming is a primary school student. He likes playing games
现:Xiao Ming is a primary school student. He likes playing games.
生成为:

Xiao Ming is a primary school student. He likes playing games. He is also a member of the team that won the World Cup in 2010.

  • 3
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
好的,以下是使用GPT-2对文本分类任务进行微调的Python代码: ```python import torch from transformers import GPT2Tokenizer, GPT2ForSequenceClassification, AdamW # 加载预训练的GPT-2模型和tokenizer tokenizer = GPT2Tokenizer.from_pretrained('gpt2') model = GPT2ForSequenceClassification.from_pretrained('gpt2') # 加载训练数据和标签 train_texts = ['This is a positive sentence.', 'This is a negative sentence.'] train_labels = [1, 0] # 对训练数据进行tokenize和padding train_encodings = tokenizer(train_texts, truncation=True, padding=True) # 将训练数据加载到PyTorch的DataLoader中 train_dataset = torch.utils.data.TensorDataset(torch.tensor(train_encodings['input_ids']), torch.tensor(train_encodings['attention_mask']), torch.tensor(train_labels)) train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=2, shuffle=True) # 定义优化器和损失函数 optimizer = AdamW(model.parameters(), lr=5e-5) loss_fn = torch.nn.CrossEntropyLoss() # 进行微调训练 for epoch in range(3): for batch in train_loader: optimizer.zero_grad() input_ids, attention_mask, labels = batch outputs = model(input_ids, attention_mask=attention_mask, labels=labels) loss = outputs[0] loss.backward() optimizer.step() print('Epoch:', epoch+1, 'Loss:', loss.item()) # 保存微调后的模型 model.save_pretrained('gpt2_classifier') tokenizer.save_pretrained('gpt2_classifier') ``` 这段代码使用了`GPT2ForSequenceClassification`模型对两个句子进行分类,其中`train_texts`存储训练数据,`train_labels`存储标签。首先,将训练数据进行tokenize和padding,并将其加载到PyTorch的DataLoader中。然后,定义优化器和损失函数,并使用微调训练对模型进行微调。最后,保存微调后的模型和tokenizer。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

爱挠静香的下巴

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值