Transformers预训练模型使用:语言建模 Language Modeling

语言建模是一个将模型拟合到一个语料库的任务,这个语料库可以是特定领域,也可以是通用领域。所有主流的、基于transformer的模型(跟这个包transformers不是一个东西)都使用了语言建模的变体任务进行训练。如BERT,使用掩码语言建模(masked language modeling),GPT-2是用的是因果语言建模(causal language modeling)。

除了用于预训练,预原建模在迁移模型领域时也很有用。比如将一个在超大语料库中训练完成的预训练模型微调到一个新数据集上。

掩码语言建模 Masked Language Modeling

掩码语言建模的任务是:提供一个包含特殊标记[MASK](即掩码)的序列,然后让模型去预测掩码位置本来的词语。如提供“我[MASK]你”,预测[MASK]位置的词语,如“爱”、“喜欢”或“恨”等。这个任务会允许模型同时关注[MASK]左右的上下文信息(有些任务只允许观察一侧上下文信息)。

这种训练会为下游任务创建一个结实的基础,比如需要识别双向双下文的问答任务(如SQuAD数据集)。

使用pipline

当然,你可以使用pipline很快的应用这类模型。

示例代码:

from transformers import pipeline
from pprint import pprint

nlp = pipeline("fill-mask")
pprint(nlp(f"HuggingFace is creating a {nlp.tokenizer.mask_token} that the community uses to solve NLP tasks."))

输出结果:

[{'score': 0.1792745739221573,
  'sequence': '<s>HuggingFace is creating a tool that the community uses to '
              'solve NLP tasks.</s>',
  'token': 3944,
  'token_str': 'tool'},
 {'score': 0.11349421739578247,
  'sequence': '<s>HuggingFace is creating a framework that the community uses '
              'to solve NLP tasks.</s>',
  'token': 7208,
  'token_str': 'framework'},
 {'score': 0.05243554711341858,
  'sequence': '<s>HuggingFace is creating a library that the community uses to '
              'solve NLP tasks.</s>',
  'token': 5560,
  'token_str': 'library'},
 {'score': 0.03493533283472061,
  'sequence': '<s>HuggingFace is creating a database that the community uses '
              'to solve NLP tasks.</s>',
  'token': 8503,
  'token_str': 'database'},
 {'score': 0.02860250137746334,
  'sequence': '<s>HuggingFace is creating a prototype that the community uses '
              'to solve NLP tasks.</s>',
  'token': 17715,
  'token_str': 'prototype'}]

使用模型和文本标记器

你也可以使用模型和文本标记器来实现上述功能,步骤如下:

  1. 实例化一个DistilBERT模型和对应文本标记器。
  2. 创建一个序列,并用tokenizer.mask_token替换掉你想要预测的单词。
  3. 编码序列并找到[MASK]的位置。
  4. 将序列输入模型,并获得预测结果。返回结果是一个维度为 [ 1 , 句 子 分 词 数 , 词 典 大 小 ] [1, 句子分词数,词典大小] [1,,]的Tensor,表示词典中每一个词在每一个位置出现的得分。模型会给予合适的词更高的分数。
  5. 使用Pytorch中的topk方法获得分数最高的若干个词的索引。
  6. 用得到的索引表示的词替换[MASK]标记即可得到结果。

示例代码:

from transformers import AutoModelWithLMHead, AutoTokenizer
import torch
import random

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased", cache_dir="./transformersModels/MLM")
model = AutoModelWithLMHead.from_pretrained("distilbert-base-cased", cache_dir="./transformersModels/MLM", return_dict=True)

sequence = f"Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."

# 将sequence转化为词典索引
input = tokenizer.encode(sequence, return_tensors="pt")

# 获取[MASK]坐标
mask_token_index = torch.argmax(input == tokenizer.convert_tokens_to_ids(tokenizer.mask_token))

token_logits = model(input).logits
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=0).indices

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

输出结果:

Using them instead of the large versions would help reduce our carbon footprint.
Using them instead of the large versions would help increase our carbon footprint.
Using them instead of the large versions would help decrease our carbon footprint.
Using them instead of the large versions would help improve our carbon footprint.
Using them instead of the large versions would help offset our carbon footprint.

当然也可以用来预测多个[MASK],示例代码如下:

from transformers import AutoModelWithLMHead, AutoTokenizer
import torch
import random

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased", cache_dir="./transformersModels/MLM")
model = AutoModelWithLMHead.from_pretrained("distilbert-base-cased", cache_dir="./transformersModels/MLM", return_dict=True)

# {tokenizer.mask_token}
# Last year, I went to the countryside to get my internship, my duty was to be a teacher, teaching the middle school students English. 

sequence = f"Last year, I went to the countryside to get my {tokenizer.mask_token}, my duty was to be a teacher, teaching the middle {tokenizer.mask_token} students English. "

# 将sequence转化为词典索引
input = tokenizer.encode(sequence, return_tensors="pt")

# 获取[MASK]坐标
mask_token_index = (input == tokenizer.convert_tokens_to_ids(tokenizer.mask_token))

token_logits = model(input).logits
mask_token_logits = token_logits[0, mask_token_index[0], :]

top_5_tokens = []
for mask_token_logit in mask_token_logits:
    top_5_tokens.append(torch.topk(mask_token_logit, 5, dim=0).indices)

"""
由于预测结果可以相互组合,因此有多种结果
这里只输出 n 种结果
每次从 top_5_token 种随机抽取一个词
"""
n = 2
for i in range(n):
    seq = sequence
    for top_5_token in top_5_tokens:
        random_token = random.choice(top_5_token)
        seq = seq.replace(tokenizer.mask_token, tokenizer.decode([random_token]), 1)
    print(seq)

输出结果:

Last year, I went to the countryside to get my education, my duty was to be a teacher, teaching the middle age students English. 
Last year, I went to the countryside to get my education, my duty was to be a teacher, teaching the middle class students English.

因果语言建模 Causal Language Modeling

因果语言建模是一个预测给定文本之后词语的任务。在这个任务中,模型只会注意左边的上下文信息。这种特性非常适合类似于文本生成的任务。

通常情况下,下一个词语是通过将之前的文本输入模型获得的最后一个隐状态(hidden state)来预测。

例子如下:

from transformers import AutoModelWithLMHead, AutoTokenizer, top_k_top_p_filtering
import torch
from torch.nn import functional as F

tokenizer = AutoTokenizer.from_pretrained("gpt2", cache_dir="./transformersModels/CLM")
model = AutoModelWithLMHead.from_pretrained("gpt2", cache_dir="./transformersModels/CLM", return_dict=True)

sequence = f"I am Student"

input_ids = tokenizer.encode(sequence, return_tensors="pt")

# get logits of last hidden state
next_token_logits = model(input_ids).logits[:, -1, :]

# filter
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)

# sample
probs = F.softmax(filtered_next_token_logits, dim=-1)

"""
torch.multinomial把输入的值看作是索引的权重,然后进行随机取样
"""
next_token = torch.multinomial(probs, num_samples=1)

generated = torch.cat([input_ids, next_token], dim=-1)

resulting_string = tokenizer.decode(generated.tolist()[0])

print(resulting_string)

输出结果:

I am Student of

文本生成 Text Generation

文本生成任务(又名开方式文本生成)的目标是根据给定的文本生成一段上下文相关的文本。下面的例子会展示如何使用GPT-2来生成文本。

使用pipeline

在默认情况下,所有使用pipeline创建的模型都应用了Top-K采样。

代码如下:

from transformers import pipeline

text_generator = pipeline("text-generation")
print(text_generator("As far as I am concerned, I will", max_length=50, do_sample=False))

输出结果如图:

[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]

模型生成了一段总计50个token的文本(标点和单词都算),这个文本与开头“As far as I am concerned, I will”有关。

使用模型和文本标记器

下面的文本生成使用了XLNet模型和其对应的文本标记器。

示例代码:

cache_dir="./transformersModels/text-generation"
"""
,cache_dir = cache_dir
"""
from transformers import AutoModelWithLMHead, AutoTokenizer

model = AutoModelWithLMHead.from_pretrained("xlnet-base-cased",cache_dir = cache_dir, return_dict=True)
tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased",cache_dir = cache_dir)

# Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
# 补充描述可以帮助模型生成更好的文本
PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""

# 生成文本的提示词
prompt = "Today the weather is really nice and I am planning on "

inputs = tokenizer.encode(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")

prompt_length = len(tokenizer.decode(inputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
# 其中的 max_length 是 补充描述+提示词+生成文本 的最大长度
# 输出也包括了 补充描述+提示词+生成文本 的tokens
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)

print("完整输出:", tokenizer.decode(outputs[0]))

# 仅获取提示词和生成的文本
generated = prompt + tokenizer.decode(outputs[0])[prompt_length:]
print("提示词+生成文本:",generated)

输出结果:

完整输出: In 1991, the remains of Russian Tsar Nicholas II and his family (except for Alexei and Maria) are discovered. The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the remainder of the story. 1883 Western Siberia, a young Grigori Rasputin is asked by his father and a group of men to perform magic. Rasputin has a vision and denounces one of the men as a horse thief. Although his father initially slaps him for making such an accusation, Rasputin watches as the man is chased outside and beaten. Twenty years later, Rasputin sees a vision of the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous, with people, even a bishop, begging for his blessing.<eod></s> <eos>Today the weather is really nice and I am planning on baking for the whole week. I know that the day of the next will be a "great day." That day is Friday 1 July (10 days ago), so I will go back to the week. I think I’ll get some sleep in on Sunday at 6:30 am. If I can get some sleep that night, I’ll
提示词+生成文本: Today the weather is really nice and I am planning on anning on baking for the whole week. I know that the day of the next will be a "great day." That day is Friday 1 July (10 days ago), so I will go back to the week. I think I’ll get some sleep in on Sunday at 6:30 am. If I can get some sleep that night, I’ll

现在文本生成能使用的模型有GPT2, OpenAi-GPT, CTRL, XLNet, Transfo-XL 和 Reformer,在pytorch和tensorflow中都有实现。从上面例子可以知道,XLNet和Transfo通常需要补充描述才能正常工作,可以把PADDING_TEXT 删了再看看效果。GPT2在开方式结尾的文本生成任务中是一个较好的选择,因为它是在百万个网页数据上训练的因果语言模型。

  • 3
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
RoBERTa中文预训练模型 概述 中文预训练RoBERTa模型 RoBERTa是BERT的改进版,通过改进训练任务和数据生成方式、训练更久、使用更大批次、使用更多数据等获得了State of The Art的效果;可以用Bert直接加载。 本项目是用TensorFlow实现了在大规模中文上RoBERTa的预训练,也会提供PyTorch预训练模型和加载方式。 中文预训练RoBERTa模型-下载 6层RoBERTa体验版 RoBERTa-zh-Layer6: Google Drive 或 百度网盘,TensorFlow版本,Bert 直接加载, 大小为200M 推荐 RoBERTa-zh-Large 通过验证 RoBERTa-zh-Large: Google Drive 或 百度网盘 ,TensorFlow版本,Bert 直接加载 RoBERTa-zh-Large: Google Drive 或 百度网盘 ,PyTorch版本,Bert的PyTorch版直接加载 RoBERTa 24/12层版训练数据:30G原始文本,近3亿个句子,100亿个中文字(token),产生了2.5亿个训练数据(instance);覆盖新闻、社区问答、多个百科数据等; 本项目与中文预训练24层XLNet模型 XLNet_zh项目,使用相同的训练数据。 RoBERTa_zh_L12: Google Drive 或 百度网盘 TensorFlow版本,Bert 直接加载 RoBERTa_zh_L12: Google Drive 或百度网盘 PyTorch版本,Bert的PyTorch版直接加载 Roberta_l24_zh_base TensorFlow版本,Bert 直接加载 24层base版训练数据:10G文本,包含新闻、社区问答、多个百科数据等 什么是RoBERTa: 一种强大的用于预训练自然语言处理(NLP)系统的优化方法,改进了Transformers或BERT的双向编码器表示形式,这是Google在2018年发布的自监督方法。 RoBERTa在广泛使用的NLP基准通用语言理解评估(GLUE)上产生最先进的结果。 该模型在MNLI,QNLI,RTE,STS-B和RACE任务上提供了最先进的性能,并在GLUE基准上提供了可观的性能改进。 RoBERTa得分88.5,在GLUE排行榜上排名第一,与之前的XLNet-Large的表现相当。 效果测试与对比 Performance 互联网新闻情感分析:CCF-Sentiment-Analysis 模型 线上F1 BERT 80.3 Bert-wwm-ext 80.5 XLNet 79.6 Roberta-mid 80.5 Roberta-large (max_seq_length=512, split_num=1) 81.25 注:数据来源于guoday的开源项目;数据集和任务介绍见:CCF互联网新闻情感分析 自然语言推断:XNLI 模型 开发集 测试集 BERT 77.8 (77.4) 77.8 (77.5) ERNIE 79.7 (79.4) 78.6 (78.2) BERT-wwm 79.0 (78.4) 78.2 (78.0) BERT-wwm-ext 79.4 (78.6) 78.7 (78.3) XLNet 79.2 78.7 RoBERTa-zh-base 79.8 78.8 RoBERTa-zh-Large 80.2 (80.0) 79.9 (79.5) 注:RoBERTa_l24_zh,只跑了两次,Performance可能还会提升; BERT-wwm-ext来自于这里;XLNet来自于这里; RoBERTa-zh-base,指12层RoBERTa中文模型 问题匹配语任务:LCQMC(Sentence Pair Matching) 模型 开发集(Dev) 测试集(Test) BERT 89.4(88.4) 86.9(86.4) ERNIE 89.8 (89.6) 87.2 (87.0) BERT-wwm 89.4 (89.2) 87.0 (86.8) BERT-wwm-ext - - RoBERTa-zh-base 88.7 87.0 RoBERTa-zh-Large 89.9(89.6) 87.2(86.7) RoBERTa-zh-Large(20w_steps) 89.7 87.0 注:RoBERTa_l24_zh,只跑了两次,Performance可能还会提升。保持训练轮次和论文一致: 阅读理解测试 目前阅读理解类问题bert和roberta最优参数均为epoch2, batch=32, lr=3e-5, warmup=0.1 cmrc20

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值