昇思25天学习打卡营第6天
文本解码原理--以MindNLP为例
回顾:自回归语言模型
根据前文预测下一个单词
一个文本序列的概率分布可以分解为每个词基于其上文的条件概率的乘积
- 𝑊_0:初始上下文单词序列
- 𝑇: 时间步
- 当生成EOS标签时,停止生成。
MindNLP/huggingface Transformers提供的文本生成方法
Greedy search
在每个时间步𝑡都简单地选择概率最高的词作为当前输出词:
𝑤_𝑡=𝑎𝑟𝑔𝑚𝑎𝑥_𝑤 𝑃(𝑤|𝑤_(1:𝑡−1))
按照贪心搜索输出序列(“The”,“nice”,“woman”) 的条件概率为:0.5 x 0.4 = 0.2
缺点: 错过了隐藏在低概率词后面的高概率词,如:dog=0.5, has=0.9
环境准备
%%capture captured_output
# 实验环境已经预装了mindspore==2.2.14,如需更换mindspore版本,可更改下面mindspore的版本号
!pip uninstall mindspore -y
!pip install -i https://pypi.mirrors.ustc.edu.cn/simple mindspore==2.2.14
!pip uninstall mindvision -y
!pip uninstall mindinsight -y
# 该案例在 mindnlp 0.3.1 版本完成适配,如果发现案例跑不通,可以指定mindnlp版本,执行`!pip install mindnlp==0.3.1`
!pip install mindnlp
#greedy_search
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')
# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')
# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')
# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))
结果输出:
Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.
I'm not sure if I'll
Beam search
Beam search通过在每个时间步保留最可能的 num_beams 个词,并从中最终选择出概率最高的序列来降低丢失潜在的高概率序列的风险。如图以 num_beams=2 为例:
(“The”,“dog”,“has”) : 0.4 * 0.9 = 0.36
(“The”,“nice”,“woman”) : 0.5 * 0.4 = 0.20
优点:一定程度保留最优路径
缺点:1. 无法解决重复问题;2. 开放域生成效果差
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')
# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')
# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')
# activate beam search and early_stopping
beam_output = model.generate(
input_ids,
max_length=50,
num_beams=5,
early_stopping=True
)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
print(100 * '-')
# set no_repeat_ngram_size to 2
beam_output = model.generate(
input_ids,
max_length=50,
num_beams=5,
no_repeat_ngram_size=2,
early_stopping=True
)
print("Beam search with ngram, Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
print(100 * '-')
# set return_num_sequences > 1
beam_outputs = model.generate(
input_ids,
max_length=50,
num_beams=5,
no_repeat_ngram_size=2,
num_return_sequences=5,
early_stopping=True
)
# now we have 3 output sequences
print("return_num_sequences, Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))
print(100 * '-')
结果输出:
Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again."
"I don't think I'll ever be able to walk with her again."
"I don't think I
----------------------------------------------------------------------------------------------------
Beam search with ngram, Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again."
"I'm not sure what to say to that," she said. "I mean, it's not like I'm
----------------------------------------------------------------------------------------------------
return_num_sequences, Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again."
"I'm not sure what to say to that," she said. "I mean, it's not like I'm
1: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again."
"I'm not sure what to say to that," she said. "I mean, it's not like she's
2: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again."
"I'm not sure what to say to that," she said. "I mean, it's not like we're
3: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again."
"I'm not sure what to say to that," she said. "I mean, it's not like I've
4: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again."
"I'm not sure what to say to that," she said. "I mean, it's not like I can
----------------------------------------------------------------------------------------------------
Beam search issues
缺点:1. 无法解决重复问题;2. 开放域生成效果差
Repeat problem
n-gram 惩罚:
将出现过的候选词的概率设置为 0
设置no_repeat_ngram_size=2 ,任意 2-gram 不会出现两次
Notice: 实际文本生成需要重复出现
Sample
根据当前条件概率分布随机选择输出词𝑤_𝑡
(“car”) ~P(w∣"The")
(“drives”) ~P(w∣"The",“car”)
优点:文本生成多样性高
缺点:生成文本不连续
import mindspore
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')
# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')
# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')
mindspore.set_seed(0)
# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
input_ids,
do_sample=True,
max_length=50,
top_k=0
)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))
结果输出:
Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog Neddy as much as I'd like. Keep up the good work Neddy!"
I realized what Neddy meant when he first launched the website. "Thank you so much for joining."
I
Temperature
降低softmax 的temperature使 P(w∣w1:t−1)分布更陡峭
增加高概率单词的似然并降低低概率单词的似然
import mindspore
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')
# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')
# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')
mindspore.set_seed(1234)
# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
input_ids,
do_sample=True,
max_length=50,
top_k=0,
temperature=0.7
)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))
结果输出:
Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog and have never had a problem with her until now.
A large dog named Chucky managed to get a few long stretches of grass on her back and ran around with it for about 5 minutes, ran around
TopK sample
选出概率最大的 K 个词,重新归一化,最后在归一化后的 K 个词中采样
TopK sample problems
将采样池限制为固定大小 K :
- 在分布比较尖锐的时候产生胡言乱语
- 在分布比较平坦的时候限制模型的创造力
import mindspore
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')
# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')
# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')
mindspore.set_seed(0)
# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
input_ids,
do_sample=True,
max_length=50,
top_k=50
)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))
结果输出:
Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog.
She's always up for some action, so I have seen her do some stuff with it.
Then there's the two of us.
The two of us I'm talking about were
Top-P sample
在累积概率超过概率 p 的最小单词集中进行采样,重新归一化
采样池可以根据下一个词的概率分布动态增加和减少
import mindspore
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')
# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')
# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')
mindspore.set_seed(0)
# deactivate top_k sampling and sample only from 92% most likely words
sample_output = model.generate(
input_ids,
do_sample=True,
max_length=50,
top_p=0.92,
top_k=0
)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))
结果输出:
Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog Neddy as much as I'd like. Keep up the good work Neddy!"
I realized what Neddy meant when he first launched the website. "Thank you so much for joining."
I
top_k_top_p
import mindspore
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')
# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')
# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')
mindspore.set_seed(0)
# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
input_ids,
do_sample=True,
max_length=50,
top_k=5,
top_p=0.95,
num_return_sequences=3
)
print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
结果输出:
Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog.
"My dog loves the smell of the dog. I'm so happy that she's happy with me.
"I love to walk with my dog. I'm so happy that she's happy
1: I enjoy walking with my cute dog. I'm a big fan of my cat and her dog, but I don't have the same enthusiasm for her. It's hard not to like her because it is my dog.
My husband, who
2: I enjoy walking with my cute dog, but I'm also not sure I would want my dog to walk alone with me."
She also told The Daily Beast that the dog is very protective.
"I think she's very protective of
MindNLP ChatGLM-6B StreamChat
本案例基于MindNLP和ChatGLM-6B实现一个聊天应用。
1 环境配置
%%capture captured_output
# 实验环境已经预装了mindspore==2.2.14,如需更换mindspore版本,可更改下面mindspore的版本号
!pip uninstall mindspore -y
!pip install -i https://pypi.mirrors.ustc.edu.cn/simple mindspore==2.2.14
!pip install -i https://pypi.mirrors.ustc.edu.cn/simple mindnlp
!pip install -i https://pypi.mirrors.ustc.edu.cn/simple mdtex2html
配置网络线路
!export HF_ENDPOINT=https://hf-mirror.com
2 代码开发
下载权重大约需要10分钟
from mindnlp.transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import gradio as gr
import mdtex2html
model = AutoModelForSeq2SeqLM.from_pretrained('ZhipuAI/ChatGLM-6B', mirror="modelscope").half()
model.set_train(False)
tokenizer = AutoTokenizer.from_pretrained('ZhipuAI/ChatGLM-6B', mirror="modelscope")
可以修改下列参数和prompt体验模型
prompt = '大佬,今天你吃了嘛'
history = []
response, _ = model.chat(tokenizer, prompt, history=history, max_length=1000)
response
结果输出:
'抱歉,作为一个人工智能语言模型,我没有身体,所以我无法吃东西。但我可以回答你关于饮食和营养的问题,如果你有任何需要帮助的,请随时问我。'
打卡记录