以gpt2为例
导入模型,并推理。
from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2Config
import torch
config = GPT2Config.from_pretrained("../model/gpt2")
model = GPT2LMHeadModel.from_pretrained("../model/gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("../model/gpt2")
prompt = "I thought this movie was glorious, I appreciated it. Conclusion: This movie is"
inputs = tokenizer(prompt, return_tensors="pt")
output = model(inputs.input_ids, output_hidden_states=True)
output输出的内容是什么
查看modeling_gpt2的源代码,在import部分:
from ...modeling_outputs import (
BaseModelOutputWithPastAndCrossAttentions,
CausalLMOutputWithCrossAttentions,
QuestionAnsweringModelOutput,
SequenceClassifierOutputWithPast,
TokenClassifierOutput,
)
再进一步查看modeling_outputs.py文件,可以看到output的类
class CausalLMOutputWithCrossAttentions(ModelOutput):
"""
Base class for causal language model (or autoregressive) outputs.
Args:
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Cross attentions weights after the attention softmax, used to compute the weighted average in the
cross-attention heads.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `torch.FloatTensor` tuples of length `config.n_layers`, with each tuple containing the cached key,
value states of the self-attention and the cross-attention layers if model is used in encoder-decoder
setting. Only relevant if `config.is_decoder = True`.
Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
"""
loss: Optional[torch.FloatTensor] = None
logits: torch.FloatTensor = None
past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
cross_attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
因此,output可以访问loss、logits、hidden_states(需要在model()加一个参数:output_hidden_states=True或者在config设置:config.output_hidden_states=True)等等。
hidden_states和logits有什么关系
hidden_states包含了每一个transformer block的输出结果,因此可以通过hidden_states[-1]来访问最后一层的输出结果,再经过一个线性变换,即可得到logits。
以gpt2(d_model = 768) 和上述prompt为例(18个token),先放上gpt2的结构,详见gpt2结构-CSDN博客
GPT2LMHeadModel(
(transformer): GPT2Model(
(wte): Embedding(50257, 768)
(wpe): Embedding(1024, 768)
(drop): Dropout(p=0.1, inplace=False)
(h): ModuleList(
(0-11): 12 x GPT2Block(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): GPT2Attention(
(c_attn): Conv1D()
(c_proj): Conv1D()
(attn_dropout): Dropout(p=0.1, inplace=False)
(resid_dropout): Dropout(p=0.1, inplace=False)
)
(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): GPT2MLP(
(c_fc): Conv1D()
(c_proj): Conv1D()
(act): NewGELUActivation()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(lm_head): Linear(in_features=768, out_features=50257, bias=False)
)
可以看到最后一行的线性层(lm_head): Linear(in_features=768, out_features=50257, bias=False)
)即为hidden_states[-1]通向logits的变换。
hidden_states[-1]的维度是[1,18,768],线性层model.lm_head.weight的维度是[50257,768],将这两个矩阵相乘
logits2 = torch.matmul(output.hidden_states[-1], model.lm_head.weight.transpose(0, 1))
得出的logits2和model.logits一摸一样,维度是[1,18,50257],50257是词表的大小。
总结:hidden_states[-1]通向logits,只需要一个线性变换。
从logits到token
得到logits之后,找到分数最大的,对应词表中的单词就是next token
# 得到logits后
probs = torch.softmax(logits, dim=-1)
print(probs.size()) #[50257]
# 选择最可能的词的索引
next_token_index = torch.argmax(probs, dim=-1)
# 使用tokenizer将索引转换为单词
next_token = tokenizer.decode(next_token_index.tolist()[0])
print(next_token) #"a"