从语言模型的hidden_states到logit，经历了什么变换

长腿老头o3o

于 2024-05-08 16:06:46 发布

阅读量602

点赞数 10

文章标签： pytorch 深度学习人工智能

本文链接：https://blog.csdn.net/qq_61631811/article/details/138575853

版权

以gpt2为例

导入模型，并推理。

from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2Config
import torch
config = GPT2Config.from_pretrained("../model/gpt2")
model = GPT2LMHeadModel.from_pretrained("../model/gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("../model/gpt2")


prompt = "I thought this movie was glorious, I appreciated it. Conclusion: This movie is"
inputs = tokenizer(prompt, return_tensors="pt")
output = model(inputs.input_ids, output_hidden_states=True)

output输出的内容是什么

查看modeling_gpt2的源代码，在import部分：

from ...modeling_outputs import (
    BaseModelOutputWithPastAndCrossAttentions,
    CausalLMOutputWithCrossAttentions,
    QuestionAnsweringModelOutput,
    SequenceClassifierOutputWithPast,
    TokenClassifierOutput,
)

再进一步查看modeling_outputs.py文件，可以看到output的类

class CausalLMOutputWithCrossAttentions(ModelOutput):
    """
    Base class for causal language model (or autoregressive) outputs.

    Args:
        loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
            Language modeling loss (for next-token prediction).
        logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.

            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
            sequence_length)`.

            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
            heads.
        cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
            sequence_length)`.

            Cross attentions weights after the attention softmax, used to compute the weighted average in the
            cross-attention heads.
        past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
            Tuple of `torch.FloatTensor` tuples of length `config.n_layers`, with each tuple containing the cached key,
            value states of the self-attention and the cross-attention layers if model is used in encoder-decoder
            setting. Only relevant if `config.is_decoder = True`.

            Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see
            `past_key_values` input) to speed up sequential decoding.
    """

    loss: Optional[torch.FloatTensor] = None
    logits: torch.FloatTensor = None
    past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
    cross_attentions: Optional[Tuple[torch.FloatTensor, ...]] = None

因此，output可以访问loss、logits、hidden_states（需要在model()加一个参数：output_hidden_states=True或者在config设置：config.output_hidden_states=True）等等。

hidden_states和logits有什么关系

hidden_states包含了每一个transformer block的输出结果，因此可以通过hidden_states[-1]来访问最后一层的输出结果，再经过一个线性变换，即可得到logits。

以gpt2(d_model = 768) 和上述prompt为例（18个token）,先放上gpt2的结构，详见gpt2结构-CSDN博客

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

可以看到最后一行的线性层(lm_head): Linear(in_features=768, out_features=50257, bias=False)
)即为hidden_states[-1]通向logits的变换。

hidden_states[-1]的维度是[1，18，768],线性层model.lm_head.weight的维度是[50257,768]，将这两个矩阵相乘

logits2 = torch.matmul(output.hidden_states[-1], model.lm_head.weight.transpose(0, 1))

得出的logits2和model.logits一摸一样，维度是[1,18,50257]，50257是词表的大小。

总结：hidden_states[-1]通向logits，只需要一个线性变换。

从logits到token

得到logits之后，找到分数最大的，对应词表中的单词就是next token

# 得到logits后
probs = torch.softmax(logits, dim=-1)
print(probs.size()) #[50257]
# 选择最可能的词的索引
next_token_index = torch.argmax(probs, dim=-1)

# 使用tokenizer将索引转换为单词
next_token = tokenizer.decode(next_token_index.tolist()[0])
print(next_token) #"a"