从语言模型的hidden_states到logit,经历了什么变换

以gpt2为例

导入模型,并推理。

from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2Config
import torch
config = GPT2Config.from_pretrained("../model/gpt2")
model = GPT2LMHeadModel.from_pretrained("../model/gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("../model/gpt2")


prompt = "I thought this movie was glorious, I appreciated it. Conclusion: This movie is"
inputs = tokenizer(prompt, return_tensors="pt")
output = model(inputs.input_ids, output_hidden_states=True)

output输出的内容是什么

查看modeling_gpt2的源代码,在import部分:

from ...modeling_outputs import (
    BaseModelOutputWithPastAndCrossAttentions,
    CausalLMOutputWithCrossAttentions,
    QuestionAnsweringModelOutput,
    SequenceClassifierOutputWithPast,
    TokenClassifierOutput,
)

再进一步查看modeling_outputs.py文件,可以看到output的类

class CausalLMOutputWithCrossAttentions(ModelOutput):
    """
    Base class for causal language model (or autoregressive) outputs.

    Args:
        loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
            Language modeling loss (for next-token prediction).
        logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.

            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
            sequence_length)`.

            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
            heads.
        cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
            sequence_length)`.

            Cross attentions weights after the attention softmax, used to compute the weighted average in the
            cross-attention heads.
        past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
            Tuple of `torch.FloatTensor` tuples of length `config.n_layers`, with each tuple containing the cached key,
            value states of the self-attention and the cross-attention layers if model is used in encoder-decoder
            setting. Only relevant if `config.is_decoder = True`.

            Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see
            `past_key_values` input) to speed up sequential decoding.
    """

    loss: Optional[torch.FloatTensor] = None
    logits: torch.FloatTensor = None
    past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
    cross_attentions: Optional[Tuple[torch.FloatTensor, ...]] = None

因此,output可以访问loss、logits、hidden_states(需要在model()加一个参数:output_hidden_states=True或者在config设置:config.output_hidden_states=True)等等。

hidden_states和logits有什么关系

hidden_states包含了每一个transformer block的输出结果,因此可以通过hidden_states[-1]来访问最后一层的输出结果,再经过一个线性变换,即可得到logits。

以gpt2(d_model = 768) 和上述prompt为例(18个token),先放上gpt2的结构,详见gpt2结构-CSDN博客

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

可以看到最后一行的线性层(lm_head): Linear(in_features=768, out_features=50257, bias=False)
)即为hidden_states[-1]通向logits的变换。

hidden_states[-1]的维度是[1,18,768],线性层model.lm_head.weight的维度是[50257,768],将这两个矩阵相乘

logits2 = torch.matmul(output.hidden_states[-1], model.lm_head.weight.transpose(0, 1))

得出的logits2和model.logits一摸一样,维度是[1,18,50257],50257是词表的大小。

总结:hidden_states[-1]通向logits,只需要一个线性变换。

从logits到token

得到logits之后,找到分数最大的,对应词表中的单词就是next token

# 得到logits后
probs = torch.softmax(logits, dim=-1)
print(probs.size()) #[50257]
# 选择最可能的词的索引
next_token_index = torch.argmax(probs, dim=-1)

# 使用tokenizer将索引转换为单词
next_token = tokenizer.decode(next_token_index.tolist()[0])
print(next_token) #"a"

  • 10
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值