怎样通过预训练的Transformers的模型得到一个Sentence的Representation

这篇博客紧接前面的内容:https://blog.csdn.net/qysh123/article/details/109666416

在我们预训练了Transformers的某个model之后(如RoBERTa),怎么用训练好的model生成某一个句子的Representation呢,其实过程也是很简单的:

在这里看到了有人在问这个问题:https://github.com/huggingface/transformers/issues/2986

其中第一个回答说他写了一篇文章介绍了过程,我仔细看了一下,发现确实介绍得很详细了:https://github.com/BramVanroy/bert-for-inference/blob/master/introduction-to-bert.ipynb

我就简单补充一点:正如这个作者所说的:Be careful, though, because the differences between model APIs, however small, are incredibly important. For instance, the position of the classification token is not the same for all models. 所以每种预训练模型还是有一定区别的。在之前的那篇博客中,我训练的是RoBERTa,所以这里也需要进行对应。roberta在transformers中的定义的模型种类很多:https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_roberta.py

我训练的时候用的是RobertaForMaskedLM,显而易见,是用MLM作为训练任务的。另外顺便需要指出的一点是:按照这里的介绍:https://github.com/huggingface/transformers/issues/6414

RobertaModel is not something you can use directly with Trainer as it doesn't have any objective (it's the base model without head). You should pick a model with head relevant to your task. 我们并不能直接使用class RobertaModel用Trainer来进行训练。那这里面就有一个问题了,如果我训练的时候用的是RobertaForMaskedLM,加载模型的时候能直接加载成RobertaModel吗?我试了一下,对于保存好的模型,加载为RobertaModel和RobertaForMaskedLM都是可以的!但是,这两者在生成Sentence的Representation的时候是不一样的,这一点需要特别注意

我仿照上面那个教程尝试了一下,这里另外需要注意一点,我们这里说的Sentence的embedding,指的是经过encoder处理之后的hidden state,而不是输入encoder的embedding。这里有个讨论:https://github.com/huggingface/transformers/issues/2072

下面给出的第一个回答中的两种方法,给出的是输入encoder的embedding,两种方法分别是:

from transformers import RobertaTokenizer, RobertaModel
import torch

tok = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaModel.from_pretrained("roberta-base")

sentence = torch.tensor([tok.encode("Alright, let's do this")])
embedding_output = model.embeddings(sentence)

和:

from transformers import RobertaTokenizer, RobertaModel, RobertaConfig
import torch

config = RobertaConfig.from_pretrained("roberta-base")
config.output_hidden_states = True

tok = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaModel.from_pretrained("roberta-base", config=config)

sentence = torch.tensor([tok.encode("Alright, let's do this")])

output = model(sentence)  # returns a tuple(sequence_output, pooled_output, hidden_states)
hidden_states = output[-1]

embedding_output = hidden_states[0]

这两段代码都很简单,但为什么说这里的embedding_output指的是输入encoder的embedding呢?其实很简单,我们看看RobertaModel的embeddings方法是怎么样定义的即可(可通过这个链接查看:https://github.com/huggingface/transformers/blob/0c9bae09340dd8c6fdf6aa2ea5637e956efe0f7c/src/transformers/modeling_roberta.py#L587):

self.embeddings = RobertaEmbeddings(config)

RobertaEmbeddings这个class的forward方法最后几行是这样的:

        if inputs_embeds is None:
            inputs_embeds = self.word_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)

        embeddings = inputs_embeds + position_embeddings + token_type_embeddings
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

很明显这是输入encoder的embedding。

在确定了这一点之后,我们还是参考上面那个网页中的方法(https://github.com/BramVanroy/bert-for-inference/blob/master/introduction-to-bert.ipynb)。正如我在前面所说的,当我们使用RobertaModel和RobertaForMaskedLM加载模型的时候,用法并不相同,先看看RobertaForMaskedLM,我把代码贴到这里:

import torch
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./Linux")

from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM.from_pretrained("./Linux",return_dict=True, output_hidden_states = True)

line_ids = tokenizer.encode('( <str> , addr ) <num>')
print('line_ids', line_ids)
print('line_tokens', tokenizer.convert_ids_to_tokens(line_ids))
line_ids = torch.LongTensor(line_ids)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.to(device)
line_ids = line_ids.to(device)
model.eval()

print(line_ids.size())
line_ids = line_ids.unsqueeze(0)
print(line_ids.size())
with torch.no_grad():
    out = model(input_ids=line_ids)
     
print(out)
hidden_states = out[1]
print(len(hidden_states))
embedding_output=hidden_states[-1]
print(embedding_output)
print(embedding_output.size())

sentence_embedding = torch.mean(embedding_output, dim=1).squeeze()
print(sentence_embedding)
print(sentence_embedding.size())

Print有点多,我也是通过这个在学习。通过这条语句:
model = RobertaForMaskedLM.from_pretrained("./Linux",return_dict=True, output_hidden_states = True),我们要求RobertaForMaskedLM返回所有的hidden state,return_dict=True是为了更好理解RobertaForMaskedLM的forward函数的输出:https://huggingface.co/transformers/model_doc/roberta.html#transformers.RobertaForMaskedLM.forward

从上面的网页(或者从其源码),可以很清除地看到RobertaForMaskedLM的forward function返回的是:

  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) – Masked language modeling (MLM) loss.

  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) – Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) – Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

当我们只指定了output_hidden_states的时候,返回的就是一个二元组:logits和hidden_states,所以上面的代码段中用hidden_states = out[1]来取到所有的hidden_states。当我们print(len(hidden_states)),得到的结果是7,这里为什么是7呢?其实就是输入的embedding加上6层的num_hidden_layers,这是我在之前训练的时候就定义的。

相应地,当使用RobertaModel的时候,对应的代码是这样:

import torch
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./Linux")

from transformers import RobertaModel
 
model = RobertaModel.from_pretrained("./Linux",return_dict=True, output_hidden_states = True)

line_ids = tokenizer.encode('( <str> , addr ) <num>')
print('line_ids', line_ids)
print('line_tokens', tokenizer.convert_ids_to_tokens(line_ids))
line_ids = torch.LongTensor(line_ids)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.to(device)
line_ids = line_ids.to(device)
model.eval()

print(line_ids.size())
line_ids = line_ids.unsqueeze(0)
print(line_ids.size())
with torch.no_grad():
    embedding_output=model.embeddings(line_ids)
    print(embedding_output)
    out = model(input_ids=line_ids)
     
print(out)
hidden_states = out[2]
print(len(hidden_states))
embedding_output=hidden_states[0]
print(embedding_output)
embedding_output=hidden_states[-1]

sentence_embedding = torch.mean(embedding_output, dim=1).squeeze()

这里的不同点是:1. RobertaModel是有embeddings这个函数的,返回的是输入Encoder的embedding(见前面的讨论)。2. 这里hidden_states = out[2],是因为RobertaModel的forward函数的返回值是(https://huggingface.co/transformers/model_doc/roberta.html#transformers.RobertaModel.forward):

  • last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.

  • pooler_output (torch.FloatTensor of shape (batch_size, hidden_size)) – Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) – Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) – Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True and config.add_cross_attention=True is passed or when config.output_attentions=True) – Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

当我们只指定了output_hidden_states的时候,返回的就是一个三元组:last_hidden_state,pooler_output,hidden_states,其中hidden_states[0]==model.embeddings(line_ids) (感兴趣的朋友打印出来看看就知道),而hidden_states[-1]==last_hidden_state==out[0],所以这种情况下我们只取out[0]就可以得到最后一层的hidden_state。

综上所述:当我们用不同的模型的时候,获得最后一层的hidden_state的方法是不一样的。就简单总结这么多,希望能帮助到和我有类似需求的朋友们。

 

  • 1
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值