怎样通过预训练的Transformers的模型得到一个Sentence的Representation_robertatokenizerfast.from_pretrained(text_encoder

本文链接：https://blog.csdn.net/qysh123/article/details/109673272

这篇博客紧接前面的内容：https://blog.csdn.net/qysh123/article/details/109666416

在我们预训练了Transformers的某个model之后（如RoBERTa），怎么用训练好的model生成某一个句子的Representation呢，其实过程也是很简单的：

在这里看到了有人在问这个问题：https://github.com/huggingface/transformers/issues/2986

其中第一个回答说他写了一篇文章介绍了过程，我仔细看了一下，发现确实介绍得很详细了：https://github.com/BramVanroy/bert-for-inference/blob/master/introduction-to-bert.ipynb

我就简单补充一点：正如这个作者所说的：Be careful, though, because the differences between model APIs, however small, are incredibly important. For instance, the position of the classification token is not the same for all models. 所以每种预训练模型还是有一定区别的。在之前的那篇博客中，我训练的是RoBERTa，所以这里也需要进行对应。roberta在transformers中的定义的模型种类很多：https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_roberta.py

我训练的时候用的是RobertaForMaskedLM，显而易见，是用MLM作为训练任务的。另外顺便需要指出的一点是：按照这里的介绍：https://github.com/huggingface/transformers/issues/6414

RobertaModel is not something you can use directly with Trainer as it doesn't have any objective (it's the base model without head). You should pick a model with head relevant to your task. 我们并不能直接使用class RobertaModel用Trainer来进行训练。那这里面就有一个问题了，如果我训练的时候用的是RobertaForMaskedLM，加载模型的时候能直接加载成RobertaModel吗？我试了一下，对于保存好的模型，加载为RobertaModel和RobertaForMaskedLM都是可以的！但是，这两者在生成Sentence的Representation的时候是不一样的，这一点需要特别注意：

我仿照上面那个教程尝试了一下，这里另外需要注意一点，我们这里说的Sentence的embedding，指的是经过encoder处理之后的hidden state，而不是输入encoder的embedding。这里有个讨论：https://github.com/huggingface/transformers/issues/2072

下面给出的第一个回答中的两种方法，给出的是输入encoder的embedding，两种方法分别是：

from transformers import RobertaTokenizer, RobertaModel
import torch

tok = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaModel.from_pretrained("roberta-base")

sentence = torch.tensor([tok.encode("Alright, let's do this")])
embedding_output = model.embeddings(sentence)

和：

from transformers import RobertaTokenizer, RobertaModel, RobertaConfig
import torch

config = RobertaConfig.from_pretrained("roberta-base")
config.output_hidden_states = True

tok = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaModel.from_pretrained("roberta-base", config=config)

sentence = torch.tensor([tok.encode("Alright, let's do this")])

output = model(sentence)  # returns a tuple(sequence_output, pooled_output, hidden_states)
hidden_states = output[-1]

embedding_output = hidden_states[0]

这两段代码都很简单，但为什么说这里的embedding_output指的是输入encoder的embedding呢？其实很简单，我们看看RobertaModel的embeddings方法是怎么样定义的即可（可通过这个链接查看：https://github.com/huggingface/transformers/blob/0c9bae09340dd8c6fdf6aa2ea5637e956efe0f7c/src/transformers/modeling_roberta.py#L587）：

self.embeddings = RobertaEmbeddings(config)

RobertaEmbeddings这个class的forward方法最后几行是这样的：

        if inputs_embeds is None:
            inputs_embeds = self.word_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)

        embeddings = inputs_embeds + position_embeddings + token_type_embeddings
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

很明显这是输入encoder的embedding。

在确定了这一点之后，我们还是参考上面那个网页中的方法（https://github.com/BramVanroy/bert-for-inference/blob/master/introduction-to-bert.ipynb）。正如我在前面所说的，当我们使用RobertaModel和RobertaForMaskedLM加载模型的时候，用法并不相同，先看看RobertaForMaskedLM，我把代码贴到这里：

import torch
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./Linux")

from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM.from_pretrained("./Linux",return_dict=True, output_hidden_states = True)

line_ids = tokenizer.encode('( <str> , addr ) <num>')
print('line_ids', line_ids)
print('line_tokens', tokenizer.convert_ids_to_tokens(line_ids))
line_ids = torch.LongTensor(line_ids)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.to(device)
line_ids = line_ids.to(device)
model.eval()

print(line_ids.size())
line_ids = line_ids.unsqueeze(0)
print(line_ids.size())
with torch.no_grad():
    out = model(input_ids=line_ids)
     
print(out)
hidden_states = out[1]
print(len(hidden_states))
embedding_output=hidden_states[-1]
print(embedding_output)
print(embedding_output.size())

sentence_embedding = torch.mean(embedding_output, dim=1).squeeze()
print(sentence_embedding)
print(sentence_embedding.size())

Print有点多，我也是通过这个在学习。通过这条语句：
model = RobertaForMaskedLM.from_pretrained("./Linux",return_dict=True, output_hidden_states = True)，我们要求RobertaForMaskedLM返回所有的hidden state，return_dict=True是为了更好理解RobertaForMaskedLM的forward函数的输出：https://huggingface.co/transformers/model_doc/roberta.html#transformers.RobertaForMaskedLM.forward

从上面的网页（或者从其源码），可以很清除地看到RobertaForMaskedLM的forward function返回的是：

loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) – Masked language modeling (MLM) loss.
logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) – Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) – Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

当我们只指定了output_hidden_states的时候，返回的就是一个二元组：logits和hidden_states，所以上面的代码段中用hidden_states = out[1]来取到所有的hidden_states。当我们print(len(hidden_states))，得到的结果是7，这里为什么是7呢？其实就是输入的embedding加上6层的num_hidden_layers，这是我在之前训练的时候就定义的。

相应地，当使用RobertaModel的时候，对应的代码是这样：

import torch
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./Linux")

from transformers import RobertaModel
 
model = RobertaModel.from_pretrained("./Linux",return_dict=True, output_hidden_states = True)

line_ids = tokenizer.encode('( <str> , addr ) <num>')
print('line_ids', line_ids)
print('line_tokens', tokenizer.convert_ids_to_tokens(line_ids))
line_ids = torch.LongTensor(line_ids)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.to(device)
line_ids = line_ids.to(device)
model.eval()

print(line_ids.size())
line_ids = line_ids.unsqueeze(0)
print(line_ids.size())
with torch.no_grad():
    embedding_output=model.embeddings(line_ids)
    print(embedding_output)
    out = model(input_ids=line_ids)
     
print(out)
hidden_states = out[2]
print(len(hidden_states))
embedding_output=hidden_states[0]
print(embedding_output)
embedding_output=hidden_states[-1]

sentence_embedding = torch.mean(embedding_output, dim=1).squeeze()

这里的不同点是：1. RobertaModel是有embeddings这个函数的，返回的是输入Encoder的embedding（见前面的讨论）。2. 这里hidden_states = out[2]，是因为RobertaModel的forward函数的返回值是（https://huggingface.co/transformers/model_doc/roberta.html#transformers.RobertaModel.forward）：

last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.
pooler_output (torch.FloatTensor of shape (batch_size, hidden_size)) – Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) – Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) – Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True and config.add_cross_attention=True is passed or when config.output_attentions=True) – Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

当我们只指定了output_hidden_states的时候，返回的就是一个三元组：last_hidden_state，pooler_output，hidden_states，其中hidden_states[0]==model.embeddings(line_ids) （感兴趣的朋友打印出来看看就知道），而hidden_states[-1]==last_hidden_state==out[0]，所以这种情况下我们只取out[0]就可以得到最后一层的hidden_state。

综上所述：当我们用不同的模型的时候，获得最后一层的hidden_state的方法是不一样的。就简单总结这么多，希望能帮助到和我有类似需求的朋友们。