这篇博客紧接前面的内容:https://blog.csdn.net/qysh123/article/details/109666416
在我们预训练了Transformers的某个model之后(如RoBERTa),怎么用训练好的model生成某一个句子的Representation呢,其实过程也是很简单的:
在这里看到了有人在问这个问题:https://github.com/huggingface/transformers/issues/2986
其中第一个回答说他写了一篇文章介绍了过程,我仔细看了一下,发现确实介绍得很详细了:https://github.com/BramVanroy/bert-for-inference/blob/master/introduction-to-bert.ipynb
我就简单补充一点:正如这个作者所说的:Be careful, though, because the differences between model APIs, however small, are incredibly important. For instance, the position of the classification token is not the same for all models. 所以每种预训练模型还是有一定区别的。在之前的那篇博客中,我训练的是RoBERTa,所以这里也需要进行对应。roberta在transformers中的定义的模型种类很多:https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_roberta.py
我训练的时候用的是RobertaForMaskedLM,显而易见,是用MLM作为训练任务的。另外顺便需要指出的一点是:按照这里的介绍:https://github.com/huggingface/transformers/issues/6414
RobertaModel is not something you can use directly with Trainer as it doesn't have any objective (it's the base model without head). You should pick a model with head relevant to your task. 我们并不能直接使用class RobertaModel用Trainer来进行训练。那这里面就有一个问题了,如果我训练的时候用的是RobertaForMaskedLM,加载模型的时候能直接加载成RobertaModel吗?我试了一下,对于保存好的模型,加载为RobertaModel和RobertaForMaskedLM都是可以的!但是,这两者在生成Sentence的Representation的时候是不一样的,这一点需要特别注意:
我仿照上面那个教程尝试了一下,这里另外需要注意一点,我们这里说的Sentence的embedding,指的是经过encoder处理之后的hidden state,而不是输入encoder的embedding。这里有个讨论:https://github.com/huggingface/transformers/issues/2072
下面给出的第一个回答中的两种方法,给出的是输入encoder的embedding,两种方法分别是:
from transformers import RobertaTokenizer, RobertaModel
import torch
tok = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaModel.from_pretrained("roberta-base")
sentence = torch.tensor([tok.encode("Alright, let's do this")])
embedding_output = model.embeddings(sentence)
和:
from transformers import RobertaTokenizer, RobertaModel, RobertaConfig
import torch
config = RobertaConfig.from_pretrained("roberta-base")
config.output_hidden_states = True
tok = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaModel.from_pretrained("roberta-base", config=config)
sentence = torch.tensor([tok.encode("Alright, let's do this")])
output = model(sentence) # returns a tuple(sequence_output, pooled_output, hidden_states)
hidden_states = output[-1]
embedding_output = hidden_states[0]
这两段代码都很简单,但为什么说这里的embedding_output指的是输入encoder的embedding呢?其实很简单,我们看看RobertaModel的embeddings方法是怎么样定义的即可(可通过这个链接查看:https://github.com/huggingface/transformers/blob/0c9bae09340dd8c6fdf6aa2ea5637e956efe0f7c/src/transformers/modeling_roberta.py#L587):
self.embeddings = RobertaEmbeddings(config)
RobertaEmbeddings这个class的forward方法最后几行是这样的:
if inputs_embeds is None:
inputs_embeds = self.word_embeddings(input_ids)
position_embeddings = self.position_embeddings(position_ids)
token_type_embeddings = self.token_type_embeddings(token_type_ids)
embeddings = inputs_embeds + position_embeddings + token_type_embeddings
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)
return embeddings
很明显这是输入encoder的embedding。
在确定了这一点之后,我们还是参考上面那个网页中的方法(https://github.com/BramVanroy/bert-for-inference/blob/master/introduction-to-bert.ipynb)。正如我在前面所说的,当我们使用RobertaModel和RobertaForMaskedLM加载模型的时候,用法并不相同,先看看RobertaForMaskedLM,我把代码贴到这里:
import torch
from transformers import RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained("./Linux")
from transformers import RobertaForMaskedLM
model = RobertaForMaskedLM.from_pretrained("./Linux",return_dict=True, output_hidden_states = True)
line_ids = tokenizer.encode('( <str> , addr ) <num>')
print('line_ids', line_ids)
print('line_tokens', tokenizer.convert_ids_to_tokens(line_ids))
line_ids = torch.LongTensor(line_ids)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.to(device)
line_ids = line_ids.to(device)
model.eval()
print(line_ids.size())
line_ids = line_ids.unsqueeze(0)
print(line_ids.size())
with torch.no_grad():
out = model(input_ids=line_ids)
print(out)
hidden_states = out[1]
print(len(hidden_states))
embedding_output=hidden_states[-1]
print(embedding_output)
print(embedding_output.size())
sentence_embedding = torch.mean(embedding_output, dim=1).squeeze()
print(sentence_embedding)
print(sentence_embedding.size())
Print有点多,我也是通过这个在学习。通过这条语句:
model = RobertaForMaskedLM.from_pretrained("./Linux",return_dict=True, output_hidden_states = True),我们要求RobertaForMaskedLM返回所有的hidden state,return_dict=True是为了更好理解RobertaForMaskedLM的forward函数的输出:https://huggingface.co/transformers/model_doc/roberta.html#transformers.RobertaForMaskedLM.forward
从上面的网页(或者从其源码),可以很清除地看到RobertaForMaskedLM的forward function返回的是:
-
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) – Masked language modeling (MLM) loss. -
logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) – Tuple oftorch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) – Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.
当我们只指定了output_hidden_states的时候,返回的就是一个二元组:logits和hidden_states,所以上面的代码段中用hidden_states = out[1]来取到所有的hidden_states。当我们print(len(hidden_states)),得到的结果是7,这里为什么是7呢?其实就是输入的embedding加上6层的num_hidden_layers,这是我在之前训练的时候就定义的。
相应地,当使用RobertaModel的时候,对应的代码是这样:
import torch
from transformers import RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained("./Linux")
from transformers import RobertaModel
model = RobertaModel.from_pretrained("./Linux",return_dict=True, output_hidden_states = True)
line_ids = tokenizer.encode('( <str> , addr ) <num>')
print('line_ids', line_ids)
print('line_tokens', tokenizer.convert_ids_to_tokens(line_ids))
line_ids = torch.LongTensor(line_ids)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.to(device)
line_ids = line_ids.to(device)
model.eval()
print(line_ids.size())
line_ids = line_ids.unsqueeze(0)
print(line_ids.size())
with torch.no_grad():
embedding_output=model.embeddings(line_ids)
print(embedding_output)
out = model(input_ids=line_ids)
print(out)
hidden_states = out[2]
print(len(hidden_states))
embedding_output=hidden_states[0]
print(embedding_output)
embedding_output=hidden_states[-1]
sentence_embedding = torch.mean(embedding_output, dim=1).squeeze()
这里的不同点是:1. RobertaModel是有embeddings这个函数的,返回的是输入Encoder的embedding(见前面的讨论)。2. 这里hidden_states = out[2],是因为RobertaModel的forward函数的返回值是(https://huggingface.co/transformers/model_doc/roberta.html#transformers.RobertaModel.forward):
-
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) – Sequence of hidden-states at the output of the last layer of the model. -
pooler_output (
torch.FloatTensor
of shape(batch_size, hidden_size)
) – Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining. -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) – Tuple oftorch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) – Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
-
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
andconfig.add_cross_attention=True
is passed or whenconfig.output_attentions=True
) – Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.
当我们只指定了output_hidden_states的时候,返回的就是一个三元组:last_hidden_state,pooler_output,hidden_states,其中hidden_states[0]==model.embeddings(line_ids) (感兴趣的朋友打印出来看看就知道),而hidden_states[-1]==last_hidden_state==out[0],所以这种情况下我们只取out[0]就可以得到最后一层的hidden_state。
综上所述:当我们用不同的模型的时候,获得最后一层的hidden_state的方法是不一样的。就简单总结这么多,希望能帮助到和我有类似需求的朋友们。