XLNet值得读的文章:
XLNet的Transformers的pytorch实现中,几个关于各种mask的参数讲解:
forward(input_ids=None, attention_mask=None, mems=None, perm_mask=None, target_mapping=None, token_type_ids=None, input_mask=None, head_mask=None, inputs_embeds=None, use_cache=True)
attention_mask
: 数据类型是torch.FloatTensor,shape是(batch_size, sequence_length)。这个mask的作用是避免将注意力分配给padding的tokens,1表示没有被mask,也就是没有被padding的部分,0表示被mask了,也就是被padding的部分。input_mask
:这个参数是为了保持与以前版本的兼容性,作用和attention_mask
一样。attention_mask
和input_mask
只使用一个就可以了。perm_mask
: 数据类型是torch.FloatTensor,shape是 (batch_size, sequence_length, sequence_length)用于指示每个输入token的注意力模式,如果perm_mask[k,i,j] =0
表示在第k个batch中,单词i对单词j有注意力,否则反之,具体而言可以看原论文中的图,
target_mapping
:数据类型torch.FloatTensor,shape是(batch_size, num_predict, sequence_length),用于指示要预测的token,如果target_mapping[k,i,j]=1
表示在batch k中的第i个预测的词,在序列中的第j个位置上。这个mask仅用在pretraining的partial prediction任务上或者生成任务中。
下面的是官方的一个生成任务的例子
from transformers import XLNetTokenizer, XLNetLMHeadModel
import torch
tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
model = XLNetLMHeadModel.from_pretrained('xlnet-large-cased')
# We show how to setup inputs to predict a next token using a bi-directional context.
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is very <mask>", add_special_tokens=False)).unsqueeze(0) # We will predict the masked token
perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
perm_mask[:, :, -1] = 1.0 # Previous tokens don't see last token
target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float) # Shape [1, 1, seq_length] => let's predict one token
target_mapping[0, 0, -1] = 1.0 # Our first (and only) prediction will be the last token of the sequence (the masked token)
outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)
next_token_logits = outputs[0] # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
# The same way can the XLNetLMHeadModel be used to be trained by standard auto-regressive language modeling.
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is very <mask>", add_special_tokens=False)).unsqueeze(0) # We will predict the masked token
labels = torch.tensor(tokenizer.encode("cute", add_special_tokens=False)).unsqueeze(0)
assert labels.shape[0] == 1, 'only one word will be predicted'
perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
perm_mask[:, :, -1] = 1.0 # Previous tokens don't see last token as is done in standard auto-regressive lm training
target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float) # Shape [1, 1, seq_length] => let's predict one token
target_mapping[0, 0, -1] = 1.0 # Our first (and only) prediction will be the last token of the sequence (the masked token)
outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping, labels=labels)
loss, next_token_logits = outputs[:2] # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]