pytorch_transformers
pytorch_transformers包含BERT, GPT, GPT-2, Transfo-XL, XLNet, XLM 等多个预训练模型。
下面简单记录一下学习过程(杂乱)
先举个栗子:
输入是:一次;放入一批(batch)语句,希望将这一批句子都转换成为数字送到模型里面去
sentences=["We are very happy to show you the Transformers library",
"We hope you don't hate it"]
from transformers import AutoTokenizer,AutoModelForSequenceClassification
Model_name = 'distillery-base-uncashed-finetuned-still-2-english'
model=AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer=AutoTokenizer.from_pretrained(model_name)
sentences=["We are very happy to show you the Transformers library",
"We hope you don't hate it"]
Pt_batch = tokenizer(
Sentences,
padding=True,
truncation=True,
max_length=512,
return_tensors="Pt"
)
padding=True:用来指明是否启用填补。他会自动补全结果中的input_ids以及attention_mask右边缺失的值。
truncation=True:将每个文本序列截断到模型可以接受的最大长度
return_tensors="pt":返回张量
得到的结果(注意 这个过程中会自动添加[CLS]等字符):
input_ids: [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]
对于BERT模型:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
texts = ["[CLS] Who was Jim Henson ? [SEP]",
"[CLS] Jim Henson was a puppeteer [SEP]"]
tokens, segments, input_masks = [], [], []
for text in texts:
tokenized_text = tokenizer.tokenize(text) #用tokenizer对句子分词
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)#索引列表
tokens.append(indexed_tokens)
segments.append([0] * len(indexed_tokens))
input_masks.append([1] * len(indexed_tokens))
max_len = max([len(single) for single in tokens]) #最大的句子长度
for j in range(len(tokens)):
padding = [0] * (max_len - len(tokens[j]))
tokens[j] += padding
segments[j] += padding
input_masks[j] += padding
#segments列表全0,因为只有一个句子1,没有句子2
#input_masks列表1的部分代表句子单词,而后面0的部分代表paddig,只是用于保持输入整齐,没有实际意义。
#相当于告诉BertModel不要利用后面0的部分
#转换成PyTorch tensors
tokens_tensor = torch.tensor(tokens)
segments_tensors = torch.tensor(segments)
input_masks_tensors = torch.tensor(input_masks)
tokens_tensor,segments_tensors,input_masks_tensors 将作为BertModel的输入。
然后构建BERTModel, 在BertModel后面加上一个全连接层,调整输出feature的维度。把输入到BertModel后得到的输出output,一般是使用它的第0维信息:
class TextNet(nn.Module):
def __init__(self, code_length): #code_length为fc映射到的维度大小
super(TextNet, self).__init__()
modelConfig = BertConfig.from_pretrained('bert-base-uncased-config.json')
self.textExtractor = BertModel.from_pretrained(
'bert-base-uncased-pytorch_model.bin', config=modelConfig)
embedding_dim = self.textExtractor.config.hidden_size
self.fc = nn.Linear(embedding_dim, code_length)
self.tanh = torch.nn.Tanh()
def forward(self, tokens, segments, input_masks):
output=self.textExtractor(tokens, token_type_ids=segments,
attention_mask=input_masks)
text_embeddings = output[0][:, 0, :]
#output[0](batch size, sequence length, model hidden dimension)
features = self.fc(text_embeddings)
features=self.tanh(features)
return features
其中output[0][:,0,:]
代表[CLS]的输出向量