本文主要讲如何调用pytorch_transformers这个包来提取一个句子的特征。
pytorch_transformers
pytorch_transformers Quickstart
pytorch_transformers包含BERT, GPT, GPT-2, Transfo-XL, XLNet, XLM 等多个模型,并提供了27 个预训练模型。
对于每个模型,pytorch_transformers库里都对应有三个类:
- model classes which are PyTorch models (
torch.nn.Modules
) of the 6 models architectures currently provided in the library, e.g.BertModel
- configuration classes which store all the parameters required to build a model, e.g.
BertConfig
. You don’t always need to instantiate these your-self, in particular if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model) - tokenizer classes which store the vocabulary for each model and provide methods for encoding/decoding strings in list of token embeddings indices to be fed to a model, e.g.
BertTokenizer
简单来说,model classes是模型的网络结构,configuration classes是模型的相关参数,tokenizer classes是分词工具,一般建议直接使用from_pretrained()方法加载已经预训练好的模型或者参数。
from_pretrained()
let you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (currently 27 models are provided as listed here) or stored locally (or on a server) by the user,
Bert
pytorch_transformers中的Bert说明文档
接下来讲解如何使用pytorch_transformers中的Bert模型。
先安装pytorch_transformers库
pip install pytorch_transformers
然后从pytorch_transformers库中导入Bert的上面所说到的3个类
from pytorch_transformers import BertModel, BertConfig,BertTokenizer
1、输入处理
先是用BertTokenizer对输入文本进行处理,从预训练模型中加载tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
如果无法访问外网,可以先把bert-base-uncased-vocab.txt 下载下来加载进去。本文最后提供下载链接。
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased-vocab.txt')
输入文本是两个sentence
输入文本是两个sentence的情况,主要是用于类似于问答类型的任务,一问一答两个sentence,这里面的sentence其实可以是一个句子,也可以是一个小段落。
需要在文本开头加上’[CLS]’,在每个句子后面加上’[SEP]’,这样输入到BertModel中才能被正确识别。
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text) #用tokenizer对句子分词
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)#词在预训练词表中的索引列表
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
#用来指定哪个是第一个句子,哪个是第二个句子,0的部分代表句子一, 1的部分代表句子二
#转换成PyTorch tensors
tokens_tensor = torch.tensor(