Bert提取句子特征（pytorch_transformers）

最新推荐文章于 2025-03-25 19:52:05 发布

正则化

最新推荐文章于 2025-03-25 19:52:05 发布

阅读量3.9w

点赞数 62

分类专栏：深度学习笔记

本文链接：https://blog.csdn.net/weixin_41519463/article/details/100863313

版权

本文主要讲如何调用pytorch_transformers这个包来提取一个句子的特征。

pytorch_transformers

pytorch_transformers Quickstart

pytorch_transformers包含BERT, GPT, GPT-2, Transfo-XL, XLNet, XLM 等多个模型，并提供了27 个预训练模型。

对于每个模型，pytorch_transformers库里都对应有三个类：

model classes which are PyTorch models (torch.nn.Modules) of the 6 models architectures currently provided in the library, e.g. BertModel
configuration classes which store all the parameters required to build a model, e.g. BertConfig. You don’t always need to instantiate these your-self, in particular if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model)
tokenizer classes which store the vocabulary for each model and provide methods for encoding/decoding strings in list of token embeddings indices to be fed to a model, e.g. BertTokenizer

简单来说，model classes是模型的网络结构，configuration classes是模型的相关参数，tokenizer classes是分词工具，一般建议直接使用from_pretrained()方法加载已经预训练好的模型或者参数。

from_pretrained() let you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (currently 27 models are provided as listed here) or stored locally (or on a server) by the user,

Bert

pytorch_transformers中的Bert说明文档

接下来讲解如何使用pytorch_transformers中的Bert模型。

先安装pytorch_transformers库

pip install pytorch_transformers

然后从pytorch_transformers库中导入Bert的上面所说到的3个类

from pytorch_transformers import  BertModel, BertConfig,BertTokenizer

1、输入处理

先是用BertTokenizer对输入文本进行处理，从预训练模型中加载tokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

如果无法访问外网，可以先把bert-base-uncased-vocab.txt 下载下来加载进去。本文最后提供下载链接。

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased-vocab.txt')

输入文本是两个sentence

输入文本是两个sentence的情况，主要是用于类似于问答类型的任务，一问一答两个sentence，这里面的sentence其实可以是一个句子，也可以是一个小段落。

需要在文本开头加上’[CLS]’，在每个句子后面加上’[SEP]’，这样输入到BertModel中才能被正确识别。

text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text) #用tokenizer对句子分词
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)#词在预训练词表中的索引列表
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
#用来指定哪个是第一个句子，哪个是第二个句子，0的部分代表句子一, 1的部分代表句子二

#转换成PyTorch tensors
tokens_tensor = torch.tensor(

最低0.47元/天解锁文章