python-pytorch基础之加载bert模型获取字向量

liwulin0506

已于 2023-08-12 08:21:10 修改

阅读量355

点赞数

分类专栏： python pytorch 文章标签： python pytorch 开发语言

于 2023-08-09 14:29:14 首次发布

本文链接：https://blog.csdn.net/m0_60688978/article/details/132187514

版权

python 同时被 2 个专栏收录

59 篇文章 3 订阅

订阅专栏

pytorch

42 篇文章 0 订阅

订阅专栏

python-pytorch基础之加载模型获取字向量

安装transformers
实例化tokenizer和model
文字转ids（string->ids）
将ids输入到模型中
查看输出结果
总结

安装transformers

# !pip install transformers

实例化tokenizer和model

from transformers import AutoModel,AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained("./distilbert-base-uncased-finetuned-sst-2-english")
model=AutoModel.from_pretrained("./distilbert-base-uncased-finetuned-sst-2-english")

Some weights of the model checkpoint at ./distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

文字转ids（string->ids）

en=tokenizer.encode("how are you and you")
en,type(en)

([101, 2129, 2024, 2017, 1998, 2017, 102], list)

import torch

torch.tensor(en)

tensor([ 101, 2129, 2024, 2017, 1998, 2017,  102])

将ids输入到模型中

# 方法一
# out=model(torch.tensor(en).unsqueeze(0))

# 方法二
out=model(torch.tensor([en]))

查看输出结果

print(out)

BaseModelOutput(last_hidden_state=tensor([[[ 0.4692,  0.5402,  0.2137,  ..., -0.1891,  1.0371, -0.8645],
         [ 0.9280,  0.8054, -0.0353,  ..., -0.0706,  1.0147, -0.9412],
         [ 1.1769,  0.4334, -0.4291,  ..., -0.3780,  0.6734, -0.5759],
         ...,
         [ 1.0213,  0.6273,  0.5482,  ..., -0.2374,  1.0714, -0.5215],
         [ 0.4576,  0.2577,  0.3044,  ..., -0.1127,  1.1128, -0.9350],
         [ 1.2613,  0.2868,  0.2176,  ...,  0.7057,  0.1919, -0.7504]]],
       grad_fn=<NativeLayerNormBackward>), hidden_states=None, attentions=None)

type(out)

transformers.modeling_outputs.BaseModelOutput

out[0]

tensor([[[ 0.4692,  0.5402,  0.2137,  ..., -0.1891,  1.0371, -0.8645],
         [ 0.9280,  0.8054, -0.0353,  ..., -0.0706,  1.0147, -0.9412],
         [ 1.1769,  0.4334, -0.4291,  ..., -0.3780,  0.6734, -0.5759],
         ...,
         [ 1.0213,  0.6273,  0.5482,  ..., -0.2374,  1.0714, -0.5215],
         [ 0.4576,  0.2577,  0.3044,  ..., -0.1127,  1.1128, -0.9350],
         [ 1.2613,  0.2868,  0.2176,  ...,  0.7057,  0.1919, -0.7504]]],
       grad_fn=<NativeLayerNormBackward>)

总结

AutoTokenizer是BertTokenizer的封装，使用AutoTokenizer会自动生成attention_mask以及token_type_ids，参见https://blog.csdn.net/m0_45478865/article/details/118219919
传入模型的参数必须是tensor，如果是从list转tensor，则需要list是二维
不同的预训练模型实例化后的结果参数不一样，如使用distilbert-base-uncased-finetuned-sst-2-english的时候out就没有pooler_output的参数，但是如果使用chinese-roberta-wwm-ext-large模型，out的属性就会有
pooler_output是句子向量，out[0]是字向量