python-pytorch基础之加载bert模型获取字向量

安装transformers

# !pip install transformers

实例化tokenizer和model

from transformers import AutoModel,AutoTokenizer
tokenizer=AutoTokenizer.from_pretrained("./distilbert-base-uncased-finetuned-sst-2-english")
model=AutoModel.from_pretrained("./distilbert-base-uncased-finetuned-sst-2-english")
Some weights of the model checkpoint at ./distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

文字转ids(string->ids)

en=tokenizer.encode("how are you and you")
en,type(en)
([101, 2129, 2024, 2017, 1998, 2017, 102], list)
import torch

torch.tensor(en)
tensor([ 101, 2129, 2024, 2017, 1998, 2017,  102])

将ids输入到模型中

# 方法一
# out=model(torch.tensor(en).unsqueeze(0))

# 方法二
out=model(torch.tensor([en]))

查看输出结果

print(out)
BaseModelOutput(last_hidden_state=tensor([[[ 0.4692,  0.5402,  0.2137,  ..., -0.1891,  1.0371, -0.8645],
         [ 0.9280,  0.8054, -0.0353,  ..., -0.0706,  1.0147, -0.9412],
         [ 1.1769,  0.4334, -0.4291,  ..., -0.3780,  0.6734, -0.5759],
         ...,
         [ 1.0213,  0.6273,  0.5482,  ..., -0.2374,  1.0714, -0.5215],
         [ 0.4576,  0.2577,  0.3044,  ..., -0.1127,  1.1128, -0.9350],
         [ 1.2613,  0.2868,  0.2176,  ...,  0.7057,  0.1919, -0.7504]]],
       grad_fn=<NativeLayerNormBackward>), hidden_states=None, attentions=None)
type(out)
transformers.modeling_outputs.BaseModelOutput
out[0]
tensor([[[ 0.4692,  0.5402,  0.2137,  ..., -0.1891,  1.0371, -0.8645],
         [ 0.9280,  0.8054, -0.0353,  ..., -0.0706,  1.0147, -0.9412],
         [ 1.1769,  0.4334, -0.4291,  ..., -0.3780,  0.6734, -0.5759],
         ...,
         [ 1.0213,  0.6273,  0.5482,  ..., -0.2374,  1.0714, -0.5215],
         [ 0.4576,  0.2577,  0.3044,  ..., -0.1127,  1.1128, -0.9350],
         [ 1.2613,  0.2868,  0.2176,  ...,  0.7057,  0.1919, -0.7504]]],
       grad_fn=<NativeLayerNormBackward>)

总结

  1. AutoTokenizer是BertTokenizer的封装,使用AutoTokenizer会自动生成attention_mask以及token_type_ids,参见https://blog.csdn.net/m0_45478865/article/details/118219919
  2. 传入模型的参数必须是tensor,如果是从list转tensor,则需要list是二维
  3. 不同的预训练模型实例化后的结果参数不一样,如使用distilbert-base-uncased-finetuned-sst-2-english的时候out就没有pooler_output的参数,但是如果使用chinese-roberta-wwm-ext-large模型,out的属性就会有
  4. pooler_output是句子向量,out[0]是字向量
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值