huggingface调用一些细节记录
写给我自己看的一些小细节,因为不是每天写代码,总是会忘
要多看文档!!!
Model Input
使用tokenizer进行分词、添加特殊token例如[cls] [sep]、添加attention mask等等操作
注意类的直接调用和类中函数调用好吗??我每次都来tokenizer()
和tokenizer.encode()
之间来回横跳不长记性
(本文都是用bert进行举例,其他预训练模型同理)
类的直接调用tokenizer()
:
encoded_dict = tokenizer("你是谁?", "我是你妈。")
print(encoded_dict)
结果:
{'input_ids': [101, 872, 3221, 6443, 8043, 102, 2769, 3221, 872, 1968, 511, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
注意:前传的时候无脑加入参数return_tensors="pt"
,省的还得自己转换为tensor
类中函数调用tokenizer.encode()
,直接返回input_ids
input_ids = tokenizer.encode("你是谁?", "我是你妈。")
print(input_ids)
结果:
[101, 872, 3221, 6443, 8043, 102, 2769, 3221, 872, 1968, 511, 102]
关于model input 单/双输入、对齐等细节,参考官方文档,事无巨细,生怕我不会用这个接口,谢谢宁
https://huggingface.co/docs/transformers/v4.23.1/en/glossary#feed-forward-chunking
关于tokenzier的基类说明,事无巨细了
Model Foward
BertModel
函数参数:
返回值:
一般也就用到last_hidden_state
,维度见上图
官方例子就很清楚了,copy一下:
https://huggingface.co/docs/transformers/v4.23.1/en/model_doc/bert#transformers.BertModel
from transformers import BertTokenizer, BertModel
import torch
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") # 自动转换为tensor
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state # 返回值取last_hidden_state
注意看这个pooler_output: 即为[cls]的表示
pooler_output (torch.FloatTensor of shape (batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.
再强调一下,仔细看文档!