bert 的输出格式详解

uan_cs

已于 2022-07-04 21:28:35 修改

阅读量1.6w

点赞数 30

分类专栏： tricks 文章标签： bert 人工智能深度学习

于 2021-11-03 16:53:29 首次发布

本文链接：https://blog.csdn.net/qq_41971355/article/details/121124868

版权

tricks 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

输出是一个元组类型的数据，包含四部分，

last hidden state shape是(batch_size, sequence_length, hidden_size)，hidden_size=768,它是模型最后一层的隐藏状态

pooler_output：shape是(batch_size, hidden_size)，这是序列的第一个token (cls) 的最后一层的隐藏状态，它是由线性层和Tanh激活函数进一步处理的，这个输出不是对输入的语义内容的一个很好的总结，对于整个输入序列的隐藏状态序列的平均化或池化可以更好的表示一句话。

hidden_states：这是输出的一个可选项，如果输出，需要指定config.output_hidden_states=True,它是一个元组，含有13个元素，第一个元素可以当做是embedding，其余12个元素是各层隐藏状态的输出，每个元素的形状是(batch_size, sequence_length, hidden_size)，

attentions：这也是输出的一个可选项，如果输出，需要指定config.output_attentions=True,它也是一个元组，含有12个元素，包含每的层注意力权重，用于计算self-attention heads的加权平均值

import torch
from torch import tensor
from transformers import BertConfig, BertTokenizer, BertModel

model_path = 'model/chinese-roberta-wwm-ext/'#已下载的预训练模型文件路径
config = BertConfig.from_pretrained(model_path, output_hidden_states = True, output_attentions=True)
assert config.output_hidden_states == True
assert config.output_attentions == True
model = BertModel.from_pretrained(model_path, config = config)
tokenizer = BertTokenizer.from_pretrained(model_path)

text = '我热爱这个世界'

# input = tokenizer(text)
# {'input_ids': [101, 2769, 4178, 4263, 6821, 702, 686, 4518, 102], 
#'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 
#'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

# input = tokenizer.encode(text)
# [101, 2769, 4178, 4263, 6821, 702, 686, 4518, 102]

# input = tokenizer.encode_plus(text)
# {'input_ids': [101, 2769, 4178, 4263, 6821, 702, 686, 4518, 102], 
#'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 
#'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

input_ids = torch.tensor([tokenizer.encode(text)], dtype=torch.long)#一个输入也需要组batch
print(input_ids.shape)
#torch.Size([1, 9])

model.eval()
output = model(input_ids)
print(len(output))
print(output[0].shape) #最后一层的隐藏状态 （batch_size, sequence_length, hidden_size)
print(output[1].shape) #第一个token即（cls）最后一层的隐藏状态 (batch_size, hidden_size)
print(len(output[2])) #需要指定 output_hidden_states = True， 包含所有隐藏状态，第一个元素是embedding, 其余元素是各层的输出 (batch_size, sequence_length, hidden_size)
print(len(output[3])) #需要指定output_attentions=True，包含每一层的注意力权重，用于计算self-attention heads的加权平均值(batch_size, layer_nums, sequence_length, sequence_legth)
# 4
# torch.Size([1, 9, 768])
# torch.Size([1, 768])
# 13
# 12

all_hidden_state = output[2]
print(all_hidden_state[0].shape)
print(all_hidden_state[1].shape)
print(all_hidden_state[2].shape)
# torch.Size([1, 9, 768])
# torch.Size([1, 9, 768])
# torch.Size([1, 9, 768])

attentions = output[3]
print(attentions[0].shape)
print(attentions[1].shape)
print(attentions[2].shape)
# torch.Size([1, 12, 9, 9])
# torch.Size([1, 12, 9, 9])
# torch.Size([1, 12, 9, 9])

后续补充，

text = '我热爱这个世界'

input = tokenizer(text)
#input分词后是一个字典
# {'input_ids': [101, 2769, 4178, 4263, 6821, 702, 686, 4518, 102], 
#'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 
#'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


#input_ids = torch.tensor([tokenizer.encode(text)], dtype=torch.long)
#一个输入也需要组batch

input_ids = torch.tensor([input["input_ids"])
token_type_ids = torch.tensor([input["token_type_ids"])
attention_mask = torch.tensor([input["attention_mask"]]

output = model(input_ids, token_type_ids, attention_mask)

# 可以同时输入input_ids token_type_ids 和 attention_mask得到输出

#另一种写法，直接在分词的过程中返回张量
input_tensor = tokenizer(input, return_tensors = "pt")

uan_cs

关注

30
点赞
踩
100

收藏

觉得还不错? 一键收藏
6
评论
bert 的输出格式详解

输出是一个元组类型的数据，包含四部分，last hidden stateshape是(batch_size, sequence_length, hidden_size)，hidden_size=768,它是模型最后一层输出的隐藏状态pooler_output：shape是(batch_size, hidden_size)，这是序列的第一个token(classification token)的最后一层的隐藏状态，它是由线性层和Tanh激活函数进一步处理的，这个输出不是对输入的语义内容的一个很好的.
复制链接

扫一扫