Hugging Face模型的简单使用

花花少年

已于 2024-08-08 10:30:06 修改

阅读量1.8k

点赞数 16

分类专栏：深度学习文章标签： Hugging Face

于 2024-07-30 13:20:25 首次发布

本文链接：https://blog.csdn.net/m0_37605642/article/details/140794119

版权

深度学习专栏收录该内容

135 篇文章

订阅专栏

一、参考资料

Hugging Face快速入门（重点讲解模型(Transformers)和数据集部分(Datasets)）

【计算机视觉 | 自然语言处理】Hugging Face 超详细介绍和使用教程

HuggingFace-transformers系列的介绍以及在下游任务中的使用

两步解决Hugging Face下载模型速度慢/连接超时/无法下载问题

Hugging Face全攻略：轻松下载Llama 3模型，探索NLP的无限可能！【实操】

如何快速下载huggingface模型——全方法总结

【实战教程】linux系统下载huggingface大模型教程

如何在huggingface上申请下载使用llama2/3模型

二、Hugging Face模型相关介绍

1. 查找Hugging Face模型

官方模型库

模型（Models）文档

在这里插入图片描述

2. 使用Hugging Face模型

Transformers项目提供了几个简单的API帮助用户使用Hugging Face模型，而这几个简单的API统称为 AutoClass (官方文档链接)，包括：

AutoTokenizer: 用于文本分词。
AutoFeatureExtractor: 用于特征提取。
AutoProcessor: 用于数据处理。
AutoModel: 用于加载模型。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer("I'm learning deep learning.")

输出结果

{'input_ids': [101, 1045, 1005, 1049, 4083, 2784, 4083, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

在这里插入图片描述

3. 文本分类任务

text-classification

常见的文本分类任务：

情感分析（Sentiment Analysis）：本质是一个二分类的问题，给定一个文本判断是正面的（POS），还是负面的（NEG）。
Quora Question Pairs：给出两个问题，判断这两个问题的含义是否一致；属于一个二分类的问题；他的数据集是quroa问题队，也被收录在GLUE内。
语法校核（Grammatical Correctness）：评估一个句子的语法可接受性，二分类任务，结果是可接受或者不可接受；常用的数据集是： nyu-mll/glue

三、Transformer模型

Hugging Face Transformer是Hugging Face最核心的项目，你可以用它做以下事情：

直接使用预训练模型进行推理。
提供了大量预训练模型可供使用。
使用预训练模型进行迁移学习。

1. 模型组成（4部分）

一个完整的transformer模型主要包含四部分：Config、Tokenizer、Model、Post processing。Config 是相关配置，Tokenizer是把输入的文本做切分，然后变成向量，Model负责根据输入的变量提取语义信息，输出logits；最后Post Processing根据模型输出的语义信息，执行具体的nlp任务，比如情感分析，文本自动打标签等。

以 google-bert/bert-base-chinese 模型为例，介绍Transformer模型的组成部分。

在这里插入图片描述

1.1 Config部分

Config，包括：控制模型的名称、最终输出的样式、隐藏层宽度和深度、激活函数的类别等。将 Config 类导出时文件格式为 json 格式，也可以通过 config.json 来实例化 Config 类，这是一个互逆的过程。

{
  "architectures": [
    "BertForMaskedLM"                      # 模型的名称
  ],
  "attention_probs_dropout_prob": 0.1,     # 注意力机制的 dropout，默认为0.1
  "directionality": "bidi",                # 文字编码方向采用bidi算法
  "hidden_act": "gelu",                    # 编码器内激活函数，默认"gelu"，还可为"relu"、"swish"或 "gelu_new"
  "hidden_dropout_prob": 0.1,              # 词嵌入层或编码器的 dropout，默认为0.1
  "hidden_size": 768,                      # 编码器内隐藏层神经元数量，默认768
  "initializer_range": 0.02,               # 神经元权重的标准差，默认为0.02
  "intermediate_size": 3072,               # 编码器内全连接层的输入维度，默认3072
  "layer_norm_eps": 1e-12,                 # layer normalization 的 epsilon 值，默认为 1e-12
  "max_position_embeddings": 512,          # 模型使用的最大序列长度，默认为512
  "model_type": "bert",                    # 模型类型是bert
  "num_attention_heads": 12,               # 编码器内注意力头数，默认12
  "num_hidden_layers": 12,                 # 编码器内隐藏层层数，默认12
  "pad_token_id": 0,
  "pooler_fc_size": 768,                   # 下面应该是pooler层的参数，本质是个全连接层，作为分类器解决序列级的NLP任务                 
  "pooler_num_attention_heads": 12,        # pooler层注意力头，默认12 
  "pooler_num_fc_layers": 3,               # pooler 连接层数，默认3
  "pooler_size_per_head": 128,             # 每个注意力头的size
  "pooler_type": "first_token_transform",
  "type_vocab_size": 2,                    # 词汇表类别，默认为2
  "vocab_size": 21128                      # 词汇数，bert默认30522，这是因为bert以中文字为单位进入输入
}

1.2 Tokenizer部分

Tokenizer是一个将纯文本转换为编码的过程。值得注意的是，Tokenizer并不涉及将词转化为词向量的过程，仅仅是对纯文本进行分词，添加 [MASK]标记、[SEP]、[CLS]标记，并转换为字典索引。

Tokenizer类导出时将分为三个文件，也就是：

vocab.txt是一个词典文件，每一行为一个词或词的一部分，行号为索引。vocab 按顺序做了索引，将来可以根据编码生成 one-hot 向量，然后跟 embeding 训练的矩阵相乘，可以得到该字符的向量。
tokenizer.json 和tokenizer_config.json 是分词的配置文件。

（可选）special_tokens_map.json，特殊标记的定义方式。

{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

这些文件是 Tokenizer 类生成的，或者处理的，只是处理文本，不涉及任何向量操作。

1.3 Model部分

Model，就是各种各样的模型。除了初始的Bert、GPT等基本模型，针对下游任务，还定义了诸如BertForQuestionAnswering等下游任务模型。模型导出时将生成config.json和pytorch_model.bin参数文件。前者就是配置文件，这和我们的直觉相同，即config和model应该是紧密联系在一起的两个类。后者其实和torch.save()存储得到的文件是相同的，这是因为Model都直接或者间接继承了Pytorch的Module类。从这里可以看出，HuggingFace在实现时很好地尊重了Pytorch的原生API。

Model是核心部分，Model又可以分为三种模型，针对不同的NLP任务，需要选取不同的模型类型：

Encoder模型，如Bert，常用于句子分类、命名实体识别、单词分类和抽取式问答。
Decoder模型，如GPT，GPT2，常用于文本生成。
sequence2sequence模型，如BART，常用于摘要，翻译，生成性问答等。

1.4 Post processing部分

//TODO

2. 使用Transformers进行推理

对于简单任务，可直接使用Transformer提供的Pipeline API进行推理。

from transformers import pipeline

translator = pipeline("translation_en_to_fr")
print(translator("How old are you?"))

输出结果

[{'translation_text': ' quel âge êtes-vous?'}]

在这里插入图片描述

对于特定任务，官方并没有提供相应的模型，但可以到官网搜索模型，然后显示指定模型即可。更多Pipeline请参考官方文档

在加载模型时，你有可能会因为缺少一些库而报错，这个时候，只需要安装对应的库，然后重启即可。

pip install sentencepiece

translator = pipeline("translation_en_to_zh", model='Helsinki-NLP/opus-mt-en-zh')
translator("I'm learning deep learning.")

输出结果

[{'translation_text': '我在学习深思熟虑'}]

在这里插入图片描述

四、BERT模型

1. bert-base-chinese模型

google-bert/bert-base-chinese

Huggingface 超详细介绍

在这里插入图片描述

1.1 加载在线模型

import torch
from transformers import BertModel, BertTokenizer, BertConfig

# 导入模型
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
config = BertConfig.from_pretrained('bert-base-chinese')

# 更改模型配置
config.update({'output_hidden_states':True}) 

model = BertModel.from_pretrained("bert-base-chinese",config=config)

from transformers import AutoModel

checkpoint = "bert-base-chinese"
# pipeline方式导入模型
model = AutoModel.from_pretrained(checkpoint)

在这里插入图片描述

1.2 加载离线模型

手动下载模型信息并导入。

在HuggingFace官方模型库上找到需要下载的模型，点击模型链接。

# 下载模型
git lfs clone https://huggingface.co/google-bert/bert-base-chinese.git

import transformers

MODEL_PATH = "/root/Downloads/models/bert-base-chinese"

# 通过词典导入分词器
tokenizer = transformers.BertTokenizer.from_pretrained("/root/Downloads/models/bert-base-chinese/vocab.txt") 

# 导入配置文件
model_config = transformers.BertConfig.from_pretrained(MODEL_PATH)

# 修改配置
model_config.output_hidden_states = True
model_config.output_attentions = True

# 通过配置和路径导入模型
model = transformers.BertModel.from_pretrained(MODEL_PATH,config = model_config)

1.3 encoder编码

BERT对中文是字符级别的分词，对英文是到sub-word级别的分词。

1.3.1 对单个句子编码

# encode仅返回input_ids
tokenizer.encode("生活的真谛是美和爱")

输出结果

[101, 4495, 3833, 4638, 4696, 6465, 3221, 5401, 1469, 4263, 102]

1.3.2 对一组句子编码

# encode返回input_ids、token_type_ids、attention_mask
tokenizer.encode_plus("生活的真谛是美和爱","说的太好了")

输出结果

{'input_ids': [101, 4495, 3833, 4638, 4696, 6465, 3221, 5401, 1469, 4263, 102, 6432, 4638, 1922, 1962, 749, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

其中，101表示[CLS]标识符，102表示[SEP]标识符。

1.3.3 对列表编码

sentences = ['网络安全开发分为三个层级',
             '车辆系统层级网络安全开发',
             '车辆功能层级网络安全开发',
             '车辆零部件层级网络安全开发',
             '测试团队根据车辆网络安全目标制定测试技术要求及测试计划',
             '测试团队在网络安全团队的支持下，完成确认测试并编制测试报告',
             '在车辆确认结果的基础上，基于合理的理由，确认在设计和开发阶段识别出的所有风险均已被接受',]

# 对列表encoder
test1 = tokenizer(sentences)

print(test1)

输出结果

{'input_ids': [[101, 5381, 5317, 2128, 1059, 2458, 1355, 1146, 711, 676, 702, 2231, 5277, 102], [101, 6756, 6775, 5143, 5320, 2231, 5277, 5381, 5317, 2128, 1059, 2458, 1355, 102], [101, 6756, 6775, 1216, 5543, 2231, 5277, 5381, 5317, 2128, 1059, 2458, 1355, 102], [101, 6756, 6775, 7439, 6956, 816, 2231, 5277, 5381, 5317, 2128, 1059, 2458, 1355, 102], [101, 3844, 6407, 1730, 7339, 3418, 2945, 6756, 6775, 5381, 5317, 2128, 1059, 4680, 3403, 1169, 2137, 3844, 6407, 2825, 3318, 6206, 3724, 1350, 3844, 6407, 6369, 1153, 102], [101, 3844, 6407, 1730, 7339, 1762, 5381, 5317, 2128, 1059, 1730, 7339, 4638, 3118, 2898, 678, 8024, 2130, 2768, 4802, 6371, 3844, 6407, 2400, 5356, 1169, 3844, 6407, 2845, 1440, 102], [101, 1762, 6756, 6775, 4802, 6371, 5310, 3362, 4638, 1825, 4794, 677, 8024, 1825, 754, 1394, 4415, 4638, 4415, 4507, 8024, 4802, 6371, 1762, 6392, 6369, 1469, 2458, 1355, 7348, 3667, 6399, 1166, 1139, 4638, 2792, 3300, 7599, 7372, 1772, 2347, 6158, 2970, 1358, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

1.4 执行推理

将分词结果输入模型，执行推理。

import transformers

MODEL_PATH = "/root/Downloads/models/bert-base-chinese"

# 通过词典导入分词器
tokenizer = transformers.BertTokenizer.from_pretrained("/root/Downloads/models/bert-base-chinese/vocab.txt") 

# 导入配置文件
model_config = transformers.BertConfig.from_pretrained(MODEL_PATH)

# 修改配置
model_config.output_hidden_states = True
model_config.output_attentions = True

# 通过配置和路径导入模型
model = transformers.BertModel.from_pretrained(MODEL_PATH,config = model_config)


encoder_result = tokenizer.encode_plus("生活的真谛是美和爱","说的太好了")
input_ids = encoder_result["input_ids"]
token_type_ids = encoder_result["token_type_ids"]

# 添加batch维度并转化为tensor
input_ids = torch.tensor([input_ids])
token_type_ids = torch.tensor([token_type_ids])

# 将模型转化为eval模式
model.eval()

# 将模型和数据转移到cuda, 若无cuda,可更换为cpu
device = 'cuda'
tokens_tensor = input_ids.to(device)
segments_tensors = token_type_ids.to(device)
model.to(device)

# 进行编码
with torch.no_grad():
    # See the models docstrings for the detail of the inputs
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    # Transformers models always output tuples.
    # See the models docstrings for the detail of all the outputs
    # In our case, the first element is the hidden state of the last layer of the Bert model
    encoded_layers = outputs

# 得到最终的编码结果encoded_layers
print(encoded_layers)
for item in encoded_layers:
    print(encoded_layers.keys)

输出结果

sequence_output, pooled_output, (hidden_states), (attentions)

以输入序列长度为14为例

index	名称	维度	描述
0	sequence_output	torch.Size([1, 14, 768])	输出序列
1	pooled_output	torch.Size([1, 768])	对输出序列进行pool操作的结果
2	(hidden_states)	tuple,13*torch.Size([1, 14, 768])	隐藏层状态(包括Embedding层)，取决于modelconfig中output_hidden_states
3	(attentions)	tuple,12*torch.Size([1, 12, 14, 14])	注意力层，取决于参数中output_attentions

2. bert-base-uncased模型

执行推理

from transformers import pipeline


# 创建一个叫fill-mask的任务，该任务使用bert-base-uncased模型
unmasker = pipeline("fill-mask",model = "bert-base-uncased")

# 输出mask
unmasker("The goal of life is [MASK].", top_k=5)

输出结果

[{'score': 0.10933335870504379,
  'token': 2166,
  'token_str': 'life',
  'sequence': 'the goal of life is life.'},
 {'score': 0.03941883519291878,
  'token': 7691,
  'token_str': 'survival',
  'sequence': 'the goal of life is survival.'},
 {'score': 0.032930612564086914,
  'token': 2293,
  'token_str': 'love',
  'sequence': 'the goal of life is love.'},
 {'score': 0.030096178874373436,
  'token': 4071,
  'token_str': 'freedom',
  'sequence': 'the goal of life is freedom.'},
 {'score': 0.024967128410935402,
  'token': 17839,
  'token_str': 'simplicity',
  'sequence': 'the goal of life is simplicity.'}]

五、Hugging Face模型微调

利用预训练模型在下游任务上微调。

1. 问答任务

任务输入：问题句，答案所在的文章 "Who was Jim Henson?", "Jim Henson was a nice puppet"

任务输出：答案 "a nice puppet"

预训练模型为：Bert

1.1 构建模型

一般情况下，一个基本模型对应一个Tokenizer，所以并不存在对应于具体下游任务的Tokenizer。这里通过bert_model初始化BertForQuestionAnswering。

from transformers import BertTokenizer, BertForQuestionAnswering
import torch

MODEL_PATH = r"D:\transformr_files\bert-base-uncased/"

# 实例化tokenizer
tokenizer = BertTokenizer.from_pretrained(r"D:\transformr_files\bert-base-uncased\bert-base-uncased-vocab.txt")

# 导入bert的model_config
model_config = transformers.BertConfig.from_pretrained(MODEL_PATH)

# 创建bert_model
bert_model = transformers.BertModel.from_pretrained(MODEL_PATH,config = model_config)

# 最终有两个输出，初始位置和结束位置
model_config.num_labels = 2

# 同样根据bert的model_config新建BertForQuestionAnswering
model = BertForQuestionAnswering(model_config)
model.bert = bert_model

1.2 encoder编码

# 设定模式
model.eval()

question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"

# 获取input_ids编码
input_ids = tokenizer.encode(question, text)

# 手动进行token_type_ids编码，可用encode_plus代替
token_type_ids = [0 if i <= input_ids.index(102) else 1 for i in range(len(input_ids))]

# 得到评分
start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))

# 进行逆编码，得到原始的token 
all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
#['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', 'henson', 'was', 'a', 'nice', 'puppet', '[SEP]']

1.3 任务输出

将模型输出转化为任务输出。

模型输入：inputids, token_type_ids

模型输出：start_scores, end_scores 形状都为torch.Size([1, 14])，其中14为序列长度，代表每个位置是开始/结束位置的概率。

# 对输出的答案进行解码的过程
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
# assert answer == "a nice puppet" 
# 这里因为没有经过微调，所以效果不是很好，输出结果不佳。
print(answer)
# 'was jim henson ? [SEP] jim henson was a nice puppet [SEP]'

2. 文本分类任务(情感分析等)

任务输入：句子 "i like you, what about you"

任务输出：句子所属的类别 class1

预训练模型为：XLNet

2.1 构建模型

from transformers import XLNetConfig, XLNetModel, XLNetTokenizer, XLNetForSequenceClassification
import torch

# 定义路径，初始化tokenizer
XLN_PATH = r"D:\transformr_files\XLNetLMHeadModel"

# 初始化tokenizer
tokenizer = XLNetTokenizer.from_pretrained(XLN_PATH)

# 加载配置
model_config = XLNetConfig.from_pretrained(XLN_PATH)

# 设定类别数为3
model_config.num_labels = 3

# 直接从xlnet的config新建XLNetForSequenceClassification
cls_model = XLNetForSequenceClassification.from_pretrained(XLN_PATH, config=model_config)

2.2 encoder编码

# 设定模式
model.eval()

token_codes = tokenizer.encode_plus("i like you, what about you")

2.3 任务输出

模型输入：inputids, token_type_ids

模型输出：logits, hidden states，其中logits形状为torch.Size([1, 3]), 其中的3对应的是类别的数量。当训练时，第一项为loss。

六、Hugging Face迁移学习

0. 引言

很多情况下，Hugging Face提供的模型并不能满足我们的需要，所以我们还是要自己训练模型的。此时我们可以使用Hugging Face提供的预训练模型来进行迁移学习。

使用Hugging Face模型做迁移学习的思路和普通迁移学习几乎一致：

首先，选择一个和你的任务类似的任务的预训练模型，或者直接选择一个任务无关的基础模型。
其次，从原有模型中拿出主干部分(backbone)。
然后，接上自己的下游任务，构建成新的模型。
最后，开始训练。

假设我的任务是一个二分类的情感分类问题，以 google-bert/bert-base-uncased 模型为例，进行迁移学习。

1. 测试模型

在 Use this model 中拷贝并运行代码。

# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
model = AutoModelForMaskedLM.from_pretrained("google-bert/bert-base-uncased")

输出结果

在这里插入图片描述

inputs = tokenizer("Learning is a very happy [MASK].", return_tensors='pt')
print(inputs)

model(**inputs).logits.argmax(dim=-1)

tokenizer.convert_ids_to_tokens(2832)

输出结果

{'input_ids': tensor([[ 101, 4083, 2003, 1037, 2200, 3407,  103, 1012,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}
'process'

与页面Inference API的结果一致：

在这里插入图片描述

2. 打印模型

print(model)

输出结果

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
  )
  (cls): BertOnlyMLMHead(
    (predictions): BertLMPredictionHead(
      (transform): BertPredictionHeadTransform(
        (dense): Linear(in_features=768, out_features=768, bias=True)
        (transform_act_fn): GELUActivation()
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (decoder): Linear(in_features=768, out_features=30522, bias=True)
    )
  )
)

可以看出，bert-base-uncased 模型由两大部分构成：骨干层bert和分类层cls。做迁移学习，则保留骨干层，替换分类层。

3. 提取`bert`层

print(model.bert)

输出结果

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
)

获取bert输出的隐层信息

outputs = model.bert(**inputs)
print(outputs)
print(outputs.last_hidden_state.size())

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.0568,  0.1662,  0.0943,  ..., -0.0346, -0.0636,  0.1689],
         [-0.0402,  0.0757,  0.1923,  ..., -0.0217, -0.0459,  0.0711],
         [-0.1038, -0.0372,  0.5063,  ..., -0.1587,  0.0475,  0.5513],
         ...,
         [ 0.1763, -0.0111,  0.1922,  ...,  0.1891, -0.1079, -0.2163],
         [ 0.8013,  0.4953, -0.2258,  ...,  0.1501, -0.7685, -0.3709],
         [ 0.0572,  0.3405,  0.6527,  ...,  0.4695, -0.0455,  0.3055]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=None, hidden_states=None, past_key_values=None, attentions=None, cross_attentions=None)
torch.Size([1, 9, 768])

4. 迁移学习

bert-base-uncased的任务是Fill-Mask，也就是填空任务，而我们的任务是情感分类，所以我们要把原本的分类器给去掉。

将bert输出的隐层信息输入到一个线性层进行情感分类，然后进行损失函数计算，进而反向传播更新参数即可。值得注意的是，上面返回的隐层Shape为(1, 9, 768)，其中1为batch_size，9为tokens数量，768为每个token对应的向量维度。我们在使用bert进行情感分类时，通常是使用第一个token（<bos>）的结果。

import torch
from torch import nn

# 定义最后的二分类线性层
cls = nn.Sequential(
    nn.Linear(768, 1),
    nn.Sigmoid()
)
# 使用二分类常用的Binary Cross Entropy Loss
criteria = nn.BCELoss()
# 这里只对最后的线性层做参数更新
optimizer = torch.optim.SGD(cls.parameters(), lr=0.1)

# 取隐层的第一个token(<bos>)的输出作为cls层的输入，然后与label进行损失计算
loss = criteria(cls(outputs.last_hidden_state[:, 0, :]), torch.FloatTensor([[1]]))
loss.backward()
optimizer.step()
optimizer.zero_grad()