LLM：Transformers 库

-柚子皮-

已于 2023-11-14 16:08:59 修改

阅读量4.9k

点赞数

分类专栏： LLM 文章标签： transformers

于 2023-06-02 11:55:36 首次发布

本文链接：https://blog.csdn.net/pipisorry/article/details/131003691

版权

LLM 专栏收录该内容

14 篇文章 6 订阅

订阅专栏

Transformers 库是一个开源库，其提供的所有预训练模型都是基于 transformer 模型结构的。Transformers 库支持三个最流行的深度学习库（PyTorch、TensorFlow 和 JAX）。我们可以使用 Transformers 库提供的 API 轻松下载和训练最先进的预训练模型。使用预训练模型可以降低计算成本，以及节省从头开始训练模型的时间。这些模型可用于不同模态的任务，例如：文本：文本分类、信息抽取、问答系统、文本摘要、机器翻译和文本生成。图像：图像分类、目标检测和图像分割。音频：语音识别和音频分类。多模态：表格问答系统、OCR、扫描文档信息抽取、视频分类和视觉问答。

环境配置

安装相关的Python库：
pip install transformers -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install sentence-transformers

train模型需要安装

pip install tensorflow
pip install datasets
pip install evaluate
pip install accelerate
pip install bitsandbytes

Note: HuggingFace的Accelerate能快速实现多机多卡、单机多卡的分布式并行计算，还支持FP16半精度计算。且Using `load_in_8bit=True` requires。使用示例：self.accelerator.backward(loss)；
self.accelerator.clip_grad_norm_(model.parameters(),args.max_grad_norm)

出错：

直接运行无错误，debug出错：
raise RuntimeError(
RuntimeError: Failed to import transformers.pipelines because of the following error (look up to see its traceback):
DLL load failed while importing _ufuncs: 找不到指定的程序。
python-BaseException
debug时，弹出debug辅助工具安装，安装时出错：
error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
都是因为缺少Microsoft Visual C++运行库导致的。
解决：从https://visualstudio.microsoft.com/visual-cpp-build-tools/ > 下载生成工具
安装时要选中第一个，后面修改再加上来也可以

最后需要重启电脑才行（重启pycharm不行）

[https://juejin.cn/s/python%20dll%20load%20failed%20while%20importing%20_ufuncs]
[https://learn.microsoft.com/en-us/answers/questions/136595/error-microsoft-visual-c-14-0-or-greater-is-requir]

示例

使用中文LLaMA模型进行句子embedding的示例

在这个例子中，我们使用了PyTorch张量（pt）格式的输入，并计算了句子的平均嵌入。

import torch
from itertools import combinations
from transformers import AutoTokenizer, AutoModel

# 使用AutoModel和AutoTokenizer类加载预训练模型和对应的分词器
pretrained_model_name = 'hfl/chinese-macbert-large'
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, resume_download=True)
model = AutoModel.from_pretrained(pretrained_model_name)

# 对句子进行分词和编码
sentence = '这是一个句子'
encoded_input = tokenizer(sentence, return_tensors='pt')
output = model(**encoded_input)[0]    # **字典解构；output[0]即output.last_hidden_state
sentence_emb = torch.mean(output, 1)  # 得到[batch_size, hidden_size]的句向量
print('sentence_emb', sentence_emb.size())
# sentence_emb torch.Size([1, 1024])

之后就可以将句子嵌入用于各种NLP任务，例如文本分类、聚类、相似性计算等。

应用示例：使用嵌入来计算两两句子之间的cos相似度

方式1：

sentences = ['这是一个句子', '那是一堆词', '不知道什么玩意儿']

encoded_input = tokenizer(sentences, return_tensors='pt', padding=True)
output = model(**encoded_input)[0]

attention_mask = encoded_input['attention_mask']
attention_mask = attention_mask.unsqueeze(-1).expand(output.shape)
masked_output = output * attention_mask
masked_sentence_emb = torch.sum(masked_output, 1) / torch.clamp(attention_mask.sum(1), min=1e-9)

cos = torch.nn.CosineSimilarity(dim=1, eps=1e-6)
for a, b in combinations(zip(sentences, torch.split(masked_sentence_emb, 1, dim=0)), 2):
    cos_sim = cos(a[1], b[1])
    print("{}\n{}\ncos_sim:{}\n".format(a[0], b[0], cos_sim))

方式2：

或者完全等价于使用sentence_transformers库的句子emb

import torch
from itertools import combinations
from sentence_transformers import SentenceTransformer as st

pretrained_model_name = "./chinese-macbert-large"
model = st(pretrained_model_name)

sentences = ['这是一个句子', '那是一堆词', '不知道什么玩意儿']

embeddings = model.encode(sentences, convert_to_tensor=True)

cos = torch.nn.CosineSimilarity(dim=1, eps=1e-6)
for a, b in combinations(zip(sentences, torch.split(embeddings, 1, dim=0)), 2):
    cos_sim = cos(a[1], b[1])
    print("{}\n{}\ncos_sim:{}\n".format(a[0], b[0], cos_sim))

Note:
embeddings = model.encode(sentences, convert_to_tensor=True)
执行的是默认load的modules = self._load_auto_model(model_path)中的[transformer_model, pooling_model]两个模块
其中算均值逻辑是在pooling_model模块里面的sentence_transformers.models.Pooling.Pooling.forward，实现和方式1其实一样的。看实现应该还可以加token_weights。

方式3：或者完全等价于

# 对不同句子进行分词和编码，并计算cos相似度
sentences = ['这是一个句子', '那是一堆词', '不知道什么玩意儿']
sentence_embs = []
for sentence in sentences:
    encoded_input = tokenizer(sentence, return_tensors='pt')
    output = model(**encoded_input)[0]
    sentence_emb = torch.mean(output, 1)
    sentence_embs.append((sentence, sentence_emb))

cos = torch.nn.CosineSimilarity(dim=1, eps=1e-6)
for a, b in combinations(sentence_embs, 2):
    cos_sim = cos(a[1], b[1])
    print("{}\n{}\ncos_sim:{}\n".format(a[0], b[0], cos_sim))

这是一个句子那是一堆词
cos_sim tensor([0.9194], grad_fn=<SumBackward1>)

这是一个句子不知道什么玩意儿
cos_sim tensor([0.8901], grad_fn=<SumBackward1>)

那是一堆词不知道什么玩意儿
cos_sim tensor([0.9310], grad_fn=<SumBackward1>)

模块详解

[https://www.cnblogs.com/shengshengwang/p/16641925.html]

AutoClass

transformers库中提供统一的入口，也就是我们这里说到的“AutoClass”系列的高级对象，通过在调用“AutoClass”的from_pretrained()方法时指定预训练模型的名称或预训练模型所在目录，即可快速、便捷得完成预训练模型创建。有了“AutoClass”，只需要知道预训练模型的名称，或者将预训练模型下载好，程序将根据预训练模型配置文件中model_type或者预训练模型名称、路径进行模式匹配，自动决定实例化哪一个模型类，不再需要再到该模型在transfors库中对应的类名。“AutoClass”所有类都不能够通过init()方法进行实例化，只能通过from_pretrained()方法实例化指定的类。

Note: 1 Huggingface官方文档对“AutoClass”的说明。

预训练文件下载

可配置的预训练模型列表参考[https://huggingface.co/models]

1 直接通过from_pretrained方法下载模型会很慢，可以加上续传参数resume_download=True。下载好的模型自动保存在：C:\Users\**\.cache\huggingface\hub（windows）。

2 因为网络原因，也可以先手动从Huggingface官网下载模型，然后在from_pretrained方法中指定本地目录方式进行加载。

通过https://huggingface.co/models找到对应的model文件，进入Files and versions
手动一个一个下载需要的，或者直接git clone到本地
git lfs install
git clone https://huggingface.co/hfl/chinese-roberta-wwm-ext

[huggingface transformers预训练模型如何下载至本地，并使用？ - 知乎]

词的Tokenizer

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name,
**{"cache_dir": model_args.cache_dir,
"use_fast": model_args.use_fast_tokenizer,
"revision": model_args.model_revision,
"use_auth_token": True if model_args.use_auth_token else None,})

参数：

use_fast=True, 表示使用rust加速。

通过在AutoTokenizer中定义的from_pretrained方法指定需要加载的分词器名称，即可从网络上自动加载分词器，并实例化tokenizers库中分词器。tokenizers中定义的分词器对象提供非常丰富的功能，例如定义词库、加载词库、截断、填充、指定特殊标记等。

大多数情况下，我们都是同时使用预定义的分词器和预训练模型，或者说是配套使用的，否则，使用预训练模型就效果将大大降低。

tokenizer(
sentences_list,
padding=True, # 长度不足max_length时是否进行填充
truncation=True, # 长度超过max_length时是否进行截断
max_length=10,
return_tensors="pt", # 指定返回数据类型，pt：pytorch的张量，tf：TensorFlow的张量
)

参数：

padding=True：需要同时处理多个句子，可以以 list 的形式输入到 tokenizer 中。但是它们的长度并不总是相同的，而模型的输入需要具有统一的形状。必须给 tokenizer() 传入参数 padding=True填充。填充时会将input_ids和attention_mask都填充为0（对应的字符一般配置为[PAD]）。

truncation=True：有时候，句子可能太长，模型无法处理。在这种情况下，可以给 tokenizer() 传入参数 truncation=True 将句子进行截断。Note: 最长默认输入句子长度：max_seq_length = min(self.auto_model.config.max_position_embeddings, self.tokenizer.model_max_length)

return_tensors：return_tensors 返回类型，默认是list类型，可选pt返回torch 的 tensor，tf返回tensorflow的tensor， np numpy类型。

pt: {'input_ids': tensor([[ 101, ...102]]),
'token_type_ids': tensor([[0, ...0]]), 'attention_mask': tensor([[1...1]])}
tf: {'input_ids': <tf.Tensor: shape=(1, 32), dtype=int32, numpy=array([[ 101, ... 102]])>, 'token_type_ids': <tf.Tensor: shape=(1, 32), dtype=int32, numpy=array([[0, ... 0]])>,
'attention_mask': <tf.Tensor: shape=(1, 32), dtype=int32, numpy=array([[1, ...1]])>}
不指定返回的是list元素类型：
{'input_ids': [101... 102], 'token_type_ids': [0, ... 0], 'attention_mask': [1...1]}

输出

返回一个dict，包含三个部分：:

{'input_ids': tensor([[101, 7444, ..., 102, 0, ... 0],[101, 1920, ..., 106, 102]]),
'token_type_ids': tensor([[0, ..., 0],[0, ..., 0]]),
'attention_mask': tensor([[1, ... 1, 0, ..., 0],[1, ..., 1]])}
input_ids：对应于句子中每个 token 的索引。
token_type_ids：当存在多个序列时，标识 token 属于那个序列。
attention_mask：表明对应的 token 是否需要被注意（1 表示需要被注意，0 表示不需要被注意。涉及到注意力机制）。

[Pytorch Transformer Tokenizer常见输入输出实战详解_token_type_ids-CSDN博客]

Note:

1 这3个的维度都是[batch_size, seq_len]，其中seq_len是padding后的（如果有padding的话）。

2 输出相比原始文本多了 [CLS] 和 [SEP]，它们是在 BERT 等模型中添加一些特殊 token。即每个原始句子都会在前后分别加上101和102，分别表示[CLS]和[SEP]。然后不足的再在后面加0（即[PAD]）。

padding之后的句子中包含了pad字符对应的emb，直接mean pooling聚合成句子emb，会和不加padding的有微小区别。

句子emb时padding的误差的解决方案：
attention_mask = encoded_input['attention_mask']
attention_mask = attention_mask.unsqueeze(-1).expand(output.shape)
masked_output = output * attention_mask
masked_sentence_emb = torch.sum(masked_output, 1) / torch.clamp(attention_mask.sum(1), min=1e-9)
print("masked_sentence_emb:", masked_sentence_emb)

【https://www.cnblogs.com/shengshengwang/p/16641925.html】

Note:1 将tokenid转成token词

decoded_input = tokenizer.decode(encoded_input["input_ids"][0])
或者decoded_input = tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][0])

比如“你好nigeshamao”
decode：[CLS] 你好 nigeshamao [SEP]
convert_ids_to_tokens：['[CLS]', '你', '好', 'ni', '##ge', '##sha', '##ma', '##o', '[SEP]']

多句padding时反解码是类似这样的：[CLS] 私信联系我 [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

2 在预训练模型vocab词典表中，以"##"开头的标记表示该词是一个子词(subword)，即该词是由原始词汇的一部分组成的。例如，"##依"是由原始词汇中的"依"这个字组成的子词。这种方式可以使得模型更好地处理未知词汇和复杂的语言结构。

3 Tokenizer slow：使用 Python 实现 tokenization 过程。Tokenizer fast：基于 Rust 库 Tokenizers 进行实现（所以安装transfomers时也会自动安装rust依赖）。

[transformers库使用]

示例

print(len(tokenizer)) #输出词典大小

预训练模型加载AutoModel

model = AutoModel.from_pretrained(pretrained_model_name)

通过这种方法，模型将直接加载预训练模型config.json的配置项。

也可以在加载模型时，指定配置类实例，这样就可以实现对预训练模型的自定义，如
model = AutoModel.from_pretrained(pretrained_model_name, config=config)

Note: 模型load逻辑是：model, missing_keys, unexpected_keys, mismatched_keys, error_msgs = cls._load_pretrained_model(...)

同时，我们也可以通过在from_pretrained()方法中直接传参的方式，传入配置项，例如，我们将编码器层数改为3层。注意，这种方式在指定了config参数时不在生效。
model = AutoModel.from_pretrained(pretrained_model_name, num_hidden_layers=3)

load模型后查看模型结构和可训练参数

print("model:\n", model)

for name, param in model.named_parameters():
if param.requires_grad:
print(name)

加载model时提示警告

Some weights of the model checkpoint were not used?或者Some weights were not initialized from the model checkpoint？

一般正常，因为if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture。

示例1：

通过BertForMaskedLM finetune 加载bert-base-uncased或者bert-base-chinese模型进行预训练时，提示Some weights of the model checkpoint at ./models/bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
>> All the weights of BertForMaskedLM were initialized from the model checkpoint at ./models/bert-base-uncased.
原因：It tells you that by loading the bert-base-uncased checkpoint in the BertForMaskedLM architecture, you're dropping two weights: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']. These are the weights used for next-sentence prediction, which aren't necessary for Masked Language Modeling.If you're only interested in doing masked language modeling, then you can safely disregard this warning.即huggingface transformer库中的BertForMaskedLM是没有nsp head的，它只是BertOnlyMLMHead。

Note: 奇怪的是，mlm训练时，BertPooler层中的'bert.pooler.dense.weight'怎么没提示not used，可能是加载了，但是add_pooling_layer=False让这个没在model输出里面显示？（有懂的求留言）

示例2：

通过BertForSequenceClassification 加载预训练bert-base-uncased或者bert-base-chinese模型进行下游任务finetune时，提示Some weights of the model checkpoint at ../mlm/models/bert-base-chinese were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
>> Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ../mlm/models/bert-base-chinese and are newly initialized: ['classifier.weight', 'classifier.bias']
原因：were not used是因为bert-base-chinese预训练模型的预训练任务（mlm[对应cls.predictions]和nsp[对应cls.seq_relationship]任务在分类任务中不需要）；were not initialized是因为下游分类任务不在预训练模型的预训练任务中。

Note: 分类任务就应该是既加载也使用了'bert.pooler.dense.weight'。

示例3：
使用BertForPreTraining加载 mlm任务finetune后的预训练模型（仅mlm无nsp），就会少nsp任务的参数：
>> All model checkpoint weights were used when initializing BertForPreTraining.
>> Some weights of BertForPreTraining were not initialized from the model checkpoint at ../mlm_wwm/models/new_chinese-roberta-wwm-ext and are newly initialized: ['bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight']

示例4：
使用BertForPreTraining加载bert-base-chinese就能将所有参数全用上，且都能从ckt获取到参数。
>> All model checkpoint weights were used when initializing BertForPreTraining.
>> All the weights of BertForPreTraining were initialized from the model checkpoint at ./models/bert-base-chinese.

模型输出

output = model(**encoded_input)

BaseModelOutputWithPoolingAndCrossAttentions(
last_hidden_state=tensor([[[-0.0522, 0.2279, 0.2827, ..., -0.9577, -0.6874, -0.1196],
[-1.0162, 1.4295, 0.6190, ..., -0.7054, -0.8687, 0.6819],
[ 0.1048, 0.1585, 0.4110, ..., 0.0747, -0.6967, 0.9410],
...,
[ 1.1614, -0.2327, 0.8371, ..., -0.4496, -1.1670, -0.4048],
[-0.0157, -0.2109, -0.1147, ..., 0.4460, -1.0719, -0.6256],
[-0.0522, 0.2279, 0.2827, ..., -0.9577, -0.6874, -0.1196]]], grad_fn=<NativeLayerNormBackward0>),
pooler_output=tensor([[768个0.1784这种]], grad_fn=<TanhBackward0>),
hidden_states=None, past_key_values=None, attentions=None, cross_attentions=None)

获取最后一层隐藏层embeddings

output[0] 或者 output.last_hidden_state

represent the embeddings for each word in the sentence, shape is [batch_size, seq_len, hidden_size]

Note: 返回的是一个OrderedDict，且第0个参数就是last_hidden_state：
BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=sequence_output（= encoder_outputs[0]）...

模型保存

当模型修改或者重新训练后，可以通过model.save_pretrained()方法再次保存，保存后，在指定目录中将生成两个文件：配置文件（config.json），权重文件（pytorch_model.bin）。
model.save_pretrained("./new_model/")

模型权重参数修改load_state_dict

model.load_state_dict(model_05.state_dict(), strict=False)其中strict=False作用是判断上面参数拷贝过程中是否有unexpected_keys或者missing_keys,如果有就报错，代码不能继续执行。当然，如果strict=False，则会忽略这些细节。[源码详解Pytorch的state_dict和load_state_dict - 知乎]

示例：这里model_05加载的是mlm预训练的参数（即包括bert主体参数[不包括bertpooler]+mlm训练参数），model分类模型加载带原始pooler层的roberta模型（即包括bert主体参数[包括bertpooler]+初始化的分类参数，但是没用mlm和nsp参数）。model load model_05的参数后，bert主体参数[不包括bertpooler]就会被model_05覆盖，达到的效果就是model=model_05的bert主体参数[不包括bertpooler]+model原始bertpooler参数+初始化的分类参数。

pretrained_model = '../mlm_wwm/models/chinese-roberta-wwm-ext'
config = AutoConfig.from_pretrained(pretrained_model, num_labels=2, finetuning_task=None)
model = BertForSequenceClassification.from_pretrained(pretrained_model, from_tf=False, config=config)

config_05 = AutoConfig.from_pretrained(pretrained_model_05, num_labels=2, finetuning_task=None)
model_05 = AutoModelForMaskedLM.from_pretrained(pretrained_model_05, from_tf=False, config=config_05)

model.load_state_dict(model_05.state_dict(), strict=False)

output_model = '../mlm_wwm/models/chinese-roberta-wwm-ext_06/'
model.save_pretrained(output_model)

配置AutoConfig

查看预训练模型的超参数
方式1：
直接查看模型配置文件
https://huggingface.co/hfl/chinese-roberta-wwm-ext/blob/main/config.json
方式2：
from transformers import AutoConfig
config = AutoConfig.from_pretrained("hfl/chinese-roberta-wwm-ext")
print(config)

通过config实例，我们可以对配置项进行修改，例如，上述配置中，编码器结构为12层编码器层，我们将其修改为5层，config.num_hidden_layers=5，经过修改后，最终创建的模型编码器只包含5层结构，也只有前5层会加载预训练结构，其他权重将会被舍弃。修改之后的参数，如果后续需要再次使用，可以保存到本地，传入保存路径，将在指定目录保存为config.json文件：
config.save_pretrained("./models/bert-base-chinese")

from:https://blog.csdn.net/pipisorry/article/details/131003691

ref:https://github.com/huggingface/transformers

https://www.cnblogs.com/chenhuabin/p/16997607.html

-柚子皮-

关注

0
点赞
踩
20

收藏

觉得还不错? 一键收藏
1
评论
LLM：Transformers 库

Transformers 库是一个开源库，其提供的所有预训练模型都是基于 transformer 模型结构的。Transformers 库支持三个最流行的深度学习库（PyTorch、TensorFlow 和 JAX）。我们可以使用 Transformers 库提供的 API 轻松下载和训练最先进的预训练模型。使用预训练模型可以降低计算成本，以及节省从头开始训练模型的时间。这些模型可用于不同模态的任务，例如：文本：文本分类、信息抽取、问答系统、文本摘要、机器翻译和文本生成。图像：图像分类、目标检测和图像分割。
复制链接

扫一扫

专栏目录