huggingface 笔记：AutoTokenizer,AutoClass

UQI-LIUWJ

已于 2024-05-30 23:51:37 修改

阅读量335

点赞数 2

分类专栏： python库整理文章标签：笔记

于 2024-05-13 10:12:47 首次发布

本文链接：https://blog.csdn.net/qq_40206371/article/details/138789706

版权

python库整理专栏收录该内容

321 篇文章 43 订阅

订阅专栏

AutoClass 是一个快捷方式，它可以自动从模型的名称或路径检索预训练模型的架构。只需要为任务选择适当的 AutoClass 及其关联的预处理类。

1 AutoTokenizer

分词器负责将文本预处理成模型输入的数字数组。控制分词过程的规则有多种，包括如何分割单词以及应在什么层级分割单词
需要用相同的模型名称实例化一个分词器，以确保使用的分词规则是模型预训练时使用的

1.1 使用 AutoTokenizer 加载分词器

from transformers import AutoTokenizer

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)

encoding = tokenizer("We are very happy to show you the 🤗 Transformers library.")
encoding
'''
{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
'''

input_ids 是句子中每个令牌对应的索引。
attention_mask 指示是否应该关注一个令牌。
token_type_ids 在有多个序列时，标识一个令牌属于哪个序列。

1.2 分词器接受输入列表

分词器还可以接受输入列表，并对文本进行填充和截断，返回长度统一的批处理

tokenizer(
    ["We are very happy to show you the Transformers library.",
     "We hope you don't hate it."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)
#这边的pt表示的是返回pytorch
'''
{'input_ids': tensor([[  101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 58263,
         13299,   119,   102],
        [  101, 11312, 18763, 10855, 11530,   112,   162, 39487, 10197,   119,
           102,     0,     0]]), 
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])}
'''

1.3 通过decode返回输入

tokenizer.decode(encoding["input_ids"])
#'[CLS] we are very happy to show you the [UNK] transformers library. [SEP]'

1.4 pad

由于句子长度不总是相同，这可能成为问题，因为模型输入的张量需要具有统一的形状。
填充是一种策略，通过向较短的句子添加特殊的填充令牌来确保张量是矩形的。
将填充参数设置为 True，以将批次中较短的序列填充至与最长序列匹配：

不加padding【长度不一】：

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences)
encoded_input
'''

{'input_ids': [[101, 10502, 11523, 10935, 10981, 61304, 136, 102], [101, 11530, 112, 162, 21506, 10191, 45864, 10935, 10981, 61304, 117, 16999, 10373, 119, 102], [101, 11523, 10935, 29577, 44682, 136, 102]], 
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0]], 
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]]}
'''

加了padding

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences,padding=True)
encoded_input
'''
{'input_ids': [[101, 10502, 11523, 10935, 10981, 61304, 136, 102, 0, 0, 0, 0, 0, 0, 0], [101, 11530, 112, 162, 21506, 10191, 45864, 10935, 10981, 61304, 117, 16999, 10373, 119, 102], [101, 11523, 10935, 29577, 44682, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
'''

第一句和第三句现在因为它们较短而用 0 填充。

1.4.1 从哪一侧pad

默认在右侧pad 0

如果要是在左侧pad呢

from transformers import AutoTokenizer

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name,
                                         padding_side='left')



batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences,
                          padding=True)
encoded_input
'''
{'input_ids': 
[[0, 0, 0, 0, 0, 0, 0, 101, 10502, 11523, 10935, 10981, 61304, 136, 102],
[101, 11530, 112, 162, 21506, 10191, 45864, 10935, 10981, 61304, 117, 16999, 10373, 119, 102], 
[0, 0, 0, 0, 0, 0, 0, 0, 101, 11523, 10935, 29577, 44682, 136, 102]], 
'token_type_ids': 
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
'attention_mask':
 [[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], 
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]]}
'''

1.5 截断

在另一端，有时一个序列可能对模型来说太长了。
在这种情况下，需要将序列截断为较短的长度。

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, 
                          padding=True, 
                          truncation=True,
                         max_length=5)
print(encoded_input)
'''
{'input_ids': [[101, 10502, 11523, 10935, 102], [101, 11530, 112, 162, 102], [101, 11523, 10935, 29577, 102]], 
'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 
'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

'''

1.6 return_tensors

设置 return_tensors 参数为 "pt" 以用于 PyTorch，或者为 "tf" 以用于 TensorFlow：

2 AutoModel

Transformers 提供了一种简单统一的方法来加载预训练实例
可以像加载 AutoTokenizer 一样加载 AutoModel
唯一的区别是选择适合任务的正确 AutoModel

2.1 举例：文本/序列分类

对于文本/序列分类，应该加载 AutoModelForSequenceClassification：

from transformers import AutoModelForSequenceClassification

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)

然后将huggingface 笔记：AutoClass (quick tour 部分）-CSDN博客已经tokenize的部分传给pt_model

pt_outputs = pt_model(**pt_batch)
#通过添加 ** 来解包字典：

pt_outputs 
'''
SequenceClassifierOutput(loss=None, logits=tensor([[-2.6407, -2.7451, -0.8407,  2.0394,  3.2070],
        [ 0.0064, -0.1258, -0.0503, -0.1655,  0.1329]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
'''

将 softmax 函数应用于 logits 以检索概率：

from torch import nn

pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)
pt_predictions
'''
tensor([[0.0022, 0.0019, 0.0131, 0.2332, 0.7496],
        [0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=<SoftmaxBackward0>)
'''

2.2 保存模型

一旦模型经过微调，可以使用 PreTrainedModel.save_pretrained() 将其与分词器一起保存

tf_save_directory = "./tf_save_pretrained"
tokenizer.save_pretrained(tf_save_directory)
pt_model.save_pretrained(tf_save_directory)

当准备再次使用模型时，使用 PreTrainedModel.from_pretrained() 方法重新加载它：

from transformers import TFAutoModelForSequenceClassification

tf_model = TFAutoModelForSequenceClassification.from_pretrained("./tf_save_pretrained")

2.3 pytorch框架，tensorflow框架互转

通过 from_pt 或 from_tf 参数可以将模型从一个框架转换到另一个框架

from transformers import AutoModel

tokenizer = AutoTokenizer.from_pretrained(tf_save_directory)
pt_model = AutoModelForSequenceClassification.from_pretrained(tf_save_directory, 
                                                              from_tf=True)

2.4 量化技术降低精度

可以通过使用“量化”技术将精度降低至 16 位以下，这是一种有损压缩模型权重的方法
允许将每个参数压缩至 8 位、4 位
***在 4 位时，模型的输出可能会受到负面影响***

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)  
# 也可以尝试 load_in_4bit


model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", 
                                 device_map="auto",                             
                                 quantization_config=quantization_config)

2.4.1 加载float16模型

设置torch_dtype即可

from transformers import AutoTokenizer, AutoModelForCausalLM
 
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B", torch_dtype=torch.float16)

2.4.2 另一种实现方式：load_in_4bit/load_in_8bit

from transformers import AutoModel

model_name = "meta-llama/Llama-2-7b-hf"
model_8bit = AutoModel.from_pretrained(model_name, load_in_8bit=True)


model_4bit = AutoModel.from_pretrained(model_name, load_in_4bit=True)

2.5 加载模型建议声明类型

PyTorch 模型权重通常实例化为 torch.float32
如果尝试将模型加载为不同的数据类型，则可能存在问题。
- 例如，可能需要两倍的内存来首先以 torch.float32 加载权重，然后再以希望的数据类型（如 torch.float16）加载它们。
- ——>为了避免这样的内存浪费，显式设置 torch_dtype 参数为希望的数据类型，或设置 torch_dtype="auto" 以使用最优的内存模式加载权重（数据类型会自动从模型权重中推导）

UQI-LIUWJ

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
huggingface 笔记：AutoTokenizer,AutoClass

AutoClass 是一个快捷方式，它可以自动从模型的名称或路径检索预训练模型的架构。只需要为任务选择适当的 AutoClass 及其关联的预处理类。
复制链接

扫一扫