【huggingface系列学习】Using Transformers

长命百岁️

已于 2023-02-11 21:30:11 修改

阅读量1k

点赞数 1

分类专栏： huggingface 文章标签： python 深度学习

于 2023-02-10 22:47:23 首次发布

本文链接：https://blog.csdn.net/qq_52852138/article/details/128978420

版权

huggingface 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

前言

因实验中遇到很多 huggingface-transformers 模型和操作，因此打算随着 course 从头理一下
这个系列将会持续更新
后续应该也会学习一下fairseq框架

Using Transformers

我们以一个完整的样例开始，看看在处理的过程中到底发生了什么

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

[{‘label’: ‘POSITIVE’, ‘score’: 0.9598047137260437},
{‘label’: ‘NEGATIVE’, ‘score’: 0.9994558095932007}]

这个pipeline包括三个部分：预处理，将输入输入模型中，后处理

在这里插入图片描述

使用tokenizer预处理

和其他模型一样，transformer不能直接处理原始文本，我们首先用tokenizer将文本转换成模型可以理解的 numbers。Tokenizer 有以下几个任务

将输入分成words， subwords 或者 symbols 等 token
将每个 token 映射成一个数字
添加额外的可能对模型有用的输入

我们使用预训练的Tokenizer,通过 AutoTokenizer class 和其 from_pretrained() 方法来加载

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

现在，我们可以向Tokenizer 中输入文本

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

>>{
    'input_ids': tensor([
        [  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172, 2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,     0,     0,     0,     0,     0,     0]
    ]), 
    'attention_mask': tensor([
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
    ])
}

我们可以输入一个句子或者一个句子列表，同时指定想要得到的 tensor 的类型

Transformers model 只接受 tensor 作为输入

输出的结果是包含两个 key 的字典：

input_ids：
attention_mask

Tokenizer详解

上面说过，Tokenizer的功能就是将原始文本转换成模型可以理解的形式（numbers）。

分离文本的方式有很多，比如python 中的 .split() 函数，按照空格来将文本分离成 words。我们还可以用标点符号来分隔，使用这种tokenizer，最后会得到一个很大的“词典”，a vocabulary is defined by the total number of independent tokens that we have in our corpus。每个词都会被分配一个 id（从0开始），模型利用 id 来区分词。

不同分词方式详见：Tokenizers - Hugging Face Course

Loading and saving

基于两个方法: from_oretrained() 和 save_pretrained()。这些方法会保存 tokenizer 使用的算法（类似模型结构）和使用的词典（类似模型权重）

加载

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") # 根据checkpoint名字自动找到对应的class

#还可以直接加载特定的 tokennizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

保存

tokenizer.save_pretrained("directory_on_my_computer")

Encoding

我们来看看 input_ids 是如何生成的（encode），encode分成两步：

tokenization（split text into tokens）

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

转换到 input IDs：当我们用.from_pretrained()实例化一个 tokenizer 时，会下载一个词典。我们通过词典来完成映射
```
ids = tokenizer.convert_tokens_to_ids(tokens)
```

Decoding

Decoding做的是，当我们提供 ids 时（其实就是词汇表中token的索引），能得到ids对应的token。这时我们可以使用

decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

decode()函数不仅将索引恢复成token，还能将属于同一个单词的 token 组合在一起，生成一个可读的句子

加载不同模型的 tokenizer 来处理序列，可以得到模型所需要的所有输入（input ids，attention mask等）

Model

创建一个Transformer

我们以BERT为例，实例化BERT的第一件事就是加载一个 configuration 对象

from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

不同的加载方法

上面展示的是模型随机初始化的加载方式，同样，我们可以加载预训练模型

from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

权重会被下载并保存到cache（默认路径是*~/.cache/huggingface/transformers*），通过设定HF_HOME环境变量可以定制cache folder

模型保存

model.save_pretrained("directory_on_my_computer")

这会保存两个文件：

config.json:包括构建模型结构必要的属性，还包括一些 metadata（上次保存使用的transformer版本等）
pytorch_model.bin：包括所有的模型权重
这两个文件是相辅相成的，一个可以知道模型架构，一个可以提供模型参数

使用模型进行推理

import torch
sequences = ["Hello!", "Cool.", "Nice!"]
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]
model_inputs = torch.tensor(encoded_sequences)
output = model(model_inputs)

model可以接受很多不同的参数，但是只有 input_ids 是必须的

Tensors只接受矩形的数据，如果每一条数据的长度不同，转换成tensor会报错

Handling multiple sequences

Models expect a batch of inputs

首先我们将tokenizer的操作拆开来使用，看看是什么情况。

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)
# This line will fail.
model(input_ids)

在模型接受数据的时候会报错，因为 transformers 模型默认接收多行数据，而我们只输入了一个 single 序列。
我们应该注意，tokenizer除了将 inputs ids转换成 tensor，其实还在前面加了一维

tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"])
>tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])

在我们上面的操作中，我们需要在将 input ids 转换成 tensor 时加一维

input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)
>Input IDs: [[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607, 2026,  2878,  2166,  1012]]
>Logits: [[-2.7276,  2.8789]]

我们也可以像模型中一次输入多个数据，这称为 Batching

batched_ids = [ids, ids]

这是一个包含两个相同序列的 batch
我们将其转换成 tensor 输入到模型中会得到与前面相同的 logits，只不过是两份

当我们将多个数据同时输入模型时，我们需要将其都转换成 tensor，但是如果他们的长度不相同，就没法转换成 tensor（tensor 只接受矩形 shape 的数据）。

Padding the inputs

padding 通过向端的 sequence 中填充 [pad_token] 使所有序列具有相同的长度（和最长的那个一样）

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)
>tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
>tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
>tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)

可以看到，输出有问题，两个[200,200]的结果不一样。

原因：transformer的 attention 操作会看到每个 token，也会看到 pad_token
我们要消除增加 pad_token 的影响

tokenizer支持很多 pad 的方式，可以看 Put it all together

Attention masks

attention_mask 与 input_ids 的 shape 是一样的，由 0 和 1 组成。

0 代表这个位置的 token 不用考虑（是pad_token）
1 代表这个位置不是 pad_token

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

这样我们就能得到相同的 logits

Longer sequences

Transformer 模型对输入序列的长度都有限制，大多数都可以处理最多 512 / 1024 tokens，再长就不行了。解决方案：

使用支持更长序列的模型
将序列拆开
```
sequence = sequence[:max_sequence_length]
```
tokenizer 也支持截断数据(在下面有介绍)

Put it all together

Tokenizer支持的一些功能

pad according to several objectives（tokenizer默认不进行padding）

# Will pad the sequences up to the maximum sequence length(输入中最长的seq)
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length（需要模型有指定的 max_length）
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

截断 seq

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

选择返回 tensor 的类型

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

special tokens

tokenizer在将seq转换成 input_ids 的基础上，会在开头结尾多加两个 token（不同tokenizer可能会加不同 token）

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
>[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
>[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]

解码出来看一下

print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))
>"[CLS] i've been waiting for a huggingface course my whole life. [SEP]"
>"i've been waiting for a huggingface course my whole life."

这是因为模型在预训练的时候使用了这些 token，我们为了得到相同的结果，也加上这些。当然，不同模型加的 token 不一样，也可能不加 token

完整的流程

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)