NLP: Transformer quick tour_tourtransformer-CSDN博客

本文链接：https://blog.csdn.net/Felaim/article/details/117327299

这篇博客介绍了Hugging Face的Transformer库，包括如何使用pipeline进行情感分析、文本生成、命名实体识别等任务。文章详细阐述了pipeline的工作原理，如tokenizer的分词过程、模型的使用以及如何处理输入文本。同时，展示了如何加载和使用预训练模型，以及如何定制化模型。此外，还提到了模型训练和保存的流程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

transformer ： Quick tour

1. 使用pipeline开始一个任务

1.1 transformer的pipeline提供的任务类型

情感分析：文本是积极的还是消极的
文本生成（英语）：给定一个提示，模型会生成对应的内容
名称实体识别（Name Entity Recognition, NER）:输入一个句子，将每个单词标记为其表示的实体（如人，地点等）
问题解答：为模型提供一些上下文和问题，并从上下文中提取答案
填充被屏蔽的文本：给定带有被屏蔽词的文本（例如，用[mask]替换），然后填入空白处
总结：生成一段长文本的总结
翻译：用另外一种语言翻译一段文本
特征提取：返回一段文本的张量表达式

所有任务代码示例链接https://huggingface.co/transformers/task_summary.html

示例代码，注：本文使用的示例代码都是pytorch版本的！！

from transformers import pipeline
classifier = pipeline('sentiment-analysis')
classifier('We are very happy to show you the 🤗 Transformers library.')
#res：[{'label': 'POSITIVE', 'score': 0.9997795224189758}

第一次使用这个命令的时候，需要先下载对应的预训练模型和它的分词器（tokenizer），后续会再进行介绍，分词器的作用是将文本先进行预处理，然后通过模型进行预测。transformers中的pipeline把所有的都结合在一起，并对预测结果进行后处理，以使其易于阅读，但是lz觉得封装的很好，并不是很容易理解当中的步骤。

默认这个pipeline下载的模型是“distilbert-base-uncased-finetuned-sst-2-english”

所有transformers提供的模型地址在 https://huggingface.co/models,大概有9k+个模型

如果想使用其他模型的话，可以替换pipeline中其他模型

classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

我们也可以用本地保存的预训练模型来进行使用，那怎么使用呢？

为此，我们将需要两个类。第一个是AutoTokenizer，我们将使用它下载与我们选择的模型关联的分词器，并对它进行实例化。第二个是AutoModelForSequenceClassification（如果使用TensorFlow，则为TFAutoModelForSequenceClassification），我们将使用它来下载模型本身。
请注意，如果我们在其他任务上使用库，则模型的类将更改，主要可以参考https://huggingface.co/transformers/task_summary.html

我们如果想要下载其他的预训练模型，那么只需要使用from_pretrained()的方法

 model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
 model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

1.2 pipeline 是怎么工作的？

就像LZ之前说的，如果使用pipeline，把接口封装的太好了，对于我们大多数理解并不是很有利，后面我们将使用的from_pretrained的方式，创建对应的模型和分词器

from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

1.2.1 使用分词器

之前我们简单提到过，分词器（tokenizer，lz不知道怎么翻译成中文才比较合适，要不后面直接用英文吧，/(ㄒoㄒ)/~~），tokenizer只要是对输入的文本进行预处理。首先，它会把输入的文本分成单词（或单词的一部分，标点符号等），通常称为标记（token）。当然不同的模型，会存在不同的处理方式（我们可以在https://huggingface.co/transformers/tokenizer_summary.html）看到不同的处理方式。因为存在不同的预处理方式，所以我们在实例化tokenizer的时候，是需要传入预训练模型的模型名称
第二步时把之前文本分成的tokens转换成数字，从而把输入文本转化成tensor的形式，送到对应的模型中。tokenizer有对应的词汇，在我们初始化的时候就下载下来了，因为我们需要使用与模型经过预训练时相同的词汇。

为了实现上述的功能,我们可以直接把文本传给tokenizer

 inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")

这个会返回一个字典，包含的是 ids of the tokens（https://huggingface.co/transformers/glossary.html#input-ids）,还会有attention mask（https://huggingface.co/transformers/glossary.html#attention-mask）,主要是告诉模型哪些词需要关注，哪些词不需要关注

print(inputs)
res: {'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

我们可以直接将句子列表传给tokenizer，如果我们希望以batch的形式把它传给网络，那么可以我们可以把输入的文本list截断成模型设置可以接受的最大长度，并将这些文本进行padding

 pt_batch = tokenizer(
     ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
     padding=True,
     truncation=True,
     max_length=512,
     return_tensors="pt"
)

padding会自动应用到模型预期的一侧（在这个演示代码中是右侧），对于填充的部分，也会生成对应的attention mask，但值为0，因为填充部分是不需要模型进行关注的部分

 for key, value in pt_batch.items():
...     print(f"{key}: {value.numpy().tolist()}")
input_ids: [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]

如果想要了解的更多，可以参考[https://huggingface.co/transformers/preprocessing.html]

1.2.2 使用对应的模型

一旦tokenizer处理好了对应的文本，我们就可以直接把处理好的结果传给对应的模型

 pt_outputs = pt_model(**pt_batch)

在transformers中，所有的输出都是包含模型激活和其他数据，在https://huggingface.co/transformers/main_classes/output.html进行了更加详细的描述。

print(pt_outputs)
SequenceClassifierOutput(loss=None, logits=tensor([[-4.0833,  4.3364],
    [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

注意输出对象是具有logits属性，我们可以使用它来访问模型的最终结果。

注意：在transformer中，无论是pytorch还是tensorflow返回的模型的结果是在最终激活函数（例如softmax）之前的因为最终激活函数的结果通常和最后计算损失函数联系在一起。

在最后的结果中使用softmax函数来获得最终的预测

 import torch.nn.functional as F
 pt_predictions = F.softmax(pt_outputs.logits, dim=-1)

我们可以获得激活后的结果

print(pt_predictions)
tensor([[2.2043e-04, 9.9978e-01],
        [5.3086e-01, 4.6914e-01]], grad_fn=<SoftmaxBackward>)

如果我们给模型提供了对应的标签，那么模型的输出也会包含对应的loss

import torch
pt_outputs = pt_model(**pt_batch, labels = torch.tensor([1, 0]))
print(pt_outputs)
SequenceClassifierOutput(loss=tensor(0.3167, grad_fn=<NllLossBackward>), logits=tensor([[-4.0833,  4.3364],
[ 0.0818, -0.0418]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

transformers是标准的torch.nn.Module或tf.keras.Model,所以我们可以在常规的循环训练中使用它们，Transformers还提供了一个trainer（如果使用tensorflow，则为TFTrainer）类，来进行训练（需要注意分布式训练，混合精度等问题），详细的情况可以参考https://huggingface.co/transformers/training.html

等待训练完成后，我们可以保存对应的model和tockenizer

tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

我们可以再次使用from_pretrained来下载对应的模型，并且无论是pytoch还tensorflow模型，都可以相互使用

#模型是pytorch版本，但我们使用的框架是tf
from transformers import TFAutoModel
tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = TFAutoModel.from_pretrained(save_directory, from_pt=True)

# 模型是tensorflow版本，但我们使用的框架是pytorch版本
from transformers import AutoModel
tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = AutoModel.from_pretrained(save_directory, from_tf=True)

如果我们需要获得隐藏层的权重，可以通过下面的方式来进行获取

pt_outputs = pt_model(**pt_batch, output_hidden_states=True, output_attentions=True)
all_hidden_states  = pt_outputs.hidden_states
all_attentions = pt_outputs.attentions

1.2.3 访问代码

AutoModel和AutoTokenizer类只是将自动与任何预训练模型一起使用的快捷方式。在源代码中，体系结构和类的每种组合中都有一个模型类，如果需要的话，可以轻松访问和调整代码。

在我们之前的示例中，该模型称为“ distilbert-base-uncased-finetuned-sst-2-english”，这意味着它使用的是DistilBERT架构。由于使用了AutoModelForSequenceClassification（如果使用TensorFlow，则使用TFAutoModelForSequenceClassification），然后自动创建的模型将成为DistilBertForSequenceClassification。我们可以查看其文档以获取与该特定模型有关的所有详细信息，或浏览源代码。

from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = DistilBertForSequenceClassification.from_pretrained(model_name)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

1.2.4 定制化模型

如果小伙伴想修改对应的模型，我们可以定义一个custom configuration类，在默认情况下，每个结果都有相对应的设置。

例如，DistilBertConfig允许我们给DistilBERT指定参数，例如隐藏尺寸，丢失率等。但是如果我们进行核心修改（例如更改隐藏大小），将无法再使用预先训练的模型，并且需要从头开始进行训练。

然后，我们可以直接从这个配置实例化对应的模型。

下面示例，展示的是我们使用from_pretrained()方式为tokenizer下载事先定义好的词汇表，和tokenizer不同的是，我们希望可以从头开始初始化模型。因此，我们可以通过使用configuration来初始化模型，而不是使用from_pretrained()的方式。

from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification(config)

在有些情况下，我们只需要修改模型的head，举个例子，label的数量，在这种情况下，我们仍然可以使用预训练模型的主体。具体的操作方式是我们可以使用from_pretrained()方式来传递参数，通过它来更新对应的默认参数设置

from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
model_name = "distilbert-base-uncased"
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)