hugging face transformers 库使用手册（二）：调用 hugging face transformers 预训练模型进行快速预测——api: Pipeline

Cleo_Gao

已于 2023-08-07 15:22:31 修改

阅读量1.1k

点赞数

分类专栏： HuggingFace transformers 库使用手册文章标签：人工智能深度学习机器学习

于 2023-08-07 15:20:47 首次发布

本文链接：https://blog.csdn.net/cleo_gao/article/details/132147046

版权

HuggingFace transformers 库使用手册专栏收录该内容

2 篇文章 1 订阅

订阅专栏

训练过程比预测过程多的东西：数据增广、梯度反传。虽然之多了这两个东西，但是训练的代码要比预测的代码复杂很多，所以先看简单一点的预测过程。

hugging face transformers 的预测过程由 Pipeline 类全权代理。

pipelines 是一种简便的 inference 流程。

实例化： pipeline() 返回 Pipeline 对象

Pipeline 对象包括：

A tokenizer in charge of mapping raw textual input to token.
A model to make predictions from the inputs.
Some (optional) post processing for enhancing model’s output.

Pipeline 对象使用示例

获取处理某个任务的 pipeline

默认传入 pipeline() 的参数是 task 参数

>>> # 获取 Pipeline 对象，通过 str 参数控制返回的 pipeline 对象类型；默认是 task 参数；
>>> pipe = pipeline("text-classification")
>>> 将输入数据传入 pipeline 对象，会返回预测结果
>>> pipe("This restaurant is awesome")
[{'label': 'POSITIVE', 'score': 0.9998743534088135}]

获取某个模型的 pipeline

如果不传 task，可以传具体需要哪个模型（传模型的名字）：

# 可以传模型名字
>>> pipe = pipeline(model="roberta-large-mnli")
>>> pipe("This restaurant is awesome")
[{'label': 'NEUTRAL', 'score': 0.7313136458396912}]

传 model 对象

from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

# Sentiment analysis pipeline
analyzer = pipeline("sentiment-analysis")

# Question answering pipeline, specifying the checkpoint identifier
oracle = pipeline(
    "question-answering", model="distilbert-base-cased-distilled-squad", tokenizer="bert-base-cased"
)

# Named entity recognition pipeline, passing in a specific model and tokenizer
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
recognizer = pipeline("ner", model=model, tokenizer=tokenizer)

使用 pipeline 一次性预测多个输入

用 list 处理多个输入

>>> pipe = pipeline("text-classification")
>>> pipe(["This restaurant is awesome", "This restaurant is awful"])
[{'label': 'POSITIVE', 'score': 0.9998743534088135},
 {'label': 'NEGATIVE', 'score': 0.9996669292449951}]

直接用 datasets

import datasets
from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm

pipe = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0)
dataset = datasets.load_dataset("superb", name="asr", split="test")

# 把 dataset 传入 pipeline 实例对象即可
# KeyDataset (only *pt*) will simply return the item in the dict returned by the dataset item
# as we're not interested in the *target* part of the dataset. For sentence pair use KeyPairDataset
for out in tqdm(pipe(KeyDataset(dataset, "file"))):
    print(out)
    # {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
    # {"text": ....}
    # ....

其实传一个 generator 就可以工作：

from transformers import pipeline

pipe = pipeline("text-classification")


def data():
    while True:
        # This could come from a dataset, a database, a queue or HTTP request
        # in a server
        # Caveat: because this is iterative, you cannot use `num_workers > 1` variable
        # to use multiple threads to preprocess data. You can still have 1 thread that
        # does the preprocessing while the main runs the big inference
        yield "This is a test"


for out in pipe(data()):
    print(out)
    # {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
    # {"text": ....}
    # ....

`transformers.pipeline()` 参数说明

参数非常多，这里只说最重要的

task: str
model: str 或者 PreTrainedModel 或者 TFPretrainedModel
config: str 或者 PretrainedConfig
- 这里面是 build model 所需超参，不是 train 所需超参
tokenizer: str 或者 PretrainedTokenizer 或者 PreTrainedTokenizerFast
device: int / str / torch.device
num_workers (int, optional, defaults to 8)
batch_size (int, optional, defaults to 1)
feature_extractor: str 或者 SequenceFeatureExtraxtor
- The feature extractor that will be used by the pipeline to encode data for the model.
- Feature extractors are used for non-NLP models, such as Speech or Vision models as well as multi-modal models. Multi-modal models will also require a tokenizer to be passed.
image_processor: str 或者 BaseImageProcessor
framework: str
- either “pt” for PyTorch or “tf” for TensorFlow.
revison: str default ‘main’
- 这个是 git branch 的名字，基本用不上
use_fast: bool
- Whether or not to use a Fast tokenizer if possible
model_kwargs: dict
- 送入 from_pretrained() 的其它参数
kwargs: dict
- 对某个特别的 pipeline 所需的其它参数

支持的 task

“audio-classification”: will return a AudioClassificationPipeline.
“automatic-speech-recognition”: will return a AutomaticSpeechRecognitionPipeline.
“conversational”: will return a ConversationalPipeline.
“depth-estimation”: will return a DepthEstimationPipeline.
“document-question-answering”: will return a DocumentQuestionAnsweringPipeline.
“feature-extraction”: will return a FeatureExtractionPipeline.
“fill-mask”: will return a FillMaskPipeline:.
“image-classification”: will return a ImageClassificationPipeline.
“image-segmentation”: will return a ImageSegmentationPipeline.
“image-to-text”: will return a ImageToTextPipeline.
“mask-generation”: will return a MaskGenerationPipeline.
“object-detection”: will return a ObjectDetectionPipeline.
“question-answering”: will return a QuestionAnsweringPipeline.
“summarization”: will return a SummarizationPipeline.
“table-question-answering”: will return a TableQuestionAnsweringPipeline.
“text2text-generation”: will return a Text2TextGenerationPipeline.
“text-classification” (alias “sentiment-analysis” available): will return a TextClassificationPipeline.
“text-generation”: will return a TextGenerationPipeline:.
“token-classification” (alias “ner” available): will return a TokenClassificationPipeline.
“translation”: will return a TranslationPipeline.
“translation_xx_to_yy”: will return a TranslationPipeline.
“video-classification”: will return a VideoClassificationPipeline.
“visual-question-answering”: will return a VisualQuestionAnsweringPipeline.
“zero-shot-classification”: will return a ZeroShotClassificationPipeline.
“zero-shot-image-classification”: will return a ZeroShotImageClassificationPipeline.
“zero-shot-audio-classification”: will return a ZeroShotAudioClassificationPipeline.
“zero-shot-object-detection”: will return a ZeroShotObjectDetectionPipeline.

Pipeline chunk batching

zero-shot-classification and question-answering 用的是 ChunkPipeline

因为 a single input might yield multiple forward pass of a model（？）Under normal circumstances, this would yield issues with batch_size argument.

之前是直接把数据送到 pipeline 就好了，但是现在要分别调用 pipeline 的方法：

pipe.preprocess()
pipe.forward()
pipe.postprocess()

基础用例：

all_model_outputs = []
for preprocessed in pipe.preprocess(inputs):
    model_outputs = pipe.forward(preprocessed)
    all_model_outputs.append(model_outputs)
outputs = pipe.postprocess(all_model_outputs)

写你自己的 pipeline

首先，弄清输入和输入分别是什么？

输入：strings / raw bytes / dictionaries / …；这将是 preprocess 的输入
输出：越简洁越好，这将是 postprocess 的输出

需要实现 4 个方法

from transformers import Pipeline


class MyPipeline(Pipeline):
    def _sanitize_parameters(self, **kwargs):
        preprocess_kwargs = {}
        if "maybe_arg" in kwargs:
            preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"]
        return preprocess_kwargs, {}, {}

    def preprocess(self, inputs, maybe_arg=2):
        model_input = Tensor(inputs["input_ids"])
        return {"model_input": model_input}

    def _forward(self, model_inputs):
        # model_inputs == {"model_input": model_input}
        outputs = self.model(**model_inputs)
        # Maybe {"logits": Tensor(...)}
        return outputs

    def postprocess(self, model_outputs):
        best_class = model_outputs["logits"].softmax(-1)
        return best_class

`preprocess()`

输入是你确定的最开始的输入，然后在这个方法里面会做一些处理，变成模型的输入（即 preprocess 的输出）。（注意区分 pipeline 的输入和 model 的输入）

一般 preprocess() 的输出是一个字典，然后送入模型的时候就用 **kwargs 传到模型里面。

`_forward()`

forward() 里面加了一些保护性的代码，让大家在希望的 device 上正常工作，而其它与模型相关的代码，都放到 _forward() 里面，然后让 forward() 调用 _forward()

注意，只有与模型相关的代码才放到 _forward()，前处理后处理都放到对应的方法里面去。

`postprocess()`

_forward() 的输出就是 postprocess() 的输入，然后把它变成用户想要的输出

`_sanitize_parameters()`

This function exists to allow users to pass any parameters whenever they wish, be it at initialization time pipeline(...., maybe_arg=4) or at call time pipe = pipeline(...); output = pipe(...., maybe_arg=4)

该方法返回值为 3 个 dicts，这 3 个 dicts 会分别送入 preprocess() , _forward() 和 postprocess()

示例

目标效果：

>>> pipe = pipeline("my-new-task")
>>> pipe("This is a test")
[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}, {"label": "3-star", "score": 0.05}
{"label": "4-star", "score": 0.025}, {"label": "5-star", "score": 0.025}]

>>> pipe("This is a test", top_k=2)
[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}]

继承 Pipeline

第一次预测的时候没有传除了输入数据以外的别的参数，自动出来 top-k 是 5 个，也就是默认参数为 5 （这个参数应该是 postprocess() 的参数）。为了实现这个，编辑 _sanitize_parameters() 方法，让这个参数加进去：

def postprocess(self, model_outputs, top_k=5):
	best_class = model_output["logits"].softmax(-1)
	return best_class

def _sanitize_parameters(self, **kwargs):
	preprocess_kwargs = {}
	if "maybe_arg" in kwargs:
		preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"]
	
	postprocess_kwargs = {}
	if "top_k" in kwargs:
		postprocess_kwargs["top_k"] = kwargs["top_k"]
		return preprocess_kwargs, {}, postprocess_kwargs

注册

调用 PIPELINE_REGISTRY.register_pipeline() 方法

from transformers.pipelines import PIPELINE_REGISTRY

PIPELINE_REGISTRY.register_pipeline(
    "new-task",
    pipeline_class=MyPipeline,
    pt_model=AutoModelForSequenceClassification,
)

针对不同任务的 Pipeline

ImageClassificationPipeline

>>> from transformers import pipeline
>>> classifier = pipeline(model="microsoft/beit-base-patch16-224-pt22k-ft22k")

>>> classifier("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
[{'score': 0.442, 'label': 'macaw'}, {'score': 0.088, 'label': 'popinjay'}, {'score': 0.075, 'label': 'parrot'}, {'score': 0.073, 'label': 'parodist, lampooner'}, {'score': 0.046, 'label': 'poll, poll_parrot'}]

__call__() 的参数：

images (str, List[str], PIL.Image or List[PIL.Image]) — The pipeline handles three types of images:
- A string containing a http link pointing to an image
- A string containing a local path to an image
- An image loaded in PIL directly
top_k (int, optional, defaults to 5) — The number of top labels that will be returned by the pipeline.

ImageSegmentationPipeline

>>> from transformers import pipeline

>>> segmenter = pipeline(model="facebook/detr-resnet-50-panoptic")
>>> segments = segmenter("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
>>> len(segments)
2

>>> segments[0]["label"]
'bird'

>>> segments[1]["label"]
'bird'

>>> type(segments[0]["mask"])  # This is a black and white mask showing where is the bird on the original image.
<class 'PIL.Image.Image'>

>>> segments[0]["mask"].size
(768, 512)

__call__() 的参数：

images (str, List[str], PIL.Image or List[PIL.Image]) — The pipeline handles three types of images:
- A string containing a http link pointing to an image
- A string containing a local path to an image
- An image loaded in PIL directly
subtask (str, optional) — Segmentation task to be performed, choose [semantic, instance and panoptic] depending on model capabilities. If not set, the pipeline will attempt tp resolve in the following order: panoptic, instance, semantic.
threshold (float, optional, defaults to 0.9) — Probability threshold to filter out predicted masks.
mask_threshold (float, optional, defaults to 0.5) — Threshold to use when turning the predicted masks into binary values.
overlap_mask_area_threshold (float, optional, defaults to 0.5) — Mask overlap threshold to eliminate small, disconnected segments.

ObjectDetectionPipeline

>>> from transformers import pipeline

>>> detector = pipeline(model="facebook/detr-resnet-50")

>>> detector("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
[{'score': 0.997, 'label': 'bird', 'box': {'xmin': 69, 'ymin': 171, 'xmax': 396, 'ymax': 507}}, {'score': 0.999, 'label': 'bird', 'box': {'xmin': 398, 'ymin': 105, 'xmax': 767, 'ymax': 507}}]

>>> # x, y  are expressed relative to the top left hand corner.

__call__() 的参数：

images (str, List[str], PIL.Image or List[PIL.Image]) — The pipeline handles three types of images:
- A string containing a http link pointing to an image
- A string containing a local path to an image
- An image loaded in PIL directly
threshold (float, optional, defaults to 0.9) — The probability necessary to make a prediction.

ImageToTextPipeline

>>> from transformers import pipeline
>>> captioner = pipeline(model="ydshieh/vit-gpt2-coco-en")

>>> captioner("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
[{'generated_text': 'two birds are standing next to each other '}]

__call__() 的参数：

images (str, List[str], PIL.Image or List[PIL.Image]) — The pipeline handles three types of images:
- A string containing a http link pointing to an image
- A string containing a local path to an image
- An image loaded in PIL directly
max_new_tokens (int, optional) — The amount of maximum tokens to generate. By default it will use generate default.
generate_kwargs (Dict, optional) — Pass it to send all of these arguments directly to generate allowing full control of this function.

VisualQuestionAnsweringPipeline

This visual question answering pipeline can currently be loaded from pipeline() using the following task identifiers: “visual-question-answering”, “vqa”.

>>> from transformers import pipeline

>>> oracle = pipeline(model="dandelin/vilt-b32-finetuned-vqa")
>>> image_url = "https://huggingface.co/datasets/Narsil/image_dummy/raw/main/lena.png"
>>> oracle(question="What is she wearing ?", image=image_url)
[{'score': 0.948, 'answer': 'hat'}, {'score': 0.009, 'answer': 'fedora'}, {'score': 0.003, 'answer': 'clothes'}, {'score': 0.003, 'answer': 'sun hat'}, {'score': 0.002, 'answer': 'nothing'}]

>>> oracle(question="What is she wearing ?", image=image_url, top_k=1)
[{'score': 0.948, 'answer': 'hat'}]

>>> oracle(question="Is this a person ?", image=image_url, top_k=1)
[{'score': 0.993, 'answer': 'yes'}]

>>> oracle(question="Is this a man ?", image=image_url, top_k=1)
[{'score': 0.996, 'answer': 'no'}]

__call__() 的参数：

images (str, List[str], PIL.Image or List[PIL.Image]) — The pipeline handles three types of images:
- A string containing a http link pointing to an image
- A string containing a local path to an image
- An image loaded in PIL directly
question (str, List[str]) — The question(s) asked. If given a single question, it can be broadcasted to multiple images.
top_k (int, optional, defaults to 5)