训练过程比预测过程多的东西:数据增广、梯度反传。虽然之多了这两个东西,但是训练的代码要比预测的代码复杂很多,所以先看简单一点的预测过程。
hugging face transformers 的预测过程由 Pipeline
类全权代理。
文章目录
pipelines 是一种简便的 inference 流程。
实例化: pipeline()
返回 Pipeline 对象
Pipeline 对象包括:
- A tokenizer in charge of mapping raw textual input to token.
- A model to make predictions from the inputs.
- Some (optional) post processing for enhancing model’s output.
Pipeline 对象使用示例
获取处理某个任务的 pipeline
默认传入 pipeline()
的参数是 task 参数
>>> # 获取 Pipeline 对象,通过 str 参数控制返回的 pipeline 对象类型;默认是 task 参数;
>>> pipe = pipeline("text-classification")
>>> 将输入数据传入 pipeline 对象,会返回预测结果
>>> pipe("This restaurant is awesome")
[{'label': 'POSITIVE', 'score': 0.9998743534088135}]
获取某个模型的 pipeline
如果不传 task,可以传具体需要哪个模型(传模型的名字):
# 可以传模型名字
>>> pipe = pipeline(model="roberta-large-mnli")
>>> pipe("This restaurant is awesome")
[{'label': 'NEUTRAL', 'score': 0.7313136458396912}]
传 model 对象
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
# Sentiment analysis pipeline
analyzer = pipeline("sentiment-analysis")
# Question answering pipeline, specifying the checkpoint identifier
oracle = pipeline(
"question-answering", model="distilbert-base-cased-distilled-squad", tokenizer="bert-base-cased"
)
# Named entity recognition pipeline, passing in a specific model and tokenizer
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
recognizer = pipeline("ner", model=model, tokenizer=tokenizer)
使用 pipeline 一次性预测多个输入
用 list 处理多个输入
>>> pipe = pipeline("text-classification")
>>> pipe(["This restaurant is awesome", "This restaurant is awful"])
[{'label': 'POSITIVE', 'score': 0.9998743534088135},
{'label': 'NEGATIVE', 'score': 0.9996669292449951}]
直接用 datasets
import datasets
from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm
pipe = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0)
dataset = datasets.load_dataset("superb", name="asr", split="test")
# 把 dataset 传入 pipeline 实例对象即可
# KeyDataset (only *pt*) will simply return the item in the dict returned by the dataset item
# as we're not interested in the *target* part of the dataset. For sentence pair use KeyPairDataset
for out in tqdm(pipe(KeyDataset(dataset, "file"))):
print(out)
# {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
# {"text": ....}
# ....
其实传一个 generator 就可以工作:
from transformers import pipeline
pipe = pipeline("text-classification")
def data():
while True:
# This could come from a dataset, a database, a queue or HTTP request
# in a server
# Caveat: because this is iterative, you cannot use `num_workers > 1` variable
# to use multiple threads to preprocess data. You can still have 1 thread that
# does the preprocessing while the main runs the big inference
yield "This is a test"
for out in pipe(data()):
print(out)
# {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
# {"text": ....}
# ....
transformers.pipeline()
参数说明
参数非常多,这里只说最重要的
- task: str
- model: str 或者 PreTrainedModel 或者 TFPretrainedModel
- config: str 或者 PretrainedConfig
- 这里面是 build model 所需超参,不是 train 所需超参
- tokenizer: str 或者 PretrainedTokenizer 或者 PreTrainedTokenizerFast
- device: int / str / torch.device
- num_workers (int, optional, defaults to 8)
- batch_size (int, optional, defaults to 1)
- feature_extractor: str 或者 SequenceFeatureExtraxtor
- The feature extractor that will be used by the pipeline to encode data for the model.
- Feature extractors are used for non-NLP models, such as Speech or Vision models as well as multi-modal models. Multi-modal models will also require a tokenizer to be passed.
- image_processor: str 或者 BaseImageProcessor
- framework: str
- either “pt” for PyTorch or “tf” for TensorFlow.
- revison: str default ‘main’
- 这个是 git branch 的名字,基本用不上
- use_fast: bool
- Whether or not to use a Fast tokenizer if possible
- model_kwargs: dict
- 送入
from_pretrained()
的其它参数
- 送入
- kwargs: dict
- 对某个特别的 pipeline 所需的其它参数
支持的 task
- “audio-classification”: will return a AudioClassificationPipeline.
- “automatic-speech-recognition”: will return a AutomaticSpeechRecognitionPipeline.
- “conversational”: will return a ConversationalPipeline.
- “depth-estimation”: will return a DepthEstimationPipeline.
- “document-question-answering”: will return a DocumentQuestionAnsweringPipeline.
- “feature-extraction”: will return a FeatureExtractionPipeline.
- “fill-mask”: will return a FillMaskPipeline:.
- “image-classification”: will return a ImageClassificationPipeline.
- “image-segmentation”: will return a ImageSegmentationPipeline.
- “image-to-text”: will return a ImageToTextPipeline.
- “mask-generation”: will return a MaskGenerationPipeline.
- “object-detection”: will return a ObjectDetectionPipeline.
- “question-answering”: will return a QuestionAnsweringPipeline.
- “summarization”: will return a SummarizationPipeline.
- “table-question-answering”: will return a TableQuestionAnsweringPipeline.
- “text2text-generation”: will return a Text2TextGenerationPipeline.
- “text-classification” (alias “sentiment-analysis” available): will return a TextClassificationPipeline.
- “text-generation”: will return a TextGenerationPipeline:.
- “token-classification” (alias “ner” available): will return a TokenClassificationPipeline.
- “translation”: will return a TranslationPipeline.
- “translation_xx_to_yy”: will return a TranslationPipeline.
- “video-classification”: will return a VideoClassificationPipeline.
- “visual-question-answering”: will return a VisualQuestionAnsweringPipeline.
- “zero-shot-classification”: will return a ZeroShotClassificationPipeline.
- “zero-shot-image-classification”: will return a ZeroShotImageClassificationPipeline.
- “zero-shot-audio-classification”: will return a ZeroShotAudioClassificationPipeline.
- “zero-shot-object-detection”: will return a ZeroShotObjectDetectionPipeline.
Pipeline chunk batching
zero-shot-classification and question-answering 用的是 ChunkPipeline
因为 a single input might yield multiple forward pass of a model(?)Under normal circumstances, this would yield issues with batch_size argument.
之前是直接把数据送到 pipeline 就好了,但是现在要分别调用 pipeline 的方法:
pipe.preprocess()
pipe.forward()
pipe.postprocess()
基础用例:
all_model_outputs = []
for preprocessed in pipe.preprocess(inputs):
model_outputs = pipe.forward(preprocessed)
all_model_outputs.append(model_outputs)
outputs = pipe.postprocess(all_model_outputs)
写你自己的 pipeline
首先,弄清输入和输入分别是什么?
- 输入:strings / raw bytes / dictionaries / …;这将是 preprocess 的输入
- 输出:越简洁越好,这将是 postprocess 的输出
需要实现 4 个方法
from transformers import Pipeline
class MyPipeline(Pipeline):
def _sanitize_parameters(self, **kwargs):
preprocess_kwargs = {}
if "maybe_arg" in kwargs:
preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"]
return preprocess_kwargs, {}, {}
def preprocess(self, inputs, maybe_arg=2):
model_input = Tensor(inputs["input_ids"])
return {"model_input": model_input}
def _forward(self, model_inputs):
# model_inputs == {"model_input": model_input}
outputs = self.model(**model_inputs)
# Maybe {"logits": Tensor(...)}
return outputs
def postprocess(self, model_outputs):
best_class = model_outputs["logits"].softmax(-1)
return best_class
preprocess()
输入是你确定的最开始的输入,然后在这个方法里面会做一些处理,变成模型的输入(即 preprocess 的输出)。(注意区分 pipeline 的输入和 model 的输入)
一般 preprocess() 的输出是一个字典,然后送入模型的时候就用 **kwargs
传到模型里面。
_forward()
forward()
里面加了一些保护性的代码,让大家在希望的 device 上正常工作,而其它与模型相关的代码,都放到 _forward()
里面,然后让 forward()
调用 _forward()
注意,只有与模型相关的代码才放到 _forward()
,前处理后处理都放到对应的方法里面去。
postprocess()
_forward()
的输出就是 postprocess()
的输入,然后把它变成用户想要的输出
_sanitize_parameters()
This function exists to allow users to pass any parameters whenever they wish, be it at initialization time pipeline(...., maybe_arg=4)
or at call time pipe = pipeline(...)
; output = pipe(...., maybe_arg=4)
该方法返回值为 3 个 dicts,这 3 个 dicts 会分别送入 preprocess()
, _forward()
和 postprocess()
示例
目标效果:
>>> pipe = pipeline("my-new-task")
>>> pipe("This is a test")
[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}, {"label": "3-star", "score": 0.05}
{"label": "4-star", "score": 0.025}, {"label": "5-star", "score": 0.025}]
>>> pipe("This is a test", top_k=2)
[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}]
继承 Pipeline
第一次预测的时候没有传除了输入数据以外的别的参数,自动出来 top-k 是 5 个,也就是默认参数为 5 (这个参数应该是 postprocess()
的参数)。为了实现这个,编辑 _sanitize_parameters()
方法,让这个参数加进去:
def postprocess(self, model_outputs, top_k=5):
best_class = model_output["logits"].softmax(-1)
return best_class
def _sanitize_parameters(self, **kwargs):
preprocess_kwargs = {}
if "maybe_arg" in kwargs:
preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"]
postprocess_kwargs = {}
if "top_k" in kwargs:
postprocess_kwargs["top_k"] = kwargs["top_k"]
return preprocess_kwargs, {}, postprocess_kwargs
注册
调用 PIPELINE_REGISTRY.register_pipeline()
方法
from transformers.pipelines import PIPELINE_REGISTRY
PIPELINE_REGISTRY.register_pipeline(
"new-task",
pipeline_class=MyPipeline,
pt_model=AutoModelForSequenceClassification,
)
针对不同任务的 Pipeline
ImageClassificationPipeline
>>> from transformers import pipeline
>>> classifier = pipeline(model="microsoft/beit-base-patch16-224-pt22k-ft22k")
>>> classifier("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
[{'score': 0.442, 'label': 'macaw'}, {'score': 0.088, 'label': 'popinjay'}, {'score': 0.075, 'label': 'parrot'}, {'score': 0.073, 'label': 'parodist, lampooner'}, {'score': 0.046, 'label': 'poll, poll_parrot'}]
__call__()
的参数:
images (str, List[str], PIL.Image or List[PIL.Image])
— The pipeline handles three types of images:- A string containing a http link pointing to an image
- A string containing a local path to an image
- An image loaded in PIL directly
top_k (int, optional, defaults to 5)
— The number of top labels that will be returned by the pipeline.
ImageSegmentationPipeline
>>> from transformers import pipeline
>>> segmenter = pipeline(model="facebook/detr-resnet-50-panoptic")
>>> segments = segmenter("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
>>> len(segments)
2
>>> segments[0]["label"]
'bird'
>>> segments[1]["label"]
'bird'
>>> type(segments[0]["mask"]) # This is a black and white mask showing where is the bird on the original image.
<class 'PIL.Image.Image'>
>>> segments[0]["mask"].size
(768, 512)
__call__()
的参数:
images (str, List[str], PIL.Image or List[PIL.Image])
— The pipeline handles three types of images:- A string containing a http link pointing to an image
- A string containing a local path to an image
- An image loaded in PIL directly
subtask (str, optional)
— Segmentation task to be performed, choose[semantic, instance and panoptic]
depending on model capabilities. If not set, the pipeline will attempt tp resolve in the following order: panoptic, instance, semantic.threshold (float, optional, defaults to 0.9)
— Probability threshold to filter out predicted masks.mask_threshold (float, optional, defaults to 0.5)
— Threshold to use when turning the predicted masks into binary values.overlap_mask_area_threshold (float, optional, defaults to 0.5)
— Mask overlap threshold to eliminate small, disconnected segments.
ObjectDetectionPipeline
>>> from transformers import pipeline
>>> detector = pipeline(model="facebook/detr-resnet-50")
>>> detector("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
[{'score': 0.997, 'label': 'bird', 'box': {'xmin': 69, 'ymin': 171, 'xmax': 396, 'ymax': 507}}, {'score': 0.999, 'label': 'bird', 'box': {'xmin': 398, 'ymin': 105, 'xmax': 767, 'ymax': 507}}]
>>> # x, y are expressed relative to the top left hand corner.
__call__()
的参数:
images (str, List[str], PIL.Image or List[PIL.Image])
— The pipeline handles three types of images:- A string containing a http link pointing to an image
- A string containing a local path to an image
- An image loaded in PIL directly
threshold (float, optional, defaults to 0.9)
— The probability necessary to make a prediction.
ImageToTextPipeline
>>> from transformers import pipeline
>>> captioner = pipeline(model="ydshieh/vit-gpt2-coco-en")
>>> captioner("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
[{'generated_text': 'two birds are standing next to each other '}]
__call__()
的参数:
images (str, List[str], PIL.Image or List[PIL.Image])
— The pipeline handles three types of images:- A string containing a http link pointing to an image
- A string containing a local path to an image
- An image loaded in PIL directly
max_new_tokens (int, optional)
— The amount of maximum tokens to generate. By default it will use generate default.generate_kwargs (Dict, optional)
— Pass it to send all of these arguments directly to generate allowing full control of this function.
VisualQuestionAnsweringPipeline
This visual question answering pipeline can currently be loaded from pipeline() using the following task identifiers: “visual-question-answering”, “vqa”.
>>> from transformers import pipeline
>>> oracle = pipeline(model="dandelin/vilt-b32-finetuned-vqa")
>>> image_url = "https://huggingface.co/datasets/Narsil/image_dummy/raw/main/lena.png"
>>> oracle(question="What is she wearing ?", image=image_url)
[{'score': 0.948, 'answer': 'hat'}, {'score': 0.009, 'answer': 'fedora'}, {'score': 0.003, 'answer': 'clothes'}, {'score': 0.003, 'answer': 'sun hat'}, {'score': 0.002, 'answer': 'nothing'}]
>>> oracle(question="What is she wearing ?", image=image_url, top_k=1)
[{'score': 0.948, 'answer': 'hat'}]
>>> oracle(question="Is this a person ?", image=image_url, top_k=1)
[{'score': 0.993, 'answer': 'yes'}]
>>> oracle(question="Is this a man ?", image=image_url, top_k=1)
[{'score': 0.996, 'answer': 'no'}]
__call__()
的参数:
images (str, List[str], PIL.Image or List[PIL.Image])
— The pipeline handles three types of images:- A string containing a http link pointing to an image
- A string containing a local path to an image
- An image loaded in PIL directly
question (str, List[str])
— The question(s) asked. If given a single question, it can be broadcasted to multiple images.top_k (int, optional, defaults to 5)