Transformers 4.37 中文文档（九十一）

最新推荐文章于 2024-07-01 08:04:26 发布

绝不原创的飞龙

最新推荐文章于 2024-07-01 08:04:26 发布

阅读量1.1k

点赞数 18

License CC BY-NC-SA 4.0 / 自豪地采用谷歌翻译

本文链接：https://blog.csdn.net/wizardforcel/article/details/139897899

版权

人工智能专栏收录该内容

582 篇文章

订阅专栏

原文：huggingface.co/docs/transformers

MGP-STR

原始文本：huggingface.co/docs/transformers/v4.37.2/en/model_doc/mgp-str

概述

MGP-STR 模型由 Peng Wang、Cheng Da 和 Cong Yao 在多粒度预测用于场景文本识别中提出。MGP-STR 是一个概念上简单但强大的视觉场景文本识别（STR）模型，它建立在视觉 Transformer（ViT）之上。为了整合语言知识，提出了多粒度预测（MGP）策略，以隐式方式将语言模态的信息注入模型中。

论文摘要如下：

场景文本识别（STR）一直是计算机视觉中的一个活跃研究课题。为了解决这一具有挑战性的问题，已经连续提出了许多创新方法，并且将语言知识整合到 STR 模型中最近成为一个突出的趋势。在这项工作中，我们首先从视觉 Transformer（ViT）的最新进展中汲取灵感，构建了一个概念上简单但强大的视觉 STR 模型，它建立在 ViT 之上，并且在场景文本识别方面优于以前的最先进模型，包括纯视觉模型和语言增强方法。为了整合语言知识，我们进一步提出了一种多粒度预测策略，以隐式方式将语言模态的信息注入模型中，即，除了传统的字符级表示外，还引入了在 NLP 中广泛使用的子词表示（BPE 和 WordPiece）到输出空间中，而不采用独立的语言模型（LM）。由此产生的算法（称为 MGP-STR）能够将 STR 的性能推向更高的水平。具体而言，在标准基准上实现了 93.35%的平均识别准确率。

drawing MGP-STR 架构。摘自原始论文。

MGP-STR 在两个合成数据集 MJSynth)（MJ）和 SynthText(www.robots.ox.ac.uk/~vgg/data/scenetext/)（ST）上进行训练，而不在其他数据集上进行微调。它在六个标准拉丁场景文本基准上取得了最先进的结果，包括 3 个常规文本数据集（IC13、SVT、IIIT）和 3 个不规则数据集（IC15、SVTP、CUTE）。该模型由yuekun贡献。原始代码可以在这里找到。

推理示例

MgpstrModel 接受图像作为输入，并生成三种类型的预测，代表不同粒度的文本信息。这三种类型的预测被融合以给出最终的预测结果。

ViTImageProcessor 类负责预处理输入图像，MgpstrTokenizer 解码生成的字符标记为目标字符串。MgpstrProcessor 将 ViTImageProcessor 和 MgpstrTokenizer 封装成单个实例，既提取输入特征又解码预测的标记 ID。

逐步光学字符识别（OCR）

>>> from transformers import MgpstrProcessor, MgpstrForSceneTextRecognition
>>> import requests
>>> from PIL import Image

>>> processor = MgpstrProcessor.from_pretrained('alibaba-damo/mgp-str-base')
>>> model = MgpstrForSceneTextRecognition.from_pretrained('alibaba-damo/mgp-str-base')

>>> # load image from the IIIT-5k dataset
>>> url = "https://i.postimg.cc/ZKwLg2Gw/367-14.png"
>>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

>>> pixel_values = processor(images=image, return_tensors="pt").pixel_values
>>> outputs = model(pixel_values)

>>> generated_text = processor.batch_decode(outputs.logits)['generated_text']

MgpstrConfig

`class transformers.MgpstrConfig`

< source >

( image_size = [32, 128] patch_size = 4 num_channels = 3 max_token_length = 27 num_character_labels = 38 num_bpe_labels = 50257 num_wordpiece_labels = 30522 hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 mlp_ratio = 4.0 qkv_bias = True distilled = False layer_norm_eps = 1e-05 drop_rate = 0.0 attn_drop_rate = 0.0 drop_path_rate = 0.0 output_a3_attentions = False initializer_range = 0.02 **kwargs )

参数

image_size (List[int], optional, defaults to [32, 128]) — 每个图像的大小（分辨率）。
patch_size (int, optional, defaults to 4) — 每个补丁的大小（分辨率）。
num_channels (int, optional, defaults to 3) — 输入通道数。
max_token_length (int, optional, defaults to 27) — 输出标记的最大数量。
num_character_labels (int, optional, defaults to 38) — 字符头的类数。
num_bpe_labels (int, optional, defaults to 50257) — bpe 头的类数。
num_wordpiece_labels (int, optional, defaults to 30522) — wordpiece 头的类数。
hidden_size (int, optional, defaults to 768) — 嵌入维度。
num_hidden_layers (int, optional, defaults to 12) — Transformer 编码器中的隐藏层数。
num_attention_heads (int, optional, defaults to 12) — Transformer 编码器中每个注意力层的注意力头数。
mlp_ratio (float, optional, defaults to 4.0) — mlp 隐藏维度与嵌入维度的比率。
qkv_bias (bool, optional, defaults to True) — 是否为查询、键和值添加偏置。
distilled (bool, optional, defaults to False) — 模型包括蒸馏令牌和头，如 DeiT 模型。
layer_norm_eps (float, optional, defaults to 1e-05) — 层归一化层使用的 epsilon。
drop_rate (float, optional, defaults to 0.0) — 嵌入层、编码器中所有全连接层的 dropout 概率。
attn_drop_rate (float, optional, defaults to 0.0) — 注意力概率的 dropout 比率。
drop_path_rate (float, optional, defaults to 0.0) — 随机深度率。
output_a3_attentions (bool, optional, defaults to False) — 模型是否返回 A³ 模块的注意力。
initializer_range (float, optional, defaults to 0.02) — 用于初始化所有权重矩阵的截断正态初始化器的标准差。

这是一个配置类，用于存储 MgpstrModel 的配置。根据指定的参数实例化一个 MGP-STR 模型，定义模型架构。使用默认值实例化配置将产生类似于 MGP-STR alibaba-damo/mgp-str-base架构的配置。

配置对象继承自 PretrainedConfig，可用于控制模型输出。阅读 PretrainedConfig 的文档以获取更多信息。

示例：

>>> from transformers import MgpstrConfig, MgpstrForSceneTextRecognition

>>> # Initializing a Mgpstr mgp-str-base style configuration
>>> configuration = MgpstrConfig()

>>> # Initializing a model (with random weights) from the mgp-str-base style configuration
>>> model = MgpstrForSceneTextRecognition(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

MgpstrTokenizer

`class transformers.MgpstrTokenizer`

< source >

( vocab_file unk_token = '[GO]' bos_token = '[GO]' eos_token = '[s]' pad_token = '[GO]' **kwargs )

参数

vocab_file (str) — 词汇表文件路径。
unk_token (str, optional, defaults to "[GO]") — 未知标记。词汇表中没有的标记无法转换为 ID，而是设置为此标记。
bos_token (str, optional, defaults to "[GO]") — 序列开始标记。
eos_token (str, optional, defaults to "[s]") — 序列结束标记。
pad_token (str or tokenizers.AddedToken, optional, defaults to "[GO]") — 用于使标记数组大小相同以进行批处理的特殊标记。然后将被注意力机制或损失计算忽略。

构建一个 MGP-STR 字符分词器。

此分词器继承自 PreTrainedTokenizer，其中包含大部分主要方法。用户应参考该超类获取有关这些方法的更多信息。

`save_vocabulary`

< source >

( save_directory: str filename_prefix: Optional = None )

MgpstrProcessor

`class transformers.MgpstrProcessor`

< source >

( image_processor = None tokenizer = None **kwargs )

参数

image_processor (ViTImageProcessor, 可选) — 一个 ViTImageProcessor 实例。图像处理器是必需的输入。
tokenizer（MgpstrTokenizer, 可选）— Tokenizer 是必需的输入。

构建一个 MGP-STR 处理器，将图像处理器和 MGP-STR 分词器封装成一个单独的

MgpstrProcessor 提供了所有 ViTImageProcessor 和 MgpstrTokenizer 的功能。查看 call() 和 batch_decode() 获取更多信息。

`call`

< source >

( text = None images = None return_tensors = None **kwargs )

在正常模式下使用时，此方法将所有参数转发给 ViTImageProcessor 的 call() 并返回其输出。如果 text 不是 None，此方法还将 text 和 kwargs 参数转发给 MgpstrTokenizer 的 call() 来编码文本。更多信息请参考上述方法的文档字符串。

`batch_decode`

< source >

( sequences ) → export const metadata = 'undefined';Dict[str, any]

参数

sequences (torch.Tensor) — 分词后输入 id 的列表。

Dict[str, any]

所有解码结果的字典。 generated_text (List[str]): 融合字符、bpe 和 wp 后的最终结果。 scores (List[float]): 融合字符、bpe 和 wp 后的最终分数。 char_preds (List[str]): 字符解码句子的列表。 bpe_preds (List[str]): bpe 解码句子的列表。 wp_preds (List[str]): wp 解码句子的列表。

通过调用 decode 将 token id 的列表转换为字符串列表。

此方法将所有参数转发给 PreTrainedTokenizer 的 batch_decode()。更多信息请参考该方法的文档字符串。

MgpstrModel

`class transformers.MgpstrModel`

< source >

( config: MgpstrConfig )

参数

config (MgpstrConfig) — 包含模型所有参数的模型配置类。使用配置文件初始化不会加载与模型相关的权重，只会加载配置。查看 from_pretrained() 方法加载模型权重。

裸 MGP-STR 模型变压器输出原始隐藏状态，没有特定的顶部头。此模型是 PyTorch torch.nn.Module子类。将其用作常规 PyTorch 模块，并参考 PyTorch 文档以获取有关一般用法和行为的所有相关信息。

Transformers 4.37 中文文档（九十一）

MGP-STR

概述

推理示例

MgpstrConfig

class transformers.MgpstrConfig

MgpstrTokenizer

class transformers.MgpstrTokenizer

save_vocabulary

MgpstrProcessor

class transformers.MgpstrProcessor

__call__

batch_decode

MgpstrModel

class transformers.MgpstrModel

forward

MgpstrForSceneTextRecognition

class transformers.MgpstrForSceneTextRecognition

forward

Nougat

概述

使用提示

推理

NougatImageProcessor

class transformers.NougatImageProcessor

preprocess

NougatTokenizerFast

class transformers.NougatTokenizerFast

correct_tables

post_process_generation

post_process_single

remove_hallucinated_references

NougatProcessor

class transformers.NougatProcessor

__call__

from_pretrained

save_pretrained

batch_decode

解码

后处理生成

OneFormer

概述

使用提示

资源

OneFormer 特定输出

class transformers.models.oneformer.modeling_oneformer.OneFormerModelOutput

class transformers.models.oneformer.modeling_oneformer.OneFormerForUniversalSegmentationOutput

OneFormerConfig

class transformers.OneFormerConfig

OneFormerImageProcessor

class transformers.OneFormerImageProcessor

preprocess

encode_inputs

post_process_semantic_segmentation

post_process_instance_segmentation

post_process_panoptic_segmentation

OneFormerProcessor

class transformers.OneFormerProcessor

encode_inputs

post_process_instance_segmentation

post_process_panoptic_segmentation

post_process_semantic_segmentation

OneFormerModel

class transformers.OneFormerModel

forward

OneFormerForUniversalSegmentation

class transformers.OneFormerForUniversalSegmentation

forward

OWL-ViT

概述

使用提示

资源

OwlViTConfig

class transformers.OwlViTConfig

from_text_vision_configs

OwlViTTextConfig

class transformers.OwlViTTextConfig

OwlViTVisionConfig

class transformers.OwlViTVisionConfig

OwlViTImageProcessor

`class transformers.MgpstrConfig`

`class transformers.MgpstrTokenizer`

`save_vocabulary`

`class transformers.MgpstrProcessor`

`call`

`batch_decode`

`class transformers.MgpstrModel`

`forward`

`class transformers.MgpstrForSceneTextRecognition`

`forward`

`class transformers.NougatImageProcessor`

`preprocess`

`class transformers.NougatTokenizerFast`

`correct_tables`

`post_process_generation`

`post_process_single`

`remove_hallucinated_references`

`class transformers.NougatProcessor`

`call`

`from_pretrained`

`save_pretrained`

`batch_decode`

`解码`

`后处理生成`

`class transformers.models.oneformer.modeling_oneformer.OneFormerModelOutput`

`class transformers.models.oneformer.modeling_oneformer.OneFormerForUniversalSegmentationOutput`

`class transformers.OneFormerConfig`

`class transformers.OneFormerImageProcessor`

`preprocess`

`encode_inputs`

`post_process_semantic_segmentation`

`post_process_instance_segmentation`

`post_process_panoptic_segmentation`

`class transformers.OneFormerProcessor`

`encode_inputs`

`post_process_instance_segmentation`

`post_process_panoptic_segmentation`

`post_process_semantic_segmentation`

`class transformers.OneFormerModel`

`forward`

`class transformers.OneFormerForUniversalSegmentation`

`forward`

`class transformers.OwlViTConfig`

`from_text_vision_configs`

`class transformers.OwlViTTextConfig`

`class transformers.OwlViTVisionConfig`

`class transformers.OwlViTImageProcessor`

`preprocess`

`post_process_object_detection`

`post_process_image_guided_detection`

`class transformers.OwlViTFeatureExtractor`

`call`

`post_process`

`post_process_image_guided_detection`

`class transformers.OwlViTProcessor`

`batch_decode`

`decode`

`post_process`

`post_process_image_guided_detection`

`post_process_object_detection`

`class transformers.OwlViTModel`

`forward`

`get_text_features`

`get_image_features`

`class transformers.OwlViTTextModel`

`forward`

`class transformers.OwlViTVisionModel`

`forward`

`class transformers.OwlViTForObjectDetection`

`forward`

`image_guided_detection`