Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks论文解读-CSDN博客

本文链接：https://blog.csdn.net/weixin_38252409/article/details/140345271

文章目录

前言
一、引言
二、Rethinking Vision Model Pre-training and Comprehensive Multitask Learning
- 1. Rethinking Vision Model Pre-training
- 2. Comprehensive Multitask Learning
三、方法
四、 Data Engine
五、Dataset
六、Experiments
七、Related Works
- 1. Vision-Language Foundation Models
- 2. Vision Datasets
八、Conclusion

前言

We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform a diversity of tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands largescale, high-quality annotated data. To this end, we codeveloped FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks. Extensive evaluations on numerous tasks demonstrated Florence-2 to be a strong vision foundation model contender with unprecedented zero-shot and fine-tuning capabilities。
我们介绍了Florence-2，这是一种新型的计算机视觉基础模型的统一、基于提示表征适用各种视觉任务或视觉语言任务。虽然现有的大型视觉模型在迁移学习方面表现优秀，但在执行简单指令的多样化任务方面却存在困难，这表明它们难以处理各种空间层次和语义粒度的复杂性。Florence-2设计是根据文本提示作为任务指令并以文本形式生成描述的结果，无论是描述、目标检测、grounding还是分割。这种多任务学习设置需要大规模、高质量的标注数据。为此，我们协同开发了FLD-5B模型，实现视觉1.26亿张图像5.4b的视觉标注，采用自动图像标注和模型refinement。我们采用sequence-to-sequence结构训练Florence-2，使其能够执行各种各样复杂的视觉任务。实验表明在众多任务上Florence-2具有更强视觉基础能力和前所未有的零样本和微调能力，成为强大视觉基础模型的有力竞争者。

论文地址：https://arxiv.org/pdf/2311.06242

一、引言

In the realm of Artificial General Intelligence (AGI) systems, there has been a notable shift towards utilizing pretrained, versatile representations, acknowledged for taskagnostic benefits accross diverse applications. This trend is evident in natural language processing (NLP), where advanced models [5, 6, 19, 43, 65, 66] show adaptability with comprehensive knowledge spanning various domains and
tasks with simple instructions. The success of NLP motivates a parallel approach in computer vision.

在人工通用智能（AGI）系统领域，已经明显出现了一种向利用预训练的通用表示转变的趋势，这种表示被认为具有跨多种应用的任务无关优势。这一趋势在自然语言处理（NLP）领域尤为明显，先进模型[5, 6, 19, 43, 65, 66]展现出了对跨领域和任务的广泛知识的适应能力，只需简单的指令即可。NLP的成功激发了计算机视觉中的一种并行方法。

Universal representation for diverse vision-related tasks presents unique challenges, notably the need for comprehensive perceptual abilities. Unlike NLP, which deals mainly with text, computer vision requires handling intricate visual data like object location, masked contours, and attributes. Attaining universal representation in computer vision demands adept management of a spectrum of complex tasks, organized two-dimensionally as illustrated in Figure 1:
在这里插入图片描述
• Spatial Hierarchy: The model must discern spatial details across varying scales, understanding imagelevel concepts and fine-grained pixel specifics. Accommodating the intricate spatial hierarchy within vision demands the model’s proficiency in handling diverse levels of granularity.

• Semantic Granularity: Universal representation in computer vision should span a spectrum of semantic granularity. The model transitions from high-level captions to nuanced descriptions, enabling versatile understanding for diverse applications.

对于与视觉相关的各种任务的通用表示提出了独特挑战，尤其是需要全面的感知能力。与主要处理文本的NLP不同，计算机视觉需要处理像目标位置、遮档轮廓和属性等复杂的视觉数据。在计算机视觉中实现通用表示要求模型熟练地管理一系列以二维方式组织的复杂任务，如图1所示：
• 空间层次结构：模型必须辨别不同尺度上的空间细节，理解图像级概念和细粒度像素细节。视觉包含了复杂的空间层次结构，要求模型能熟练处理多层级的能力。
• 语义粒度：计算机视觉中的通用表示应跨越一系列语义粒度。模型从高层描述过渡到细致度描述，是能够对多样应用中有versatile理解能力。

This pursuit is characterized by distinctiveness and substantial challenges. A key hurdle is the scarcity of comprehensive visual annotations, hindering the development of a foundational model capable of capturing the intricate nuances of spatial hierarchy and semantic granularity. Existing datasets, such as ImageNet [18], COCO [48], and Flickr30k Entities [61], tailored for specialized applications, are extensively labeled by humans. To overcome this constraint, it is imperative to generate extensive annotations for each image on a larger scale.

这一追求以独特性和重大挑战为特征。一个关键障碍是复杂视觉注释的不足，阻碍了基础模型具有捕获复杂空间层次结构和语义粒度能力的发展。现有数据集，如ImageNet [18]、COCO [48]和Flickr30k Entities [61]，专为特定应用而设计，由人类注释标记。为了克服这一限制，这是十分重要的，必须在更大规模图像数据上为每个图像生成大量标注。

Another challenge is the absence of a unified pretraining framework with a singular network architecture that seamlessly integrates spatial hierarchy and semantic granularity in computer vision. Traditional models excel in tasks like object detection [26, 97], semantic segmentation [16, 82], and image captioning [45, 78] with task specific design. However, it is essential to develop a comprehensive, unified model that is capable of adapting across various vision tasks in a task-agnostic manner, even accommodating new tasks with minimal or no task-specific finetuning.

另一个挑战是在计算机视觉中缺乏一个统一的预训练框架，具有无缝集成空间层次结构和语义粒度。传统模型在诸如目标检测[26, 97]、语义分割[16, 82]和图像描述[45, 78]为特定任务设计。然而，本质是必须开发一个全面统一的模型，能够以任务无关的方式适应各种视觉任务，甚至在最小或没有任务特定微调的情况下适应新任务。

The model Florence [95] pioneers the integration of spatial, temporal, and multi-modal aspects in computer vision through unified pre-training and network architecture. The first evolutionary version [95] excels in transfer learning via pre-training with noisy text-image pairs and task-specific fine-tuning using specialized adapters. However, it relies on large task-specific datasets and adapters, leaving gaps in
addressing the above dual key challenges.

模型Florence [95]开拓在计算机视觉中集成空间、时间和多模态方面方面统一的预训练和网络架构。首个演进版本[95]通过使用带有噪声文本-图像对进行预训练，并使用专门适配器进行任务特定微调，在迁移学习方面表现出色。然而，它依赖于大量任务特定数据集和适配器，在解决上述双重关键挑战方面留下空白。

In this paper, we introduce Florence-2, a universal backbone achieved through multitask learning with extensive visual annotations. This results in a unified, prompt-based representation for diverse vision tasks, effectively addressing the challenges of limited comprehensive data and the absence of a unified architecture.

在本文中，我们引入了Florence-2，一个通用backbone，通过大量视觉标注数据多任务学习而成。这导致了一个统一的、基于提示的表示形式，适用于各种视觉任务，有效地解决了有限的复杂数据问题与缺乏统一架构问题。

Multitask learning necessitates large-scale, high-quality annotated data. Our data engine, instead of relying on labor-intensive manual annotation, autonomously generates a comprehensive visual dataset called FLD-5B, encompassing a total of 5.4B annotations for 126M images. This engine consists of two efficient processing modules. The first module uses specialized models to collaboratively and autonomously annotate images, moving away from the traditional single and manual annotation approach. Multiple models work together to reach a consensus, reminiscent of the wisdom of crowds concept [33, 80, 89], ensuring a more reliable and unbiased image
understanding. The second module iteratively refines and filters these automated annotations using well-trained foundational models.

多任务学习需要大规模、高质量的注释数据。我们的数据引擎不依赖于费力的手动注释，而是自动生成一个名为FLD-5B的全面视觉数据集，涵盖了126M张图像共5.4B个注释。该引擎包括两个高效处理模块。第一个模块使用专门模型协作和自主地为图像注释，摆脱了传统的单一手动注释方法。多个模型共同工作达成共识，这类似于crowds concept[33, 80, 89]，确保更可靠和无偏见的图像理解。第二个模块通过经过良好训练的基础模型迭代地完善和过滤这些自动注释。

By utilizing this extensive dataset, our model employs a sequence-to-sequence (seq2seq) architecture [17,19,66,76], which integrates an image encoder and a multi-modality encoder-decoder. This design accommodates a spectrum of vision tasks without the need for task-specific architectural modifications, aligning with the ethos of the NLP community for versatile model development with a consistent underlying structure. All annotations in the dataset FLD-5B, are uniformly standardized into textual outputs, facilitating a unified multi-task learning approach with consistent optimization with the same loss function as the objective. The outcome is a versatile vision foundation model, Florence-2, capable of performing a variety of tasks, such as object detection, captioning, and grounding, all within a single model
governed by a uniform set of parameters. Task activation is achieved through textual prompts, reflecting the approach used by Large Language Models (LLMs) [65]。

通过利用这一庞大数据集，我们的模型采用序列到序列（seq2seq）架构[17, 19, 66, 76]，集成了图像编码器和多模态编码器-解码器。这种设计适应了一系列视觉任务，无需进行任务特定的架构修改，符合NLP社区对具有一致底层结构的多功能模型开发的理念。FLD-5B数据集中的所有标注都被统一标准化为文本输出，便于采用统一的多任务学习方法进行一致优化，使用相同的目标损失函数。其结果是一个多功能视觉基础模型Florence-2，能够在单个模型内执行各种任务，例如目标检测、解说、定位，所有这些仅用一个模型，模型由一组统一参数完成。任务激活是通过文本提示实现的，反映了大型语言模型(llm)使用的方法[65]。

Our approach attains a universal representation, demonstrating broad applicability across various visual tasks. Key results include:
• As a versatile vision foundation model, Florence-2 achieves new state-of-the-art zero-shot performance in tasks such as captioning on COCO [48], visual grounding on Flick30k [61], and referring expression comprehension on RefCOCO/+/g [31, 56, 93].
• After fine-tuning with public human-annotated data, Florence-2, despite its compact size, competes with larger specialist models. Notably, the fine-tuned Florence-2 establishes new state-of-the-art results on the benchmarks on RefCOCO/+/g.
• The pre-trained Florence-2 backbone enhances performance on downstream tasks, e.g. COCO object detection and instance segmentation, and ADE20K semantic segmentation, surpassing both supervised and self-supervised models. Compared to pre-trained models on ImageNet, ours improves training efficiency by 4× and achieves substantial improvements of 6.9,5.5, and 5.9 points on COCO [48] and ADE20K [98] datasets, using Mask-RCNN [26], DINO [97], and UperNet [82] frameworks respectively.

我们的方法实现了通用表示，展示了在各种视觉任务中的广泛适用性。关键结果包括：
• 作为一种通用的视觉基础模型，Florence-2 实现新的最先进 zero-shot性能在例如COCO[48]描述任务，Flick30k[61]视觉定位以及复杂表达在RefCOCO/+/g [31, 56, 93]
• 在使用公开人工标注数据进行微调后，Florence-2，尽管尺寸小，它与更大的专业模型更有竞争。值得注意的是，经过微调的Florence-2在RefCOCO/+/g基准测试上确立了新的最先进结果。
• 预训练的Florence-2骨干网络提升了下游任务的性能，例如COCO目标检测和实例分割，以及ADE20K语义分割，超过了监督和自监督模型。与在ImageNet上预训练的模型相比，我们的方法提高了4倍的训练效率，并分别在COCO和ADE20K数据集上使用Mask-RCNN、DINO和UperNet框架提高了6.9、5.5和5.9个点的显著改进。

二、Rethinking Vision Model Pre-training and Comprehensive Multitask Learning

1. Rethinking Vision Model Pre-training

In pursuit of a versatile vision foundation model, we revisit three predominant pre-training paradigms: supervised (e.g., ImageNet classification [18]), self-supervised (e.g.,SimCLR [9], MoCo [25], BEiT [4], MAE [24]), and weakly supervised (e.g., CLIP [64], Florence [95], SAM [32]). Each paradigm captures unique aspects of visual data but is inherently limited by the constraints of single-task learning frameworks. Supervised pre-training excels in object recognition but lacks adaptability [38]; self-supervised algorithms reveal intricate features but may overemphasize certain attributes [8]; weakly supervised methods leverage unstructured textual annotations but yield only image-level understanding [64]. To build a unified vision foundation model suitable for various applications, we must explore innovative pre-training strategies that overcome single-task limitations and integrate both textual and visual semantics.

在追求一种多才多艺的视觉基础模型时，我们重新审视了三种主要的预训练范式：监督（例如，ImageNet分类[18]）、自监督（例如，SimCLR [9]、MoCo [25]、BEiT [4]、MAE [24]）和弱监督（例如，CLIP [64]、Florence [95]、SAM [32]）。每种范式捕捉了视觉数据的独特方面，但在单任务学习框架的限制下受到固有限制。监督预训练擅长目标识别，但缺乏适应性[38]；自监督算法揭示了复杂的特征，但可能过分强调确定属性[8]；弱监督方法利用非结构化文本注释，但仅产生图像级别的理解[64]。为了构建一个适用于各种应用的统一视觉基础模型，我们必须探索创新的预训练策略，克服单任务限制，并整合文本和视觉语义。

Image understanding necessitates capturing multiple levels of granularity, from global semantics to local details, and comprehending spatial relationships between objects and entities in their semantic context. To address these core aspects of image understanding, our approach incorporates a diverse set of annotations, effectively capturing visual understanding nuances and bridging the gap between vision and language understanding.

图像理解需要捕获多个粒度级别，从全局语义到局部细节，并在其语义上下文中理解对象和实体之间的空间关系。为了解决图像理解的这些核心问题，我们的方法结合了一组不同的注释，有效地捕捉视觉理解的细微差别，弥合视觉和语言理解之间的差距。

2. Comprehensive Multitask Learning

To develop a versatile vision foundation model, we formulate a range of multitask learning objectives, each tailored to address specific aspects of visual comprehension. These objectives align with our predefined criteria: spatial hierarchy and semantic granularity, inspired by recent research on multitask learning [2,12,14,15,55,79]. Our multitask learning approach incorporates three distinct learning objectives, each addressing a different level of granularity and semantic understanding:

为了开发一种多才多艺的视觉基础模型，我们制定了一系列多任务学习目标，每个目标都旨在解决视觉理解的特定方面。这些目标符合我们预先定义的标准：空间层次结构和语义粒度，受到最近多任务学习研究的启发[2,12,14,15,55,79]。我们的多任务学习方法包含三个不同的学习目标，分别处理不同粒度级别和语义理解：

• Image-level understanding tasks capture high-level semantics and foster a comprehensive understanding of images through linguistic descriptions [13, 18, 34, 91]. They enable the model to comprehend the overall context of an image and grasp semantic relationships and contextual nuances in the language domain. Exemplar tasks include image classification, captioning, and visual question answering.

• Region/pixel-level recognition tasks facilitate detailed object and entity localization within images, capturing relationships between objects and their spatial context. Tasks include object detection, segmentation, and referring expression comprehension.

• Fine-grained visual-semantic alignment tasks require fine-grained understanding of both text and image. It involves locating the image regions that correspond to the text phrases, such as objects, attributes, or relations. These tasks challenge the ability to capture the local details of visual entities and their semantic contexts, as well as the interactions between textual and visual elements.

• 图像级理解任务通过语言描述捕获高级语义，并通过语言描述促进对图像的复杂理解 [13, 18, 34, 91]。它们使模型能够理解图像的整体背景，并掌握语言领域中的语义关系和上下文细微差别。示例任务包括图像分类、描述和视觉问题回答。

• 区域/像素级识别任务促进对图像内部目标和实体的详细定位，捕捉对象之间以及它们的空间上下文之间的关系。任务包括目标检测、分割和指称表达理解。

• 细粒度视觉-语义校准任务需要对文本和图像进行细粒度理解。它涉及定位与文本短语相对应的图像区域，如对象、属性或关系。这些任务挑战挑战是捕捉视觉实体与其语义上下文的详细细，以及文本和视觉元素之间的交互。

By combining these three learning objectives in a multitask learning framework, our foundation model learns to handle different levels of detail and semantic understanding. This strategic alignment enables our model to deal with various spatial details, distinguish levels of detail in understanding, and go beyond surface-level recognition—ultimately learning a universal representation for vision understanding.

通过在多任务学习框架中结合这三个学习目标，我们的基础模型学会处理不同粒度级别和语义理解。这种战略性对齐使我们的模型能够处理各种空间细节，在理解中区分层级细节，并超越表面级别识别，最终学习出一种通用表征的视觉理解表示。

三、方法

We present the foundation model Florence-2, designed for universal representation learning, capable of handling various vision tasks with a single set of weights and a unified architecture. As depicted in Figure 2, Florence-2 employs a sequence-to-sequence learning paradigm [77], integrating all tasks, described in Section 2, under a common language modeling objective. The model takes images coupled with task-prompt as task instructions, and generates the desirable results in text forms. It uses a vision encoder to convert images into visual token embeddings, which are then concatenated with text embeddings and processed by a transformer-based multi-modal encoder-decoder to generate the response. In the following sections, we will provide a detailed explanation of each model component.

我们提出了通用表示学习的基础模型Florence-2，设计用于通用表征学习，是能处理各种视觉任务，只需使用一组权重和统一的架构。如图2所示，Florence-2采用了一个序列到序列的学习范式[77]，整合了所有人不，在第2节中描述，集成在一个通用的语言建模目标。该模型将图像与任务提示作为任务指令，生成文本形式的理想结果。它使用一个视觉编码器将图像转换为视觉token embeddings，然后与文本wembeding连接，并通过基于Transformer的多模态编码器-解码器进行处理以生成响应。在接下来的部分，我们将详细解释每个模型组件。
在这里插入图片描述

Task formulation.

We adopt a sequence-to-sequence framework [10,15,55,77] to address various vision tasks in a unified manner. As shown in Table 13, we formulate each task as a translation problem: Given an input image and a task-specific prompt, we generate the corresponding output response. Depending on the task, the prompt and response can be either text or region:

我们采用序列到序列框架[10,15,55,77]以统一的方式解决各种视觉任务。如表13所示，我们将每个任务都制定为一个翻译问题：给定一个输入图像和一个特定于任务的提示，我们生成相应的输出响应。根据任务的不同，提示和响应可以是文本或区域：

• Text: When the prompt or answer is plain text without special formatting, we maintain it in our final sequence-to-sequence format.

• 文本：当提示或答案是没有特殊格式的纯文本时，我们将其保留在最终的序列到序列格式中。

• Region: For region-specific tasks, we add location tokens to the tokenizer’s vocabulary list, representing quantized coordinates. We create 1, 000 bins, similar to [10, 11, 55, 79], and represent regions using formats tailored to task requirements:

• 区域：对于特定于区域的任务，我们将位置tokens 添加到分词器的词汇表中，表示量化坐标。我们创建了1,000个bins，类似于[10, 11, 55, 79]，并使用符合任务要求的格式表示区域：

– Box representation (x0, y0, x1, y1): Utilized in tasks such as object detection and dense region captioning, with location tokens corresponding to the box coordinates. The location tokens are the coordinates of the top-left and bottom-right corners of the box.

– Quad box representation (x0, y0, …, x3, y3): For text detection and recognition tasks, using location tokens for each coordinate of the quadrilateral enclosing the text. The location tokens are the coordinates of each corner of the quad box, starting from the top-left and going clockwise.

– Polygon Representation (x0, y0, …, xn, yn): For referring segmentation tasks, with location tokens representing the vertices of the polygon. The location tokens are the coordinates of the vertices of the polygon, in clockwise order.

box表示（x0、y0、x1、y1）：用于对象检测和密集区域描述等任务，位置tokens对应于框坐标。位置tokens是框左上角和右下角的坐标。
四边框表示（x0、y0、…、x3、y3）：用于文本检测和识别任务，使用每个坐标的位置tokens表示包围文本的四边形。位置tokens是四边形的每个角的坐标，从左上角开始顺时针方向。
多边形表示（x0、y0、…、xn、yn）：用于指代分割任务，位置tokens表示多边形的顶点。位置tokens是多边形顶点的坐标，按顺时针顺序排列。

By extending the tokenizer’s vocabulary to include location tokens, we enable the model to process region-specific information in a unified learning format. This eliminates the need to design task-specific heads for different tasks and allows for a more data-centric approach.

通过扩展分词器的词汇表以包括位置令牌，我们使模型能够以统一的学习格式处理特定于区域的信息。这消除了为不同任务设计特定任务头的需要，并允许更加以数据为中心的方法。

Vision encoder

We employ DaViT [20] as the vision encoder. It processes an input image I ∈ RH×W ×3 (with H and W denoting height and width, respectively) into flattened visual token embeddings V ∈ RNv×Dv, where Nv and Dv represent the number and dimensionality of vision tokens, respectively.

我们采用DaViT [20]作为视觉编码器。它将输入图像I ∈ RH×W×3（其中H和W分别表示高度和宽度）处理成扁平化的视觉tokens嵌入V ∈ RNv×Dv，其中Nv和Dv分别表示视觉token的数量和维度。

Multi-modality encoder decoder

We use a standard encoder-decoder transformer architecture to process visual and language token embeddings. We first obtain prompt text embeddings Tprompt ∈ RNt×D using our extended language tokenizer and word embedding layer [43]. Then, we concatenate vision token embeddings with prompt embeddings to form the multi-modality encoder module input,X = [V′, Tprompt], where V′ ∈ RNv×D is obtained by applying a linear projection and LayerNorm layer [3] to V for dimensionality alignment.

我们使用标准的编码器-解码器Transformer架构来处理视觉和语言token嵌入。首先，我们使用我们扩展的语言分词器和词嵌入层[43]获得提示文本嵌入Tprompt ∈ RNt×D。然后，我们将视觉tokens嵌入与提示嵌入连接起来形成多模态编码器模块输入X = [V′, Tprompt]，其中V’ ∈ RNv×D是通过对V进行线性投影和LayerNorm层[3]进行维度对齐得到的。

Optimization objective

Given the input x combined from the image and the prompt, and the target y, we use the standard language modeling with cross-entropy loss for all the tasks.
给定由图像和提示组合而成的输入x以及目标y，我们对所有任务使用标准语言建模和交叉熵损失。

在这里插入图片描述

where θ are the network parameters, |y| is the number of target tokens.

四、 Data Engine

To train our Florence-2 model, we require a comprehensive, large-scale, high-quality multitask dataset encompassing various image data aspects. Given the scarcity of such data, we have developed a new multitask image dataset. This dataset FLD-5B includes 126M images, 500M text annotations, and 1.3B text-region annotations, and 3.6B textphrase-region annotations across different tasks. We extensively explain our data collection and annotation procedures, encompassing adaptations for various annotation types. The data engine pipeline, shown in Figure 3, will be discussed in subsequent sections.

为了训练我们的Florence-2模型，我们需要一个全面、大规模、高质量的多任务数据集，涵盖各种图像数据方面。鉴于这类数据的稀缺性，我们开发了一个新的多任务图像数据集。这个数据集FLD-5B包括1.26亿张图片，5亿个文本标注，13亿个文本区域标注，以及36亿个文本短语区域标注，涵盖了不同任务。我们详细解释了我们的数据收集和标注程序，包括对各种标注类型的调整。数据引擎管道如图3所示，将在后续章节中讨论。

在这里插入图片描述

1. Image Collection

We construct our data by gathering a diverse collection of images from various sources. We begin with the identification of three key tasks that act as primary sources for our image corpus: image classification, object detection, and image captioning. Consequently, we curate and combine five distinct datasets originating from the aforementioned tasks: ImageNet-22k [18], Object 365 [70], Open Images [40], Conceptual Captions [71], and LAION [68] filtered by [45]. This combination results in a dataset of 126 million images in total.

我们通过从各种来源收集多样化的图像来构建我们的数据。我们从识别三个关键任务开始，这些任务作为我们图像语料库的主要来源：图像分类、目标检测和图像描述。因此，我们筛选并结合了来自上述任务的五个不同数据集：ImageNet-22k [18]、Object 365 [70]、Open Images [40]、Conceptual Captions [71]和LAION [68]，经过[45]的筛选。这种组合总共产生了1.26亿张图片的数据集。

2. Data Annotation

Our primary objective is to generate comprehensive annotations that can support multitask learning effectively. Accordingly, our annotation endeavors span a comprehensive range of tasks, encapsulated within three discrete annotation categories: text, region-text pairs, and text-phraseregion triplets, which is illustrated in Figure 4. The data annotation workflow consists of three essential phases, each of which ensures the accuracy and quality of the annotations: (1) initial annotation employing specialist models, (2) data filtering to correct errors and remove irrelevant annotations, and (3) an iterative process for data refinement.

我们的主要目标是生成全面的标注，以有效支持多任务学习。因此，我们的标注工作涵盖了一系列广泛的任务，分为三个不同的标注类别：文本、区域-文本对和文本短语-区域三元组，如图4所示。数据标注工作流程包括三个基本阶段，每个阶段都确保了标注的准确性和质量：(1) 利用专业模型进行初始标注，(2) 数据过滤以纠正错误和删除无关的标注，以及(3) 用于数据精炼的迭代过程。
在这里插入图片描述

Initial annotation with specialist models

To initiate the annotation process for each annotation type, we employ synthetic labels obtained from specialist models. These specialist models are a combination of offline models trained on a diverse range of publicly available datasets and online services hosted on cloud platforms. They are specifically tailored to excel in annotating their respective annotation types.

利用专业模型进行初始标注。为了启动每种标注类型的标注过程，我们使用从专业模型获得的合成标签。这些专业模型是离线模型和在线服务的组合，这些模型经过训练，能够在各种公开可用数据集上表现出色，并且托管在云平台上。它们专门针对各自的标注类型进行优化。

It is worth noting that certain image datasets may already contain partial annotations for some annotation types. For instance, the Object 365 [70] dataset already includes human-annotated bounding boxes and corresponding categories as region-text annotations. In such cases, we merge the pre-existing annotations with the synthetic labels generated by the specialist models. This approach enhances the
coverage and diversity of the annotations.

值得注意的是，某些图像数据集可能已经包含某些标注类型的部分标注。例如，Object 365 [70]数据集已经包含了人工标注的边界框和相应的类别作为区域-文本标注。在这种情况下，我们将现有的标注与专业模型生成的合成标签合并。这种方法增强了标注的覆盖范围和多样性。

Moreover, specific annotations, such as detailed descriptions in the text annotation type, are represented by datasets of a considerably small size. This inherently poses challenges in obtaining high-performance specialist models. Consequently, we opt to omit these tasks during the initial annotation phase. Annotations for these tasks are generated later during the iterative data refinement process.

此外，特定的标注，比如文本标注类型中的详细描述，由于数据集规模相对较小，在获得高性能专业模型方面存在挑战。因此，在初始标注阶段我们选择忽略这些任务。这些任务的标注将在后续的迭代数据精炼过程中生成。

In summation, through the rigorous initial annotation procedures, we ensure that the aggregated dataset of 126 million images is comprehensively labeled across the majority of annotation types.

总之，通过严格的初始标注程序，我们确保了聚合的1.26亿张图片数据集在大多数标注类型上都有全面的标记。

Data filtering and enhancement

The initial annotations obtained from the specialist models, while comprehensive, are susceptible to noise and imprecision. In response to this challenge, we have implemented a multifaceted filtering process to refine and eliminate undesired annotations. Our general filtering protocol mainly focuses on two data types in the annotations: text and region data.

数据过滤和增强。从专业模型获得的初始标注虽然全面，但容易受到噪音和不精确性的影响。针对这一挑战，我们实施了一个多方面的过滤过程，以精炼和消除不需要的标注。我们的一般过滤协议主要集中在两种类型的数据中：文本和区域数据。

First, pertaining to textual annotations, we are inspired by DiHT [63] and develop a parsing tool based on SpaCy [28] to extract objects, attributes, and actions. We filter out texts containing excessive objects, as they tend to introduce noise and may not accurately reflect the actual content in the corresponding images. Additionally, we assess the complexity of the actions and objects by measuring their degree
of node in the dependency parsing tree. We retain texts with a certain minimum action and object complexity to ensure the richness of visual concepts in the images.

首先，在文本标注方面，我们受到DiHT [63] 的启发，并基于SpaCy [28] 开发了一个解析工具，用于提取对象、属性和动作。我们过滤掉包含过多目标的文本，因为它们往往会引入噪音，并且可能无法准确反映相应图像中的实际内容。此外，通过测量它们在依赖解析树中节点程度来评估动作和对象的复杂性。我们保留具有一定最小动作和对象复杂性的文本，以确保图像中视觉概念的丰富性。

Second, in relation to the region annotations, specifically bounding boxes, we remove the noisy boxes under a confidence score threshold. Complementing this, we also employ non-maximum suppression to reduce redundant or overlapping bounding boxes.

其次，在区域标注方面，特别是边界框部分，我们删除置信度阈值以下的噪声框。此外，我们还使用非极大值抑制来减少冗余或重叠的边界框。

Iterative data refinement

Using our filtered initial annotations, we trained a multitask model that processes sequences of data. Upon evaluating this model against our training images, we discerned a marked enhancement in its predictions, particularly in instances where original labels were marred by inaccuracies or extraneous noise, such as in alt-texts. Motivated by these findings, we integrated these updated annotations with our original ones and subjected the model to another training iteration. This cyclical refinement process incrementally improves the quality of our training dataset.

迭代数据细化。利用我们经过筛选的初始标注，我们训练了一个处理数据序列的多任务模型。在将这个模型与我们的训练图片进行评估后，我们发现其预测能力明显提升，特别是在原始标签存在不准确或多余噪音（例如alt-texts）的情况下。受到这些发现的启发，我们将这些更新的标注与我们的原始标注整合，并对模型进行另一轮训练迭代。这种循环的细化过程逐步提高了我们训练数据集的质量。

In the case of tasks we initially bypassed due to insufficient data for the training of a robust specialist model, we leveraged the iteratively trained model for pre-training purposes. Subsequent fine-tuning of this pre-trained model with the sparse dataset showcased superior performance compared to a model trained from scratch on the same data.Thus, we harness the fine-tuned model as a specialist for annotating our expansive dataset comprising 126 million images, ensuring comprehensive annotation coverage.

对于最初由于训练强大专家模型所需数据不足而绕过的任务，我们利用经过迭代训练的模型进行预训练目的。随后，对这个预训练模型进行微调，使用稀疏数据集展示了比从头开始在相同数据上训练的模型更优越的性能。因此，我们利用这个经过精调的模型作为专家，为我们包含1.26亿张图片的庞大数据集进行标注，确保全面的标注覆盖。

3. Annotation-specific Variations

In Section 4.2, we introduce our general annotation workflow. This section delves into each annotation type and the corresponding variations of the annotation procedure.

在第4.2节中，我们介绍了我们的通用标注工作流程。本节深入探讨了每种标注类型及相应的标注过程变体。

文本(text)

Text. Text annotations categorize images using three types of granularities: brief, detailed, and more detailed. The brief text includes only one sentence that demonstrates the most salient objects and activities, which is similar to COCO caption [13]. In contrast, the detailed text and more detailed text contain multiple sentences that describe the image with richer objects, attributes, and actions.

文本。文本标注使用三种粒度对图像进行分类：简要、详细和更详细。简要文本仅包括一个句子，展示最显著的目标和活动，类似于COCO描述[13]。相反，详细文本和更详细文本包含多个句子，描述图像中更丰富的目标、属性和动作。

For the brief text, a Florence-2 model is trained as the specialist on publicly available image caption and imagetext datasets, creating an image-to-text model for initial annotations. Iterative refinement is used to minimize noise in these texts. For the detailed text, prompts including existing image annotations like the brief text and region-text annotations, are fed to large language models (LLMs) or large multimodal models (LMMs) to generate comprehensive descriptions. Due to the high cost of the large models, only a small set of detailed text and more detailed text are generated. These are used to fine-tune the caption specialist, developing a detailed description specialist for further annotations.

对于简要文本，使用Florence-2模型作为专家，在公开可用的图像描述和图像文本数据集上进行训练，创建一个图像到文本的模型进行初始标注。采用迭代细化方法来减少这些文本中的噪音。对于详细文本，包括现有图像标注（如简要文本和区域-文本注释）的提示，被送到大型语言模型（LLMs）或大型多模态模型（LMMs）中生成全面描述。由于大型模型成本高昂，只生成了少量详细文本和更详细文本。这些用于微调描述专家，开发详细描述的专家以进行进一步的标注。

区域文本对(Region-text pairs)

Region-text pairs. The region-text pairs provide descriptive textual annotation for semantic regions in the image. Semantic regions include regions of visual objects as well as text regions. The region is represented by a tight bounding box surrounds the region. Moreover, each region can be annotated with varying degrees of granularity, including phrases and sentences, that contribute to a richer understanding of the region.

区域-文本对。区域-文本对为图像中的语义区域提供描述性文本标注。语义区域包括视觉对象区域和文本区域。该区域由紧密的边界框表示周围的区域。此外，每个区域可以以不同程度的粒度进行标注，包括短语和句子，有助于更深入地理解该区域。

Region-text pairs are annotated differently for text regions and visual object regions. Text regions are labeled using Azure AI Services’ OCR API [1], while visual objects are initially annotated with a DINO object detector [97] trained on public datasets. Data filtering, including confidence thresholding and non-maximum suppression, removes noisy boxes. Textual annotations for the visual object regions are further enriched by brief text generated from an image-to-text model with cropped image regions. Each region then receives three textual annotations: phrase from object category, brief text, and noun phrase chunks from the brief text. The Florence-1 [95] model determines the most similar textual annotation to each image region.

区域-文本对针对文本区域和视觉对象区域进行不同的标注。文本区域使用Azure AI Services的OCR API[1]进行标记，而视觉对象最初使用在公共数据集上训练的DINO目标检测器[97]进行标注。数据过滤包括置信度阈值和非极大值抑制，用于去除嘈杂的框。视觉对象区域的文本标注通过从图像到文本模型生成的简要文本进一步丰富。然后，每个区域接收三个文本标注：来自对象类别的短语、简要文本以及来自简要文本的名词短语块。Florence-1[95]模型确定与每个图像区域最相似的文本标注。

文本短语区域三元组(Text-phrase-region triplets)

Text-phrase-region triplets. Text-phrase-region triplets consist of a descriptive text of the image, noun phrases in this text related to image objects, and region annotations for these objects. The text includes brief, detailed, and more detailed text generated earlier. For each text, the Grounding DINO model [50] identifies noun phrases and creates bounding boxes for them. Additionally, the SAM model [32] generates segmentation masks for each box, offering more precise object localization. During data filtering, a confidence score threshold is applied to both noun phrases and bounding boxes to ensure relevance. A blacklist is also used to exclude irrelevant noun phrases like pronouns and abstract concepts.

文本短语区域三元组。文本短语区域三元组包括图像的描述性文本、与图像对象相关的该文本中的名词短语以及这些对象的区域注释。该文本包括之前生成的简要、详细和更详细文本。对于每个文本，Grounding DINO模型[50]识别名词短语并为其创建边界框。此外，SAM模型[32]为每个框生成分割掩模，提供更精确的对象定位。在数据过滤过程中，对名词短语和边界框应用置信度评分阈值以确保相关性。还使用黑名单排除类似代词和抽象概念等不相关名词短语。

五、Dataset

This section introduces the statistics and analysis of FLD-5B that we built using the data engine in Section 4. We begin with an overview of the dataset and compare it with the recent works. We then show further analyses of detailed annotation statistics, semantic coverage and spatial coverage in the established dataset.

本节介绍了我们在第4节中使用数据引擎构建的FLD-5B的统计数据和分析。我们首先概述了数据集，并将其与最近的研究works进行了比较。然后，我们展示了建立数据集中详细标注统计、语义覆盖和空间覆盖的进一步分析。

1. Overview

Following the data engine, we build a large-scale training set (FLD-5B) of 126M images, more than 500M text annotations, 1.3B region-text annotations, and 3.6B textphrase-region annotations. Each image is annotated with text, region-text pairs, and text-phrase-region triplets and each annotation type has multiple instances varying in diverse granularity. An illustrative example of an image and its corresponding annotations can be found in Figure 4.

在数据引擎之后，我们构建了一个包含1.26亿张图片、超过5亿个文本标注、13亿个区域-文本标注和36亿个文本短语-区域标注的大规模训练集（FLD-5B）。每幅图像都被标注了文本、区域-文本对和文本短语-区域三元组，每种标注类型都有多个实例，具有不同的粒度。图像及其对应标注的示例可在图4中找到。

We provide a comparison between our data set and the existing data sets that are commonly used for training foundation models in Table 1. Our data set has several advantages over the previous ones, such as having more annotations in total and per image. Moreover, the annotations in our data set span multiple levels of spatial and semantic granularity, which allows for more diverse and comprehensive visual understanding tasks.

我们在表1中对我们的数据集与常用于训练基础模型的现有数据集进行了比较。我们的数据集相对于先前的数据集具有多个优势，例如总体和每幅图像拥有更多标注。此外，我们数据集中的标注跨越了多个空间和语义粒度级别，这使得更多样化和全面的视觉理解任务成为可能。
在这里插入图片描述

2. Data Analysis

Annotation statistics

The statistics for each annotation type within our dataset are presented in Table 2. Firstly, we have around 500M text annotations, including brief, detailed, and more detailed texts with different lengths. It is noteworthy that our detailed and more detailed
text has 4x and 9x number of tokens compared with the brief text that is similar to COCO captions [13]. These lengthy annotations provide much richer information for comphrensive visual understanding.

标注统计。我们数据集中每种标注类型的统计数据如表2所示。首先，我们有大约5亿个文本标注，包括不同长度的简要、详细和更详细文本。值得注意的是，相较于类似于COCO描述[13]的简要文本，我们的详细和更详细文本的标记tokens数量分别增加了4倍和9倍。这些较长的标注为全面的视觉理解提供了更丰富的信息。
在这里插入图片描述

In addition, our dataset has around 1.3B region-text annotations, which is more than 30x larger than the academic object detection datasets such as OpenImages [40] and Object 365 [70]. On average, each image has around 5 regions, and each region is annotated with either a phrase or a relatively longer brief text. Note that the regional brief text (2.55 avg tokens) is shorter than typical brief text annotation
(7.95 avg tokens), as the regional brief text annotation actually includes a mixture of phrase, noun chunks, and brief text based on the Florence-1 score. More details can be found from Section 4.3 - region-text pairs.

此外，我们的数据集中有大约13亿个区域-文本标注，比如OpenImages [40]和Object 365 [70]等学术目标检测数据集大约30倍。平均每幅图像约有5个区域，并且每个区域都用短语或相对较长的简要文本进行了标注。请注意，区域简要文本（平均2.55个tokens）比典型简要文本标注（平均7.95个tokens）更短，因为区域简要文本标注实际上包含了根据Florence-1评分的短语、名词块和简要文本混合。更多详情请参见第4.3节 - 区域-文本对。

Moreover, we collect text-phrase-region annotations that include more than 3.6B phrase-region pairs for the 500M text annotations. Specifically, the brief text annotation has 4.27 average phrase-region pairs, while detailed and more detailed text annotation has more than 10 pairs, indicating that the richer text annotation covers more objects and their corresponding phrases in the text.

此外，我们收集了文本短语-区域标注，其中包括针对5亿个文本标注的超过36亿个短语-区域对。具体来说，简要文本标注平均有4.27个短语-区域对，而详细和更详细文本标注则有超过10个对，表明更丰富的文本标注涵盖了更多对象及其在文本中对应的短语。

Semantic coverage

Our text annotations comprise various text types, addressing different levels of detail. To assess semantic coverage, we employ SpaCy [28] for tokenization and parsing, inspired by DiHT [63]. This process yields part-of-speech (POS) tags and the dependency parsing tree among tokens. We establish heuristic rules based on POS tags, categorizing tokens into semantic element types, e.g.,objects, attributes, actions, and proper nouns. Additionally, we introduce the concept of token complexity, measured by the total degrees of the token in the dependency parsing tree when treated as an undirected graph. This complexity reflects the richness of semantic connections. In our study, we focus on measuring the complexity of objects and actions.

语义覆盖。我们的文本标注包括各种类型的文本，涉及不同层次的细节。为了评估语义覆盖范围，我们使用SpaCy [28]进行分词和解析，受到DiHT [63]的启发。这一过程产生了tokens之间的词性（POS）标签和依存解析树。我们根据POS标签建立启发式规则，将tokens分类为语义元素类型，例如对象、属性、动作和专有名词。此外，我们引入了令牌复杂性的概念，通过将令牌视为无向图在依存解析树中的总度量来衡量。这种复杂性反映了语义连接的丰富程度。在我们的研究中，我们专注于衡量对象和动作的复杂性。

Table 3 presents the statistics on the average number of semantic elements and their corresponding complexity. The results indicate that all measurements increase with the inclusion of more details in text annotations. Notably, average actions experience the most significant boost, with detailed and more detailed text exhibiting 7× and 15× increases, respectively, compared to brief text. This highlights the limitations of traditional brief text annotations in describing image actions. Conversely, the increment in proper nouns is relatively low, potentially because specialists often describe objects more generally than using specific proper nouns. In terms of complexity measurements, both objects and actions show more semantic connections in detailed text annotations. The complexity of actions exhibits a higher improvement, aligning with our observation of the increasing number of actions.

表3呈现了平均语义元素数量及其相应复杂性的统计数据。结果表明，在文本标注中增加更多细节后，所有测量值均增加。值得注意的是，平均动作经历了最显著的增长，详细和更详细文本分别与简要文本相比增加了7倍和15倍。这突显了传统简要文本标注在描述图像动作方面存在的局限性。相反，专有名词的增加幅度相对较低，可能是因为专家通常更一般地描述对象，而不使用具体专有名词。在复杂性测量方面，对象和动作在详细文本标注中展现出更多语义连接。动作的复杂性提升较大，与我们观察到动作数量增加的情况一致。
在这里插入图片描述

Spatial coverage

Our region-text and text-phrase-region annotations, represented by bounding boxes and masks, capture the location of visual concepts within images. The distribution of box areas, as shown in Figure 5a, reveals more small boxes in region-text pairs and a uniform box size distribution in text-phrase-region triplets. This difference stems from the the divergent origins of these boxes: object detectors for region-text pairs and a grounding model for text-phrase-region triplets, which aligns boxes to textual phrases representing both localized and overarching image concepts. In Figure 5b, the log-format distribution of aspect ratios is illustrated. Region-text pairs and textphrase-region triplets exhibit similar symmetric distributions, covering a wide range of aspect ratios. Heatmaps of the box center for each annotation type, shown in Figures. 5c and 5d, indicate a center bias, with region-text pairs displaying a more uniform distribution than text-phraseregion triplets.

空间覆盖。我们的区域-文本和文本短语-区域标注，通过边界框和掩模表示，捕捉了图像中视觉概念的位置。如图5a所示的框面积分布显示，在区域-文本对中有更多小框，而在文本短语-区域三元组中有均匀的框尺寸分布。这种差异源于这些框的不同来源：区域-文本对使用目标检测器，而文本短语-区域三元组使用一个grounding模型，它将框校准到局部和图像概念的文本表征。在图5b中，展示了纵横比的对数格式分布。区域-文本对和文本短语-区域三元组呈现出类似对称的分布，涵盖了广泛的纵横比范围。在图5c和5d中显示了每种标注类型的框中心热图，表明存在中心偏向性，其中区域-文本对显示出比文本短语-区域三元组更均匀的分布。

在这里插入图片描述

六、Experiments

Our Florence-2 models are trained on FLD-5B to learn a universal image representation. We conduct our experiments in three main parts: (1) We evaluate the zero-shot performance of our method on various tasks to show its inherent ability to handle multiple tasks without any extra fine-tuning on task-specific data using one single generalist model. (2) We show the adaptability of our method by further training one single generalist model with additional supervised data on a wide range of tasks, achieving competitive state-of-the-art performance. (3) We examine the performance of the learned visual representation on the downstream tasks as the backbone to show the superiority
of our pre-training method over previous approaches.

我们的Florence-2模型是在FLD-5B上训练的，用于学习通用图像表示。我们的实验分为三个主要部分：(1) 我们评估了我们的方法在各种任务上的zero-shot性能，展示了其固有能力在不需要对任务特定数据进行额外微调的情况下，使用一个通用模型处理多个任务。(2) 我们通过在广泛的任务上进一步训练一个通用模型，展示了我们方法的适应性，实现了具有竞争力的最新性能。(3) 我们检查了学习到的视觉表示在下游任务中作为主干的性能，展示了我们的预训练方法相对于先前方法的优越性。

1. Setup

We investigate two model variants with different sizes: Florence-2-B model with 232 million parameters and Florence-2-L model with 771 million parameters. The detailed architectures of each model are given in Table 15. We initialize the weights of the image encoder and multi-modality encoder-decoder from UniCL [87] and BART [43], respectively.

我们研究了两种不同规模的模型变体：Florence-2-B模型，参数数量为2.32亿；Florence-2-L模型，参数数量为7.71亿。每个模型的详细架构见表15。我们从UniCL [87]和BART [43]中初始化了图像编码器和多模态编码器-解码器的权重。
在这里插入图片描述

We adopt AdamW [54] with cosine learning rate decay [53] for training our models. We leverage Deepspeed [67] and mixed precision to improve the training efficiency. The maximum learning rate is set at 1e − 4 for the base model and 1e − 5 for the large model. A linear warmup to the maximum learning rate is applied during the first 5,000 optimization steps.

我们采用AdamW [54]与余弦学习率衰减[53]来训练我们的模型。我们利用Deepspeed [67]和混合精度来提高训练效率。基础模型的最大学习率设置为1e-4，大模型为1e-5。在前5000个优化步骤期间，采用线性预热到最大学习率。

We train our models with a mini-batch size of 2048/3072 (base/large) and an image size of 384×384 until reaching 3 billion effective training samples. Similar to [15, 29, 64, 92, 95], we further conduct high-resolution tuning with an image size of 768×768 for 0.5 billion samples for the base model and 0.1 billion samples for the large model.

我们以2048/3072（基础/大型）的小批量大小和384×384的图像尺寸训练我们的模型，直到达到30亿有效训练样本。类似于[15, 29, 64, 92, 95]，我们进一步使用768×768的高分辨率调整，对基础模型进行了5亿样本调整，大模型进行了1亿样本调整。

2. Zero-shot Evaluation Across Tasks

We present a powerful vision foundation model that does not require task-specific supervised annotations for finetuning. The zero-shot performance of our model is shown in Table 4. For image-level tasks, Florence-2-L achieves a 135.6 CIDEr score on the COCO caption benchmark [48], utilizing less than 1% of the parameters compared to the 80B Flamingo [2] model (which has an 84.3 CIDEr score).

我们提出了一个强大的视觉基础模型，不需要针对微调的任务特定监督注释。我们模型的zero-shot性能如表4所示。对于图像级任务，Florence-2-L 在 COCO 字幕基准[48]上实现了135.6的 CIDEr 分数，与拥有 84.3 CIDEr 分数的 80B Flamingo [2] 模型相比，参数使用量不到其1%。
在这里插入图片描述

For region-level grounding and referring expression comprehension tasks, Florence-2-L establishes a new record in zero-shot performance achieving a 5.7 improvement in Flickr30k [61] Recall@1, and approximately 4%, 8%, and 8% absolute improvements on Refcoco, Refcoco+, and Refcocog [94], respectively, compared to the Kosmos-2 [60] model, which has 1.6B parameters. Additionally, our pretrained model attains a 35.8% mIOU in the Refcoco referring expression segmentation (RES) [94] task, a capability not supported by prior foundation models.

对于区域级定位和指称表达理解任务，Florence-2-L 在 Flickr30k [61] 的 Recall@1 上实现了新的zero-shot性能记录，相比拥有 1.6B 参数的 Kosmos-2 [60] 模型，分别在 Refcoco、Refcoco+ 和 Refcocog [94] 上分别取得了约 4%、8% 和 8% 的绝对改进。此外，我们的预训练模型在 Refcoco 指称表达分割（RES）[94] 任务中获得了35.8% 的 mIOU，这是以前基础模型不支持的功能。

3. Generalist Model with Public Supervised Data

We demonstrate the versatility and effectiveness of our model as a vision foundation that can be transferred to various downstream tasks. We fine-tune Florence-2 models by adding a collection of public datasets that cover imagelevel, region-level, pixel-level tasks, yielding one generalist model for various vision tasks. The details of the dataset collection are provided in Appendix B. Tables 5 and 6 compare our model with other state-of-the-art models. Our key findings are:

我们展示了我们的模型作为一个视觉基础可以迁移到各种下游任务的多功能性和有效性。我们通过添加一系列涵盖图像级、区域级、像素级任务的公共数据集来微调 Florence-2 模型，从而产生一个适用于各种视觉任务的通用模型。数据集收集的详细信息请参见附录 B。表5和表6将我们的模型与其他最先进的模型进行了比较。我们的主要发现是：
在这里插入图片描述

Simple design for strong performance. Florence-2 demonstrates strong performance with standard multimodality Transformer encoder-decoder without special designs, particularly for region-level and pixel-level tasks. For example, Florence-2-L outperforms PolyFormer [49]
on both RefCOCO REC task and RES task by 3.0 Accuracy@0.5 and 3.54 mIOU respectively, where PolyFormer [49] adapts specifically designed regression-based prediction head for coordinates. Florence-2-L also outperforms previous SOTA method UNINEXT [84] on RefCOCO by 0.8 Accuracy@0.5, where UNINEXT [84] is based on advanced object detector Deformable DETR [100] and DINO [97].

简单设计具有强大性能。Florence-2 在标准多模态 Transformer 编码器-解码器下表现出色，特别是在区域级和像素级任务中。例如，Florence-2-L 在 RefCOCO REC 任务和 RES 任务上分别比 PolyFormer [49] 高出3.0 Accuracy@0.5 和 3.54 mIOU，而 PolyFormer [49] 采用了专门设计的基于回归的坐标预测头部。Florence-2-L 还在 RefCOCO 上比以前的 SOTA 方法 UNINEXT [84] 高出0.8 Accuracy@0.5，而 UNINEXT [84] 基于先进的目标检测器 Deformable DETR [100] 和 DINO [97]。

Competitive performance with fewer parameters. Florence-2-L achieves competitive performance without the need for LLMs, showcasing efficiency in handling diverse tasks while maintaining a compact size. For instance, Florence-2-L attains a CIDEr score of 140.0 on the COCO Caption karpathy test split [30], outperforming models with significantly more parameters, such as Flamingo (80B parameters, 138.1 CIDEr score).

具有更少参数的竞争性性能。Florence-2-L 在不需要 LLMs 的情况下实现了竞争性能，展示了在处理多样任务时保持紧凑大小的效率。例如，Florence-2-L 在 COCO caption karpathy 测试集上获得了140.0 的 CIDEr 分数，超过了具有更多参数的模型，如 Flamingo（80B 参数，138.1 CIDEr 分数）。

Adaptable generalization across task levels. Florence-2 demonstrates competitive performance across image-level, pixel-level, and region-level tasks, emphasizing its adaptability and effectiveness in addressing various challenges in computer vision and natural language processing. For example, in the TextVQA task, Florence-2-L sets a new stateof-the-art performance with an accuracy of 81.5 without any
external OCR token input, surpassing previous SOTA methods [12, 15].

可适应不同任务级别的泛化。Florence-2 在图像级、像素级和区域级任务中展现了竞争性能，强调了其在解决计算机视觉和自然语言处理中各种挑战方面的适应性和有效性。例如，在 TextVQA 任务中，Florence-2-L 在没有任何外部 OCR token 输入的情况下实现了81.5 的准确率，超过了以前的 SOTA 方法。

These achievements emphasize Florence-2’s efficiency in handling diverse tasks while maintaining a compact size, making it a unique and valuable asset in the ever-evolving landscape of AI research and applications.

这些成就强调了 Florence-2 在处理多样任务时保持紧凑大小的效率，使其成为不断发展的人工智能研究和应用领域中独特而有价值的资产。

4. Downstream Tasks Fine-tuning

In this section, we investigate the performance of our single model fine-tuning on downstream tasks. This experiment highlights the superiority of Florence-2 pre-training over previous approaches, as it demonstrates the effectiveness of the learned universal image representation. We use the base size model with about 80M parameters in our experiments to ensure fair comparison with other methods.

在本节中，我们研究了我们单一模型在下游任务上微调的性能。这个实验突显了 Florence-2 预训练优于以往方法的优越性，因为它展示了学习到的通用图像表示的有效性。我们在实验中使用约 80M 参数的基础模型，以确保与其他方法进行公平比较。

Object detection and segmentation. We conduct COCO object detection and instance segmentation [48] experiments with Mask R-CNN [26], and COCO object detection [48] experiments with DINO [97] to further demonstrate the effectiveness of Florence-2 pre-training. We train on the train2017 split and evaluate on the val2017 split.

目标检测和分割。我们使用 Mask R-CNN [26] 进行 COCO 目标检测和实例分割[48] 实验，并使用 DINO [97] 进行 COCO 目标检测[48] 实验，以进一步展示 Florence-2 预训练的有效性。我们在 train2017 分割上训练，并在 val2017 分割上评估。

For Mask R-CNN [26] experiments, we follow the common setup used in [51, 97], we use the standard 1× (12 epochs) schedule with multi-scale training for all experiments. The learning rate is stepped down by a factor of 0.1 at the 67% and 89% of training epochs. We do not use any additional augmentation (such as random crop, mosaic, etc) or optimization techniques (such as EMA, weight normalization) during training to ensure a fair comparison. We do not use any test time augmentation (TTA) either. Thanks to the strong universal representation learned by Florence-2 pre-training, we do not require longer training epochs, such as 36 epochs in [51, 81, 85, 86], or 100 epochs in [46], to achieve better results.

对于 Mask R-CNN [26] 实验，我们遵循[51, 97]中使用的常规设置，所有实验都使用标准的 1×（12 个 epoch）计划进行多尺度训练。学习率在训练 epoch 的 67% 和 89% 处按照 0.1 的因子下降。我们在训练过程中不使用任何额外增强（如随机裁剪、马赛克等）或优化技术（如 EMA、权重归一化）以确保公平比较。我们也不使用任何测试时间增强（TTA）。由于 Florence-2 预训练学到了强大的通用表示，我们不需要更长的训练周期，例如[51, 81, 85, 86]中的 36 个 epoch 或[46]中的 100 个 epoch，就能取得更好的结果。

For DINO [97] experiments, we train DINO-4scale [97] detector for 12 epochs (1×) using the same data augmentation strategy as employed by [7].

对于 DINO [97] 实验，我们使用与 [7] 相同的数据增强策略训练 DINO-4scale [97] 检测器 12 个 epoch（1×）。

First, our base model achieves a strong performance improvement compared to other approaches. As shown in Table 7, our DaViT-B model pre-trained by Florence-2 surpasses previous best base model (ConvNext v2-B), which is pre-trained by FCMAE [81], by 0.7 APb using Mask RCNN. Importantly, while ConvNeXt v2-B leverages a 3× schedule (36 epochs), our model efficiently employs a 1× schedule (12 epochs) thanks to our powerful pretrained universal representation. For DINO framework, our model significantly outperforms the ViT-B, achieving a notable improvement of 4.2 AP.

首先，我们的基础模型相比其他方法取得了强大的性能改进。如表7所示，我们由 Florence-2 预训练的 DaViT-B 模型在使用 Mask RCNN 时超过了之前最佳基础模型（ConvNext v2-B），后者是由 FCMAE [81] 预训练的，APb 提高了 0.7。值得注意的是，虽然 ConvNeXt v2-B 利用了 3× 计划（36 个 epoch），但由于我们强大的预训练通用表示，我们的模型有效地采用了 1× 计划（12 个 epoch）。对于 DINO 框架，我们的模型明显优于 ViT-B，AP 提高了显著的 4.2。
在这里插入图片描述

Second, our pre-training demonstrates higher training efficiency. As shown in Table 8 and Figure 6, compared to the model with supervised ImageNet-1k pre-training, our model with Florence-2 pre-training achieves 4x efficiency and a significant improvement of 6.9 AP and 5.5 AP with Mask-RCNN and DINO framework, respectively.

其次，我们的预训练展现出更高的训练效率。如表8和图6所示，与具有受监督 ImageNet-1k 预训练的模型相比，我们的 Florence-2 预训练模型在 Mask-RCNN 和 DINO 框架中分别实现了 4 倍和显著提高了 6.9 和 5.5 的 AP。
在这里插入图片描述

Third, our pre-training provides a good generic representation without extensive fine-tuning. Table 8 indicates that the models with Florence-2 pre-training maintains competitive performances when the first two stages are frozen with only 0.3 and 0.2 drops for Mask-RCNN and DINO, respectively. Moreover, our approach with completely frozen backbone can outperform the model with supervised ImageNet-1k pre-training by 1.6 and 2.4 for MaskRCNN and DINO.

第三，我们的预训练提供了一个良好的通用表示，无需进行大量微调。表8表明，具有 Florence-2 预训练的模型在前两个阶段冻结时保持竞争性能，仅在 Mask-RCNN 和 DINO 中分别下降了 0.3 和 0.2。此外，我们完全冻结骨干网络的方法可以在 MaskRCNN 和 DINO 中分别优于具有受监督 ImageNet-1k 预训练的模型 1.6 和 2.4。

Semantic segmentation. We conduct semantic segmentation experiments with UperNet [82] framework on ADE20k [98] dataset. We mostly follow the training and evaluation protocols from Swin [51]. Specifically, we use input size 512×512 and train the model for 40k iterations with a batch size of 64. We adopt the AdamW [54] optimizer with the optimal learning rate searched from {8e-4,4e-4,2e-4,1e-4}.

语义分割。我们在 ADE20k [98] 数据集上使用 UperNet [82] 框架进行语义分割实验。我们主要遵循 Swin [51] 的训练和评估协议。具体来说，我们使用输入尺寸为 512×512，并以批量大小为 64 训练模型 40k 次迭代。我们采用 AdamW [54] 优化器，并从 {8e-4,4e-4,2e-4,1e-4} 中搜索最佳学习率。

Our results show a similar trend to the object detection experiments. As illustrated in Table 9, our base model outperforms the previous SoTA model, which is BEiT pretrained ViT-B [4], by 1.3 and 1.4 points in single-scale and multi-scale testing protocol, respectively. With the same backbone architecture of DaViT-B [20], Florence-2 pretrained model achieves a remarkable improvement of 4.9 points and 4× efficiency compared to the ImageNet-1k pretrained counterpart as demonstrated in Table 8 and Figure 6.

我们的结果显示与目标检测实验类似的趋势。如表9所示，我们的基础模型在单尺度和多尺度测试协议中分别比之前的 SoTA 模型 BEiT 预训练 ViT-B [4] 提高了 1.3 和 1.4 分。与 DaViT-B [20] 相同的骨干架构下，Florence-2 预训练模型在表8和图6中表现出了显著提高，与 ImageNet-1k 预训练对应模型相比，AP 分别提高了 4.9 点和效率提高了 4 倍。
在这里插入图片描述

七、Related Works

1. Vision-Language Foundation Models

Vision-Language Foundation Models Recent vision-language pre-training models [29, 64, 95] have demonstrated impressive zero-shot transfer abilities to vision-language alignment and image classification tasks, thanks to the alignment of vision and text embeddings extracted from respective encoders through contrastive learning objectives [58, 74]. These models (e.g., [95]), trained on weakly large-scale image-text data, have been further extended to more downstream tasks such as object detection, achieving state-of-the-art performance with taskspecific adaptation heads.

近期的视觉-语言预训练模型 [29, 64, 95] 展示了令人印象深刻的zero-shot迁移能力，能够将从各自编码器中提取的视觉和文本嵌入通过对比学习目标进行对齐，应用于视觉-语言对齐和图像分类任务。这些模型（例如，[95]），在大规模弱标记的图像-文本数据上训练，已经进一步扩展到更多的下游任务，如目标检测，通过任务特定的适应头部实现了最先进的性能。

In contrast, other studies [2, 45, 78, 92] propose using a multi-modality decoder to predict text in an autoregressive manner with language modeling pre-training objectives. Techniques for fusing vision and language embeddings vary: GIT [78] concatenates vision and text tokens as decoder input and designs a casual attention mask, CoCa [92] uses attentional poolers with learnable queries to select task-specific vision representations which are then cross-attended via the decoder, and Flamingo [2] pools a fixed number of vision tokens with a Perceiver Resampler and adds new learnable cross-attention layers to the decoder while freezing the pre-trained vision encoder and text decoder.

相比之下，其他研究 [2, 45, 78, 92] 提出使用多模态解码器以自回归方式预测文本，具有语言建模预训练目标。融合视觉和语言嵌入的技术各不相同：GIT [78] 将视觉和文本tokens连接为解码器输入，并设计了一个因果注意力掩码，CoCa [92] 使用具有可学习查询的注意力池化器来选择任务特定的视觉表示，然后通过解码器进行交叉注意力，而 Flamingo [2] 则使用 Perceiver Resampler 汇集固定数量的视觉tokens，并在解码器中添加新的可学习交叉注意力层，同时冻结预训练的视觉编码器和文本解码器。

Beyond image captioning pre-training task, some research [15,55,79] attempts to formulate more vision tasks in a unified sequence-to-sequence learning paradigm, including object detection and image segmentation. Customized special tokens accommodate representations beyond pure text, such as bounding boxes [10, 55, 79]. This approach uses the same architecture for pre-training and downstream tasks, potentially using the same set of weights for all tasks.

除了图像描述预训练任务外，一些研究 [15,55,79] 尝试以统一的序列到序列学习范式制定更多的视觉任务，包括目标检测和图像分割。定制的特殊tokens适应了超出纯文本的表示，如边界框 [10, 55, 79]。这种方法在预训练和下游任务中使用相同的架构，可能使用所有任务相同的权重集。

Our method, which falls into this category, aims to obtain foundation models that understand dense information beyond simple image-level captions. It shares the same encoder-decoder design as other multi-modality encoder-decoder models [15, 55] adapted for sequence-to-sequence learning, but uses our built large-scale comprehensive annotation data instead of combining existing sparse annotated data.

我们的方法属于这一类，旨在获得理解简单图像级描述之外的丰富信息的基础模型。它与其他为序列到序列学习调整的多模态编码器-解码器模型 [15, 55] 共享相同的编码器-解码器设计，但使用我们构建的大规模全面注释数据，而不是结合现有的稀疏注释数据。

2. Vision Datasets

Comprehensive annotations. The quest for comprehensive understanding of visual scenes, the holy grail of computer vision [36], has evolved from focusing on individual datasets each targeting a single perspective, e.g., image classification [18], to providing multi-perspective [36,40, 48], comprehensive annotations for every visual data point. Notable datasets like MS-COCO [13, 48] and Visual Genome [36] integrate various types of annotations, enabling richer understanding in spatial and semantic granularities and better model interactions across annotations.

全面注释。对于全面理解视觉场景的追求，计算机视觉的holy [36] 已经从专注于每个针对单一视角的个别数据集（例如图像分类 [18]）发展到为每个视觉数据点提供多视角 [36,40, 48]、全面注释。像 MS-COCO [13, 48] 和 Visual Genome [36] 这样的显著数据集整合了各种类型的注释，使得在空间和语义粒度上有更丰富的理解，并在注释之间实现更好的模型交互。

However, due to the high cost of human verification, these annotations are limited in size. Our datasets, while largescale, maintain comprehensive annotations covering text, region-text pairs, and text-phrase-region triplets, with reduced human involvement.

然而，由于人工验证成本高昂，这些注释在规模上受到限制。我们的数据集虽然规模庞大，但保持了包括文本、区域-文本对和文本短语-区域三元组在内的全面注释，并减少了人类参与。

Scalable annotations.: Over the past decade, vision datasets have rapidly scaled up from thousands [37, 42] to billion examples [29, 96] to encompass more visual concepts for better generalization. This shift is evident in recent foundation models that employ massive quantities of data [5]. These large datasets typically collect images from the web and parse noisy annotations from the corresponding metadata, such as category label from query [75, 96], short description from alt-text [29,64], as well as detailed description from interleaved text [2, 41]. Despite their diversity, these annotations suffer from randomness and limited types (i.e., texts only). Some works [32, 45] attempt to scale up annotations using pseudo-label generation with iteratively trained models, which offer higher quality without significant diversity loss.

可扩展注释。在过去十年里，视觉数据集迅速从数千个 [37, 42] 扩展到数十亿个示例 [29, 96]，涵盖更多视觉概念以实现更好的泛化。这种转变在最近采用大量数据的基础模型中明显可见 [5]。这些大型数据集通常从网络收集图像，并从相应元数据中解析嘈杂的注释，例如从查询中获取类别标签 [75, 96]、从 alt-text 中获取简短描述 [29,64]，以及从交错文本中获取详细描述 [2, 41]。尽管这些注释具有多样性，但受到随机性和有限类型（即仅文本）的影响。一些研究[32, 45] 尝试使用伪标签生成与迭代训练模型，提供更高质量而没有显著多样性损失。

Our data pipeline extends these largescale, web-crawled noisy annotations with higher-quality, autonomous annotations generated from multiple specialist models. The pipeline iteratively refines labels and completes missing pieces, resulting in a scalable and comprehensive dataset for learning a unified visual representation.

我们的数据管道通过从多个专业模型生成高质量、自主注释来扩展这些大规模、网络爬取的嘈杂注释。该管道迭代地完善标签并填补缺失部分，为学习统一视觉表示提供了可扩展和全面的数据集。

八、Conclusion

The Florence Project endeavors to develop a foundational vision model endowed with a diverse array of perceptual capabilities, encompassing spatial hierarchy and semantic granularity. To this end, we construct FLD-5B dataset containing an extensive collection of 126M images paired with 5B comprehensive annotations, which are collected by the Florence data engine. Subsequently, we pretrain Florence-2 on this rich dataset through comprehensive multitask learning in a unified manner. Florence-2 has exhibited remarkable zero-shot capabilities that extend across a wide spectrum of visual tasks, such as captioning, object detection, visual grounding, and referring segmentation, among others. The experimental findings underscore the potency of the universal representation pre-trained by Florence-2, revealing its substantial contributions to the enhancement of a multitude of downstream tasks.

佛罗伦萨项目致力于开发一个具有多样感知能力的基础视觉模型，涵盖了空间层次和语义粒度。为此，我们构建了包含 1.26 亿图像和 50 亿全面注释的 FLD-5B 数据集，这些数据是由佛罗伦萨数据引擎收集的。随后，我们以统一的方式通过全面的多任务学习在这个丰富的数据集上对 Florence-2 进行预训练。Florence-2 展示了卓越的零-shot 能力，可在广泛的视觉任务领域展现，如描述、目标检测、视觉定位和指称分割等。实验结果强调了由 Florence-2 预训练的通用表示的潜力，揭示了其对增强多种下游任务的重要贡献。