Florence-2：推进多种视觉任务的统一表征

最新推荐文章于 2025-03-24 22:24:03 发布

强化学习曾小健

最新推荐文章于 2025-03-24 22:24:03 发布

阅读量1.3k

点赞数 34

文章标签：人工智能

本文链接：https://blog.csdn.net/sinat_37574187/article/details/142592033

版权

Florence-2：推进多种视觉任务的统一表征

模型摘要

该 Hub 存储库包含来自微软的 Florence-2 模型的 HuggingFacetransformers实现。

Florence-2 是一种先进的视觉基础模型，它使用基于提示的方法来处理各种视觉和视觉语言任务。Florence-2 可以解释简单的文本提示来执行字幕、对象检测和分割等任务。它利用我们的 FLD-5B 数据集（包含 1.26 亿张图像中的 54 亿条注释）来掌握多任务学习。该模型的序列到序列架构使其能够在零样本和微调设置中表现出色，证明是一个具有竞争力的视觉基础模型。

资源和技术文档：

模型	模型大小	模型描述
佛罗伦萨-2-基[HF]	0.23亿	使用 FLD-5B 进行预训练的模型
佛罗伦萨-2-大[HF]	0.77亿	使用 FLD-5B 进行预训练的模型
佛罗伦萨-2-基-英尺[HF]	0.23亿	在一系列下游任务上对模型进行微调
佛罗伦萨-2-大-英尺[HF]	0.77亿	在一系列下游任务上对模型进行微调

如何开始使用该模型

使用以下代码开始使用该模型。所有模型均使用 float16 进行训练。

import requests

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM 


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-large", torch_dtype=torch_dtype, trust_remote_code=True).to(device)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)

prompt = "<OD>"

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch_dtype)

generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024,
    num_beams=3,
    do_sample=False
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

parsed_answer = processor.post_process_generation(generated_text, task="<OD>", image_size=(image.width, image.height))

print(parsed_answer)

任务

该模型能够通过改变提示来执行不同的任务。

首先，让我们定义一个函数来运行提示。

点击展开

import requests

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM 

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-large", torch_dtype=torch_dtype, trust_remote_code=True).to(device)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

def run_example(task_prompt, text_input=None):
    if text_input is None:
        prompt = task_prompt
    else:
        prompt = task_prompt + text_input
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch_dtype)
    generated_ids = model.generate(
      input_ids=inputs["input_ids"],
      pixel_values=inputs["pixel_values"],
      max_new_tokens=1024,
      num_beams=3
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))

    print(parsed_answer)

以下是可以执行的任务Florence-2：

点击展开

标题

prompt = "<CAPTION>"
run_example(prompt)

详细标题

prompt = "<DETAILED_CAPTION>"
run_example(prompt)

更详细的说明

prompt = "<MORE_DETAILED_CAPTION>"
run_example(prompt)

字幕到短语基础

标题到短语基础任务需要额外的文本输入，即标题。

字幕到短语基础结果格式：{'<CAPTION_TO_PHRASE_GROUNDING>': {'bboxes': [[x1, y1, x2, y2], ...], 'labels': ['', '', ...]}}

task_prompt = "<CAPTION_TO_PHRASE_GROUNDING>"
results = run_example(task_prompt, text_input="A green car parked in front of a yellow building.")

物体检测

OD 结果格式：{'<OD>': {'bboxes': [[x1, y1, x2, y2], ...], 'labels': ['label1', 'label2', ...]} }

prompt = "<OD>"
run_example(prompt)

密集区域标题

密集区域标题结果格式：{'<DENSE_REGION_CAPTION>' : {'bboxes': [[x1, y1, x2, y2], ...], 'labels': ['label1', 'label2', ...]}}

prompt = "<DENSE_REGION_CAPTION>"
run_example(prompt)

区域提案

密集区域标题结果格式：{'<REGION_PROPOSAL>': {'bboxes': [[x1, y1, x2, y2], ...], 'labels': ['', '', ...]}}

prompt = "<REGION_PROPOSAL>"
run_example(prompt)

光学字符识别 (OCR)

prompt = "<OCR>"
run_example(prompt)

带区域的 OCR

带区域输出格式的 OCR：{'<OCR_WITH_REGION>': {'quad_boxes': [[x1, y1, x2, y2, x3, y3, x4, y4], ...], 'labels': ['text1', ...]}}

prompt = "<OCR_WITH_REGION>"
run_example(prompt)

更多详细示例，请参阅笔记本

基准

Florence-2 零样本性能

下表展示了通用视觉基础模型在图像字幕和物体检测评估任务上的零样本性能。这些模型在训练阶段没有接触过评估任务的训练数据。

方法	#参数	COCO Cap. 测试 CIDEr	NoCaps val CIDEr	TextCaps val CIDEr	COCO Det. val2017 mAP
火烈鸟	80B	84.3	-	-	-
佛罗伦萨-2-基地	0.23亿	133.0	118.7	70.1	34.7
佛罗伦萨-2-大	0.77亿	135.6	120.8	72.8	37.5

下表继续与其他视觉语言评估任务的表现进行比较。

方法	Flickr30k 测试 R@1	Refcoco val 准确度	Refcoco 测试-A 准确度	Refcoco 测试-B 准确度	Refcoco+ val 准确度	Refcoco+ 测试-A 准确度	Refcoco+ test-B 准确度	参考认知值准确度	Refcocog 测试准确度	Refcoco RES 值
科斯莫斯-2	78.7	52.3	57.4	47.3	45.5	50.7	42.2	60.6	61.7	-
佛罗伦萨-2-基地	83.6	53.9	58.4	49.7	51.5	56.4	47.9	66.3	65.1	34.6
佛罗伦萨-2-大	84.4	56.3	61.6	51.4	53.6	57.9	49.9	68.0	67.0	35.8

佛罗伦萨-2号微调性能

我们通过一系列下游任务对 Florence-2 模型进行微调，得到了两个通用模型Florence-2-base-ft和Florence-2-large-ft，可以执行广泛的下游任务。

下表比较了专家模型和通才模型在各种字幕和视觉问答 (VQA) 任务上的表现。专家模型针对每项任务进行了专门的微调，而通才模型则以与任务无关的方式在所有任务上进行了微调。符号“▲”表示使用外部 OCR 作为输入。

方法	# 参数	COCO Caption Karpathy 测试 CIDEr	NoCaps val CIDEr	TextCaps val CIDEr	VQAv2 测试-dev 加速	TextVQA 测试-dev 权限	VizWiz VQA 测试开发 Acc
专业模特
科卡	2.1B	143.6	122.4	-	82.3	-	-
BLIP-2	7.8亿	144.5	121.6	-	82.2	-	-
吉特2	5.1B	145.0	126.9	148.6	81.7	67.3	71.0
火烈鸟	80B	138.1	-	-	82.0	54.1	65.7
巴利	17B	149.1	127.0	160.0▲	84.3	58.8 / 73.1▲	71.6 / 74.4▲
帕利一X	55B	149.2	126.3	147.0 / 163.7▲	86.0	71.4 / 80.8▲	70.9 / 74.6▲
通才模型
统一IO	2.9亿	-	100.0	-	77.9	-	57.4
佛罗伦萨-2-基-英尺	0.23亿	140.0	116.7	143.9	79.7	63.6	63.6
佛罗伦萨-2-大-英尺	0.77亿	143.3	124.9	151.1	81.7	73.5	72.6

方法	# 参数	COCO Det. val2017 mAP	Flickr30k 测试 R@1	RefCOCO val 准确率	RefCOCO 测试-A 准确率	RefCOCO 测试-B 准确率	RefCOCO+ val 准确率	RefCOCO+ 测试-A 准确率	RefCOCO+ 测试-B 准确率	RefCOCOg 值准确度	RefCOCOg 测试准确度	RefCOCO RES 值 mIoU
专业模特
序列TR	-	-	-	83.7	86.5	81.2	71.5	76.3	64.9	74.9	74.2	-
聚合成型机	-	-	-	90.4	92.9	87.2	85.0	89.8	78.0	85.8	85.9	76.9
联合国际电信联盟	0.74亿	60.6	-	92.6	94.3	91.5	85.2	89.6	79.8	88.7	89.4	-
雪貂	13B	-	-	89.5	92.4	84.4	82.8	88.1	75.2	85.8	86.3	-
通才模型
联合技术咨询委员会	-	-	-	88.6	91.1	83.8	81.0	85.4	71.6	84.6	84.7	-
佛罗伦萨-2-基-英尺	0.23亿	41.4	84.0	92.6	94.8	91.5	86.8	91.7	82.2	89.8	82.2	78.0
佛罗伦萨-2-大-英尺	0.77亿	43.4	85.2	93.4	95.3	92.0	88.3	92.9	83.6	91.2	91.7	80.5

BibTex 和引文信息

@article{xiao2023florence,
  title={Florence-2: Advancing a unified representation for a variety of vision tasks},
  author={Xiao, Bin and Wu, Haiping and Xu, Weijian and Dai, Xiyang and Hu, Houdong and Lu, Yumao and Zeng, Michael and Liu, Ce and Yuan, Lu},
  journal={arXiv preprint arXiv:2311.06242},
  year={2023}
}