阿里国际开源Ovis2系列多模态大语言模型 共有六个版本

2025 年 2 月 21 日,阿里巴巴国际化团队宣布其新型多模态大语言模型Ovis2 系列正式开源。

Ovis2 是阿里巴巴国际化团队提出的Ovis系列模型的最新版本。与前序1. 6 版本相比,Ovis2 在数据构造和训练方法上都有显著改进。它不仅强化了小规模模型的能力密度,还通过指令微调和偏好学习大幅提升了思维链(CoT)推理能力。此外,Ovis2 引入了视频和多图像处理能力,并增强了多语言能力和复杂场景下的OCR能力,显著提升了模型的实用性。

此次开源的Ovis2 系列包括1B、2B、4B、8B、16B和34B六个版本,各个参数版本均达到了同尺寸的SOTA(State of the Art)水平。其中,Ovis2-34B在权威评测榜单OpenCompass上展现出了卓越的性能。在多模态通用能力榜单上,Ovis2-34B位列所有开源模型第二,以不到一半的参数尺寸超过了诸多70B开源旗舰模型。在多模态数学推理榜单上,Ovis2-34B更是位列所有开源模型第一,其他尺寸版本也展现出出色的推理能力。这些成绩不仅证明了Ovis架构的有效性,也展示了开源社区在推动多模态大模型发展方面的巨大潜力。

在这里插入图片描述
Ovis2 的架构设计巧妙地解决了模态间嵌入策略差异这一局限性。它由视觉tokenizer、视觉嵌入表和LLM三个关键组件构成。视觉tokenizer将输入图像分割成多个图像块,利用视觉Transformer提取特征,并通过视觉头层将特征匹配到“视觉单词”上,得到概率化的视觉token。视觉嵌入表存储每个视觉单词对应的嵌入向量,而LLM则将视觉嵌入向量与文本嵌入向量拼接后进行处理,生成文本输出,完成多模态任务。

在训练策略上,Ovis2 采用了四阶段训练方法,以充分激发其多模态理解能力。第一阶段冻结大部分LLM和ViT参数,训练视觉模块,学习视觉特征到嵌入的转化。第二阶段进一步增强视觉模块的特征提取能力,提升高分辨率图像理解、多语言和OCR能力。第三阶段通过对话形式的视觉Caption数据对齐视觉嵌入与LLM的对话格式。第四阶段则是多模态指令训练和偏好学习,进一步提升模型在多种模态下对用户指令的遵循能力和输出质量。

为了提升视频理解能力,Ovis2 开发了一种创新的关键帧选择算法。该算法基于帧与文本的相关性、帧之间的组合多样性和帧的序列性挑选最有用的视频帧。通过高维条件相似度计算、行列式点过程(DPP)和马尔可夫决策过程(MDP),算法能够在有限的视觉上下文中高效地选择关键帧,从而提升视频理解的性能。

Ovis2 系列模型在OpenCompass多模态评测榜单上的表现尤为突出。不同尺寸的模型在多个Benchmark上均取得了SOTA成绩。例如,Ovis2-34B在多模态通用能力和数学推理榜单上分别位列第二和第一,展现了其强大的性能。此外,Ovis2 在视频理解榜单上也取得了领先性能,进一步证明了其在多模态任务中的优势。

阿里巴巴国际化团队表示,开源是推动AI技术进步的关键力量。通过公开分享Ovis2 的研究成果,团队期待与全球开发者共同探索多模态大模型的前沿,并激发更多创新应用。目前,Ovis2 的代码已开源至GitHub,模型可在Hugging Face和Modelscope平台上获取,同时提供了在线Demo供用户体验。相关研究论文也已发布在arXiv上,供开发者和研究者参考。

我们使用 OpenCompass 多模态和推理排行榜中使用的 VLMEvalKit 来评估 Ovis2。

在这里插入图片描述

图像基准

BenchmarkQwen2.5-VL-3BSAIL-VL-2BInternVL2.5-2B-MPOOvis1.6-3BInternVL2.5-1B-MPOOvis2-1BOvis2-2B
MMBench-V1.1test77.173.670.774.165.868.476.9
MMStar56.556.554.952.049.552.156.7
MMMUval51.444.144.646.740.336.145.6
MathVistatestmini60.162.853.458.947.759.464.1
HallusionBench48.745.940.743.834.845.250.2
AI2D81.477.475.177.868.576.482.7
OCRBench83.183.183.880.184.389.087.3
MMVet63.244.264.257.647.250.058.3
MMBenchtest78.67772.876.667.970.278.9
MMT-Benchval60.857.154.459.250.855.561.7
RealWorldQA66.56261.366.75763.966.0
BLINK48.446.443.843.84144.047.9
QBench74.472.869.875.863.371.376.2
ABench75.574.571.175.267.571.376.6
MTVQA24.920.222.621.121.723.725.6
BenchmarkQwen2.5-VL-7BInternVL2.5-8B-MPOMiniCPM-o-2.6Ovis1.6-9BInternVL2.5-4B-MPOOvis2-4BOvis2-8B
MMBench-V1.1test82.682.080.680.577.881.483.6
MMStar64.165.263.362.96161.964.6
MMMUval56.254.850.95551.849.057.4
MathVistatestmini65.867.973.367.364.169.671.8
HallusionBench56.351.751.152.247.553.856.3
AI2D84.184.586.184.481.585.786.6
OCRBench87.788.288.98387.991.189.1
MMVet66.668.167.2656665.565.1
MMBenchtest83.483.283.282.779.683.284.9
MMT-Benchval62.762.562.364.961.665.266.6
RealWorldQA68.871.168.070.764.471.172.5
BLINK56.156.653.948.550.653.054.3
QBench77.973.878.776.771.578.178.9
ABench75.677.077.574.475.977.576.4
MTVQA28.527.223.119.22829.429.7
BenchmarkQwen2.5-VL-72BInternVL2.5-38B-MPOInternVL2.5-26B-MPOOvis1.6-27BLLaVA-OV-72BOvis2-16BOvis2-34B
MMBench-V1.1test87.885.484.282.284.485.686.6
MMStar71.170.167.763.565.867.269.2
MMMUval67.963.856.460.356.660.766.7
MathVistatestmini70.873.671.570.268.473.776.1
HallusionBench58.859.752.454.147.956.858.8
AI2D88.287.986.286.686.286.388.3
OCRBench88.189.490.585.674.187.989.4
MMVet76.772.668.16860.668.477.1
MMBenchtest88.286.485.484.685.687.187.8
MMT-Benchval69.169.165.768.2-69.271.2
RealWorldQA75.974.473.772.773.974.175.6
BLINK62.363.262.648-59.060.1
QBench-76.176.077.7-79.579.8
ABench-78.679.476.5-79.478.7
MTVQA-31.228.726.5-30.330.6

视频基准

BenchmarkQwen2.5-VL-3BInternVL2.5-2BInternVL2.5-1BOvis2-1BOvis2-2B
VideoMME(wo/w-subs)61.5/67.651.9 / 54.150.3 / 52.348.6/49.557.2/60.8
MVBench67.068.864.360.3264.9
MLVU(M-Avg/G-Avg)68.2/-61.4/-57.3/-58.5/3.6668.6/3.86
MMBench-Video1.631.441.361.261.57
TempCompass64.4--51.4362.64
BenchmarkQwen2.5-VL-7BInternVL2.5-8BLLaVA-OV-7BInternVL2.5-4BOvis2-4BOvis2-8B
VideoMME(wo/w-subs)65.1/71.664.2 / 66.958.2/61.562.3 / 63.664.0/66.368.0/71.6
MVBench69.672.056.771.668.4568.15
MLVU(M-Avg/G-Avg)70.2/-68.9/-64.7/-68.3/-70.8/4.2376.4/4.25
MMBench-Video1.791.68-1.731.691.85
TempCompass71.7---67.0269.28
BenchmarkQwen2.5-VL-72BInternVL2.5-38BInternVL2.5-26BLLaVA-OneVision-72BOvis2-16BOvis2-34B
VideoMME(wo/w-subs)73.3/79.170.7 / 73.166.9 / 69.266.2/69.570.0/74.471.2/75.6
MVBench70.474.475.259.468.670.3
MLVU(M-Avg/G-Avg)74.6/-75.3/-72.3/-68.0/-77.7/4.4477.8/4.59
MMBench-Video2.021.821.86-1.921.98
TempCompass74.8---74.1675.97

使用

pip install torch==2.4.0 transformers==4.46.2 numpy==1.25.0 pillow==10.3.0
pip install flash-attn==2.7.0.post2 --no-build-isolation

Ovis2-1B

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

# load model
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis2-1B",
                                             torch_dtype=torch.bfloat16,
                                             multimodal_max_length=32768,
                                             trust_remote_code=True).cuda()

text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()

# single-image input
image_path = '/data/images/example_1.jpg'
images = [Image.open(image_path)]
max_partition = 9
text = 'Describe the image.'
query = f'<image>\n{text}'

## cot-style input
# cot_suffix = "Provide a step-by-step solution to the problem, and conclude with 'the answer is' followed by the final solution."
# image_path = '/data/images/example_1.jpg'
# images = [Image.open(image_path)]
# max_partition = 9
# text = "What's the area of the shape?"
# query = f'<image>\n{text}\n{cot_suffix}'

## multiple-images input
# image_paths = [
#     '/data/images/example_1.jpg',
#     '/data/images/example_2.jpg',
#     '/data/images/example_3.jpg'
# ]
# images = [Image.open(image_path) for image_path in image_paths]
# max_partition = 4
# text = 'Describe each image.'
# query = '\n'.join([f'Image {i+1}: <image>' for i in range(len(images))]) + '\n' + text

## video input (require `pip install moviepy==1.0.3`)
# from moviepy.editor import VideoFileClip
# video_path = '/data/videos/example_1.mp4'
# num_frames = 12
# max_partition = 1
# text = 'Describe the video.'
# with VideoFileClip(video_path) as clip:
#     total_frames = int(clip.fps * clip.duration)
#     if total_frames <= num_frames:
#         sampled_indices = range(total_frames)
#     else:
#         stride = total_frames / num_frames
#         sampled_indices = [min(total_frames - 1, int((stride * i + stride * (i + 1)) / 2)) for i in range(num_frames)]
#     frames = [clip.get_frame(index / clip.fps) for index in sampled_indices]
#     frames = [Image.fromarray(frame, mode='RGB') for frame in frames]
# images = frames
# query = '\n'.join(['<image>'] * len(images)) + '\n' + text

## text-only input
# images = []
# max_partition = None
# text = 'Hello'
# query = text

# format conversation
prompt, input_ids, pixel_values = model.preprocess_inputs(query, images, max_partition=max_partition)
attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
input_ids = input_ids.unsqueeze(0).to(device=model.device)
attention_mask = attention_mask.unsqueeze(0).to(device=model.device)
if pixel_values is not None:
    pixel_values = pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device)
pixel_values = [pixel_values]

# generate output
with torch.inference_mode():
    gen_kwargs = dict(
        max_new_tokens=1024,
        do_sample=False,
        top_p=None,
        top_k=None,
        temperature=None,
        repetition_penalty=None,
        eos_token_id=model.generation_config.eos_token_id,
        pad_token_id=text_tokenizer.pad_token_id,
        use_cache=True
    )
    output_ids = model.generate(input_ids, pixel_values=pixel_values, attention_mask=attention_mask, **gen_kwargs)[0]
    output = text_tokenizer.decode(output_ids, skip_special_tokens=True)
    print(f'Output:\n{output}')

批推理

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

# load model
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis2-1B",
                                             torch_dtype=torch.bfloat16,
                                             multimodal_max_length=32768,
                                             trust_remote_code=True).cuda()
text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()

# preprocess inputs
batch_inputs = [
    ('/data/images/example_1.jpg', 'What colors dominate the image?'),
    ('/data/images/example_2.jpg', 'What objects are depicted in this image?'),
    ('/data/images/example_3.jpg', 'Is there any text in the image?')
]

batch_input_ids = []
batch_attention_mask = []
batch_pixel_values = []

for image_path, text in batch_inputs:
    image = Image.open(image_path)
    query = f'<image>\n{text}'
    prompt, input_ids, pixel_values = model.preprocess_inputs(query, [image], max_partition=9)
    attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
    batch_input_ids.append(input_ids.to(device=model.device))
    batch_attention_mask.append(attention_mask.to(device=model.device))
    batch_pixel_values.append(pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device))

batch_input_ids = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in batch_input_ids], batch_first=True,
                                                  padding_value=0.0).flip(dims=[1])
batch_input_ids = batch_input_ids[:, -model.config.multimodal_max_length:]
batch_attention_mask = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in batch_attention_mask],
                                                       batch_first=True, padding_value=False).flip(dims=[1])
batch_attention_mask = batch_attention_mask[:, -model.config.multimodal_max_length:]

# generate outputs
with torch.inference_mode():
    gen_kwargs = dict(
        max_new_tokens=1024,
        do_sample=False,
        top_p=None,
        top_k=None,
        temperature=None,
        repetition_penalty=None,
        eos_token_id=model.generation_config.eos_token_id,
        pad_token_id=text_tokenizer.pad_token_id,
        use_cache=True
    )
    output_ids = model.generate(batch_input_ids, pixel_values=batch_pixel_values, attention_mask=batch_attention_mask,
                                **gen_kwargs)

for i in range(len(batch_inputs)):
    output = text_tokenizer.decode(output_ids[i], skip_special_tokens=True)
    print(f'Output {i + 1}:\n{output}\n')

参考

代码:https://github.com/AIDC-AI/Ovis

模型(Huggingface):https://huggingface.co/AIDC-AI/Ovis2-34B

模型(Modelscope):https://modelscope.cn/collections/Ovis2-1e2840cb4f7d45

Demo:https://huggingface.co/spaces/AIDC-AI/Ovis2-16B

arXiv: https://arxiv.org/abs/2405.20797

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值