MiniCPM-V 4.5 模型解析

模型介绍

MiniCPM-V 4.5 是 MiniCPM-V 系列最新推出的多模态大模型,具备领先的视觉-语言理解与生成能力。该模型基于 Qwen3-8B 和 SigLIP2-400M 构建,总参数量为 8B,在多项权威评测中表现卓越。在 OpenCompass 综合评估中平均得分达 77.0,超越了包括 GPT-4o-latest、Gemini 2.0 Pro 等主流闭源模型,以及参数量更大的开源模型(如 Qwen2.5-VL 72B),成为 30B 参数以下性能最强的多模态语言模型之一。

MiniCPM-V 4.5 在视频理解方面实现重大突破,借助创新的统一3D重采样器,可高效处理高帧率与长视频内容,实现高达96倍的视频 token 压缩率。该能力使其在 Video-MME、LVBench 等视频理解任务中达到最先进水平,同时保持较低的推理成本。

模型支持“快速/深度”双模式推理,用户可根据任务复杂度灵活切换,兼顾效率与性能。此外,MiniCPM-V 4.5 继承了 LLaVA-UHD 架构优势,可处理分辨率高达 1344×1344 像素的图像,在 OCR、文档解析等任务中表现突出,在 OCR Bench 和 OmniDocBench 上的性能甚至超过 GPT-4o-latest 和 Gemini 2.5。模型还具备多语言支持(超过30种语言)、可信行为增强等特性。

MiniCPM-V 4.5 提供丰富的部署方式,支持本地 CPU 推理(llama.cpp、ollama)、多种量化格式(int4/GGUF/AWQ),并可借助 SGLang、vLLM 实现高吞吐推理,也支持通过 Transformers 和 LLaMA-Factory 进行微调。用户还可体验本地 WebUI 演示、iOS 端优化应用及在线服务器演示,开箱即用,灵活高效。


模型性能

在这里插入图片描述
推理效率

在这里插入图片描述

模型加载

import torch
from PIL import Image
from modelscope import AutoModel, AutoTokenizer

torch.manual_seed(100)

model = AutoModel.from_pretrained('OpenBMB/MiniCPM-V-4_5', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16)
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('OpenBMB/MiniCPM-V-4_5', trust_remote_code=True)

2025-09-03 16:54:48,126 - modelscope - INFO - Got 1 files, start to download ...
Downloading [README.md]: 100%|██████████| 24.2k/24.2k [00:00<00:00, 36.2kB/s]
Processing 1 items: 100%|██████████| 1.00/1.00 [00:00<00:00, 1.43it/s]
2025-09-03 16:54:48,835 - modelscope - INFO - Download model 'OpenBMB/MiniCPM-V-4_5' successfully.
Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00, 18.10it/s]

模型结构

model
MiniCPMV(
  (llm): Qwen3ForCausalLM(
    (model): Qwen3Model(
      (embed_tokens): Embedding(151748, 4096)
      (layers): ModuleList(
        (0-35): 36 x Qwen3DecoderLayer(
          (self_attn): Qwen3Attention(
            (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
            (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
          )
          (mlp): Qwen3MLP(
            (gate_proj): Linear(in_features=4096, out_features=12288, bias=False)
            (up_proj): Linear(in_features=4096, out_features=12288, bias=False)
            (down_proj): Linear(in_features=12288, out_features=4096, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
          (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
        )
      )
      (norm): Qwen3RMSNorm((4096,), eps=1e-06)
      (rotary_emb): Qwen3RotaryEmbedding()
    )
    (lm_head): Linear(in_features=4096, out_features=151748, bias=False)
  )
  (vpm): SiglipVisionTransformer(
    (embeddings): SiglipVisionEmbeddings(
      (patch_embedding): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14), padding=valid)
      (position_embedding): Embedding(4900, 1152)
    )
    (encoder): SiglipEncoder(
      (layers): ModuleList(
        (0-26): 27 x SiglipEncoderLayer(
          (self_attn): SiglipAttention(
            (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
            (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
            (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
            (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
          )
          (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
          (mlp): SiglipMLP(
            (activation_fn): PytorchGELUTanh()
            (fc1): Linear(in_features=1152, out_features=4304, bias=True)
            (fc2): Linear(in_features=4304, out_features=1152, bias=True)
          )
          (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
        )
      )
    )
    (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
  )
  (resampler): Resampler(
    (kv_proj): Linear(in_features=1152, out_features=4096, bias=False)
    (attn): MultiheadAttention(
      (out_proj): NonDynamicallyQuantizableLinear(in_features=4096, out_features=4096, bias=True)
    )
    (ln_q): LayerNorm((4096,), eps=1e-06, elementwise_affine=True)
    (ln_kv): LayerNorm((4096,), eps=1e-06, elementwise_affine=True)
    (ln_post): LayerNorm((4096,), eps=1e-06, elementwise_affine=True)
  )
)

模型配置

model.config
MiniCPMVConfig {
  "architectures": [
    "MiniCPMV"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "configuration_minicpm.MiniCPMVConfig",
    "AutoModel": "modeling_minicpmv.MiniCPMV",
    "AutoModelForCausalLM": "modeling_minicpmv.MiniCPMV"
  },
  "batch_3d_resampler": true,
  "batch_vision_input": true,
  "bos_token_id": 151643,
  "drop_vision_last_layer": false,
  "eos_token_id": 151645,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "image_size": 448,
  "initializer_range": 0.02,
  "intermediate_size": 12288,
  "layer_types": [
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention",
    "full_attention"
  ],
  "max_position_embeddings": 40960,
  "max_window_layers": 36,
  "model_type": "minicpmv",
  "num_attention_heads": 32,
  "num_hidden_layers": 36,
  "num_key_value_heads": 8,
  "patch_size": 14,
  "query_num": 64,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000,
  "slice_config": {
    "max_slice_nums": 9,
    "model_type": "minicpmv",
    "patch_size": 14,
    "scale_resolution": 448
  },
  "slice_mode": true,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.55.2",
  "use_cache": true,
  "use_image_id": true,
  "use_sliding_window": false,
  "version": 4.5,
  "vision_batch_size": 16,
  "vision_config": {
    "attention_dropout": 0.0,
    "hidden_act": "gelu_pytorch_tanh",
    "hidden_size": 1152,
    "image_size": 980,
    "intermediate_size": 4304,
    "layer_norm_eps": 1e-06,
    "model_type": "siglip_vision_model",
    "num_attention_heads": 16,
    "num_channels": 3,
    "num_hidden_layers": 27,
    "patch_size": 14
  },
  "vocab_size": 151748
}

模型调用

在这里插入图片描述

image = Image.open('image.png').convert('RGB')

enable_thinking=False # If `enable_thinking=True`, the thinking mode is enabled.
stream=True # If `stream=True`, the answer is string

# First round chat 
question = "text in the image?"
msgs = [{'role': 'user', 'content': [image, question]}]

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    enable_thinking=enable_thinking,
    stream=True
)

generated_text = ""
for new_text in answer:
    generated_text += new_text
    print(new_text, flush=True, end='')
Here is the text from the image:

When it comes to retailing industry, we often remind the both part of realistic store and internet shopping. Both of them are all have its pros and cons, but according the picture, we can find out both of the internet sales counting and its profit are all grown up every years between twenty eighteen to twenty twenty one. The years internet rate began with twenty eighteen only 10.3%, next year 14.8% and the next 20.3%, finally finished in twenty twenty one up to 24.5%. The sales profit also began with twenty eighteen only 25.17 million, next year 28.93, and the next 34.56, finally finished in twenty twenty one up to 43.03. Therefore, we can find out the internet shopping is grown up between the four years. Beside 2019, I according my observed, more of my friends change to internet shopping because of COVID-19. All above results provided the picture is to realistic. In my opinion, shopping on the internet can save many times to me, so I also do it when I
# Second round chat, pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": [generated_text]})
msgs.append({"role": "user", "content": ["keep going."]})

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    stream=True
)

generated_text = ""
for new_text in answer:
    generated_text += new_text
    print(new_text, flush=True, end='')
Sure! Here is the rest of the text from the image:

need. In my opinion, shopping on the internet can save many times to me, so I also do it when I want to buy something.

Let me know if you need any more help with this or anything else!
<think>好的,用户想了解Dify、Ollama和MiniCPM-V相关的技术文档、教程和帮助。首先,我需要确认这些技术之间的关系和各自的用途。 Dify看起来是一个AI应用开发平台,可能支持集成不同的模型。Ollama是运行大型语言模型的工具,而MiniCPM-V是具体的多模态模型,特别是图像识别方面。用户可能是在尝试将MiniCPM-V通过Ollama部署到Dify平台上,或者想了解如何结合使用这些工具。 接下来,我需要根据提供的引用内容来组织信息。引用[1]提到MiniCPM-V2_6适用于手机上的图像识别,类似于GPT-4V。引用[2]给出了Ollama运行该模型的命令。引用[3]虽然关于YOLOX,但可能用户误贴,不过YOLOX的目标检测可能与MiniCPM-V的图像识别有联系,但暂时不考虑。 用户的问题主要集中在技术文档、教程和帮助上,所以需要分别针对Dify、Ollama和MiniCPM-V提供资源链接和使用步骤。同时,可能涉及如何将模型集成到Dify中,使用Ollama作为后端。 需要确保回答结构清晰,分点说明每个技术的资源,并给出示例命令或步骤。还要生成相关问题,帮助用户深入探索可能的需求,比如部署问题、API集成、性能优化等。 最后,检查是否符合格式要求:使用中文,数学表达式正确,引用标识正确添加,代码块格式正确,相关问题部分合理。</think>### Dify、Ollama 与 MiniCPM-V 技术集成指南 #### 1. **Dify 技术文档与教程** Dify 是一个开源的 AI 应用开发平台,支持快速集成大语言模型LLM)和多模态模型(如 MiniCPM-V)。以下是关键资源: - **官方文档**:提供模型部署、API 配置、应用开发流程等,涵盖从基础到进阶的操作[^1]。 - 链接:[Dify 文档中心](https://docs.dify.ai/) - **核心功能**:支持模型微调、Prompt 工程、数据管理,可结合 Ollama 本地部署的模型(如 MiniCPM-V)实现端到端应用开发。 --- #### 2. **Ollama 运行 MiniCPM-V 的步骤** Ollama 是本地运行 LLM/MLLM 的工具,通过命令行直接调用模型: ```bash # 拉取并运行 MiniCPM-V2.6 模型(需确认模型名称是否匹配) ollama run aiden_lu/minicpm-v2.6:Q4_K_M # 引用[2] ``` - **常见问题**: - 模型兼容性:需检查 Ollama 支持的模型格式(如 GGUF 量化版本)。 - 性能优化:使用 `Q4_K_M` 等量化参数平衡速度与精度[^2]。 --- #### 3. **MiniCPM-V 多模态能力** MiniCPM-V 是面向移动端优化的多模态模型,特点包括: - **功能**:支持单图/多图/视频理解,性能对标 GPT-4V。 - **部署场景**:适用于手机端图像分类、目标检测(可结合 YOLOX 等框架增强检测能力[^3])。 --- #### 4. **集成示例:Dify + Ollama + MiniCPM-V** 1. **本地部署 Ollama 服务**:确保模型已下载并运行。 2. **Dify 配置模型接口**:在 Dify 后台添加 Ollama 的 API 地址,绑定 MiniCPM-V 模型。 3. **构建应用**:通过 Dify 可视化界面设计多模态 AI 应用(如图像问答系统)。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值