端侧多模态大模型再升级，只有968M！实测效果咋样

最新推荐文章于 2025-04-07 22:12:52 发布

AI码上来

最新推荐文章于 2025-04-07 22:12:52 发布

阅读量1.4k

点赞数 11

分类专栏： AI实战文章标签：人工智能 linux 开源

本文链接：https://blog.csdn.net/u010522887/article/details/144149525

版权

AI实战专栏收录该内容

82 篇文章

订阅专栏

前有 GPT-4o 一骑绝尘，各大厂商的多模态大模型也在遍地开花，然而这些模型都需要在云端部署。

今天看到一款开源的多模态模型 Omnivision，参数量只有968M，应该是当前最小的多模态模型。

Omnivision 由创业公司 Nexa AI 发布，这家公司的使命就是开发端侧 AI 模型，所谓端侧模型，就是无需云端部署，直接跑在本地设备，具有成本低、隐私保护的优势。

实测效果咋样？

今日分享，带大家一探 Omnivision 的究竟。

1.Omnivision 亮点

官方仓库：https://huggingface.co/NexaAIDev/omnivision-968M

官方更新迭代非常迅速，不到一个月，v2 版本已上线，部分指标甚至接近 Qwen2-VL-2B。

对于小模型而言，处理复杂任务的能力非常有限，为此官方进行了大量优化，v2 版的突出亮点有：

增强的艺术描述
更好的复杂图像理解
改进的动漫识别
更准确的颜色和细节检测
扩展的世界知识

为啥模型可以做到这么小？

Omnivision 是 LLava 架构，包括三个组件：

基础LLM：采用 Qwen2.5-0.5B-Instruct；
视觉编码器：SigLIP-400M，输入分辨率384x384；
投影层：采用MLP将视觉编码器与语言模型对齐，和 Llava 相比，图像 tokens 减少9倍（729->81）。

先看下官方测试效果：

接下来，我们实测一番。

2. 效果实测

2.1 本地部署

本地使用，官方提供了一个和 Ollama 非常类似的 Nexa-SDK，需安装后在终端使用！

Nexa-SDK 安装使用可参考：https://github.com/NexaAI/nexa-sdk

安装成功后，一键运行：

nexa run omnivision

当然，也可以部署成本地服务：

nexa server

有没有一种似曾相识的感觉？说的就是 Ollama …

2.2 API 调用

当然，如果不想本地部署，只想体验一番，可前往官方体验地址：https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo

在 app 中可以找到 API 地址：wss://nexa-omni.nexa4ai.com/ws/process-image/。

服务调用使用的是 websocket 协议，Python 中有两种实现：

异步实现：采用 websockets 库

pip install websockets

同步实现：采用 websocket-client 库

pip install websocket-client

官方地址中给出的是异步实现，所以这里给出同步实现代码：

from websocket import create_connection

def process_image_stream(image_path, prompt, max_tokens=512):
    try:
        # Read and convert image to base64
        with Image.open(image_path) as img:
            img = img.convert('RGB')
            buffer = io.BytesIO()
            img.save(buffer, format="JPEG")
            base64_image = base64.b64encode(buffer.getvalue()).decode('utf-8')
            
        websocket = create_connection('wss://nexa-omni.nexa4ai.com/ws/process-image/')
        websocket.send(json.dumps({
            "image": f"data:image/jpeg;base64,{base64_image}",
            "prompt": prompt,
            "task": "instruct",  # Fixed to instruct
            "max_tokens": max_tokens
        }))
            
        response = ""
        token_count = 0
        while True:
            message = websocket.recv()
            data = json.loads(message)
            if data['status'] == 'generating':
                if token_count < 3 and data["token"] in [" ", " \n", "\n", "<|im_start|>", "assistant"]:
                    token_count += 1
                    continue
                response += data["token"]
            elif data["status"] == "complete":
                break
            elif data["status"] == "error":
                return f"Error: {data['error']}"
        return response
    except Exception as e:
        return f"Error: {e}"

2.3 效果展示

先来一个简单的任务~

英文提示词的效果还是 OK 的：

不过，换成中文的，呃。。。

再来一个复杂的任务。

比如，拍照识别卡路里，中文提示词如下：

你是营养分析专家，请根据用户上传的食物图片，提供一份详细的营养分析报告，请根据周围物体作为近似分量大小的参考。

要求：结构化输出，包括食物名称、重量、总卡路里、营养成分细分以及健康小提示。不要输出其它任何无关内容。
例子：
食物名称：排骨蒸饭配清炒芽白
- 卡路里：620千卡
- 重量：600克
👩🏻‍🔬 营养博士 👩🏻‍🔬
- 蛋白质：25.5克
- 脂肪：15.5克
- 碳水化合物：85克
健康小提示：清炒芽白富含膳食纤维，可多吃。蒸排骨是优质蛋白质来源，但脂肪含量较高，建议一周吃2 - 3次。白米饭是碳水化合物的主要来源，可以搭配其他粗粮交替食用。

考虑到 omnivision 没有中文理解能力，我们将其翻译成英文输入：

As a nutrition analysis expert, please provide a detailed nutritional analysis report based on the food image uploaded by the user, including the food name, weight, total calories, protein, fat, carbohydrate content, as well as healthy eating advice.

Requirements: Structured output, including food name, weight, total calories, detailed nutritional components, and health tips.

Example:
Food Name: Steamed Rice with Spare Ribs and Stir-fried Bean Sprouts
- Calories: 620 kcal
- Weight: 600 grams
👩🏻‍🔬 Nutrition Expert 👩🏻‍🔬
- Protein: 25.5 grams
- Fat: 15.5 grams
- Carbohydrates: 85 grams
Health Tip: Stir-fried bean sprouts are rich in dietary fiber and can be consumed more frequently. Steamed spare ribs are a good source of protein, but they are high in fat, so it is recommended to consume them 2-3 times a week. White rice is the main source of carbohydrates and can be alternated with other whole grains for consumption.

输出如下：

The food in the image appears to be a meal that includes:

- Steamed Rice with Spare Ribs: This dish is likely to be a balanced meal with protein from the beans and a healthy fat content from the spare ribs. The rice provides a carbohydrate component, and the beans add protein and fiber.
- Stir-fried Vegetables: The vegetables are a source of vitamins, minerals, and fiber, and they appear to be cooked in a sauce, which may add flavor and additional protein.
- Rice with a topping of shredded vegetables: This dish seems to be a variation of a rice dish, with the shredded vegetables served on top, which adds fiber and a variety of nutrients.

The meal is presented on a white plate with a blue rim, and the food is served on white plates, which is a common presentation style for East Asian cuisine. The image does not provide information on the exact weight of the food, but the portion sizes suggest a balanced meal. The overall impression is of a well-prepared, nutritious meal that could be part of a larger healthy eating plan.

可以发现，对于指令遵循，还是差点意思。

换用 gemini-1.5-pro 来试下：

食物名称：清蒸鱼，蒜蓉辣椒酱，白菜，米饭
- 卡路里：约550千卡
- 重量：约500克
👩🏻‍🔬 营养博士 👩🏻‍🔬
- 蛋白质：约30克
- 脂肪：约15克
- 碳水化合物：约70克
健康小提示：清蒸鱼是优质蛋白质和低脂肪的来源，是很健康的烹饪方式。蒜蓉和辣椒酱虽然美味，但也要注意控制钠的摄入。白菜富含维生素和膳食纤维，可以促进肠道健康。米饭是碳水化合物的主要来源，建议适量食用，可以搭配其他粗粮一起食用以获得更均衡的营养。

Nice~

所以，要想更好的指令遵循，目前来看，还得是大模型。

至于端侧，依然任重道远啊。