一文快速学会基于Milvus构建多模态RAG问答系统

最新推荐文章于 2025-05-06 10:14:13 发布

Knoka705

最新推荐文章于 2025-05-06 10:14:13 发布

阅读量823

点赞数 12

文章标签： milvus

本文链接：https://blog.csdn.net/qq_61897309/article/details/147705318

版权

0、系统目标

本系统预期目标是构建一个多模态的RAG系统，此系统可实现多种模态的数据导入，包括如PDF、Word、Excel、图片等（仅结构化数据），并可通过对话进行检索，进行整理和返回结果，返回结果和包括图片和文本等。

1、服务端准备

我们这里采用Milvus作为我们的RAG基础，将其在云服务器上事先完成配置，并测试成功进行链接，此处可参考本人的上一章教程，这里不再详细阐述

一文学会云服务器配置Milvus向量数据库 - 今天开始学AI

2、多模态RAG构建

2.1 模型准备

首先我们需要准备我们的词嵌入模型，由于想实现的是一个多模态的RAG，因此需要选择合适的嵌入模型，这里所选择的模型为通用多语言多模态向量模型 — jina-clip-v2，该模型基于 jina-clip-v1 和 jina-embeddings-3 构建，并实现了多项关键改进：

性能提升：v2 在文本-图像和文本-文本检索任务中，性能较 v1 提升了 3%。此外，与 v1 类似，v2 的文本编码器也能高效地应用于多语言长文本密集检索索，其性能可与我们目前最先进的模型 —— 参数量低于 1B 的最佳多语言向量模型 jina-embeddings-v3（基于 MTEB 排行榜）—— 相媲美；
多语言支持：以 jina-embeddings-v3 作为文本塔，jina-clip-v2 支持 89 种语言的多语言图像检索，并在该任务上的性能相比 nllb-clip-large-siglip 提升了 4%；
更高图像分辨率：v2 支持 512x512 像素的输入图像分辨率，相比 v1 的 224x224 有了大幅提升。能够更好地捕捉图像细节，提升特征提取的精度，并更准确地识别细粒度视觉元素；
可变维度输出：jina-clip-v2 引入了俄罗斯套娃表示学习（Matryoshka Representation Learning，MRL）技术，只需设置 dimensions 参数，即可获取指定维度的向量输出，且在减少存储成本的同时，保持强大的性能；

这里我们首先安装modelscope库，接着拉取模型即可，注意调整好模型路径（xlm-roberta-flash-implementation可能需要手动转移到.cache/huggingface/modules/transformers_modules/xlm-roberta-flash-implementation内）

pip install modelscope
modelscope download --model jinaai/jina-clip-v2 --local_dir ./jinaai/jina-clip-v2
modelscope download --model jinaai/jina-clip-implementation --local_dir ./jinaai/jina-clip-implementation
modelscope download --model jinaai/jina-embeddings-v3 --local_dir ./jinaai/jina-embeddings-v3
modelscope download --model jinaai/xlm-roberta-flash-implementation --local_dir ./jinaai/xlm-roberta-flash-implementation

2.2 向量数据库构建

这里我们采用Python进行，首先调用pymilvus库，通过connections连接数据库，代码如下

from pymilvus import utility, Collection
from pymilvus import CollectionSchema, FieldSchema, DataType
from pymilvus import connections

# 尝试连接到 Milvus 服务器
try:
    connections.connect(host='111.6.167.30', port=19530)
    print(f"成功连接到 Milvus 服务器，端口为：{19530}")
except Exception as e:
    print(f"连接失败：{e}")

接着创建两个字段，分别为主键字段和向量字段

主键字段：创建名为 id 的自增主键字段，用于唯一标识每条数据记录，无需手动指定值；
向量字段：定义名为 vector 的浮点向量字段，维度为 1024（模型决定），用于存储文本/图像的嵌入向量；

并设置动态扩展，启用 enable_dynamic_field 特性，允许在插入数据时动态添加未预先定义的字段（如 text 文本内容或 image_url 图片地址）；

集合名称设置为multimodal_rag_demo，分片策略设置为shards_num=2，优化数据存储和查询的并发性能，索引类型为IVF_FLAT（基于聚类的倒排索引），其适合大规模数据（百万级以上），通过分簇加速搜索，并将nlist=1024，将数据划分为 1024 个聚类中心，平衡搜索速度与精度，metric_type=COSINE：使用余弦相似度衡量向量间相似性；

# 定义集合参数
id_field = FieldSchema(
    name="id",
    dtype=DataType.INT64,
    is_primary=True,
    auto_id=True
)

vector_field = FieldSchema(
    name="vector",
    dtype=DataType.FLOAT_VECTOR,
    dim=1024
)

# 创建集合模式（启用动态字段）
schema = CollectionSchema(
    fields=[id_field, vector_field],
    description="多模态RAG向量存储",
    enable_dynamic_field=True  # 允许插入额外字段（如text/image_url）
)

# 创建集合
collection = Collection(
    name=collection_name,
    schema=schema,
    shards_num=2
)

# 创建索引
index_params = {
    "metric_type": "COSINE",
    "index_type":"IVF_FLAT",
    "params":{"nlist":1024}
}

collection.create_index(
    field_name="vector",
    index_params=index_params
)

print("索引创建成功")

创建完成后我们可尝试进行数据插入测试，示例如下

# 插入数据（示例数据）
data_to_insert = [
    {"text": "示例文本1", "vector": [0.1]*1024},
    {"image_url": "http://example.com/img1.jpg", "vector": [0.2]*1024},
    # 添加更多数据...
]

try:
    insert_result = collection.insert(data_to_insert)
    print(f"插入成功，插入行数: {insert_result.insert_count}")
except MilvusException as e:
    print(f"插入失败: {e}")

# 加载集合
collection.load()
print("集合加载完成")

# 验证数据
print(f"集合行数: {collection.num_entities}")

3、多模态RAG数据插入

首先下载如下依赖库

pip install transformers torch timm

然后通过如下代码，加载嵌入模型并进行实例化，这里定义了词嵌入和图像嵌入的方法

from transformers import AutoModel
from pymilvus import MilvusClient
import torch
import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
# Define Encoder class to handle text and image embedding generation
class Encoder:
    def __init__(self):
        # Initialize the model (AutoModel from transformers instead of SentenceTransformer)
        # self.model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
        self.model = AutoModel.from_pretrained('./jinaai/jina-clip-v2', trust_remote_code=True)

    def encode_text(self, text: list[str]) -> list[float]:     # Generate embeddings for text only
        with torch.no_grad():
            text_emb = self.model.encode_text(text)
        return text_emb
    def encode_image(self, image_urls: list[str]) -> lis