Llamaindex+Llava实现多模态RAG（入门）

最新推荐文章于 2025-04-02 14:02:10 发布

cycyc123

最新推荐文章于 2025-04-02 14:02:10 发布

阅读量5k

点赞数 25

文章标签：人工智能 python 语言模型全文检索

本文链接：https://blog.csdn.net/cycyc123/article/details/137225998

版权

本文介绍了RAG（Retrieval-AugmentedGeneration）的基本概念和在解决大语言模型幻觉问题上的应用，重点讲解了Llama-index在知识库加载、索引和存储方面的技术，以及如何利用Llama-index创建和优化多模态RAG系统，包括处理文档拆分、节点操作和多模态检索增强。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

最近在做RAG相关的内容，发现多模态的问答加上本地知识库在垂直领域具有非常广泛的应用。本文第一部分会简单介绍RAG的基本知识，第二部分讲一下llamaindex的基础知识（有两部分没有讲，请自行参考官方文档），第三部分构建多模态的RAG系统。（注意，本文RAG模型全部部署在本地，需要20G以上的显卡！！）

1、 RAG基本知识

RAG（Retrieval-Augmented Generation）大白话来说就是：增强检索（RA）和生成（G）。之所以提出RAG的概念主要还是为了解决大语言模型（如chatGPT）的模型幻觉问题，说白了就是回答问题不准确。并且首先于先验知识的局限，很多LLM对最新的知识无法回答。由此，RAG就此诞生。以下是RAG的简单框架图：1）用户提问+指令；2）转换为query（Q）；3）检索本地知识库+增强内容；4）发送到LLM对知识进行增强；5）更具检索增强后的知识生成回答。（图片来源于：什么是 RAG？— 检索增强生成详解 — AWS）

2、Llama-index基础知识

那么问题来了，如何高效的构建知识库以及优化RAG的完整流程呢？如今的技术栈例如Langchain、Llama-index、FlowiseAI以及AutoGen。这些技术栈优化了RAG的整体流程，但有些对于新手来说确实不是很友好，主要原因包括个人认为是内容过于冗杂。在经过多次尝试后，决定从llama-index入手（后面发现和langchain的原理差不多）。

Llama-index是一个非常强大的技术栈，对RAG任务具有非常的针对性。简单来说，可以把RAG的流程拆解为：loading、indexing、storing、querying以及evaluation（非常精简，并且每一部分的操作性非常强，类似于一个骨架）。

2.1 Loading

这部分我认为最重要的一个概念就是document和nodes的概念，有点类似于知识图谱。我做了一个简单的图方便大家理解，这里我以一个pdf文档为例。首先一个pdf文档通过llama-index内置的工具可以把每一页pdf拆分为若干个Documents。但每个Document会包含很多很多内容，这样其实不利于检索，即使检索也会产生很多无关的上下文，增加token的消耗。所以，llama-index进一步将document拆分为nodes，更加精细化检索内容。

Node的概念十分重要，node可以简单的理解为chunk。每个chunk包含文字，你可以自定义chunk的大小以及overlap（为了上下文的连续性）的大小。huggingface上面有个space可以可视化chunk/node的概念https://huggingface.co/spaces/m-ric/chunk_visualizer。

Llama-index为我们提供了多种node的模式，包括：BaseNode、TextNode（储存文本）、ImageNode（储存图像）以及IndexNode（仅储存索引）。但我们将document划分为chunk/node之后，我们可以查看每个节点的具体信息，请查看我分享在github上面的jupyter代码：Llama-index LoadingContribute to MMHHRR/urban_llmRAG development by creating an account on GitHub.https://github.com/MMHHRR/urban_llmRAG/blob/main/llama-index/1.%20llamaindex_loading.ipynb 可以发现，Llama-index为每个node配置了一个id_（也可以自己设置），以及metadata、relationship以及text。需要注意的是llama-index里面为每个node匹配了一个relationship，如下：

SOURCE: The node is the source document.

PREVIOUS: The node is the previous node in the document.

NEXT: The node is the next node in the document.

PARENT: The node is the parent node in the document.

CHILD: The node is a child node in the document.

2.2 Indexing

另一个重要的概念是indexing，可以参考我的github。我们可以从document以及node直接创建索引（index），非常简单的一个示例：

Llama-index IndexingContribute to MMHHRR/urban_llmRAG development by creating an account on GitHub.https://github.com/MMHHRR/urban_llmRAG/blob/main/llama-index/2.%20llamaindex_Indexing.ipynb

document1 = Document(
    text="This is a super-customized document",
    metadata={
        "file_name": "super_secret_document.txt",
        "category": "finance",
        "author": "LlamaIndex"})
document2 = Document(
    text="Hello world!",
    metadata={
        "file_name": "computersss.txt",
        "category": "RAG",
        "author": "Human"})
##或者这样写，把多个document用[.. , ..]连接
document = [document1, document2]
document

你还可以从node创建index。此外，Llama-index还为我们提供了management工具，我们可以轻易的插入节点（insertation）、删除节点（delete）、升级节点（updata）以及更新节点（refresh）。当然还可以追踪节点的变化。

print(index.ref_doc_info)

2.3 Storing

可以先跑一下storing的代码。Storing具有多种储存模式，你可以将Document、Index、Vector储存在本地或者其他数据库中（例如Qdrant），其目的就是为了更加便捷的对文档或者图像进行检索。可以想象一下，数据被embeding后储存在strore中，然后根据query进行搜索（余弦相似度等搜索方法）。比如MultiModalVectorStoreIndex()方法可以将多模态数据进行储存。Llama-index StoringContribute to MMHHRR/urban_llmRAG development by creating an account on GitHub.https://github.com/MMHHRR/urban_llmRAG/blob/main/llama-index/3.%20llamaindex_storing.ipynb

3、创建多模态的RAG

首先导入包，以及设置系统的embeding模型。（注意：Llama-index更新的太快了，很多文档都不一样，本文使用的是现在时间（2024-4-1）的最新版本！）。其次，本文使用ollama本地化LLM模型，所以要装一下包。我提供了完整的代码，小伙伴们可以自己尝试一下！urban_llmRAG/coda/llamaindex_multimodal.ipynb at main · MMHHRR/urban_llmRAG · GitHubContribute to MMHHRR/urban_llmRAG development by creating an account on GitHub.https://github.com/MMHHRR/urban_llmRAG/blob/main/coda/llamaindex_multimodal.ipynb

pip install llama_index
pip install qdrant_client
pip install llama_index-vector_stores-qdrant
pip install llama_index-embeddings-clip
pip install llama_index-multi_modal_llms-ollama
pip install llama_index-embeddings-huggingface

import qdrant_client
from llama_index.core import (ServiceContext, 
                               SimpleDirectoryReader,
                               SimpleDirectoryReader,
                               StorageContext,
                               Settings)
from llama_index.core.schema import ImageNode
from llama_index.core.schema import ImageDocument
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core.response.notebook_utils import display_source_node
from llama_index.core.indices.multi_modal.base import MultiModalVectorStoreIndex
from llama_index.embeddings.clip import ClipEmbedding
from llama_index.multi_modal_llms.ollama import OllamaMultiModal
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

import os
import io 
import json
from PIL import Image 
import matplotlib.pyplot as plt

##设置embeding 模型和llva模型
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
mm_model = OllamaMultiModal(model="llava")

解析一下原始的json文档，由于是图像问答（多模态），图片的text和对应的图像路径会储存在json中。

import json
with open('E:/RAG_LLM/metadata.json') as f:
	doc = json.load(f)
      
document = []
for obj in doc:
    caption = obj.get("Caption")
    captions = str(caption).replace("'", "").replace("[", "").replace("]", "")
    obj['City'] = str(obj.get("Image Name")).split('_')[3].split('.')[0]
    image_path = obj.get("Relative Path")
    image_paths = str(image_path)

    obj.pop("Caption", None)
    obj.pop("Relative Path", None)
    
    document.append([image_paths, captions, obj])
document[0]

# print--> 
['./test_img\\1.274615_103.797243_50f561a0fdc9f065f0005614_Singapore.JPG',
 'a car driving down a street next to a large bridge over a highway with a lot of trees',
 {'Image Name': '1.274615_103.797243_50f561a0fdc9f065f0005614_Singapore.JPG',
  'livelier': '4.514021433',
  'more beautiful': '5.15325436',
  'more boring': '4.950422095',
  'more depressing': '6.346440519',
  'safer': '2.17046243',
  'wealthier': '4.583960576',
  'City': 'Singapore'}]

好的，我们现在将每个json储存在node里面，接着创建多模态的索引。

test_image_nodess =[ImageNode(image_path=p, text=t,metadata=k) for p,t,k in document]
test_image_nodess[0]

##MultiModalVectorStoreIndex可以储存多模态vector
multi_index = MultiModalVectorStoreIndex(test_image_nodess, show_progress=True)

接着我们需要创建一个检索器。llamaindex提供了多种检索方式，在多模态vectorstore的基础上，我们可以直接使用多模态进行检索（图像或文本）、图像到图像检索、以及文本到图像检索。

##使用.as_retriever创建一个检索器
urban_retrieve = multi_index.as_retriever(similarity_top_k=3, image_similarity_top_k=3)

query_str = 'a building on the side of street in Singapore?'
img_path = '20.672456_-103.411312_514135aafdc9f04926004ab6_Guadalajara.JPG'

img, txt, score, metadata = retrieve_display(urban_retrieve.retrieve(query_str))  ##多模态检索

# img, txt, score, metadata = retrieve_display(urban_retrieve.image_to_image_retrieve(img_path))  ##图像到图像检索

# img, txt, score, metadata = retrieve_display(urban_retrieve.text_to_image_retrieve(img_path))  ##文本到图像检索

##将检索完成的图像/本文/元数据储存在这里
image_documents = [ImageDocument(image_path=img_path)]
for res_img in img:
    image_documents.append(ImageDocument(image_path=res_img))
context_str = "".join(txt)
metadata_str = metadata

##可视化一下
print(score)
plot_images(img)  ## visualization

最后使用多模态的llm生成结果或者进一步问答。

##使用prompt，融入检索信息进行生成回答
## LLM templet
qa_tmpl_str = (
    "Given the provided information, including retrieved contents and metadata, \
     accurately and precisely answer the query without any additional prior knowledge.\n"
    "Please ensure honesty and responsibility, refraining from any racist or sexist remarks.\n"
    "---------------------\n"
    "Context: {context_str}\n"     ## 将上下文信息放进去
    "Metadata: {metadata_str} \n"  ## 将原始的meta信息放进去
    "---------------------\n"
    "Query: {query_str}\n"
    "Answer: "
)
query_str = 'a building on the side of street in Singapore?'

## use'.complete' invoke LLM
response = mm_model.complete(
    prompt=qa_tmpl_str.format(
        context_str=context_str,
        metadata_str=metadata_str,
        query_str=query_str, 
        ),
    image_documents=image_documents,
    )