使用LlamaIndex与LLaVa进行多模态信息检索和数据抽取

最新推荐文章于 2024-09-15 22:31:42 发布

qq_37836323

最新推荐文章于 2024-09-15 22:31:42 发布

阅读量424

点赞数 8

文章标签： python 人工智能计算机视觉

本文链接：https://blog.csdn.net/qq_29929123/article/details/140969102

版权

在本文中，我们将探讨如何使用LlamaIndex与LLaVa模型进行多模态信息检索和数据抽取。我们将对图像中的信息进行结构化提取，并展示一种通过检索增强的图像生成数据描述的处理流程。

环境搭建

首先，我们需要安装一些必要的Python包：

!pip install llama-index-multi-modal-llms-ollama
!pip install llama-index-readers-file
!pip install unstructured
!pip install llama-index-embeddings-huggingface
!pip install llama-index-vector-stores-qdrant
!pip install llama-index-embeddings-clip
# 安装所需库

结构化数据提取

我们将展示如何使用LLaVa模型将图像中的信息提取为结构化的Pydantic对象。

加载数据

首先，我们加载一个炸鸡广告的图像，并显示出来：

from pathlib import Path
from llama_index.core import SimpleDirectoryReader
from PIL import Image
import matplotlib.pyplot as plt

input_image_path = Path("restaurant_images")
if not input_image_path.exists():
    input_image_path.mkdir()

# 下载图像
!wget "https://docs.google.com/uc?export=download&id=1GlqcNJhGGbwLKjJK1QJ_nyswCTQ2K2Fq" -O ./restaurant_images/fried_chicken.png

# 加载图像文档
image_documents = SimpleDirectoryReader("./restaurant_images").load_data()

# 显示图像
imageUrl = "./restaurant_images/fried_chicken.png"
image = Image.open(imageUrl).convert("RGB")
plt.figure(figsize=(16, 5))
plt.imshow(image)
plt.show()

定义数据模型

我们定义一个Restaurant数据模型，用于存储图像中提取的信息：

from pydantic import BaseModel

class Restaurant(BaseModel):
    """Data model for a restaurant."""
    restaurant: str
    food: str
    discount: str
    price: str
    rating: str
    review: str

提取信息

我们实例化MultiModalLLMCompletionProgram并提取图像中的信息：

from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.core.output_parsers import PydanticOutputParser

prompt_template_str = """\
{query_str}

Return the answer as a Pydantic object. The Pydantic schema is given below:

"""
mm_program = MultiModalLLMCompletionProgram.from_defaults(
    output_parser=PydanticOutputParser(Restaurant),
    image_documents=image_documents,
    prompt_template_str=prompt_template_str,
    multi_modal_llm=OllamaMultiModal(model="llava:13b"),
    verbose=True,
)

response = mm_program(query_str="Can you summarize what is in the image?")
for res in response:
    print(res)

检索增强型图像描述生成

接下来，我们展示如何通过检索增强来生成图像描述。

加载数据以及构建向量索引

# 下载数据
!wget "https://www.dropbox.com/scl/fi/mlaymdy1ni1ovyeykhhuk/tesla_2021_10k.htm?rlkey=qf9k4zn0ejrbm716j0gg7r802&dl=1" -O tesla_2021_10k.htm
!wget "https://docs.google.com/uc?export=download&id=1THe1qqM61lretr9N3BmINc_NWDvuthYf" -O shanghai.jpg

from llama_index.readers.file import UnstructuredReader
from llama_index.core.schema import ImageDocument
from llama_index.core import VectorStoreIndex
from llama_index.core.embeddings import resolve_embed_model

# 加载文本和图像文档
loader = UnstructuredReader()
documents = loader.load_data(file=Path("tesla_2021_10k.htm"))
image_doc = ImageDocument(image_path="./shanghai.jpg")

embed_model = resolve_embed_model("local:BAAI/bge-m3")
vector_index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
query_engine = vector_index.as_query_engine()

构建查询管道并执行查询

from llama_index.core.prompts import PromptTemplate
from llama_index.core.query_pipeline import QueryPipeline, FnComponent

query_prompt_str = """\
Please expand the initial statement using the provided context from the Tesla 10K report.

{initial_statement}

"""
query_prompt_tmpl = PromptTemplate(query_prompt_str)

# 构建查询管道
qp = QueryPipeline(
    modules={
        "mm_model": mm_model.as_query_component(partial={"image_documents": [image_doc]}),
        "query_prompt": query_prompt_tmpl,
        "query_engine": query_engine,
    },
    verbose=True,
)
qp.add_chain(["mm_model", "query_prompt", "query_engine"])
rag_response = qp.run("Which Tesla Factory is shown in the image?")

print(f"> Retrieval Augmented Response: {rag_response}")