使用GPT4V和LlamaIndex进行多模态数据处理

最新推荐文章于 2024-07-27 23:02:02 发布

ppoojjj

最新推荐文章于 2024-07-27 23:02:02 发布

阅读量301

点赞数 3

文章标签： python numpy 开发语言

本文链接：https://blog.csdn.net/ppoojjj/article/details/140283956

版权

在这篇文章中，我们将展示如何通过LlamaIndex集成新的OpenAI GPT4V API来生成结构化数据。用户只需指定一个Pydantic对象即可完成任务。

安装必要的库

首先，我们需要安装必要的库：

%pip install llama-index-multi-modal-llms-openai
%pip install llama-index-multi-modal-llms-replicate

设置API密钥

我们需要将API密钥设置到环境变量中：

import os

OPENAI_API_TOKEN = "sk-<your-openai-api-token>"
os.environ["OPENAI_API_KEY"] = OPENAI_API_TOKEN

REPLICATE_API_TOKEN = ""  # Your Relicate API token here
os.environ["REPLICATE_API_TOKEN"] = REPLICATE_API_TOKEN

下载图片

我们需要下载用于测试的图片：

from pathlib import Path

input_image_path = Path("restaurant_images")
if not input_image_path.exists():
    Path.mkdir(input_image_path)

!wget "https://docs.google.com/uc?export=download&id=1GlqcNJhGGbwLKjJK1QJ_nyswCTQ2K2Fq" -O ./restaurant_images/fried_chicken.png

初始化Pydantic类

我们将为餐馆的图片定义一个Pydantic数据模型：

from pydantic import BaseModel

class Restaurant(BaseModel):
    """Data model for a restaurant."""

    restaurant: str
    food: str
    discount: str
    price: str
    rating: str
    review: str

加载OpenAI GPT4V多模态模型

from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core import SimpleDirectoryReader

# put your local directory here
image_documents = SimpleDirectoryReader("./restaurant_images").load_data()

openai_mm_llm = OpenAIMultiModal(
    model="gpt-4-vision-preview", api_key=OPENAI_API_TOKEN, max_new_tokens=1000
)

绘制图片

from PIL import Image
import matplotlib.pyplot as plt

imageUrl = "./restaurant_images/fried_chicken.png"
image = Image.open(imageUrl).convert("RGB")

plt.figure(figsize=(16, 5))
plt.imshow(image)
plt.show()

生成结构化数据

from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.core.output_parsers import PydanticOutputParser

prompt_template_str = """\
    can you summarize what is in the image\
    and return the answer with json format \
"""
openai_program = MultiModalLLMCompletionProgram.from_defaults(
    output_parser=PydanticOutputParser(Restaurant),
    image_documents=image_documents,
    prompt_template_str=prompt_template_str,
    multi_modal_llm=openai_mm_llm,
    verbose=True,
)

response = openai_program()
for res in response:
    print(res)

示例代码：使用Fuyu-8B模型

from llama_index.multi_modal_llms.replicate import ReplicateMultiModal
from llama_index.multi_modal_llms.replicate.base import REPLICATE_MULTI_MODAL_LLM_MODELS

prompt_template_str = """\
    can you summarize what is in the image\
    and return the answer with json format \
"""

def pydantic_replicate(model_name, output_class, image_documents, prompt_template_str):
    mm_llm = ReplicateMultiModal(
        model=REPLICATE_MULTI_MODAL_LLM_MODELS[model_name],
        temperature=0.1,
        max_new_tokens=1000,
    )

    llm_program = MultiModalLLMCompletionProgram.from_defaults(
        output_parser=PydanticOutputParser(output_class),
        image_documents=image_documents,
        prompt_template_str=prompt_template_str,
        multi_modal_llm=mm_llm,
        verbose=True,
    )

    response = llm_program()
    print(f"Model: {model_name}")
    for res in response:
        print(res)

# 使用Fuyu-8B模型
pydantic_replicate("fuyu-8b", Restaurant, image_documents, prompt_template_str)