使用Replicate LLaVa, Fuyu 8B和MiniGPT4模型进行多模态图像推理-CSDN博客

本文链接：https://blog.csdn.net/qq_29929123/article/details/140704865

在这篇文章中，我们将演示如何使用多模态LLM类进行图像理解/推理。我们现在支持以下模型：

LLava-13B
Fuyu-8B
MiniGPT-4

在第二部分，我们将展示如何使用Replicate进行流式完成和异步完成。

注意：目前，Replicate多模态LLM仅支持一次处理一个图像文档。

安装所需的库

首先，安装必要的库：

%pip install llama-index-multi-modal-llms-replicate
%pip install replicate

加载和初始化Replicate

import os

REPLICATE_API_TOKEN = "<你的Replicate API令牌>"  # 替换为你的Replicate API令牌
os.environ["REPLICATE_API_TOKEN"] = REPLICATE_API_TOKEN

下载并本地加载图像

from PIL import Image
import requests
from io import BytesIO
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core.schema import ImageDocument

if not os.path.exists("test_images"):
    os.makedirs("test_images")

image_urls = [
    "https://www.sportsnet.ca/wp-content/uploads/2023/11/CP1688996471-1040x572.jpg",
    "https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg",
    "https://www.cleverfiles.com/howto/wp-content/uploads/2018/03/minion.jpg",
]
for idx, image_url in enumerate(image_urls):
    response = requests.get(image_url)
    img = Image.open(BytesIO(response.content))
    img.save(f"test_images/{idx}.png")

image_documents = [
    ImageDocument(image_path=f"test_images/{idx}.png")
    for idx in range(len(image_urls))
]

可视化图像

import matplotlib.pyplot as plt
from llama_index.core.response.notebook_utils import display_image_uris

image_paths = [str(img_doc.image_path) for img_doc in image_documents]
display_image_uris(image_paths)

提供不同的提示语以测试多模态LLM

from llama_index.multi_modal_llms.replicate import ReplicateMultiModal
from llama_index.multi_modal_llms.replicate.base import REPLICATE_MULTI_MODAL_LLM_MODELS

prompts = [
    "这幅图像显示了什么?",
    "图像中有多少人?",
    "图像中有什么不寻常的吗?",
]

使用不同的提示语对不同模型生成图像推理结果

res = []
for prompt_idx, prompt in enumerate(prompts):
    for image_idx, image_doc in enumerate(image_documents):
        for llm_idx, llm_model in enumerate(REPLICATE_MULTI_MODAL_LLM_MODELS):
            try:
                multi_modal_llm = ReplicateMultiModal(
                    model=REPLICATE_MULTI_MODAL_LLM_MODELS[llm_model],
                    max_new_tokens=100,
                    temperature=0.1,
                    num_input_files=1,
                    top_p=0.9,
                    num_beams=1,
                    repetition_penalty=1,
                )

                mm_resp = multi_modal_llm.complete(
                    prompt=prompt,
                    image_documents=[image_doc],
                )
            except Exception as e:
                print(f"使用提示 {prompt}，图像 {image_idx} 和 MM 模型 {llm_model} 的LLM模型推理时出错")
                print("推理失败，原因：", e)
                continue
            res.append(
                {
                    "model": llm_model,
                    "prompt": prompt,
                    "response": mm_resp,
                    "image": str(image_doc.image_path),
                }
            )

显示多模态LLM的示例响应

from IPython.display import display
import pandas as pd

pd.options.display.max_colwidth = None
df = pd.DataFrame(res)
display(df[:5])

使用异步流完成

import asyncio

async def async_complete_task(multi_modal_llm, prompt, image_documents):
    resp = await multi_modal_llm.astream_complete(
        prompt=prompt,
        image_documents=image_documents,
    )
    async for delta in resp:
        print(delta.delta, end="")

multi_modal_llm = ReplicateMultiModal(
    model=REPLICATE_MULTI_MODAL_LLM_MODELS["fuyu-8b"],
    max_new_tokens=100,
    temperature=0.1,
    num_input_files=1,
    top_p=0.9,
    num_beams=1,
    repetition_penalty=1,
)

# 异步调用
asyncio.run(async_complete_task(multi_modal_llm, "告诉我这幅图像的信息", [image_documents[0]]))