使用GPT-4V模型进行多模态图像分析与查询

最新推荐文章于 2024-07-27 23:02:02 发布

qq_29929123

最新推荐文章于 2024-07-27 23:02:02 发布

阅读量240

点赞数 5

文章标签： python

本文链接：https://blog.csdn.net/qq_29929123/article/details/140283188

版权

引言

在本文中，我们将介绍如何构建一个多模态ReAct代理，该代理可以处理文本和图像作为输入任务定义，并通过连锁思维和工具使用来尝试解决任务。本教程展示了两个用例：

RAG代理: 给定文本/图像，可以查询RAG（Retrieval-Augmented Generation）管道以查找答案。
Web 代理: 给定文本/图像，可以查询网络工具以查找相关信息。

需要注意的是，这个功能目前仅在GPT-4V中可用，并且现在是一个beta版本，抽象接口可能会在未来发生变化。

安装和设置

首先，我们需要安装相关的库和下载处理图像的数据：

%pip install llama-index-llms-openai llama-index-readers-web llama-index-multi-modal-llms-openai llama-index-tools-metaphor

# 下载我们将用来运行查询的图像
!wget "https://images.openai.com/blob/a2e49de2-ba5b-4869-9c2d-db3b4b5dcc19/new-models-and-developer-products-announced-at-devday.jpg?width=2000" -O other_images/openai/dev_day.png
!wget "https://drive.google.com/uc\?id\=1B4f5ZSIKN0zTTPPRlZ915Ceb3_uF9Zlq\&export\=download" -O other_images/adidas.png

设置数据

我们将使用SimpleWebPageReader从一个网页中读取数据：

from llama_index.readers.web import SimpleWebPageReader

url = "https://openai.com/blog/new-models-and-developer-products-announced-at-devday"
reader = SimpleWebPageReader(html_to_text=True)
documents = reader.load_data(urls=[url])

设置工具

接下来，我们需要设置一些查询工具，并使用这些工具来初始化RAG管道：

from llama_index.llms.openai import OpenAI
from llama_index.core import VectorStoreIndex
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core import Settings

Settings.llm = OpenAI(temperature=0, model="gpt-3.5-turbo")

vector_index = VectorStoreIndex.from_documents(documents)

query_tool = QueryEngineTool(
    query_engine=vector_index.as_query_engine(),
    metadata=ToolMetadata(
        name=f"vector_tool",
        description="Useful to lookup new features announced by OpenAI"
    ),
)

设置代理

然后我们设置代理，并使用所设置的查询工具来创建任务：

from llama_index.core.agent.react_multimodal.step import MultimodalReActAgentWorker
from llama_index.core.agent import AgentRunner
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.agent import Task
from llama_index.core.schema import ImageDocument

mm_llm = OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=1000)

react_step_engine = MultimodalReActAgentWorker.from_tools(
    [query_tool],
    multi_modal_llm=mm_llm,
    verbose=True,
)
agent = AgentRunner(react_step_engine)

query_str = (
    "The photo shows some new features released by OpenAI. "
    "Can you pinpoint the features in the photo and give more details using relevant tools?"
)

image_document = ImageDocument(image_path="other_images/openai/dev_day.png")

task = agent.create_task(
    query_str,
    extra_state={"image_docs": [image_document]},
)

运行任务

最后，我们运行创建的任务：

def execute_step(agent: AgentRunner, task: Task):
    step_output = agent.run_step(task.task_id)
    if step_output.is_last:
        response = agent.finalize_response(task.task_id)
        print(f"> Agent finished: {str(response)}")
        return response
    else:
        return None

def execute_steps(agent: AgentRunner, task: Task):
    response = execute_step(agent, task)
    while response is None:
        response = execute_step(agent, task)
    return response

response = execute_steps(agent, task)

print(str(response))

结果示例

运行上述代码后，代理会解析图像并结合从网络获取的信息，得出以下结论：

该照片展示了一个用户界面，其中包括“Playground”部分和多个选项，例如“GPT-4.0-turbo”，“Code Interpreter”，“Translate”和“Chat”。这些功能是OpenAI新发布的功能的一部分，具体包括GPT-4 Turbo模型（更强大且成本更低的语言模型），Assistants API（允许开发人员创建AI应用），以及多模态功能（包括视觉和图像创建）。

可能遇到的错误

网络连接问题: 在下载图像或从网络读取数据时，如果网络不稳定或访问受限，可能会导致任务失败。
API调用错误: 在设置和调用诸如OpenAI等大模型API时，如果API密钥无效或超过了调用限制，可能会导致请求失败。
模型输出不准确: 在处理复杂多模态任务时，模型的输出可能不准确，需要进一步调整和优化。

如果你觉得这篇文章对你有帮助，请点赞，关注我的博客，谢谢!

参考资料

qq_29929123

关注

5
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
使用GPT-4V模型进行多模态图像分析与查询

在本文中，我们将介绍如何构建一个多模态ReAct代理，该代理可以处理文本和图像作为输入任务定义，并通过连锁思维和工具使用来尝试解决任务。需要注意的是，这个功能目前仅在GPT-4V中可用，并且现在是一个beta版本，抽象接口可能会在未来发生变化。
复制链接

扫一扫