使用Anthropic多模态LLM进行图像理解和推理

最新推荐文章于 2024-09-15 22:31:42 发布

qq_37836323

最新推荐文章于 2024-09-15 22:31:42 发布

阅读量313

点赞数 4

文章标签： python

本文链接：https://blog.csdn.net/qq_29929123/article/details/140712377

版权

近年来，多模态机器学习模型(Multi-Modal LLM)在图像理解及推理方面表现出色。Anthropic公司推出了他们最新的多模态模型：Claude 3 Opus 和 Claude 3 Sonnet。这篇文章将介绍如何使用Anthropic的多模态LLM进行图像理解和推理，并展示一些实际操作的例子。

环境准备

在开始操作之前，我们需要安装一些必要的Python库：

!pip install llama-index-multi-modal-llms-anthropic
!pip install llama-index-vector-stores-qdrant
!pip install matplotlib

本地目录图像理解

我们可以从本地目录中读取图像，并利用Anthropic多模态LLM进行图像描述。

import os
from PIL import Image
import matplotlib.pyplot as plt
from llama_index.core import SimpleDirectoryReader
from llama_index.multi_modal_llms.anthropic import AnthropicMultiModal

# 设置API Key
os.environ["ANTHROPIC_API_KEY"] = "your_api_key_here" 

# 读取图像
img = Image.open("path_to_your_image_file.png")
plt.imshow(img)
plt.show()

# 从本地读取图像文件
image_documents = SimpleDirectoryReader(input_files=["path_to_your_image_file.png"]).load_data()

# 实例化Anthropic 多模态类
anthropic_mm_llm = AnthropicMultiModal(max_tokens=300)

# 调用模型进行图像描述
response = anthropic_mm_llm.complete(
    prompt="Describe the images as an alternative text",
    image_documents=image_documents,
)

print(response)

注释 : 使用中转API地址，因为中国访问不了海外API。

从URL中获取图像进行推理

我们还可以通过URL读取图像，进行推理和描述。

from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls

image_urls = ["your_image_url_here"]

# 获取并显示图像
img_response = requests.get(image_urls[0])
img = Image.open(BytesIO(img_response.content))
plt.imshow(img)
plt.show()

# 加载图像URL
image_url_documents = load_image_urls(image_urls)

# 调用模型进行图像描述
response = anthropic_mm_llm.complete(
    prompt="Describe the images as an alternative text",
    image_documents=image_url_documents,
)

print(response)

注释 : 使用中转API地址，因为中国访问不了海外API。

从图像中解析结构化输出

利用Anthropic多模态LLM，我们还可以从图像中解析出结构化的数据。

from pydantic import BaseModel
from typing import List
from llama_index.multi_modal_llms.anthropic import AnthropicMultiModal
from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.core.output_parsers import PydanticOutputParser

# 定义数据结构
class TickerInfo(BaseModel):
    direction: str
    ticker: str
    company: str
    shares_traded: int
    percent_of_total_etf: float

class TickerList(BaseModel):
    fund: str
    tickers: List[TickerInfo]

# 读取图像文件
image_documents = SimpleDirectoryReader(input_files=["path_to_your_image_file.png"]).load_data()

# 实例化Anthropic 多模态类
anthropic_mm_llm = AnthropicMultiModal(max_tokens=300)

# 构建多模态LLM程序
prompt_template_str = "Can you get the stock information in the image and return the answer in JSON format?"

llm_program = MultiModalLLMCompletionProgram.from_defaults(
    output_cls=TickerList,
    image_documents=image_documents,
    prompt_template_str=prompt_template_str,
    multi_modal_llm=anthropic_mm_llm,
    verbose=True,
)

# 获取响应
response = llm_program()
print(str(response))

注释 : 使用中转API地址，因为中国访问不了海外API。

参考资料

Anthropic Multi Modal LLM Documentation

常见问题与错误

API Key 问题:
- 错误信息: Invalid API Key
- 解决方法: 请检查是否正确设置了API Key，并确保在环境变量中设置了ANTHROPIC_API_KEY。
图像文件路径问题:
- 错误信息: FileNotFoundError
- 解决方法: 请确保图像文件路径正确且文件存在。
网络问题:
- 错误信息: requests.exceptions.RequestException
- 解决方法: 请检查网络连接，并确保URL可以访问。