在本文中,我们将介绍如何使用Anthropic多模态LLM进行图像理解和推理。Anthropic最近发布了其最新的多模态模型:Claude 3 Opus和Claude 3 Sonnet。我们将展示如何使用这些模型进行图像推理操作,并提供一些相关的代码示例。
安装依赖
在开始之前,我们需要安装一些必要的Python库:
!pip install llama-index-multi-modal-llms-anthropic
!pip install llama-index-vector-stores-qdrant
!pip install matplotlib
使用本地图像进行推理
首先,我们将展示如何使用Anthropic的API来理解本地目录中的图像。
import os
from PIL import Image
import matplotlib.pyplot as plt
from llama_index.core import SimpleDirectoryReader
from llama_index.multi_modal_llms.anthropic import AnthropicMultiModal
# 设置API密钥
os.environ["ANTHROPIC_API_KEY"] = "" # 在此处填入你的ANTHROPIC API密钥
# 读取本地图像
img = Image.open("../data/images/prometheus_paper_card.png")
plt.imshow(img)
# 加载图像数据
image_documents = SimpleDirectoryReader(
input_files=["../data/images/prometheus_paper_card.png"]
).load_data()
# 初始化Anthropic多模态类
anthropic_mm_llm = AnthropicMultiModal(max_tokens=300)
# 推理图像
response = anthropic_mm_llm.complete(
prompt="Describe the images as an alternative text",
image_documents=image_documents,
)
print(response)
使用URL进行图像推理
接下来,我们将展示如何使用AnthropicMultiModal类来从URL加载并推理图像。
from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
image_urls = [
"https://venturebeat.com/wp-content/uploads/2024/03/Screenshot-2024-03-04-at-12.49.41%E2%80%AFAM.png",
# 添加你自己的URL
]
img_response = requests.get(image_urls[0])
img = Image.open(BytesIO(img_response.content))
plt.imshow(img)
image_url_documents = load_image_urls(image_urls)
response = anthropic_mm_llm.complete(
prompt="Describe the images as an alternative text",
image_documents=image_url_documents,
)
print(response)
从图像生成结构化输出
我们还可以使用多模态Pydantic程序从图像生成结构化输出。
from llama_index.core import SimpleDirectoryReader
from PIL import Image
import matplotlib.pyplot as plt
from pydantic import BaseModel
from typing import List
class TickerInfo(BaseModel):
direction: str
ticker: str
company: str
shares_traded: int
percent_of_total_etf: float
class TickerList(BaseModel):
fund: str
tickers: List[TickerInfo]
image_documents = SimpleDirectoryReader(
input_files=["../data/images/ark_email_sample.PNG"]
).load_data()
img = Image.open("../data/images/ark_email_sample.PNG")
plt.imshow(img)
from llama_index.multi_modal_llms.anthropic import AnthropicMultiModal
from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.core.output_parsers import PydanticOutputParser
prompt_template_str = """
Can you get the stock information in the image \
and return the answer? Pick just one fund.
Make sure the answer is a JSON format corresponding to a Pydantic schema. The Pydantic schema is given below.
"""
anthropic_mm_llm = AnthropicMultiModal(max_tokens=300)
llm_program = MultiModalLLMCompletionProgram.from_defaults(
output_cls=TickerList,
image_documents=image_documents,
prompt_template_str=prompt_template_str,
multi_modal_llm=anthropic_mm_llm,
verbose=True,
)
response = llm_program()
print(str(response))
可能遇到的错误
- API密钥问题: 确保API密钥正确无误,否则会导致认证失败。
- 文件路径错误: 确认图像文件路径正确,否则会导致文件读取失败。
- 网络问题: URL方式需要网络连接,确保网络通畅。
如果你觉得这篇文章对你有帮助,请点赞,关注我的博客,谢谢!
参考资料: