使用LLamaIndex和自然语言SQL查询实现结构化数据查询

本文链接：https://blog.csdn.net/ppoojjj/article/details/140150581

在现代数据系统中，结构化数据是非常重要的组成部分。结构化数据通常存储在关系型数据库如Postgres或数据仓库如Snowflake中。LLamaIndex 提供了许多高级功能，这些功能由大型语言模型（LLM）驱动，能够从非结构化数据中创建结构化数据，并通过增强的文本到SQL功能分析这些结构化数据。

本文将详细介绍如何使用LLamaIndex进行结构化数据查询。具体内容包括：

设置：定义示例SQL表
构建表索引：从SQL数据库到表模式索引
使用自然语言SQL查询：如何使用自然语言查询SQL数据库

我们将以一个包含城市/人口/国家信息的玩具示例表进行演示。

设置

首先，使用SQLAlchemy设置一个简单的SQLite数据库：

from sqlalchemy import create_engine, MetaData, Table, Column, String, Integer, insert

# 创建SQLite数据库引擎
engine = create_engine("sqlite:///:memory:")
metadata_obj = MetaData()

# 创建city_stats表
table_name = "city_stats"
city_stats_table = Table(
    table_name,
    metadata_obj,
    Column("city_name", String(16), primary_key=True),
    Column("population", Integer),
    Column("country", String(16), nullable=False),
)
metadata_obj.create_all(engine)

# 插入一些数据
rows = [
    {"city_name": "Toronto", "population": 2731571, "country": "Canada"},
    {"city_name": "Tokyo", "population": 13929286, "country": "Japan"},
    {"city_name": "Berlin", "population": 600000, "country": "Germany"},
]
for row in rows:
    stmt = insert(city_stats_table).values(**row)
    with engine.begin() as connection:
        connection.execute(stmt)

构建表索引

接下来，我们将使用LLamaIndex将SQLAlchemy引擎包装成SQLDatabase，以便在LLamaIndex中使用：

from llama_index import SQLDatabase

# 包装SQLAlchemy引擎
sql_database = SQLDatabase(engine, include_tables=["city_stats"])

使用自然语言SQL查询

一旦我们构建了SQL数据库，就可以使用NLSQLTableQueryEngine构建自然语言查询，将自然语言合成SQL查询：

from llama_index import NLSQLTableQueryEngine

# 创建查询引擎
query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    tables=["city_stats"],
)

# 自然语言查询
query_str = "Which city has the highest population?"
response = query_engine.query(query_str)
print(response)  # 输出查询结果

建立我们的表索引

如果我们事先不知道要使用哪个表，并且所有表模式的总大小超过了上下文窗口大小，我们应该将表模式存储在索引中，以便在查询时检索正确的模式。

from llama_index import SQLTableNodeMapping, ObjectIndex, SQLTableSchema, VectorStoreIndex

# 定义表节点映射和表模式对象
table_node_mapping = SQLTableNodeMapping(sql_database)
table_schema_objs = [
    (SQLTableSchema(table_name="city_stats")),
]

# 构建对象索引
obj_index = ObjectIndex.from_objects(
    table_schema_objs,
    table_node_mapping,
    VectorStoreIndex,
)

# 设置额外的上下文文本
city_stats_text = (
    "This table gives information regarding the population and country of a given city.\n"
    "The user will query with codewords, where 'foo' corresponds to population and 'bar'"
    "corresponds to city."
)

table_node_mapping = SQLTableNodeMapping(sql_database)
table_schema_objs = [
    (SQLTableSchema(table_name="city_stats", context_str=city_stats_text))
]

使用自然语言SQL查询

定义表模式索引后，我们可以通过传入SQLDatabase和从对象索引构建的检索器来构建SQLTableRetrieverQueryEngine。

from llama_index import SQLTableRetrieverQueryEngine

# 创建检索查询引擎
query_engine = SQLTableRetrieverQueryEngine(
    sql_database, obj_index.as_retriever(similarity_top_k=1)
)
response = query_engine.query("Which city has the highest population?")
print(response)  # 输出查询结果