用AI大模型将CSV文件转为图谱，发现数据间关系，以健康医疗场景为例

本文链接：https://blog.csdn.net/chengxuyuanyy/article/details/140371615

最近，我发现了一个 neo4j-runway 项目。Neo4j Runway 是一个 Python 库，简化了将关系数据迁移到图数据库的过程。

它提供了与 OpenAI 交互的工具，用于数据发现和生成数据模型，还提供了生成代码和将数据加载到 Neo4j 实例中的工具。

换句话说，通过上传 CSV 文件，大模型将识别节点之间关系，并自动生成知识图谱。

在健康医疗领域，知识图谱是一种强大的工具，用于组织和分析复杂的医疗数据。

这些图谱以一种更易于理解的方式结构化信息，使得不同实体之间的关系更加清晰，例如疾病、治疗、患者和医疗提供者。

知识图谱在健康医疗行业有以下一些有用的应用：

整合多样化的数据源。知识图谱可以整合来自各种来源的数据，如电子健康记录（EHRs）、医学研究论文、临床试验结果、基因组数据和患者历史记录。

改进临床决策。通过链接症状、诊断、治疗和结果，知识图谱可以增强临床决策支持系统（CDSS），因为它们考虑了大量相互关联的医学知识，可能提高诊断准确性和治疗效果。在本文中，我将探讨这一主题。

个性化医疗。通过将患者特定数据与更广泛的医学知识相关联，促进制定个性化治疗计划。这包括理解基因信息、疾病机制和治疗反应之间的关系，从而提供更量身定制的医疗。

加速新药研发。在制药研究中，知识图谱可以通过识别潜在药物靶点和理解疾病涉及的生物路径，加速新药研发。

公共卫生与流行病学。有助于追踪疾病爆发、理解流行病学趋势和规划干预措施。因为它们可以整合来自各种公共卫生数据库、社交媒体和其他来源的数据，提供关于公共卫生威胁的实时洞察。

1

Neo4j Runway 是由 Alex Gilmore 创建的开源库。你可以在这里找到代码库。

代码库：https://github.com/a-s-g93/neo4j-runway

目前，该库仅支持使用 OpenAI 的大模型解析 CSV，并提供以下功能：

• 提取数据：利用大模型从数据中提取有意义的见解。
• 图数据建模：使用 OpenAI 和 Instructor Python 库开发准确的图数据模型。
• 数据摄取：利用 Runway 内置的 PyIngest ，将数据加载到 Neo4j 中。
• 无需编写 Cypher 语句，因为大模型会完成所有工作。

本文除了演示用大模型把CSV文件转为知识图谱外。我还使用了 Langchain 的 GraphCypherQAChain，用提示词生成 Cypher，无需编写一行 Cypher（用于查询 Neo4j 图数据库的 SQL 类语言）即可查询图谱。

该库给了一个金融行业的示例，本文我将测试在健康医疗场景中的效果。

从 Kaggle 上一个非常简单的数据集开始（疾病症状和患者档案数据集）。

该数据集只有 10 列（疾病、发烧、咳嗽、疲劳、呼吸困难、年龄、性别、血压、胆固醇水平和结果变量），我希望能够向大模型提供医疗报告，以获得诊断假设。

数据集：https://www.kaggle.com/datasets/uom190346a/disease-symptoms-and-patient-profile-dataset

让我们直接进入代码部分。

加载所需的库和环境变量

首先，加载所需的库：

sudo apt install python3-pydot graphviz

pip install neo4j-runway

import numpy as np
import pandas as pd
from neo4j_runway import Discovery, GraphDataModeler, IngestionGenerator, LLM, PyIngest
from IPython.display import display, Markdown, Image

加载环境变量：在 Neo4j Aura 中创建实例并进行身份验证。

在这里插入图片描述

加载医疗数据并整理格式

从 Kaggle 网站下载 CSV 文件，并将其加载到 Jupyter notebook 中。这个数据集非常简单，但对于测试概念非常有用。

在这里插入图片描述

例如，我们可以创建一个列表，列出所有导致呼吸困难的疾病，这不仅对选择图中的节点很有趣，也有助于开发诊断假设：

disease_df[disease_df['Difficulty Breathing']=='Yes']

所有变量必须是字符串（库是这样设计的），即使是整数。

然后，我们保存 CSV 文件：

在这里插入图片描述

现在，我们将为大模型描述数据，包括每个字段的可能值：

DATA_DESCRIPTION = {
'Disease': 'The name of the disease or medical condition.',
'Fever': 'Indicates whether the patient has a fever (Yes/No).',
'Cough': 'Indicates whether the patient has a cough (Yes/No).',
'Fatigue': 'Indicates whether the patient experiences fatigue (Yes/No).',
'Difficulty Breathing': 'Indicates whether the patient has difficulty breathing (Yes/No).',
'Age': 'The age of the patient in years.',
'Gender': 'The gender of the patient (Male/Female).',
'Blood Pressure': 'The blood pressure level of the patient (Normal/High).',
'Cholesterol Level': 'The cholesterol level of the patient (Normal/High).',
'Outcome Variable': 'The outcome variable indicating the result of the diagnosis or assessment for the specific disease (Positive/Negative).'
}

用大模型识别重要数据元素

下一步是让大模型分析表格数据，识别对生成图数据模型重要的数据元素。

在这里插入图片描述

这将生成数据分析的 Markdown 输出：

创建初始模型

现在，让我们创建初始模型：

在这里插入图片描述

这里，我的重点是疾病，所以我们将重新排列一些关系。

gdm.iterate_model(user_corrections=‘’’
Let’s think step by step. Please make the following updates to the data model:
\1. Remove the relationships between Patient and Disease, between Patient and Symptom and between Patient and Outcome.
\2. Change the Patient node into Demographics.
\3. Create a relationship HAS_DEMOGRAPHICS from Disease to Demographics.
\4. Create a relationship HAS_SYMPTOM from Disease to Symptom. If the Symptom value is No, remove this relationship.
\5. Create a relationship HAS_LAB from Disease to HealthIndicator.
\6. Create a relationship HAS_OUTCOME from Disease to Outcome.
‘’')

from IPython.display import Image, display
gdm.current_model.visualize().render(‘output’, format=‘png’)
# Load and display the image with a specific width
img = Image(‘output.png’, width=1200) # Adjust the width as needed
display(img

用Neo4j生成图谱

现在我们可以生成 Cypher 代码和 YAML 文件，将数据加载到 Neo4j 中。

如果你只是测试或第二次执行此操作，可能需要将实例重置为空白状态（清除所有内容）。

在这里插入图片描述

一切准备就绪。让我们将数据加载到实例中：

PyIngest(yaml_string=pyingest_yaml, dataframe=disease_df)

进入 Neo4j Aura 实例，打开，输入你的密码，并通过 Cypher 运行此查询：

MATCH (n)
WHERE n:Demographics OR n:Disease OR n:Symptom OR n:Outcome OR n:HealthIndicator
OPTIONAL MATCH (n)-[r]->(m)
RETURN n, r, m

按下 CTRL + ENTER，你将看到以下结果：

检查节点和关系后，我们发现症状、健康指标和人口统计数据之间有大量的相互连接：

让我们看看糖尿病。由于没有应用过滤器，男性和女性都会出现，以及所有的实验室、人口统计和结果的可能性。

MATCH (n:Disease {name: 'Diabetes'})
WHERE n:Demographics OR n:Disease OR n:Symptom OR n:Outcome OR n:HealthIndicator
OPTIONAL MATCH (n)-[r]->(m)
RETURN n, r, m

或者查看所有在临床检查中表现出高血压的疾病：

// Match the Disease nodes
MATCH (d:Disease)
// Match HAS_LAB relationships from Disease nodes to Lab nodes
MATCH (d)-[r:HAS_LAB]->(l)
MATCH (d)-[r2:HAS_OUTCOME]->(o)
// Ensure the Lab nodes have the bloodPressure property set to 'High'
WHERE l.bloodPressure = 'High' AND o.result='Positive'
RETURN d, properties(d) AS disease_properties, r, properties(r) AS relationship_properties, l, properties(l) AS lab_properties

接下来我们向大模型（在本例中是 Google 的 Gemini-1.5-Flash）提交一份医疗报告，让它通过 Langchain（GraphCypherQAChain）自动生成 Cypher 查询，基于症状、健康指标等，返回患者可能患有的疾病。

让我们开始吧：

在这里插入图片描述

从实例中获取知识图谱和模式：这里有节点属性和关系属性。

kg = Neo4jGraph(
    url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD, database=NEO4J_DATABASE

)
kg.refresh_schema()
print(textwrap.fill(kg.schema, 60))
schema=kg.schema