我找了好久的数据,一直找不到金融相关的好数据来制作知识图谱(主要是我假若拿真实数据出来,里面满满的个人资料根本不能写在这里),找到一位大佬提供的医疗疾病相关数据,借来模仿学习一下知识图谱相关的知识
首先我先来介绍一下什么是知识图谱
知识图谱简介
知识图谱(Knowledge Graph / Vault)又称为科学知识图谱,其本质上是语义网络,是一种基于图的数据结构,由代码(点)和边(边)组成。 表示现实世界中存在的“实体”,每条边为实体与实体之间的“关系”。知识图谱是关系的最有效的表示方式。
通俗地讲,知识图谱就是把所有不同种类的信息(异构信息) 连接在一起而得到的一个关系网络。知识图谱提供了从“关系”的角度去分析问题的能力。
可以从下面的图谱可以查到 「刘德华」为中心向外可以查询到的相关资讯,透过视觉化的方式呈现刘德华相关的资讯
![fabf89a0e18ffe04b7d8cbac47a22cbe.png](https://i-blog.csdnimg.cn/blog_migrate/fb03d04e69a443dbed9ea3602ca67055.png)
制作知识图谱的工具-Neo4j
简单来说Neo4j 为一个图像制作的工具,并且提供python一个好用的套件py2neo,让我这个初入图谱的小白可以快速的制作Neo4j 的点(使用graph.create(node)函数),可以看下图
![75938fdda8f9bb70c57cd6b46327439c.png](https://i-blog.csdnimg.cn/blog_migrate/ade13af252d7e68350ddc8811a0aa2b5.png)
创造两个点之间的关系,可以透过py2neo的Relationship函数,使用的方式如下
r = Relationship( 点A , "A和B之间关系的名称", 点B)
把两个点连成关系图,就会变成如下的情况
![7084510cf4ee482336c9c3936b00d0c6.png](https://i-blog.csdnimg.cn/blog_migrate/95add4880b352166da907d155ff8c9b9.png)
医疗相关数据
我是参考以下这位大佬的代码进行学习,有兴趣的小伙伴也可以试着修改里面的代码
https://github.com/zhihao-chen/QASystemOnMedicalGraphgithub.com这位大佬提供19种分类如下所列
![7325b5b941b705955c100286b84e1659.png](https://i-blog.csdnimg.cn/blog_migrate/544965c2d486ff871b4cb47b86c8ddb5.png)
将第一笔数据打印出来让大家看一下里面有哪些资料
![72bb607f8c9849c9345aef95655f68c5.png](https://i-blog.csdnimg.cn/blog_migrate/1c20f16f932f12319c02e5f20b98d0fc.png)
医疗数据取得
透过Excel将所有的医疗资料全部读取出来,并且提取实体与实体之间的关系
from py2neo import Graph, Node, Relationship
import pandas as pd
import re
import os
cur_dir = '/'.join(os.path.abspath(__file__).split('/')[:-1])
data_path = os.path.join(cur_dir, 'DATA/disease.csv')
graph = Graph("http://localhost:7474", username="neo4j", password="123456789")
"""
读取文件,获得实体,实体关系
:return:
"""
# cols = ["name", "alias", "part", "age", "infection", "insurance", "department", "checklist", "symptom",
# "complication", "treatment", "drug", "period", "rate", "money"]
# 实体
diseases = [] # 疾病
aliases = [] # 别名
symptoms = [] # 症状
parts = [] # 部位
departments = [] # 科室
complications = [] # 并发症
drugs = [] # 药品
# 疾病的属性:age, infection, insurance, checklist, treatment, period, rate, money
diseases_infos = []
# 关系
disease_to_symptom = [] # 疾病与症状关系
disease_to_alias = [] # 疾病与别名关系
diseases_to_part = [] # 疾病与部位关系
disease_to_department = [] # 疾病与科室关系
disease_to_complication = [] # 疾病与并发症关系
disease_to_drug = [] # 疾病与药品关系
all_data = pd.read_csv(data_path, encoding='gb18030').loc[:, :].values
for data in all_data:
disease_dict = {} # 疾病信息
# 疾病
disease = str(data[0]).replace("...", " ").strip()
disease_dict["name"] = disease
# 别名
line = re.sub("[,、;,.;]", " ", str(data[1])) if str(data[1]) else "未知"
for alias in line.strip().split():
aliases.append(alias)
disease_to_alias.append([disease, alias])
# 部位
part_list = str(data[2]).strip().split() if str(data[2]) else "未知"
for part in part_list:
parts.append(part)
diseases_to_part.append([disease, part])
# 年龄
age = str(data[3]).strip()
disease_dict["age"] = age
# 传染性
infect = str(data[4]).strip()
disease_dict["infection"] = infect
# 医保
insurance = str(data[5]).strip()
disease_dict["insurance"] = insurance
# 科室
department_list = str(data[6]).strip().split()
for department in department_list:
departments.append(department)
disease_to_department.append([disease, department])
# 检查项
check = str(data[7]).strip()
disease_dict["checklist"] = check
# 症状
symptom_list = str(data[8]).replace("...", " ").strip().split()[:-1]
for symptom in symptom_list:
symptoms.append(symptom)
disease_to_symptom.append([disease, symptom])
# 并发症
complication_list = str(data[9]).strip().split()[:-1] if str(data[9]) else "未知"
for complication in complication_list:
complications.append(complication)
disease_to_complication.append([disease, complication])
# 治疗方法
treat = str(data[10]).strip()[:-4]
disease_dict["treatment"] = treat
# 药品
drug_string = str(data[11]).replace("...", " ").strip()
for drug in drug_string.split()[:-1]:
drugs.append(drug)
disease_to_drug.append([disease, drug])
# 治愈周期
period = str(data[12]).strip()
disease_dict["period"] = period
# 治愈率
rate = str(data[13]).strip()
disease_dict["rate"] = rate
# 费用
money = str(data[14]).strip() if str(data[14]) else "未知"
disease_dict["money"] = money
diseases_infos.append(disease_dict)
diseases = set(diseases)
symptoms = set(symptoms)
aliases = set(aliases)
parts = set(parts)
departments = set(departments)
complications = set(complications)
drugs = set(drugs)
disease_to_alias = disease_to_alias
disease_to_symptom = disease_to_symptom
diseases_to_part = diseases_to_part
disease_to_department = disease_to_department
disease_to_complication = disease_to_complication
disease_to_drug = disease_to_drug
diseases_infos = diseases_infos
创建节点
我们使用create_graphNodes创建一个节点,在创建节点时主要透过以下两个步骤将节点建立起来:
- 创建节点的属性:就是将每个节点的详细资料创建进去,可以参考下图
- 创建节点:依照各个不同的标签(如:症状、部位...)创建节点名称
![ee6cecfd66c5ca8740faa87448a3bcba.png](https://i-blog.csdnimg.cn/blog_migrate/f55879500ddf8ae3e64681a832607209.png)
def create_node(self, label, nodes):
"""
创建节点
:param label: 标签
:param nodes: 节点
:return:
"""
count = 0
for node_name in nodes:
node = Node(label, name=node_name)
self.graph.create(node)
count += 1
print(count, len(nodes))
return
def create_diseases_nodes(self, disease_info):
"""
创建疾病节点的属性
:param disease_info: list(Dict)
:return:
"""
count = 0
for disease_dict in disease_info:
node = Node("Disease", name=disease_dict['name'], age=disease_dict['age'],
infection=disease_dict['infection'], insurance=disease_dict['insurance'],
treatment=disease_dict['treatment'], checklist=disease_dict['checklist'],
period=disease_dict['period'], rate=disease_dict['rate'],
money=disease_dict['money'])
self.graph.create(node)
count += 1
print(count)
return
def create_graphNodes(self):
"""
创建知识图谱实体
:return:
"""
disease, symptom, alias, part, department, complication, drug, rel_alias, rel_symptom, rel_part,
rel_department, rel_complication, rel_drug, rel_infos = self.read_file()
self.create_diseases_nodes(rel_infos)
self.create_node("Symptom", symptom)
self.create_node("Alias", alias)
self.create_node("Part", part)
self.create_node("Department", department)
self.create_node("Complication", complication)
self.create_node("Drug", drug)
return
建立节点关系
透过之前整理好的疾病与每一个种类关系,建立起节点之间的关系,建立关系需要以下两个重点:
- 取得疾病(disease)与每一种种类之间的关系
- 透过Cypher查询节点将关系建立起来
如下图是疾病(disease)与症状(alias)之间的关系,可以从disease_to_alias的list中取得
![b9aede0abf2eb13eadc47768de856f81.png](https://i-blog.csdnimg.cn/blog_migrate/0af624d7418a80085741b01174a9dde2.png)
def create_graphRels(self):
disease, symptom, alias, part, department, complication, drug, rel_alias, rel_symptom, rel_part,
rel_department, rel_complication, rel_drug, rel_infos = self.read_file()
self.create_relationship("Disease", "Alias", rel_alias, "ALIAS_IS", "别名")
self.create_relationship("Disease", "Symptom", rel_symptom, "HAS_SYMPTOM", "症状")
self.create_relationship("Disease", "Part", rel_part, "PART_IS", "发病部位")
self.create_relationship("Disease", "Department", rel_department, "DEPARTMENT_IS", "所属科室")
self.create_relationship("Disease", "Complication", rel_complication, "HAS_COMPLICATION", "并发症")
self.create_relationship("Disease", "Drug", rel_drug, "HAS_DRUG", "药品")
def create_relationship(self, start_node, end_node, edges, rel_type, rel_name):
"""
创建实体关系边
:param start_node:
:param end_node:
:param edges:
:param rel_type:
:param rel_name:
:return:
"""
count = 0
# 去重处理
set_edges = []
for edge in edges:
set_edges.append('###'.join(edge))
all = len(set(set_edges))
for edge in set(set_edges):
edge = edge.split('###')
p = edge[0]
q = edge[1]
# 使用Neo4j的Cypher查询节点
query = "match(p:%s),(q:%s) where p.name='%s'and q.name='%s' create (p)-[rel:%s{name:'%s'}]->(q)" % (
start_node, end_node, p, q, rel_type, rel_name)
try:
self.graph.run(query)
count += 1
print(rel_type, count, all)
except Exception as e:
print(e)
return
知识图谱的关系图
建立完成的图谱可以在浏览器中输入 http://localhost:7474 打开Neo4j可以查到建立完成的知识图谱
下图为「疾病与科室关系」可以看到皮肤科与其他疾病的关系
![c97628e1a1a9257dd802c03fa9251f03.png](https://i-blog.csdnimg.cn/blog_migrate/bd95d78c4493e7301cbff12d0bf9bcf2.jpeg)
这样就完成了知识图谱的关系建立,有兴趣的小伙伴可以尝试建立属于自己的知识图谱!!!