1、环境
Version: | 3.5.8 |
Edition: | Community |
2、数据来源
http://www.openkg.cn/dataset/e5a5f1fb-505a-4ccc-bac5-4eb9856b667e
https://www.ownthink.com/docs/kg/#_1
下载到ownthink_v2.csv文件,包含1.4亿数据
wc -l ownthink_v2.csv
#140919781 ownthink_v2.csv
ownthink_v2.csv 前20行如下
实体,属性,值
胶饴,描述,别名: 饴糖、畅糖、畅、软糖。
词条,描述,词条(拼音:cí tiáo)也叫词目,是辞书学用语,指收列的词语及其释文。
词条,标签,文化
红色食品,描述,红色食品是指食品为红色、橙红色或棕红色的食品。
红色食品,中文名,红色食品
红色食品,是否含防腐剂,否
红色食品,主要食用功效,预防感冒,缓解疲劳
红色食品,适宜人群,全部人群
红色食品,用途,增强表皮细胞再生和防止皮肤衰老
红色食品,标签,非科学
红色食品,标签,生活
大龙湫,描述,雁荡山景区分散,东起羊角洞,西至锯板岭;南起筋竹溪,北至六坪山。
大龙湫,中文名称,大龙湫
大龙湫,外文名称,big dragon autrum
大龙湫,地理位置,浙江省温州市雁荡山景区
大龙湫,开放时间,08:00~18:00
大龙湫,门票价格,50元
大龙湫,著名景点,芙蓉峰
大龙湫,著名景点,剪刀峰
3、数据清洗
3.1 下载工具 rdf-converter
https://github.com/jievince/rdf-converter,按照说明运行,生成edge.csv和vertex.csv两个文件
vertex.csv 格式如下:
-201035082963479683,实体
-1779678833482502384,值
4646408208538057683,胶饴
-1861609733419239066,别名: 饴糖、畅糖、畅、软糖。
-2047289935702608120,词条
5842706712819643509,词条(拼音:cí tiáo)也叫词目,是辞书学用语,指收列的词语及其释文。
-3063129772935425027,文化
-2484942249444426630,红色食品
-3877061284769534378,红色食品是指食品为红色、橙红色或棕红色的食品。
-3402450096279275143,否
4786182067583989997,预防感冒,缓解疲劳
-8978611301755314833,全部人群
-382812815618074210,增强表皮细胞再生和防止皮肤衰老
3455734391170888430,非科学
-4368442157131186527,生活
-4016848910133347272,大龙湫
-1751058806841876591,雁荡山景区分散,东起羊角洞,西至锯板岭;南起筋竹溪,北至六坪山。
-4369745808943528904,big dragon autrum
-3278556255913778158,浙江省温州市雁荡山景区
-1081363081064284954,08:00~18:00
edge.csv 格式如下:
-201035082963479683,-1779678833482502384,属性
4646408208538057683,-1861609733419239066,描述
-2047289935702608120,5842706712819643509,描述
-2047289935702608120,-3063129772935425027,标签
-2484942249444426630,-3877061284769534378,描述
-2484942249444426630,-2484942249444426630,中文名
-2484942249444426630,-3402450096279275143,是否含防腐剂
-2484942249444426630,4786182067583989997,主要食用功效
-2484942249444426630,-8978611301755314833,适宜人群
-2484942249444426630,-382812815618074210,用途
-2484942249444426630,3455734391170888430,标签
-2484942249444426630,-4368442157131186527,标签
-4016848910133347272,-1751058806841876591,描述
-4016848910133347272,-4016848910133347272,中文名称
-4016848910133347272,-4369745808943528904,外文名称
-4016848910133347272,-3278556255913778158,地理位置
-4016848910133347272,-1081363081064284954,开放时间
-4016848910133347272,3797530799472559859,门票价格
-4016848910133347272,6249183780323029504,著名景点
-4016848910133347272,6601364230245153029,著名景点
当前格式并不能直接导入neo4j,必须处理成相应的格式。
3.2 将节点和关系处理成neo4j格式
def prep_vertex_all():
ferror = open("kg/kg-clean/err_vertex.csv",'r')
frname = "kg/kg-clean/vertex.csv"
fwname = "kg/kg-clean/vertex_output_vertex_all.csv"
with open(frname, 'r') as fr:
with open(fwname, 'w') as fw:
fw.write("{},{},{}\n".format(":ID", "name", ":LABEL"))
for line in fr:
try:
# print(line.strip())
line = line.strip()
if not line:
continue
spo = line.split(",")
# print(spo)
fw.write("{},{},{}\n".format(spo[0], spo[1].replace('"',''), "ENTITY"))
except:
ferror.write("{}\n".format(line))
continue
def prep_edge_all():
ferror = open("kg/kg-clean/err_edge.csv",'r')
frname = "kg/kg-clean/edge.csv"
fwname = "kg/kg-clean/edge_output_all.csv"
print(frname)
print(fwname)
with open(frname, 'r') as fr:
with open(fwname, 'w') as fw:
fw.write("{},{},{},{}\n".format(":START_ID", "name", ":END_ID", ":TYPE"))
for line in fr:
try:
# print(line.strip())
line = line.strip()
if not line:
continue
spo = line.split(",")
# print(spo)
fw.write("{},{},{},{}\n".format(spo[0], spo[2].replace('"', ''), spo[1], "RELATIONSHIP"))
except:
ferror.write("{}\n".format(line))
continue
if __name__ == '__main__':
prep_vertex_all()
prep_edge_all()
运行上边的代码会生成edge_output_all.csv和vertex_output_all.csv 两个文件,满足neo4j import的导入格式。
vertex_output_all.csv
:ID,name,:LABEL
-201035082963479683,实体,ENTITY
-1779678833482502384,值,ENTITY
4646408208538057683,胶饴,ENTITY
-1861609733419239066,别名: 饴糖、畅糖、畅、软糖。,ENTITY
-2047289935702608120,词条,ENTITY
5842706712819643509,词条(拼音:cí tiáo)也叫词目,是辞书学用语,指收列的词语及其释文。,ENTITY
-3063129772935425027,文化,ENTITY
-2484942249444426630,红色食品,ENTITY
-3877061284769534378,红色食品是指食品为红色、橙红色或棕红色的食品。,ENTITY
-3402450096279275143,否,ENTITY
4786182067583989997,预防感冒,缓解疲劳,ENTITY
-8978611301755314833,全部人群,ENTITY
-382812815618074210,增强表皮细胞再生和防止皮肤衰老,ENTITY
3455734391170888430,非科学,ENTITY
-4368442157131186527,生活,ENTITY
-4016848910133347272,大龙湫,ENTITY
-1751058806841876591,雁荡山景区分散,东起羊角洞,西至锯板岭;南起筋竹溪,北至六坪山。,ENTITY
-4369745808943528904,big dragon autrum,ENTITY
-3278556255913778158,浙江省温州市雁荡山景区,ENTITY
edge_output_all.csv格式如下:
:START_ID,name,:END_ID,:TYPE
-201035082963479683,属性,-1779678833482502384,RELATIONSHIP
4646408208538057683,描述,-1861609733419239066,RELATIONSHIP
-2047289935702608120,描述,5842706712819643509,RELATIONSHIP
-2047289935702608120,标签,-3063129772935425027,RELATIONSHIP
-2484942249444426630,描述,-3877061284769534378,RELATIONSHIP
-2484942249444426630,中文名,-2484942249444426630,RELATIONSHIP
-2484942249444426630,是否含防腐剂,-3402450096279275143,RELATIONSHIP
-2484942249444426630,主要食用功效,4786182067583989997,RELATIONSHIP
-2484942249444426630,适宜人群,-8978611301755314833,RELATIONSHIP
-2484942249444426630,用途,-382812815618074210,RELATIONSHIP
-2484942249444426630,标签,3455734391170888430,RELATIONSHIP
-2484942249444426630,标签,-4368442157131186527,RELATIONSHIP
-4016848910133347272,描述,-1751058806841876591,RELATIONSHIP
-4016848910133347272,中文名称,-4016848910133347272,RELATIONSHIP
-4016848910133347272,外文名称,-4369745808943528904,RELATIONSHIP
-4016848910133347272,地理位置,-3278556255913778158,RELATIONSHIP
-4016848910133347272,开放时间,-1081363081064284954,RELATIONSHIP
-4016848910133347272,门票价格,3797530799472559859,RELATIONSHIP
-4016848910133347272,著名景点,6249183780323029504,RELATIONSHIP
4、数据导入neo4j
bin/neo4j-admin import --database=graph-all.db --mode=csv --nodes /ownthink_data/vertex_output_all.csv --relationships /ownthink_data/edge_output_all.csv --ignore-duplicate-nodes=true --ignore-missing-nodes=true --id-type=string
如果没有报错:将出现如下界面,导入需要20分钟左右:
如上,导入成功。
5、需要重启服务
5.1 数据库替换
现在databases下边有两个db, graph.db(默认使用)和graph-all.db(导入数据新生成),用graph-all.db覆盖graph.db,并且修改权限为777。
执行代码
rm -r graph.db
mv graph-all.db graph.db
chmod -R 777 graph.db
5.2 重启服务
浏览器登录显示如下:
成功导入。
6、优化
上边的关系类型都是RELATIONSHIP,浏览器页面没法区分其关系。
可以更改edge_output_all.csv的格式,生成代码如下:
def prep_edge_all():
ferror = open("kg/kg-clean/err_edge.csv",'r')
frname = "kg/kg-clean/edge.csv"
fwname = "kg/kg-clean/edge_output_all.csv"
print(frname)
print(fwname)
with open(frname, 'r') as fr:
with open(fwname, 'w') as fw:
fw.write("{},{},{},{}\n".format(":START_ID", "name", ":END_ID", ":TYPE"))
for line in fr:
try:
# print(line.strip())
line = line.strip()
if not line:
continue
spo = line.split(",")
# 更改这行
fw.write("{},{},{},{}\n".format(spo[0], spo[2].replace('"', ''), spo[1], spo[2]))
except:
ferror.write("{}\n".format(line))
continue
if __name__ == '__main__':
prep_edge_all()
修改后的edge_output_all.csv的格式如下
:START_ID,name,:END_ID,:TYPE
-201035082963479683,属性,-1779678833482502384,属性
4646408208538057683,描述,-1861609733419239066,描述
-2047289935702608120,描述,5842706712819643509,描述
-2047289935702608120,标签,-3063129772935425027,标签
-2484942249444426630,描述,-3877061284769534378,描述
-2484942249444426630,中文名,-2484942249444426630,中文名
-2484942249444426630,是否含防腐剂,-3402450096279275143,是否含防腐剂
-2484942249444426630,主要食用功效,4786182067583989997,主要食用功效
-2484942249444426630,适宜人群,-8978611301755314833,适宜人群
-2484942249444426630,用途,-382812815618074210,用途
-2484942249444426630,标签,3455734391170888430,标签
-2484942249444426630,标签,-4368442157131186527,标签
-4016848910133347272,描述,-1751058806841876591,描述
-4016848910133347272,中文名称,-4016848910133347272,中文名称
-4016848910133347272,外文名称,-4369745808943528904,外文名称
-4016848910133347272,地理位置,-3278556255913778158,地理位置
-4016848910133347272,开放时间,-1081363081064284954,开放时间
-4016848910133347272,门票价格,3797530799472559859,门票价格
-4016848910133347272,著名景点,6249183780323029504,著名景点
重复4-5工作即可。