背景
这是去年九月份在研究知识图谱与推荐时做的一个Demo项目,源自于在github上找到一个关于汽车行业的知识图谱开源项目。我主要对它进行了一些改造,使之变成了一个基于知识图谱的影视剧推荐系统。
环境
python3、flask前端框架、图数据库neo4j(3.3.1)
操作系统为windows10
项目框架
把上面的汽车项目clone下来后,整个的项目结构如下图所示
里面有两个项目版本,第一次验收和第二次验收,两者主要区别是用的数据库不同,前者用的是mysql,后者用的是neo4j。我主要是基于第二次验收进行改造的。打开第二次验收的项目,里面的结构如下图所示
流程分析
下面,我们就原始项目的工作流程,进行一步一步的分析,因为只有这样,才能完成对其的改造。
数据的读取和插入
首先我们肯定需要把数据插入到neo4j里,那么上来我们就得启动neo4j,打开cmd,输入以下命令
neo4j console
然后如若cmd显示下面的消息,neo4j就启动完成了
最后一行显示的可用地址http://localhost:7474就是我们访问neo4j的地址,打开浏览器,把这个地址拷到地址栏里,敲下回车,就会看到neo4j的控制台界面 ,如下图所示
数据库启动完事之后,就可以打开项目里kg\kg.py文件了,在这里面,主要代码如下所示
def data_init(self):
# 连接图数据库
print('开始数据预处理')
self.graph = Graph('http://localhost:7474', user="neo4j", password="szc")
self.selector = NodeSelector(self.graph)
self.graph.delete_all()
def insert_datas(self):
print('开始插入数据')
with open('../data/tuples/three_tuples_2.txt', 'r', encoding='utf-8') as f:
lines, num = f.readlines(), -1
for line in lines:
num += 1
if num % 500 == 0:
print('当前处理进度:{}/{}'.format(lines.index(line), len(lines)))
line = line.strip().split(' ')
if len(line) != 3:
print('insert_datas错误:', line)
continue
self.insert_one_data(line)
def insert_one_data(self, line):
if '' in line:
print('insert_one_data错误', line)
return
start = self.look_and_create(line[0])
for name in self.get_items(line[2]):
end = self.look_and_create(name)
r = Relationship(start, line[1], end, name=line[1])
self.graph.create(r) # 当存在时不会创建新的
# 查找节点是否不存,不存在就创建一个
def look_and_create(self, name):
end = self.graph.find_one(label="car_industry", property_key="name", property_value=name)
if end == None:
end = Node('car_industry', name=name)
return end
def get_items(self, line):
if '{' not in line and '}' not in line:
return [line]
# 检查
if '{' not in line or '}' not in line:
print('get_items Error', line)
lines = [w[1:-1] for w in re.findall('{.*?}', line)]
return lines
最上面的data_init()函数,是用来连接neo4j数据库的,传入数据库地址、用户名、密码就可以了。然后调用graph.delete_all()函数,在插入数据前,先对原来的数据进行清空,这一步要根据自己的业务场景酌情考虑,是否保留。
然后是insert_datas()函数,这个函数就是读取txt文件,遍历每一行,对每一行调用insert_one_data()函数,进行每一行的解析,结点和关系的创建。根据代码可以发现,每一行的数据都是“起点 关系 终点”的形式,比如“安阳 位置 豫北”,就表示实体安阳和实体豫北的关系是位置,而且,顺序是安阳-->位置-->豫北。
调用insert_one_data()函数时,会先查询数据库里是否有这一个同名结点,根据结果决定是复用已有的还是建一个新的,这个过程对应函数look_and_create()。
在函数look_and_create()里,“car_industry”是数据库的标签(我理解是对应Mysql里每个数据库的名字,要用到哪个就调用命令use database some_database),然后find_one()函数里,property_name的值对应创建结点时Node的构造函数的参数名name,property_value就是Node的构造函数的name参数值,也就是实体的名字。拿我的故乡——安阳市实体为例,它在neo4j里的存储结构就可以理解为{property_name: "name", property_value: "安阳"}。
最后的get_items()函数就是实体的合法性检验,不做过多解读。
运行服务
数据全部插入数据库中后,就可以运行我们的服务了,文件对应run_server.py,里面代码如下
if __name__ == '__main__':
args=get_args()
print('\nhttp_host:{},http_port:{}'.format('localhost',args.http_port))
app.run(debug=True, host='210.41.97.169', port=8090)
其实关键就是一句app.run()函数,把里面的Ip和端口换成自己就可以
处理页面请求
我们的业务逻辑是:在浏览器输入url和参数,获取相关结果。
其中,处理我们的参数的过程,对应文件views.py,里面的主要代码如下
@app.route('/KnowGraph/v2',methods=["POST"])
def look_up():
kg=KnowGraph(get_args())
client_params=request.get_json(force=True)
server_param={}
if client_params['method'] == 'entry_to_entry':
kg.lookup_entry2entry(client_params,server_param)
elif client_params['method'] == 'entry_to_property':
kg.lookup_entry2property(client_params,server_param)
elif client_params['method'] == 'entry':
kg.lookup_entry(client_params,server_param)
elif client_params['method'] == 'statistics':
kg.lookup_statistics(client_params,server_param)
elif client_params['method'] == 'live':
params={'success':'true'}
server_param['result']=params
server_param['id']=client_params['id']
server_param['jsonrpc']=client_params['jsonrpc']
server_param['method']=client_params['method']
print(server_param)
return json.dumps(server_param, ensure_ascii=False).encode("utf-8")
可以看到,/KnowGraph/v2路径的post方法会路由到look_up函数里,里面根据参数method的值,调用kg对象的不同函数,执行不同的查询逻辑。
但是,我们在浏览器输入路径和参数然后敲下回车后,是要获取数据库信息,显然是对应的get方法。而且,关于向flask模板传递数据的路由也没写上,所以这个文件我们要进行大改。
数据查询
方才说到,views.py文件里会根据参数method的值的不同,调用kg对象的不同函数,来获取不同的结果。
而kg对象所属的KnowledgeGraph类,在文件modules.py里。以最简单也是最基本的对实体查询为例,我们看看其是怎么实现的,这对应lookup_entry函数,代码如下
def lookup_entry(self,client_params,server_param):
#支持设定网络查找的深度
start_time = time.time()
params=client_params["params"]
edges=set()
self.lookup_entry_deep(edges,params,0)
if len(edges)==0:
server_param['result']={"success":'false'}
else:
server_param['result']={'edges':[list(i) for i in edges],"success":'true'}
print('本次查找三元组的数量为:{},耗时:{}s'.format(len(edges),time.time()-start_time))
除了计时外,主要将客户端参数里的params取出来,里面包含要查找的实体名和查找深度,然后调用lookup_entry_deep函数进行查找,结果保存在edges集合里,最后将edges集合的每一项做为列表的列表的每一项,存储在server_params的'results'项中的'edges'里,进行返回。
下面,我们就看一下lookup_entry_deep函数的实现,代码如下
def lookup_entry_deep(self,edges,params,deep):
#当前查找深度不得等于要求的深度
if deep >= params['deep']:
return
#正向查找
result1=self.graph.data("match (s)-[r]->(e) where s.name='{}' return s.name,r.name,e.name".format(params['name']))
result2=self.graph.data("match (e)<-[r]-(s) where e.name='{}' return s.name,r.name,e.name".format(params['name']))
if len(result1)==0 and len(result2)==0:
return
for item in result1:
edges.add((item['s.name'],item['r.name'],item['e.name']))
if item['s.name'] != item['e.name']:#避免出现:双面胶:中文名:双面胶的死循环
params['name']=item['e.name']
self.lookup_entry_deep(edges,params.copy(),deep+1)
for item in result2:
edges.add((item['s.name'],item['r.name'],item['e.name']))
if item['s.name'] != item['e.name']:#避免出现:双面胶:中文名:双面胶的死循环
params['name']=item['e.name']
self.lookup_entry_deep(edges,params.copy(),deep+1)
首先,如果深度超标,就直接返回。然后先后针对params里的name项,也就是要查找的实体名,在数据库里进行正向和逆向的查询,然后把每一项做为元组保存在edges集合里,并递归调用这个函数,同时深度+1
改造
现有的流程就如上文所言,接下来,我们针对影视剧推荐的业务场景,对其进行改造。
假设有个用户观看了电视剧《上将XXX》,我们可以根据导演、演员、上映地、语种、类型标签等为其推荐他可能感兴趣的影视剧。
数据格式
我们的文件都保存在wiki目录里,均为txt文件,里面一行行的都是json,其中一行内容如下
{
.....
"title": "上将XXX",
"wikiData": {
.....
"wikiInfo": {
"country": "中国大陆",
"language": "普通话",
"directors": [
"安澜"
],
"actors": [
"宋春丽",
"王伍福",
"张秋歌",
"范明",
"刘劲",
"陶慧敏",
"侯勇"
],
....
},
....
"wikiTags": [
"电视剧",
"历史",
"战争",
"军旅",
"革命",
"动作",
"热血",
"激昂",
"24-36",
"36-45",
"45-55",
"55-70",
"上星剧",
"传记"
]
}
}
里面有用的信息格式化后就像上面显示的,导演演员之类的。
接下来,我们就可以根据解析项目时理出的流程,进行改造
数据读取和插入
这对应kg.py文件,首先定义我们的目录路径
data_dir = "C:\\Users\\songzeceng\\Desktop\\wiki\\"
然后遍历这个目录下的文件,对每个文件进行读取和解析,代码如下
def insert_data_from_txt(self, file_path):
try:
with open(file=file_path, mode="r", encoding="utf-8") as f:
for line in f.readlines():
item = json.loads(line)
if 'title' not in item.keys():
continue
title = self.look_and_create(item['title'])
if 'wikiData' not in item.keys():
continue
wikiData = item['wikiData']
if 'wikiDesc' in wikiData.keys():
wikiDesc = self.look_and_create(wikiData['wikiDesc'])
self.create_sub_graph(entity1=title, entity2=wikiDesc, relation="desc")
if 'wikiTags' in wikiData.keys():
for tag in wikiData['wikiTags']:
tag = self.look_and_create(tag)
self.create_sub_graph(entity1=title, entity2=tag, relation="tag")
wikiInfo = wikiData['wikiInfo']
if 'country' in wikiInfo.keys():
country = self.look_and_create(wikiInfo['country'])
self.create_sub_graph(entity1=title, entity2=country, relation="country")
if 'language' in wikiInfo.keys():
language = self.look_and_create(wikiInfo['language'])
self.create_sub_graph(entity1=title, entity2=language, relation="language")
if 'actors' in wikiInfo.keys():
for actor in wikiInfo['actors']:
actor = self.look_and_create(actor)
self.create_sub_graph(entity1=title, entity2=actor, relation="actor")
if 'directors' in wikiInfo.keys():
for director in wikiInfo['directors']:
actor = self.look_and_create(director)
self.create_sub_graph(entity1=title, entity2=actor, relation="director")
print(file_path, "读取完毕")
except Exception as e:
print("文件" + file_path + "读取异常:" + str(e))
pass
看着长,其实就是解析每一项,先查找或创建实体,对应函数look_and_create。由于我的py2neo版本和原项目里的不一样,所以对这个函数进行了改写,代码如下
def look_and_create(self, name):
matcher = NodeMatcher(self.graph)
end = matcher.match("car_industry", name=name).first()
if end == None:
end = Node('car_industry', name=name)
return end
然后进行实体关系的创建,对应函数create_sub_graph,代码如下
def create_sub_graph(self, entity1, relation, entity2):
r = Relationship(entity1, relation, entity2, name=relation)
self.graph.create(r)
整个kg文件代码如下所示
# coding:utf-8
'''
Created on 2018年1月26日
@author: qiujiahao
@email:997018209@qq.com
'''
import sys
import re
import os
sys.path.append('..')
from conf import get_args
from py2neo import Node, Relationship, Graph, NodeMatcher
import pandas as pd
import json
import os
data_dir = "C:\\Users\\songzeceng\\Desktop\\wiki\\"
class data(object):
def __init__(self):
self.args = get_args()
self.data_process()
def data_process(self):
# 初始化操 # 插入数据
self.data_init()
print("数据预处理完毕")
def data_init(self):
# 连接图数据库
print('开始数据预处理')
self.graph = Graph('http://localhost:7474', user="neo4j", password="szc")
# self.graph.delete_all()
file_names = os.listdir(data_dir)
for file_name in file_names:
self.insert_data_from_txt(data_dir + file_name)
def insert_data_from_txt(self, file_path):
try:
with open(file=file_path, mode="r", encoding="utf-8") as f:
for line in f.readlines():
item = json.loads(line)
if 'title' not in item.keys():
continue
title = self.look_and_create(item['title'])
# id = self.look_and_create(item['id'])
#
# self.create_sub_graph(entity1=title, entity2=id, relation="title")
if 'wikiData' not in item.keys():
continue
wikiData = item['wikiData']
if 'wikiDesc' in wikiData.keys():
wikiDesc = self.look_and_create(wikiData['wikiDesc'])
self.create_sub_graph(entity1=title, entity2=wikiDesc, relation="desc")
if 'wikiTags' in wikiData.keys():
for tag in wikiData['wikiTags']:
tag = self.look_and_create(tag)
self.create_sub_graph(entity1=title, entity2=tag, relation="tag")
wikiInfo = wikiData['wikiInfo']
if 'country' in wikiInfo.keys():
country = self.look_and_create(wikiInfo['country'])
self.create_sub_graph(entity1=title, entity2=country, relation="country")
if 'language' in wikiInfo.keys():
language = self.look_and_create(wikiInfo['language'])
self.create_sub_graph(entity1=title, entity2=language, relation="language")
if 'actors' in wikiInfo.keys():
for actor in wikiInfo['actors']:
actor = self.look_and_create(actor)
self.create_sub_graph(entity1=title, entity2=actor, relation="actor")
if 'directors' in wikiInfo.keys():
for director in wikiInfo['directors']:
actor = self.look_and_create(director)
self.create_sub_graph(entity1=title, entity2=actor, relation="director")
print(file_path, "读取完毕")
except Exception as e:
print("文件" + file_path + "读取异常:" + str(e))
pass
def create_sub_graph(self, entity1, relation, entity2):
r = Relationship(entity1, relation, entity2, name=relation)
self.graph.create(r)
def look_and_create(self, name):
matcher = NodeMatcher(self.graph)
end = matcher.match("car_industry", name=name).first()
if end == None:
end = Node('car_industry', name=name)
return end
if __name__ == '__main__':
data = data()
运行之,命令行输出如下图所示
数据不规范,很多文件读不了,不管了,反正就是个demo。然后neo4j数据库里,取25条数据,结果如下图所示
运行服务
这里直接把run_server.py里的ip和端口改成自己的就行了
处理请求
这一步对应views.py。
首先我们要把/KnowGraph/v2路径的get请求拦截,所以要加一个注解函数,如下所示
@app.route('/KnowGraph/v2', methods=["GET"])
def getInfoFromServer():
pass
然后就实现这个函数即可,首先处理请求参数,我们的请求完整url是这样的
http://localhost:8090/KnowGraph/v2?method=entry&jsonrpc=2.0&id=1¶ms=entry=上将许世友-deep=2
参数比较多,而且很多是固定的,比如jsonrpc、id等,因此我将其简化为
http://localhost:8090/KnowGraph/v2?name=上将许世友
然后在getInfoFromServer()函数里,把默认参数都加上即可,代码如下
def handle_args(originArgs):
if 'name' not in originArgs.keys():
return None
args = {}
for item in originArgs:
key = item
value = originArgs[key]
if key == "params":
kvs = str(value).split("-")
kv_dic = {}
for item in kvs:
kv = item.split("=")
k = kv[0]
v = kv[1]
if v.isnumeric():
kv_dic[k] = int(v)
else:
kv_dic[k] = v
args[key] = kv_dic
else:
if value.isnumeric():
args[key] = int(value)
else:
args[key] = value
if 'params' not in args.keys():
args['params'] = {
'name': args['name']
}
args.pop('name')
args['params']['name'] = args['params']['name'].replace('\'', '\\\'')
if 'method' not in args.keys():
args['method'] = 'entry'
if 'deep' not in args['params'].keys():
args['params']['deep'] = 2
if 'jsonrpc' not in args.keys():
args['jsonrpc'] = 2.0
if 'id' not in args.keys():
args['id'] = 1
return args
其实主要就是遍历和填充操作
参数处理完后,我们就可以根据参数里的method字段,来进行不同的查询操作了,然后从server_param的result字段里获取结果,交给前端,进行页面的渲染。故而,可以写出getInfoFromServer()函数代码如下
@app.route('/KnowGraph/v2', methods=["GET"])
def getInfoFromServer():
args = handle_args(request.args.to_dict())
kg = KnowGraph(args)
client_params = args
server_param = {}
if client_params['method'] == 'entry':
kg.lookup_entry(client_params, server_param)
server_param['id'] = client_params['id']
server_param['jsonrpc'] = client_params['jsonrpc']
server_param['method'] = client_params['method']
print("server_param:\n", server_param)
global mydata
if 'result' in server_param.keys():
mydata = server_param['result']
else:
mydata = '{}'
print("mydata:\n", mydata)
return render_template("index.html")
这里我们只处理对实体的查询,因为我们的输入就是用户观看的一个影视剧的名字。
渲染界面时,会通过/KnowGraph/data路径获取数据,因此要将其拦截,代码如下
@app.route("/KnowGraph/data")
def data():
print("data:", data)
return mydata
整个的views.py文件如下所示
# coding:utf-8
'''
Created on 2018年1月9日
@author: qiujiahao
@email:997018209@qq.com
'''
from flask import jsonify
from conf import *
from flask import Flask
from flask import request, render_template
from server.app import app
import tensorflow as tf
from server.module import KnowGraph
import json
mydata = ""
# http://210.41.97.89:8090/KnowGraph/v2?name=胜利之路
# http://113.54.234.209:8090/KnowGraph/v2?name=孤战
# http://localhost:8090/KnowGraph/v2?method=entry_to_property&jsonrpc=2.0&id=1¶ms=entry=水冶-property=位置
@app.route('/KnowGraph/v2', methods=["GET"])
def getInfoFromServer():
args = handle_args(request.args.to_dict())
kg = KnowGraph(args)
client_params = args
server_param = {}
if client_params['method'] == 'entry':
kg.lookup_entry(client_params, server_param)
server_param['id'] = client_params['id']
server_param['jsonrpc'] = client_params['jsonrpc']
server_param['method'] = client_params['method']
print("server_param:\n", server_param)
global mydata
if 'result' in server_param.keys():
mydata = server_param['result']
else:
mydata = '{}'
print("mydata:\n", mydata)
return render_template("index.html")
def handle_args(originArgs):
if 'name' not in originArgs.keys():
return None
args = {}
for item in originArgs:
key = item
value = originArgs[key]
if key == "params":
kvs = str(value).split("-")
kv_dic = {}
for item in kvs:
kv = item.split("=")
k = kv[0]
v = kv[1]
if v.isnumeric():
kv_dic[k] = int(v)
else:
kv_dic[k] = v
args[key] = kv_dic
else:
if value.isnumeric():
args[key] = int(value)
else:
args[key] = value
if 'params' not in args.keys():
args['params'] = {
'name': args['name']
}
args.pop('name')
args['params']['name'] = args['params']['name'].replace('\'', '\\\'')
if 'method' not in args.keys():
args['method'] = 'entry'
if 'deep' not in args['params'].keys():
args['params']['deep'] = 2
if 'jsonrpc' not in args.keys():
args['jsonrpc'] = 2.0
if 'id' not in args.keys():
args['id'] = 1
return args
@app.route("/KnowGraph/data")
def data():
print("data:", data)
return mydata
数据库查询
最后,我们把精力投放在module.py中的数据库查询和结果分析中。
为了便于查看,我们把结果放在json文件里,因此,查询结果在内存中用字典存储,每一次查询前,先把字典清空,再进行查询,然后根据有无结果,执行不同的解析逻辑。因此,可以写出lookup_entry函数如下所示
def lookup_entry(self, client_params, server_param):
# 支持设定网络查找的深度
start_time = time.time()
params = client_params["params"]
edges = set()
sim_dict.clear()
self.lookup_entry_deep(edges, params, 0)
if len(edges) == 0:
server_param['success'] = 'false'
else:
self.handleResult(edges, server_param, start_time)
对实体的查询都放在lookup_entry_deep()函数里。一般来说,我们的深度只有两层, 第一层是我们查询用户影视剧的各个属性,比如上将许世友的导演,第二层我们根据每个属性,去查找这个属性对应的实体,比如查询上将许世友的导演,还主拍过哪些影视剧。显然,第一层为正向查找,第二层则为逆向查找。
在查找时,为了避免向用户推荐他刚看过的影视剧,我们还要对结果进行去重。比方说,我们针对上将XXX进行查找,当查到上将XXX的导演为安澜,然后对安澜进行逆向查找时,如果发现安澜只导演过上将XXX这一部作品,那我们就没必要也不应该,把上将许世友加入到推荐列表里。
针对上面的没有查出别的实体的情况,我把这一返回结果定义为'nothing else';如果什么也没查到,就是'nothing got';如果深度超标,就是'deep out';一切正常,则为'ok'。
我们先进行双向查询,代码如下
result1 = self.graph.run(cypher='''match (s)-[r]->(e) where s.name='{}'
return s.name,r.name,e.name'''.format(params['name'])).data()
result2 = self.graph.run(cypher='''match (e)<-[r]-(s) where e.name='{}'
return s.name,r.name,e.name '''.format(params['name'])).data()
然后对两个结果进行判空,如果长度都为0,就返回'nothing got'
if len(result1) == 0 and len(result2) == 0:
return 'nothing got'
如果result2(也就是逆向查找的结果)只有一项,这一项中的s.name(也就是影视剧名)还是输入的实体名,e.name(也就是属性名)还是原来的属性名,那就直接返回'nothing else'
if len(result2) == 1:
item = result2[0]
if origin_tv_name is not None and origin_property_name is not None:
if origin_property_name == item['e.name'] and origin_tv_name == item['s.name']:
return 'nothing else'
这里的origin_tv_name和origin_property_name都是lookup_entry_deep函数的参数之一,默认为None
然后我们先遍历正向查询结果result1,把里面的属性值(e.name)、属性名(r.name)和原始影视剧(s.name)连接起来,作为三元组保存到edges集合里。
for item in result1:
tv_name = item['s.name']
property_name = item['e.name']
has_result = False
if tv_name != property_name: # 避免出现:双面胶:中文名:双面胶的死循环
if oldName != property_name:
params['name'] = property_name
has_result = self.lookup_entry_deep(edges, params.copy(), deep + 1,
origin_tv_name=tv_name,
origin_property_name=property_name)
oldName是本次查询的实体名,此处为了避免出现死循环,加了个判断,其实我们这个场景里,这个判断肯定是成立的。
接下来,我们就分析逆向查找的结果。如果查出了新的影视剧,就先根据新影视剧和属性的关系,得出这一关系的相似度。然后,再把新的影视剧、相同属性名、相似度以或累加、或新建的方式加入相似字典和edges集合里,代码如下
for item in result2:
tv_name = item['s.name']
property_name = item['e.name']
relation_name = item['r.name']
if tv_name != origin_tv_name:
score = get_sim_score_accroding_to_relation(relation_name)
if tv_name not in sim_dict.keys():
sim_dict[tv_name] = {
relation_name: [property_name],
"similarity": score
}
else:
item_dict = sim_dict[tv_name]
if relation_name in item_dict.keys() and \
property_name in item_dict.values():
continue
if relation_name in item_dict.keys():
item_dict[relation_name].append(property_name)
else:
item_dict[relation_name] = [property_name]
item_dict["similarity"] += score
edges.add((tv_name, relation_name, property_name))
其中,根据关系获得相似度的函数get_sim_score_accroding_to_relation()的代码如下所示
def get_sim_score_accroding_to_relation(relation_name):
if relation_name in ['actor', 'director', 'tag']:
return 1.0
elif relation_name in ['language', 'country']:
return 0.5
return 0.0
完整的lookup_entry_deep()函数如下所示
# 限制深度的查找
def lookup_entry_deep(self, edges, params, deep, origin_tv_name=None, origin_property_name=None):
# 当前查找深度不得等于要求的深度
if deep >= params['deep']:
return 'deep out'
# 正向查找
oldName = str(params['name'])
if oldName.__contains__("\'") and not oldName.__contains__("\\\'"):
params['name'] = oldName.replace("\'", "\\\'")
result1 = self.graph.run(cypher='''match (s)-[r]->(e) where s.name='{}'
return s.name,r.name,e.name'''.format(params['name'])).data()
result2 = self.graph.run(cypher='''match (e)<-[r]-(s) where e.name='{}'
return s.name,r.name,e.name '''.format(params['name'])).data()
if len(result1) == 0 and len(result2) == 0:
return 'nothing got'
if len(result2) == 1:
item = result2[0]
if origin_tv_name is not None and origin_property_name is not None:
if origin_property_name == item['e.name'] and origin_tv_name == item['s.name']:
return 'nothing else'
for item in result1:
tv_name = item['s.name']
property_name = item['e.name']
if tv_name != property_name: # 避免出现:双面胶:中文名:双面胶的死循环
if oldName != property_name:
params['name'] = property_name
has_result = self.lookup_entry_deep(edges, params.copy(), deep + 1,
origin_tv_name=tv_name,
origin_property_name=property_name)
for item in result2:
has_result = False
tv_name = item['s.name']
property_name = item['e.name']
relation_name = item['r.name']
if tv_name != origin_tv_name:
score = get_sim_score_accroding_to_relation(relation_name)
if tv_name not in sim_dict.keys():
sim_dict[tv_name] = {
relation_name: [property_name],
"similarity": score
}
else:
item_dict = sim_dict[tv_name]
if relation_name in item_dict.keys() and \
property_name in item_dict.values():
continue
if relation_name in item_dict.keys():
item_dict[relation_name].append(property_name)
else:
item_dict[relation_name] = [property_name]
item_dict["similarity"] += score
edges.add((tv_name, relation_name, property_name))
return 'ok'
当查询完成后,如果有结果,我们就会到handle_result()函数里处理结果,进行返回或输出。主要是根据相似度进行从高到低的排序,然后取出前20个,写入json文件,这部分代码如下所示
def handleResult(self, edges, server_param, start_time):
....
sorted_sim_list = sorted(sim_dict.items(), key=lambda x: x[1]['similarity'], reverse=True)
ret = {}
for i in range(len(sorted_sim_list)):
if i >= 20:
break
ret[sorted_sim_list[i][0]] = sorted_sim_list[i][1]
mydata = json.dumps(ret, ensure_ascii=False)
print('Json路径是:%s' % (fname))
self.clear_and_write_file(fname, mydata)
def clear_and_write_file(self, fname, mydata):
with open(fname, 'w', encoding='utf-8') as f:
f.write(str(""))
with open(fname, 'a', encoding='utf-8') as f:
f.write(str(mydata))
除此之外,我还将结果存放在了server_param里,用于向前端界面输出结果,这部分代码如下所示
ret = []
for result in edges:
ret.append({
"source": result[0],
"target": result[2],
"relation": result[1],
"label": "relation"
})
print("ret:", ret)
server_param['result'] = {"edges": ret}
server_param['success'] = 'true'
print('本次查找三元组的数量为:{},耗时:{}s'.format(len(ret), time.time() - start_time))
完整的结果处理函数的代码如下
def handleResult(self, edges, server_param, start_time):
ret = []
for result in edges:
ret.append({
"source": result[0],
"target": result[2],
"relation": result[1],
"label": "relation"
})
print("ret:", ret)
server_param['result'] = {"edges": ret}
server_param['success'] = 'true'
print('本次查找三元组的数量为:{},耗时:{}s'.format(len(ret), time.time() - start_time))
sorted_sim_list = sorted(sim_dict.items(), key=lambda x: x[1]['similarity'], reverse=True)
ret = {}
for i in range(len(sorted_sim_list)):
if i >= 20:
break
ret[sorted_sim_list[i][0]] = sorted_sim_list[i][1]
mydata = json.dumps(ret, ensure_ascii=False)
print('Json路径是:%s' % (fname))
self.clear_and_write_file(fname, mydata)
运行结果
首先启动服务,运行run_server.py,然后在浏览器地址栏里,输入如下url(XXX为输入的名字):
http://210.41.97.169:8090/KnowGraph/v2?name=XXX
然后页面输出如下
结果非常庞杂,我们再看看json文件里的前20个的输出,结果如下
{
"XXX元帅": {
"actor": [
"侯勇",
"刘劲"
],
"similarity": 14.0,
"language": [
"普通话"
],
"country": [
"中国大陆"
],
"tag": [
"传记",
"上星剧",
"55-70",
"45-55",
"36-45",
"24-36",
"热血",
"革命",
"战争",
"历史",
"电视剧"
]
},
"BBB": {
"actor": [
"刘劲",
"王伍福"
],
"similarity": 14.0,
"language": [
"普通话"
],
"country": [
"中国大陆"
],
"tag": [
"传记",
"上星剧",
"55-70",
"45-55",
"36-45",
"24-36",
"热血",
"革命",
"战争",
"历史",
"电视剧"
]
},
"长征大会师": {
"actor": [
"刘劲",
"王伍福"
],
"similarity": 14.0,
"language": [
"普通话"
],
"country": [
"中国大陆"
],
"tag": [
"上星剧",
"55-70",
"45-55",
"36-45",
"24-36",
"激昂",
"热血",
"革命",
"战争",
"历史",
"电视剧"
]
},
"战将": {
"language": [
"普通话"
],
"similarity": 13.0,
"country": [
"中国大陆"
],
"tag": [
"传记",
"上星剧",
"55-70",
"45-55",
"36-45",
"24-36",
"热血",
"动作",
"革命",
"战争",
"历史",
"电视剧"
]
},
"炮神": {
"language": [
"普通话"
],
"similarity": 13.0,
"country": [
"中国大陆"
],
"tag": [
"上星剧",
"55-70",
"45-55",
"36-45",
"24-36",
"激昂",
"动作",
"革命",
"军旅",
"战争",
"历史",
"电视剧"
]
},
"独立纵队": {
"language": [
"普通话"
],
"similarity": 13.0,
"country": [
"中国大陆"
],
"tag": [
"上星剧",
"55-70",
"45-55",
"36-45",
"24-36",
"激昂",
"热血",
"动作",
"革命",
"战争",
"历史",
"电视剧"
]
},
"女子军魂": {
"language": [
"普通话"
],
"similarity": 13.0,
"country": [
"中国大陆"
],
"tag": [
"上星剧",
"55-70",
"45-55",
"36-45",
"24-36",
"激昂",
"热血",
"革命",
"军旅",
"战争",
"历史",
"电视剧"
]
},
"热血军旗": {
"actor": [
"侯勇"
],
"similarity": 12.0,
"language": [
"普通话"
],
"country": [
"中国大陆"
],
"tag": [
"上星剧",
"55-70",
"45-55",
"36-45",
"热血",
"动作",
"革命",
"战争",
"历史",
"电视剧"
]
},
"擒狼": {
"language": [
"普通话"
],
"similarity": 12.0,
"country": [
"中国大陆"
],
"tag": [
"上星剧",
"55-70",
"45-55",
"36-45",
"24-36",
"激昂",
"动作",
"革命",
"战争",
"历史",
"电视剧"
]
},
"信者无敌": {
"language": [
"普通话"
],
"similarity": 12.0,
"country": [
"中国大陆"
],
"tag": [
"上星剧",
"55-70",
"45-55",
"36-45",
"24-36",
"激昂",
"热血",
"革命",
"战争",
"历史",
"电视剧"
]
},
"我的抗战之猎豹突击": {
"language": [
"普通话"
],
"similarity": 12.0,
"country": [
"中国大陆"
],
"tag": [
"上星剧",
"55-70",
"45-55",
"36-45",
"24-36",
"激昂",
"热血",
"革命",
"战争",
"历史",
"电视剧"
]
},
"魔都风云": {
"language": [
"普通话"
],
"similarity": 12.0,
"country": [
"中国大陆"
],
"tag": [
"上星剧",
"55-70",
"45-55",
"36-45",
"24-36",
"激昂",
"热血",
"动作",
"革命",
"战争",
"电视剧"
]
},
"英雄戟之影子战士": {
"language": [
"普通话"
],
"similarity": 12.0,
"country": [
"中国大陆"
],
"tag": [
"55-70",
"45-55",
"36-45",
"24-36",
"激昂",
"热血",
"动作",
"革命",
"战争",
"历史",
"电视剧"
]
},
"第一声枪响": {
"language": [
"普通话"
],
"similarity": 12.0,
"country": [
"中国大陆"
],
"tag": [
"上星剧",
"55-70",
"45-55",
"36-45",
"24-36",
"激昂",
"热血",
"革命",
"战争",
"历史",
"电视剧"
]
},
"亮剑": {
"language": [
"普通话"
],
"similarity": 12.0,
"country": [
"中国大陆"
],
"tag": [
"上星剧",
"45-55",
"36-45",
"24-36",
"激昂",
"热血",
"动作",
"革命",
"战争",
"历史",
"电视剧"
]
},
"飞虎队": {
"language": [
"普通话"
],
"similarity": 12.0,
"country": [
"中国大陆"
],
"tag": [
"上星剧",
"45-55",
"36-45",
"24-36",
"激昂",
"热血",
"动作",
"革命",
"战争",
"历史",
"电视剧"
]
},
"伟大的转折": {
"language": [
"普通话"
],
"similarity": 12.0,
"country": [
"中国大陆"
],
"tag": [
"上星剧",
"55-70",
"45-55",
"36-45",
"24-36",
"激昂",
"热血",
"革命",
"战争",
"历史",
"电视剧"
]
},
"太行英雄传": {
"language": [
"普通话"
],
"similarity": 12.0,
"country": [
"中国大陆"
],
"tag": [
"上星剧",
"45-55",
"36-45",
"24-36",
"激昂",
"热血",
"动作",
"革命",
"战争",
"历史",
"电视剧"
]
},
"雪豹": {
"language": [
"普通话"
],
"similarity": 12.0,
"country": [
"中国大陆"
],
"tag": [
"上星剧",
"55-70",
"45-55",
"36-45",
"24-36",
"激昂",
"革命",
"军旅",
"战争",
"历史",
"电视剧"
]
},
"宜昌保卫战": {
"actor": [
"侯勇"
],
"similarity": 11.0,
"language": [
"普通话"
],
"country": [
"中国大陆"
],
"tag": [
"上星剧",
"45-55",
"36-45",
"24-36",
"激昂",
"革命",
"战争",
"历史",
"电视剧"
]
}
}
排在前面的分别都是和我们的输入相关度很高的影视剧,相似度和相同的属性也赫然其中,看来效果还不错。
结语
这只是个demo,用来体验一下知识图谱在推荐系统中的应用。
最后,再次感谢原项目作者,没有他的辛勤劳作搭建出来的框架,我也很难做出第一步的实践。
再次给出原项目的地址:https://github.com/qiu997018209/KnowledgeGraph