基于模板匹配的知识库问答系统
任务介绍
实践内容贯通《知识工程》课程的所有内容,实现一个知识库问答系统,将学到的理论知识以及实践操作结合,并融会贯通到实际应用中。
数据从天津大学的官网获取,将数据转为RDF数据,构建知识库。通过Jena Fuseki引擎,使用SPARQL语句进行查询与推理。使用模板匹配和正则表达实现知识库问答系统的demo。
数据获取与数据处理
所需的环境是python3,并需要导入requests和BeautifulSoup库
首先需要从数据所在的网址中获取请求URL和请求方式,从浏览器得开发者模式中可以看到。在Network中的Headers下面可以找到,如图所示
然后在jupyter中来编写抓取数据的代码。
第一步:获取要爬虫的网页
import requests
res = requests.get('http://cic.tju.edu.cn/jyjx/yjsjy/yjsdsml.htm')
res.encoding = 'utf-8'
第二步:从Elements中找到要爬取的内容所在的位置
其实爬数据就是找到网页中对应的块,然后提取出块中的信息。我要的数据是学院所有硕士生导师的信息,需要从<table>中教师名字超链接<a>中进入教师的个人网页,然后爬取数据。在这里我先爬取教师的个人网页,然后保存在txt文档中,再逐行读取并进入文档中的网址,将每名教师的信息爬取出来
#爬取教师个人主页网址,并存入txt文档
from bs4 import BeautifulSoup
soup = BeautifulSoup(res.text, 'lxml')
tables = soup.find_all('table')
table = tables[1]
teachers = table.find_all('a')
for teacher in teachers:
link = teacher.get('href')
fo = open('E:\study\zhishi\实践\http.txt', 'ab+')
fo.write((link + '\r\n').encode('utf-8'))
fo.close()
部分结果如图所示
#从教师个人主页中爬取数据,并逐行保存到txt文档中
import requests
from bs4 import BeautifulSoup
file1 = open('E:\study\zhishi\实践\http.txt')
file2 = open('E:\study\zhishi\实践\teacher.txt', 'ab+')
num = 0
for line in file1:
line = line.strip('\r\n')
res = requests.get(line)
res.encoding = 'utf-8'
res.text
soup = BeautifulSoup(res.text, 'lxml')
teachers = soup.find_all('div', class_ = 'v_news_content')
for teacher in teachers:
informations = teacher.find_all('p')
for information in informations:
out = information.text
file2.write((out + '\r\n').encode('utf-8'))
num = num + 1
file1.close()
file2.close()
第三步:要对爬取的数据进行清洗处理,因为并不是所有教师主页中都包含相同格式的内容,而且在后续将数据转为RDF时,需要按照标签来讲内容设置为不同的属性。因此我们首先将原始数据逐行读取,然后按空格切分,将每块内容作为value,并赋予对应的key,存成字典格式。最后输出到txt文档。部分结果如图所示
最后将数据转为RDF数据,供后面进行SPARQL语句查询。这个数据是一个实体关系三元组的形式,原始数据中的教师姓名、职称、主讲课程等都是实体,要将这些实体按照彼此的关系进行连接。
第一步:先定义三元组的格式,如代码所示,其中“%s”会被原始数据代替
#定义教师姓名
teacher_name = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_name> \"%s\" ."
#定义教师——职称三元组和教师职称
teacher_professional_title = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_professional_title> <http://kg.course/informations/%s> ."
professional_title_name = "<http://kg.course/informations/%s> <http://kg.course/informations/professional_title_name> \"%s\" ."
#定义教师——所在系别三元组和系别
teacher_department = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_department> <http://kg.course/informations/%s> ."
department_name = "<http://kg.course/informations/%s> <http://kg.course/informations/department_name> \"%s\" ."
#定义教师——主讲课程三元组和主讲课程
teacher_courses = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_courses> <http://kg.course/informations/%s> ."
courses_name = "<http://kg.course/informations/%s> <http://kg.course/informations/courses_name> \"%s\" ."
#定义教师——导师类型三元组和导师类型
teacher_type = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_type> <http://kg.course/informations/%s> ."
type_name = "<http://kg.course/informations/%s> <http://kg.course/informations/type_name> \"%s\" ."
#定义教师——电子邮件三元组和电子邮件
teacher_email = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_email> <http://kg.course/informations/%s> ."
email_name = "<http://kg.course/informations/%s> <http://kg.course/informations/email_name> \"%s\" ."
#定义教师——研究领域三元组和研究领域
teacher_field = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_field> <http://kg.course/informations/%s> ."
field_name = "<http://kg.course/informations/%s> <http://kg.course/informations/field_name> \"%s\" ."
#定义教师——研究方向三元组和研究方向
teacher_direction = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_direction> <http://kg.course/informations/%s> ."
direction_name = "<http://kg.course/informations/%s> <http://kg.course/informations/direction_name> \"%s\" ."
#定义教师——个人主页三元组和个人主页
teacher_webpage = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_webpage> <http://kg.course/informations/%s> ."
webpage_name = "<http://kg.course/informations/%s> <http://kg.course/informations/webpage_name> \"%s\" ."
#定义三元组
triples = []
第二步:逐行读取数据,然后按照定义好的三元组,将数据存为实体关系三元组。以教师——主讲课程三元组为例,每名教师可能教授几门课程,每门课程都要和对应的教师建立实体关系三元组
if (len(dict_teacher['主讲课程']) > 1):
for num_1 in range(len(dict_teacher['主讲课程'])):
teacher_courses_str = teacher_courses % (dict_teacher['姓名'][0], dict_teacher['主讲课程'][num_1])
courses_name_str = courses_name % (dict_teacher['主讲课程'][num_1], dict_teacher['主讲课程'][num_1])
triples.append(teacher_courses_str)
triples.append(courses_name_str)
第三步:在后续操作中,需要对问题文本进行分词、词性标注,为了避免教师姓名分词出现错误,要提前制作教师姓名词性字典,代码如下所示
file_3.write(dict_teacher['姓名'][0] + ' ' + 'nr'+'\n')
完整代码如下所示
import io
import sys
import random
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='UTF-8')
file_1 = open('E:/study/zhishi/实践/teachers.txt', encoding='UTF-8')
file_2 = open('E:/study/zhishi/实践/teachers_information.nt', 'w', encoding='UTF-8')
file_3 = open('E:/study/zhishi/实践/teachers_name.txt', 'w', encoding='UTF-8')
teacher_name = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_name> \"%s\" ."
teacher_professional_title = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_professional_title> <http://kg.course/informations/%s> ."
professional_title_name = "<http://kg.course/informations/%s> <http://kg.course/informations/professional_title_name> \"%s\" ."
teacher_department = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_department> <http://kg.course/informations/%s> ."
department_name = "<http://kg.course/informations/%s> <http://kg.course/informations/department_name> \"%s\" ."
teacher_courses = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_courses> <http://kg.course/informations/%s> ."
courses_name = "<http://kg.course/informations/%s> <http://kg.course/informations/courses_name> \"%s\" ."
teacher_type = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_type> <http://kg.course/informations/%s> ."
type_name = "<http://kg.course/informations/%s> <http://kg.course/informations/type_name> \"%s\" ."
teacher_email = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_email> <http://kg.course/informations/%s> ."
email_name = "<http://kg.course/informations/%s> <http://kg.course/informations/email_name> \"%s\" ."
teacher_field = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_field> <http://kg.course/informations/%s> ."
field_name = "<http://kg.course/informations/%s> <http://kg.course/informations/field_name> \"%s\" ."
teacher_direction = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_direction> <http://kg.course/informations/%s> ."
direction_name = "<http://kg.course/informations/%s> <http://kg.course/informations/direction_name> \"%s\" ."
teacher_webpage = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_webpage> <http://kg.course/informations/%s> ."
webpage_name = "<http://kg.course/informations/%s> <http://kg.course/informations/webpage_name> \"%s\" ."
triples = []
for line in file_1:
list_teacher = line.strip().split(' ')
dict_teacher = {}
for list_elm in list_teacher:
list_to_dict = []
list_to_dict = list_elm.split(':')
char = ';'
if char in list_to_dict[1]:
subvalue = list_to_dict[1].split(';')
dict_teacher[list_to_dict[0]] = subvalue
else:
dict_teacher[list_to_dict[0]] = [list_to_dict[1]]
print(dict_teacher)
file_3.write(dict_teacher['姓名'][0] + ' ' + 'nr'+'\n')
teacher_name_str = teacher_name % (dict_teacher['姓名'][0], dict_teacher['姓名'][0])
triples.append(teacher_name_str)
teacher_professional_title_str = teacher_professional_title % (dict_teacher['姓名'][0], dict_teacher['职称'][0])
professional_title_name_str = professional_title_name % (dict_teacher['职称'][0], dict_teacher['职称'][0])
triples.append(teacher_professional_title_str)
triples.append(professional_title_name_str)
teacher_department_str = teacher_department % (dict_teacher['姓名'][0], dict_teacher['所在系别'][0])
department_name_str = department_name % (dict_teacher['所在系别'][0], dict_teacher['所在系别'][0])
triples.append(teacher_department_str)
triples.append(department_name_str)
if (len(dict_teacher['主讲课程']) > 1):
for num_1 in range(len(dict_teacher['主讲课程'])):
teacher_courses_str = teacher_courses % (dict_teacher['姓名'][0], dict_teacher['主讲课程'][num_1])
courses_name_str = courses_name % (dict_teacher['主讲课程'][num_1], dict_teacher['主讲课程'][num_1])
triples.append(teacher_courses_str)
triples.append(courses_name_str)
if (len(dict_teacher['主讲课程']) == 1):
teacher_courses_str = teacher_courses % (dict_teacher['姓名'][0], dict_teacher['主讲课程'][0])
courses_name_str = courses_name % (dict_teacher['主讲课程'][0], dict_teacher['主讲课程'][0])
triples.append(teacher_courses_str)
triples.append(courses_name_str)
if (len(dict_teacher['导师类型']) > 1):
for num_2 in range(len(dict_teacher['导师类型'])):
teacher_type_str = teacher_type % (dict_teacher['姓名'][0], dict_teacher['导师类型'][num_2])
type_name_str = type_name % (dict_teacher['导师类型'][num_2], dict_teacher['导师类型'][num_2])
triples.append(teacher_type_str)
triples.append(type_name_str)
if (len(dict_teacher['导师类型']) == 1):
teacher_type_str = teacher_type % (dict_teacher['姓名'][0], dict_teacher['导师类型'][0])
type_name_str = type_name % (dict_teacher['导师类型'][0], dict_teacher['导师类型'][0])
triples.append(teacher_type_str)
triples.append(type_name_str)
teacher_email_str = teacher_email % (dict_teacher['姓名'][0], dict_teacher['电子邮件'][0])
email_name_str = email_name % (dict_teacher['电子邮件'][0], dict_teacher['电子邮件'][0])
triples.append(teacher_email_str)
triples.append(email_name_str)
teacher_field_str = teacher_field % (dict_teacher['姓名'][0], dict_teacher['研究领域'][0])
field_name_str = field_name % (dict_teacher['研究领域'][0], dict_teacher['研究领域'][0])
triples.append(teacher_field_str)
triples.append(field_name_str)
teacher_direction_str = teacher_direction % (dict_teacher['姓名'][0], dict_teacher['研究方向'][0])
direction_name_str = direction_name % (dict_teacher['研究方向'][0], dict_teacher['研究方向'][0])
triples.append(teacher_direction_str)
triples.append(direction_name_str)
teacher_webpage_str = teacher_webpage % (dict_teacher['姓名'][0], dict_teacher['个人主页'][0])
webpage_name_str = webpage_name % (dict_teacher['个人主页'][0], dict_teacher['个人主页'][0])
triples.append(teacher_webpage_str)
triples.append(webpage_name_str)
file_2.write("\n".join(triples))
部分结果如下所示
知识库导入Apache Jena Fuseki
首先从官网下载Jena Fuseki,解压到指定位置安装。然后启动cmd,进入Jena Fuseki所在的位置。然后启动Jena Fuseki,并创建数据库名称,命令如下图所示
然后从浏览器中输入localhost:3030,进入Jena Fuseki。从dataset中选择刚刚建立的数据库testds,并将创建的RDF数据上传到数据库中。上传成功后,如下图所示
接下来就是设计问答系统
设计问答系统
第一步:要通过SPARQLWrapper包来连接数据库
sparql_base = SPARQLWrapper("http://localhost:3030/testds")
第二步:设计SPARQL查询语句的模板
首先是SELECT类型模板
SPARQL_TEM = u"{preamble}\n" + \
u"SELECT DISTINCT {select} WHERE {{\n" + \
u"{expression}\n" + \
u"}}\n"
然后是COUNT类型模板
SPARQL_TEM_count = u"{preamble}\n" + \
u"SELECT (COUNT({select}) AS {count}) WHERE {{\n" + \
u"{expression}\n" + \
u"}}\n
最后是ASK类型模板
SPARQL_ASK_TEM = u"{preamble}\n" + \
u"ASK WHERE{{\n" + \
u"{expression}\n" + \
u"}}\n"
第三步:设计正则匹配
首先要对问句列表中的问句进行分词处理,为了避免教师姓名分词有误,因此导入之前的教师姓名的外部字典,代码如下所示
jieba.load_userdict("E:/study/zhishi/teachers_name.txt")
default_questions = [
u"硕士生导师类型有哪些老师?",
u"硕士生导师类型有多少老师?",
u"张小旺老师主讲了哪些课?",
u"张小旺老师主讲了几门课?",
u"张小旺老师是博士生导师吗?",
u"李坤老师主讲了哪些课?",
u"李坤老师主讲了几门课?",
u"李坤老师是博士生导师吗?",
u"毕重科老师主讲了哪些课?",
u"毕重科老师主讲了几门课?",
u"毕重科老师是博士生导师吗?",
]
questions = default_questions[0:]
seg_lists = []
for question in questions:
words = pseg.cut(question)
seg_list = [Word(word.encode("utf-8"), flag) for word, flag in words]
seg_lists.append(seg_list)
然后设置关键词,让正则匹配时可以根据关键词匹配到正确的问题
tutor_type_master = (W("硕士生导师") | W("硕导")| W("硕士导师")| W("硕士生"))
tutor_type_PhD = (W("博士生导师") | W("博导")| W("博士导师")| W("博士生"))
teacher = (W(pos = "nr") | W(pos = "x"))
whose = (W("谁") | W("哪些"))
quantity = (W("多少") | W("几") | W("几门"))
接下来编写正则匹配规则。以第一个Rule为例,condition表示当遇见关键词tutor_type_master和whose时,就采用who_is_master_tutor_question这个查询函数
class Rule(object):
def __init__(self, condition=None, action=None):
assert condition and action
self.condition = condition
self.action = action
def apply(self, sentence):
matches = []
for m in finditer(self.condition, sentence):
i, j = m.span()
matches.extend(sentence[i:j])#将关键词依次放进matches中
if __name__ == '__main__':
print "----------applying %s----------" % self.action.__name__
return self.action(matches)#将关键词列表给action代表的函数
rules = [
#某导师类型有哪些老师?
Rule(condition = tutor_type_master + Star(Any(), greedy = False) + whose, action = who_is_master_tutor_question),
# 某导师类型有多少老师?
Rule(condition = tutor_type_master + Star(Any(), greedy = False) + quantity, action = how_many_teachers_are_master_tutor_question),
#某老师主讲了哪些课?
Rule(condition = teacher + Star(Any(), greedy = False) + whose, action = what_courses_teacher_question),
#某老师主讲了几门课?
Rule(condition = teacher + Star(Any(), greedy = False) + quantity, action = how_many_courses_teacher_question),
#某老师是博士生导师吗?
Rule(condition = teacher + Star(Any(), greedy = False) + tutor_type_PhD, action = teacher_is_PhD_tutor_question)
]
编写查询函数
#某导师类型有哪些老师?
def who_is_master_tutor_question(x):
select = u"?x0"
sparql = None
for w in x:
if w.token == "硕士生" or w.token == "哪些":
e = u"?x school:teacher_type school:{type}导师. ?x school:teacher_name ?x0.".format(type = w.token.decode("utf-8"))
sparql = SPARQL_TEM.format(preamble = SPARQL_PREAMBLE, select = select, expression = INDENT + e)
break
return sparql
#某导师类型有多少老师?
def how_many_teachers_are_master_tutor_question(x):
select = u"?teachers"
count = u"?x0"
sparql = None
for w in x:
if w.token.decode("utf-8") == "硕士生" or w.token.decode("utf-8") == "多少":
e = u"?teachers school:teacher_type school:{type}导师.".format(type = w.token.decode("utf-8"))
sparql = SPARQL_TEM_count.format(preamble = SPARQL_PREAMBLE, select = select, count = count, expression = INDENT + e)
break
return sparql
#某老师主讲了哪些课?
def what_courses_teacher_question(x):
select = u"?x0"
sparql = None
for w in x:
if w.pos == "nr":
e = u"school:{person} school:teacher_courses ?x0.".format(person = w.token.decode("utf-8"))
sparql = SPARQL_TEM.format(preamble = SPARQL_PREAMBLE, select = select, expression = INDENT + e)
break
return sparql
#某老师主讲了几门课?
def how_many_courses_teacher_question(x):
select = u"?courses"
count = u"?x0"
sparql = None
for w in x:
if w.pos == "nr":
e = u"school:{person} school:teacher_courses ?courses.".format(person = w.token.decode("utf-8"))
sparql = SPARQL_TEM_count.format(preamble = SPARQL_PREAMBLE, select = select, count = count, expression = INDENT + e)
break
return sparql
#某老师是博士生导师吗?
def teacher_is_PhD_tutor_question(x):
sparql = None
for w in x:
if w.pos == "nr":
e = u"school:{person} school:teacher_type school:博士生导师.".format(person = w.token.decode("utf-8"))
sparql = SPARQL_ASK_TEM.format(preamble = SPARQL_PREAMBLE, expression = INDENT + e)
break
return sparql
最后开始提取问题,匹配规则,输出答案。需要注意的是ASK类型的答案,它的存储方式和SELECT、COUNT的不同,需要定义True和False
for seg in seg_lists:
#提取问题
question = []
for s in seg:
#输出问题,分词后的版本
print s.token
question.append(s.token)
file_3.write(u','.join(question))
print
for rule in rules:
#提取一个rule
query = rule.apply(seg)
if query is None:
continue
print query
file_3.write(query + '\n')
if query:
sparql_base.setQuery(query)
sparql_base.setReturnFormat(JSON)
results = sparql_base.query().convert()
if "results" in results.keys():
if not results["results"]["bindings"]:
print "No answer found :("
print
continue
for result in results["results"]["bindings"]:
print "Result: ", result["x0"]["value"]
file_3.write("Result: " + result["x0"]["value"] + '\n')
print
else:
print "Result: ", results["boolean"]
boo = str(results["boolean"])
if boo == "True":
file_3.write(u"Result: " + "True" + '\n')
else:
file_3.write(u"Result: " + "False" + '\n')
完整代码如下所示
# coding: utf-8
# standard import
import re
# third-party import
from refo import finditer, Predicate, Star, Any
import jieba.posseg as pseg
from jieba import suggest_freq
import jieba
from SPARQLWrapper import SPARQLWrapper, JSON
import io
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
# 引入外部字典
# jieba.load_userdict("D:/app/teachers_name.txt")
jieba.load_userdict("E:/study/zhishi/teachers_name.txt")
sparql_base = SPARQLWrapper("http://localhost:3030/testds")
# SPARQL config
# SPARQL模板
SPARQL_PREAMBLE = u"""
PREFIX school: <http://kg.course/informations/>
"""
SPARQL_TEM = u"{preamble}\n" + \
u"SELECT DISTINCT {select} WHERE {{\n" + \
u"{expression}\n" + \
u"}}\n"
SPARQL_TEM_count = u"{preamble}\n" + \
u"SELECT (COUNT({select}) AS {count}) WHERE {{\n" + \
u"{expression}\n" + \
u"}}\n"
SPARQL_ASK_TEM = u"{preamble}\n" + \
u"ASK WHERE{{\n" + \
u"{expression}\n" + \
u"}}\n"
INDENT = " "
class Word(object):
"""treated words as objects"""
def __init__(self, token, pos):
self.token = token
self.pos = pos
class W(Predicate):
"""object-oriented regex for words"""
def __init__(self, token=".*", pos=".*"):
self.token = re.compile(token + "$")
self.pos = re.compile(pos + "$")
super(W, self).__init__(self.match)
def match(self, word):
m1 = self.token.match(word.token)
m2 = self.pos.match(word.pos)
return m1 and m2
class Rule(object):
def __init__(self, condition=None, action=None):
assert condition and action
self.condition = condition
self.action = action
def apply(self, sentence):
matches = []
for m in finditer(self.condition, sentence):
i, j = m.span()
matches.extend(sentence[i:j])#将关键词依次放进matches中
if __name__ == '__main__':
print "----------applying %s----------" % self.action.__name__
return self.action(matches)#将关键词列表给action代表的函数
#某导师类型有哪些老师?
def who_is_master_tutor_question(x):
select = u"?x0"
sparql = None
for w in x:
if w.token == "硕士生" or w.token == "哪些":
e = u"?x school:teacher_type school:{type}导师. ?x school:teacher_name ?x0.".format(type = w.token.decode("utf-8"))
sparql = SPARQL_TEM.format(preamble = SPARQL_PREAMBLE, select = select, expression = INDENT + e)
break
return sparql
#某导师类型有多少老师?
def how_many_teachers_are_master_tutor_question(x):
select = u"?teachers"
count = u"?x0"
sparql = None
for w in x:
if w.token.decode("utf-8") == "硕士生" or w.token.decode("utf-8") == "多少":
e = u"?teachers school:teacher_type school:{type}导师.".format(type = w.token.decode("utf-8"))
sparql = SPARQL_TEM_count.format(preamble = SPARQL_PREAMBLE, select = select, count = count, expression = INDENT + e)
break
return sparql
#某老师主讲了哪些课?
def what_courses_teacher_question(x):
select = u"?x0"
sparql = None
for w in x:
if w.pos == "nr":
e = u"school:{person} school:teacher_courses ?x0.".format(person = w.token.decode("utf-8"))
sparql = SPARQL_TEM.format(preamble = SPARQL_PREAMBLE, select = select, expression = INDENT + e)
break
return sparql
#某老师主讲了几门课?
def how_many_courses_teacher_question(x):
select = u"?courses"
count = u"?x0"
sparql = None
for w in x:
if w.pos == "nr":
e = u"school:{person} school:teacher_courses ?courses.".format(person = w.token.decode("utf-8"))
sparql = SPARQL_TEM_count.format(preamble = SPARQL_PREAMBLE, select = select, count = count, expression = INDENT + e)
break
return sparql
#某老师是博士生导师吗?
def teacher_is_PhD_tutor_question(x):
sparql = None
for w in x:
if w.pos == "nr":
e = u"school:{person} school:teacher_type school:博士生导师.".format(person = w.token.decode("utf-8"))
sparql = SPARQL_ASK_TEM.format(preamble = SPARQL_PREAMBLE, expression = INDENT + e)
break
return sparql
def encode(s):
return ' '.join([bin(ord(c)).replace('0b', '') for c in s])
if __name__ == "__main__":
default_questions = [
u"硕士生导师类型有哪些老师?",
u"硕士生导师类型有多少老师?",
u"张小旺老师主讲了哪些课?",
u"张小旺老师主讲了几门课?",
u"张小旺老师是博士生导师吗?",
u"李坤老师主讲了哪些课?",
u"李坤老师主讲了几门课?",
u"李坤老师是博士生导师吗?",
u"毕重科老师主讲了哪些课?",
u"毕重科老师主讲了几门课?",
u"毕重科老师是博士生导师吗?",
]
questions = default_questions[0:]
seg_lists = []
# tokenizing questions
for question in questions:
words = pseg.cut(question)
seg_list = [Word(word.encode("utf-8"), flag) for word, flag in words]
seg_lists.append(seg_list)
# some rules for matching
# TODO: customize your own rules here
# 正则匹配关键词设置
tutor_type_master = (W("硕士生导师") | W("硕导")| W("硕士导师")| W("硕士生"))
tutor_type_PhD = (W("博士生导师") | W("博导")| W("博士导师")| W("博士生"))
teacher = (W(pos = "nr") | W(pos = "x"))
whose = (W("谁") | W("哪些"))
quantity = (W("多少") | W("几") | W("几门"))
# 正则匹配规则编写
rules = [
#某导师类型有哪些老师?
Rule(condition = tutor_type_master + Star(Any(), greedy = False) + whose, action = who_is_master_tutor_question),
#某导师类型有多少老师?
Rule(condition = tutor_type_master + Star(Any(), greedy = False) + quantity, action = how_many_teachers_are_master_tutor_question),
#某老师主讲了哪些课?
Rule(condition = teacher + Star(Any(), greedy = False) + whose, action = what_courses_teacher_question),
#某老师主讲了几门课?
Rule(condition = teacher + Star(Any(), greedy = False) + quantity, action = how_many_courses_teacher_question),
#某老师是博士生导师吗?
Rule(condition = teacher + Star(Any(), greedy = False) + tutor_type_PhD, action = teacher_is_PhD_tutor_question)
]
file_3 = io.open('result.txt', 'w', encoding='UTF-8')
# matching and querying
for seg in seg_lists:#提取问题
# display question each
question = []
for s in seg:
print s.token,#输出问题,分词后的版本
question.append(s.token)
file_3.write(u','.join(question))
print
for rule in rules:#提取一个rule
query = rule.apply(seg)
if query is None:
continue
print query
file_3.write(query + '\n')
if query:
sparql_base.setQuery(query)
sparql_base.setReturnFormat(JSON)
results = sparql_base.query().convert()
if "results" in results.keys():
if not results["results"]["bindings"]:
print "No answer found :("
print
continue
for result in results["results"]["bindings"]:
print "Result: ", result["x0"]["value"]
file_3.write("Result: " + result["x0"]["value"] + '\n')
print
else:
print "Result: ", results["boolean"]
boo = str(results["boolean"])
if boo == "True":
file_3.write(u"Result: " + "True" + '\n')
else:
file_3.write(u"Result: " + "False" + '\n')
最终的输出部分结果如图所示
参考资料:天津大学《知识工程》课程