知识工程课程实践——知识库问答

最新推荐文章于 2024-03-05 10:12:44 发布

荷月十六

最新推荐文章于 2024-03-05 10:12:44 发布

阅读量2.1k

点赞数 3

文章标签： python 知识图谱自然语言处理爬虫

本文链接：https://blog.csdn.net/u010744489/article/details/105923730

版权

基于模板匹配的知识库问答系统

任务介绍

实践内容贯通《知识工程》课程的所有内容，实现一个知识库问答系统，将学到的理论知识以及实践操作结合，并融会贯通到实际应用中。

数据从天津大学的官网获取，将数据转为RDF数据，构建知识库。通过Jena Fuseki引擎，使用SPARQL语句进行查询与推理。使用模板匹配和正则表达实现知识库问答系统的demo。

数据获取与数据处理

所需的环境是python3，并需要导入requests和BeautifulSoup库

首先需要从数据所在的网址中获取请求URL和请求方式，从浏览器得开发者模式中可以看到。在Network中的Headers下面可以找到，如图所示

在这里插入图片描述

然后在jupyter中来编写抓取数据的代码。
第一步：获取要爬虫的网页

import requests
res = requests.get('http://cic.tju.edu.cn/jyjx/yjsjy/yjsdsml.htm')
res.encoding = 'utf-8'

第二步：从Elements中找到要爬取的内容所在的位置

在这里插入图片描述
其实爬数据就是找到网页中对应的块，然后提取出块中的信息。我要的数据是学院所有硕士生导师的信息，需要从<table>中教师名字超链接<a>中进入教师的个人网页，然后爬取数据。在这里我先爬取教师的个人网页，然后保存在txt文档中，再逐行读取并进入文档中的网址，将每名教师的信息爬取出来

#爬取教师个人主页网址，并存入txt文档
from bs4 import BeautifulSoup
soup = BeautifulSoup(res.text, 'lxml')
tables = soup.find_all('table')
table = tables[1]
teachers = table.find_all('a')
for teacher in teachers:
	link = teacher.get('href')
	fo = open('E:\study\zhishi\实践\http.txt', 'ab+')
	fo.write((link + '\r\n').encode('utf-8'))
fo.close()

部分结果如图所示

在这里插入图片描述

#从教师个人主页中爬取数据，并逐行保存到txt文档中
import requests
from bs4 import BeautifulSoup
file1 = open('E:\study\zhishi\实践\http.txt')
file2 = open('E:\study\zhishi\实践\teacher.txt', 'ab+')
num = 0
for line in file1:
	line = line.strip('\r\n')
	res = requests.get(line)
	res.encoding = 'utf-8'
	res.text
	soup = BeautifulSoup(res.text, 'lxml')
	teachers = soup.find_all('div', class_ = 'v_news_content')
	for teacher in teachers:
		informations = teacher.find_all('p')
		for information in informations:
			out = information.text
			file2.write((out + '\r\n').encode('utf-8'))
			num = num + 1
file1.close()
file2.close()

第三步：要对爬取的数据进行清洗处理，因为并不是所有教师主页中都包含相同格式的内容，而且在后续将数据转为RDF时，需要按照标签来讲内容设置为不同的属性。因此我们首先将原始数据逐行读取，然后按空格切分，将每块内容作为value，并赋予对应的key，存成字典格式。最后输出到txt文档。部分结果如图所示

在这里插入图片描述
最后将数据转为RDF数据，供后面进行SPARQL语句查询。这个数据是一个实体关系三元组的形式，原始数据中的教师姓名、职称、主讲课程等都是实体，要将这些实体按照彼此的关系进行连接。

第一步：先定义三元组的格式，如代码所示，其中“%s”会被原始数据代替

#定义教师姓名
teacher_name = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_name> \"%s\" ."

#定义教师——职称三元组和教师职称
teacher_professional_title = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_professional_title> <http://kg.course/informations/%s> ."
professional_title_name = "<http://kg.course/informations/%s> <http://kg.course/informations/professional_title_name> \"%s\" ."

#定义教师——所在系别三元组和系别
teacher_department = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_department> <http://kg.course/informations/%s> ."
department_name = "<http://kg.course/informations/%s> <http://kg.course/informations/department_name> \"%s\" ."

#定义教师——主讲课程三元组和主讲课程
teacher_courses = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_courses> <http://kg.course/informations/%s> ."
courses_name = "<http://kg.course/informations/%s> <http://kg.course/informations/courses_name> \"%s\" ."

#定义教师——导师类型三元组和导师类型
teacher_type = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_type> <http://kg.course/informations/%s> ."
type_name = "<http://kg.course/informations/%s> <http://kg.course/informations/type_name> \"%s\" ."

#定义教师——电子邮件三元组和电子邮件
teacher_email = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_email> <http://kg.course/informations/%s> ."
email_name = "<http://kg.course/informations/%s> <http://kg.course/informations/email_name> \"%s\" ."

#定义教师——研究领域三元组和研究领域
teacher_field = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_field> <http://kg.course/informations/%s> ."
field_name = "<http://kg.course/informations/%s> <http://kg.course/informations/field_name> \"%s\" ."

#定义教师——研究方向三元组和研究方向
teacher_direction = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_direction> <http://kg.course/informations/%s> ."
direction_name = "<http://kg.course/informations/%s> <http://kg.course/informations/direction_name> \"%s\" ."

#定义教师——个人主页三元组和个人主页
teacher_webpage = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_webpage> <http://kg.course/informations/%s> ."
webpage_name = "<http://kg.course/informations/%s> <http://kg.course/informations/webpage_name> \"%s\" ."

#定义三元组
triples = []

第二步：逐行读取数据，然后按照定义好的三元组，将数据存为实体关系三元组。以教师——主讲课程三元组为例，每名教师可能教授几门课程，每门课程都要和对应的教师建立实体关系三元组

if (len(dict_teacher['主讲课程']) > 1):
	for num_1 in range(len(dict_teacher['主讲课程'])):
		teacher_courses_str = teacher_courses % (dict_teacher['姓名'][0], dict_teacher['主讲课程'][num_1])
		courses_name_str = courses_name % (dict_teacher['主讲课程'][num_1], dict_teacher['主讲课程'][num_1])
		triples.append(teacher_courses_str)
		triples.append(courses_name_str)

第三步：在后续操作中，需要对问题文本进行分词、词性标注，为了避免教师姓名分词出现错误，要提前制作教师姓名词性字典，代码如下所示

file_3.write(dict_teacher['姓名'][0] + ' ' + 'nr'+'\n')

完整代码如下所示

import io
import sys
import random

sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='UTF-8') 
file_1 = open('E:/study/zhishi/实践/teachers.txt', encoding='UTF-8')
file_2 = open('E:/study/zhishi/实践/teachers_information.nt', 'w', encoding='UTF-8')
file_3 = open('E:/study/zhishi/实践/teachers_name.txt', 'w', encoding='UTF-8')

teacher_name = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_name> \"%s\" ."

teacher_professional_title = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_professional_title> <http://kg.course/informations/%s> ."
professional_title_name = "<http://kg.course/informations/%s> <http://kg.course/informations/professional_title_name> \"%s\" ."

teacher_department = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_department> <http://kg.course/informations/%s> ."
department_name = "<http://kg.course/informations/%s> <http://kg.course/informations/department_name> \"%s\" ."

teacher_courses = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_courses> <http://kg.course/informations/%s> ."
courses_name = "<http://kg.course/informations/%s> <http://kg.course/informations/courses_name> \"%s\" ."

teacher_type = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_type> <http://kg.course/informations/%s> ."
type_name = "<http://kg.course/informations/%s> <http://kg.course/informations/type_name> \"%s\" ."

teacher_email = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_email> <http://kg.course/informations/%s> ."
email_name = "<http://kg.course/informations/%s> <http://kg.course/informations/email_name> \"%s\" ."

teacher_field = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_field> <http://kg.course/informations/%s> ."
field_name = "<http://kg.course/informations/%s> <http://kg.course/informations/field_name> \"%s\" ."

teacher_direction = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_direction> <http://kg.course/informations/%s> ."
direction_name = "<http://kg.course/informations/%s> <http://kg.course/informations/direction_name> \"%s\" ."

teacher_webpage = "<http://kg.course/informations/%s> <http://kg.course/informations/teacher_webpage> <http://kg.course/informations/%s> ."
webpage_name = "<http://kg.course/informations/%s> <http://kg.course/informations/webpage_name> \"%s\" ."

triples = []

for line in file_1:
	list_teacher = line.strip().split(' ')
	dict_teacher = {}
	for list_elm in list_teacher:
		list_to_dict = []
		list_to_dict = list_elm.split('：')
		char = '；'
		if char in list_to_dict[1]:
			subvalue = list_to_dict[1].split('；')
			dict_teacher[list_to_dict[0]] = subvalue
		else:
			dict_teacher[list_to_dict[0]] = [list_to_dict[1]]
	print(dict_teacher)
	file_3.write(dict_teacher['姓名'][0] + ' ' + 'nr'+'\n')

	teacher_name_str = teacher_name % (dict_teacher['姓名'][0], dict_teacher['姓名'][0])
	triples.append(teacher_name_str)

	teacher_professional_title_str = teacher_professional_title % (dict_teacher['姓名'][0], dict_teacher['职称'][0])
	professional_title_name_str = professional_title_name % (dict_teacher['职称'][0], dict_teacher['职称'][0])
	triples.append(teacher_professional_title_str)
	triples.append(professional_title_name_str)

	teacher_department_str = teacher_department % (dict_teacher['姓名'][0], dict_teacher['所在系别'][0])
	department_name_str = department_name % (dict_teacher['所在系别'][0], dict_teacher['所在系别'][0])
	triples.append(teacher_department_str)
	triples.append(department_name_str)

	if (len(dict_teacher['主讲课程']) > 1):
		for num_1 in range(len(dict_teacher['主讲课程'])):
			teacher_courses_str = teacher_courses % (dict_teacher['姓名'][0], dict_teacher['主讲课程'][num_1])
			courses_name_str = courses_name % (dict_teacher['主讲课程'][num_1], dict_teacher['主讲课程'][num_1])
			triples.append(teacher_courses_str)
			triples.append(courses_name_str)

	if (len(dict_teacher['主讲课程']) == 1):
		teacher_courses_str = teacher_courses % (dict_teacher['姓名'][0], dict_teacher['主讲课程'][0])
		courses_name_str = courses_name % (dict_teacher['主讲课程'][0], dict_teacher['主讲课程'][0])
		triples.append(teacher_courses_str)
		triples.append(courses_name_str)

	if (len(dict_teacher['导师类型']) > 1):
		for num_2 in range(len(dict_teacher['导师类型'])):
			teacher_type_str = teacher_type % (dict_teacher['姓名'][0], dict_teacher['导师类型'][num_2])
			type_name_str = type_name % (dict_teacher['导师类型'][num_2], dict_teacher['导师类型'][num_2])
			triples.append(teacher_type_str)
			triples.append(type_name_str)

	if (len(dict_teacher['导师类型']) == 1):
		teacher_type_str = teacher_type % (dict_teacher['姓名'][0], dict_teacher['导师类型'][0])
		type_name_str = type_name % (dict_teacher['导师类型'][0], dict_teacher['导师类型'][0])
		triples.append(teacher_type_str)
		triples.append(type_name_str)

	teacher_email_str = teacher_email % (dict_teacher['姓名'][0], dict_teacher['电子邮件'][0])
	email_name_str = email_name % (dict_teacher['电子邮件'][0], dict_teacher['电子邮件'][0])
	triples.append(teacher_email_str)
	triples.append(email_name_str)

	teacher_field_str = teacher_field % (dict_teacher['姓名'][0], dict_teacher['研究领域'][0])
	field_name_str = field_name % (dict_teacher['研究领域'][0], dict_teacher['研究领域'][0])
	triples.append(teacher_field_str)
	triples.append(field_name_str)

	teacher_direction_str = teacher_direction % (dict_teacher['姓名'][0], dict_teacher['研究方向'][0])
	direction_name_str = direction_name % (dict_teacher['研究方向'][0], dict_teacher['研究方向'][0])
	triples.append(teacher_direction_str)
	triples.append(direction_name_str)

	teacher_webpage_str = teacher_webpage % (dict_teacher['姓名'][0], dict_teacher['个人主页'][0])
	webpage_name_str = webpage_name % (dict_teacher['个人主页'][0], dict_teacher['个人主页'][0])
	triples.append(teacher_webpage_str)
	triples.append(webpage_name_str)

	file_2.write("\n".join(triples))

部分结果如下所示

在这里插入图片描述

知识库导入Apache Jena Fuseki

首先从官网下载Jena Fuseki，解压到指定位置安装。然后启动cmd，进入Jena Fuseki所在的位置。然后启动Jena Fuseki，并创建数据库名称，命令如下图所示

在这里插入图片描述

然后从浏览器中输入localhost:3030，进入Jena Fuseki。从dataset中选择刚刚建立的数据库testds，并将创建的RDF数据上传到数据库中。上传成功后，如下图所示

在这里插入图片描述
接下来就是设计问答系统

设计问答系统

第一步：要通过SPARQLWrapper包来连接数据库

sparql_base = SPARQLWrapper("http://localhost:3030/testds")

第二步：设计SPARQL查询语句的模板

首先是SELECT类型模板

SPARQL_TEM = u"{preamble}\n" + \
             u"SELECT DISTINCT {select} WHERE {{\n" + \
             u"{expression}\n" + \
             u"}}\n"

然后是COUNT类型模板

SPARQL_TEM_count = u"{preamble}\n" + \
                    u"SELECT (COUNT({select}) AS {count}) WHERE {{\n" + \
                    u"{expression}\n" + \
                    u"}}\n

最后是ASK类型模板

SPARQL_ASK_TEM = u"{preamble}\n" + \
                u"ASK WHERE{{\n" + \
                u"{expression}\n" + \
                u"}}\n"

第三步：设计正则匹配

首先要对问句列表中的问句进行分词处理，为了避免教师姓名分词有误，因此导入之前的教师姓名的外部字典，代码如下所示

jieba.load_userdict("E:/study/zhishi/teachers_name.txt")

default_questions = [
    u"硕士生导师类型有哪些老师?",
    u"硕士生导师类型有多少老师?",
    u"张小旺老师主讲了哪些课?",
    u"张小旺老师主讲了几门课?",
    u"张小旺老师是博士生导师吗?",
    u"李坤老师主讲了哪些课?",
    u"李坤老师主讲了几门课?",
    u"李坤老师是博士生导师吗?",
    u"毕重科老师主讲了哪些课?",
    u"毕重科老师主讲了几门课?",
    u"毕重科老师是博士生导师吗?",
]

questions = default_questions[0:]
seg_lists = []
for question in questions:
	words = pseg.cut(question)
    seg_list = [Word(word.encode("utf-8"), flag) for word, flag in words]
    seg_lists.append(seg_list)

然后设置关键词，让正则匹配时可以根据关键词匹配到正确的问题

tutor_type_master = (W("硕士生导师") | W("硕导")| W("硕士导师")| W("硕士生"))
tutor_type_PhD = (W("博士生导师") | W("博导")| W("博士导师")| W("博士生"))
teacher = (W(pos = "nr") | W(pos = "x"))
whose = (W("谁") | W("哪些"))
quantity = (W("多少") | W("几") | W("几门"))

接下来编写正则匹配规则。以第一个Rule为例，condition表示当遇见关键词tutor_type_master和whose时，就采用who_is_master_tutor_question这个查询函数

class Rule(object):
    def __init__(self, condition=None, action=None):
        assert condition and action
        self.condition = condition
        self.action = action

    def apply(self, sentence):
        matches = []
        for m in finditer(self.condition, sentence):
            i, j = m.span()
            matches.extend(sentence[i:j])#将关键词依次放进matches中
        if __name__ == '__main__':
            print "----------applying %s----------" % self.action.__name__
        return self.action(matches)#将关键词列表给action代表的函数
        
rules = [
	#某导师类型有哪些老师?
    Rule(condition = tutor_type_master + Star(Any(), greedy = False) + whose, action = who_is_master_tutor_question),

    # 某导师类型有多少老师?
    Rule(condition = tutor_type_master + Star(Any(), greedy = False) + quantity, action = how_many_teachers_are_master_tutor_question),

    #某老师主讲了哪些课?
    Rule(condition = teacher + Star(Any(), greedy = False) + whose, action = what_courses_teacher_question),

    #某老师主讲了几门课?
    Rule(condition = teacher + Star(Any(), greedy = False) + quantity, action = how_many_courses_teacher_question),

    #某老师是博士生导师吗?
    Rule(condition = teacher + Star(Any(), greedy = False) + tutor_type_PhD, action = teacher_is_PhD_tutor_question)
    ]

编写查询函数

#某导师类型有哪些老师?
def who_is_master_tutor_question(x):
    select = u"?x0"
    sparql = None
    for w in x:
        if w.token == "硕士生" or w.token == "哪些":
            e = u"?x school:teacher_type school:{type}导师. ?x school:teacher_name ?x0.".format(type = w.token.decode("utf-8"))
            sparql = SPARQL_TEM.format(preamble = SPARQL_PREAMBLE, select = select, expression = INDENT + e)
            break
    return sparql

#某导师类型有多少老师?
def how_many_teachers_are_master_tutor_question(x):
    select = u"?teachers"
    count = u"?x0"
    sparql = None
    for w in x:
        if w.token.decode("utf-8") == "硕士生" or w.token.decode("utf-8") == "多少":
            e = u"?teachers school:teacher_type school:{type}导师.".format(type = w.token.decode("utf-8"))
            sparql = SPARQL_TEM_count.format(preamble = SPARQL_PREAMBLE, select = select, count = count, expression = INDENT + e)
            break
    return sparql

#某老师主讲了哪些课?
def what_courses_teacher_question(x):
    select = u"?x0"
    sparql = None
    for w in x:
        if w.pos == "nr":
            e = u"school:{person} school:teacher_courses ?x0.".format(person = w.token.decode("utf-8"))
            sparql = SPARQL_TEM.format(preamble = SPARQL_PREAMBLE, select = select, expression = INDENT + e)
            break
    return sparql

#某老师主讲了几门课?
def how_many_courses_teacher_question(x):
    select = u"?courses"
    count = u"?x0"
    sparql = None
    for w in x:
        if w.pos == "nr":
            e = u"school:{person} school:teacher_courses ?courses.".format(person = w.token.decode("utf-8"))
            sparql = SPARQL_TEM_count.format(preamble = SPARQL_PREAMBLE, select = select, count = count, expression = INDENT + e)
            break
    return sparql

#某老师是博士生导师吗?
def teacher_is_PhD_tutor_question(x):
    sparql = None
    for w in x:
        if w.pos == "nr":
            e = u"school:{person} school:teacher_type school:博士生导师.".format(person = w.token.decode("utf-8"))
            sparql = SPARQL_ASK_TEM.format(preamble = SPARQL_PREAMBLE, expression = INDENT + e)
            break
    return sparql

最后开始提取问题，匹配规则，输出答案。需要注意的是ASK类型的答案，它的存储方式和SELECT、COUNT的不同，需要定义True和False

for seg in seg_lists:
	#提取问题
    question = []
    for s in seg:
    	#输出问题，分词后的版本
        print s.token
        question.append(s.token)
    file_3.write(u','.join(question))
    print

    for rule in rules:
    	#提取一个rule
        query = rule.apply(seg)

        if query is None:
            continue
        print query
        file_3.write(query + '\n')

        if query:
            sparql_base.setQuery(query)
            sparql_base.setReturnFormat(JSON)
            results = sparql_base.query().convert()

            if "results" in results.keys():
                if not results["results"]["bindings"]:
                    print "No answer found :("
                    print
                    continue
                for result in results["results"]["bindings"]:
                    print "Result: ", result["x0"]["value"]
                    file_3.write("Result: " + result["x0"]["value"] + '\n')
                    print
            else:
                print "Result: ", results["boolean"]
                boo = str(results["boolean"])
                if boo == "True":
                    file_3.write(u"Result: " + "True" + '\n')
                else:
                    file_3.write(u"Result: " + "False" + '\n')

完整代码如下所示

# coding: utf-8

# standard import
import re
# third-party import
from refo import finditer, Predicate, Star, Any
import jieba.posseg as pseg
from jieba import suggest_freq
import jieba
from SPARQLWrapper import SPARQLWrapper, JSON
import io


import sys
reload(sys)
sys.setdefaultencoding("utf-8")


# 引入外部字典
# jieba.load_userdict("D:/app/teachers_name.txt")
jieba.load_userdict("E:/study/zhishi/teachers_name.txt")

sparql_base = SPARQLWrapper("http://localhost:3030/testds")

# SPARQL config
# SPARQL模板
SPARQL_PREAMBLE = u"""
PREFIX school: <http://kg.course/informations/>
"""

SPARQL_TEM = u"{preamble}\n" + \
             u"SELECT DISTINCT {select} WHERE {{\n" + \
             u"{expression}\n" + \
             u"}}\n"

SPARQL_TEM_count = u"{preamble}\n" + \
                    u"SELECT (COUNT({select}) AS {count}) WHERE {{\n" + \
                    u"{expression}\n" + \
                    u"}}\n"

SPARQL_ASK_TEM = u"{preamble}\n" + \
                u"ASK WHERE{{\n" + \
                u"{expression}\n" + \
                u"}}\n"

INDENT = "    "

class Word(object):
    """treated words as objects"""
    def __init__(self, token, pos):
        self.token = token
        self.pos = pos


class W(Predicate):
    """object-oriented regex for words"""
    def __init__(self, token=".*", pos=".*"):
        self.token = re.compile(token + "$")
        self.pos = re.compile(pos + "$")
        super(W, self).__init__(self.match)

    def match(self, word):
        m1 = self.token.match(word.token)
        m2 = self.pos.match(word.pos)
        return m1 and m2


class Rule(object):
    def __init__(self, condition=None, action=None):
        assert condition and action
        self.condition = condition
        self.action = action

    def apply(self, sentence):
        matches = []
        for m in finditer(self.condition, sentence):
            i, j = m.span()
            matches.extend(sentence[i:j])#将关键词依次放进matches中
        if __name__ == '__main__':
            print "----------applying %s----------" % self.action.__name__
        return self.action(matches)#将关键词列表给action代表的函数

#某导师类型有哪些老师?
def who_is_master_tutor_question(x):
    select = u"?x0"
    sparql = None
    for w in x:
        if w.token == "硕士生" or w.token == "哪些":
            e = u"?x school:teacher_type school:{type}导师. ?x school:teacher_name ?x0.".format(type = w.token.decode("utf-8"))
            sparql = SPARQL_TEM.format(preamble = SPARQL_PREAMBLE, select = select, expression = INDENT + e)
            break
    return sparql

#某导师类型有多少老师?
def how_many_teachers_are_master_tutor_question(x):
    select = u"?teachers"
    count = u"?x0"
    sparql = None
    for w in x:
        if w.token.decode("utf-8") == "硕士生" or w.token.decode("utf-8") == "多少":
            e = u"?teachers school:teacher_type school:{type}导师.".format(type = w.token.decode("utf-8"))
            sparql = SPARQL_TEM_count.format(preamble = SPARQL_PREAMBLE, select = select, count = count, expression = INDENT + e)
            break
    return sparql

#某老师主讲了哪些课?
def what_courses_teacher_question(x):
    select = u"?x0"
    sparql = None
    for w in x:
        if w.pos == "nr":
            e = u"school:{person} school:teacher_courses ?x0.".format(person = w.token.decode("utf-8"))
            sparql = SPARQL_TEM.format(preamble = SPARQL_PREAMBLE, select = select, expression = INDENT + e)
            break
    return sparql

#某老师主讲了几门课?
def how_many_courses_teacher_question(x):
    select = u"?courses"
    count = u"?x0"
    sparql = None
    for w in x:
        if w.pos == "nr":
            e = u"school:{person} school:teacher_courses ?courses.".format(person = w.token.decode("utf-8"))
            sparql = SPARQL_TEM_count.format(preamble = SPARQL_PREAMBLE, select = select, count = count, expression = INDENT + e)
            break
    return sparql

#某老师是博士生导师吗?
def teacher_is_PhD_tutor_question(x):
    sparql = None
    for w in x:
        if w.pos == "nr":
            e = u"school:{person} school:teacher_type school:博士生导师.".format(person = w.token.decode("utf-8"))
            sparql = SPARQL_ASK_TEM.format(preamble = SPARQL_PREAMBLE, expression = INDENT + e)
            break
    return sparql

def encode(s):
    return ' '.join([bin(ord(c)).replace('0b', '') for c in s])

if __name__ == "__main__":
    default_questions = [
        u"硕士生导师类型有哪些老师?",
        u"硕士生导师类型有多少老师?",
        u"张小旺老师主讲了哪些课?",
        u"张小旺老师主讲了几门课?",
        u"张小旺老师是博士生导师吗?",
        u"李坤老师主讲了哪些课?",
        u"李坤老师主讲了几门课?",
        u"李坤老师是博士生导师吗?",
        u"毕重科老师主讲了哪些课?",
        u"毕重科老师主讲了几门课?",
        u"毕重科老师是博士生导师吗?",
    ]


    questions = default_questions[0:]

    seg_lists = []

    # tokenizing questions
    for question in questions:
        words = pseg.cut(question)
        seg_list = [Word(word.encode("utf-8"), flag) for word, flag in words]
        seg_lists.append(seg_list)

    # some rules for matching
    # TODO: customize your own rules here
    # 正则匹配关键词设置
    tutor_type_master = (W("硕士生导师") | W("硕导")| W("硕士导师")| W("硕士生"))
    tutor_type_PhD = (W("博士生导师") | W("博导")| W("博士导师")| W("博士生"))
    teacher = (W(pos = "nr") | W(pos = "x"))
    whose = (W("谁") | W("哪些"))
    quantity = (W("多少") | W("几") | W("几门"))
    
    # 正则匹配规则编写
    rules = [

        #某导师类型有哪些老师?
        Rule(condition = tutor_type_master + Star(Any(), greedy = False) + whose, action = who_is_master_tutor_question),

        #某导师类型有多少老师?
        Rule(condition = tutor_type_master + Star(Any(), greedy = False) + quantity, action = how_many_teachers_are_master_tutor_question),

        #某老师主讲了哪些课?
        Rule(condition = teacher + Star(Any(), greedy = False) + whose, action = what_courses_teacher_question),

        #某老师主讲了几门课?
        Rule(condition = teacher + Star(Any(), greedy = False) + quantity, action = how_many_courses_teacher_question),

        #某老师是博士生导师吗?
        Rule(condition = teacher + Star(Any(), greedy = False) + tutor_type_PhD, action = teacher_is_PhD_tutor_question)

    ]

    file_3 = io.open('result.txt', 'w', encoding='UTF-8')

    # matching and querying
    for seg in seg_lists:#提取问题
        # display question each
        question = []
        for s in seg:
            print s.token,#输出问题，分词后的版本
            question.append(s.token)
        file_3.write(u','.join(question))
        print

        for rule in rules:#提取一个rule
            query = rule.apply(seg)

            if query is None:
                continue
            print query
            file_3.write(query + '\n')

            if query:
                sparql_base.setQuery(query)
                sparql_base.setReturnFormat(JSON)
                results = sparql_base.query().convert()

                if "results" in results.keys():
                    if not results["results"]["bindings"]:
                        print "No answer found :("
                        print
                        continue
                    for result in results["results"]["bindings"]:
                        print "Result: ", result["x0"]["value"]
                        file_3.write("Result: " + result["x0"]["value"] + '\n')
                        print
                else:
                    print "Result: ", results["boolean"]
                    boo = str(results["boolean"])
                    if boo == "True":
                        file_3.write(u"Result: " + "True" + '\n')
                    else:
                        file_3.write(u"Result: " + "False" + '\n')