这部分参考 该github项目
梳理支持问答的类型
由于我们的数据集是针对中医药的数据集,因此支持的问答类型也都以中医药为中心。
使用AC自动机进行实体提取
我们采用词典匹配的方式进行意图识别,就是将用户查询的内容与词典中的内容进行匹配。
通过将该领域的实体名字加入到AC树中,可以查询问句中出现的实体名字,从而实现实体提取。
def build_actree(self, wordlist):
actree = ahocorasick.Automaton()
for index, word in enumerate(wordlist):
actree.add_word(word, (index, word))
actree.make_automaton()
return actree
构造领域词典
def build_wdtype_dict(self):
wd_dict = dict()
for wd in self.region_words:
wd_dict[wd] = []
if wd in self.disease_wds:
wd_dict[wd].append('disease')
if wd in self.drug_wds:
wd_dict[wd].append('drug')
if wd in self.ingredient_wds:
wd_dict[wd].append('ingedient')
if wd in self.taste_wds:
wd_dict[wd].append('taste')
if wd in self.people_wds:
wd_dict[wd].append('people')
return wd_dict
问句过滤
由于在AC自动机匹配时,例如乌鸡白凤丸可能会有多个匹配,比如“乌鸡”和“乌鸡白凤丸”两个词都匹配,因此需要过滤筛选,将“乌鸡”这个词过滤掉。
def check_medical(self, question):
#存储实体名称
region_wds = []
for i in self.region_tree.iter(question):
#找到在问题中出现的词
wd = i[1][1]
region_wds.append(wd)
stop_wds = []
#过滤
for wd1 in region_wds:
for wd2 in region_wds:
if wd1 in wd2 and wd1 != wd2:
stop_wds.append(wd1)
#问题中出现的所有领域词
final_wds = [i for i in region_wds if i not in stop_wds]
#存储实体及其标签,如乌鸡白凤丸:drug
final_dict = {i:self.wdtype_dict.get(i) for i in final_wds}
return final_dict
问句分类
根据问句中出现的疑问词和问句中出现的实体类型来判断问句类型
'''分类主函数'''
#传入问句question
def classify(self, question):
data = {}
medical_dict = self.check_medical(question)
if not medical_dict:
return {}
data['args'] = medical_dict
#收集问句当中所涉及到的实体类型
types = []
for type_ in medical_dict.values():
types += type_
question_type = 'others'
#记录问题的类型
question_types = []
# 能治疗什么症状
if self.check_words(self.disease_wds, question) and ('drug' in types):
question_type = 'drug_disease'
question_types.append(question_type)
#药物性味
if self.check_words(self.taste_qwds, question) and ('drug' in types):
question_type = 'drug_taste'
question_types.append(question_type)
#什么人不能吃
if self.check_words(self.people_qwds, question) and ('drug' in types):
question_type = 'drug_people'
question_types.append(question_type)
#应该吃啥药
if self.check_words(self.eat_qwds, question) and ('disease' in types):
question_type = 'disease_drug'
question_types.append(question_type)
#含有什么成分
if self.check_words(self.ingredient_qwds, question) and ('drug' in types):
question_type = 'drug_ingredient'
question_types.append(question_type)
# 若没有查到相关的外部查询信息,那么则将该药物的描述信息返回
if question_types == [] and 'drug' in types:
question_types = ['drug_desc']
# 将多个分类结果进行合并处理,组装成一个字典
data['question_types'] = question_types
return data
运行效果
问句中的实体识别及问句意图识别功能完成!