基于知识图谱的从患者与医生的对话提取特征词过程

chen_nnn

已于 2022-04-02 18:09:06 修改

阅读量3.1k

点赞数 1

分类专栏：笔记文章标签：知识图谱机器学习 nlp

于 2022-04-02 16:45:30 首次发布

本文链接：https://blog.csdn.net/chen_nnn/article/details/123922025

版权

笔记专栏收录该内容

12 篇文章 40 订阅

订阅专栏

又有新的任务了，要能够从患者与医生之间对话提取出关键词，然后再根据已经构建好的知识图谱的内容，去寻找回答患者的提问，这一部分工作同样刘老师都已经实现，这里仍然是对其进行解读。

question_classifier.py

question_classifier.py

import os
import pyahocorasick

class QuestionClassifier:



if __name__ == '__main__':
    handler = QuestionClassifier()
    while 1:
        question = input('input an question:')
        data = handler.classify(question)
        print(data)

ahocorasick库的介绍可以看这篇博客，而如果在下载中遇到问题的话可以试试，pip install pyahocorasick这两个库应该是一样的。主函数的内容也比较简单，就是不断循环，得到使用者提出的问题，然后根据知识图谱得到相关回答并输出。python ahocorasick介绍_追梦杏花天影的博客-CSDN博客_ahocorasickpython ahocorasick介绍ahocorasick模块介绍ahocorasick是个python模块，Aho-Corasick算法是多模式匹配中的经典算法，目前在实际应用中较多。由两种数据结构实现：trie和Aho-Corasick自动机，简称AC自动机。多模式匹配：多模式匹配就是有多个模式串P1,P2,P3…，Pm，求出所有这些模式串在连续文本T1…n中的所有可能出现的位...https://blog.csdn.net/u010569893/article/details/97136696

QuestionClassifier类：

class QuestionClassifier:
    def __init__(self):
        cur_dir = '/'.join(os.path.abspath(__file__).split('/')[:-1])
        #　特征词路径
        self.disease_path = os.path.join(cur_dir, 'dict/disease.txt')
        self.department_path = os.path.join(cur_dir, 'dict/department.txt')
        self.check_path = os.path.join(cur_dir, 'dict/check.txt')
        self.drug_path = os.path.join(cur_dir, 'dict/drug.txt')
        self.food_path = os.path.join(cur_dir, 'dict/food.txt')
        self.producer_path = os.path.join(cur_dir, 'dict/producer.txt')
        self.symptom_path = os.path.join(cur_dir, 'dict/symptom.txt')
        self.deny_path = os.path.join(cur_dir, 'dict/deny.txt')
        # 加载特征词
        self.disease_wds= [i.strip() for i in open(self.disease_path) if i.strip()]
        self.department_wds= [i.strip() for i in open(self.department_path) if i.strip()]
        self.check_wds= [i.strip() for i in open(self.check_path) if i.strip()]
        self.drug_wds= [i.strip() for i in open(self.drug_path) if i.strip()]
        self.food_wds= [i.strip() for i in open(self.food_path) if i.strip()]
        self.producer_wds= [i.strip() for i in open(self.producer_path) if i.strip()]
        self.symptom_wds= [i.strip() for i in open(self.symptom_path) if i.strip()]
        self.region_words = set(self.department_wds + self.disease_wds + self.check_wds + self.drug_wds + self.food_wds + self.producer_wds + self.symptom_wds)
        self.deny_words = [i.strip() for i in open(self.deny_path) if i.strip()]
        # 构造领域actree
        self.region_tree = self.build_actree(list(self.region_words))
        # 构建词典
        self.wdtype_dict = self.build_wdtype_dict()
        # 问句疑问词
        self.symptom_qwds = ['症状', '表征', '现象', '症候', '表现']
        self.cause_qwds = ['原因','成因', '为什么', '怎么会', '怎样才', '咋样才', '怎样会', '如何会', '为啥', '为何', '如何才会', '怎么才会', '会导致', '会造成']
        self.acompany_qwds = ['并发症', '并发', '一起发生', '一并发生', '一起出现', '一并出现', '一同发生', '一同出现', '伴随发生', '伴随', '共现']
        self.food_qwds = ['饮食', '饮用', '吃', '食', '伙食', '膳食', '喝', '菜' ,'忌口', '补品', '保健品', '食谱', '菜谱', '食用', '食物','补品']
        self.drug_qwds = ['药', '药品', '用药', '胶囊', '口服液', '炎片']
        self.prevent_qwds = ['预防', '防范', '抵制', '抵御', '防止','躲避','逃避','避开','免得','逃开','避开','避掉','躲开','躲掉','绕开',
                             '怎样才能不', '怎么才能不', '咋样才能不','咋才能不', '如何才能不',
                             '怎样才不', '怎么才不', '咋样才不','咋才不', '如何才不',
                             '怎样才可以不', '怎么才可以不', '咋样才可以不', '咋才可以不', '如何可以不',
                             '怎样才可不', '怎么才可不', '咋样才可不', '咋才可不', '如何可不']
        self.lasttime_qwds = ['周期', '多久', '多长时间', '多少时间', '几天', '几年', '多少天', '多少小时', '几个小时', '多少年']
        self.cureway_qwds = ['怎么治疗', '如何医治', '怎么医治', '怎么治', '怎么医', '如何治', '医治方式', '疗法', '咋治', '怎么办', '咋办', '咋治']
        self.cureprob_qwds = ['多大概率能治好', '多大几率能治好', '治好希望大么', '几率', '几成', '比例', '可能性', '能治', '可治', '可以治', '可以医']
        self.easyget_qwds = ['易感人群', '容易感染', '易发人群', '什么人', '哪些人', '感染', '染上', '得上']
        self.check_qwds = ['检查', '检查项目', '查出', '检查', '测出', '试出']
        self.belong_qwds = ['属于什么科', '属于', '什么科', '科室']
        self.cure_qwds = ['治疗什么', '治啥', '治疗啥', '医治啥', '治愈啥', '主治啥', '主治什么', '有什么用', '有何用', '用处', '用途',
                          '有什么好处', '有什么益处', '有何益处', '用来', '用来做啥', '用来作甚', '需要', '要']

        print('model init finished ......')

        return

这里很明显是在加载已经处理好的字典，并导入到该python文件内存储，寻找特征词路径并加载特征词。然后将疾病、科室、检查项目、药品名称、食物名称、生产商和症状都去重之后存储到领域字典内。然后是根据这一领域字典构造ACTree，然后在构造一个字典里面保存了领域字典内所有词的属性。之后则是根据使用者的提问来设置部分关键词，这些关键词也都是有着自己的分类，因为根据这些词出现的时候，我们需要推测出使用者想要什么回答，因为这个是根据字符串进行匹配，且都是由人本身想出来的，肯定会存在着缺失的情况，而这个问题可以在大量的原始数据的基础上看在什么情况下失效，然后进行补充。

build_actree():

    def build_actree(self, wordlist):
        actree = ahocorasick.Automaton()
        for index, word in enumerate(wordlist):
            actree.add_word(word, (index, word))
        actree.make_automaton()
        return actree

这里刘老师给的注释是构造actree，加速过滤。这里我的理解是构造actree这点和容易看出来，而加速过滤这一条是由这个自动机带来的好处。至于构造的方式一般是比较固定的，都是先ahocorasick.Automaton一下（我甚至觉得这里应该是Automation，库函数中少写了一个i），然后根据自己的需要构造add_word的形式（这里可以先放一下，之后再看为什么用这样的形式保存），最后是一句actree.make_automaton。

build_wdtype_dict():

    def build_wdtype_dict(self):
        wd_dict = dict()
        for wd in self.region_words:
            wd_dict[wd] = []
            if wd in self.disease_wds:
                wd_dict[wd].append('disease')
            if wd in self.department_wds:
                wd_dict[wd].append('department')
            if wd in self.check_wds:
                wd_dict[wd].append('check')
            if wd in self.drug_wds:
                wd_dict[wd].append('drug')
            if wd in self.food_wds:
                wd_dict[wd].append('food')
            if wd in self.symptom_wds:
                wd_dict[wd].append('symptom')
            if wd in self.producer_wds:
                wd_dict[wd].append('producer')
        return wd_dict

这里的注释内容为：构造词对应的类型，这就很好理解了，就是在这个函数中，我们首先定义好一个字典向量。然后针对领域字典内的所有词遍历，找到该词原本在哪个属性下，并记录下来，最后返回实现好的字典。这两个函数都可以是作为初始化的一个步骤。

check_medical():

    def check_medical(self, question):
        region_wds = []
        for i in self.region_tree.iter(question):
            wd = i[1][1]
            region_wds.append(wd)
        stop_wds = []
        for wd1 in region_wds:
            for wd2 in region_wds:
                if wd1 in wd2 and wd1 != wd2:
                    stop_wds.append(wd1)
        final_wds = [i for i in region_wds if i not in stop_wds]
        final_dict = {i:self.wdtype_dict.get(i) for i in final_wds}

        return final_dict

该函数注释为：问句过滤。我理解的含义是找到使用者提问的问题中关键的部分。根据前面我们已经构造好的actree，里面字典内包括疾病、科室、检查项目、药品名称、食物名称、生产商和症状这些信息，因此最终在问句中找到的也是这些信息，找到的这些信息都保存在region_wds里。然后在对这些已经找到的词进行检索，如果满足这其中的某个词会完全出现在另外一个词当中，就将短的那个词保存到stop_wds中，比方说“头痛”和“偏头痛”，则会将“头痛”保存（这是我的理解，倒不一定对，还得程序运行起来进行调试才知道）。而最终需要的final_wds是那些更长的词（比如“偏头痛”，感觉有点绕了），然后在字典中将这些词按键值对的方式进行保存，键名是疾病、科室、检查项目、药品名称、食物名称、生产商和症状下对应的具体单词（比如“内科”），值对应的则是单词的属性（“内科”的属性就是“department”），然后返回。

check_words():

    def check_words(self, wds, sent):
        for wd in wds:
            if wd in sent:
                return True
        return False

注释内容为：基于特征词进行分类。这个函数单看的话，虽然实现的功能很简单，就是判断给出的单词有没有出现在sent当中，但是在整个代码中的功能却比较抽象。在下面的classify函数中将出现并读懂本身含义，这里不作过多解释。

classify():

    def classify(self, question):
        data = {}
        medical_dict = self.check_medical(question)
        if not medical_dict:
            return {}
        data['args'] = medical_dict
        #收集问句当中所涉及到的实体类型
        types = []
        for type_ in medical_dict.values():
            types += type_
        question_type = 'others'

        question_types = []

        # 症状
        if self.check_words(self.symptom_qwds, question) and ('disease' in types):
            question_type = 'disease_symptom'
            question_types.append(question_type)

        if self.check_words(self.symptom_qwds, question) and ('symptom' in types):
            question_type = 'symptom_disease'
            question_types.append(question_type)

        # 原因
        if self.check_words(self.cause_qwds, question) and ('disease' in types):
            question_type = 'disease_cause'
            question_types.append(question_type)
        # 并发症
        if self.check_words(self.acompany_qwds, question) and ('disease' in types):
            question_type = 'disease_acompany'
            question_types.append(question_type)

        # 推荐食品
        if self.check_words(self.food_qwds, question) and 'disease' in types:
            deny_status = self.check_words(self.deny_words, question)
            if deny_status:
                question_type = 'disease_not_food'
            else:
                question_type = 'disease_do_food'
            question_types.append(question_type)

        #已知食物找疾病
        if self.check_words(self.food_qwds+self.cure_qwds, question) and 'food' in types:
            deny_status = self.check_words(self.deny_words, question)
            if deny_status:
                question_type = 'food_not_disease'
            else:
                question_type = 'food_do_disease'
            question_types.append(question_type)

        # 推荐药品
        if self.check_words(self.drug_qwds, question) and 'disease' in types:
            question_type = 'disease_drug'
            question_types.append(question_type)

        # 药品治啥病
        if self.check_words(self.cure_qwds, question) and 'drug' in types:
            question_type = 'drug_disease'
            question_types.append(question_type)

        # 疾病接受检查项目
        if self.check_words(self.check_qwds, question) and 'disease' in types:
            question_type = 'disease_check'
            question_types.append(question_type)

        # 已知检查项目查相应疾病
        if self.check_words(self.check_qwds+self.cure_qwds, question) and 'check' in types:
            question_type = 'check_disease'
            question_types.append(question_type)

        #　症状防御
        if self.check_words(self.prevent_qwds, question) and 'disease' in types:
            question_type = 'disease_prevent'
            question_types.append(question_type)

        # 疾病医疗周期
        if self.check_words(self.lasttime_qwds, question) and 'disease' in types:
            question_type = 'disease_lasttime'
            question_types.append(question_type)

        # 疾病治疗方式
        if self.check_words(self.cureway_qwds, question) and 'disease' in types:
            question_type = 'disease_cureway'
            question_types.append(question_type)

        # 疾病治愈可能性
        if self.check_words(self.cureprob_qwds, question) and 'disease' in types:
            question_type = 'disease_cureprob'
            question_types.append(question_type)

        # 疾病易感染人群
        if self.check_words(self.easyget_qwds, question) and 'disease' in types :
            question_type = 'disease_easyget'
            question_types.append(question_type)

        # 若没有查到相关的外部查询信息，那么则将该疾病的描述信息返回
        if question_types == [] and 'disease' in types:
            question_types = ['disease_desc']

        # 若没有查到相关的外部查询信息，那么则将该疾病的描述信息返回
        if question_types == [] and 'symptom' in types:
            question_types = ['symptom_disease']

        # 将多个分类结果进行合并处理，组装成一个字典
        data['question_types'] = question_types

        return data

来到真正的重头戏，前面都是些小菜，为这个服务的。首先对问题进行处理，得到medical_dict字典，这里我们在前面介绍了，里面的键值对前面是具体，后面是属性。将这些内容保存到data字典的args键名下。然后对medical_dict中出现的属性名进行汇总。首先是对症状进行分类，如果我们之前预定义好的询问症状的话语出现在问题当中而且问题当中确实出现了和disease属性的词（我感觉这里像是一道双保险），则我们可以认为这个问题是已知疾病想询问这个疾病都会有哪些症状的类型，然后以此类推，还有已知症状推测疾病的类型、已知疾病想知道成因、已知疾病想知道并发症、已知疾病想知道什么该吃什么不该吃、已知服务想知道什么病可以吃什么病不可以吃、已知疾病想知道吃什么药、已知药品想知道可以治什么病、已知疾病需要做什么检查项目、已知检查项目查响应疾病、已知疾病想知道该如何预防、已知疾病想知道要治多久、已知疾病想知道该如何治疗、已知疾病想知道治愈的可能性，已知疾病想知道易感人群。若以上情况都没有出现且问题中仍然出现了疾病名，则将该疾病的具体信息返回；若以上情况都没有出现但问题中确实出现了症状名，则将其归类到已知症状推测疾病这一类。将这些问题的类型都保存到data当中，并返回。

运行结果

运行结果po出来给大家看一下，数据来源是我自己爬取的部分数据。看结果之后很满意但是也没有那么满意，比如医生出现在症状当中，23出现在药品当中，这些可能是当初的数据来源不干净，爬虫做的数据处理也出现了一些问题，但是由于没有刘老师爬取的源文件，所以没有办法对其进行完全的改造，这个参考一下，日后如果有需要进行改进。