Python学习之路-NLP(人物提取)

目标:

  • 读取四大名著中的一部
  • 按章节进行人名抽取,并排序
  • 合并所有排序结果至最终结果字典中
  • 显示最终结果

遇到问题:

  • 【listdir 获取文件乱序】,【解决方法】加个按创建时间排序 st_mtime,但实际上Linux下并没有创建时间,实际上是最后修改时间
  • file_names.sort(key = lambda x:os.stat(dir + '/'+x).st_mtime)

     

  • 【解析文件报错】因为是Mac电脑,解析文件的时候,最后总报错,经调试知道,原来是读取了系统生成的.DS_Store文件,【解决方法】加个文件名正则判断,是否以数字开头^[0-9]
  • #判断是否是章节(数字开头),排除掉mydict.txt 和 .DS_Store 文件
            if re.findall("^[0-9]", file.split('/')[-1]) == []:
                continue

     

完整代码:

import os
import jieba.posseg as psg
import jieba
import time
import re
novel = 'hlm'
dir = './Text/'+novel
#jieba.load_userdict(dir+'/mydict.txt')

# 字典合并
def combineListToDic(list):
    if len(list) <= 1:
        return list
    temp = {}
    for l in list:
        for k in l:
            temp = sumDic(temp,{k[0]:k[1]})
    return temp

def main():
    # 获取文件名
    file_names = os.listdir(dir)
    # listdir出来的文件是乱序
    # 根据创建时间排序(创建时间与章节同序)
    # file_names.sort(key = compare) 同下
    file_names.sort(key = lambda x:os.stat(dir + '/'+x).st_mtime)
    #print(file_names)
    # 文件名拼接路径
    file_list = [os.path.join('./Text/'+novel+'/', file) for file in file_names]
    #print(file_list)
    text = []
    # name_countList,存储所有章节独立的[人名+计数]
    name_countList = []
    for file in file_list:
        #判断是否是章节(数字开头),排除掉mydict.txt 和 .DS_Store 文件
        if re.findall("^[0-9]", file.split('/')[-1]) == []:
            continue
        time_start = time.time()  # 开始计时
        f = open(file, encoding='utf-8')
        text = f.readlines()
        #分词+排序
        name_count = getNameCountListFromText(text)
        name_countList.append(name_count)
        time_end = time.time()  # 结束计时
        time_c = time_end - time_start  # 运行所花时间
        print(time_c,file,name_count)
    # 打印全部章节[人名+计数]
    for n in name_countList:
        print(len(n), n)
    # 人名合并
    name_countListDic =  combineListToDic(name_countList)
    # 排序
    name_count_total = sorted(name_countListDic.items(), key=lambda x: x[1], reverse=True)
    print('========人名频次总排名========')
    print(name_count_total)

# 根据文本内容,经过分词,提取人名并按出现次数排序
def getNameCountListFromText(text):
    # 分词
    # for t in text:
    #     res = psg.cut(t)
    #     print([(item.word, item.flag) for item in res])

    # 计数
    dict = {}
    for t in text:
        # 忽略空行
        if t.strip() == '':
            text.remove(t)
            continue
        res = psg.cut(t)
        for item in res:
            if item.flag == 'nr' and item.word in dict:
                dict[item.word] += 1
            elif item.flag == 'nr' and item.word not in dict:
                dict[item.word] = 1
    #print(dict)

    # 排序
    name_count = sorted(dict.items(), key=lambda x: x[1], reverse=True)
    return name_count

#根据创建时间排序
def compare(x):
    return os.stat(dir + '/'+x).st_ctime

#字典合并,相同key的count相加
def sumDic(dict1,dict2):
    temp = dict()
    # python3,dict_keys类似set; | 并集
    for key in dict1.keys() | dict2.keys():
        # 根据业务需求修改下面方法,
        temp[key] = sum([d.get(key, 0) for d in (dict1, dict2)])
    return temp

if __name__ == '__main__':
    main()

过程:

  • 每篇文章大概处理400~600ms,但是章节数量多,整体单线程比较慢,下次学习一下多线程【已更新】
  • 最后一遍合并排序比较慢,优化不是我的重点,暂时忽略

最终结果:

[('宝玉', 3737), ('贾母', 1250), ('凤姐', 1215), ('黛玉', 1074), ('王夫人', 975), ('老太太', 965), ('宝钗', 753), ('贾琏', 682), ('薛姨妈', 445), ('贾政', 434), ('探春', 429), ('紫鹃', 411), ('凤姐儿', 405), ('李', 388), ('小丫头', 292), ('贾珍', 287), ('邢夫人', 274), ('尤氏', 260), ('贾', 243), ('薛蟠', 238), ('刘老老', 229), ('明白', 213), ('贾蓉', 189), ('贾政道', 171), ('周瑞家', 171), ('连', 167), ('言语', 158), ('惜春', 151), ('贾芸', 141), ('迎春', 141), ('赵姨娘', 134), ('林之孝', 134), ('林姑娘', 130), ('金桂', 128), ('贾母笑', 127), ('妙玉', 126), ('冷笑', 122), ('和尚', 119), ('薛', 117), ('老婆子', 111), ('贾环', 109), ('宝钗笑', 107), ('宝二爷', 104), ('雪雁', 100), ('宝琴', 99), ('林妹妹', 89), ('黛玉笑', 89), ('老祖宗', 84), ('宝', 81), ('宝姐姐', 81), ('芳', 70), ('贾母王', 69), ('府', 69), ('秦氏', 67), ('秦钟', 67), ('大老爷', 64), ('司棋', 62), ('秋纹', 61), ('那丫头', 60), ('冯紫英', 60), ('玉', 59), ('雨村', 59), ('刘老', 58), ('凤丫头', 58), ('小姐', 58), ('士隐', 57), ('北静王', 56), ('宝兄弟', 56), ('贾政听', 54), ('贾兰', 54), ('邢王二', 53), ('巧姐', 53), ('贾母因', 49), ('元妃', 49), ('贾瑞', 48), ('金钏儿', 48), ('尤二姐', 48), ('玉钏儿', 48), ('巧姐儿', 47), ('包勇', 46), ('从小儿', 46), ('那婆子', 45), ('况', 44), ('王爷', 44), ('小红', 43), ('那宝玉', 43), ('丰儿', 41), ('何曾', 41), ('老嬷嬷', 40), ('贾府', 40), ('史湘云', 37), ('祖宗', 37), ('甄宝玉', 36), ('金陵', 36), ('鲍二', 35), ('旺儿', 35), ('李嬷嬷', 34), ('安静', 33), ('贾芹', 33), ('向宝玉', 33), ('那黛玉', 33), ('宁府', 33), ('张华', 33), ('孙子', 32), ('王', 30), ('金荣', 30), ('王仁', 30), ('王善保', 30), ('张罗', 30), ('平儿忙', 30), ('宝丫头', 30), ('周瑞', 30), ('李婶娘', 29), ('桂花', 29), ('秋菱', 29), ('尤老娘', 28), ('尤氏笑', 28), ('赖', 27), ('倪二', 27), ('金', 27), ('梅花', 27), ('蒋玉函', 27), ('贾政笑', 27), ('李十儿', 27), ('贾政又', 26), ('秋桐', 26), ('凤姐姐', 26), ('问宝玉', 26), ('赖大', 26), ('老先生', 25), ('张道士', 25), ('黛', 25), ('甄家', 25), ('李贵', 25), ('么', 23), ('马道婆', 23), ('贾妃', 23), ('宁可', 23), ('贾芸道', 23), ('齐全', 23), ('李纹', 22), ('代儒', 22), ('贾政便', 22), ('黛玉忙', 22), ('雪雁道', 22), ('贾家', 22), ('春燕', 22), ('那玉', 22), ('薛家', 22), ('荣国府', 21), ('贾琏便', 21), ('道谢', 21), ('贾母房', 21), ('史', 20), ('陈设', 20), ('邢岫烟', 20), ('贾蔷', 20), ('宝钗因', 19), ('水月庵', 19), ('宝钗见', 19), ('尤三姐', 18), ('胡闹', 18), ('贾母忙', 18), ('李氏', 18), ('谢恩', 18), ('王太医', 18), ('贾雨村', 18), ('王子腾', 18), ('相公', 17), ('小太监', 17), ('小红道', 17), ('林黛玉', 17), ('鲍', 17), ('王一贴', 16), ('莫若', 16), ('贾琏忙', 16), ('鸾', 16), ('贾宝玉', 16), ('齐备', 16), ('柳湘莲', 16), ('林丫头', 16), ('听宝玉', 16), ('明白人', 16), ('胡', 15), ('云儿', 15), ('任凭', 15), ('向黛玉', 15), ('邢姑娘', 15), ('小么儿', 15), ('金玉', 15), ('贾母问', 15), ('凤', 15), ('王大夫', 15), ('宝钗忙', 15), ('玉宝钗', 14), ('周', 14), ('向宝钗', 14), ('但凡', 14), ('轩', 14), ('荣府', 14), ('马', 14), ('贾政回', 14), ('耶', 14), ('庄子', 14), ('贾政因', 14), ('贾母命', 14), ('孙女儿', 14), ('贾环贾', 14), ('茜', 14), ('贾政忙', 13), ('老妈妈', 13), ('钱粮', 13), ('邢大舅', 13), ('宝贝', 13), ('佩凤', 13), ('兰儿', 13), ('卿', 13), ('赵堂官', 13), ('英莲', 13), ('天亮', 13), ('贾珍忙', 13), ('孙', 13), ('贾珍笑', 13), ('贾政正', 13), ('子孙', 13), ('赵嬷嬷', 13), ('蓼', 13), ('吴新登', 12), ('兰', 12), ('和宝钗', 12), ('薛大爷', 12), ('赖嬷嬷', 12), ('那道人', 12), ('恩', 12), ('宗祠', 12), ('贾母喜', 12), ('殷勤', 12), ('平儿见', 12), ('宫', 12), ('明贾母', 12), ('祖母', 12), ('翠墨', 12), ('周全', 12), ('宁荣二', 12), ('青埂峰', 12), ('贾蔷道', 12), ('向凤姐', 12), ('贾敬', 12), ('贾母正', 11), ('赖尚荣', 11), ('贾府中', 11), ('阎王', 11), ('别提', 11), ('宝妹妹', 11), ('毛', 11), ('张三', 11), ('贾环见', 11), ('古董', 11), ('夏', 11), ('俞禄', 11), ('太妃', 11), ('那道士', 11), ('小老婆', 11), ('荣宁', 11), ('巴巴儿', 11), ('向平儿', 11), ('老尼', 11), ('老世翁', 11), ('周姨娘', 10), ('那凤姐', 10), ('天恩', 10), ('贾政进', 10), ('贾珍便', 10), ('尼姑', 10), ('林四娘', 10), ('文武', 10), ('王府', 10), ('贾瑞道', 10), ('时宝钗', 10), ('林姐姐', 10), ('赖升', 10), ('赵', 10), ('薛蟠自', 10), ('冯大爷', 10), ('兰哥儿', 10), ('二叔叔', 10), ('呼唤', 10), ('平姐姐', 10), ('贺喜', 10), ('贾环便', 9), ('宝二叔', 9), ('小道士', 9), ('甄应嘉', 9), ('贾蓉忙', 9), ('山子石', 9), ('栗子', 9), ('贾政知', 9), ('姬', 9), ('甄士隐', 9), ('尔', 9), ('贾母薛', 9), ('薛蟠笑', 9), ('詹光', 9), ('红梅', 9), ('薛大哥', 9), ('子儿', 9), ('和黛玉', 9), ('翁', 9), ('子兴道', 9), ('贾珍贾', 9), ('伯', 9), ('贾政叫', 9), ('宁荣', 9), ('石青', 9), ('薛蟠见', 8), ('鬼混', 8), ('甄老爷', 8), ('牛黄', 8), ('那僧道', 8), ('贾兰道', 8), ('藏躲', 8), ('凤凰', 8), ('乌进孝', 8), ('林如海', 8), ('程日兴', 8), ('宁府中', 8), ('秦', 8), ('孔雀', 8), ('夏家', 8), ('小名儿', 8), ('那秦钟', 8), ('张德辉', 8), ('云丫头', 8), ('玉微微', 8), ('清香', 8), ('托生', 8), ('喇', 8), ('秋纹笑', 8), ('老公', 8), ('斯文', 8), ('妙玉笑', 8), ('荣华', 8), ('西平王', 8), ('宝钗进', 8), ('施礼', 8), ('花柳', 8), ('卜世仁', 8), ('王子', 8), ('张王氏', 8), ('贾琏贾', 7), ('凤姐命', 7), ('千秋', 7), ('贾芸笑', 7), ('关了门', 7), ('吴良', 7), ('安逸', 7), ('诸公', 7), ('承', 7), ('贾大人', 7), ('小姑子', 7), ('阳', 7), ('齐声', 7), ('白石', 7), ('贾琮', 7), ('贾政问', 7), ('梅', 7), ('向贾珍', 7), ('金麒麟', 7), ('寿星', 7), ('银红', 7), ('贾政心', 7), ('金刚', 7), ('凤姐忙', 7), ('冯渊', 7), ('贾政叹', 7), ('赵姨奶奶', 7), ('妙师父', 7), ('宝钗黛', 7), ('芦雪庭', 7), ('詹光道', 7), ('凤姑娘', 7), ('小姑娘', 7), ('李妈', 7), ('老伯', 7), ('谢了恩', 7), ('花大姐姐', 7), ('贾政命', 7), ('贾菌', 7), ('小红笑', 7), ('贾宅', 7), ('王妃', 7), ('高明', 7), ('紫菱洲', 7), ('墨雨', 7), ('戴权', 7), ('贾母方', 7), ('寻思', 7), ('倪家', 7), ('贾母素', 7), ('王法', 7), ('冷汗', 7), ('司', 7), ('西风', 7), ('朱', 7), ('那僧', 7), ('禄', 7), ('甄夫人', 7), ('蒙圣恩', 6), ('周姐姐', 6), ('尤氏李', 6), ('隆恩', 6), ('红香圃', 6), ('黄', 6), ('薛公子', 6), ('紫檀', 6), ('郎中', 6), ('和宝琴', 6), ('林大娘', 6), ('老三', 6), ('毛丫头', 6), ('燕子', 6), ('荣国公', 6), ('贾政先', 6), ('神瑛侍者', 6), ('封书子', 6), ('宝钗正', 6), ('璜', 6), ('太老爷', 6), ('孔子', 6), ('年庚', 6), ('花香', 6), ('贾珍方', 6), ('有凤来仪', 6), ('裘', 6), ('依允', 6), ('冯家', 6), ('孙行者', 6), ('毛半仙', 6), ('贾珠', 6), ('宝钗心', 6), ('周大娘', 6), ('小性儿', 6), ('柳二爷', 6), ('寿礼', 6), ('胡君荣', 6), ('宝珠', 6), ('惜', 6), ('黄汤', 6), ('贾蓝', 6), ('孙女', 6), ('惜春笑', 6), ('宋妈妈', 6), ('灵', 6), ('秦邦业', 6), ('忠', 6), ('花名册', 6), ('兰花', 6), ('那五儿', 6), ('肯依', 6), ('亦且', 6), ('甄府', 6), ('侯', 6), ('桂', 6), ('孙绍祖', 6), ('甄', 6), ('张口', 6), ('通灵玉', 6), ('夏婆子', 6), ('龙禁尉', 6), ('玉石', 6), ('静静', 6), ('和睦', 6), ('宝钗素', 6), ('唐突', 6), ('方笑', 6), ('俊', 6), ('宋妈', 6), ('杨', 6), ('李宫裁', 5), ('贾门', 5), ('邢妹妹', 5), ('冷清清', 5), ('秋水', 5), ('和凤姐', 5), ('通灵宝玉', 5), ('史妹妹', 5), ('和芳官', 5), ('安郡王', 5), ('李贵忙', 5), ('那紫鹃', 5), ('谢礼', 5), ('钟爱', 5), ('王公', 5), ('那宝钗', 5), ('封锁', 5), ('秋爽斋', 5), ('那太监', 5), ('宝二', 5), ('洪福', 5), ('须眉', 5), ('那巧姐儿', 5), ('薛蟠忙', 5), ('北静郡王', 5), ('薛宝钗', 5), ('花姑娘', 5), ('老东西', 5), ('柴胡', 5), ('楠', 5), ('白操', 5), ('香玉', 5), ('金刚经', 5), ('宝哥儿', 5), ('陈', 5), ('贾母才', 5), ('光辉', 5), ('贾琏宝玉', 5), ('千金小姐', 5), ('唐', 5), ('小人儿', 5), ('云雨', 5), ('老虎', 5), ('白玉', 5), ('小叔子', 5), ('王老爷', 5), ('金文翔', 5), ('顾', 5), ('贾珍之', 5), ('古圣贤', 5), ('周嫂子', 5), ('秋风', 5), ('金氏', 5), ('胡子', 5), ('侍', 5), ('秋香色', 5), ('那僧笑', 5), ('贾母先', 5), ('张', 5), ('隆儿', 5), ('美玉', 5), ('周旋', 5), ('向紫鹃', 5), ('芹', 5), ('黛玉正', 5), ('丁忧', 5), ('秦钟笑', 5), ('玉湘云', 5), ('红颜', 5), ('吴大人', 5), ('花园子', 5), ('平儿先', 5), ('贾菱', 5), ('凤姐笑', 5), ('张嘴', 4), ('孟子', 4), ('莫如', 4), ('那小子', 4), ('贾兰便', 4), ('金哥', 4), ('傅秋芳', 4), ('荣禧堂', 4), ('祖', 4), ('贾蔷忙', 4), ('帕子', 4), ('金钏', 4), ('楚', 4), ('冷子兴', 4), ('贾琼', 4), ('贾化', 4), ('沙弥', 4), ('燕', 4), ('老二', 4), ('过荣府', 4), ('包儿', 4), ('薛妹妹', 4), ('红香绿', 4), ('周贵妃', 4), ('贾母灵', 4), ('小爷', 4), ('宝石', 4), ('那小红', 4), ('富丽', 4), ('汝', 4), ('鸿蒙', 4), ('和暖', 4), ('伏侍宝玉', 4), ('翠竹', 4), ('春燕笑', 4), ('恕', 4), ('侍卫', 4), ('恩爱', 4), ('黛玉方', 4), ('孔', 4), ('贾琏拿', 4), ('贾政贾', 4), ('宝鉴', 4), ('贾政才', 4), ('紫英', 4), ('都晓', 4), ('小学生
  • 2
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值