python——jieba库的使用
一、jieba库的安装
1.快捷指令win+r,输入cmd,
2.打开之后,输入pip install jieba,回车自动安装。
3.安装完毕后输入python,回车,再输入import jieba
二、jieba库常用分词函数
jieba.cut(s) | 精确模式,返回—个可迭代的数据类型 |
jieba.cut(s,cut_all=True) | 全模式,输出文本s中所有可能单词 |
jieba.cut_for_search(s) | 搜索引擎模式,适合搜索引擎建立索引的分词结果 |
jieba.lcut(s) | 精确模式,返回—个列表类型,建议使用 |
jieba.lcut(s,cut_all=True) | 全模式,返回—个列表类型,建议使用 |
jieba.lcut_for_search(s) | 搜索引擎模式,返回—个列表类型,建议使用 |
jieba.add_word(w) | 向分词词典中增加新词W |
三、实际应用:文本词频的统计
《红楼梦》人物统计。编写程序统计《红楼梦》中前20位出场最多的人物。
import jieba
txt=open("D:/红楼梦.txt","r",encoding='utf-8').read()
excludes={"什么","一个","我们","那里","你们","如今","知道",
"起来","说道","姑娘","这里","出来","他们","众人","奶奶","自己",
"一面","只见","怎么","两个","没有","不是","不知","这个","听见",
"这样","进来","咱们","告诉","就是","东西","回来","只是","大家",
"老爷","只得","丫头","这些","不敢","出去","所以","不过","的话",
"不好","姐姐","一时","不能","过来","心里","二爷","如此","今日",
"银子","几个","答应","二人","还有","只管","这么","说话","一回",
"那边","这话","外头","打发","自然","今儿","罢了","屋里","那些",
"听说","小丫头","如何","问道","看见","妹妹","人家","不用","媳妇"}
words=jieba.lcut(txt)
counts={}
for word in words:
if len(word)==1: #排除单个字符
continue
elif word=="王夫人" or word=="太太": #同一人物整合处理
rword="王夫人"
elif word=="贾母" or word=="老太太":
rword="贾母"
elif word=="凤姐" or word=="凤姐儿":
rword="凤姐"
elif word=="黛玉" or word=="林黛玉":
rword="黛玉"
else:
rword=word
counts[rword]=counts.get(rword,0) + 1
for word in excludes:
del(counts[word])
items=list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(20):
word,count=items[i]
print("{0:<10}{1:>5}".format(word,count))
运行结果:
runfile('D:/python/zuoye9.3.py', wdir='D:/python')
宝玉 3556
贾母 2148
王夫人 1748
凤姐 1556
黛玉 746
贾琏 664
平儿 590
宝钗 536
袭人 516
薛姨妈 445
探春 428
鸳鸯 421
贾政 332
晴雯 305
刘姥姥 293
湘云 292
邢夫人 285
贾珍 277
紫鹃 265
香菱 257