import jieba
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
names_txt = open("names of ThreeKingdoms.txt", "r", encoding='utf-8').read()
names = names_txt.split()
words = jieba.lcut(txt)
counts = {}
for word in words:
if word not in names:
continue
else:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(15):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))
threekingdoms.txt是三国演义的电子版,names of ThreeKingdoms.txt是三国演义中所有出现过的人物。
使用到jieba库。
jieba库简介
精确模式:把文本精确地切分,不存在冗余 jieba.lcut(s)
全模式:把文本中所有可能的都扫描出来 jieba.lcut(s, cut_all=True)
搜索引擎模式:精确模式基础上,对长词再次切分 jieba.lcut_for_search(s)
往jieba字典中添加新词:jieba.add_word(w)
最后三国演义中提到最多的人物是曹操。