《2018年5月16日》【连续217天】
标题:jieba库和文本词频统计;
内容:
A.jieba库:一个强大的中文分词的第三方库:
包括精确模式,全模式,搜索引擎模式;
1.jieba.lcut(s)
2.jieba.lcut(s,cut_all=True)
3jieba.lcut_for_search(s)
B.文本词频统计:
英文版,以Hamlet中的 单词出现次数为例:
#CalHamletV1.py
def getText():
txt =open("hamlet.txt","r").read()
txt = txt.lower()
for ch in "!@#$%^&*()_+-;'./`~?\|{}[],.<>":
txt = txt.replace(ch," ")
return txt
hamletTxt =getText()
words = hamletTxt.split()
counts ={}
for word in words:
counts[word] =counts.get(word,0)+1
items =list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(10):
word,count=items[i]
print("{0:<10}{1:>5}".format(word,count))
结果:
the 1143
and 966
to 762
of 668
i 630
a 546
you 540
my 514
hamlet 466
in 450
以三国演义中名字的出场频率为例:
#CalThreeKingdomsV2.py
import jieba
txt =open("threekingdoms.txt","r",encoding="utf-8").read()
excludes ={"将军","却说","荆州","二人","不可","不能","如此","如何","商议"\
,"军士","军马","左右","引兵","次日","主公","大喜","天下","东吴"\
,"于是","今日","不敢","魏兵","陛下","一人","人马","不知"}
words =jieba.lcut(txt)
counts ={}
for word in words:
if len(word) ==1:
continue
elif word =="诸葛亮" or word =="孔明曰":
rword ="孔明"
elif word =="关公" or word =="云长":
rword ="关羽"
elif word =="玄德" or word =="玄德曰":
rword ="刘备"
elif word =="孟德" or word =="丞相":
rword ="曹操"
elif word =="都督":
rword ="周瑜"
else:
rword =word
counts[rword] =counts.get(rword,0)+1
for word in excludes:
del counts[word]
items =list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word,count =items[i]
print("{0:<10}{1:>5}".format(word,count))
结果:
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\WIN10\AppData\Local\Temp\jieba.cache
Loading model cost 1.561 seconds.
Prefix dict has been built succesfully.
曹操 1451
孔明 1383
刘备 1252
关羽 784
周瑜 438
张飞 358
吕布 300
赵云 278
孙权 264
司马懿 221