217day(jieba库和文本词频统计)

《2018年5月16日》【连续217天】

标题:jieba库和文本词频统计;

内容:
A.jieba库:一个强大的中文分词的第三方库:

包括精确模式,全模式,搜索引擎模式;

1.jieba.lcut(s)

2.jieba.lcut(s,cut_all=True)

3jieba.lcut_for_search(s)

 

B.文本词频统计:
英文版,以Hamlet中的 单词出现次数为例:
 

#CalHamletV1.py
def getText():
    txt =open("hamlet.txt","r").read()
    txt = txt.lower()
    for ch in "!@#$%^&*()_+-;'./`~?\|{}[],.<>":
        txt = txt.replace(ch," ")
    return txt
hamletTxt =getText()
words = hamletTxt.split()
counts ={}
for word in words:
    counts[word] =counts.get(word,0)+1
items =list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(10):
    word,count=items[i]
    print("{0:<10}{1:>5}".format(word,count))
    

结果:
the        1143
and         966
to          762
of          668
i           630
a           546
you         540
my          514
hamlet      466

in          450

以三国演义中名字的出场频率为例:
 

#CalThreeKingdomsV2.py
import jieba
txt =open("threekingdoms.txt","r",encoding="utf-8").read()
excludes ={"将军","却说","荆州","二人","不可","不能","如此","如何","商议"\
           ,"军士","军马","左右","引兵","次日","主公","大喜","天下","东吴"\
           ,"于是","今日","不敢","魏兵","陛下","一人","人马","不知"}
words =jieba.lcut(txt)
counts ={}
for word in words:
    if len(word) ==1:
        continue
    elif word =="诸葛亮" or word =="孔明曰":
        rword ="孔明"
    elif word =="关公" or word =="云长":
        rword ="关羽"
    elif word =="玄德" or word =="玄德曰":
        rword ="刘备"
    elif word =="孟德" or word =="丞相":
        rword ="曹操"
    elif word =="都督":
        rword ="周瑜"
    else:
        rword =word
    counts[rword] =counts.get(rword,0)+1
for word in excludes:
    del counts[word]
items =list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
    word,count =items[i]
    print("{0:<10}{1:>5}".format(word,count))

结果:

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\WIN10\AppData\Local\Temp\jieba.cache
Loading model cost 1.561 seconds.
Prefix dict has been built succesfully.
曹操         1451
孔明         1383
刘备         1252
关羽          784
周瑜          438
张飞          358
吕布          300
赵云          278
孙权          264
司马懿         221

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值