结巴分词下载 anaconda+jupyter notebook方式下载:
以下是国内豆瓣镜像下载命令(网速快)
anaconda prompt端输入命令:
pip install jieba -i https://pypi.douban.com/simple
代码:
import jieba
from jieba import analyse
def fenci():
tianlongbabu = open("E:/MyDownloads/python/
anaconda/workspace/openfile/jieba_simple.txt", "r",
encoding="utf-8").read()
print("------天龙八部第一章:合计有%d个
字符------"%len(tianlongbabu))
dic = {}
resource = jieba.cut(tianlongbabu)
for word in resource:
if len(word)==1:
continue ##字长为1的去掉
if word in dic:
dic[word] += 1 ##计数加一
else:
dic[word] = 1 ##词典没有则添加
dic = list(dic.items())
dic.sort(key=lambda x:x[1],reverse=True) ##lambda对第二维数据排序
for i in range(10):
word = dic[i][0]
count = dic[i][1]/dic[0][1] ##词频/最大词频
print("-----{:<10}{:>5}".format(word,count))
if __name__== '__main__': ##此处是两个下划线,看起来像一个
fenci()
结果预览:
------天龙八部第一章:合计有23635个字符------
-----段誉 1.0
-----少女 1.0
-----司空玄 0.8461538461538461
-----左子穆 0.782051282051282
-----什么 0.782051282051282
-----钟灵 0.7435897435897436
-----无量 0.5897435897435898
-----龚光杰 0.5641025641025641
-----神农 0.5512820512820513
-----说道 0.5256410256410257