python统计汉字个数是_Python中文词频统计

最新推荐文章于 2023-01-07 11:54:37 发布

weixin_39850062

最新推荐文章于 2023-01-07 11:54:37 发布

阅读量282

点赞数

文章标签： python统计汉字个数是

今天看到的一个统计，统计的金庸小说里面的高频词语。想着看了一周python，试试看能不能统计。

网上找的代码，调整顺序拼接了一下，分词库是结巴分词。

解决了python2.7中字典显示中文乱码的问题

分词代码：https://github.com/imwilsonxu/mao

频率统计：https://github.com/aolingwen/0006

结巴分词：https://github.com/fxsjy/jieba

# -*- coding: utf-8 -*-

import json

import re

import jieba

from collections import Counter

class StatWords(object):

def statTopN(self,path, n):

file = open(path,'r')

wordDict = {}

content = file.read()

wordlist = re.split('[\s\ \\,\;\.\!\n]+', content)

for word in wordlist:

if word in wordDict:

wordDict[word]=wordDict[word]+1

else:

wordDict[word] = 1

count = Counter(wordDict)

print json.dumps(count.most_common()[:n], encoding="UTF-8", ensure_ascii=False)

STOPWORDS = [u'的', u'地', u'得', u'而', u'了', u'在', u'是', u'我', u'有', u'和',

u'就', u'不', u'人', u'都', u'一', u'一个', u'上', u'也', u'很', u'到', u'说', u'要',

u'去', u'你', u'会', u'着', u'没有', u'看', u'好', u'自己', u'这']

PUNCTUATIONS = [u'。', u'，', u'“', u'”', u'…', u'？', u'！', u'、', u'；', u'（',

u'）',u'?',u'：']

#黑名单

f_in = open('file_in.txt')

f_out = open('file_out.txt', 'w')

#f_in原文档，f_out分词后的文档

try:

for l in f_in:

seg_list = jieba.cut(l)

# print "/".join(seg_list)

for seg in seg_list:

if seg not in STOPWORDS and seg not in PUNCTUATIONS:

f_out.write(seg.encode('utf-8', 'strict') + "\n")

finally:

f_in.close()

f_out.close()

if __name__ == '__main__':

s = StatWords()

s.statTopN("file_out.txt",10)

weixin_39850062

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。