词频统计

最新推荐文章于 2022-06-09 16:22:40 发布

wukk007

最新推荐文章于 2022-06-09 16:22:40 发布

阅读量1.9k

点赞数

词频统计就是指统计出某个文本中各个词出现的次数，这里使用python中的词典数据结构易得。我用的是matplotlib画柱状图，画出top-K个高频词。这里需要注意的是图中的中文显示问题，在使用之前，需要修改相应的设置，具体方法不妨去google一下，我就不详细介绍了。

# -*- coding: UTF-8-*-

import string

import numpy

import pylab

def getstr(word, count):

countstr = word + ',' + str(count)

return countstr

def get_wordlist(infile):

c = open(infile).readlines()

wordlist = []

for line in c:

if len(line)>1:

words = line.split(' ')

for word in words:

if len(word)>1:

wordlist.append(word)

return wordlist

def get_wordcount(wordlist, outfile):

out = open(outfile, 'w')

wordcnt ={}

for i in wordlist:

if i in wordcnt:

wordcnt[i] += 1

else:

wordcnt[i] = 1

worddict = wordcnt.items()

worddict.sort(key=lambda a: -a[1])

for word,cnt in worddict:

out.write(getstr(word.encode('gbk'), cnt)+'\n')

out.close()

return wordcnt

def barGraph(wcDict):

wordlist=[]

for key,val in wcDict.items():

if val>5 and len(key)>3:

wordlist.append((key.decode('utf-8'),val))

wordlist.sort()

keylist=[key for key,val in wordlist]

vallist=[val for key,val in wordlist]

barwidth=0.5

xVal=numpy.arange(len(keylist))

pylab.xticks(xVal+barwidth/2.0,keylist,rotation=45)

pylab.bar(xVal,vallist,width=barwidth,color='y')

pylab.title(u'微博词频分析图')

pylab.show()

if __name__ == '__main__':

myfile = 'F://NLP/iWInsightor/weibo_filter.dat'

outfile = 'F://NLP/iWInsightor/result.dat'

wordlist = get_wordlist(myfile)

wordcnt = get_wordcount(wordlist,outfile)

barGraph(wordcnt)

至此，我们的工作就完成了。下面是我的微博词频的一个柱状图。这些仅是业余时间之作，尚有诸多不足之处。

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

wukk007 CSDN认证博客专家 CSDN认证企业博客

码龄12年

24: 原创

23万+: 周排名

207万+: 总排名

39万+: 访问

: 等级

3864: 积分

141: 粉丝

62: 获赞

12: 评论

216: 收藏

私信

关注

热门文章

分类专栏

最新评论

python 内存释放
lmw0320: 想请教下，gc.collect()到底是放在代码头部，还是要跟着变量删除的操作之后？？如果是多处需要删除变量，是否要每处都使用gc.collect()？？我现在感觉内存占用太大，想进行控制，结果发现gc.collect()貌似不能明显降低内存。。而且我使用memort_profile这个库，来监测内存使用时，发现输出的log文件中，increment数值居然有不少负值，不知道是否内存泄漏了。。
大规模优化算法 - LBFGS算法
IT猿手: 大规模优化算法合集https://mianbaoduo.com/o/liang
python 内存释放
进击的黄鸭9527: 感谢~
如何深入理解时间序列分析中的平稳性？
xummingcong: 独立同分布的时间序列有没有弱平稳性？
Python 文本挖掘：使用gensim进行文本相似度计算
vx555: 请问怎么用LDA和LSV等其他算法呢？

最新文章

目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。