python英文文本词频统计代码_python英文与中文的词频统计

最新推荐文章于 2024-04-10 23:11:57 发布

weixin_39627455

最新推荐文章于 2024-04-10 23:11:57 发布

阅读量1k

点赞数

文章标签： python英文文本词频统计代码

1、统计英文单词，

# 1.准备utf-8编码的文本文件file(已在文件夹中定义了一个名叫“head.txt.rtf”文本文件，详情请见截图)

def getTxt(): #3对文本预处理(包括)

txt = open('head.txt.rtf').read() #2.通过文件读取字符串 str

txt = txt.lower()#将所有的单词全部转化成小写

for ch in ",.!、!@#$%^'": #将所有除了单词以外的符号换成空格

txt.replace(ch, ' ')

return txt

#4、分析提取单词

txtArr = getTxt().split()

#5、单词计数字典

counts = {}

for word in txtArr:

counts[word] = counts.get(word, 0) + 1

#6、将字典转换为列表

countsList = list(counts.items())

countsList.sort(key=lambda x:x[1], reverse=True)

#8.输出TOP(20)

for i in range(20):

word, count = countsList[i]

print('{0:<20}{1:>10}'.format(word,count))

用到的知识点， split

str.split(str="", num=string.count(str)).

str -- 分隔符，默认为所有的空字符，包括空格、换行(\n)、制表符(\t)等。

num -- 分割次数。

实例：

#!/usr/bin/python

str = "Line1-abcdef \nLine2-abc \nLine4-abcd";

print str.split( );

print str.split(' ', 1 );

输出：

['Line1-abcdef', 'Line2-abc', 'Line4-abcd']

['Line1-abcdef', '\nLine2-abc \nLine4-abcd']

counts.get(word, 0) + 1：这个表达式代表的意思是，统计counts中的单词数，如果有word，就+1，没有word就返回0;这是一个统计单词数的方法，虽然统计完后，没有直接方法放到字典中，但它确实存在。counts.items()：以列表返回可遍历的(键, 值) 元组数组

实例：

#!/usr/bin/python

# coding=utf-8

dict = {'Google': 'www.google.com', 'Runoob': 'www.runoob.com', 'taobao': 'www.taobao.com'}

print "字典值 : %s" % dict.items()

# 遍历字典列表

for key,values in dict.items():

print key,values

结果：字典值 : [('Google', 'www.google.com'), ('taobao', 'www.taobao.com'), ('Runoob', 'www.runoob.com')]

Google www.google.com

taobao www.taobao.com

Runoob www.runoob.com

countsList.sort(key=lambda x:x[1], reverse=True)：在这道题中，显然是排序，以从大到小的顺序排列，但有了lambda x:x[1]以后，就是以元组中的第二个元素来进行排序，显然第二个元素也会带动第一个元素。这样就可以，将列表中每个元组也排序了。后面的reverse = true是从大到小排列，false则反之。

range(start, stop[, step])：

start: 计数从 start 开始。默认是从 0 开始。例如range(5)等价于range(0， 5);

stop: 计数到 stop 结束，但不包括 stop。例如：range(0， 5) 是[0, 1, 2, 3, 4]没有5

step：步长，默认为1。例如：range(0， 5) 等价于 range(0, 5, 1)

>>>range(10) # 从 0 开始到 10

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

>>> range(1, 11) # 从 1 开始到 11

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

>>> range(0, 30, 5) # 步长为 5

[0, 5, 10, 15, 20, 25]

>>> range(0, 10, 3) # 步长为 3

[0, 3, 6, 9]

>>> range(0, -10, -1) # 负数

[0, -1, -2, -3, -4, -5, -6, -7, -8, -9]

>>> range(0)

[]

>>> range(1, 0)

[]

print('{0:<20}{1:>10}'.format(word,count))：这个语法的意思是，将word空出20个占位符，count占10个占位符；而大于号与小于号决定word与count是居左还是剧右。

-------------------------------------------------------------------------------------------------------------------

weixin_39627455

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。