python 统计TXT中的英文词频

最新推荐文章于 2024-02-28 10:26:42 发布

江山美女一锅端

最新推荐文章于 2024-02-28 10:26:42 发布

阅读量1k

点赞数 1

本文链接：https://blog.csdn.net/hnx19910729/article/details/72571783

版权

每句话会出现英文下的句号（.），当然也可能出现其他标点符号（，！ ?等等）。

先处理句号和逗号。

如果不进行处理

>>> "and port 4 communicateswith the SGMII4 module.".split()

['and', 'port', '4', 'communicates','with', 'the', 'SGMII4', 'module.']

可以发现句号和某个合在一起组成了一个新的字符串。

用字符串的replace方法来去掉句号。

>>> "and port 4 communicateswith the SGMII4 module.".replace('.','').split()

['and', 'port', '4', 'communicates','with', 'the', 'SGMII4', 'module']

如果句子中既有句号又有逗号

>>> "and port 4 communicates,with the SGMII4 module.".replace('.','').split()

['and', 'port', '4', 'communicates,','with', 'the', 'SGMII4', 'module']

>>> "and port 4 communicates,with the SGMII4 module.".replace('.','').replace(',','').split()

['and', 'port', '4', 'communicates','with', 'the', 'SGMII4', 'module']

当要统计的文本中标点符号太多时，可以采用如下方式。

先import string，string模块中有punctuation属性，它包含了所有的标点符号。

再

fori in string.punctuation:

r2=r2.replace(i,'')

r2=r2.split()

得到的为r2为列表形式，现在要统计词频，很容易想到字典这用数据结构。

frequencies = {}

for word in r2:

frequencies[word] += 1

python会抛出一个KeyError 异常，因为字典索引之前必须初始化，可以用下面的方法解决

for word in r2:

try:

frequencies[word] += 1

except KeyError:

frequencies[word] = 1

当然也可以使用setdefault()方法

Dict（一个字典变量）.setdefault()方法（字典方法）接收两个参数，第一个参数是key的名称，第二个参数是默认值。假如字典中不存在给定的key，则返回参数中提供的默认值,并创建一个key---value对；反之，则返回字典中保存的值。

>>> frequencies = {}

>>> for word in r2:

frequencies[word]= frequencies.setdefault(word,0)+1

当然，collections.defaultdict也可以轻松的解决这个问题

from collections import defaultdict

frequencies = defaultdict(int) #传入int()函数来初始化

for word in r2:

frequencies[word] += 1

collections.defaultdict可以接受一个函数作为参数来初始化。什么意思呢，看上面的例子，我们想要frequencies[word]初始化为0，这时就可以用一个int()函数作为参数出给defaultdict，我们不带参数调用int()，int()就会返回一个0值

完整代码

# -*- coding: utf-8 -*- #统计TXT中的英文单词个数及词频 import string from collections import defaultdict import pprint with open('C:\Users\hnx\Desktop\word.txt','r') as f: r=f.read() print r print len(r) r1=r.replace('.','').replace(',','').split()#只处理了逗号和句号 print r1 print len(r1) r2=r for i in string.punctuation: r2=r2.replace(i,'') r2=r2.split() print r2 print len(r2) frequencies = defaultdict(int) #传入int()函数来初始化 for word in r2: frequencies[word] += 1 pprint.pprint(frequencies) #输出格式和print一样 frequencies = {} for word in r2: frequencies[word] = frequencies.setdefault(word,0)+1 pprint.pprint(frequencies) #能美观打印

江山美女一锅端

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python 统计TXT中的英文词频

每句话会出现英文下的句号（.），当然也可能出现其他标点符号（，！ ?等等）。先处理句号和逗号。如果不进行处理>>> "and port 4 communicateswith the SGMII4 module.".split()['and', 'port', '4', 'communicates','with', 'the', 'SGMII4', 'module.']可以发现句
复制链接

扫一扫