怎么统计一篇文档或者一段字符串中某个单词出现的次数。这是在数据分析或者统计学,或者在python的机器学习中常常要做的工作。
统计单词出现的频率的方法有两种;一种是使用for循环,一种是使用python的内建类函数Counter。
一、for循环:
# 随意的一句话
string = "good good study day day up"
# 对需要统计的单词进行去重
set_string = set(string.split(' '))
# 定义一个字典,便于统计展示
result = dict()
# 对去重后的单词遍历,经统计的信息放入定义的字典中
for key in set_string:
# 初始化单词的个数
value = 0
for world in string.split(' '):
if key == world:
value += 1
result[key] = value
print(result)
#>>{'day': 2, 'good': 2, 'up': 1, 'study': 1}
二 、Counter 类方法:
from collections import Counter
res = Counter(string.split(' '))
print(res);print(type(res));print(dict(res))
"""
Counter({'good': 2, 'day': 2, 'study': 1, 'up': 1})
<class 'collections.Counter'>
{'good': 2, 'study': 1, 'day': 2, 'up': 1}
"""
from collections import Counter
class Counter()
'''Dict subclass for counting hashable items. Sometimes called a bag
or multiset. Elements are stored as dictionary keys and their counts
are stored as dictionary values.
>>> c = Counter('abcdeabcdabcaba') # count elements from a string
>>> c.most_common(3) # three most common elements
[('a', 5), ('b', 4), ('c', 3)]
>>> sorted(c) # list all unique elements
['a', 'b', 'c', 'd', 'e']
>>> ''.join(sorted(c.elements())) # list elements with repetitions
'aaaaabbbbcccdde'
>>> sum(c.values()) # total of all counts
15
>>> c['a'] # count of letter 'a'
5
>>> for elem in 'shazam': # update counts from an iterable
... c[elem] += 1 # by adding 1 to each element's count
>>> c['a'] # now there are seven 'a'
7
>>> del c['b'] # remove all 'b'
>>> c['b'] # now there are zero 'b'
0
>>> d = Counter('simsalabim') # make another counter
>>> c.update(d) # add in the second counter
>>> c['a'] # now there are nine 'a'
9
>>> c.clear() # empty the counter
>>> c
Counter()
Note: If a count is set to zero or reduced to zero, it will remain
in the counter until the entry is deleted or the counter is cleared:
>>> c = Counter('aaabbc')
>>> c['b'] -= 2 # reduce the count of 'b' by two
>>> c.most_common() # 'b' is still in, but its count is zero
[('a', 3), ('c', 1), ('b', 0)]
'''
第一种方法与第二种方法结果是一样的,再使用memory_profile进行对两种方法内存分析。
第一种:for循环:
from memory_profiler import profile
@profile
def func1(string):
# 对需要统计的单词进行去重
set_string = set(string.split(' '))
# 定义一个字典,便于统计展示
result = dict()
# 对去重后的单词遍历,经统计的信息放入定义的字典中
for key in set_string:
# 初始化单词的个数
value = 0
for world in string.split(' '):
if key == world:
value += 1
result[key] = value
print(result)
return result
内存分析结果:
第二种方法:Counter()
from memory_profiler import profile
@profile
def func2(string):
res = Counter(string.split(' '))
print(res);print(type(res));print(dict(res))
return res
内存分析结果:
总结:上边的string字符串的长度增加了进几十倍,由此可以看出再进行统计时两种方法所占用的内存大小是相同的,但是发生次数差别是很大的,统计效率也是很大差别的。