python统计一篇英文短文中单词出现的频率、内存使用分析_python统一相同单词出现的次数-CSDN博客

本文链接：https://blog.csdn.net/qq_42336581/article/details/125186860

怎么统计一篇文档或者一段字符串中某个单词出现的次数。这是在数据分析或者统计学，或者在python的机器学习中常常要做的工作。

统计单词出现的频率的方法有两种；一种是使用for循环，一种是使用python的内建类函数Counter。

一、for循环：

# 随意的一句话
string = "good good study day day up"
# 对需要统计的单词进行去重
set_string = set(string.split(' '))
# 定义一个字典，便于统计展示
result = dict()
# 对去重后的单词遍历，经统计的信息放入定义的字典中
for key in set_string:
    # 初始化单词的个数
    value = 0
    for world in string.split(' '):
        if key == world:
            value += 1
    result[key] = value
print(result)
#>>{'day': 2, 'good': 2, 'up': 1, 'study': 1}

二、Counter 类方法：

from collections import Counter
res = Counter(string.split(' '))
print(res);print(type(res));print(dict(res))
"""
Counter({'good': 2, 'day': 2, 'study': 1, 'up': 1})
<class 'collections.Counter'>
{'good': 2, 'study': 1, 'day': 2, 'up': 1}
"""

from collections import Counter

class Counter()
'''Dict subclass for counting hashable items.  Sometimes called a bag
    or multiset.  Elements are stored as dictionary keys and their counts
    are stored as dictionary values.

    >>> c = Counter('abcdeabcdabcaba')  # count elements from a string

    >>> c.most_common(3)                # three most common elements
    [('a', 5), ('b', 4), ('c', 3)]
    >>> sorted(c)                       # list all unique elements
    ['a', 'b', 'c', 'd', 'e']
    >>> ''.join(sorted(c.elements()))   # list elements with repetitions
    'aaaaabbbbcccdde'
    >>> sum(c.values())                 # total of all counts
    15

    >>> c['a']                          # count of letter 'a'
    5
    >>> for elem in 'shazam':           # update counts from an iterable
    ...     c[elem] += 1                # by adding 1 to each element's count
    >>> c['a']                          # now there are seven 'a'
    7
    >>> del c['b']                      # remove all 'b'
    >>> c['b']                          # now there are zero 'b'
    0

    >>> d = Counter('simsalabim')       # make another counter
    >>> c.update(d)                     # add in the second counter
    >>> c['a']                          # now there are nine 'a'
    9

    >>> c.clear()                       # empty the counter
    >>> c
    Counter()

    Note:  If a count is set to zero or reduced to zero, it will remain
    in the counter until the entry is deleted or the counter is cleared:

    >>> c = Counter('aaabbc')
    >>> c['b'] -= 2                     # reduce the count of 'b' by two
    >>> c.most_common()                 # 'b' is still in, but its count is zero
    [('a', 3), ('c', 1), ('b', 0)]

    '''

第一种方法与第二种方法结果是一样的，再使用memory_profile进行对两种方法内存分析。

第一种：for循环：

from memory_profiler import profile


@profile
def func1(string):
    # 对需要统计的单词进行去重
    set_string = set(string.split(' '))
    # 定义一个字典，便于统计展示
    result = dict()
    # 对去重后的单词遍历，经统计的信息放入定义的字典中
    for key in set_string:
        # 初始化单词的个数
        value = 0
        for world in string.split(' '):
            if key == world:
                value += 1
        result[key] = value
    print(result)
    return result

内存分析结果：

第二种方法：Counter()

from memory_profiler import profile


@profile
def func2(string):
    res = Counter(string.split(' '))
    print(res);print(type(res));print(dict(res))
    return res

内存分析结果：