统计文本词频的几种方法（Python）

greatau

已于 2023-10-25 22:49:31 修改

阅读量1.6w

点赞数 12

分类专栏：计算机等级二级Python 文章标签： python 开发语言人工智能大数据

于 2023-10-25 22:16:54 首次发布

本文链接：https://blog.csdn.net/greatau/article/details/134044945

版权

词频统计是自然语言处理的基本任务，针对一段句子、一篇文章或一组文章，统计文章中每个单词出现的次数，在此基础上发现文章的主题词、热词。

1. 单句的词频统计

思路：首先定义一个空字典my_dict，然后遍历文章（或句子），针对每个单词判断是否在字典my_dict的key中，不存在就将该单词当作my_dict的key，并设置对应的value值为1；若已存在，则将对应的value值+1。

#统计单句中每个单词出现的次数
news = "Xi, also general secretary of the Communist Party of China (CPC) Central Committee and chairman of the Central Military Commission, made the remarks while attending a voluntary tree-planting activity in the Chinese capital's southern district of Daxing."

def couWord(news_list): 
    ##定义计数函数  输入：句子的单词列表 输出：单词-次数 的字典
    my_dict = {}  #空字典 来保存单词出现的次数
    for v in news_list:
        if my_dict.get(v):
            my_dict[v] += 1
        else:
            my_dict[v] = 1
    return my_dict

print(couWord(news.split ()))

输出

{‘Xi,’: 1, ‘also’: 1, ‘general’: 1, ‘secretary’: 1, ‘of’: 4, ‘the’: 4, ‘Communist’: 1, ‘Party’: 1, ‘China’: 1, ‘(CPC)’: 1, ‘Central’: 2, ‘Committee’: 1, ‘and’: 1, ‘chairman’: 1, ‘Military’: 1, ‘Commission,’: 1, ‘made’: 1, ‘remarks’: 1, ‘while’: 1, ‘attending’: 1, ‘a’: 1, ‘voluntary’: 1, ‘tree-planting’: 1, ‘activity’: 1, ‘in’: 1, ‘Chinese’: 1, “capital’s”: 1, ‘southern’: 1, ‘district’: 1, ‘Daxing.’: 1}

以上通过couWord方法实现了词频的统计，但是存在以下两个问题。

（1）未去除stopword

输出结果中保护’also’、‘and’、'in’等stopword（停止词），停止词语与文章主题关系不大，需要在词频统计等各类处理中将其过滤掉。

（2）未根据出现次数进行排序

根据每个单词出现次数进行排序后，可以直观而有效的发现文章主题词或热词。

改进后的couWord函数如下：

def couWord(news_list,word_list,N):
    #输入 文章单词的列表 停止词列表  输出：Top N的单词
    my_dict = {}  #空字典 来保存单词出现的次数
    for v in news_list:
        if (v not in word_list): # 判断是否在停止词列表中
            if my_dict.get(v):
                my_dict[v] += 1
            else:
                my_dict[v] = 1
                  
    topWord = sorted(zip(my_dict.values(),my_dict.keys()),reverse=True)[:N] 
    
    return topWord

加载英文停止词列表：

stopPath = r'Data/stopword.txt'
with open(stopPath,encoding = 'utf-8') as file:
    word_list = file.read().split()      #通过read()返回一个字符串函数，再将其转换成列表

print(couWord(news.split(),word_list,5))

输出

[(2, ‘Central’), (1, ‘voluntary’), (1, ‘tree-planting’), (1, ‘southern’), (1, ‘secretary’)]