英文文本单词词频统计——基于HashTable

Ckyeka

已于 2022-07-29 14:41:09 修改

阅读量959

点赞数

分类专栏：数据结构 python 文章标签：散列表哈希算法数据结构

于 2022-07-28 16:51:03 首次发布

本文链接：https://blog.csdn.net/Shallowmm/article/details/126037831

版权

数据结构同时被 2 个专栏收录

15 篇文章 3 订阅

订阅专栏

python

5 篇文章 0 订阅

订阅专栏

今天在面试的时候导师要求使用哈希表来统计一段文本中单词的词频，虽然比较简单但面试的时候居然没有完全写出来，哭死。

其实思路很简单，首先将文本中的单词提取出来存放到一个列表，这里没有使用jieba库，而是直接使用python提供split的方法，但是需要注意的是光分割字符串还不够，还要去除文本中的停用词，如括号、逗号、句号等。

在得到单词列表后，剩下的就是构建哈希表，然后统计词频了。哈希表的设计分为以下两步：

1.哈希函数

对于一个单词，如何计算它的哈希值？我们可以将单词各位的ascii码相加作为该单词的key，然后用除留余数法来计算哈希值，计算公式为hash(key) = key % p，p为一个较大的质数。

2.冲突处理
常用的冲突处理方法有开放定址法和拉链法，我采用的是拉链法，原理可以参考这幅图，如果计算得到的地址是空闲的，那么可以直接将结点放入该位置，并作为链表的头结点，否则检查链表中是否存在相同的单词如果存在则将对应单词的词频加一，如果不存在则将结点插入链表尾。
在这里插入图片描述
构建好哈希表后，我们只需要遍历哈希表中的所有结点，即可得到所有单词的词频，如下图所示：

代码实现如下：

text = """he National Wrestling Association was an early professional wrestling sanctioning body created in 1930 
by the National Boxing Association (NBA) (now the World Boxing Association, WBA) as an attempt to create a governing 
body for professional wrestling in the United States. The group created a number of "World" level 
championships as an attempt to clear up the professional wrestling rankings which at the time saw a number of
different championships promoted as the "true world championship". The National Wrestling Association's NWA 
World Heavyweight Championship was later considered part of the historical lineage of the National Wrestling 
Alliance's NWA World Heavyweight Championship when then National Wrestling Association champion Lou Thesz 
won the National Wrestling Alliance championship, folding the original championship into one title in 1949.
"""
HASH_NUM = 29989

class HashNode:
    """哈希表结点
    """
    def __init__(self, word, count):
        self.count = count
        self.word = word
        self.next = None

class HashTable:
    """哈希表
    """
    def __init__(self):
        self.table = [None] * HASH_NUM  # 哈希表 用列表的索引作为散列地址
        self.word_indexs = []  # 存放加入哈希表的单词的散列地址

    def hash(self, word):
        """哈希函数
            hash(key) = key % HASH_NUM
        """
        key = 0
        for ch in word:
            key = key + int(ord(ch))

        return key % HASH_NUM

    def add_word(self, word):
        """向哈希表中添加一个结点
        """
        index = self.hash(word)
        self.word_indexs.append(index)

        # 如果该位置不存在元素 则直接作为链表头结点
        if self.table[index] is None:
            self.table[index] = HashNode(word, 1)
            return

        # 寻找该位置对应的链表中是否有相同的单词
        node = self.table[index]
        while True:
            # 如果存在相同的单词 出现次数加一
            if node.word == word:
                node.count += 1
                return
            if node.next is None:
                break
            node = node.next

        # 否则新建一个结点插入链表尾
        new_node = HashNode(word, 1)
        node.next = new_node

    def display_count(self):
        """按降序顺序输出单词的词频
        """
        count_table = {}
       
        for index in list(set(self.word_indexs)):
            p = self.table[index]
            while p:
                count_table[p.word] = p.count
                p = p.next
            
        sorted_table = sorted(count_table.items(), key=lambda x: x[1], reverse=True)
        for key, value in sorted_table:
            print(key, ": ", value)


if __name__ == '__main__':
    split_words = text.split(' ')
    stop_words = ['\n', '"', ',', '.', '?', '"', '(', ')']
    word_list = []

    for word in split_words:
        # 去除停用词
        for stop_word in stop_words:
            word = word.strip(stop_word)
        word_list.append(word)

    hash_table = HashTable()
    for word in word_list:
        hash_table.add_word(word)
    hash_table.display_count()