其他题目---出现次数的TopK问题

最新推荐文章于 2023-09-16 21:17:32 发布

wenbin1996

最新推荐文章于 2023-09-16 21:17:32 发布

阅读量2.1k

点赞数 2

分类专栏：数据结构与算法文章标签： python 堆

本文链接：https://blog.csdn.net/qq_34342154/article/details/78494636

版权

数据结构与算法专栏收录该内容

168 篇文章 9 订阅

订阅专栏

【题目】

　　给定String类型的数组strArr，再给定整数k，请严格按照排名顺序打印出现次数前k名的字符串。要求时间复杂度O(Nlogk)。

【进阶题目】

　　设计并实现TopKRecord结构，可以不断地向其中加入字符串，并且可以根据字符串出现的情况随时打印加入次数最多的前k个字符串，具体为：

　　1.k在TopKRecord实例生成时指定，并且不再变化（k是构造函数的参数）。
　　 2.含有add(String str)方法，即向TopKRecord中加入字符串。
　　3.含有printTopK()方法，即打印加入次数最多的前k个字符串，打印有哪些字符串和对应出现的次数即可，不要求严格按排名顺序打印。

　　要求：

在任何时刻，add方法的时间复杂度不超过O(logk)。
在任何时刻，printTopK方法的时间复杂度不超过O(k)。

【基本思路】

　　原问题。使用一个哈希表记录每种字符串出现的次数，遍历一遍哈希表，根据该哈希表构建一个大小为k的小根堆，该小根堆以词频作为衡量标准，小根堆中的字符串就是出现次数前TopK的字符串。具体过程见如下代码：

class FreNode:
    def __init__(self, st, times):
        self.str = st
        self.times = times

#python3.5
#原问题
def printTopKAndRank(strArr, k):
    def heapInsert(heap, i):
        parent = (i - 1) // 2
        while parent >= 0 and heap[parent].times > heap[i].times:
            heap[parent], heap[i] = heap[i], heap[parent]
            i = parent
            parent = (i - 1) // 2

    def heapify(heap, i, heapSize):
        left = 2 * i + 1
        right = 2 * i + 2
        most = i
        while left < heapSize:
            if heap[left].times < heap[i].times:
                most = left
            if right < heapSize and heap[right].times < heap[most].times:
                most = right
            if most == i:
                break
            else:
                heap[most], heap[i] = heap[i], heap[most]
                i = most
                left = 2 * i + 1
                right = 2 * i + 2

    if strArr == None or len(strArr) == 0 or k < 1 or k > len(strArr):
        return
    map = {}
    for element in strArr:
        if element in map:
            map[element] += 1
        else:
            map[element] = 1
    heap = [0 for i in range(k)]
    index = 0
    for key,value in map.items():
        curNode = FreNode(key, value)
        if index != k:
            heap[index] = curNode
            heapInsert(heap, index)
            index += 1
        else:
            if heap[0].times < curNode.times:
                heap[0] = curNode
                heapify(heap, 0, k)
    for i in range(index-1, 0, -1):
        heap[0], heap[i] = heap[i],heap[0]
        heapify(heap,0,i)
    for i in range(index):
        print("No." + str(i+1) + " :" + heap[i].str + " times: " + str(heap[i].times))

　　进阶问题。进阶问题的关键在于，字符串出现的次数是动态的，当然也可以向原问题一样，每加入一个字符串，就更新哈希表以及小根堆。这样可以做到add方法的时间复杂度为O(1)，但是，每次printTopK的时候，都需要遍历一遍哈希表并且重新构建小根堆，时间复杂度为O(Nlogk)，显然不符合题意。

　　要做到printTopK的时间复杂度为O(logk)，我们就希望每加入一个字符串的时候，可以利用到之前创建的小根堆，而不是直接重建小根堆。

　　因此，我们在原问题的基础上改进一下，每次放入小根堆的元素都记录下它在小根堆中的位置以及它的词频。这样的好处是：假设一个字符串出现了一次，如果字符串已经在小根堆中，此时只需要在小根堆中找到这个字符串所在的位置，让该字符串的词频加1，然后从该位置开始向下调整小根堆即可。如果该字符串之前不在小根堆中，只需要看它的词频加一后是否大于堆顶的词频，如果大的话，更新堆顶，并向下调整堆。每次调整时间复杂度都为O(logk)。

　　具体的实现参见如下代码：

#进阶问题
class TopKRecord:
    index = 0    #目前堆中的元素个数
    strNodeMap = {}   #记录字符串和node的对应关系
    nodeIndexMap = {}  #记录node在堆中的位置，如果不在堆中则为-1

    def __init__(self, size):
        self.heap = [0 for i in range(size)]

    def add(self, str1):
        preIndex = -1
        curNode = None
        if str1 not in self.strNodeMap:
            curNode = FreNode(str1, 1)
            self.strNodeMap[str1] = curNode
            self.nodeIndexMap[curNode] = -1
        else:
            self.strNodeMap[str1].times += 1
            curNode = self.strNodeMap[str1]
            preIndex = self.nodeIndexMap[curNode]
        if preIndex == -1:
            if self.index == len(self.heap):
                if curNode.times > self.heap[0].times:
                    self.nodeIndexMap[self.heap[0]] = -1
                    self.nodeIndexMap[curNode] = 0
                    self.heap[0] = curNode
                    self.heapify(0, self.index)
            else:
                self.nodeIndexMap[curNode] = self.index
                self.heap[self.index] = curNode
                self.heapInsert(self.index)
                self.index += 1
        else:
            self.heapify(preIndex, self.index)

    def printTopK(self):
        print("TOP:")
        for i in range(self.index):
            print("Str: " + self.heap[i].str + " Times:" + str(self.heap[i].times))

    def heapify(self, i, heapSize):
        left = 2 * i + 1
        right = 2 * i + 2
        smallest = i
        while left < heapSize:
            if self.heap[left].times < self.heap[i].times:
                smallest = left
            if right < heapSize and self.heap[right].times < self.heap[smallest].times:
                smallest = right
            if smallest == i:
                break
            else:
                self.nodeIndexMap[self.heap[i]] = smallest
                self.nodeIndexMap[self.heap[smallest]] = i
                self.heap[i], self.heap[smallest] = self.heap[smallest], self.heap[i]
                i = smallest
                left = 2 * i + 1
                right = 2 * i + 2

    def heapInsert(self, i):
        while i > 0:
            parent = (i - 1) // 2
            if self.heap[parent].times > self.heap[i].times:
                self.nodeIndexMap[self.heap[i]] = parent
                self.nodeIndexMap[self.heap[parent]] = i
                self.heap[i], self.heap[parent] = self.heap[parent], self.heap[i]
                i = parent
            else:
                break

wenbin1996

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
其他题目---出现次数的TopK问题

【题目】　　给定String类型的数组strArr，再给定整数k，请严格按照排名顺序打印出现次数前k名的字符串。要求时间复杂度O(Nlogk)。【进阶题目】　　设计并实现TopKRecord结构，可以不断地向其中加入字符串，并且可以根据字符串出现的情况随时打印加入次数最多的前k个字符串，具体为：　　1.k在TopKRecord实例生成时指定，并且不再变化（k是构造函数的参数）。　　 2.含有ad
复制链接

扫一扫