Leetcode0398. 随机数索引(medium,哈希表，蓄水池抽样)

笨牛慢耕

于 2022-04-25 06:49:29 发布

阅读量219

点赞数

分类专栏： leetcode 文章标签： leetcode python 算法哈希蓄水池抽样

本文链接：https://blog.csdn.net/chenxy_bwave/article/details/124395948

版权

leetcode 专栏收录该内容

194 篇文章 4 订阅

订阅专栏

1. 问题描述

给定一个可能含有重复元素的整数数组，要求随机输出给定的数字的索引。您可以假设给定的数字一定存在于数组中。

注意：
数组大小可能非常大。使用太多额外空间的解决方案将不会通过测试。

示例:

int[] nums = new int[] {1,2,3,3,3};
Solution solution = new Solution(nums);

// pick(3) 应该返回索引 2,3 或者 4。每个索引的返回概率应该相等。
solution.pick(3);

// pick(1) 应该返回 0。因为只有nums[0]等于1。
solution.pick(1);

2. 方法一：哈希表

2.1 思路

基于输入的nums数组构建一个哈希表，以数组元素值为key，以其出现的下标构成的列表作为value。

这样，在执行pick操作时从对应的列表中随机抽样即可。

2.2 代码

from typing import List
import random
class Solution:

    def __init__(self, nums: List[int]):
        self.d = dict()
        for k,num in enumerate(nums):
            if num in self.d:
                self.d[num].append(k)
            else:
                self.d[num] = [k]

    def pick(self, target: int) -> int:
        return random.choice(self.d[target])        


# Your Solution object will be instantiated and called as such:
# obj = Solution(nums)
# param_1 = obj.pick(target)

if __name__ == "__main__":
    
    nums = [1,2,3,3,3]
    sln = Solution(nums)
    print(sln.pick(1))
    print(sln.pick(2))
    for k in range(10):
        print(sln.pick(3))

执行用时：92 ms, 在所有 Python3 提交中击败了44.97%的用户

内存消耗：24.8 MB, 在所有 Python3 提交中击败了17.77%的用户

做了个对比实验，使用defaultdict()替代dict()会使得代码更简单一些，但是性能上却比使用dict更差。的确是没有免费的午餐原理在作怪吧。

from collections import defaultdict
class Solution:
    def __init__(self, nums: List[int]):
        self.d = defaultdict(list)
        for k,num in enumerate(nums):
                self.d[num].append(k)

    def pick(self, target: int) -> int:
        return random.choice(self.d[target])

执行用时：104 ms, 在所有 Python3 提交中击败了30.62%的用户

内存消耗：25.8 MB, 在所有 Python3 提交中击败了5.36%的用户

3. 方法二：蓄水池抽样

3.1 思路

学习官解（https://leetcode-cn.com/problems/random-pick-index/solution/sui-ji-shu-suo-yin-by-leetcode-solution-ofsq/）。一看名字这么高大上。。。没听说过啊。。。学海无涯啊

如果数组以文件形式存储（读者可假设构造函数传入的是个文件路径），且文件大小远超内存大小，我们是无法通过读文件的方式，将所有下标保存在内存中的，因此需要找到一种空间复杂度更低的算法。

我们可以设计如下算法实现 $\text{pick}$ 操作：

遍历 $\textit{nums}$ ，当我们第 i 次遇到值为 $\textit{target}$ 的元素时，随机选择区间 [0,i) 内的一个整数，如果其等于 0，则将返回值置为该元素的下标，否则返回值不变。

设 $\textit{nums}$ 中有 k 个值为 $\textit{target}$ 的元素，该算法会保证这 k 个元素的下标成为最终返回值的概率均为 $\dfrac{1}{k}$ ，证明如下：

3.2 代码

class Solution:
    def __init__(self, nums: List[int]):
        self.nums = nums

    def pick(self, target: int) -> int:
        ans = cnt = 0
        for i, num in enumerate(self.nums):
            if num == target:
                cnt += 1  # 第 cnt 次遇到 target
                if random.randrange(cnt) == 0:
                    ans = i
        return ans

执行用时：80 ms, 在所有 Python3 提交中击败了77.73%的用户

内存消耗：18.4 MB, 在所有 Python3 提交中击败了42.83%的用户

很巧妙。不过，这样每次pick操作都要从头到尾遍历一次nums，为什么这样能比哈希表方法更快呢？

时间复杂度：初始化为 O(1)， $\text{pick}$ 为 O(n)，其中 n 是 $\textit{nums}$ 的长度。

空间复杂度：O(1)。只需要常数的空间保存若干变量。

蓄水池采样（Reservoir Sampling）,也称水塘抽样，适用于对长度未知的数据流进行随机采样，具有低内存消耗的特性。另参见：382. 链表随机节点

回到总目录：笨牛慢耕的Leetcode每日一题总目录(动态更新。。。)