Random Sampling (with/without replacement) & Random Shuffling

最新推荐文章于 2025-04-13 09:31:47 发布

idkmn_

最新推荐文章于 2025-04-13 09:31:47 发布

阅读量1.1k

点赞数 15

分类专栏： Federated Learning 文章标签：算法深度学习 python 人工智能

本文链接：https://blog.csdn.net/xbn20000224/article/details/138203850

版权

1 Random sample & Random shuffle

1.1 Example

Python中random模块常用的抽样函数：

import random
# random.random() 返回一个[0,1)之间的随机数
# random.uniform(num1, num2) 返回一个[num1,num2]之间的随机数
# random.randint(int1, int2) 输入两个整数int1与int2，返回其中任意一个
# random.choice(lis) 从lis列表中，返回一个随机元素

# random.sample(lis, ele_num) 从lis列表中，随机返回具有ele_num个元素的新列表，原有列表不受影响
# random.shuffle(lis) 将一个lis中的元素随机排列，属于原地操作，即直接改变传入序列的顺序，而不会返回新的序列

对于random.sample() 与random.shuffle()两个函数，博文有过细致的讨论，源码如下

def sample(self, population, k):
        """Chooses k unique random elements from a population sequence.

        Returns a new list containing elements from the population while
        leaving the original population unchanged.  The resulting list is
        in selection order so that all sub-slices will also be valid random
        samples.  This allows raffle winners (the sample) to be partitioned
        into grand prize and second place winners (the subslices).

        Members of the population need not be hashable or unique.  If the
        population contains repeats, then each occurrence is a possible
        selection in the sample.

        To choose a sample in a range of integers, use xrange as an argument.
        This is especially fast and space efficient for sampling from a
        large population:   sample(xrange(10000000), 60)
        """

        # Sampling without replacement entails tracking either potential
        # selections (the pool) in a list or previous selections in a set.

        # When the number of selections is small compared to the
        # population, then tracking selections is efficient, requiring
        # only a small set and an occasional reselection.  For
        # a larger number of selections, the pool tracking method is
        # preferred since the list takes less space than the
        # set and it doesn't suffer from frequent reselections.

        n = len(population)
        if not 0 <= k <= n:
            raise ValueError("sample larger than population")
        random = self.random
        _int = int
        result = [None] * k
        setsize = 21        # size of a small set minus size of an empty list
        if k > 5:
            setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
        if n <= setsize or hasattr(population, "keys"):
            # An n-length list is smaller than a k-length set, or this is a
            # mapping type so the other algorithm wouldn't work.
            pool = list(population)
            for i in xrange(k):         # invariant:  non-selected at [0,n-i)
                j = _int(random() * (n-i))
                result[i] = pool[j]
                pool[j] = pool[n-i-1]   # move non-selected item into vacancy
        else:
            try:
                selected = set()
                selected_add = selected.add
                for i in xrange(k):
                    j = _int(random() * n)
                    while j in selected:
                        j = _int(random() * n)
                    selected_add(j)
                    result[i] = population[j]
            except (TypeError, KeyError):   # handle (at least) sets
                if isinstance(population, list):
                    raise
                return self.sample(tuple(population), k)
        return result

def shuffle(self, x, random=None, int=int):
        """x, random=random.random -> shuffle list x in place; return None.

        Optional arg random is a 0-argument function returning a random
        float in [0.0, 1.0); by default, the standard random.random.
        """

        if random is None:
            random = self.random
        for i in reversed(xrange(1, len(x))):
            # pick an element in x[:i+1] with which to exchange x[i]
            j = int(random() * (i+1))
            x[i], x[j] = x[j], x[i]

对于random.shuffle()，其借助Fisher–Yates shuffle思想，
“
第1步从0到N-1个元素中随机选择一个与第N-1个替换
第2步从0到N-2个元素中随机选择一个与第N-2个替换
第k步从0到N-k个元素中随机选择一个与第N-k个替换
（自身可与自身交换）
”
容易验证，shuffle后所有排列出现概率是相等的

import random

lis = [1, 2, 3]
count = 0
for test in range(10000):
    random.shuffle(lis)
    if lis == [1, 2, 3]:
        count += 1
print(count)

对于random.sample()，可以看出实现的是 Return a k length list of unique elements chosen from the population sequence. Used for random sampling without replacement，所有元素被选中概率均为k/n

import random

lis = [1, 2, 2, 3, 3, 3]
count_a = 0
count_b = 0
for test in range(10000):
    a = random.sample(lis, 3)
    b = random.sample(lis, 3)
    if a == [3, 3, 3]:
        count_a += 1
    if sum(b) == 6:  # 含有1,2,3元素即可 期望概率为0.3
        count_b += 1
print(count_a, count_b)

博文作者总结：“在使用MP3听歌的时候，就有两个功能：shuffle，random，二者的区别在于，前者打乱播放顺序，保证所有的歌曲都会播放一遍；而后者每次随机选择一首。” 事实上，shuffle的特性往往能促进泛化。

1.2 Random Shuffle的作用

机器学习中，当数据集很大时，所有数据并不会存放在同一位置，造成random.sample这种每次都需要抽样数据放回的操作往往是难以实现的，同时，最近也有研究关注与random reshuffling SGD在何种情形下优于random sampling SGD，可参考论文list如下：
How Good is SGD with Random Shuffling?
Random Reshuffling: Simple Analysis with Vast Improvements
Random Shuffling Beats SGD after Finite Epochs
Open Problem: Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?
Random Reshuffling is Not Always Better

具体分析详见下一篇博文。链接：

2 Random sampling with / without replacement

2.1 概念

二者区别在于：
random sampling with replacement为随机放回抽样，随机选取观测值子集，一个观测值可以被多次选取，总体中的每个元素在每次抽取时被选中的机会是相等的
random sampling without replacement为随机不放回抽样，随机选取观测值的一个子集，一个观测值一旦被选取，就不能再被选取

2.2 方差推导

下面给出系统内分别采取random sampling with/without replacement时，相应的采样均值与方差。
假定系统内共 $N$ 个用户，考虑finite-sum minimization问题：
$\min x=\frac{1}{N}\sum_{i=1}^Nx_i$
使用两种抽样方式，抽出 $K (K < N)$ 个用户，设为