目录
1 Random sample & Random shuffle
1.1 Example
Python中random模块常用的抽样函数:
import random
# random.random() 返回一个[0,1)之间的随机数
# random.uniform(num1, num2) 返回一个[num1,num2]之间的随机数
# random.randint(int1, int2) 输入两个整数int1与int2,返回其中任意一个
# random.choice(lis) 从lis列表中,返回一个随机元素
# random.sample(lis, ele_num) 从lis列表中,随机返回具有ele_num个元素的新列表,原有列表不受影响
# random.shuffle(lis) 将一个lis中的元素随机排列,属于原地操作,即直接改变传入序列的顺序,而不会返回新的序列
对于random.sample() 与random.shuffle()两个函数,博文有过细致的讨论,源码如下
def sample(self, population, k):
"""Chooses k unique random elements from a population sequence.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
To choose a sample in a range of integers, use xrange as an argument.
This is especially fast and space efficient for sampling from a
large population: sample(xrange(10000000), 60)
"""
# Sampling without replacement entails tracking either potential
# selections (the pool) in a list or previous selections in a set.
# When the number of selections is small compared to the
# population, then tracking selections is efficient, requiring
# only a small set and an occasional reselection. For
# a larger number of selections, the pool tracking method is
# preferred since the list takes less space than the
# set and it doesn't suffer from frequent reselections.
n = len(population)
if not 0 <= k <= n:
raise ValueError("sample larger than population")
random = self.random
_int = int
result = [None] * k
setsize = 21 # size of a small set minus size of an empty list
if k > 5:
setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
if n <= setsize or hasattr(population, "keys"):
# An n-length list is smaller than a k-length set, or this is a
# mapping type so the other algorithm wouldn't work.
pool = list(population)
for i in xrange(k): # invariant: non-selected at [0,n-i)
j = _int(random() * (n-i))
result[i] = pool[j]
pool[j] = pool[n-i-1] # move non-selected item into vacancy
else:
try:
selected = set()
selected_add = selected.add
for i in xrange(k):
j = _int(random() * n)
while j in selected:
j = _int(random() * n)
selected_add(j)
result[i] = population[j]
except (TypeError, KeyError): # handle (at least) sets
if isinstance(population, list):
raise
return self.sample(tuple(population), k)
return result
def shuffle(self, x, random=None, int=int):
"""x, random=random.random -> shuffle list x in place; return None.
Optional arg random is a 0-argument function returning a random
float in [0.0, 1.0); by default, the standard random.random.
"""
if random is None:
random = self.random
for i in reversed(xrange(1, len(x))):
# pick an element in x[:i+1] with which to exchange x[i]
j = int(random() * (i+1))
x[i], x[j] = x[j], x[i]
对于random.shuffle(),其借助Fisher–Yates shuffle思想,
“
第1步 从0到N-1个元素中随机选择一个与第N-1个替换
第2步 从0到N-2个元素中随机选择一个与第N-2个替换
第k步 从0到N-k个元素中随机选择一个与第N-k个替换
(自身可与自身交换)
”
容易验证,shuffle后所有排列出现概率是相等的
import random
lis = [1, 2, 3]
count = 0
for test in range(10000):
random.shuffle(lis)
if lis == [1, 2, 3]:
count += 1
print(count)
对于random.sample(),可以看出实现的是 Return a k length list of unique elements chosen from the population sequence. Used for random sampling without replacement,所有元素被选中概率均为k/n
import random
lis = [1, 2, 2, 3, 3, 3]
count_a = 0
count_b = 0
for test in range(10000):
a = random.sample(lis, 3)
b = random.sample(lis, 3)
if a == [3, 3, 3]:
count_a += 1
if sum(b) == 6: # 含有1,2,3元素即可 期望概率为0.3
count_b += 1
print(count_a, count_b)
博文作者总结:“在使用MP3听歌的时候,就有两个功能:shuffle,random,二者的区别在于,前者打乱播放顺序,保证所有的歌曲都会播放一遍;而后者每次随机选择一首。” 事实上,shuffle的特性往往能促进泛化。
1.2 Random Shuffle的作用
机器学习中,当数据集很大时,所有数据并不会存放在同一位置,造成random.sample这种每次都需要抽样数据放回的操作往往是难以实现的,同时,最近也有研究关注与random reshuffling SGD在何种情形下优于random sampling SGD,可参考论文list如下:
How Good is SGD with Random Shuffling?
Random Reshuffling: Simple Analysis with Vast Improvements
Random Shuffling Beats SGD after Finite Epochs
Open Problem: Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?
Random Reshuffling is Not Always Better
具体分析详见下一篇博文。链接:
2 Random sampling with / without replacement
2.1 概念
二者区别在于:
random sampling with replacement为随机放回抽样,随机选取观测值子集,一个观测值可以被多次选取,总体中的每个元素在每次抽取时被选中的机会是相等的
random sampling without replacement为随机不放回抽样,随机选取观测值的一个子集,一个观测值一旦被选取,就不能再被选取
2.2 方差推导
下面给出系统内分别采取random sampling with/without replacement时,相应的采样均值与方差。
假定系统内共 N N N个用户,考虑finite-sum minimization问题:
min x = 1 N ∑ i = 1 N x i \min x=\frac{1}{N}\sum_{i=1}^Nx_i minx=N1i=1∑Nxi
使用两种抽样方式,抽出 K ( K < N ) K (K < N) K(K<N)个用户,设为