Problem
给出一串元素流,设长度为
n
,
Randomly select s items from a set
R of size n>s , where n is unknown, such that each member ofR should have an equal probability of being selected.
Reservoir Sampling
设当前接收到的元素个数为
k
,当
当
k>s
时,生成随机数
r∈[1,k]
:
- 如果
r≤s
, 则将该元素加入集合
R
,并替换
R 中位置 r 的元素; - 否则,丢弃该元素。
# To initialize an array a to k random elements
# of S (which is of length n), both 0-based:
a[0] ← S[0]
for i from 1 to k - 1 do
r ← random [0 .. i]
a[i] ← a[r]
a[r] ← S[i]
for i from k to n - 1 do
r ← random [0 .. i]
if (r < k) then a[r] ← S[i]
分析:
该算法的时间复杂度为
Limitation
该算法需要
k
为一个实现已知的常数,若
此外,该算法需要维护一个集合
R
,集合
Reservoir sampling makes the assumption that the desired sample fits into main memory, often implying that k is a constant independent of n.
Proof
使用归纳法证明。
- 当
n=s+1
时,易得每个元素留在
R
中的概率为
ss+1=sn ; - 当
n>s
时,假设
n
个元素中,每个元素留在
R 的概率为 sn ; - 当
n=n+1
时,第
n+1
个元素
an+1
进入
R
的概率为
sn+1 ,而前 n 个元素ai,1≤i≤n 留在 R 中的概率为sn∗nn+1=sn+1 .
□