我试图从一个很大的文本语料库中获取一个随机样本。
您出色的综合答案目前表明胜出iter_sample_fast(gen, pop)。但是,我尝试了Katriel的推荐random.sample(list(gen), pop)-与之相比,它的速度非常快!
def iter_sample_easy(iterable, samplesize):
return random.sample(list(iterable), samplesize)
Sampling 1000 from 10000
Using iter_sample_fast 0.0192 s
Using iter_sample_easy 0.0009 s
Sampling 10000 from 100000
Using iter_sample_fast 0.1807 s
Using iter_sample_easy 0.0103 s
Sampling 100000 from 1000000
Using iter_sample_fast 1.8192 s
Using iter_sample_easy 0.2268 s
Sampling 200000 from 1000000
Using iter_sample_fast 1.7467 s
Using iter_sample_easy 0.3297 s
Sampling 500000 from 1000000
Using iter_sample_easy 0.5628 s
Sampling 2000000 from 5000000
Using iter_sample_easy 2.7147 s
现在,随着您的语料库变得非常大,将整个可迭代实现变成a list将使用大量内存。但是,如果我们可以对问题进行分块处理,我们仍然可以利用Python的超快性:基本上,我们选择一个CHUNKSIZE“合理小”的对象random.sample,对该大小的块进行处理,然后random.sample再次使用以将它们合并在一起。我们只需要正确设置边界条件即可。
如果的长度list(iterable)是的精确倍数CHUNKSIZE且不大于,我会知道如何做samplesize*CHUNKSIZE:
def iter_sample_dist_naive(iterable, samplesize):
CHUNKSIZE = 10000
samples = []
it = iter(iterable)
try:
while True:
first = next(it)
chunk = itertools.chain([first], itertools.islice(it, CHUNKSIZE-1))
samples += iter_sample_easy(chunk, samplesize)
except StopIteration:
return random.sample(samples, samplesize)
但是,上面的代码在时会产生不均匀的采样len(list(iterable)) % CHUNKSIZE != 0,并且由于len(list(iterable)) * samplesize / CHUNKSIZE“很大” 而耗尽内存。恐怕这些bug的修复超出了我的薪水等级,但是此博客文章中描述了一种解决方案,对我来说听起来很合理。(搜索字词:“分布式随机抽样”,“分布式水库抽样”。)
Sampling 1000 from 10000
Using iter_sample_fast 0.0182 s
Using iter_sample_dist_naive 0.0017 s
Using iter_sample_easy 0.0009 s
Sampling 10000 from 100000
Using iter_sample_fast 0.1830 s
Using iter_sample_dist_naive 0.0402 s
Using iter_sample_easy 0.0103 s
Sampling 100000 from 1000000
Using iter_sample_fast 1.7965 s
Using iter_sample_dist_naive 0.6726 s
Using iter_sample_easy 0.2268 s
Sampling 200000 from 1000000
Using iter_sample_fast 1.7467 s
Using iter_sample_dist_naive 0.8209 s
Using iter_sample_easy 0.3297 s
我们真正获胜的samplesize时机相对而言很小len(list(iterable))。
Sampling 20 from 10000
Using iterSample 0.0202 s
Using sample_from_iterable 0.0047 s
Using iter_sample_fast 0.0196 s
Using iter_sample_easy 0.0001 s
Using iter_sample_dist_naive 0.0004 s
Sampling 20 from 100000
Using iterSample 0.2004 s
Using sample_from_iterable 0.0522 s
Using iter_sample_fast 0.1903 s
Using iter_sample_easy 0.0016 s
Using iter_sample_dist_naive 0.0029 s
Sampling 20 from 1000000
Using iterSample 1.9343 s
Using sample_from_iterable 0.4907 s
Using iter_sample_fast 1.9533 s
Using iter_sample_easy 0.0211 s
Using iter_sample_dist_naive 0.0319 s
Sampling 20 from 10000000
Using iterSample 18.6686 s
Using sample_from_iterable 4.8120 s
Using iter_sample_fast 19.3525 s
Using iter_sample_easy 0.3162 s
Using iter_sample_dist_naive 0.3210 s
Sampling 20 from 100000000
Using iter_sample_easy 2.8248 s
Using iter_sample_dist_naive 3.3817 s