1.参数说明
Sample构造函数有三个参数,每个参数的含义如下:
withReplacement:元素可以多次抽样(有放回的抽样)
fraction:期望样本的大小作为RDD大小的一部分, 当withReplacement=false时:选择每个元素的概率;分数一定是[0,1] ; 当 withReplacement=true时:选择每个元素的期望次数; 分数必须大于等于0。
seed:随机数生成器的种子
2.源码流程分析
往下找sample的源码,可以看到调用了RDD的sample方法:
def sample(
withReplacement: Boolean,
fraction: Double,
seed: Long = Utils.random.nextLong): RDD[T] = {
require(fraction >= 0,
s"Fraction must be nonnegative, but got ${fraction}")
withScope {
require(fraction >= 0.0, "Negative fraction value: " + fraction)
if (withReplacement) {
new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
} else {
new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
}
}
}
方法前面都是对fraction合理性的判断,主要还是要看if和else中。都是返回了PartitionwiseSampleRDD对象,并且其构造方法传入了PoissonSampler和BernoulliSampler对象。当withReplacement=false时,是选择每个元素的概率,所以我们先看BernoulliSampler。
2.BernoulliSampler(伯努利分布)
class BernoulliSampler[T: ClassTag](fraction: Double) extends RandomSampler[T, T] {
/** epsilon slop to avoid failure from floating point jitter */
require(
fraction >= (0.0 - RandomSampler.roundingEpsilon)
&& fraction <= (1.0 + RandomSampler.roundingEpsilon),
s"Sampling fraction ($fraction) must be on interval [0, 1]")
private val rng: Random = RandomSampler.newDefaultRNG
override def setSeed(seed: Long): Unit = rng.setSeed(seed)
private lazy val gapSampling: GapSampling =
new GapSampli