这也是一个action算子,但是触发的action>=2,第一次是:countI() 获取数据总量,第二次是:collect() ...如果不满足会执行whlie循环再次执行this.sample().collect 所有action触发的次数>=2.
这个方法中有三个参数:参数1:withReplace:抽样的时候应该是否放回,会根据这个采用不同抽样器计算概率,如果是true 会采用PoissonBounds抽样器,false会采用BiunomialBounds采样器;参数2:抽样的数量;参数3:随机种子,这个是固定多次抽样返回的结果,如果我们指定,那么即使我们执行多次后抽样的结果是一样的,即是根据随机种子来决定的;如果多次抽样的随机种子是一样的,那么结果都是一样的.
方法中在开始会执行this.count() 获取总数据量,如果withReplace为false 并且num大于总数据量 会触发一次job this.collect() 将结果返回,但是会打乱顺序;
不然会根据我们传入的参数withReplacement num 数据总条数 的呆一个概率;然后调用的是sample方法,但会的是PartitionwiseSampledRDD 接下来执行collect() 又触发了job,如果结果不够 会执行while 再次触发job....
/**
* TODO:返回固定个数的大小的一个数组
* Return a fixed-size sampled subset of this RDD in an array
*
* @param withReplacement whether sampling is done with replacement TODO:这个true 和false对应着两种抽取器,
* 是否可以在抽样时候进行替换
* @param num size of the returned sample
* @param seed seed for the random number generator
* TODO:随机数发生器的种子,例如:
* 这个值是默认的一个随机制,两次运行这个算子,如果我们不指定seed,返回的结果是不一样的,
* 如果我们固定下这个随机种子,无论运行多少此返回的结果都是一样的
* @return sample of specified size in an array
* TODO:返回数据量大不建议使用 因为会将结果 返回到driver 内存
* @note this method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*/
def takeSample(
withReplacement: Boolean,
num: Int,
seed: Long = Utils.random.nextLong): Array[T] = withScope {
val numStDev = 10.0
require(num >= 0, "Negative number of elements requested")
require(num <= (Int.MaxValue - (numStDev * math.sqrt(Int.MaxValue)).toInt),
"Cannot support a sample size > Int.MaxValue - " +
s"$numStDev * math.sqrt(Int.MaxValue)")
if (num == 0) {
new Array[T](0)
} else {
// TODO:触发第一次job 统计数据总条数 count
val initialCount = this.count()
if (initialCount == 0) {
new Array[T](0)
} else {
val rand = new Random(seed)
//TODO:如果设置的false 并且num大于总数据量,randomizeInPlace 会对数据顺序进行打乱 返回
if (!withReplacement && num >= initialCount) {
Utils.randomizeInPlace(this.collect(), rand)
} else {
// TODO:这里面根据withReplacement 为true或者false 采用不同的算法 返回一个概率
val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
withReplacement)
// TODO:this.sample 返回的是PartitionwiseSampledRDD
var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
// If the first sample didn't turn out large enough, keep trying to take samples;
// this shouldn't happen often because we use a big multiplier for the initial size
var numIters = 0
while (samples.length < num) {
logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
// TODO:如果抽样不够 则在循环执行一次抽样
samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
numIters += 1
}
Utils.randomizeInPlace(samples, rand).take(num)
}
}
}
}
/**
* Shuffle the elements of an array into a random order, modifying the
* original array. Returns the original array.
*/
def randomizeInPlace[T](arr: Array[T], rand: Random = new Random): Array[T] = {
for (i <- (arr.length - 1) to 1 by -1) {
val j = rand.nextInt(i + 1)
val tmp = arr(j)
arr(j) = arr(i)
arr(i) = tmp
}
arr
}
SamplingUtils类
def computeFractionForSampleSize(sampleSizeLowerBound: Int, total: Long,
withReplacement: Boolean): Double = {
//TODO:为true
if (withReplacement) {
PoissonBounds.getUpperBound(sampleSizeLowerBound) / total
} else {
// TODO:为false
val fraction = sampleSizeLowerBound.toDouble / total
BinomialBounds.getUpperBound(1e-4, total, fraction)
}
}
def sample(
withReplacement: Boolean,
fraction: Double,
seed: Long = Utils.random.nextLong): RDD[T] = {
require(fraction >= 0,
s"Fraction must be nonnegative, but got ${fraction}")
withScope {
require(fraction >= 0.0, "Negative fraction value: " + fraction)
if (withReplacement) {
// TODO:是直接从 第一个父RDD中抽样
new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
} else {
new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
}
}
}
override def compute(splitIn: Partition, context: TaskContext): Iterator[U] = {
val split = splitIn.asInstanceOf[PartitionwiseSampledRDDPartition]
val thisSampler = sampler.clone
thisSampler.setSeed(split.seed)
// TODO:从第一个父RDD 中获取 根据不同的筛选器来进行
/**
* TODO:/** Returns the first parent RDD */
* protected[spark] def firstParent[U: ClassTag]: RDD[U] = {
* dependencies.head.rdd.asInstanceOf[RDD[U]]
* }
*/
thisSampler.sample(firstParent[T].iterator(split.prev, context))
}