takeSample原理-源码(spark3.0)

最新推荐文章于 2021-09-24 19:35:44 发布

best啊李

最新推荐文章于 2021-09-24 19:35:44 发布

阅读量237

点赞数

分类专栏： spark 文章标签： spark

本文链接：https://blog.csdn.net/qq_27015119/article/details/120045910

版权

spark 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

这也是一个action算子，但是触发的action>=2,第一次是:countI() 获取数据总量，第二次是：collect() ...如果不满足会执行whlie循环再次执行this.sample().collect 所有action触发的次数>=2.

这个方法中有三个参数：参数1:withReplace：抽样的时候应该是否放回，会根据这个采用不同抽样器计算概率，如果是true 会采用PoissonBounds抽样器，false会采用BiunomialBounds采样器；参数2:抽样的数量；参数3:随机种子，这个是固定多次抽样返回的结果，如果我们指定，那么即使我们执行多次后抽样的结果是一样的，即是根据随机种子来决定的；如果多次抽样的随机种子是一样的，那么结果都是一样的.

方法中在开始会执行this.count() 获取总数据量，如果withReplace为false 并且num大于总数据量会触发一次job this.collect() 将结果返回，但是会打乱顺序；

不然会根据我们传入的参数withReplacement num 数据总条数的呆一个概率；然后调用的是sample方法，但会的是PartitionwiseSampledRDD 接下来执行collect() 又触发了job，如果结果不够会执行while 再次触发job....

/**
 * TODO:返回固定个数的大小的一个数组
 * Return a fixed-size sampled subset of this RDD in an array
 *
 * @param withReplacement whether sampling is done with replacement TODO：这个true 和false对应着两种抽取器，
 *                        是否可以在抽样时候进行替换
 * @param num size of the returned sample
 * @param seed seed for the random number generator
 *              TODO：随机数发生器的种子，例如：
 *               这个值是默认的一个随机制，两次运行这个算子，如果我们不指定seed，返回的结果是不一样的，
 *               如果我们固定下这个随机种子，无论运行多少此返回的结果都是一样的
 * @return sample of specified size in an array
 * TODO：返回数据量大不建议使用  因为会将结果 返回到driver 内存
 * @note this method should only be used if the resulting array is expected to be small, as
 * all the data is loaded into the driver's memory.
 */
def takeSample(
    withReplacement: Boolean,
    num: Int,
    seed: Long = Utils.random.nextLong): Array[T] = withScope {
  val numStDev = 10.0

  require(num >= 0, "Negative number of elements requested")
  require(num <= (Int.MaxValue - (numStDev * math.sqrt(Int.MaxValue)).toInt),
    "Cannot support a sample size > Int.MaxValue - " +
    s"$numStDev * math.sqrt(Int.MaxValue)")

  if (num == 0) {
    new Array[T](0)
  } else {
    // TODO：触发第一次job  统计数据总条数 count
    val initialCount = this.count()
    if (initialCount == 0) {
      new Array[T](0)
    } else {
      val rand = new Random(seed)
      //TODO：如果设置的false 并且num大于总数据量，randomizeInPlace 会对数据顺序进行打乱 返回
      if (!withReplacement && num >= initialCount) {
        Utils.randomizeInPlace(this.collect(), rand)
      } else {
        // TODO：这里面根据withReplacement 为true或者false 采用不同的算法  返回一个概率
        val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
          withReplacement)
        // TODO：this.sample 返回的是PartitionwiseSampledRDD
        var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()

        // If the first sample didn't turn out large enough, keep trying to take samples;
        // this shouldn't happen often because we use a big multiplier for the initial size
        var numIters = 0
        while (samples.length < num) {
          logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
          // TODO：如果抽样不够 则在循环执行一次抽样
          samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
          numIters += 1
        }
        Utils.randomizeInPlace(samples, rand).take(num)
      }
    }
  }
}

/**
 * Shuffle the elements of an array into a random order, modifying the
 * original array. Returns the original array.
 */
def randomizeInPlace[T](arr: Array[T], rand: Random = new Random): Array[T] = {
  for (i <- (arr.length - 1) to 1 by -1) {
    val j = rand.nextInt(i + 1)
    val tmp = arr(j)
    arr(j) = arr(i)
    arr(i) = tmp
  }
  arr
}
SamplingUtils类
def computeFractionForSampleSize(sampleSizeLowerBound: Int, total: Long,
    withReplacement: Boolean): Double = {
  //TODO：为true
  if (withReplacement) {
    PoissonBounds.getUpperBound(sampleSizeLowerBound) / total
  } else {
    // TODO：为false
    val fraction = sampleSizeLowerBound.toDouble / total
    BinomialBounds.getUpperBound(1e-4, total, fraction)
  }
}

def sample(
    withReplacement: Boolean,
    fraction: Double,
    seed: Long = Utils.random.nextLong): RDD[T] = {
  require(fraction >= 0,
    s"Fraction must be nonnegative, but got ${fraction}")

  withScope {
    require(fraction >= 0.0, "Negative fraction value: " + fraction)
    if (withReplacement) {
      // TODO：是直接从 第一个父RDD中抽样
      new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
    } else {
      new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
    }
  }
}


override def compute(splitIn: Partition, context: TaskContext): Iterator[U] = {
  val split = splitIn.asInstanceOf[PartitionwiseSampledRDDPartition]
  val thisSampler = sampler.clone
  thisSampler.setSeed(split.seed)
  // TODO：从第一个父RDD 中获取 根据不同的筛选器来进行
  /**
   * TODO：/** Returns the first parent RDD */
   * protected[spark] def firstParent[U: ClassTag]: RDD[U] = {
   * dependencies.head.rdd.asInstanceOf[RDD[U]]
   * }
   */
  thisSampler.sample(firstParent[T].iterator(split.prev, context))
}

best啊李

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
takeSample原理-源码(spark3.0)

这也是一个action算子，但是触发的action>=2,第一次是:countI() 获取数据总量，第二次是：collect() ...如果不满足会执行whlie循环再次执行this.sample().collect 所有action触发的次数>=2.这个方法中有三个参数：参数1:withReplace：抽样的时候应该是否放回，会根据这个采用不同抽样器计算概率，如果是true会采用PoissonBounds抽样器，false会采用BiunomialBounds采样器...
复制链接

扫一扫