Spark算子执行流程详解之一

最新推荐文章于 2024-05-28 22:46:39 发布

亮亮-AC米兰

最新推荐文章于 2024-05-28 22:46:39 发布

阅读量2k

点赞数 3

分类专栏： Spark Spark RDD算子详细流程解析附具体执行流程图文章标签： Spark RDD算子

本文链接：https://blog.csdn.net/wl044090432/article/details/59481423

版权

本文详细介绍了Spark中的算子执行流程，重点讲解了`take`、`first`和`sortByKey`。`take`通过部分计算避免全量计算，可能多次触发action；`first`依赖`take`实现，获取RDD的第一个元素；`sortByKey`涉及ShuffledRDD生成，使用RangePartitioner进行范围分区，确保数据均匀分布。整个过程中，sortByKey的性能关键在于RangePartitioner的分区策略，尤其是Reservoir Sampling算法的应用，以确保高效的数据划分。

摘要由CSDN通过智能技术生成

1.take

获取前num条记录。

def take(num: Int): Array[T] = withScope {
if (num == 0) {
    new Array[T](0)
} else {
    val buf = newArrayBuffer[T]
    val totalParts = this.partitions.length
    var partsScanned = 0
    while (buf.size < num && partsScanned < totalParts) {
      // The number of partitions to try in this iteration. It is ok for this number to be
      // greater than totalParts because we actually cap it at totalParts in runJob.
      var numPartsToTry =1
      if (partsScanned > 0) {
        // If we didn't find any rows after the previous iteration, quadruple and retry.
        // Otherwise, interpolate the number of partitions we need to try, but overestimate
        // it by 50%. We also cap the estimation in the end.
        if (buf.size ==0) {//截止目前为止buf为空的话，则扩大4倍范围
          numPartsToTry = partsScanned * 4
        } else { //截止目前为止还有部分值没取到的话，则扩大至Math.max((1.5 * num * partsScanned / buf.size).toInt - partsScanned, 1)，但是不超过当前已扫描过分区的4倍
          // the left side of max is >=1 whenever partsScanned >= 2
          numPartsToTry = Math.max((1.5* num * partsScanned / buf.size).toInt - partsScanned, 1)
          numPartsToTry = Math.min(numPartsToTry, partsScanned * 4)
        }
      }

      val left = num - buf.size
      val p = partsScanned until math.min(partsScanned + numPartsToTry, totalParts)
      val res = sc.runJob(this, (it:Iterator[T]) => it.take(left).toArray, p, allowLocal =true)

      res.foreach(buf ++= _.take(num - buf.size))
      partsScanned += numPartsToTry
    }

    buf.toArray
}
}

首先关注下sc.runJob函数的传参：

/**
* Run a job on a given set of partitions of an RDD, but take a function of type
* `Iterator[T] => U` instead of `(TaskContext, Iterator[T]) => U`.
*/
def runJob[T,U: ClassTag](
    rdd: RDD[T],
    func: Iterator[T] =>U,
    partitions: Seq[Int],
    allowLocal: Boolean
    ): Array[U] = {
val cleanedFunc = clean(func)
runJob(rdd, (ctx: TaskContext, it: Iterator[T]) => cleanedFunc(it), partitions, allowLocal)
}

其中partitions: Seq[Int]代表需要计算的分区，可以计算某个分区，也可以计算多个分区，是待计算的分区集合。

其次先看第一次循环，其partsScanned为0，numPartsToTry为1，因此先计算第一个分区的结果，如果第一次计算可以取得满足条件的num个值，则循环结束，如果取不到满足条件的num个值，则扩大第二次计算的分区范围，很可能一下子扫多个分区。

其执行过程见下图：

Take可以避免全量计算，执行时间比较短。但可能会多次触发action。

2.first

取RDD的第一个元素

/**
* Return the first element in this RDD.
*/
def first(): T = withScope {
take(1) match {
case Array(t) => t
case _ => throw newUnsupportedOperationException("empty collection")
}

}

其实就是调用take来完成的，take的流程可以查阅take函数详解

3.sortByKey

def sortByKey(ascending: Boolean =true, numPartitions: Int = self.partitions.length)
: RDD[(K, V)] = self.withScope
{
val part = newRangePartitioner(numPartitions, self, ascending)
new ShuffledRDD[K,V, V](self, part)
.setKeyOrdering(if (ascending)ordering else

最低0.47元/天解锁文章

亮亮-AC米兰

关注

3
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
Spark算子执行流程详解之一

1.take(num:Int)获取前num条记录。def take(num: Int): Array[T] = withScope { if (num == 0) { new Array[T](0) } else { val buf = newArrayBuffer[T] val totalParts = this.pa
复制链接

扫一扫

专栏目录