Spark算子执行流程详解之八

最新推荐文章于 2021-08-28 12:05:08 发布

亮亮-AC米兰

最新推荐文章于 2021-08-28 12:05:08 发布

阅读量2.3k

点赞数

分类专栏： Spark Spark RDD算子详细流程解析附具体执行流程图文章标签： Spark RDD算子

本文链接：https://blog.csdn.net/wl044090432/article/details/59486308

版权

本文详细探讨了Spark中的RDD算子，包括zip、zipPartitions、zipWithIndex、zipWithUniqueId、foreach和foreachPartition的实现原理和使用场景。解释了这些算子如何组合RDD元素，以及在数据处理中的应用。特别强调了foreach和foreachPartition的区别，以及在数据处理和分区上的影响。

摘要由CSDN通过智能技术生成

36.zip

将2个rdd相同位置的元素组成KV对

/**
 * Zips this RDD with another one, returning key-value pairs with the first element in each RDD,
 * second element in each RDD, etc. Assumes that the two RDDs have the *same number of
 * partitions* and the *same number of elements in each partition* (e.g. one was made through
 * a map on the other).
 */
  def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
  zipPartitions(other, preservesPartitioning = false) {

 //如果2个迭代器都有值，则输出，如果没值，则不输出，并且两个迭代器的长度必须保持一致

(thisIter, otherIter) =>
    new Iterator[(T, U)] {
      def hasNext: Boolean = (thisIter.hasNext, otherIter.hasNext) match {
        case (true, true) => true
        case (false, false) => false
        case _ => throw new SparkException("Can only zip RDDs with " +
          "same number of elements in each partition")
      }
      def next(): (T, U) = (thisIter.next(), otherIter.next())
    }
  }

继续看zipPartitions的具体实现：

def zipPartitions[B: ClassTag, V: ClassTag]
    (rdd2: RDD[B], preservesPartitioning: Boolean)
    (f: (Iterator[T], Iterator[B]) => Iterator[V]): RDD[V] = withScope {
  new ZippedPartitionsRDD2(sc, sc.clean(f), this, rdd2, preservesPartitioning)

private[spark] class ZippedPartitionsRDD2[A: ClassTag, B: ClassTag, V: ClassTag](
    sc: SparkContext,
    var f: (Iterator[A], Iterator[B]) => Iterator[V],
    var rdd1: RDD[A],
    var rdd2: RDD[B],
    preservesPartitioning: Boolean = false)
  extends ZippedPartitionsBaseRDD[V](sc, List(rdd1, rdd2), preservesPartitioning) {
  //compute就是利用zip时候生成的迭代器，重写其hasNext和next方法来返回数据
  override def compute(s: Partition, context: TaskContext): Iterator[V] = {
    val partitions = s.asInstanceOf[ZippedPartitionsPartition].partitions
    f(rdd1.iterator(partitions(0), context), rdd2.iterator(partitions(1), context))
  }
  
  override def clearDependencies() {
    super.clearDependencies()
    rdd1 = null
    rdd2 = null
    f = null
  }
}

那么其分区特性和数据本地性是如何计算的呢？需要查看其基类ZippedPartitionsBaseRDD

private[spark] abstract class ZippedPartitionsBaseRDD[V: ClassTag](
    sc: SparkContext,
    var rdds: Seq[RDD[_]],
    preservesPartitioning: Boolean = false)
  extends RDD[V](sc, rdds.map(x => new OneToOneDependency(x))) {
    //由于zip默认的preservesPartitioning为false，则zip之后partitioner为None
  override val partitioner =
    if (preservesPartitioning) firstParent[Any].partitioner else None
  
  override def getPartitions: Array[Partition] = {
    val numParts = rdds.head.partitions.length

//zip的左右两RDD的分区个数必须保持一致
    if (!rdds.forall(rdd => rdd.partitions.length == numParts)) {
      throw new IllegalArgumentException("Can't zip RDDs with unequal numbers of partitions")
    }
    Array.tabulate[Partition](numParts) { i =>

//获取每个分区的loc信息
      val prefs = rdds.map(rdd => rdd.preferredLocations(rdd.partitions(i)))
      // Check whether there are any hosts that match all RDDs; otherwise return the union