Spark算子执行流程详解之八

本文详细探讨了Spark中的RDD算子,包括zip、zipPartitions、zipWithIndex、zipWithUniqueId、foreach和foreachPartition的实现原理和使用场景。解释了这些算子如何组合RDD元素,以及在数据处理中的应用。特别强调了foreach和foreachPartition的区别,以及在数据处理和分区上的影响。
摘要由CSDN通过智能技术生成

36.zip

将2个rdd相同位置的元素组成KV对

/**
 * Zips this RDD with another one, returning key-value pairs with the first element in each RDD,
 * second element in each RDD, etc. Assumes that the two RDDs have the *same number of
 * partitions* and the *same number of elements in each partition* (e.g. one was made through
 * a map on the other).
 */
  def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
  zipPartitions(other, preservesPartitioning = false) { 
 //如果2个迭代器都有值,则输出,如果没值,则不输出,并且两个迭代器的长度必须保持一致
(thisIter, otherIter) =>
    new Iterator[(T, U)] {
      def hasNext: Boolean = (thisIter.hasNext, otherIter.hasNext) match {
        case (true, true) => true
        case (false, false) => false
        case _ => throw new SparkException("Can only zip RDDs with " +
          "same number of elements in each partition")
      }
      def next(): (T, U) = (thisIter.next(), otherIter.next())
    }
  }
}

继续看zipPartitions的具体实现:

def zipPartitions[B: ClassTag, V: ClassTag]
    (rdd2: RDD[B], preservesPartitioning: Boolean)
    (f: (Iterator[T], Iterator[B]) => Iterator[V]): RDD[V] = withScope {
  new ZippedPartitionsRDD2(sc, sc.clean(f), this, rdd2, preservesPartitioning)
}
private[spark] class ZippedPartitionsRDD2[A: ClassTag, B: ClassTag, V: ClassTag](
    sc: SparkContext,
    var f: (Iterator[A], Iterator[B]) => Iterator[V],
    var rdd1: RDD[A],
    var rdd2: RDD[B],
    preservesPartitioning: Boolean = false)
  extends ZippedPartitionsBaseRDD[V](sc, List(rdd1, rdd2), preservesPartitioning) {
  //compute就是利用zip时候生成的迭代器,重写其hasNext和next方法来返回数据
  override def compute(s: Partition, context: TaskContext): Iterator[V] = {
    val partitions = s.asInstanceOf[ZippedPartitionsPartition].partitions
    f(rdd1.iterator(partitions(0), context), rdd2.iterator(partitions(1), context))
  }
  
  override def clearDependencies() {
    super.clearDependencies()
    rdd1 = null
    rdd2 = null
    f = null
  }
}

那么其分区特性和数据本地性是如何计算的呢?需要查看其基类ZippedPartitionsBaseRDD

private[spark] abstract class ZippedPartitionsBaseRDD[V: ClassTag](
    sc: SparkContext,
    var rdds: Seq[RDD[_]],
    preservesPartitioning: Boolean = false)
  extends RDD[V](sc, rdds.map(x => new OneToOneDependency(x))) {
    //于zip默认的preservesPartitioning为false,则zip之后partitioner为None
  override val partitioner =
    if (preservesPartitioning) firstParent[Any].partitioner else None
  
  override def getPartitions: Array[Partition] = {
    val numParts = rdds.head.partitions.length
//zip的左右两RDD的分区个数必须保持一致
    if (!rdds.forall(rdd => rdd.partitions.length == numParts)) {
      throw new IllegalArgumentException("Can't zip RDDs with unequal numbers of partitions")
    }
    Array.tabulate[Partition](numParts) { i =>
//获取每个分区的loc信息
      val prefs = rdds.map(rdd => rdd.preferredLocations(rdd.partitions(i)))
      // Check whether there are any hosts that match all RDDs; otherwise return the union
//求loc的交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值