Spark算子执行流程详解之六

最新推荐文章于 2024-07-18 11:26:53 发布

亮亮-AC米兰

最新推荐文章于 2024-07-18 11:26:53 发布

阅读量2.1k

点赞数 1

分类专栏： Spark Spark RDD算子详细流程解析附具体执行流程图文章标签： Spark RDD算子

本文链接：https://blog.csdn.net/wl044090432/article/details/59483645

版权

本文详细介绍了Spark中coalesce算子的工作原理，包括如何减少分区、是否进行shuffle以及CoalescedRDD的计算过程。接着提到了repartition、sample和takeSample等其他关键算子，探讨了它们在数据处理中的应用。

摘要由CSDN通过智能技术生成

26.coalesce

coalesce顾名思义为合并，就是把多个分区的RDD合并成少量分区的RDD，这样可以减少任务调度的时间，但是请记住：合并之后不能保证结果RDD中的每个分区的记录数量是均衡的，因为合并的时候并没有考虑合并前每个分区的记录数，合并只会减少RDD的分区个数，因此并不能利用它来解决数据倾斜的问题。

def coalesce(numPartitions: Int, shuffle: Boolean =false)(implicitord:Ordering[T] =null)
    : RDD[T] = withScope {
if (shuffle) {
    /** Distributes elements evenly across output partitions, starting from a random partition. */
    val distributePartition = (index: Int, items:Iterator[T]) => {
      var position = (newRandom(index)).nextInt(numPartitions)//针对不同的分区索引初始化一个随机数

//将原来的记录映射为(K,记录)对，其中K为随机数的不断叠加
      items.map { t =>
        // Note that the hash code of the key will just be the key itself. The HashPartitioner
        // will mod it with the number of total partitions.
        position = position + 1
        (position, t)
      }
    } : Iterator[(Int, T)]
    // include a shuffle step so that our upstream tasks are still distributed

//针对(k,记录)进行一次Hash分区
    new CoalescedRDD(
      new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
      new HashPartitioner(numPartitions)),
      numPartitions).values//由于是KV对，最后再取其V即可
} else {
    new CoalescedRDD(this, numPartitions)
}

}

先看其shuffle参数，如果为true的话，则先生成一个ShuffleRDD，然后在这基础上产生CoalescedRDD，如果为false的话，则直接生成CoalescedRDD。因此先看下其ShuffleRDD的生成过程：

以上是将3个分区合并成2个分区，当shuffle为true的时候，其CoalescedRDD父RDD即ShuffledRDD的生成过程，如果shuffle为false的时候，则直接利用其本身取生成CoalescedRDD。

再来看CoalescedRDD的计算过程：

private[spark] classCoalescedRDD[T: ClassTag](
    @transient varprev: RDD[T],
    maxPartitions: Int,
    balanceSlack: Double = 0.10)
extends RDD[T](prev.context,Nil) { // Nil since we implement getDependencies
override def getPartitions: Array[Partition] = {
    val pc = newPartitionCoalescer(maxPartitions, prev, balanceSlack)
    pc.run().zipWithIndex.map {
      case (pg, i) =>
        val ids = pg.arr.map(_.index).toArray
        new CoalescedRDDPartition(i, prev, ids, pg.prefLoc)
  }
}
override def compute(partition: Partition, context: TaskContext):Iterator[T] = {
    partition.asInstanceOf[CoalescedRDDPartition].parents.iterator.flatMap { parentPartition =>
      firstParent[T].iterator(parentPartition, context)
    }
}
……

}

/**
* Class that captures a coalesced RDD by essentially keeping track of parent partitions
* @param index of this coalesced partition
* @param rdd which it belongs to

* parentsIndices它代表了当前CoalescedRDD对应分区索引的分区是由父RDD的哪几个分区组成的
* @param parentsIndices list of indices in the parent that have been coalesced into this partition

* @param preferredLocation the preferred location for this partition
*/
private[spark] case class CoalescedRDDPartition(
    index: Int,
    @transient rdd: RDD[_],
    parentsIndices: Array[Int],
    @transient preferredLocation: Option[String] = None)extendsPartition {
var parents:Seq[Partition] =parentsIndices.map(rdd.partitions(_))

@throws(classOf[IOException])
private def writeObject(oos: ObjectOutputStream): Unit = Utils.tryOrIOException{
    // Update the reference to parent partition at the time of task serialization
    parents = parentsIndices.map(rdd.partitions(_))
    oos.defaultWriteObject()
}

/**
   * Computes the fraction of the parents' partitions containing preferredLocation within
   * their getPreferredLocs.
   * @return locality of this coalesced partition between 0 and 1
   */
def localFraction: Double = {
    val loc = parents.count { p =>
      val parentPreferredLocations = rdd.context.getPreferredLocs(rdd, p.index).map(_.host)
      preferredLocation.exists(parentPreferredLocations.contains)
    }
    if (parents.size ==0)0.0 else(loc.toDouble /parents.size.toDouble)
}
}