Spark算子执行流程详解之六

本文详细介绍了Spark中coalesce算子的工作原理,包括如何减少分区、是否进行shuffle以及CoalescedRDD的计算过程。接着提到了repartition、sample和takeSample等其他关键算子,探讨了它们在数据处理中的应用。
摘要由CSDN通过智能技术生成

26.coalesce

coalesce顾名思义为合并,就是把多个分区的RDD合并成少量分区的RDD,这样可以减少任务调度的时间,但是请记住:合并之后不能保证结果RDD中的每个分区的记录数量是均衡的,因为合并的时候并没有考虑合并前每个分区的记录数,合并只会减少RDD的分区个数,因此并不能利用它来解决数据倾斜的问题。

def coalesce(numPartitions: Int, shuffle: Boolean =false)(implicitord:Ordering[T] =null)
    : RDD[T] = withScope {
  if (shuffle) {
    /** Distributes elements evenly across output partitions, starting from a random partition. */
   
val distributePartition = (index: Int, items:Iterator[T]) => {
      var position = (newRandom(index)).nextInt(numPartitions)//针对不同的分区索引初始化一个随机数

//将原来的记录映射为(K,记录)对,其中K为随机数的不断叠加
      items.map { t =>
        // Note that the hash code of the key will just be the key itself. The HashPartitioner
        // will mod it with the number of total partitions.
       
position = position + 1
       
(position, t)
      }
    } : Iterator[(Int, T)]
    // include a shuffle step so that our upstream tasks are still distributed

//针对(k,记录)进行一次Hash分区
   
new CoalescedRDD(
      new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
      new HashPartitioner(numPartitions)),
      numPartitions).values//由于是KV对,最后再取其V即可
  } else {
    new CoalescedRDD(this, numPartitions)
  }

}

先看其shuffle参数,如果为true的话,则先生成一个ShuffleRDD,然后在这基础上产生CoalescedRDD,如果为false的话,则直接生成CoalescedRDD。因此先看下其ShuffleRDD的生成过程:


以上是将3个分区合并成2个分区,当shuffle为true的时候,其CoalescedRDD父RDD即ShuffledRDD的生成过程,如果shuffle为false的时候,则直接利用其本身取生成CoalescedRDD。

       再来看CoalescedRDD的计算过程:

private[spark] classCoalescedRDD[T: ClassTag](
    @transient varprev: RDD[T],
    maxPartitions: Int,
    balanceSlack: Double = 0.10)
  extends RDD[T](prev.context,Nil) { // Nil since we implement getDependencies
 
override def getPartitions: Array[Partition] = {
    val pc = newPartitionCoalescer(maxPartitions, prev, balanceSlack)
    pc.run().zipWithIndex.map {
      case (pg, i) =>
        val ids = pg.arr.map(_.index).toArray
        new CoalescedRDDPartition(i, prev, ids, pg.prefLoc)
    }
  }
  override def compute(partition: Partition, context: TaskContext):Iterator[T] = {
    partition.asInstanceOf[CoalescedRDDPartition].parents.iterator.flatMap { parentPartition =>
      firstParent[T].iterator(parentPartition, context)
    }
  }
  ……

}

/**
 * Class that captures a coalesced RDD by essentially keeping track of parent partitions
 *
@param index of this coalesced partition
 *
@param rdd which it belongs to

* parentsIndices它代表了当前CoalescedRDD对应分区索引的分区是由父RDD的哪几个分区组成的
 *
@param parentsIndices list of indices in the parent that have been coalesced into this partition

* @param preferredLocation the preferred location for this partition
 */
private[spark] case class CoalescedRDDPartition(
    index: Int,
    @transient rdd: RDD[_],
    parentsIndices: Array[Int],
    @transient preferredLocation: Option[String] = None)extendsPartition {
  var parents:Seq[Partition] =parentsIndices.map(rdd.partitions(_))

  @throws(classOf[IOException])
  private def writeObject(oos: ObjectOutputStream): Unit = Utils.tryOrIOException{
    // Update the reference to parent partition at the time of task serialization
   
parents
= parentsIndices.map(rdd.partitions(_))
    oos.defaultWriteObject()
  }

  /**
   * Computes the fraction of the parents' partitions containing preferredLocation within
   * their getPreferredLocs.
   *
@return locality of this coalesced partition between 0 and 1
   */
 
def localFraction: Double = {
    val loc = parents.count { p =>
      val parentPreferredLocations = rdd.context.getPreferredLocs(rdd, p.index).map(_.host)
      preferredLocation.exists(parentPreferredLocations.contains)
    }
    if (parents.size ==0)0.0 else(loc.toDouble /parents.size.toDouble)
  }
}

CoalescedRDD的分区结果由CoalescedRDDPartition决定,其中par

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值