Spark中RDD DAG图的建立
RDD是spark计算的核心,是分布式数据元素的集合,具有不可变、可分区、可被并行操作的特性,基础的RDD类包含了常用的操作,如果需要特殊操作可以继承RDD基类进行自己的扩展,基础预算包括map、filter、reduce等。
RDD包含5个主要特性:partition、针对split的算子、自身依赖哪些RDD、分区类型(默认hash)、split计算是的分区位置(例如计算HDFS block的时候,让计算分配在block所在机器会减少网络数据的传输)
如果需要自定义一个RDD,通常需要实现如下方法:
- //对一个partition做何种运算
- def compute(split: Partition, context: TaskContext): Iterator[T]
- //获得该RDD的所有分区
- protected def getPartitions: Array[Partition]
- //获得该RDD有哪些依赖
- protected def getDependencies: Seq[Dependency[_]] = deps
- //获得partition的最佳位置
- protected def getPreferredLocations(split: Partition): Seq[String] = Nil
- //使用何种分区
- @transient val partitioner: Option[Partitioner] = None
在一个作业提交给dagscheduler前,要先构建RDD的DAG图,构建过程需要建立依赖关系,在spark中有宽依赖和窄依赖之分,下面我们通过一个具体的计算链条来分析这个过程,代码如下:
- val count = sc.parallelize(1 to 5, 2).map((_,1)).reduceByKey(_+_).collect().foreach(println)
执行过程从左到右,首先执行parallelize操作:
- def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T] = {
- assertNotStopped()
- new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
- }
- def map[U: ClassTag](f: T => U): RDD[U] = {
- val cleanF = sc.clean(f)
- new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
- }
- private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
- prev: RDD[T],
- f: (TaskContext, Int, Iterator[T]) => Iterator[U], // (TaskContext, partition index, iterator)
- preservesPartitioning: Boolean = false)
- extends RDD[U](prev) {
- override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None
- override def getPartitions: Array[Partition] = firstParent[T].partitions
- override def compute(split: Partition, context: TaskContext) =
- f(context, split.index, firstParent[T].iterator(split, context))
- }
collect是一个action算子,上述map、reduce操作属于transform操作,只负责对RDD打转化标记,并不执行真正的计算。而action操作会导致作业提交至集群,由调度器做好切分stage后进行调度。
- def collect(): Array[T] = {
- val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
- Array.concat(results: _*)
- }
- def submitJob[T, U](
- rdd: RDD[T],
- func: (TaskContext, Iterator[T]) => U,
- partitions: Seq[Int],
- callSite: CallSite,
- allowLocal: Boolean,
- resultHandler: (Int, U) => Unit,
- properties: Properties): JobWaiter[U] = {
- // Check to make sure we are not launching a task on a partition that does not exist.
- val maxPartitions = rdd.partitions.length
- partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
- throw new IllegalArgumentException(
- "Attempting to access a non-existent partition: " + p + ". " +
- "Total number of partitions: " + maxPartitions)
- }
- val jobId = nextJobId.getAndIncrement()
- if (partitions.size == 0) {
- return new JobWaiter[U](this, jobId, 0, resultHandler)
- }
- assert(partitions.size > 0)
- val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
- val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
- eventProcessLoop.post(JobSubmitted(
- jobId, rdd, func2, partitions.toArray, allowLocal, callSite, waiter, properties))
- waiter
- }