Spark 延时计算原理
Spark算子主要分为两类:
- Transformation变换/转换算子:这种变换是延迟计算的,也就是说从一个RDD转换生成另一个RDD的转换操作不是马上执行,需要等到有Action操作的时候才会真正触发运算。
- Action行动算子:这类算子会触发Spark提交作业(Job),并将数据输出。
Spark是延时计算的,只有Action算子才会触发任务的正式执行,那么Spark是如何实现延时计算的,要理解延时计算原理,需要搞懂以下三个问题:
- 如何暂存计算逻辑?
- 如何进行逻辑分发?
- 如何还原计算逻辑?
一、如何暂存计算逻辑?
以WordCount
代码为例:
val file = sc.textFile("...")
val wordCounts = file
.flatMap(line => line.split(","))
.map(word => (word, 1))
.reduceByKey(_ + _)
wordCounts.saveAsTextFile("...")
sc.textFile()
方法并没有立即进行文件的读取,而只是返回了一个RDD的子类HadoopRDD。HadoopRDD在继承RDD类时,前置依赖是Nil,因为这是读取输入的任务,没有父任务依赖很正常。def textFile( path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = withScope { hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], minPartitions).map(pair => pair._2.toString).setName(path) } def hadoopFile[K, V]( path: String, inputFormatClass: Class[_ <: InputFormat[K, V]], keyClass: Class[K], valueClass: Class[V], minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope { new HadoopRDD( this, confBroadcast, Some(setInputPathsFunc), inputFormatClass, keyClass, valueClass, minPartitions) .map(pair => pair._2.toString).setName(path) } class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging
flatMap
返回的也是RDD的子类MapPartitionsRDD,并且在创建MapPartitionsRDD时将当前RDD的引用this传入给了构造函数中的prev变量,可以发现在这里也没有进行计算逻辑,而是保存了一个RDD的血缘依赖链。def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope { val cleanF = sc.clean(f) new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF)) } private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag]( var prev: RDD[T], f: (TaskContext, Int, Iterator[T]) => Iterator[U], // (TaskContext, partition index, iterator) preservesPartitioning: Boolean = false) extends RDD[U](prev)
map
和flapMap
的逻辑一致,也是返回MapPartitionsRDD。def map[U: ClassTag](f: T => U): RDD[U] = withScope { val cleanF = sc.clean(f) new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF)) }
reduceByKey
也只是返回RDD的子类MapPartitionsRDD或者ShuffledRDD。另外,也可以注意到ShuffledRDD的前置依赖也传入的是Nil值。def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope { combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner) } def combineByKeyWithClassTag[C]( createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope { ... if (self.partitioner == Some(partitioner)) { self.mapPartitions(iter => { val context = TaskContext.get() new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context)) }, preservesPartitioning = true) } else { new ShuffledRDD[K, V, C](self, partitioner) .setSerializer(serializer) .s