Spark 延时计算原理

最新推荐文章于 2022-08-11 20:44:31 发布

涛声依旧（竞涛）

最新推荐文章于 2022-08-11 20:44:31 发布

阅读量1.1k

点赞数

分类专栏： spark 文章标签： spark 延时计算源码

本文链接：https://blog.csdn.net/qq_27639777/article/details/84182888

版权

Spark 延时计算原理

Spark算子主要分为两类：

Transformation变换/转换算子：这种变换是延迟计算的，也就是说从一个RDD转换生成另一个RDD的转换操作不是马上执行，需要等到有Action操作的时候才会真正触发运算。
Action行动算子：这类算子会触发Spark提交作业（Job），并将数据输出。

Spark是延时计算的，只有Action算子才会触发任务的正式执行，那么Spark是如何实现延时计算的，要理解延时计算原理，需要搞懂以下三个问题：

如何暂存计算逻辑？
如何进行逻辑分发？
如何还原计算逻辑？

一、如何暂存计算逻辑？

以WordCount代码为例：

val file = sc.textFile("...")
val wordCounts = file
  .flatMap(line => line.split(","))
  .map(word => (word, 1))
  .reduceByKey(_ + _)
wordCounts.saveAsTextFile("...")

sc.textFile()方法并没有立即进行文件的读取，而只是返回了一个RDD的子类HadoopRDD。HadoopRDD在继承RDD类时，前置依赖是Nil，因为这是读取输入的任务，没有父任务依赖很正常。

def textFile(
  path: String,
  minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
  hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
    minPartitions).map(pair => pair._2.toString).setName(path)
}

def hadoopFile[K, V](
  path: String,
  inputFormatClass: Class[_ <: InputFormat[K, V]],
  keyClass: Class[K],
  valueClass: Class[V],
  minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
	new HadoopRDD(
  	this,
  	confBroadcast,
  	Some(setInputPathsFunc),
  	inputFormatClass,
  	keyClass,
  	valueClass,
  	minPartitions)
	.map(pair => pair._2.toString).setName(path)
}

class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging

flatMap返回的也是RDD的子类MapPartitionsRDD，并且在创建MapPartitionsRDD时将当前RDD的引用this传入给了构造函数中的prev变量，可以发现在这里也没有进行计算逻辑，而是保存了一个RDD的血缘依赖链。

def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}

private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
    var prev: RDD[T],
    f: (TaskContext, Int, Iterator[T]) => Iterator[U],  // (TaskContext, partition index, iterator)
    preservesPartitioning: Boolean = false)
  extends RDD[U](prev)

map和flapMap的逻辑一致，也是返回MapPartitionsRDD。

def map[U: ClassTag](f: T => U): RDD[U] = withScope {
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}

reduceByKey也只是返回RDD的子类MapPartitionsRDD或者ShuffledRDD。另外，也可以注意到ShuffledRDD的前置依赖也传入的是Nil值。

def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
  combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}

def combineByKeyWithClassTag[C](
  createCombiner: V => C,
  mergeValue: (C, V) => C,
  mergeCombiners: (C, C) => C,
  partitioner: Partitioner,
  mapSideCombine: Boolean = true,
  serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
  
  ...
  if (self.partitioner == Some(partitioner)) {
    self.mapPartitions(iter => {
      val context = TaskContext.get()
      new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
    }, preservesPartitioning = true)
  } else {
    new ShuffledRDD[K, V, C](self, partitioner)
      .setSerializer(serializer)
      .s

最低0.47元/天解锁文章

涛声依旧（竞涛）

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Spark 延时计算原理

Spark 延时计算原理Spark算子主要分为两类：Transformation变换/转换算子：这种变换是延迟计算的，也就是说从一个RDD转换生成另一个RDD的转换操作不是马上执行，需要等到有Action操作的时候才会真正触发运算。Action行动算子：这类算子会触发Spark提交作业（Job），并将数据输出。Spark是延时计算的，只有Action算子才会触发任务的正式执行，那么Sp...
复制链接

扫一扫