这几个算子主要分为两类:
行动算子:reduce,aggregate,fold
转化算子:reduceByKey,aggregateByKey,foldByKey
行动算子与转化算子的区别:
行动算子会执行Spark作业执行,而转化算子只是计算逻辑的封装
/**
* Run a function on a given set of partitions in an RDD and pass the results to the given
* handler function. This is the main entry point for all actions in Spark.
*
* @param rdd target RDD to run tasks on
* @param func a function to run on each partition of the RDD
* @param partitions set of partitions to run on; some jobs may not want to compute on all
* partitions of the target RDD, e.g. for operations like `first()`
* @param resultHandler callback to pass each result to
*/
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit): Unit = {
if (stopped.get()) {
throw new IllegalStateException("SparkContext has been shutdown")
}
val callSite = getCallSite
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
if (conf.getBoolean("spark.logLineage", false)) {
logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
}
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
progressBar.foreach(_.finishAll())
rdd.doCheckpoint()
}
reduce行动算子与转化算子reduceByKey:
除开一个是行动算子一个是转化算子,计算逻辑是一样的
aggregate与aggregateByKey:
aggregateByKey中,zoreValue只会参与每个分区中每个key首次计算,但是在aggregate中zoreValue不仅仅只是参与分区内计算还会参与分区间计算
fold与foldByKey:
类似aggregate与aggregateByKey的区别,aggregate中zoreValue参与分区内计算,还参与分区间计算