Spark中RDD算子reduce，aggregate，fold与reduceByKey，aggregateByKey，foldByKey的区别

最新推荐文章于 2022-09-23 09:39:17 发布

liguanghai12

最新推荐文章于 2022-09-23 09:39:17 发布

阅读量586

点赞数

分类专栏： scala 大数据文章标签：大数据 spark scala

本文链接：https://blog.csdn.net/weixin_44563670/article/details/112899176

版权

大数据同时被 2 个专栏收录

28 篇文章 1 订阅

订阅专栏

scala

16 篇文章 1 订阅

订阅专栏

这几个算子主要分为两类：
行动算子：reduce，aggregate，fold
转化算子：reduceByKey，aggregateByKey，foldByKey

行动算子与转化算子的区别：
行动算子会执行Spark作业执行,而转化算子只是计算逻辑的封装

/**
   * Run a function on a given set of partitions in an RDD and pass the results to the given
   * handler function. This is the main entry point for all actions in Spark.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @param partitions set of partitions to run on; some jobs may not want to compute on all
   * partitions of the target RDD, e.g. for operations like `first()`
   * @param resultHandler callback to pass each result to
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

reduce行动算子与转化算子reduceByKey：
除开一个是行动算子一个是转化算子，计算逻辑是一样的

aggregate与aggregateByKey：
aggregateByKey中，zoreValue只会参与每个分区中每个key首次计算，但是在aggregate中zoreValue不仅仅只是参与分区内计算还会参与分区间计算

fold与foldByKey：
类似aggregate与aggregateByKey的区别，aggregate中zoreValue参与分区内计算，还参与分区间计算

liguanghai12

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Spark中RDD算子reduce，aggregate，fold与reduceByKey，aggregateByKey，foldByKey的区别

这几个算子主要分为两类：行动算子：reduce，aggregate，fold转化算子：reduceByKey，aggregateByKey，foldByKey行动算子与转化算子的区别：行动算子会执行Spark作业执行,而转化算子只是计算逻辑的封装/** * Run a function on a given set of partitions in an RDD and pass the results to the given * handler function. This is
复制链接

扫一扫