Spark:foldByKey源码分析

最新推荐文章于 2021-10-29 22:13:39 发布

GScallion

最新推荐文章于 2021-10-29 22:13:39 发布

阅读量99

点赞数

分类专栏： Spark 文章标签： spark

本文链接：https://blog.csdn.net/qq_24325581/article/details/113185592

版权

Spark 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

Spark版本：2.4.0

代码位置：org.apache.spark.rdd.PairRDDFunctions
foldByKey(zeroValue: V, numPartitions: Int)(func: (V, V) => V): RDD[(K, V)]
foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)]
最终调用combineByKeyWithClassTag

应用示例

object FoldByKeyDemo {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession
      .builder()
      .appName("ReduceByKeyDemo")
      .config("spark.master", "local")
      .config("spark.driver.host", "localhost")
      .getOrCreate()
    val sc: SparkContext = spark.sparkContext
    sc.setLogLevel("ERROR")

    val sourceRdd = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 2), ("b", 3)))
    val resRdd: RDD[(String, Int)] = sourceRdd.foldByKey(0)(
      (acc: Int, V: Int) => acc + V
    )
    resRdd.foreach(println)

    spark.stop()
  }
}

打印结果:

(a,3)
(b,5)

源码如下：

方法1和方法2统一调用方法3

方法1：
/**
   * Merge the values for each key using an associative function and a neutral "zero value" which
   * may be added to the result an arbitrary number of times, and must not change the result
   * (e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
   */
  //需要传入的参数(初始值：zeroValue: V，分区数：numPartitions: Int)(合并函数：func: (V, V) => V)
  def foldByKey(zeroValue: V, numPartitions: Int)(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    foldByKey(zeroValue, new HashPartitioner(numPartitions))(func)
  }
方法2：
  /**
   * Merge the values for each key using an associative function and a neutral "zero value" which
   * may be added to the result an arbitrary number of times, and must not change the result
   * (e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
   */
  //需要传入的参数(初始值：zeroValue: V)(合并函数：func: (V, V) => V)
  def foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    foldByKey(zeroValue, defaultPartitioner(self))(func)
  }

/**
   * Merge the values for each key using an associative function and a neutral "zero value" which
   * may be added to the result an arbitrary number of times, and must not change the result
   * (e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
   */
  def foldByKey(
      zeroValue: V,
      partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    // Serialize the zero value to a byte array so that we can get a new clone of it on each key
    // 将零值序列化为字节数组，以便我们可以在每个键上获取它的新副本
    val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
    val zeroArray = new Array[Byte](zeroBuffer.limit)
    zeroBuffer.get(zeroArray)

    // When deserializing, use a lazy val to create just one instance of the serializer per task
    // 反序列化时，使用惰性val为每个任务仅创建一个序列化程序实例
    lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
    val createZero = () => cachedSerializer.deserialize[V](ByteBuffer.wrap(zeroArray))

    val cleanedFunc = self.context.clean(func)
    combineByKeyWithClassTag[V](
      (v: V) => cleanedFunc(createZero(), v), // 初始化合并器
      cleanedFunc, // 用于同一个Executor内部合并数据
      cleanedFunc, // 不同Executor之间合并数据
      partitioner)
  }

GScallion

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark:foldByKey源码分析

Spark版本：2.4.0代码位置：org.apache.spark.rdd.PairRDDFunctionsfoldByKey(zeroValue: V, numPartitions: Int)(func: (V, V) => V): RDD[(K, V)]foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)]应用示例object FoldByKeyDemo { def main(args: Array[String]):
复制链接

扫一扫