spark streaming中shuffling后partition数量

最新推荐文章于 2024-06-13 07:42:22 发布

fct2001140269

最新推荐文章于 2024-06-13 07:42:22 发布

阅读量635

点赞数

分类专栏： spark 大数据技术

本文链接：https://blog.csdn.net/fct2001140269/article/details/96152149

版权

大数据技术同时被 2 个专栏收录

89 篇文章 1 订阅

订阅专栏

spark

28 篇文章 0 订阅

订阅专栏

spark streaming中shuffle后partition数量

使用reduceByKey时候,在shuffle阶段的reduce时候，其使用的RDD的partation数量的源码解释如下：

/**
   * Return a new DStream by applying `reduceByKey` to each RDD. The values for each key are
   * merged using the associative and commutative reduce function. Hash partitioning is used to
   * generate the RDDs with Spark's default number of partitions.
   */
  def reduceByKey(func: JFunction2[V, V, V]): JavaPairDStream[K, V] =
    dstream.reduceByKey(func)

  /**
   * Return a new DStream by applying `reduceByKey` to each RDD. The values for each key are
   * merged using the supplied reduce function. Hash partitioning is used to generate the RDDs
   * with `numPartitions` partitions.
   */
  def reduceByKey(func: JFunction2[V, V, V], numPartitions: Int): JavaPairDStream[K, V] =
    dstream.reduceByKey(func, numPartitions)

源码解读：reduceByKey(func: JFunction2[V, V, V])使用只有一个参数的reduceByKey函数时候，其Hash partitioning的数量使用default number of partitions；如果指定第二个参数numPartitions，则表示使用提供的numPartitions参数作为shuffle时候reduce的数量（即rdd中分区数量）；

【很多时候，不知道API中函数背后运行机制，就看该函数的源码解释，不同形参列表表示的含义】

所谓使用默认的reduce端default number of partitions，参考这篇文章

https://blog.csdn.net/bbaiggey/article/details/51984753

http://spark.apache.org/docs/latest/configuration.html

具体的Execution Behavior参数列表配置如下

Property Name	Default	Meaning
`spark.default.parallelism`	For distributed shuffle operations like `reduceByKey`and `join`, the largest number of partitions in a parent RDD. For operations like `parallelize`with no parent RDDs, it depends on the cluster manager:Local mode: number of cores on the local machineMesos fine grained mode: 8Others: total number of cores on all executor nodes or 2, whichever is larger	Default number of partitions in RDDs returned by transformations like `join`, `reduceByKey`, and `parallelize` when not set by user.