spark streaming中shuffle后partition数量
使用reduceByKey时候,在shuffle阶段的reduce时候,其使用的RDD的partation数量的源码解释如下:
/**
* Return a new DStream by applying `reduceByKey` to each RDD. The values for each key are
* merged using the associative and commutative reduce function. Hash partitioning is used to
* generate the RDDs with Spark's default number of partitions.
*/
def reduceByKey(func: JFunction2[V, V, V]): JavaPairDStream[K, V] =
dstream.reduceByKey(func)
/**
* Return a new DStream by applying `reduceByKey` to each RDD. The values for each key are
* merged using the supplied reduce function. Hash partitioning is used to generate the RDDs
* with `numPartitions` partitions.
*/
def reduceByKey(func: JFunction2[V, V, V], numPartitions: Int): JavaPairDStream[K, V] =
dstream.reduceByKey(func, numPartitions)
源码解读:reduceByKey(func: JFunction2[V, V, V])使用只有一个参数的reduceByKey函数时候,其Hash partitioning的数量使用default number of partitions;如果指定第二个参数numPartitions,则表示使用提供的numPartitions参数作为shuffle时候reduce的数量(即rdd中分区数量);
【很多时候,不知道API中函数背后运行机制,就看该函数的源码解释,不同形参列表表示的含义】
所谓使用默认的reduce端default number of partitions,参考这篇文章
https://blog.csdn.net/bbaiggey/article/details/51984753
http://spark.apache.org/docs/latest/configuration.html
具体的Execution Behavior参数列表配置如下
Property Name | Default | Meaning |
---|---|---|
spark.default.parallelism | For distributed shuffle operations like reduceByKey and join , the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:Local mode: number of cores on the local machineMesos fine grained mode: 8Others: total number of cores on all executor nodes or 2, whichever is larger | Default number of partitions in RDDs returned by transformations like join , reduceByKey , and parallelize when not set by user. |