Spark-partitioner-CSDN博客

本文链接：https://blog.csdn.net/oblesslyy/article/details/45898243

Spark-partitioner

@(spark)[partitioner]

Partitioner

/**                                                                                                                                                                     
 * An object that defines how the elements in a key-value pair RDD are partitioned by key.                                                                              
 * Maps each key to a partition ID, from 0 to `numPartitions - 1`.                                                                                                      
 */                                                                                                                                                                     
abstract class Partitioner extends Serializable {                                                                                                                       
  def numPartitions: Int                                                                                                                                                
  def getPartition(key: Any): Int                                                                                                                                       
}

HashPartitioner

/**                                                                                                                                                                     
 * A [[org.apache.spark.Partitioner]] that implements hash-based partitioning using                                                                                     
 * Java's `Object.hashCode`.                                                                                                                                            
 *                                                                                                                                                                      
 * Java arrays have hashCodes that are based on the arrays' identities rather than their contents,                                                                      
 * so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will                                                                       
 * produce an unexpected or incorrect result.                                                                                                                           
 */                                                                                                                                                                     
class HashPartitioner(partitions: Int) extends Partitioner {

RangePartitioner

实际上这个用于sort base的partition
1. 取个sample，得到大概的数据分布
2. 每个key，根据上面的sample确定partition

/**                                                                                                                                                                     
 * A [[org.apache.spark.Partitioner]] that partitions sortable records by range into roughly                                                                            
 * equal ranges. The ranges are determined by sampling the content of the RDD passed in.                                                                                
 *                                                                                                                                                                      
 * Note that the actual number of partitions created by the RangePartitioner might not be the same                                                                      
 * as the `partitions` parameter, in the case where the number of sampled records is less than                                                                          
 * the value of `partitions`.                                                                                                                                           
 */                                                                                                                                                                     
class RangePartitioner[K : Ordering : ClassTag, V](