Spark-Dependency/Aggregator

Spark-Dependency/Aggregator

@(spark)[Dependency|Aggregator]
RDD的核心之一:依赖关系

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * Base class for dependencies.                                                                                                                                         
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
abstract class Dependency[T] extends Serializable {                                                                                                                     
  def rdd: RDD[T]                                                                                                                                                       
}         

Product2 是scala的类
Product2 is a cartesian product of 2 components.

NarrowDependency

比较简单的一类依赖,

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * Base class for dependencies where each partition of the child RDD depends on a small number                                                                          
 * of partitions of the parent RDD. Narrow dependencies allow for pipelined execution.                                                                                  
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {    
  /**                                                                                                                                                                   
   * Get the parent partitions for a child partition.                                                                                                                   
   * @param partitionId a partition of the child RDD                                                                                                                    
   * @return the partitions of the parent RDD that the child partition depends upon                                                                                     
   */                                                                                                                                                                   
  def getParents(partitionId: Int): Seq[Int]                                                                                                                            

  override def rdd: RDD[T] = _rdd                                                                                                                                       
}                                                                                                                                                                          

OneToOneDependency

1:1 的mapping

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * Represents a one-to-one dependency between partitions of the parent and child RDDs.                                                                                  
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {                                                                                             
  override def getParents(partitionId: Int) = List(partitionId)                                                                                                         
} 

RangeDependency

根据range确定依赖关系,每个range一个dependency?

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.                                                                        
 * @param rdd the parent RDD                                                                                                                                            
 * @param inStart the start of the range in the parent RDD                                                                                                              
 * @param outStart the start of the range in the child RDD                                                                                                              
 * @param length the length of the range                                                                                                                                
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)   

ShuffleDependency

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * Represents a dependency on the output of a shuffle stage. Note that in the case of shuffle,                                                                          
 * the RDD is transient since we don't need it on the executor side.                                                                                                    
 *                                                                                                                                                                      
 * @param _rdd the parent RDD                                                                                                                                           
 * @param partitioner partitioner used to partition the shuffle output                                                                                                  
 * @param serializer [[org.apache.spark.serializer.Serializer Serializer]] to use. If set to None,                                                                      
 *                   the default serializer, as specified by `spark.serializer` config option, will                                                                     
 *                   be used.                                                                                                                                           
 * @param keyOrdering key ordering for RDD's shuffles                                                                                                                   
 * @param aggregator map/reduce-side aggregator for RDD's shuffle                                                                                                       
 * @param mapSideCombine whether to perform partial aggregation (also known as map-side combine)                                                                        
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
class ShuffleDependency[K, V, C](                                                                                                                                       
    @transient _rdd: RDD[_ <: Product2[K, V]],                                                                                                                          
    val partitioner: Partitioner,                                                                                                                                       
    val serializer: Option[Serializer] = None,                                                                                                                          
    val keyOrdering: Option[Ordering[K]] = None,                                                                                                                        
    val aggregator: Option[Aggregator[K, V, C]] = None,                                                                                                                 
    val mapSideCombine: Boolean = false)                                                                                                                                
  extends Dependency[Product2[K, V]] {    

Aggregator

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * A set of functions used to aggregate data.                                                                                                                           
 *                                                                                                                                                                      
 * @param createCombiner function to create the initial value of the aggregation.                                                                                       
 * @param mergeValue function to merge a new value into the aggregation result.                                                                                         
 * @param mergeCombiners function to merge outputs from multiple mergeValue function.                                                                                   
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
case class Aggregator[K, V, C] (                                                                                                                                        
    createCombiner: V => C,                                                                                                                                             
    mergeValue: (C, V) => C,                                                                                                                                            
    mergeCombiners: (C, C) => C) {  
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值