spark源码分析之ExternalSorter

最新推荐文章于 2023-02-09 09:45:08 发布

VIP文章 cclucc

最新推荐文章于 2023-02-09 09:45:08 发布

阅读量589

点赞数

分类专栏： spark 大数据文章标签： spark spark源码 spark排序

本文链接：https://blog.csdn.net/cclucc/article/details/79910996

版权

  在SortShuffleWriter中调用ExternalSorter的两个方法insertAll和writePartitionedFile 

 
 1】、blockManager 

 
 2】、diskBlockManager 

 
 3】、serializerManager 

 
 4】、fileBufferSize 

  spark.shuffle.file.buffer=32k 

 
 5】、serializerBatchSize 

  spark.shuffle.spill.batchSize=10000 

 
 6】、map（PartitionedAppendOnlyMap） 

  private var data = new Array[AnyRef](2 * capacity) 

  即消耗的并不是Storage的内存 

 
 7】、buffer（PartitionedPairBuffer） 

 
 8】、forceSpillFiles（ArrayBuffer[SpilledFile]） 

  PartitionedAppendOnlyMap 放不下，要落地，那么不能硬生生的写磁盘，所以需要个buffer,然后把buffer再一次性写入磁盘文件，buffer的大小由fileBufferSize决定 

 
 9】、spills（ArrayBuffer[SpilledFile]） 

 
 10】、insertAll 

if (shouldCombine) {
  // Combine values in-memory first using our AppendOnlyMap
  val mergeValue = aggregator.get.mergeValue
  val createCombiner = aggregator.get.createCombiner
  var kv: Product2[K, V] = null
  val update = (hadValue: Boolean, oldValue: C) => {
    if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
  }
  while (records.hasNext) {
    addElementsRead()
    kv = records.next()
    map.changeValue((getPartition(kv._1), kv._1), update)
    maybeSpillCollection(usingMap = true)
  }
} else {
  // Stick values into our buffer
  while (records.hasNext) {
    addElementsRead()
    val kv = records.next()
    buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
    maybeSpillCollection(usingMap = false)
  }
}

override def changeValue(key: K, updateFunc: (Boolean, V) => V): V = {
  val newValue = super.changeValue(key, updateFunc)
  super.afterUpdate()

  当被混入的集合的每次update操作以后，需要执行SizeTracker的afterUpdate方法，afterUpdate会判断这是第几次更新，需要的话就会使用SizeEstimator的estimate方法来估计下集合的大小。由于SizeEstimator的调用开销比较大，注释上说会是数毫秒，所以不能频繁调用。所以SizeTracker会记录更新的次数，发生estimate的次数是指数级增长的，基数是1.1，所以调用estimate时更新的次数会是1.1, 1.1 * 1.1, 1.1 * 1.1 *1.1, .... 

  这是指数的初始增长是很慢的， 1.1的96次方会是1w, 1.1 ^ 144次方是100w，即对于1w次update，它会执行96次estimate，对10w次update执行120次estimate, 对100w次update执行144次estimate，对1000w次update执行169次。 

最低0.47元/天解锁文章

cclucc

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
spark源码分析之ExternalSorter

在SortShuffleWriter中调用ExternalSorter的两个方法insertAll和writePartitionedFile1】、blockManager2】、diskBlockManager3】、serializerManager4】、fileBufferSizespark.shuffle.file.buffer=32k5】、serializerBatchSize spark.s...
复制链接

扫一扫