spark中shuffle过程分区源码分析

最新推荐文章于 2023-03-06 17:43:15 发布

新宿一次狼

最新推荐文章于 2023-03-06 17:43:15 发布

阅读量277

点赞数

分类专栏： spark 文章标签：大数据 spark

本文链接：https://blog.csdn.net/youmianzhou/article/details/109718160

版权

spark 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

spark rdd 在shuffle过程中，涉及数据重组和重新分区，主要是根据key值，把相同的key值分配到同一个区。需要注意的是，因为分区的数量是有限的，所以会有不同的key分到相同的分区，这个过程主要是用hash算法实现。

分区规则由抽象类Partitioner控制。默认分区是用HashPartitioner
在这里插入图片描述
往下找可以找到HashPartitioner

class HashPartitioner(partitions: Int) extends Partitioner {
  require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")

  def numPartitions: Int = partitions

  def getPartition(key: Any): Int = key match {
    case null => 0
    //具体的分区方法在nonNegativeMod中
    case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
  }

  override def equals(other: Any): Boolean = other match {
    case h: HashPartitioner =>
      h.numPartitions == numPartitions
    case _ =>
      false
  }

  override def hashCode: Int = numPartitions
}

ctrl+鼠标左键点击nonNegativeMod

 /* Calculates 'x' modulo 'mod', takes to consideration sign of x,
  * i.e. if 'x' is negative, than 'x' % 'mod' is negative too
  * so function return (x % mod) + mod in that case.
  */
  def nonNegativeMod(x: Int, mod: Int): Int = {
    val rawMod = x % mod
    // 如果hashcode是负值，函数返回(x % mod) + mod
    rawMod + (if (rawMod < 0) mod else 0)
  }

计算过程是：
nonNegativeMod(key.hashCode, numPartitions)
1.参数说明
key.hashCode:每个key值有一个对应的hashCode
numPartitions:分区数

2.key.hashCode % numPartitions 得到余数
这样把key.hashCode值分成numPartitions份，余数就是那一份的id，要注意的是key的hash值也有可能是负数，上面也给出了得到负数时的分区的计算方法。

3.这样就能够根据key值，把相同的key放到同一个分区中

新宿一次狼

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
spark中shuffle过程分区源码分析

spark rdd 在shuffle过程中，涉及数据重组和重新分区，主要是根据key值，把相同的key值分配到同一个区。需要注意的是，因为分区的数量是有限的，所以会有不同的key分到相同的分区，这个过程主要是用hash算法实现。分区规则由抽象类Partitioner控制。默认分区是用HashPartitioner往下找可以找到HashPartitionerclass HashPartitioner(partitions: Int) extends Partitioner { require(pa
复制链接

扫一扫

专栏目录