sc.parallelize 获得的RDD分区是怎么划分的

17 篇文章 0 订阅
12 篇文章 0 订阅

sc.parallelize 数据分区划分

1. parallelize 方法
分区相关核心代码

    def parallelize[T: ClassTag](
          seq: Seq[T],  传入数据
          numSlices: Int = defaultParallelism 分区数不传就是默认值
          ): RDD[T] = withScope {
        assertNotStopped()
        new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
      }
      获取分区
 override def getPartitions: Array[Partition] = {
    val slices = ParallelCollectionRDD.slice(data, numSlices).toArray
    slices.indices.map(i => new ParallelCollectionPartition(id, i, slices(i))).toArray
  }
  def slice[T: ClassTag](seq: Seq[T], numSlices: Int): Seq[Seq[T]] = {
    if (numSlices < 1) {
      throw new IllegalArgumentException("Positive number of partitions required")
    }
    //****************************获取开始计数下标***********************************
    def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] = {
      (0 until numSlices).iterator.map { i =>
        val start = ((i * length) / numSlices).toInt
        val end = (((i + 1) * length) / numSlices).toInt
        (start, end)
      }
    }
    seq match {
      case r: Range =>
        positions(r.length, numSlices).zipWithIndex.map { case ((start, end), index) =>
          // If the range is inclusive, use inclusive range for the last slice
          if (r.isInclusive && index == numSlices - 1) {
            new Range.Inclusive(r.start + start * r.step, r.end, r.step)
          }
          else {
            new Range(r.start + start * r.step, r.start + end * r.step, r.step)
          }
        }.toSeq.asInstanceOf[Seq[Seq[T]]]
      case nr: NumericRange[_] =>
        // For ranges of Long, Double, BigInteger, etc
        val slices = new ArrayBuffer[Seq[T]](numSlices)
        var r = nr
        for ((start, end) <- positions(nr.length, numSlices)) {
          val sliceSize = end - start
          slices += r.take(sliceSize).asInstanceOf[Seq[T]]
          r = r.drop(sliceSize)
        }
        slices
      case _ =>
        val array = seq.toArray // To prevent O(n^2) operations for List etc
        ***************************得到下标*********************
	positions(array.length, numSlices).map { case (start, end) =>
            //**************************获取分区数据***********************
	    array.slice(start, end).toSeq
        }.toSeq
    }
  }
  //***************返回分区数据**********************
  def slice(from: Int, until: Int): Repr = {
    val lo    = math.max(from, 0)
    val hi    = math.min(math.max(until, 0), length)
    val elems = math.max(hi - lo, 0)
    val b     = newBuilder
    b.sizeHint(elems)
    
    var i = lo
    while (i < hi) {
      b += self(i)
      i += 1
    }
    b.result()
  }

2. 例子解析

 以sc.parallelize(Array(1,2,3,4,5),3)为例子
positions(5,3)
	(0,5/3) =>(0,1)
 	slice(0,1)=>b.(0) =>(1)
	(5/3,10/3) =>(1,3)
	slice(1,3)=>b.(1),b.(2) =>(2,3)
	(10/3,15/3) =>(3,5)
	slice(3,5)=>b.(3),b(4) =>(4,5)
验证结果
sc.parallelize(Array(1,2,3,4,5),3)这个呢
scala> sc.parallelize(Array(1,2,3,4,5),3).glom.collect
res21: Array[Array[Int]] = Array(Array(1), Array(2, 3), Array(4, 5))
  • 2
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值