spark top takeOrdered 源码解析

最新推荐文章于 2023-07-14 15:37:44 发布

手把手教你学AI

最新推荐文章于 2023-07-14 15:37:44 发布

阅读量284

点赞数

本文链接：https://blog.csdn.net/zhaomengsen/article/details/82822211

版权

RDD 中上面的注释写具体的操作调用takeOrdered

/**
 * Returns the top k (largest) elements from this RDD as defined by the specified
 * implicit Ordering[T] and maintains the ordering. This does the opposite of
 * [[takeOrdered]]. For example:
 * {{{
 *   sc.parallelize(Seq(10, 4, 2, 12, 3)).top(1)
 *   // returns Array(12)
 *
 *   sc.parallelize(Seq(2, 3, 4, 5, 6)).top(2)
 *   // returns Array(6, 5)
 * }}}
 *
 * @note this method should only be used if the resulting array is expected to be small, as
 * all the data is loaded into the driver's memory.
 *
 * @param num k, the number of top elements to return
 * @param ord the implicit ordering for T
 * @return an array of top elements
 */
def top(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
  takeOrdered(num)(ord.reverse)
}

takeOrdered 实现正序和倒叙

每个map都存储定义 num长度的队列queue

def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
  if (num == 0) {
    Array.empty
  } else {
    val mapRDDs = mapPartitions { items =>
      // Priority keeps the largest elements, so let's reverse the ordering.
      val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
      queue ++= util.collection.Utils.takeOrdered(items, num)(ord)
      Iterator.single(queue)
    }
    if (mapRDDs.partitions.length == 0) {
      Array.empty
    } else {
      mapRDDs.reduce { (queue1, queue2) =>
        queue1 ++= queue2
        queue1
      }.toArray.sorted(ord)
    }
  }
}

其中 util.collection.Utils.takeOrdered(items, num)(ord)计算每个map 计算

合并队列转换数据进行排序

mapRDDs.reduce { (queue1, queue2) => queue1 ++= queue2 queue1 }.toArray.sorted(ord)

看takeOrdered代码

调用调用goole ,Ordering 很多淘汰算法都是用这个时间

原理优化 Linkl链表

import com.google.common.collect.{Ordering => GuavaOrdering}

/**
 * Returns the first K elements from the input as defined by the specified implicit Ordering[T]
 * and maintains the ordering.
 */
def takeOrdered[T](input: Iterator[T], num: Int)(implicit ord: Ordering[T]): Iterator[T] = {
  val ordering = new GuavaOrdering[T] {
    override def compare(l: T, r: T): Int = ord.compare(l, r)
  }
  ordering.leastOf(input.asJava, num).iterator.asScala
}

后面这个调用array 排序我各个感觉归并排序更有效。

mapRDDs.reduce { (queue1, queue2) => queue1 ++= queue2 queue1 }.toArray.sorted(ord)

手把手教你学AI

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
spark top takeOrdered 源码解析

RDD 中上面的注释写具体的操作调用takeOrdered /** * Returns the top k (largest) elements from this RDD as defined by the specified * implicit Ordering[T] and maintains the ordering. This does the opposite...
复制链接

扫一扫