Spark源码分析-topN源码

最新推荐文章于 2022-06-09 21:15:54 发布

会飞的犬良

最新推荐文章于 2022-06-09 21:15:54 发布

阅读量251

点赞数

分类专栏： Spark 大数据文章标签： spark 源码 topN

本文链接：https://blog.csdn.net/things_use/article/details/105612783

版权

介绍

TopN算子是取RDD的前N个元素。取TopN元素，我们就一定要对其进行严格排序吗？非也，也正是如此，加大了此算子的效率。

源码理解

  def top(num: Int): JList[T] = {
  val comp = com.google.common.collect.Ordering.natural().asInstanceOf[Comparator[T]]
  top(num, comp)
}

我们需要传进去一个整形的参数，然后后面声明了一个默认的比较器comp，又调用了两个参数的top函数，然后可以看到又调用了RDD的takeOrdered函数：

def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
  if (num == 0) {
    Array.empty
  } else {
    val mapRDDs = mapPartitions { items =>
      // Priority keeps the largest elements, so let's reverse the ordering.
      val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
      queue ++= collectionUtils.takeOrdered(items, num)(ord)
      Iterator.single(queue)
    }
    if (mapRDDs.partitions.length == 0) {
      Array.empty
    } else {
      mapRDDs.reduce { (queue1, queue2) =>
        queue1 ++= queue2
        queue1
      }.toArray.sorted(ord)
    }
  }
}

RDD的takeOrdered函数里面使用的是mapPartitions算子，从这里就可以知道他首先需要在每个partition中求top信息，然后就在对所有的partition信息进行统一的排序。

重点看collectionUtils.takeOrdered方法，里面调用了一个ordering.leastOf(input.asJava, num)方法，重点看这个方法里面的内容：

public <E extends T> List<E> leastOf(Iterator<E> elements, int k) {
  checkNotNull(e

最低0.47元/天解锁文章

会飞的犬良

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark源码分析-topN源码

介绍TopN算子是取RDD的前N个元素。取TopN元素，我们就一定要对其进行严格排序吗？非也，也正是如此，加大了此算子的效率。源码理解def top(num: Int): JList[T] = {val comp = com.google.common.collect.Ordering.natural().asInstanceOf[Comparator[T]]top(nu...
复制链接

扫一扫