spark-Action算子

最新推荐文章于 2024-04-10 22:24:15 发布

reddy_Hu

最新推荐文章于 2024-04-10 22:24:15 发布

阅读量228

点赞数

文章标签： spark

本文链接：https://blog.csdn.net/reddy_Hu/article/details/107922710

版权

count

Return the number of elements in the RDD.

这个算子就是来算一下所有分区有多少条数据,因为底层调用了runJob方法,所以是一个Action方法

package com.doit.spark.day05

import org.apache.spark.{SparkConf, SparkContext}

object Count {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("MapPartitionsDemo").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val nums = sc.parallelize(List(1, 1, 2, 2, 2, 3, 4, 5, 4, 3, 2, 4, 2, 5), 4)
    //rdd调用count方法,因为有四个分区  结果为3+4+3+4=14
    val c: Long = nums.count()
    //手写源码方法
    val ints: Array[Int] = sc.runJob(nums, (it: Iterator[Int]) => {
    var i = 0
      while (it.hasNext) {
        it.next()
        i += 1
      }
      i
    })
    print(ints.sum)    
  }
}

top

这个方法是来计算出所有数据中排序前两个

object TopN {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("DistinctDemo").setMaster("local[*]")
    val sc = new SparkContext(conf)

    val nums: RDD[Int] = sc.parallelize(List(1,8,3,6, 2,9,5,4,7), 2)
    val r: Array[Int] = nums.top(2)
  }
}

源码

可以自定义排序方法,是将每个分区中的前N名拿出来,通过new BoundedPriorityQueue[T](num)(ord.reverse)

mapRDDs.reduce { (queue1, queue2) =>
  queue1 ++= queue2
  queue1
}.toArray.sorted(ord)

再来和并

def top(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
    takeOrdered(num)(ord.reverse)
  }


def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
    if (num == 0) {
      Array.empty
    } else {
      val mapRDDs = mapPartitions { items =>
        // Priority keeps the largest elements, so let's reverse the ordering.
        val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
        queue ++= collectionUtils.takeOrdered(items, num)(ord)
        Iterator.single(queue)
      }
      if (mapRDDs.partitions.length == 0) {
        Array.empty
      } else {
        mapRDDs.reduce { (queue1, queue2) =>
          queue1 ++= queue2
          queue1
        }.toArray.sorted(ord)
      }
    }
  }

take

就是拿出所有数据中的前n个数据,从0分区开始拿,需要拿的数据在多少个分区就action几次

max和min

取出数据中最大的值和最小的值

底层调用reduce方法

 val i: Int = nums.reduce((a, b) => Math.max(a, b))

reddy_Hu

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark-Action算子

countReturn the number of elements in the RDD.这个算子就是来算一下所有分区有多少条数据,因为底层调用了runJob方法,所以是一个Action方法package com.doit.spark.day05import org.apache.spark.{SparkConf, SparkContext}object Count { def main(args: Array[String]): Unit = { val conf =
复制链接

扫一扫