Spark之两种方式分组取topN

最新推荐文章于 2023-07-17 15:36:26 发布

yj2434

最新推荐文章于 2023-07-17 15:36:26 发布

阅读量469

点赞数 1

分类专栏： spark

本文链接：https://blog.csdn.net/yj2434/article/details/109366505

版权

spark 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

原始数据：

class1 90
class2 56
class1 87
class1 76
class2 88
class1 95
class1 74
class2 87
class2 67
class2 77

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD

import scala.util.control.Breaks

/**
 * 分组取topN问题，分好组之后，两种获取topN数据方式：
 * 1.原生集合排序,问题：数据量多占用Eecutor内存多，有可能导致Executor oom问题。
 * 2.定长数组方式
 *
 * spark的mapPartitionsWithIndex中iterator尽量不要使用toList，
 * 原因：toList相当于将迭代数据进行了缓存，容易导致OutOfMemory的异常，
 * iterator是流式的处理，处理完一条记录才会去读取下一条记录并且会丢弃已读的记录，无法重复使用；
 * 而iterator.toList会将所有的记录进行缓存，便于重复使用。
 */
object GroupTopN {
  def main(args: Array[String]): Unit = {
    val context: SparkContext = new SparkContext(new SparkConf()
      .setAppName("group top to N")
      .setMaster("local"))
    val lines: RDD[String] = context.textFile("T:/code/spark_scala/data/score.txt")

    /**
     * 原生集合排序 有可能占用 Executor端的内存比较多，导致内存OOM问题
     */
    lines.map(line => (line.split(" ")(0), line.split(" ")(1).toInt))
      .groupByKey()
      .map(line => (line._1, line._2.toList.sorted(Ordering.Int.reverse)))
      .foreach(x => {
        println(s"class = ${x._1},socres = ${x._2.slice(0, 3)}")
      })

    /**
     * 定长数组方式,内存中的list最多只有定长的3条数据，建议在大数据量的时候使用
     */
    lines.map(line => (line.split(" ")(0), line.split(" ")(1).toInt))
      .groupByKey()
      .foreach(line => {
        val top = new Array[Int](3)
        val b = new Breaks
        val iter = line._2.iterator
        while (iter.hasNext) {
          val next = iter.next()
          b.breakable {
            for (i <- 0 until top.length) {
              if (top(i) == 0) {
                top(i) = next
                b.break()
              } else if (next > top(i)) {
                for (j <- 2 until(i, -1)) {
                  top(j) = top(j - 1)
                }
                top(i) = next
                b.break()
              }
            }
          }
        }
        println(s"class = ${line._1},socres = ${top.toList}")
      })
  }
}

结果：

class = class1,socres = List(95, 90, 87)
class = class2,socres = List(88, 87, 77)

yj2434

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark之两种方式分组取topN

原始数据：class1 90class2 56class1 87class1 76class2 88class1 95class1 74class2 87class2 67class2 77import org.apache.spark.{SparkConf, SparkContext}import org.apache.spark.rdd.RDDimport scala.util.control.Breaks/** * 分组取topN问题，分好组之后，两种获取topN数
复制链接

扫一扫