Spark窗口函数

窗口函数的适用范围:
1.SparkSQL
2.HiveSQL
3.JDBC

一:定义

窗口函数和 GroupBy 最大的区别, 就是 GroupBy 的聚合对每一个组只有一个结果, 而窗口函数可以对每一条数据都有一个结果

说白了, 窗口函数其实就是根据当前数据, 计算其在所在的组中的统计数据

二:窗口函数的逻辑

运行逻辑

三:函数的组成

从语法的角度上讲, 窗口函数大致分为两个部分
dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank

* 函数部分 dense_rank()

* 窗口定义部分 PARTITION BY category ORDER BY revenue DESC

四:窗口函数的类型

函数类型介绍

五:案例代码

class demo05 {
  val spark = SparkSession.builder().master("local[6]").appName("aggregation").getOrCreate()
  import spark.implicits._
  //创建数据集
  val df: DataFrame = Seq(("aa", 80), ("aa", 70), ("aa", 70), ("aa", 60), ("bb", 70), ("bb", 80), ("cc", 70), ("cc", 50), ("cc", 50), ("cc", 90), ("cc", 100))
    .toDF("name", "score")
  //创建临时表
  df.createTempView("class")

  @Test
  def rank(): Unit ={
    //rank窗口函数测试
    spark.sql(
      """
        |select name , score ,
        |  rank() over(
        |    partition by name order by score desc
        |      ) number
        |        from class
        |
      """.stripMargin).show()
  }

  @Test
  def dense_rank(): Unit ={
    //dense_rank窗口函数测试
    spark.sql(
      """
        |select name , score ,
        |  dense_rank() over(
        |    partition by name order by score desc
        |      ) number
        |        from class
        |
      """.stripMargin).show()
  }

  @Test
  def row_number(): Unit ={
    //row_number窗口函数测试
    spark.sql(
      """
        |select name , score ,
        |  row_number() over(
        |    partition by name order by score desc
        |      ) number
        |        from class
        |
      """.stripMargin).show()
  }

  @Test
  def ntile(): Unit ={
    //ntile窗口函数测试
    spark.sql(
      """
        |select name , score ,
        |  ntile(2) over(
        |    partition by name order by score desc
        |      ) ntile
        |        from class
        |
      """.stripMargin).show()
  }

  @Test
  def first_value(): Unit ={
    //first_value窗口函数测试
    spark.sql(
      """
        |
        | select name , score ,
        |  first_value( score ) over(
        |    partition by name order by score desc
        |      ) first
        |        from class
        |
        |
      """.stripMargin).show()
  }

  @Test
  def last_value(): Unit ={
    //last_value窗口函数测试
    spark.sql(
      """
        |
        | select name , score ,
        |  last_value( score ) over(
        |    partition by name order by score desc
        |      ) last
        |        from class
        |
        |
      """.stripMargin).show()
  }

  @Test
  def lag(): Unit ={
    //lag窗口函数测试
    spark.sql(
      """
        |
        | select name , score ,
        |  lag(score,2) over(
        |    partition by name order by score desc
        |      ) lag
        |        from class
        |
        |
      """.stripMargin).show()
  }

  @Test
  def lead(): Unit ={
    //lead
    spark.sql(
      """
        |
        | select name , score ,
        |  lead(score,2) over(
        |    partition by name order by score desc
        |      ) lead
        |        from class
        |
        |
      """.stripMargin).show()
  }

}

六:窗口函数的两种使用场景

1.在sql语句中使用
SELECT
  product,
  category,
  revenue
FROM (
  SELECT
    product,
    category,
    revenue,
    dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank
  FROM productRevenue) tmp
WHERE
  rank <= 2
窗口函数在 SQL 中的完整语法如下
function OVER (PARITION BY ... ORDER BY ... FRAME_TYPE BETWEEN ... AND ...)

2.在DataFrame的API中使用
val window: WindowSpec = Window.partitionBy('category)
  .orderBy('revenue.desc)

source.select('product, 'category, 'revenue, dense_rank() over window as "rank")
  .where('rank <= 2)
  .show()
WindowSpec : 窗口的描述符, 描述窗口应该是怎么样的
dense_rank() over window : 表示一个叫做 dense_rank() 的函数作用于每一个窗口

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值