spark sql中的窗口函数

databricks博客给出的窗口函数概述

Spark SQL supports three kinds of window functions: ranking functions, analytic functions, and aggregate functions. The available ranking functions and analytic functions are summarized in the table below. For aggregate functions, users can use any existing aggregate function as a window function.

窗口函数包含3种:

  1. ranking 排名类
  2. analytic 分析类
  3. aggregate 聚合类

ranking 和 analytic 见下表,所有已经存在的聚合类函数(sum、avg、max、min)都可以作为窗口函数。

|Function Type| SQL| DataFrame API| |--|--|--| |Ranking |rank | rank | |Ranking |dense_rank|denseRank| |Ranking |percent_rank |percentRank| |Ranking |ntile|ntile| |Ranking |row_number|rowNumber| |Analytic |cume_dist|cumeDist| |Analytic |first_value |firstValue| |Analytic |last_value |lastValue| |Analytic |lag|lag| |Analytic |lead|lead|

先用案例说明

案例数据:/root/score.json/score.json,学生名字、课程、分数

{"name":"A","lesson":"Math","score":100}
{"name":"B","lesson":"Math","score":100}
{"name":"C","lesson":"Math","score":99}
{"name":"D","lesson":"Math","score":98}
{"name":"A","lesson":"E","score":100}
{"name":"B","lesson":"E","score":99}
{"name":"C","lesson":"E","score":99}
{"name":"D","lesson":"E","score":98}
./spark-shell --master local #本地启动spark-shell

    import org.apache.spark.sql.expressions.Window
    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.hive.HiveContext


    sc.setLogLevel("WARN") // 日志级别,可不改
    val hiveContext = new HiveContext(sc)

    val df = hiveContext.read.json("file:///root/score.json")
    case class Score(val name: String, val lesson: String, val score: Int)

    df.registerTempTable("score") // 注册临时表

    // SQL语句
    val stat = "select".
      concat(" name,lesson,score, ").
      concat(" ntile(2) over (partition by lesson order by score desc ) as ntile_2,").
      concat(" ntile(3) over (partition by lesson order by score desc ) as ntile_3,").
      concat(" row_number() over (partition by lesson order by score desc ) as row_number,").
      concat(" rank() over (partition by lesson order by score desc ) as rank, ").
      concat(" dense_rank() over (partition by lesson order by score desc ) as dense_rank, ").
      concat(" percent_rank() over (partition by lesson order by score desc ) as percent_rank ").
      concat(" from score ").
      concat(" order by lesson,name,score")

    hiveContext.sql(stat).show // 执行语句得到的结果

/**
 * 用DataFrame API的方式完成相同的功能。
**/
    val window_spec = Window.partitionBy("lesson").orderBy(df("score").desc) // 窗口函数中公用的子句

    df.select(df("name"), df("lesson"), df("score"),
      ntile(2).over(window_spec).as("ntile_2"),
      ntile(3).over(window_spec).as("ntile_3"),
      row_number().over(window_spec).as("row_number"),
      rank().over(window_spec).as("rank"),
      dense_rank().over(window_spec).as("dense_rank"),
      percent_rank().over(window_spec).as("percent_rank")
    ).orderBy("lesson", "name", "score").show


  • 输出结果完全一样,如下表所示
namelessonscorentile_2ntile_3row_numberrankdense_rankpercent_rank
AE100111110.0
BE99112220.3333333333333333
CE99223220.3333333333333333
DE98234431.0
AMath100111110.0
BMath100112110.0
CMath99223320.6666666666666666
DMath98234431.0
  • rank遇到相同的数据则rank并列,因此rank值可能是不连续的
  • dense_rank遇到相同的数据则rank并列,但是rank值一定是连续的
  • row_number 很单纯的行号,类似excel的行号,不会因为数据相同而rank的值重复或者有间隔
  • percent_rank = 相同的分组中 (rank -1) / ( count(score) - 1 )
  • ntile(n) 是将同一组数据 循环的往n个 桶中放,返回对应的桶的index,index从1开始。
  • 结合官方博客的python调用dataframe API的写法可知,scala的写法几乎和python的一样。官方博客的地址见最下面的参考。

上面的案例,每个分组中所有的数据都参与到窗口函数中计算了。考虑下面一种场景:

  1. 各科成绩 与 该科成绩的 最高分、最高分、平均分相差多少。每一行与此行所属分组聚合后的值再做计算。参与窗口计算的数据是绝对的,就是此行所属的窗口内的所有数据。
  2. 各科成绩按从高到低排序后,比前一名相差多少。每一行与此行的前一行的值相关。参与窗口计算的数据是相对于当前行的。
  // 各科成绩和最高分、最高分、平均分差多少分
    // 各科成绩按从高到低排序后,比前一名差多少分
    val window_clause = Window.partitionBy(df("lesson")).orderBy(df("score").desc)
    val window_spec2 = window_clause.rangeBetween(-Int.MaxValue, Int.MaxValue) // 绝对范围
    val window_spec3 = window_clause.rowsBetween(-1, 0) // 相对范围,-1:当前行的前一行,
    df.select(
      df("name"),
      df("lesson"),
      df("score"),
      // 窗口内的第一行的score-当前的行score
      (df("score") - first("score").over(window_spec3)).as("score-last_score"), 

      // 各科成绩和最高分、最高分、平均分差多少分
      (min(df("score")).over(window_spec2)).as("min_score"),
      (df("score") - min(df("score")).over(window_spec2)).as("score-min"),
      (max(df("score")).over(window_spec2)).as("max_score"),
      (df("score") - max(df("score")).over(window_spec2)).as("score-max"),
      (avg(df("score")).over(window_spec2)).as("avg_score"),
      (df("score") - avg(df("score")).over(window_spec2)).as("score-avg")
    ).
      orderBy("lesson", "name", "score").
      show

namelessonscorescore-last_scoremin_scorescore-minmax_scorescore-maxavg_scorescore-avg
AE1000982100099.01.0
BE99-1981100-199.00.0
CE990981100-199.00.0
DE98-1980100-299.0-1.0
AMath1000982100099.250.75
BMath1000982100099.250.75
CMath99-1981100-199.25-0.25
DMath98-1980100-299.25-1.25

未完待续

  • Analytic functions类型的解析
  • 源码解析

参考:

  1. percent_rank
  2. databricks博客

转载于:https://my.oschina.net/corleone/blog/755393

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值