Spark sql -- Spark sql中的窗口函数和对应的api

一、窗口函数种类

  1. ranking 排名类
  2. analytic 分析类
  3. aggregate 聚合类
Function TypeSQLDataFrame APIDescription
 Ranking rank  rankrank值可能是不连续的
 Ranking dense_rank denseRankrank值一定是连续的
 Ranking percent_rank  percentRank相同的分组中 (rank -1) / ( count(score) - 1 )
 Ranking ntile ntile将同一组数据循环的往n个桶中放,返回对应的桶的index,index从1开始
 Ranking row_number rowNumber很单纯的行号,类似excel的行号
 Analytic  cume_dist cumeDist 
 Analytic  first_value  firstValue相同的分组中最小值
 Analytic  last_value  lastValue相同的分组中最大值
 Analytic  lag lag取前n行数据
 Analytic  lead lead取后n行数据
 Aggregate  minmin最小值
 Aggregate  maxmax最大值
 Aggregate  sumsum求和
 Aggregate  avgavg求平均

二、具体用法如下

count(...) over(partition by ... order by ...)--求分组后的总数。
sum(...) over(partition by ... order by ...)--求分组后的和。
max(...) over(partition by ... order by ...)--求分组后的最大值。
min(...) over(partition by ... order by ...)--求分组后的最小值。
avg(...) over(partition by ... order by ...)--求分组后的平均值。
rank() over(partition by ... order by ...)--rank值可能是不连续的。
dense_rank() over(partition by ... order by ...)--rank值是连续的。
first_value(...) over(partition by ... order by ...)--求分组内的第一个值。
last_value(...) over(partition by ... order by ...)--求分组内的最后一个值。
lag() over(partition by ... order by ...)--取出前n行数据。  
lead() over(partition by ... order by ...)--取出后n行数据。
ratio_to_report() over(partition by ... order by ...)--Ratio_to_report() 括号中就是分子,over() 括号中就是分母。
percent_rank() over(partition by ... order by ...)--

三、实际例子

案例数据:/root/score.json/score.json,学生名字、课程、分数

{"name":"A","lesson":"Math","score":100} {"name":"B","lesson":"Math","score":100} {"name":"C","lesson":"Math","score":99} {"name":"D","lesson":"Math","score":98} {"name":"A","lesson":"E","score":100} {"name":"B","lesson":"E","score":99} {"name":"C","lesson":"E","score":99} {"name":"D","lesson":"E","score":98}
select
name,lesson,score,
ntile(2) over (partition by lesson order by score desc ) as ntile_2,
ntile(3) over (partition by lesson order by score desc ) as ntile_3,
row_number() over (partition by lesson order by score desc ) as row_number,
rank() over (partition by lesson order by score desc ) as rank,
dense_rank() over (partition by lesson order by score desc ) as dense_rank, 
percent_rank() over (partition by lesson order by score desc ) as percent_rank 
from score 
order by lesson,name,score

输出结果完全一样,如下表所示

namelessonscorentile_2ntile_3row_numberrankdense_rankpercent_rank
AE100111110.0
BE99112220.3333333333333333
CE99223220.3333333333333333
DE98234431.0
AMath100111110.0
BMath100112110.0
CMath99223320.6666666666666666
DMath98234431.0

参考:

spark sql中的窗口函数

over(partition by) 函数

 

=================================================================================

原创文章,转载请务必将下面这段话置于文章开头处(保留超链接)。
本文转发自程序媛说事儿,原文链接https://www.cnblogs.com/abc8023/p/10910741.html

=================================================================================

转载于:https://www.cnblogs.com/abc8023/p/10910741.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值