Hive中分析型函数的那些骚操作

行列互转操作

列转行是一列拆成多行
idnums
1001A,B,AC,AB,AE

将上面原表的形式转换成

idnums
1001A
1001B
1001AC
1001AB
1001AE

这个过程使用的是lateral view explode(split(column,’,’)) num
select a.id,num from table a lateral view explode(split(a.nums,’,’)) n as num ;
如果是将上面的过程逆转过来就是
select a.id,concat_ws(’,’,collect_list(a.num)) as nums from table a group by a.id ;

窗口函数

窗口大小设置
窗口大小设置是使用 ROWS n PRECEDING AND n FOLLOWING / ROWS n PRECEDING AND CURRENT ROW / ROWS CURRENT ROW AND FOLLOWING 这三种设置方式
其中n PRECEDING是当前行的前n行,n FOLLOWING是当前行的后n行加入计算,CURRENT ROW 是表示当前行 ,如果不写这个窗口的大小默认是对整个窗口进行计算的。
比如
数据是这样的
±----±—±--------±----+
|film |user|pay_money|score|
±----±—±--------±----+
|A0006|Jack|40.6 |6.5 |
|A0002|Jack|32.5 |7.0 |
|A0005|Jack|31.5 |8.0 |
|A0001|Jack|42.6 |9.0 |
|A0004|Jack|42.6 |9.0 |
|A0001|Tom |33.5 |3.0 |
|A0003|Tom |33.5 |3.0 |
|A0002|Tom |42.6 |9.0 |
±----±—±--------±----+
使用设置窗口来求和
select a.film,a.user,SUM(a.score) over(partition by a.user ROWS BETWEEN 1 PRECEDING AND CURRENT ROW ) as sum_score from tb_film a
在上面sql中开出来的窗口就是按照user进行设置的,也就是说上面的数据会相当于开出两个窗口来,一个是
|A0006|Jack|40.6 |6.5 |
|A0002|Jack|32.5 |7.0 |
|A0005|Jack|31.5 |8.0 |
|A0001|Jack|42.6 |9.0 |
|A0004|Jack|42.6 |9.0 |

另一个是

|A0001|Tom |33.5 |3.0 |
|A0003|Tom |33.5 |3.0 |
|A0002|Tom |42.6 |9.0 |

 在这两个窗口中进行开窗函数中的求和操作,而窗口设置中又设置了求和边界ROWS BETOWEEN 1 PRECEDING AND CURRENT ROW其中n PRECEDING

是指当前行的前n行会加入计算,CURRENT ROW指的是当前行也就是说前一行和当前行会求和放入到当前数据的计算中。

在第一行数据计算出结果时就是0+6.5,因为第一行的上一行没有值所以是0,第二行就是6.5+7.0
同理第二个窗口中第一行的结果也是0+自身值。

开窗函数使用介绍
rank ,dense_rank  ,  row_number,count ,sum,avg,max,min,lead,lag,first_value,last_value,cume_dist,ntile,percent_rank

rank是具有排序功能的开窗函数,使用该函数将按照窗口里设置的partition by 分区,order by 进行排序,单独开出来的一列是排序序号
原始数据
±----±—±--------±----+
|film |user|pay_money|score|
±----±—±--------±----+
|A0006|Jack|40.6 |6.5 |
|A0002|Jack|32.5 |7.0 |
|A0005|Jack|31.5 |8.0 |
|A0004|Jack|42.6 |8.0 |
|A0001|Jack|42.6 |9.0 |
|A0001|Tom |33.5 |3.0 |
|A0003|Tom |33.5 |3.0 |
|A0002|Tom |42.6 |9.0 |
±----±—±--------±----+
经过sql:
select a.film,a.user,a.pay_money,a.score,rank() over(partition by a.user order by a.score asc ) as score_rank from tb_film a
查询结果数据
±----±—±--------±----±---------+
|film |user|pay_money|score|score_rank|
±----±—±--------±----±---------+
|A0006|Jack|40.6 |6.5 |1 |
|A0002|Jack|32.5 |7.0 |2 |
|A0005|Jack|31.5 |8.0 |3 |
|A0004|Jack|42.6 |8.0 |3 |
|A0001|Jack|42.6 |9.0 |5 |
|A0001|Tom |33.5 |3.0 |1 |
|A0003|Tom |33.5 |3.0 |1 |
|A0002|Tom |42.6 |9.0 |3 |
±----±—±--------±----±---------+

dense_rank() 也是用来排序的开窗函数,dense是密集的,稠密的意思,这个函数的排序后如果两个值相等会排序序号相等,下一个的序号是+1的值而不会像rank一样将该序号空出来。
原始数据
±----±—±--------±----+
|film |user|pay_money|score|
±----±—±--------±----+
|A0006|Jack|40.6 |6.5 |
|A0002|Jack|32.5 |7.0 |
|A0005|Jack|31.5 |8.0 |
|A0004|Jack|42.6 |8.0 |
|A0001|Jack|42.6 |9.0 |
|A0001|Tom |33.5 |3.0 |
|A0003|Tom |33.5 |3.0 |
|A0002|Tom |42.6 |9.0 |
±----±—±--------±----+
经过sql:
select a.film,a.user,a.pay_money,a.score,dense_rank() over(partition by a.user order by a.score asc ) as score_rank from tb_film a
查询结果
±----±—±--------±----±---------+
|film |user|pay_money|score|score_rank|
±----±—±--------±----±---------+
|A0006|Jack|40.6 |6.5 |1 |
|A0002|Jack|32.5 |7.0 |2 |
|A0005|Jack|31.5 |8.0 |3 |
|A0004|Jack|42.6 |8.0 |3 |
|A0001|Jack|42.6 |9.0 |4 |
|A0001|Tom |33.5 |3.0 |1 |
|A0003|Tom |33.5 |3.0 |1 |
|A0002|Tom |42.6 |9.0 |2 |
±----±—±--------±----±---------+

row_number() 这个函数也是用来排序的与rank() dense_rank()功能类似,不过对于相同的值不会产生相同的序号
原始数据
±----±—±--------±----+
|film |user|pay_money|score|
±----±—±--------±----+
|A0006|Jack|40.6 |6.5 |
|A0002|Jack|32.5 |7.0 |
|A0005|Jack|31.5 |8.0 |
|A0004|Jack|42.6 |8.0 |
|A0001|Jack|42.6 |9.0 |
|A0001|Tom |33.5 |3.0 |
|A0003|Tom |33.5 |3.0 |
|A0002|Tom |42.6 |9.0 |
±----±—±--------±----+
经过sql:
select a.film,a.user,a.pay_money,a.score,row_number() over(partition by a.user order by a.score asc ) as score_rank from tb_film a
查询结果
±----±—±--------±----±---------+
|film |user|pay_money|score|score_rank|
±----±—±--------±----±---------+
|A0006|Jack|40.6 |6.5 |1 |
|A0002|Jack|32.5 |7.0 |2 |
|A0005|Jack|31.5 |8.0 |3 |
|A0004|Jack|42.6 |8.0 |4 |
|A0001|Jack|42.6 |9.0 |5 |
|A0001|Tom |33.5 |3.0 |1 |
|A0003|Tom |33.5 |3.0 |2 |
|A0002|Tom |42.6 |9.0 |3 |
±----±—±--------±----±---------+

count 就是统计条数个数的函数
原始数据:同上
查询sql:select a.film,a.user,a.pay_money,a.score,count(1) over( partition by a.user ) as user_count from tb_film a
查询结果
±----±—±--------±----±---------+
|film |user|pay_money|score|user_count|
±----±—±--------±----±---------+
|A0006|Jack|40.6 |6.5 |5 |
|A0002|Jack|32.5 |7.0 |5 |
|A0005|Jack|31.5 |8.0 |5 |
|A0004|Jack|42.6 |8.0 |5 |
|A0001|Jack|42.6 |9.0 |5 |
|A0001|Tom |33.5 |3.0 |3 |
|A0003|Tom |33.5 |3.0 |3 |
|A0002|Tom |42.6 |9.0 |3 |
±----±—±--------±----±---------+

sum,avg,max,min 都是同类的函数在窗口里求和,平均值,最大值,最小值
=select a.film,a.user,a.pay_money,a.score,sum(a.score) over( partition by a.user ) as score_sum from tb_film a==
±----±—±--------±----±--------+
|film |user|pay_money|score|score_sum|
±----±—±--------±----±--------+
|A0006|Jack|40.6 |6.5 |38.5 |
|A0002|Jack|32.5 |7.0 |38.5 |
|A0005|Jack|31.5 |8.0 |38.5 |
|A0004|Jack|42.6 |8.0 |38.5 |
|A0001|Jack|42.6 |9.0 |38.5 |
|A0001|Tom |33.5 |3.0 |15.0 |
|A0003|Tom |33.5 |3.0 |15.0 |
|A0002|Tom |42.6 |9.0 |15.0 |
±----±—±--------±----±--------+

=select a.film,a.user,a.pay_money,a.score,avg(a.score) over( partition by a.user ) as score_avg from tb_film a==
±----±—±--------±----±--------+
|film |user|pay_money|score|score_avg|
±----±—±--------±----±--------+
|A0006|Jack|40.6 |6.5 |7.7 |
|A0002|Jack|32.5 |7.0 |7.7 |
|A0005|Jack|31.5 |8.0 |7.7 |
|A0004|Jack|42.6 |8.0 |7.7 |
|A0001|Jack|42.6 |9.0 |7.7 |
|A0001|Tom |33.5 |3.0 |5.0 |
|A0003|Tom |33.5 |3.0 |5.0 |
|A0002|Tom |42.6 |9.0 |5.0 |
±----±—±--------±----±--------+

=select a.film,a.user,a.pay_money,a.score,max(a.score) over( partition by a.user ) as score_max from tb_film a==
±----±—±--------±----±--------+
|film |user|pay_money|score|score_max|
±----±—±--------±----±--------+
|A0006|Jack|40.6 |6.5 |9.0 |
|A0002|Jack|32.5 |7.0 |9.0 |
|A0005|Jack|31.5 |8.0 |9.0 |
|A0004|Jack|42.6 |8.0 |9.0 |
|A0001|Jack|42.6 |9.0 |9.0 |
|A0001|Tom |33.5 |3.0 |9.0 |
|A0003|Tom |33.5 |3.0 |9.0 |
|A0002|Tom |42.6 |9.0 |9.0 |
±----±—±--------±----±--------+

=select a.film,a.user,a.pay_money,a.score,min(a.score) over( partition by a.user ) as score_min from tb_film a==
±----±—±--------±----±--------+
|film |user|pay_money|score|score_min|
±----±—±--------±----±--------+
|A0006|Jack|40.6 |6.5 |6.5 |
|A0002|Jack|32.5 |7.0 |6.5 |
|A0005|Jack|31.5 |8.0 |6.5 |
|A0004|Jack|42.6 |8.0 |6.5 |
|A0001|Jack|42.6 |9.0 |6.5 |
|A0001|Tom |33.5 |3.0 |3.0 |
|A0003|Tom |33.5 |3.0 |3.0 |
|A0002|Tom |42.6 |9.0 |3.0 |
±----±—±--------±----±--------+

LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

=select a.film,a.user,a.pay_money,a.score,lag(a.score,1,0.0) over( partition by a.user order by a.score) as lag_1 from tb_film a==
±----±—±--------±----±----+
|film |user|pay_money|score|lag_1|
±----±—±--------±----±----+
|A0006|Jack|40.6 |6.5 |0.0 |
|A0002|Jack|32.5 |7.0 |6.5 |
|A0005|Jack|31.5 |8.0 |7.0 |
|A0004|Jack|42.6 |8.0 |8.0 |
|A0001|Jack|42.6 |9.0 |8.0 |
|A0001|Tom |33.5 |3.0 |0.0 |
|A0003|Tom |33.5 |3.0 |3.0 |
|A0002|Tom |42.6 |9.0 |3.0 |
±----±—±--------±----±----+

LEAD(col,n,default)用于统计窗口内往下第n行数据
第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
和lag用法类似

=select a.film,a.user,a.pay_money,a.score,lead(a.score,1,0.0) over( partition by a.user order by a.score) as lead_1 from tb_film a==
±----±—±--------±----±-----+
|film |user|pay_money|score|lead_1|
±----±—±--------±----±-----+
|A0006|Jack|40.6 |6.5 |7.0 |
|A0002|Jack|32.5 |7.0 |8.0 |
|A0005|Jack|31.5 |8.0 |8.0 |
|A0004|Jack|42.6 |8.0 |9.0 |
|A0001|Jack|42.6 |9.0 |0.0 |
|A0001|Tom |33.5 |3.0 |3.0 |
|A0003|Tom |33.5 |3.0 |9.0 |
|A0002|Tom |42.6 |9.0 |0.0 |
±----±—±--------±----±-----+

first_value() , last_value() 取分组内排序后,截止到当前行,第一个值和最后一个值
注意last_value() 这里取得并不是这个分组中的最后一个值,而是截止到当前行最后一个值,其实也就是当前行上的值。
=select a.film,a.user,a.pay_money,a.score,first_value(a.score) over( partition by a.user order by a.score) as first_value from tb_film a==
±----±—±--------±----±----------+
|film |user|pay_money|score|first_value|
±----±—±--------±----±----------+
|A0006|Jack|40.6 |6.5 |6.5 |
|A0002|Jack|32.5 |7.0 |6.5 |
|A0005|Jack|31.5 |8.0 |6.5 |
|A0004|Jack|42.6 |8.0 |6.5 |
|A0001|Jack|42.6 |9.0 |6.5 |
|A0001|Tom |33.5 |3.0 |3.0 |
|A0003|Tom |33.5 |3.0 |3.0 |
|A0002|Tom |42.6 |9.0 |3.0 |
±----±—±--------±----±----------+

=select a.film,a.user,a.pay_money,a.score,last_value(a.score) over( partition by a.user order by a.score) as last_value from tb_film a==
±----±—±--------±----±---------+
|film |user|pay_money|score|last_value|
±----±—±--------±----±---------+
|A0006|Jack|40.6 |6.5 |6.5 |
|A0002|Jack|32.5 |7.0 |7.0 |
|A0005|Jack|31.5 |8.0 |8.0 |
|A0004|Jack|42.6 |8.0 |8.0 |
|A0001|Jack|42.6 |9.0 |9.0 |
|A0001|Tom |33.5 |3.0 |3.0 |
|A0003|Tom |33.5 |3.0 |3.0 |
|A0002|Tom |42.6 |9.0 |9.0 |
±----±—±--------±----±---------+

cume_dist() 小于当前行的值的行数/窗口内总行数,必须有分区规则和排序规则
=select a.film,a.user,a.pay_money,a.score,cume_dist() over( partition by a.user order by a.score asc ) as cume_dist from tb_film a==
±----±—±--------±----±-----------------+
|film |user|pay_money|score|cume_dist |
±----±—±--------±----±-----------------+
|A0006|Jack|40.6 |6.5 |0.2 |
|A0002|Jack|32.5 |7.0 |0.4 |
|A0005|Jack|31.5 |8.0 |0.8 |
|A0004|Jack|42.6 |8.0 |0.8 |
|A0001|Jack|42.6 |9.0 |1.0 |
|A0001|Tom |33.5 |3.0 |0.6666666666666666|
|A0003|Tom |33.5 |3.0 |0.6666666666666666|
|A0002|Tom |42.6 |9.0 |1.0 |
±----±—±--------±----±-----------------+

<=0.2的只有1行,窗口总行数为5所以比值为1/5=0.2

percent_rank() 分组内当前行的RANK值-1/分组内总行数-1
=select a.film,a.user,a.pay_money,a.score,rank() over(partition by a.user order by a.score asc) as score_rank,count(a.score) over(partition by a.user) as user_sum,percent_rank() over( partition by a.user order by a.score asc ) as percent_rank from tb_film a==
±----±—±--------±----±---------±-------±-----------+
|film |user|pay_money|score|score_rank|user_sum|percent_rank|
±----±—±--------±----±---------±-------±-----------+
|A0006|Jack|40.6 |6.5 |1 |5 |0.0 |
|A0002|Jack|32.5 |7.0 |2 |5 |0.25 |
|A0005|Jack|31.5 |8.0 |3 |5 |0.5 |
|A0004|Jack|42.6 |8.0 |3 |5 |0.5 |
|A0001|Jack|42.6 |9.0 |5 |5 |1.0 |
|A0001|Tom |33.5 |3.0 |1 |3 |0.0 |
|A0003|Tom |33.5 |3.0 |1 |3 |0.0 |
|A0002|Tom |42.6 |9.0 |3 |3 |1.0 |
±----±—±--------±----±---------±-------±-----------+

NTILE(n),用于将分组数据按照顺序切分成n片,返回当前切片值

=select a.film,a.user,a.pay_money,a.score,ntile(2) over( partition by a.user order by a.score asc ) as ntil_value from tb_film a==
±----±—±--------±----±---------+
|film |user|pay_money|score|ntil_value|
±----±—±--------±----±---------+
|A0006|Jack|40.6 |6.5 |1 |
|A0002|Jack|32.5 |7.0 |1 |
|A0005|Jack|31.5 |8.0 |1 |
|A0004|Jack|42.6 |8.0 |2 |
|A0001|Jack|42.6 |9.0 |2 |
|A0001|Tom |33.5 |3.0 |1 |
|A0003|Tom |33.5 |3.0 |1 |
|A0002|Tom |42.6 |9.0 |2 |
±----±—±--------±----±---------+

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值