Hive窗口函数
over():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变化
而变化;
放在over()里面的:
partition by x order by y [desc] :分区排序
distribute by x sort by y [desc]:分组排序
unbounded preceding表示从前面的起点;
n preceding:往前 n 行数据;
current row:当前行;
n following:往后 n 行数据;
unbounded following表示到后面的终点;
用法:(…rows between xxx and yyy)xxx表示前面的要填的行 yyy表示后面的要填的行,不能反过来
放在over之前的:
sum(col):所在组的和
avg(col):所在组的平均值
count(col):所在组的总个数
lag(col,n,default):往前第 n 行数据;
lead(col,n,default):往后第 n 行数据;
ntile(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从 1 开始,对于每一行,ntile返回此行所属的组的编号。注意:n 必须为 int 类型。
rank() 排序相同时会重复,总数不会变(1,2,4,.。。)
dense_rank() 排序相同时会重复,总数会减少,密集排序(1,2,3,4.,。。)
row_number() 会根据顺序计算
实操一
select * from emp;
select job,hiredate,sal
,sum(sal) over(partition by job order by sal rows between unbounded preceding and current row) t1 第一个到当前行
,sum(sal) over(partition by job order by sal rows between 1 preceding and current row) t2 当前行前一个到当前行
,sum(sal) over(partition by job order by sal rows between 1 preceding and 1 following) t3 当前行前一个到当前行后一个
,sum(sal) over(partition by job order by sal rows between current row and 1 following ) t4 当前行到当前行后一个
,sum(sal) over(partition by job order by sal rows between current row and unbounded following) t5 当前行到最后一个
from emp;
结果
select job,hiredate,sal,
lag(hiredate,1,"1970-01-01") over(partition by job order by sal) lag1,
lead(hiredate,1,"9999-12-31") over(partition by job order by sal) lead1
from emp;
ntile 一般用来取前百分之几
比如我要没个工作组前三分之一,首先给每组的所有人尽可能平均分配1~3的编号
取ntile=1就是我们要的结果
select job,hiredate,sal
from(
select job,hiredate,sal,
ntile(3) over(partition by job order by sal) ntile
from emp
) t
where ntile=1;
实操二
select * from score;
select name,subject,score,
rank() over(partition by subject order by score desc) r,
dense_rank() over(partition by subject order by score desc) dr,
row_number() over(partition by subject order by score desc) rm
from score;
英语成绩中有两个人成绩相同,
rank:在两人并列第一后直接跳到3
dense_rank:在两人并列第一后直接跳到2,属于密集排名
row_number:在两人并列第一后直接跳到3 直接打印1234,不管并列情况