序列函数
NTILE
ntile(n):把有序分区中的行发送到指定的组中(n是有几组),各个组有编号,编号是从1开始
ntile 是Hive很强大的一个分析函数。可以看成是:它把有序的数据集合 平均分配 到 指定的数量(num)个桶中, 将桶号分配给每一行。如果不能平均分配,则优先分配较小编号的桶,并且各个桶中能放的行数最多相差1。
例子:
select name,orderdate,cost,
ntile(3) over(partition by name), # 按照name进行分组,在分组内将数据切成3份
from t_order;
mart 2018-04-13 94 1
mart 2018-04-11 75 1
mart 2018-04-09 68 2
mart 2018-04-08 62 3
neil 2018-06-12 80 1
neil 2018-05-10 12 2
saml 2018-01-01 10 1
saml 2018-01-08 55 1
saml 2018-04-06 42 2
saml 2018-01-05 46 2
saml 2018-02-03 23 3
tony 2018-01-07 50 1
tony 2018-01-02 15 2
tony 2018-01-04 29 3
LAG和LEAD函数
-
lag返回当前数据行的前第n行的数据
-
lead返回当前数据行的后第n行的数据
-
lag(列,行数,默认值):往前第n行数据
lead(列,行数,默认值):往后第n行数据
需求:查询顾客上次购买的时间
select name,orderdate,cost,
lag(orderdate,1) over(partition by name order by orderdate) as time1
from t_order;
lag(colName,n[,default value]): 取字段的前第n个值。如果为null,显示默认值
select name,orderdate,cost,
lag(orderdate,1,'1990-01-01') over(partition by name order by orderdate ) as time1
from t_order;
取得顾客下次购买的时间
select name,orderdate,cost,
lead(orderdate,1) over(partition by name order by orderdate ) as time1
from t_order;
案例:求5分钟内点击100次的用户
dt id url
2019-08-22 19:00:01,1,www.baidu.com
2019-08-22 19:01:01,1,www.baidu.com
2019-08-22 19:02:01,1,www.baidu.com
2019-08-22 19:03:01,1,www.baidu.com
select id,dt,lag(dt,100) over(partition by id order by dt)
from tablename where dt-lag(dt,100) over(partition by id order by dt)<5分钟
first_value和last_value
- first_value 取分组内排序后,截止到当前行,第一个值
- last_value 分组内排序后,截止到当前行,最后一个值
案例:
select name,orderdate,cost,
first_value(orderdate) over(partition by name order by orderdate) as time1,
last_value(orderdate) over(partition by name order by orderdate) as time2
from t_order;
select name,orderdate,cost,
first_value(orderdate) over(partition by name order by orderdate) as time1,
first_value(orderdate) over(partition by name order by orderdate desc) as time2
from t_order;