Hive笔记2 窗口函数

基础

聚合函数, 宏观角度来看是统计特征值,计算代表特征的统计数据,这样的数据往往是一个数字,多 — fuc —> 少。
但窗口函数,可以把统治值分散给其他列, 多 —fuc—> 多。

栗子

举例说明:
随意捏点数据,

jack,2015-01-01,10
tony,2015-01-02,15
mart,2015-01-03,14
jack,2015-01-04,26
jim,2015-01-05,85
mike,2015-01-06,65
jim,2015-01-07,45
mike,2015-01-08,35
jack,2015-01-09,79
mart,2015-01-10,64
mike,2015-01-11,26
jim,2015-01-12,81
mart,2015-01-13,63
tony,2015-01-14,58
tony,2015-01-15,45
mike,2015-01-16,23
mart,2015-01-17,49
jack,2015-01-18,18
tony,2015-01-19,49
jack,2015-01-20,57

建表语句
create external table tmp.t_window(
name string,
orderdate string,
cost int)
row format delimited
fields terminated by ","
location  '/data/window';

先看这句话,有没有问题

select name, count(*) 
from tmp.t_window
where substring(orderdate, 1, 7) = '2015-01';

select name 查看的是name所有的数据,而 count() 算的是有多少数据,是个值,肯定会报错,那怎样把count() 结果赋值给每行呢? 窗口函数,这样改:

select name, count(*) over()
from tmp.t_window
where substring(orderdate, 1, 7) = '2015-01'; 
  • 查询不同人的购买次数,以及购物消费金额
-- 不使用窗口函数
select name, count(*) count, sum(cost) sum 
from tmp.t_window  
where substring(orderdate, 1, 7) = '2015-01'
group by name;

-- 使用窗口函数 
select distinct name, count(*) over(partition by name) id_cost,
sum(cost) over(partition by name) sum 
from tmp.t_window  
where substring(orderdate, 1, 7) = '2015-01';

output : 
name    cost    sum
jack    5       190
jim     3       211
mart    4       190
mike    4       149
tony    4       167

  • 查询不同人的购买次数,以及购物消费金额, 最终结果按照消费的多少进行排序
-- 不使用窗口函数
select name, count(*) count, sum(cost) sum 
from tmp.t_window  
where substring(orderdate, 1, 7) = '2015-01'
group by name
order by sum;

-- 使用窗口函数
select distinct name, count(*) over(partition by name) id_count,
sum(cost) over(partition by name ) sum 
from tmp.t_window  
where substring(orderdate, 1, 7) = '2015-01'
order by sum;

output:
name    id_count        sum
mike    4       149
tony    4       167
mart    4       190
jack    5       190
jim     3       211

如果把 order by 加入窗口函数内,会是什么样子呢?

select distinct name, count(*) over(partition by name) id_count,
sum(cost) over(partition by name order by cost) sum 
from tmp.t_window  
where substring(orderdate, 1, 7) = '2015-01';

output :
name    id_count        sum
jack    5       10
jack    5       28
jack    5       54
jack    5       111
jack    5       190
jim     3       45
jim     3       126
jim     3       211
mart    4       14
mart    4       63
mart    4       126
mart    4       190
mike    4       23
mike    4       49
mike    4       84
mike    4       149
tony    4       15
tony    4       60
tony    4       109
tony    4       167

总结一下就是,组内有序,整体无序

窗口函数

  • sum()
    分组内每行的累加值
select name, orderdate, cost,
sum(cost) over(partition by name order by orderdate) as sum
from tmp.t_window;

output:
name    orderdate       cost    sum
jack    2015-01-01      10      10
jack    2015-01-04      26      36
jack    2015-01-09      79      115
jack    2015-01-18      18      133
jack    2015-01-20      57      190
jim     2015-01-05      85      85
  • ntile
    均匀切片,如果分布不均匀,则默认增加第一个切片分布
select name, orderdate, cost,
ntile (3) over(partition by name order by orderdate) as split
from tmp.t_window;

name    orderdate       cost    split
jack    2015-01-01      10      1
jack    2015-01-04      26      1
jack    2015-01-09      79      2
jack    2015-01-18      18      2
jack    2015-01-20      57      3
jim     2015-01-05      85      1
jim     2015-01-07      45      2
jim     2015-01-12      81      3

  • row_number, rank, dense_rank 排名函数
select name, orderdate, cost,
rank() over(partition by name order by cost desc) as rn1,
dense_rank() over(partition by name order by cost desc) as rn2, 
row_number() over(partition by name order by cost desc) as rn3
from tmp.t_window  
where substring(orderdate, 1, 7) = '2015-01';

row_number 排名名词无重复
rank :排名相同时,名次会重复,不影响整体排序:1,2,2,2,5
dense_rank : 排名相同时,名次会重复,影响整体排序:1,2,2,2,3,3,3,3,4,4

  • lag, lead
select name, orderdate, cost,
lag(orderdate, 1, '0000-00-00' ) over(partition by name order by orderdate ) as time1,
lag(orderdate,2) over (partition by name order by orderdate) as time2
from tmp.t_window;

lag(colname, n, xxxxxx) 整行向下移动 n 行, 若 xxxx 不是 null,则 1 ~ n 行为 xxxx
lead 为向上

  • first_value, last_value
    取分组排序后,截止到当前行的第一个值与最后一个值
  • cume_dist
    增加等分位
select name, orderdate, cost,
cume_dist () over(partition by name order by orderdate) as split
from tmp.t_window;
out:
name    orderdate       cost    split
jack    2015-01-01      10      0.2
jack    2015-01-04      26      0.4
jack    2015-01-09      79      0.6
jack    2015-01-18      18      0.8
jack    2015-01-20      57      1.0  
jim     2015-01-05      85      0.3333333333333333
jim     2015-01-07      45      0.6666666666666666
jim     2015-01-12      81      1.0
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值