Hive笔记2 窗口函数

最新推荐文章于 2024-07-10 16:42:23 发布

YIFZZZ

最新推荐文章于 2024-07-10 16:42:23 发布

阅读量491

点赞数

分类专栏： BIgData 文章标签： hive 数据库 sql

本文链接：https://blog.csdn.net/weixin_46546516/article/details/127722645

版权

BIgData 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

基础

聚合函数，宏观角度来看是统计特征值，计算代表特征的统计数据，这样的数据往往是一个数字，多 — fuc —> 少。
但窗口函数，可以把统治值分散给其他列，多 —fuc—> 多。

栗子

举例说明：
随意捏点数据，

jack,2015-01-01,10
tony,2015-01-02,15
mart,2015-01-03,14
jack,2015-01-04,26
jim,2015-01-05,85
mike,2015-01-06,65
jim,2015-01-07,45
mike,2015-01-08,35
jack,2015-01-09,79
mart,2015-01-10,64
mike,2015-01-11,26
jim,2015-01-12,81
mart,2015-01-13,63
tony,2015-01-14,58
tony,2015-01-15,45
mike,2015-01-16,23
mart,2015-01-17,49
jack,2015-01-18,18
tony,2015-01-19,49
jack,2015-01-20,57

建表语句
create external table tmp.t_window(
name string,
orderdate string,
cost int)
row format delimited
fields terminated by ","
location  '/data/window';

先看这句话，有没有问题

select name, count(*) 
from tmp.t_window
where substring(orderdate, 1, 7) = '2015-01';

select name 查看的是name所有的数据，而 count() 算的是有多少数据，是个值，肯定会报错，那怎样把count() 结果赋值给每行呢？窗口函数，这样改：

select name, count(*) over()
from tmp.t_window
where substring(orderdate, 1, 7) = '2015-01';

查询不同人的购买次数，以及购物消费金额

-- 不使用窗口函数
select name, count(*) count, sum(cost) sum 
from tmp.t_window  
where substring(orderdate, 1, 7) = '2015-01'
group by name;

-- 使用窗口函数 
select distinct name, count(*) over(partition by name) id_cost,
sum(cost) over(partition by name) sum 
from tmp.t_window  
where substring(orderdate, 1, 7) = '2015-01';

output : 
name    cost    sum
jack    5       190
jim     3       211
mart    4       190
mike    4       149
tony    4       167

查询不同人的购买次数，以及购物消费金额, 最终结果按照消费的多少进行排序

-- 不使用窗口函数
select name, count(*) count, sum(cost) sum 
from tmp.t_window  
where substring(orderdate, 1, 7) = '2015-01'
group by name
order by sum;

-- 使用窗口函数
select distinct name, count(*) over(partition by name) id_count,
sum(cost) over(partition by name ) sum 
from tmp.t_window  
where substring(orderdate, 1, 7) = '2015-01'
order by sum;

output:
name    id_count        sum
mike    4       149
tony    4       167
mart    4       190
jack    5       190
jim     3       211

如果把 order by 加入窗口函数内，会是什么样子呢？

select distinct name, count(*) over(partition by name) id_count,
sum(cost) over(partition by name order by cost) sum 
from tmp.t_window  
where substring(orderdate, 1, 7) = '2015-01';

output :
name    id_count        sum
jack    5       10
jack    5       28
jack    5       54
jack    5       111
jack    5       190
jim     3       45
jim     3       126
jim     3       211
mart    4       14
mart    4       63
mart    4       126
mart    4       190
mike    4       23
mike    4       49
mike    4       84
mike    4       149
tony    4       15
tony    4       60
tony    4       109
tony    4       167

总结一下就是，组内有序，整体无序

窗口函数

sum()
分组内每行的累加值

select name, orderdate, cost,
sum(cost) over(partition by name order by orderdate) as sum
from tmp.t_window;

output:
name    orderdate       cost    sum
jack    2015-01-01      10      10
jack    2015-01-04      26      36
jack    2015-01-09      79      115
jack    2015-01-18      18      133
jack    2015-01-20      57      190
jim     2015-01-05      85      85

ntile
均匀切片，如果分布不均匀，则默认增加第一个切片分布

select name, orderdate, cost,
ntile (3) over(partition by name order by orderdate) as split
from tmp.t_window;

name    orderdate       cost    split
jack    2015-01-01      10      1
jack    2015-01-04      26      1
jack    2015-01-09      79      2
jack    2015-01-18      18      2
jack    2015-01-20      57      3
jim     2015-01-05      85      1
jim     2015-01-07      45      2
jim     2015-01-12      81      3

row_number, rank, dense_rank 排名函数

select name, orderdate, cost,
rank() over(partition by name order by cost desc) as rn1,
dense_rank() over(partition by name order by cost desc) as rn2, 
row_number() over(partition by name order by cost desc) as rn3
from tmp.t_window  
where substring(orderdate, 1, 7) = '2015-01';

row_number 排名名词无重复
rank ：排名相同时，名次会重复，不影响整体排序：1,2,2,2,5
dense_rank : 排名相同时，名次会重复，影响整体排序：1,2,2,2,3,3,3,3,4,4

lag, lead

select name, orderdate, cost,
lag(orderdate, 1, '0000-00-00' ) over(partition by name order by orderdate ) as time1,
lag(orderdate,2) over (partition by name order by orderdate) as time2
from tmp.t_window;

lag(colname, n, xxxxxx) 整行向下移动 n 行，若 xxxx 不是 null，则 1 ~ n 行为 xxxx
lead 为向上

first_value, last_value
取分组排序后，截止到当前行的第一个值与最后一个值
cume_dist
增加等分位

select name, orderdate, cost,
cume_dist () over(partition by name order by orderdate) as split
from tmp.t_window;
out:
name    orderdate       cost    split
jack    2015-01-01      10      0.2
jack    2015-01-04      26      0.4
jack    2015-01-09      79      0.6
jack    2015-01-18      18      0.8
jack    2015-01-20      57      1.0  
jim     2015-01-05      85      0.3333333333333333
jim     2015-01-07      45      0.6666666666666666
jim     2015-01-12      81      1.0