基础
聚合函数, 宏观角度来看是统计特征值,计算代表特征的统计数据,这样的数据往往是一个数字,多 — fuc —> 少。
但窗口函数,可以把统治值分散给其他列, 多 —fuc—> 多。
栗子
举例说明:
随意捏点数据,
jack,2015-01-01,10
tony,2015-01-02,15
mart,2015-01-03,14
jack,2015-01-04,26
jim,2015-01-05,85
mike,2015-01-06,65
jim,2015-01-07,45
mike,2015-01-08,35
jack,2015-01-09,79
mart,2015-01-10,64
mike,2015-01-11,26
jim,2015-01-12,81
mart,2015-01-13,63
tony,2015-01-14,58
tony,2015-01-15,45
mike,2015-01-16,23
mart,2015-01-17,49
jack,2015-01-18,18
tony,2015-01-19,49
jack,2015-01-20,57
建表语句
create external table tmp.t_window(
name string,
orderdate string,
cost int)
row format delimited
fields terminated by ","
location '/data/window';
先看这句话,有没有问题
select name, count(*)
from tmp.t_window
where substring(orderdate, 1, 7) = '2015-01';
select name 查看的是name所有的数据,而 count() 算的是有多少数据,是个值,肯定会报错,那怎样把count() 结果赋值给每行呢? 窗口函数,这样改:
select name, count(*) over()
from tmp.t_window
where substring(orderdate, 1, 7) = '2015-01';
- 查询不同人的购买次数,以及购物消费金额
-- 不使用窗口函数
select name, count(*) count, sum(cost) sum
from tmp.t_window
where substring(orderdate, 1, 7) = '2015-01'
group by name;
-- 使用窗口函数
select distinct name, count(*) over(partition by name) id_cost,
sum(cost) over(partition by name) sum
from tmp.t_window
where substring(orderdate, 1, 7) = '2015-01';
output :
name cost sum
jack 5 190
jim 3 211
mart 4 190
mike 4 149
tony 4 167
- 查询不同人的购买次数,以及购物消费金额, 最终结果按照消费的多少进行排序
-- 不使用窗口函数
select name, count(*) count, sum(cost) sum
from tmp.t_window
where substring(orderdate, 1, 7) = '2015-01'
group by name
order by sum;
-- 使用窗口函数
select distinct name, count(*) over(partition by name) id_count,
sum(cost) over(partition by name ) sum
from tmp.t_window
where substring(orderdate, 1, 7) = '2015-01'
order by sum;
output:
name id_count sum
mike 4 149
tony 4 167
mart 4 190
jack 5 190
jim 3 211
如果把 order by 加入窗口函数内,会是什么样子呢?
select distinct name, count(*) over(partition by name) id_count,
sum(cost) over(partition by name order by cost) sum
from tmp.t_window
where substring(orderdate, 1, 7) = '2015-01';
output :
name id_count sum
jack 5 10
jack 5 28
jack 5 54
jack 5 111
jack 5 190
jim 3 45
jim 3 126
jim 3 211
mart 4 14
mart 4 63
mart 4 126
mart 4 190
mike 4 23
mike 4 49
mike 4 84
mike 4 149
tony 4 15
tony 4 60
tony 4 109
tony 4 167
总结一下就是,组内有序,整体无序
窗口函数
- sum()
分组内每行的累加值
select name, orderdate, cost,
sum(cost) over(partition by name order by orderdate) as sum
from tmp.t_window;
output:
name orderdate cost sum
jack 2015-01-01 10 10
jack 2015-01-04 26 36
jack 2015-01-09 79 115
jack 2015-01-18 18 133
jack 2015-01-20 57 190
jim 2015-01-05 85 85
- ntile
均匀切片,如果分布不均匀,则默认增加第一个切片分布
select name, orderdate, cost,
ntile (3) over(partition by name order by orderdate) as split
from tmp.t_window;
name orderdate cost split
jack 2015-01-01 10 1
jack 2015-01-04 26 1
jack 2015-01-09 79 2
jack 2015-01-18 18 2
jack 2015-01-20 57 3
jim 2015-01-05 85 1
jim 2015-01-07 45 2
jim 2015-01-12 81 3
- row_number, rank, dense_rank 排名函数
select name, orderdate, cost,
rank() over(partition by name order by cost desc) as rn1,
dense_rank() over(partition by name order by cost desc) as rn2,
row_number() over(partition by name order by cost desc) as rn3
from tmp.t_window
where substring(orderdate, 1, 7) = '2015-01';
row_number 排名名词无重复
rank :排名相同时,名次会重复,不影响整体排序:1,2,2,2,5
dense_rank : 排名相同时,名次会重复,影响整体排序:1,2,2,2,3,3,3,3,4,4
- lag, lead
select name, orderdate, cost,
lag(orderdate, 1, '0000-00-00' ) over(partition by name order by orderdate ) as time1,
lag(orderdate,2) over (partition by name order by orderdate) as time2
from tmp.t_window;
lag(colname, n, xxxxxx) 整行向下移动 n 行, 若 xxxx 不是 null,则 1 ~ n 行为 xxxx
lead 为向上
- first_value, last_value
取分组排序后,截止到当前行的第一个值与最后一个值 - cume_dist
增加等分位
select name, orderdate, cost,
cume_dist () over(partition by name order by orderdate) as split
from tmp.t_window;
out:
name orderdate cost split
jack 2015-01-01 10 0.2
jack 2015-01-04 26 0.4
jack 2015-01-09 79 0.6
jack 2015-01-18 18 0.8
jack 2015-01-20 57 1.0
jim 2015-01-05 85 0.3333333333333333
jim 2015-01-07 45 0.6666666666666666
jim 2015-01-12 81 1.0