Hive 窗口函数在计算部分与整体之间的关系中常用到,可以达到简化SQL,减少Hive 执行计划产生的MR数。
实列:1
需求:统计各城市用户在总体的占比
数据结构:「user_city」表中有两个字段「 user_name」,「city」,表示用户姓名与所在城市
分析:上述需求中明显需要统计每个「城市」的用户数与「所有用户数」,然后做商得到最终结果
实现1:不适用窗口函数
select city,users/total as percentage
(
-- 计算每个城市的用户数
(select
count(user_name) as users,
city,
from user_city
group by city) city
left join
-- 计算用户总数
(select
count(user_name) as total
from user_city) total
on 1=1
) a
实现2:使用窗口函数
select city,
count(user_name)/sum(count(user_name)) over() as percentage
from user_city
group by city
实例:2
需求:统计每个城市的人数,且将用户按用户姓名排序编号
数据结构:同上
分析:最终需要的字段是「user_name,city,user_couty,rank」
实现1:分开使用窗口函数
select
user.user_name,
city.city,
city.user_county,
user.rank
from
(select
city,
count(user_name) as user_county
from user_city
group by city) city
left join
(select
user_name,
city,
row_number() over(partition by city order by user_name) as rank
from user_city
) user on user.city = city.city
实现2:使用两个窗口函数
select
user_name,
city,
count(user_name) over(partition by city) as user_county,
row_number() over(partition by city order by user_name) as rank
from user_city
补充:
-- 窗口函数中的 partition by 与 order by
partition by作用是限定逻辑计算的边界,border by是排序
user_name income month
a 10 1
a 10 2
a 10 3
b 20 1
b 20 2
b 20 3
需求:结果展示 user_name,income,month,累计收入(截止到当前月份的收入总额)
select
user_name,
income,month,
sum(income) over(partition name,month order by month)
from t_test;
**order by 在这里面起到排序并使计算从起始行到当前行的作用
select name,orderdate,cost,
sum(cost) over() as sample1,--所有行相加
sum(cost) over(partition by name) as sample2,--按name分组,组内数据相加
sum(cost) over(partition by name order by orderdate) as sample3,--按name分组,组内数据累加
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3一样,由起点到当前行的聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --当前行和前面一行做聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--当前行和前边一行及后面一行
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行
from t_window;