cube
hive中也有cube函数,可以实现多个任意维度的查询
cube(a,b,c)则首先会对(a,b,c)进行group by,
然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在对全表进行group by,他会统计所选列中值的所有组合的聚合
用cube函数就可以完成所有维度的聚合工作.
select os, --维度1
manufacturer,--维度2
carrier, --维度3
network_type,--维度4
area_code, --维度5
count(user_id) user_count,
sum(launch_count) launch_count
from dws.launch
where bdp_day = '20201223'
group by os, manufacturer, carrier, network_type, area_code
with cube --使用cube函数
limit 20;
结果预览
+----+------------+-------+------------+---------+----------+------------+
|os |manufacturer|carrier|network_type|area_code|user_count|launch_count|
+----+------------+-------+------------+---------+----------+------------+
|NULL|NULL |NULL |NULL |NULL |170426 |724588 |
|NULL|NULL |NULL |NULL |03060 |44 |184 |
|NULL|NULL |NULL |NULL |10002 |560 |2132 |
|NULL|NULL |NULL |NULL |10003 |672 |2752 |
|NULL|NULL |NULL |NULL |10004 |634 |2744 |
|NULL|NULL |NULL |NULL |10005 |906 |3756 |
|NULL|NULL |NULL |NULL |10007 |1272 |5116 |
|NULL|NULL |NULL |NULL |10008 |722 |3248 |
|NULL|NULL |NULL |NULL |10009 |1086 |4744 |
|NULL|NULL |NULL |NULL |10010 |714 |2888 |
|NULL|NULL |NULL |NULL |10013 |1524 |6536 |
|NULL|NULL |NULL |NULL |10014 |744 |3024 |
|NULL|NULL |NULL |NULL |10015 |684 |3108 |
|NULL|NULL |NULL |NULL |10016 |318 |1472 |
|NULL|NULL |NULL |NULL |10017 |406 |1776 |
|NULL|NULL |NULL |NULL |10018 |204 |828 |
|NULL|NULL |NULL |NULL |10019 |410 |1684 |
|NULL|NULL |NULL |NULL |10020 |224 |1108 |
|NULL|NULL |NULL |NULL |10021 |360 |1576 |
|NULL|NULL |NULL |NULL |11010 |312 |1268 |
+----+------------+-------+------------+---------+----------+------------+
grouping sets
当不需要cube将所有维度都列出来的时候,当只需要部分维度的时候
可以使用grouping sets来进行决定聚合那些维度
当使用grouping sets()进行指定维度聚合的时候,仅仅聚合你给出的维度组合,并不会自动帮你组合维度.
例如 grouping sets(col1,(col2,col3)) 只聚合col1维度,(col2,col3)维度.
并不会自动聚合(col1,col2,col3)维度
select os, --维度1
manufacturer,--维度2
carrier, --维度3
network_type,--维度4
area_code, --维度5
count(user_id) user_count,
sum(launch_count) launch_count
from dws.launch
where bdp_day = '20201223'
group by os, manufacturer, carrier, network_type, area_code
grouping sets ((manufacturer,carrier))
运行结果
+----+------------+-------+------------+---------+----------+------------+
|os |manufacturer|carrier|network_type|area_code|user_count|launch_count|
+----+------------+-------+------------+---------+----------+------------+
|NULL|02 |2 |NULL |NULL |9408 |39888 |
|NULL|02 |3 |NULL |NULL |9482 |40440 |
|NULL|03 |2 |NULL |NULL |9632 |40996 |
|NULL|03 |3 |NULL |NULL |9392 |40376 |
|NULL|04 |2 |NULL |NULL |8906 |37312 |
|NULL|04 |3 |NULL |NULL |9794 |41896 |
|NULL|05 |2 |NULL |NULL |9552 |40504 |
|NULL|05 |3 |NULL |NULL |9690 |41960 |
|NULL|06 |2 |NULL |NULL |9556 |40352 |
|NULL|06 |3 |NULL |NULL |9378 |40200 |
+----+------------+-------+------------+---------+----------+------------+
roll up
rollup函数
是cube
的子集,以最左侧维度
为主,按照顺序依次进行聚合.
例如聚合的维度为 col1,col2,col3
使用rollup
聚合的字段分别为 col1,(col1,col2),(col1,col3),(col1,col2,col3)
举个例子
select os, --维度1
manufacturer,--维度2
carrier, --维度3
network_type,--维度4
area_code, --维度5
count(distinct user_id) user_count,
sum(launch_count) launch_count
from dws.launch ---用户启动表
where bdp_day = '20201223'
group by os, manufacturer, carrier, network_type, area_code
with rollup --使用roll up函数
limit 10;
结果如下
就是说等价于统计出来了以下结果
group by 1 2 3 4 5
group by 2 3 4 5
group by 3 4 5
group by 4 5
group by 5
+----+------------+-------+------------+---------+----------+------------+
|os |manufacturer|carrier|network_type|area_code|user_count|launch_count|
+----+------------+-------+------------+---------+----------+------------+
|NULL|NULL |NULL |NULL |NULL |56885 |724588 |
|2 |NULL |NULL |NULL |NULL |56885 |724588 |
|2 |02 |NULL |NULL |NULL |6307 |80328 |
|2 |02 |2 |NULL |NULL |3126 |39888 |
|2 |02 |2 |0 |NULL |1042 |13412 |
|2 |02 |2 |0 |10002 |5 |48 |
|2 |02 |2 |0 |10003 |6 |84 |
|2 |02 |2 |0 |10004 |5 |44 |
|2 |02 |2 |0 |10005 |6 |80 |
|2 |02 |2 |0 |10007 |13 |180 |
+----+------------+-------+------------+---------+----------+------------+
总结
- 所有组合维度的聚合可以用
cube
指定维度
的聚合用grouping sets