Hive常见的多维度聚合函数有:
- With Cube:任意维度聚合
- Grouping Sets:指定维度聚合
- With Rollup:GROUP BY最左侧的维度为主层级聚合
测试sql:
use db01;
create table if not exists db01.t12(id int,name string,age string,dt string)
row format delimited fields terminated by ',' stored as textfile;
insert into db01.t12 values(1,'aa','12','2021-01-01'),(1,'aa','12','2021-01-02'),(2,'aa','14','2021-02-01'),(2,'bb','14','2021-02-02'),(3,'cc','16','2020-01-01');
1.分組(n个维度1种)
select id,name,count(dt)
from db01.t11
group by id,name
+-----+-------+------+
| id | name | _c2 |
+-----+-------+------+
| 1 | aa | 2 |
| 2 | aa | 1 |
| 3 | cc | 1 |
| 2 | bb | 1 |
+-----+-------+------+
2. with cube(n个维度2^n种)
select id,name,count(dt)
from db01.t11
group by id,name
with cube
+-------+-------+------+
| id | name | _c2 |
+-------+-------+------+
| NULL | aa | 3 |
| NULL | bb | 1 |
| 2 | NULL | 2 |
| NULL | NULL | 5 |
| NULL | cc | 1 |
| 1 | NULL | 2 |
| 3 | NULL | 1 |
+-------+-------+------+
| 1 | aa | 2 |
| 2 | aa | 1 |
| 3 | cc | 1 |
| 2 | bb | 1 |
3.with rollup(n个维度n+1种)
select id,name,count(dt)
from db01.t11
group by id,name
with rollup
+-------+-------+------+
| id | name | _c2 |
+-------+-------+------+
| 2 | NULL | 2 |
| NULL | NULL | 5 |
| 1 | NULL | 2 |
| 3 | NULL | 1 |
+-------+-------+------+
| 1 | aa | 2 |
| 2 | aa | 1 |
| 3 | cc | 1 |
| 2 | bb | 1 |
4.grouping sets
select id,name,count(dt),GROUPING__ID --两个下划线
from db01.t11
group by id,name
grouping sets((id,name),(),(id),(name))
+-------+-------+------+---------------+
| id | name | _c2 | grouping__id |
+-------+-------+------+---------------+
| NULL | aa | 3 | 2 |
| NULL | bb | 1 | 2 |
| 2 | NULL | 2 | 1 |
| 2 | bb | 1 | 0 |
| NULL | NULL | 5 | 3 |
| NULL | cc | 1 | 2 |
| 1 | NULL | 2 | 1 |
| 1 | aa | 2 | 0 |
| 2 | aa | 1 | 0 |
| 3 | NULL | 1 | 1 |
| 3 | cc | 1 | 0 |
+-------+-------+------+---------------+
grouping_id 算法
grouping_id的值是根据groupBy的列是否使用和列的顺序来决定。靠近groupBy的列为高位,远离groupBy的列为低位;列被使用则为’0’,列没有被使用则为’1’。按照此规则,对每种粒度的组合生成一组二进制数,然后将二进制数转成十进制数。(dim1, dim2), (dim1), (dim2), ( ) 对应的二进制数和十进制数见表格
代码实现 grouping_id
concat_ws
(
':'
, case when dim1 is not null then 'dim1' else null end
, case when dim2 is not null then 'dim2' else null end
) as group_id
聚合粒度 -> group_id
(dim1, dim2) -> dim1:dim2
(dim1) -> dim1
(dim2) -> dim2