Grouping_ID函数
当我们没有统计某一列时,它的值显示为null,这可能与列本身就有null值冲突,这就需要一种方法区分是没有统计还是值本来就是null。(写一个排列组合的算法,就马上理解了,grouping_id其实就是所统计各列二进制和)
Column1 (key) | Column2 (value) |
---|---|
1 | NULL |
1 | 1 |
2 | 2 |
3 | 3 |
3 | NULL |
4 | 5 |
hql统计:
SELECT key, value, GROUPING__ID, count(*) from T1 GROUP BY key, value WITH ROLLUP
统计结果如下:
NULL | NULL | 0 00 | 6 |
1 | NULL | 1 10 | 2 |
1 | NULL | 3 11 | 1 |
1 | 1 | 3 11 | 1 |
2 | NULL | 1 10 | 1 |
2 | 2 | 3 11 | 1 |
3 | NULL | 1 10 | 2 |
3 | NULL | 3 11 | 1 |
3 | 3 | 3 11 | 1 |
4 | NULL | 1 10 | 1 |
4 | 5 | 3 11 | 1 |
如果列中没有有null值
SELECT fact_1_id,
fact_2_id,
SUM(sales_value) AS sales_value,
(case when fact_1_id is null then 1 else 0 end) as f1g,
(case when fact_2_id is null then 1 else 0 end) as f2
FROM dimension_tab
GROUP BY fact_1_id, fact_2_id WITH CUBE
ORDER BY fact_1_id, fact_2_id;
如果列中本来就有null值
SELECT fact_1_id,
fact_2_id,
SUM(sales_value) AS sales_value,
(case when (CAST (GROUPING__ID AS INT) & 1) = 0 then 1 else 0 end) as f1g,
(case when (CAST (GROUPING__ID AS INT) & 2) = 0 then 1 else 0 end) as f2g
FROM dimension_tab
GROUP BY fact_1_id, fact_2_id WITH CUBE
ORDER BY fact_1_id, fact_2_id;
如下
(case when (CAST (GROUPING__ID AS INT) & 1) = 1 then '1 ' else 0 end),
(case when (CAST (GROUPING__ID AS INT) & 2) = 2 then '2' else 0 end) ,
(case when (CAST (GROUPING__ID AS INT) & 4) = 4 then '3' else 0 end) ,
第一位如果和1与为1,证明第一位存在同时包括本来存在的null值,
用来判断每一个字段知否为聚合的字段中,null值本来的null还是grouping set的null值。
参考
http://stackoverflow.com/questions/29577887/grouping-in-hive
https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup