Hive group by、groupping sets、with cube, with rollup

最新推荐文章于 2023-12-13 22:04:49 发布

weixin_34293059

最新推荐文章于 2023-12-13 22:04:49 发布

阅读量618

点赞数

文章标签：大数据

原文链接：https://my.oschina.net/trydaydayup/blog/1512298

版权

2019独角兽企业重金招聘Python工程师标准>>>

数据处理过程中，我们通常需要对各个维度进行交叉分析，如果只有GROUP BY子句，那我们可以写出按各个维度或层次进行GROUP BY的查询语句，然后再通过UNION子句把结果集拼凑起来，但是这样的查询语句显得冗长、笨拙。

为了解决HQL冗长的问题，下面我们介绍一下HIVE提供的一些语法：

group by 后使用 grouping sets

因为涉及UNION操作，所以为了遵循UNION对参与合并的数据集合的要求，GROUPING SETS会把在单个GROUP BY逻辑中没有参与GROUP BY的那一列置为NULL值，使它成为常量占位列。这样聚合出来的结果，未被GROUP BY的列将显示为NULL。

group by后带grouping sets子句效果就是只返回小记记录，即只返回按单个列分组后的统计数据，不返回多个列组合分组的统计数据。

例1：Group by grouping sets(A )

产生的分组种数：1种；

第一种：group by A

返回结果集：即为以上一种分组的统计结果集。

例2：Group by grouping sets(A ,B)

产生的分组种数：2种；

第一种：group by A

第二种：group by B

返回结果集：为以上两种分组统计结果集的并集且未去掉重复数据。

例3：Group by grouping sets (A ,B,C)

产生的分组种数：3种；

第一种：group by A

第二种：group by B

第三种：group by C

返回结果集：为以上三种分组统计结果集的并集且未去掉重复数据。

例4：Group by A, B, C grouping sets (A ,B,C，（A ,B）)

产生的分组种数：4种；

第一种：group by A

第二种：group by B

第三种：group by C

第四种：group by A, B

返回结果集：为以上四种分组统计结果集的并集且未去掉重复数据。

Aggregate Query with GROUPING SETS	Equivalent Aggregate Query with GROUP
SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b) )	SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b), a)	SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b UNION SELECT a, null, SUM( c ) FROM tab1 GROUP BY a
SELECT a,b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS (a,b)	SELECT a, null, SUM( c ) FROM tab1 GROUP BY a UNION SELECT null, b, SUM( c ) FROM tab1 GROUP BY b
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a, b), a, b, ( ) )	SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b UNION SELECT a, null, SUM( c ) FROM tab1 GROUP BY a, null UNION SELECT null, b, SUM( c ) FROM tab1 GROUP BY null, b UNION SELECT null, null, SUM( c ) FROM tab1

Aggregate Query with GROUPING SETS

Equivalent Aggregate Query with GROUP

SELECT a, b, SUM(c) FROM tab1

GROUP BY a, b

GROUPING SETS ( (a,b) )

SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b

SELECT a, b, SUM( c ) FROM tab1

GROUP BY a, b

GROUPING SETS ( (a,b), a)

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b

UNION

SELECT a, null, SUM( c ) FROM tab1 GROUP BY a

SELECT a,b, SUM( c ) FROM tab1

GROUP BY a, b GROUPING SETS (a,b)

SELECT a, null, SUM( c ) FROM tab1 GROUP BY a

UNION

SELECT null, b, SUM( c ) FROM tab1 GROUP BY b

SELECT a, b, SUM( c ) FROM tab1

GROUP BY a, b

GROUPING SETS ( (a, b), a, b, ( ) )

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b

UNION

SELECT a, null, SUM( c ) FROM tab1 GROUP BY a, null

UNION

SELECT null, b, SUM( c ) FROM tab1 GROUP BY null, b

UNION

SELECT null, null, SUM( c ) FROM tab1

group by后使用rollup子句总结

group by后带rollup子句的功能可以理解为：先按一定的规则产生多种分组，然后按各种分组统计数据（至于统计出的数据是求和还是最大值还是平均值等这就取决于SELECT后的聚合函数）。因此要搞懂group by后带rollup子句的用法主要是搞懂它是如何按一定的规则产生多种分组的。另group by后带rollup子句所返回的结果集，可以理解为各个分组所产生的结果集的并集且没有去掉重复数据。下面举例说明：

1、对比没有带rollup的goup by

例：Group by A ,B

产生的分组种数：1种；

即group by A,B

返回结果集：也就是这一种分组的结果集。

2、group by后带rollup

例1：Group by A,B with rollup

产生的分组种数：3种；

第一种：group by A,B

第二种：group by A

第三种：group by NULL

（说明：本没有group by NULL 的写法，在这里指是为了方便说明，而采用之。含义是：没有分组，也就是所有数据做一个统计。例如聚合函数是SUM的话，那就是对所有满足条件的数据进行求和。此写法的含义下同)

返回结果集：为以上三种分组统计结果集的并集且未去掉重复数据。

有下面一个表数据：

Column1 (key)	Column2 (value)
1	NULL
1	1
2	2
3	3
3	NULL
4	5

执行下面查询语句：

SELECT key, value, GROUPING__ID, count(*) from T1 GROUP BY key,value WITH ROLLUP

将得到如下查询结果：

key	value	GROUPING__ID	*count()**
NULL	NULL	0	6
1	NULL	1	2
1	NULL	3	1
1	1	3	1
2	NULL	1	1
2	2	3	1
3	NULL	1	2
3	NULL	3	1
3	3	3	1
4	NULL	1	1
4	5	3	1

例2：Group by A ,B,C with rollup

产生的分组种数：4种；

第一种：group by A,B,C

第二种：group by A,B

第三种：group by A

第四种：group by NULL

返回结果集：为以上四种分组统计结果集的并集且未去掉重复数据。

group by后带rollup子句与group by后带cube子句区别

group by后带rollup子句与group by后带cube子句的唯一区别就是：

带cube子句的group by会产生更多的分组统计数据。cube后的列有多少种组合（注意组合是与顺序无关的）就会有多少种分组。

例：Group by A ,B,C with cube

产生的分组种数：8种；

第一种：group by A,B,C

第二种：group by A,B

第三种：group by A,C

第四种：group by B,C

第五种：group by C

第六种：group by B

第七种：group by A

第八种：group by NULL

返回结果集：为以上八种分组统计结果集的并集且未去掉重复数据。

########################################################

Cubes and Rollups

The general syntax is WITH CUBE/ROLLUP. It is used with the GROUP BY only. CUBE creates a subtotal of all possible combinations of the set of column in its argument. Once we compute a CUBE on a set of dimension, we can get answer to all possible aggregation questions on those dimensions.

It might be also worth mentioning here that
GROUP BY a, b, c WITH CUBE is equivalent to
GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ( )).

ROLLUP clause is used with GROUP BY to compute the aggregate at the hierarchy levels of a dimension.
GROUP BY a, b, c with ROLLUP assumes that the hierarchy is "a" drilling down to "b" drilling down to "c".

GROUP BY a, b, c, WITH ROLLUP is equivalent to GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ( )).

其他参考：

http://blog.csdn.net/ljb522744686/article/details/

http://blog.csdn.net/mashroomxl/article/details/22578471

https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup#EnhancedAggregation,Cube,GroupingandRollup-CubesandRollups

转载于:https://my.oschina.net/trydaydayup/blog/1512298