hive学习之grouping set /cube/rollup

最新推荐文章于 2022-12-12 18:39:51 发布

cclovezbf

最新推荐文章于 2022-12-12 18:39:51 发布

阅读量1k

点赞数 1

分类专栏： hive 文章标签： hive grouping sets cube

本文链接：https://blog.csdn.net/cclovezbf/article/details/126705112

版权

hive 专栏收录该内容

49 篇文章 10 订阅

订阅专栏

GROUPING SETS clause

The GROUPING SETS clause in GROUP BY allows us to specify more than one GROUP BY option in the same record set. All GROUPING SET clauses can be logically expressed in terms of several GROUP BY queries connected by UNION. Table-1 shows several such equivalent statements. This is helpful in forming the idea of the GROUPING SETS clause. A blank set ( ) in the GROUPING SETS clause calculates the overall aggregate.

这句话就是说 grouping sets 让我们一句sql可以计算出多个group by的值。

Table 1 - GROUPING SET queries and the equivalent GROUP BY queries

Aggregate Query with GROUPING SETS	Equivalent Aggregate Query with GROUP BY
SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b) )	SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b), a)	SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b UNION SELECT a, null, SUM( c ) FROM tab1 GROUP BY a
SELECT a,b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS (a,b)	SELECT a, null, SUM( c ) FROM tab1 GROUP BY a UNION SELECT null, b, SUM( c ) FROM tab1 GROUP BY b
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a, b), a, b, ( ) )	SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b UNION SELECT a, null, SUM( c ) FROM tab1 GROUP BY a, null UNION SELECT null, b, SUM( c ) FROM tab1 GROUP BY null, b UNION SELECT null, null, SUM( c ) FROM tab1

数据准备

create table test.cc_groupingset(
a int ,
b int ,
c int );
insert into test.cc_groupingset values
(1,1,1),
(2,2,2),
(3,3,3),
(4,4,4),
(1,null,null),
(null,2,null),
(null,null,3)

需求1 ，算出根据a,b分组后c的sum值

select a,b,sum(c) from test.cc_groupingset group by a,b GROUPING sets((a,b))

select a,b,sum(c) from test.cc_groupingset group by a,b

需求2，同时算出需求1也就是根据a，b分组的sum(c)，和根据a分组后的sum(c)

select a,b,sum(c) from test.cc_groupingset group by a,b GROUPING sets((a,b),a)

select a,b,sum(c) from test.cc_groupingset group by a,b

union

select a,null,sum(c) from test.cc_groupingset group by a

注意我下面标红的第3、8、12行就是group by a 的值。

需求3，同时根据a分组后的sum(c)和根据b分组后的sum(c)

select a,b,sum(c) from test.cc_groupingset group by a,b GROUPING sets(a,b)

explain
select null,b,sum(c) from test.cc_groupingset group by b
union
select a,null,sum(c) from test.cc_groupingset group by a

需求4，算个各个维度的sum(c),即根据group a / group b/ group a,b/ 不分组直接sum

select a,b,sum(c) from test.cc_groupingset group by a,b GROUPING sets(a,b,(a,b))

根据我之前的select *可以看到

group by a 有 5条 group by b 有 5条，group by a,b 有 7条，直接sum()有1条

所以总数=5+5+7+1=18条

现在来总结下用法。

就是在我们根据多个维度分组的时候，例如group by a,b,c,d 有时候我们也需要这几个维度中部分维度的聚合值例如group by a,b 。这个时候通过group by a,b,c,d grouping sets((a,b,c,d),(a,b))即可，看起来简洁，使用方便。

上面的grouping sets 还是要手动指定各个维度。

Cubes and Rollups

The general syntax is WITH CUBE/ROLLUP. It is used with the GROUP BY only. CUBE creates a subtotal of all possible combinations of the set of column in its argument. Once we compute a CUBE on a set of dimension, we can get answer to all possible aggregation questions on those dimensions.

It might be also worth mentioning here that
GROUP BY a, b, c WITH CUBE is equivalent to
GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ( )).

ROLLUP clause is used with GROUP BY to compute the aggregate at the hierarchy levels of a dimension.
GROUP BY a, b, c with ROLLUP assumes that the hierarchy is "a" drilling down to "b" drilling down to "c".

GROUP BY a, b, c, WITH ROLLUP is equivalent to GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ( )).

简单来说WITH CUBE 就是各个维度都包含，

with ROLLUP 就是根据第一个维度去看所有和这个维度一起的聚合结果

因为cube 和roll up计算的维度比较多，所以会有多个任务使用下列参数优化

hive.new.job.grouping.set.cardinality

Whether a new map-reduce job should be launched for grouping sets/rollups/cubes.
For a query like: select a, b, c, count(1) from T group by a, b, c with rollup;
4 rows are created per row: (a, b, c), (a, b, null), (a, null, null), (null, null, null)
This can lead to explosion across map-reduce boundary if the cardinality of T is very high
and map-side aggregation does not do a very good job.

This parameter decides if hive should add an additional map-reduce job. If the grouping set
cardinality (4 in the example above), is more than this value, a new MR job is added under the
assumption that the orginal group by will reduce the data size

grouping__id

还是之前的原始数据

我们之前写的sql 有个问题。就是结果虽然全部出来了，但是我根本分不清谁是谁，比如

a=1，b=null ，sum(c)=1 这个到底是根据a 分的组 b默认是null ，还是根据ab 分组，a=1 b=null

所以多了一个Grouping__ID

之前的18条数据，杂乱不堪。

select a,b,sum(c),grouping__ID from test.cc_groupingset group by a,b with cube order by grouping__ID

为什么要学这种。因为这个效率高。

explain select a,b,sum(c) from test.cc_groupingset group by a,b GROUPING sets(a,b)

explain
select null,b,sum(c) from test.cc_groupingset group by b
union
select a,null,sum(c) from test.cc_groupingset group by a

通过explain可以看到reduce的个数减少了

cclovezbf

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
hive学习之grouping set /cube/rollup

就是在我们根据多个维度分组的时候，例如group by a,b,c,d 有时候我们也需要这几个维度中部分维度的聚合值例如group by a,b。这个时候通过group by a,b,c,d grouping sets((a,b,c,d),(a,b))即可，看起来简洁，使用方便。a=1，b=null ，sum(c)=1 这个到底是根据a 分的组 b默认是null ，还是根据ab 分组，a=1 b=null。需求2，同时算出需求1也就是根据a，b分组的sum(c)，和根据a分组后的sum(c)
复制链接

扫一扫