hive学习之grouping set /cube/rollup

GROUPING SETS clause

The GROUPING SETS clause in GROUP BY allows us to specify more than one GROUP BY option in the same record set. All GROUPING SET clauses can be logically expressed in terms of several GROUP BY queries connected by UNION. Table-1 shows several such equivalent statements. This is helpful in forming the idea of the GROUPING SETS clause. A blank set ( ) in the GROUPING SETS clause calculates the overall aggregate.

这句话就是说 grouping sets 让我们一句sql可以计算出多个group by的值。

Table 1 - GROUPING SET queries and the equivalent GROUP BY queries

Aggregate Query with GROUPING SETS

Equivalent Aggregate Query with GROUP BY

SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b) )

SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b), a)

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b

UNION

SELECT a, null, SUM( c ) FROM tab1 GROUP BY a

SELECT a,b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS (a,b)

SELECT a, null, SUM( c ) FROM tab1 GROUP BY a

UNION

SELECT null, b, SUM( c ) FROM tab1 GROUP BY b

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a, b), a, b, ( ) )

SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b

UNION

SELECT a, null, SUM( c ) FROM tab1 GROUP BY a, null

UNION

SELECT null, b, SUM( c ) FROM tab1 GROUP BY null, b

UNION

SELECT null, null, SUM( c ) FROM tab1

数据准备

create table test.cc_groupingset(
a int ,
b int ,
c int );
insert into test.cc_groupingset values
(1,1,1),
(2,2,2),
(3,3,3),
(4,4,4),
(1,null,null),
(null,2,null),
(null,null,3)

需求1 , 算出 根据a,b分组后c的sum值

select a,b,sum(c) from test.cc_groupingset group by a,b GROUPING sets((a,b))

select a,b,sum(c) from test.cc_groupingset group by a,b

 

需求2,同时算出需求1也就是根据a,b分组的sum(c),和根据a分组后的sum(c)

select a,b,sum(c) from  test.cc_groupingset group by a,b GROUPING sets((a,b),a)

select a,b,sum(c) from test.cc_groupingset group by a,b

union 

select a,null,sum(c) from test.cc_groupingset group by a

注意我下面标红的第3、8、12行就是group by a 的值。

 

需求3,同时根据a分组后的sum(c)和根据b分组后的sum(c)

select a,b,sum(c) from test.cc_groupingset group by a,b GROUPING sets(a,b)

explain 
select null,b,sum(c) from test.cc_groupingset group by b
union 
select a,null,sum(c) from test.cc_groupingset group by a

 需求4,算个各个维度的sum(c),即根据group a / group b/ group a,b/ 不分组直接sum

select a,b,sum(c) from test.cc_groupingset group by a,b GROUPING sets(a,b,(a,b))

根据我之前的select *可以看到 

group by a 有 5条 group by b 有 5条,group by a,b 有 7条 ,直接sum()有1条

所以总数=5+5+7+1=18条

 现在来总结下用法。

就是在我们根据多个维度分组的时候,例如group by a,b,c,d 有时候我们也需要这几个维度中部分维度的聚合值 例如group by a,b 。这个时候通过group by a,b,c,d grouping sets((a,b,c,d),(a,b))即可,看起来简洁,使用方便。

上面的grouping sets 还是要手动指定各个维度。

Cubes and Rollups

The general syntax is WITH CUBE/ROLLUP. It is used with the GROUP BY only. CUBE creates a subtotal of all possible combinations of the set of column in its argument. Once we compute a CUBE on a set of dimension, we can get answer to all possible aggregation questions on those dimensions.

It might be also worth mentioning here that
GROUP BY a, b, c WITH CUBE is equivalent to
GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ( )).

ROLLUP clause is used with GROUP BY to compute the aggregate at the hierarchy levels of a dimension.
GROUP BY a, b, c with ROLLUP assumes that the hierarchy is "a" drilling down to "b" drilling down to "c".

GROUP BY a, b, c, WITH ROLLUP is equivalent to GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ( )).

简单来说WITH CUBE 就是各个维度都包含,

with ROLLUP 就是根据第一个维度去看所有和这个维度一起的聚合结果

 因为cube 和roll up计算的维度比较多,所以会有多个任务使用下列参数优化

hive.new.job.grouping.set.cardinality

Whether a new map-reduce job should be launched for grouping sets/rollups/cubes.
For a query like: select a, b, c, count(1) from T group by a, b, c with rollup;
4 rows are created per row: (a, b, c), (a, b, null), (a, null, null), (null, null, null)
This can lead to explosion across map-reduce boundary if the cardinality of T is very high
and map-side aggregation does not do a very good job.

This parameter decides if hive should add an additional map-reduce job. If the grouping set
cardinality (4 in the example above), is more than this value, a new MR job is added under the
assumption that the orginal group by will reduce the data size

grouping__id

还是之前的原始数据

我们之前写的sql 有个问题。就是结果虽然全部出来了,但是我根本分不清谁是谁,比如

a=1,b=null ,sum(c)=1  这个到底是根据a 分的组 b默认是null ,还是根据ab 分组,a=1 b=null

所以多了一个Grouping__ID   

之前的18条数据,杂乱不堪。

select a,b,sum(c),grouping__ID from test.cc_groupingset group by a,b with cube order by grouping__ID 

 

为什么要学这种。因为这个效率高。

explain select a,b,sum(c) from test.cc_groupingset group by a,b GROUPING sets(a,b)

explain 
select null,b,sum(c) from test.cc_groupingset group by b
union 
select a,null,sum(c) from test.cc_groupingset group by a

通过explain可以看到reduce的个数减少了 

 

 

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值