【hive】hive grouping sets和GROUPING__ID的用法

前言​

GROUPING SETS,GROUPING__ID,CUBE,ROLLUP 这几个分析函数通常用于OLAP中,不能累加,而且需要根据不同维度上钻和下钻的指标统计,比如,分小时、天、月的UV数。

grouping sets根据不同的维度组合进行聚合,等价于将不同维度的group by的结果进行 union all,简单来说就是将多个不同维度的group by逻辑写在了 一个sql中。

数据准备:

vim /root/test.txt

2015-03,2015-03-10,cookie1
2015-03,2015-03-10,cookie5
2015-03,2015-03-12,cookie7
2015-04,2015-04-12,cookie3
2015-04,2015-04-13,cookie2
2015-04,2015-04-13,cookie4
2015-04,2015-04-16,cookie4
2015-03,2015-03-10,cookie2
2015-03,2015-03-10,cookie3
2015-04,2015-04-12,cookie5
2015-04,2015-04-13,cookie6
2015-04,2015-04-15,cookie3
2015-04,2015-04-15,cookie2
2015-04,2015-04-16,cookie1

将数据导入到hdfs目录上:

hdfs dfs -put /root/test.txt /tmp

创建表

use test;

create table cookie5(month string, day string, cookieid string) 
row format delimited fields terminated by ',';

load data inpath "/tmp/test.txt" into table cookie5;

select * from cookie5;

在这里插入图片描述

开始使用grouping sets

来条sql语句:

select 
    month,day,count(cookieid) 
from cookie5 
    group by month,day 
grouping sets (month,day);

查询结果如下:
在这里插入图片描述

上面这个sql等同于多个group by + union all

select month,NULL as day,count(cookieid) as nums 
from cookie5 
group by month 
union all select NULL as month,day,count(cookieid) as nums 
from cookie5 
group by day;

注意点:使用unionunion all必须保证各个select 集合的结果有相同个数的列,并且每个列的类型是一样的。union all的表字段必须匹配,也就是上文的month 需要用 NULL as month 来进行填充。

在这里插入图片描述

结果一致!但是grouping sets速度要比group by + union 快!!!!!

但是有引出一个问题?为什么grouping sets 要比 group by + union 速度要快?

首先:用explain来解释下hive执行计划:

explain select 
    month,day,count(cookieid) 
from cookie5 
    group by month,day 
grouping sets (month,day);

执行计划如下:

+----------------------------------------------------+--+
|                      Explain                       |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: cookie5                         |
|             Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: month (type: string), day (type: string), cookieid (type: string) |
|               outputColumnNames: month, day, cookieid |
|               Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 aggregations: count(cookieid)      |
|                 keys: month (type: string), day (type: string), '0' (type: string) |
|                 mode: hash                         |
|                 outputColumnNames: _col0, _col1, _col2, _col3 |
|                 Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
|                   sort order: +++                  |
|                   Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
|                   Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
|                   value expressions: _col3 (type: bigint) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string) |
|           mode: mergepartial                       |
|           outputColumnNames: _col0, _col1, _col3   |
|           Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|           pruneGroupingSetId: true                 |
|           Select Operator                          |
|             expressions: _col0 (type: string), _col1 (type: string), _col3 (type: bigint) |
|             outputColumnNames: _col0, _col1, _col2 |
|             Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.TextInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink    

再看另外一个:

explain select month,NULL as day,count(cookieid) as nums from cookie5 group by month 
union all 
select NULL as month,day,count(cookieid) as nums from cookie5 group by day;

+----------------------------------------------------+--+
|                      Explain                       |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-2 depends on stages: Stage-1, Stage-3      |
|   Stage-3 is a root stage                          |
|   Stage-0 depends on stages: Stage-2               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: cookie5                         |
|             Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: month (type: string), cookieid (type: string) |
|               outputColumnNames: month, cookieid   |
|               Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 aggregations: count(cookieid)      |
|                 keys: month (type: string)         |
|                 mode: hash                         |
|                 outputColumnNames: _col0, _col1    |
|                 Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: string) |
|                   sort order: +                    |
|                   Map-reduce partition columns: _col0 (type: string) |
|                   Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|                   value expressions: _col1 (type: bigint) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           keys: KEY._col0 (type: string)           |
|           mode: mergepartial                       |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|           Select Operator                          |
|             expressions: _col0 (type: string), UDFToString(null) (type: string), _col1 (type: bigint) |
|             outputColumnNames: _col0, _col1, _col2 |
|             Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-2                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Union                                  |
|               Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.TextInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|           TableScan                                |
|             Union                                  |
|               Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.TextInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-3                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: cookie5                         |
|             Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: day (type: string), cookieid (type: string) |
|               outputColumnNames: day, cookieid     |
|               Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 aggregations: count(cookieid)      |
|                 keys: day (type: string)           |
|                 mode: hash                         |
|                 outputColumnNames: _col0, _col1    |
|                 Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: string) |
|                   sort order: +                    |
|                   Map-reduce partition columns: _col0 (type: string) |
|                   Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|                   value expressions: _col1 (type: bigint) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           keys: KEY._col0 (type: string)           |
|           mode: mergepartial                       |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
+----------------------------------------------------+--+
|                      Explain                       |
+----------------------------------------------------+--+
|           Select Operator                          |
|             expressions: UDFToString(null) (type: string), _col0 (type: string), _col1 (type: bigint) |
|             outputColumnNames: _col0, _col1, _col2 |
|             Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+--+

比较:通过explain 可以发现

  • grouping sets时,有2个stage,一次reduce
  • group by + union时,有4个stage,两次reduce

那么肯定用grouping sets时,速度会快。

GROUPING__ID的使用:

来条sql语句:

select 
  month,
  day,
  count(distinct cookieid) as uv,
  GROUPING__ID
from cookie5 
group by month,day 
grouping sets (month,day) 
order by GROUPING__ID;

在这里插入图片描述

效果等价于:

SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month
UNION ALL 
SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day;

结果说明:

  • 第一列是按照month进行分组

  • 第二列是按照day进行分组

  • 第三列是按照month或day分组是,统计这一组有几个不同的cookieid

  • 第四列grouping_id表示这一组结果属于哪个分组集合,根据grouping sets中的分组条件month,day

再来个例子:

SELECT  month, day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID 
FROM cookie5 
GROUP BY month,day 
GROUPING SETS (month,day,(month,day)) 
ORDER BY GROUPING__ID;

等价于:

SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month 
UNION ALL 
SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day
UNION ALL 
SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM cookie5 GROUP BY month,day;

在这里插入图片描述

grouping_id计算方法

grouping sets 中的每一种粒度,都对应唯一的 grouping__id 值,其计算公式与 group by 的顺序、当前粒度的字段有关。
具体计算方法如下:

  1. 将每个字段都抽象成二进制,有2个字段就是2位,有5个字段就有5位二进制数。
  2. 靠近 group by 的字段是低位, 远离的则是高位。
  3. hive中参与了组合的字段是1,未参与是0。
  4. 这样就形成了一个二进制数,这个二进制数转为十进制,即为当前粒度对应的 grouping__id。

在这里插入图片描述

举例

以上述对 cookie5 的3种粒度的统计结果为例:

我们的上面的sql 为 group by month, day, 遵循第一条原则,month 在二进制中处于低位,day 处于高位。

序号grouping set二进制值对应的十进制(grouping__id 的值)
0()000
1(month)011
2(day)102
3(month, day)113

这就是上面 grouping sets 的结果中 grouping__id 值的由来。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值