HiveQL DQL7—高级聚合


通过使用 GROUPING SETS, CUBE, 和 ROLLUP,Hive提供了高级聚合功能

Grouping sets

GROUPING SETS可以对同一个数据集执行多个GROUP BY操作。事实上,GROUPING SETS 是一种将多个GROUP BY结果集使用UNION ALL连接起来的简写方式。GROUPING SETS 关键字会在一个作业的一个阶段中执行完成所有的处理,这样会更高效。GROUPING SETS子句中的空白集 () 会计算整个聚合。下面会用几个示例来说明GROUPING SETS 的等价性。为了更好地理解,我们可以说GROUPING SETS的外层括号定义了要执行UNION ALL的数据,内部括号定义了在每个UNION ALL中要执行的GROUP BY 数据。下表是GROUPING SETS查询及其等效的GROUP BY查询的对应表

使用 GROUPING SETS的聚合查询使用GROUP BY的等价的聚合查询
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b) )SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b), a)SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b
UNION
SELECT a, null, SUM( c ) FROM tab1 GROUP BY a
SELECT a,b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS (a,b)SELECT a, null, SUM( c ) FROM tab1 GROUP BY a
UNION
SELECT null, b, SUM( c ) FROM tab1 GROUP BY b
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a, b), a, b, ( ) )SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b
UNION
SELECT a, null, SUM( c ) FROM tab1 GROUP BY a, null
UNION
SELECT null, b, SUM( c ) FROM tab1 GROUP BY null, b
UNION
SELECT null, null, SUM( c ) FROM tab1

示例:使用一个字段对元素的GROUPING SETS

> SELECT 
name, 
start_date,
count(sin_number) as sin_cnt 
FROM employee_hr
GROUP BY name, start_date 
GROUPING SETS((name, start_date));
+----------+-------------+----------+
|   name   | start_date  | sin_cnt  |
+----------+-------------+----------+
| Lucy     | 2010-01-03  | 1        |
| Michael  | 2014-01-29  | 1        |
| Steven   | 2012-11-03  | 1        |
| Will     | 2013-10-02  | 1        |
+----------+-------------+----------+

等价于

> SELECT 
name, 
start_date, 
count(sin_number) AS sin_cnt 
FROM employee_hr
GROUP BY name, start_date;
+----------+-------------+----------+
|   name   | start_date  | sin_cnt  |
+----------+-------------+----------+
| Lucy     | 2010-01-03  | 1        |
| Michael  | 2014-01-29  | 1        |
| Steven   | 2012-11-03  | 1        |
| Will     | 2013-10-02  | 1        |
+----------+-------------+----------+

示例:使用2个元素的GROUPING SETS

> SELECT 
name, start_date, count(sin_number) as sin_cnt 
FROM employee_hr
GROUP BY name, start_date 
GROUPING SETS(name, start_date);
+----------+-------------+----------+
|   name   | start_date  | sin_cnt  |
+----------+-------------+----------+
| NULL     | 2010-01-03  | 1        |
| NULL     | 2012-11-03  | 1        |
| NULL     | 2013-10-02  | 1        |
| NULL     | 2014-01-29  | 1        |
| Lucy     | NULL        | 1        |
| Michael  | NULL        | 1        |
| Steven   | NULL        | 1        |
| Will     | NULL        | 1        |
+----------+-------------+----------+

等价于

> SELECT 
name, null as start_date, count(sin_number) as sin_cnt 
FROM employee_hr
GROUP BY name
UNION ALL
SELECT 
null as name, start_date, count(sin_number) as sin_cnt 
FROM employee_hr
GROUP BY start_date;
+-----------+-----------------+--------------+
| _u1.name  | _u1.start_date  | _u1.sin_cnt  |
+-----------+-----------------+--------------+
| Lucy      | NULL            | 1            |
| Michael   | NULL            | 1            |
| Steven    | NULL            | 1            |
| Will      | NULL            | 1            |
| NULL      | 2010-01-03      | 1            |
| NULL      | 2012-11-03      | 1            |
| NULL      | 2013-10-02      | 1            |
| NULL      | 2014-01-29      | 1            |
+-----------+-----------------+--------------+

示例:使用2个元素的GROUPING SETS,一个字段对,一个是单独字段

> SELECT 
name, start_date, count(sin_number) as sin_cnt 
FROM employee_hr
GROUP BY name, start_date 
GROUPING SETS((name, start_date), name);
+----------+-------------+----------+
|   name   | start_date  | sin_cnt  |
+----------+-------------+----------+
| Lucy     | NULL        | 1        |
| Lucy     | 2010-01-03  | 1        |
| Michael  | NULL        | 1        |
| Michael  | 2014-01-29  | 1        |
| Steven   | NULL        | 1        |
| Steven   | 2012-11-03  | 1        |
| Will     | NULL        | 1        |
| Will     | 2013-10-02  | 1        |
+----------+-------------+----------+

等价于

> SELECT 
name, start_date, count(sin_number) as sin_cnt 
FROM employee_hr
GROUP BY name, start_date
UNION ALL
SELECT 
name, null as start_date, count(sin_number) as sin_cnt 
FROM employee_hr
GROUP BY name;
+-----------+-----------------+--------------+
| _u1.name  | _u1.start_date  | _u1.sin_cnt  |
+-----------+-----------------+--------------+
| Lucy      | 2010-01-03      | 1            |
| Michael   | 2014-01-29      | 1            |
| Steven    | 2012-11-03      | 1            |
| Will      | 2013-10-02      | 1            |
| Lucy      | NULL            | 1            |
| Michael   | NULL            | 1            |
| Steven    | NULL            | 1            |
| Will      | NULL            | 1            |
+-----------+-----------------+--------------+

示例:使用4个元素的GROUPING SETS

> SELECT 
name, start_date, count(sin_number) as sin_cnt 
FROM employee_hr
GROUP BY name, start_date 
GROUPING SETS((name, start_date), name, start_date, ());
+----------+-------------+----------+
|   name   | start_date  | sin_cnt  |
+----------+-------------+----------+
| NULL     | NULL        | 4        |
| NULL     | 2010-01-03  | 1        |
| NULL     | 2012-11-03  | 1        |
| NULL     | 2013-10-02  | 1        |
| NULL     | 2014-01-29  | 1        |
| Lucy     | NULL        | 1        |
| Lucy     | 2010-01-03  | 1        |
| Michael  | NULL        | 1        |
| Michael  | 2014-01-29  | 1        |
| Steven   | NULL        | 1        |
| Steven   | 2012-11-03  | 1        |
| Will     | NULL        | 1        |
| Will     | 2013-10-02  | 1        |
+----------+-------------+----------+

等价于

> SELECT 
name, start_date, count(sin_number) AS sin_cnt 
FROM employee_hr
GROUP BY name, start_date
UNION ALL
SELECT 
name, null as start_date, count(sin_number) AS sin_cnt 
FROM employee_hr
GROUP BY name
UNION ALL
SELECT 
null as name, start_date, count(sin_number) AS sin_cnt 
FROM employee_hr
GROUP BY start_date
UNION ALL
SELECT 
null as name, null as start_date, count(sin_number) AS sin_cnt 
FROM employee_hr;
+-----------+-----------------+--------------+
| _u1.name  | _u1.start_date  | _u1.sin_cnt  |
+-----------+-----------------+--------------+
| Lucy      | 2010-01-03      | 1            |
| Michael   | 2014-01-29      | 1            |
| Steven    | 2012-11-03      | 1            |
| Will      | 2013-10-02      | 1            |
| Lucy      | NULL            | 1            |
| Michael   | NULL            | 1            |
| Steven    | NULL            | 1            |
| Will      | NULL            | 1            |
| NULL      | 2010-01-03      | 1            |
| NULL      | 2012-11-03      | 1            |
| NULL      | 2013-10-02      | 1            |
| NULL      | 2014-01-29      | 1            |
| NULL      | NULL            | 4            |
+-----------+-----------------+--------------+

Rollup 和 Cube

ROLLUP语句使SELECT语句能够跨指定的维度组计算多级聚合。ROLLUP语句是GROUP BY子句的简单扩展,具有高效性和最小的查询开销。与创建指定聚合级别的GROUPING SETS相比,ROLLUP创建n+1级别的聚合,其中n是分组列数。首先,它会计算GROUP BY子句中指定的标准聚合值。然后,创建一个更高级别的小计(subtotals),通过分组列的组合列表从右向左移动。如,GROUP BY a,b,c WITH ROLLUP 等价于 GROUP BY a,b,c GROUPING SETS ((a,b,c),(a,b),(a),())。
CUBE语句接受一组指定的分组列,并为其所有可能的组合创建聚合。如果为CUBE指定了n列,则会返回2n个聚合组合。如,GROUP BY a,b,c WITH CUBE 等价于 GROUP BY a,b,c GROUPING SETS ((a,b,c),(a,b),(b,c),(a,c),(a),(b),©,())。
GROUPING__ID函数作为扩展来区分整个行和其他行。它返回GROUP BY后指定的每列的BIT向量的十进制值。返回的十进制数是从1和0的二进制数转换而来的,表示列是(1)否(0)在行中聚合。另一方面,grouping(…)函数还通过直接返回二进制1或0来指示GROUP BY子句中的列是否聚合。在下面的示例中,列的顺序从GROUP BY中最近的列(如name)开始计算。结果集中的第一行表明GROUP BY中没有使用任何列。

> SELECT
name, start_date, count(employee_id) as emp_id_cnt,
GROUPING__ID,
grouping(name) as gp_name,
grouping(start_date) as gp_sd
FROM employee_hr
GROUP BY name, start_date
WITH CUBE ORDER BY name, start_date;
+----------+-------------+-------------+---------------+----------+--------+
|   name   | start_date  | emp_id_cnt  | grouping__id  | gp_name  | gp_sd  |
+----------+-------------+-------------+---------------+----------+--------+
| NULL     | NULL        | 4           | 3             | 1        | 1      |
| NULL     | 2010-01-03  | 1           | 2             | 1        | 0      |
| NULL     | 2012-11-03  | 1           | 2             | 1        | 0      |
| NULL     | 2013-10-02  | 1           | 2             | 1        | 0      |
| NULL     | 2014-01-29  | 1           | 2             | 1        | 0      |
| Lucy     | NULL        | 1           | 1             | 0        | 1      |
| Lucy     | 2010-01-03  | 1           | 0             | 0        | 0      |
| Michael  | NULL        | 1           | 1             | 0        | 1      |
| Michael  | 2014-01-29  | 1           | 0             | 0        | 0      |
| Steven   | NULL        | 1           | 1             | 0        | 1      |
| Steven   | 2012-11-03  | 1           | 0             | 0        | 0      |
| Will     | NULL        | 1           | 1             | 0        | 1      |
| Will     | 2013-10-02  | 1           | 0             | 0        | 0      |
+----------+-------------+-------------+---------------+----------+--------+

聚合条件过滤 HAVING

Hive从 0.7.0中开始添加了HAVING语句用以支持聚合结果的过滤。通过使用HAVING,可以避免在GROUP BY语句之后使用子查询。

> SELECT
 gender_age.age
 FROM employee
 GROUP BY gender_age.age
 HAVING count(*)=1;
+-----------------+
| gender_age.age  |
+-----------------+
| 27              |
| 30              |
| 35              |
| 57              |
+-----------------+

HAVING语句中也支持使用别名

> SELECT 
gender_age.age, 
count(*) as cnt 
FROM employee 
GROUP BY gender_age.age 
HAVING cnt=1
+-----------------+------+
| gender_age.age  | cnt  |
+-----------------+------+
| 27              | 1    |
| 30              | 1    |
| 35              | 1    |
| 57              | 1    |
+-----------------+------+

TIP:HAVING 也支持对常规字段进行过滤,但是为了能获得更好的性能,对于这样的过滤类型还是建议使用WHERE子句,而不是HAVING

如果不使用HAVING,就需要使用子查询来代替

> SELECT a.age
FROM
(SELECT count(*) as cnt, gender_age.age
FROM employee GROUP BY gender_age.age
) a WHERE a.cnt<=1;
+--------+
| a.age  |
+--------+
| 27     |
| 30     |
| 35     |
| 57     |
+--------+

参考

https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup
书籍 Apache Hive Essentials Second Edition (by Dayong Du) Chapter 6

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值