目录
通过使用 GROUPING SETS, CUBE, 和 ROLLUP,Hive提供了高级聚合功能
Grouping sets
GROUPING SETS可以对同一个数据集执行多个GROUP BY操作。事实上,GROUPING SETS 是一种将多个GROUP BY结果集使用UNION ALL连接起来的简写方式。GROUPING SETS 关键字会在一个作业的一个阶段中执行完成所有的处理,这样会更高效。GROUPING SETS子句中的空白集 () 会计算整个聚合。下面会用几个示例来说明GROUPING SETS 的等价性。为了更好地理解,我们可以说GROUPING SETS的外层括号定义了要执行UNION ALL的数据,内部括号定义了在每个UNION ALL中要执行的GROUP BY 数据。下表是GROUPING SETS查询及其等效的GROUP BY查询的对应表
使用 GROUPING SETS的聚合查询 | 使用GROUP BY的等价的聚合查询 |
---|---|
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b) ) | SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b |
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b), a) | SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b UNION SELECT a, null, SUM( c ) FROM tab1 GROUP BY a |
SELECT a,b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS (a,b) | SELECT a, null, SUM( c ) FROM tab1 GROUP BY a UNION SELECT null, b, SUM( c ) FROM tab1 GROUP BY b |
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a, b), a, b, ( ) ) | SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b UNION SELECT a, null, SUM( c ) FROM tab1 GROUP BY a, null UNION SELECT null, b, SUM( c ) FROM tab1 GROUP BY null, b UNION SELECT null, null, SUM( c ) FROM tab1 |
示例:使用一个字段对元素的GROUPING SETS
> SELECT
name,
start_date,
count(sin_number) as sin_cnt
FROM employee_hr
GROUP BY name, start_date
GROUPING SETS((name, start_date));
+----------+-------------+----------+
| name | start_date | sin_cnt |
+----------+-------------+----------+
| Lucy | 2010-01-03 | 1 |
| Michael | 2014-01-29 | 1 |
| Steven | 2012-11-03 | 1 |
| Will | 2013-10-02 | 1 |
+----------+-------------+----------+
等价于
> SELECT
name,
start_date,
count(sin_number) AS sin_cnt
FROM employee_hr
GROUP BY name, start_date;
+----------+-------------+----------+
| name | start_date | sin_cnt |
+----------+-------------+----------+
| Lucy | 2010-01-03 | 1 |
| Michael | 2014-01-29 | 1 |
| Steven | 2012-11-03 | 1 |
| Will | 2013-10-02 | 1 |
+----------+-------------+----------+
示例:使用2个元素的GROUPING SETS
> SELECT
name, start_date, count(sin_number) as sin_cnt
FROM employee_hr
GROUP BY name, start_date
GROUPING SETS(name, start_date);
+----------+-------------+----------+
| name | start_date | sin_cnt |
+----------+-------------+----------+
| NULL | 2010-01-03 | 1 |
| NULL | 2012-11-03 | 1 |
| NULL | 2013-10-02 | 1 |
| NULL | 2014-01-29 | 1 |
| Lucy | NULL | 1 |
| Michael | NULL | 1 |
| Steven | NULL | 1 |
| Will | NULL | 1 |
+----------+-------------+----------+
等价于
> SELECT
name, null as start_date, count(sin_number) as sin_cnt
FROM employee_hr
GROUP BY name
UNION ALL
SELECT
null as name, start_date, count(sin_number) as sin_cnt
FROM employee_hr
GROUP BY start_date;
+-----------+-----------------+--------------+
| _u1.name | _u1.start_date | _u1.sin_cnt |
+-----------+-----------------+--------------+
| Lucy | NULL | 1 |
| Michael | NULL | 1 |
| Steven | NULL | 1 |
| Will | NULL | 1 |
| NULL | 2010-01-03 | 1 |
| NULL | 2012-11-03 | 1 |
| NULL | 2013-10-02 | 1 |
| NULL | 2014-01-29 | 1 |
+-----------+-----------------+--------------+
示例:使用2个元素的GROUPING SETS,一个字段对,一个是单独字段
> SELECT
name, start_date, count(sin_number) as sin_cnt
FROM employee_hr
GROUP BY name, start_date
GROUPING SETS((name, start_date), name);
+----------+-------------+----------+
| name | start_date | sin_cnt |
+----------+-------------+----------+
| Lucy | NULL | 1 |
| Lucy | 2010-01-03 | 1 |
| Michael | NULL | 1 |
| Michael | 2014-01-29 | 1 |
| Steven | NULL | 1 |
| Steven | 2012-11-03 | 1 |
| Will | NULL | 1 |
| Will | 2013-10-02 | 1 |
+----------+-------------+----------+
等价于
> SELECT
name, start_date, count(sin_number) as sin_cnt
FROM employee_hr
GROUP BY name, start_date
UNION ALL
SELECT
name, null as start_date, count(sin_number) as sin_cnt
FROM employee_hr
GROUP BY name;
+-----------+-----------------+--------------+
| _u1.name | _u1.start_date | _u1.sin_cnt |
+-----------+-----------------+--------------+
| Lucy | 2010-01-03 | 1 |
| Michael | 2014-01-29 | 1 |
| Steven | 2012-11-03 | 1 |
| Will | 2013-10-02 | 1 |
| Lucy | NULL | 1 |
| Michael | NULL | 1 |
| Steven | NULL | 1 |
| Will | NULL | 1 |
+-----------+-----------------+--------------+
示例:使用4个元素的GROUPING SETS
> SELECT
name, start_date, count(sin_number) as sin_cnt
FROM employee_hr
GROUP BY name, start_date
GROUPING SETS((name, start_date), name, start_date, ());
+----------+-------------+----------+
| name | start_date | sin_cnt |
+----------+-------------+----------+
| NULL | NULL | 4 |
| NULL | 2010-01-03 | 1 |
| NULL | 2012-11-03 | 1 |
| NULL | 2013-10-02 | 1 |
| NULL | 2014-01-29 | 1 |
| Lucy | NULL | 1 |
| Lucy | 2010-01-03 | 1 |
| Michael | NULL | 1 |
| Michael | 2014-01-29 | 1 |
| Steven | NULL | 1 |
| Steven | 2012-11-03 | 1 |
| Will | NULL | 1 |
| Will | 2013-10-02 | 1 |
+----------+-------------+----------+
等价于
> SELECT
name, start_date, count(sin_number) AS sin_cnt
FROM employee_hr
GROUP BY name, start_date
UNION ALL
SELECT
name, null as start_date, count(sin_number) AS sin_cnt
FROM employee_hr
GROUP BY name
UNION ALL
SELECT
null as name, start_date, count(sin_number) AS sin_cnt
FROM employee_hr
GROUP BY start_date
UNION ALL
SELECT
null as name, null as start_date, count(sin_number) AS sin_cnt
FROM employee_hr;
+-----------+-----------------+--------------+
| _u1.name | _u1.start_date | _u1.sin_cnt |
+-----------+-----------------+--------------+
| Lucy | 2010-01-03 | 1 |
| Michael | 2014-01-29 | 1 |
| Steven | 2012-11-03 | 1 |
| Will | 2013-10-02 | 1 |
| Lucy | NULL | 1 |
| Michael | NULL | 1 |
| Steven | NULL | 1 |
| Will | NULL | 1 |
| NULL | 2010-01-03 | 1 |
| NULL | 2012-11-03 | 1 |
| NULL | 2013-10-02 | 1 |
| NULL | 2014-01-29 | 1 |
| NULL | NULL | 4 |
+-----------+-----------------+--------------+
Rollup 和 Cube
ROLLUP语句使SELECT语句能够跨指定的维度组计算多级聚合。ROLLUP语句是GROUP BY子句的简单扩展,具有高效性和最小的查询开销。与创建指定聚合级别的GROUPING SETS相比,ROLLUP创建n+1级别的聚合,其中n是分组列数。首先,它会计算GROUP BY子句中指定的标准聚合值。然后,创建一个更高级别的小计(subtotals),通过分组列的组合列表从右向左移动。如,GROUP BY a,b,c WITH ROLLUP 等价于 GROUP BY a,b,c GROUPING SETS ((a,b,c),(a,b),(a),())。
CUBE语句接受一组指定的分组列,并为其所有可能的组合创建聚合。如果为CUBE指定了n列,则会返回2n个聚合组合。如,GROUP BY a,b,c WITH CUBE 等价于 GROUP BY a,b,c GROUPING SETS ((a,b,c),(a,b),(b,c),(a,c),(a),(b),©,())。
GROUPING__ID函数作为扩展来区分整个行和其他行。它返回GROUP BY后指定的每列的BIT向量的十进制值。返回的十进制数是从1和0的二进制数转换而来的,表示列是(1)否(0)在行中聚合。另一方面,grouping(…)函数还通过直接返回二进制1或0来指示GROUP BY子句中的列是否聚合。在下面的示例中,列的顺序从GROUP BY中最近的列(如name)开始计算。结果集中的第一行表明GROUP BY中没有使用任何列。
> SELECT
name, start_date, count(employee_id) as emp_id_cnt,
GROUPING__ID,
grouping(name) as gp_name,
grouping(start_date) as gp_sd
FROM employee_hr
GROUP BY name, start_date
WITH CUBE ORDER BY name, start_date;
+----------+-------------+-------------+---------------+----------+--------+
| name | start_date | emp_id_cnt | grouping__id | gp_name | gp_sd |
+----------+-------------+-------------+---------------+----------+--------+
| NULL | NULL | 4 | 3 | 1 | 1 |
| NULL | 2010-01-03 | 1 | 2 | 1 | 0 |
| NULL | 2012-11-03 | 1 | 2 | 1 | 0 |
| NULL | 2013-10-02 | 1 | 2 | 1 | 0 |
| NULL | 2014-01-29 | 1 | 2 | 1 | 0 |
| Lucy | NULL | 1 | 1 | 0 | 1 |
| Lucy | 2010-01-03 | 1 | 0 | 0 | 0 |
| Michael | NULL | 1 | 1 | 0 | 1 |
| Michael | 2014-01-29 | 1 | 0 | 0 | 0 |
| Steven | NULL | 1 | 1 | 0 | 1 |
| Steven | 2012-11-03 | 1 | 0 | 0 | 0 |
| Will | NULL | 1 | 1 | 0 | 1 |
| Will | 2013-10-02 | 1 | 0 | 0 | 0 |
+----------+-------------+-------------+---------------+----------+--------+
聚合条件过滤 HAVING
Hive从 0.7.0中开始添加了HAVING语句用以支持聚合结果的过滤。通过使用HAVING,可以避免在GROUP BY语句之后使用子查询。
> SELECT
gender_age.age
FROM employee
GROUP BY gender_age.age
HAVING count(*)=1;
+-----------------+
| gender_age.age |
+-----------------+
| 27 |
| 30 |
| 35 |
| 57 |
+-----------------+
HAVING语句中也支持使用别名
> SELECT
gender_age.age,
count(*) as cnt
FROM employee
GROUP BY gender_age.age
HAVING cnt=1
+-----------------+------+
| gender_age.age | cnt |
+-----------------+------+
| 27 | 1 |
| 30 | 1 |
| 35 | 1 |
| 57 | 1 |
+-----------------+------+
TIP:HAVING 也支持对常规字段进行过滤,但是为了能获得更好的性能,对于这样的过滤类型还是建议使用WHERE子句,而不是HAVING
如果不使用HAVING,就需要使用子查询来代替
> SELECT a.age
FROM
(SELECT count(*) as cnt, gender_age.age
FROM employee GROUP BY gender_age.age
) a WHERE a.cnt<=1;
+--------+
| a.age |
+--------+
| 27 |
| 30 |
| 35 |
| 57 |
+--------+
参考
https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup
书籍 Apache Hive Essentials Second Edition (by Dayong Du) Chapter 6