hive 新增加了以下四种聚合功能,扩展了之前只能 group by 一组的模式。
1. GROUPING SETS clause
2. Grouping__ID function
3. Cubes and Rollups
4. hive.new.job.grouping.set.cardinality
1. GROUPING SETS 这个功能比较容易理解,就如下表所示。
Aggregate Query with GROUPING SETS
Equivalent Aggregate Query with GROUP BY
SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b) )
SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b), a)
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b
UNION
SELECT a, null, SUM( c ) FROM tab1 GROUP BY a
SELECT a,b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS (a,b)
SELECT a, null, SUM( c ) FROM tab1 GROUP BY a
UNION
SELECT null, b, SUM( c ) FROM tab1 GROUP BY b
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a, b), a, b, ( ) )
SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b
UNION
SELECT a, null, SUM( c ) FROM tab1 GROUP BY a, null
UNION
SELECT null, b, SUM( c ) FROM tab1 GROUP BY null, b
UNION
SELECT null, null, SUM( c ) FROM tab1
2. Grouping__ID 这个功能是针对上面GROUPING SETS,当聚集某一列时会显示为NULL,而如果当前列的值正好有为NULL的,会导致冲突。这就引入了Grouping__ID的功能。Grouping__ID会返回一个bitvector(如果包含第几列,那2进制下该列就是1,比如选中1,2列来聚合,则Grouping__ID=3;只选中1列,则Grouping__ID=1;一列都没选中,则Grouping__ID=0)。这样可以通过Grouping__ID值来区分是聚合值还是当前列是null的值。
3. Cubes and rollups 这个功能纯粹是为了为了节省sql的长度,不用自己去写有规律的长的GROUPING SETS
GROUP BY a, b, c WITH CUBE
等同于:
GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ( )).
GROUP BY a, b, c, WITH ROLLUP
等同于:
GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ( )).
4. hive.new.job.grouping.set.cardinality 默认值是30,也就是在GROUPING SETS里面集合数据超过这个值的时候,会增加一个job来降低mapreduce的处理。
5.示例操作如下:
5.0、基础知识
(1)pv:page view(页面访问量)
(2)uv:user view(访问人数)
(3)uv表的数据如下
(4)统计每个月的用户浏览量,"distinct"关键字是去除重复的值
select month, count(distinct id) from uv group by month;
5.1、union all:表联合操作
eg:统计每天和每月的用户访问量
select month, count(distinct id) from uv group by month union all select day, count(distinct id) from uv group by day;
5.2、grouping sets:只统计指定字段
eg:统计每天和每月的用户访问量,grouping__id是分组的组号(两个下划线)
select month, day, count(distinct id), grouping__id from uv group by month, day grouping sets(month, day);
5.3、with cube:统计指定字段的所有组合(包括NULL)
eg:统计每天和每月的用户访问量,grouping__id是分组的组号(两个下划线)
select month, day, count(distinct id), grouping__id from uv group by month, day with cube order by grouping__id;
5.4、with rollup:逐层统计指定字段
eg:统计每天和每月的用户访问量,grouping__id是分组的组号(两个下划线),组号扔按所有组合排序,但只显示逐层统计的记录
select month, day, count(distinct id), grouping__id from uv group by month, day with rollup order by grouping__id;