Hive中的多维分析函数
众所周知,Hive是一个OLAP数据库,拥有着很强大的数据分析能力。
但是当我们一张表的维度信息特别多,并且需要各种维度组合的统计分析结果时就很不方便。比如一张表有5个维度,所有的维度组合是2的5次方也就是32种,那也就意味着我们需要写32个查询的HQL解决我们的需求。同时不方便的点还存在于我们需要将32个HQL的查询结果写入到32张不同的表中,无形中给我们未来查询我们所需要的信息带来了额外的麻烦。
本篇文章就是介绍Hive中非常好用的多维分析函数帮助我们解决以上两个痛点问题。
1.环境说明
当前环境采用Hadoop3.1.3以及Hive3.1.2版本!
2.准备数据
[whalex@hadoop102data]$ vim test.txt
1001 John male sale beijing
1002 Tom female sale shanghai
1003 Bob male sale beijing
1004 Alex male product shanghai
3.建表并导入数据
create table test(id string,name string,sex string,dept_name string,addr string) row format delimited fields terminated by '\t';
load data local inpath '/opt/whalex/data/test.txt' into table test;
4.Grouping Sets计算多种自定义维度组合问题
此时需要计算如下需求:
- 按照性别分组计算每种性别的人数
- 按照部门和性别分组计算每个部门每种性别的人数
- 按照地区和性别分组计算每个地区每种性别的人数
我们可以采用如下3条HQL语句计算并插入到不同的3个结果表中
select sex,count(*) ct from test group by sex;
select dept_name,sex,count(*) ct from test group by dept_name,sex;
select addr,sex,count(*) ct from test group by addr,sex;
其实在Hive中提供了Grouping Sets帮助我们解决上述问题,Grouping Sets说明如下图所示:
所以以上3条HQL可以合并为如下一条HQL
select dept_name,addr,sex,count(*) ct from test group by dept_name,addr,sex grouping sets((sex),(dept_name,sex),(addr,sex)) order by dept_name,addr,sex;
+------------+-----------+---------+-----+
| dept_name | addr | sex | ct |
+------------+-----------+---------+-----+
| NULL | NULL | female | 1 |
| NULL | NULL | male | 3 |
| NULL | beijing | male | 2 |
| NULL | shanghai | female | 1 |
| NULL | shanghai | male | 1 |
| product | NULL | male | 1 |
| sale | NULL | female | 1 |
| sale | NULL | male | 2 |
+------------+-----------+---------+-----+
注:HQL中的order by只是为了让输出结果更好看一些,没有其他意义。
5.With Cube解决计算所有维度组合问题
此时需要计算地区、部门以及性别三个维度所有组合的聚合数据
我们当然还可以使用Grouping Sets来解决问题,HQL如下
select dept_name,addr,sex,count(*) ct from test group by dept_name,addr,sex grouping sets((addr,dept_name,sex),(addr,dept_name),(addr,sex),(dept_name,sex),(addr),(dept_name),(sex),()) order by dept_name,addr,sex;
+------------+-----------+---------+-----+
| dept_name | addr | sex | ct |
+------------+-----------+---------+-----+
| NULL | NULL | NULL | 4 |
| NULL | NULL | female | 1 |
| NULL | NULL | male | 3 |
| NULL | beijing | NULL | 2 |
| NULL | beijing | male | 2 |
| NULL | shanghai | NULL | 2 |
| NULL | shanghai | female | 1 |
| NULL | shanghai | male | 1 |
| product | NULL | NULL | 1 |
| product | NULL | male | 1 |
| product | shanghai | NULL | 1 |
| product | shanghai | male | 1 |
| sale | NULL | NULL | 3 |
| sale | NULL | female | 1 |
| sale | NULL | male | 2 |
| sale | beijing | NULL | 2 |
| sale | beijing | male | 2 |
| sale | shanghai | NULL | 1 |
| sale | shanghai | female | 1 |
+------------+-----------+---------+-----+
此时我们也可以使用With Cube来解决问题
GROUP BY a, b, c WITH CUBE is equivalent to
GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ( )).
所以HQL可以改为:
select dept_name,addr,sex,count(*) ct from test group by dept_name,addr,sex with cube order by dept_name,addr,sex;
+------------+-----------+---------+-----+
| dept_name | addr | sex | ct |
+------------+-----------+---------+-----+
| NULL | NULL | NULL | 4 |
| NULL | NULL | female | 1 |
| NULL | NULL | male | 3 |
| NULL | beijing | NULL | 2 |
| NULL | beijing | male | 2 |
| NULL | shanghai | NULL | 2 |
| NULL | shanghai | female | 1 |
| NULL | shanghai | male | 1 |
| product | NULL | NULL | 1 |
| product | NULL | male | 1 |
| product | shanghai | NULL | 1 |
| product | shanghai | male | 1 |
| sale | NULL | NULL | 3 |
| sale | NULL | female | 1 |
| sale | NULL | male | 2 |
| sale | beijing | NULL | 2 |
| sale | beijing | male | 2 |
| sale | shanghai | NULL | 1 |
| sale | shanghai | female | 1 |
+------------+-----------+---------+-----+
6.With Rollup解决计算层级维度组合问题
数据:
1 2020 12 31 56
2 2020 12 31 89
3 2021 01 01 78
4 2021 01 01 67
5 2021 01 02 56
建表:
create table test1(order_id int,year string,month string,day string,order_amount int) row format delimited fields terminated by '\t';
现在需要按照时间统计销售总额,注意按照时间聚合数据的时候,单独的月份、单独的天、月份和天以及年和天的维度聚合出来的指标是没有意义的,也就是说按照时间维度聚合指标时只有“年月日”、“年月”、“年”以及不添加维度信息计算总和时才是有意义的。
这种像年月日维度的可以称为层级维度,同样可以使用Grouping Sets来解决该需求,HQL如下:
select year,month,day,sum(order_amount) from test1 group by year,month,day grouping sets((year,month,day),(year,month),(year),()) order by year,month,day;
+-------+--------+-------+------+
| year | month | day | _c3 |
+-------+--------+-------+------+
| NULL | NULL | NULL | 346 |
| 2020 | NULL | NULL | 145 |
| 2020 | 12 | NULL | 145 |
| 2020 | 12 | 31 | 145 |
| 2021 | NULL | NULL | 201 |
| 2021 | 01 | NULL | 201 |
| 2021 | 01 | 01 | 145 |
| 2021 | 01 | 02 | 56 |
+-------+--------+-------+------+
此时我们还可以使用With Rollup来解决该需求
GROUP BY a, b, c, WITH ROLLUP is equivalent to GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ( )).
所以HQL可以改为:
select year,month,day,sum(order_amount) from test1 group by year,month,day with rollup order by year,month,day;
+-------+--------+-------+------+
| year | month | day | _c3 |
+-------+--------+-------+------+
| NULL | NULL | NULL | 346 |
| 2020 | NULL | NULL | 145 |
| 2020 | 12 | NULL | 145 |
| 2020 | 12 | 31 | 145 |
| 2021 | NULL | NULL | 201 |
| 2021 | 01 | NULL | 201 |
| 2021 | 01 | 01 | 145 |
| 2021 | 01 | 02 | 56 |
+-------+--------+-------+------+
7.总结
Hive中提供了grouping sets、with cube以及with rollup来解决多维分析的问题,当维度特别多时,我们根据具体情况可以使用对应的函数来解决问题。
我们可以将结果输出到一张表中方便后续查询,比如我们是a,b,c三个维度使用with cube计算的结果集,当我们只需要group by a的结果时,我们可以使用select a,ct from result_table where a is not null and b is null and c is null;
这样的HQL获取需要的结果,避免了我们在开头提到了多条HQL需要使用多张表来保存数据给最后查询带来麻烦的问题。