Hive中的多维分析函数

最新推荐文章于 2023-12-07 17:28:00 发布

沙漏的离歌oO

最新推荐文章于 2023-12-07 17:28:00 发布

阅读量2.8k

点赞数 18

分类专栏： Hive技术分享文章标签：大数据 hive

本文链接：https://blog.csdn.net/weixin_41875224/article/details/119744659

版权

Hive技术分享专栏收录该内容

2 篇文章 2 订阅

订阅专栏

Hive中的多维分析函数

众所周知，Hive是一个OLAP数据库，拥有着很强大的数据分析能力。
但是当我们一张表的维度信息特别多，并且需要各种维度组合的统计分析结果时就很不方便。比如一张表有5个维度，所有的维度组合是2的5次方也就是32种，那也就意味着我们需要写32个查询的HQL解决我们的需求。同时不方便的点还存在于我们需要将32个HQL的查询结果写入到32张不同的表中，无形中给我们未来查询我们所需要的信息带来了额外的麻烦。
本篇文章就是介绍Hive中非常好用的多维分析函数帮助我们解决以上两个痛点问题。

1.环境说明

当前环境采用Hadoop3.1.3以及Hive3.1.2版本!

2.准备数据

[whalex@hadoop102data]$ vim test.txt
1001    John    male    sale    beijing 
1002    Tom     female  sale    shanghai
1003    Bob     male    sale    beijing
1004    Alex    male    product shanghai

3.建表并导入数据

create table test(id string,name string,sex string,dept_name string,addr string) row format delimited fields terminated by '\t';
load data local inpath '/opt/whalex/data/test.txt' into table test;

4.Grouping Sets计算多种自定义维度组合问题

此时需要计算如下需求：

按照性别分组计算每种性别的人数
按照部门和性别分组计算每个部门每种性别的人数
按照地区和性别分组计算每个地区每种性别的人数

我们可以采用如下3条HQL语句计算并插入到不同的3个结果表中

select sex,count(*) ct from test group by sex;
select dept_name,sex,count(*) ct from test group by dept_name,sex;
select addr,sex,count(*) ct from test group by addr,sex;

其实在Hive中提供了Grouping Sets帮助我们解决上述问题，Grouping Sets说明如下图所示：
在这里插入图片描述

所以以上3条HQL可以合并为如下一条HQL

select dept_name,addr,sex,count(*) ct from test group by dept_name,addr,sex grouping sets((sex),(dept_name,sex),(addr,sex)) order by dept_name,addr,sex;

+------------+-----------+---------+-----+
| dept_name  |   addr    |   sex   | ct  |
+------------+-----------+---------+-----+
| NULL       | NULL      | female  | 1   |
| NULL       | NULL      | male    | 3   |
| NULL       | beijing   | male    | 2   |
| NULL       | shanghai  | female  | 1   |
| NULL       | shanghai  | male    | 1   |
| product    | NULL      | male    | 1   |
| sale       | NULL      | female  | 1   |
| sale       | NULL      | male    | 2   |
+------------+-----------+---------+-----+

注：HQL中的order by只是为了让输出结果更好看一些，没有其他意义。

5.With Cube解决计算所有维度组合问题

此时需要计算地区、部门以及性别三个维度所有组合的聚合数据
我们当然还可以使用Grouping Sets来解决问题，HQL如下

select dept_name,addr,sex,count(*) ct from test group by dept_name,addr,sex grouping sets((addr,dept_name,sex),(addr,dept_name),(addr,sex),(dept_name,sex),(addr),(dept_name),(sex),()) order by dept_name,addr,sex;

+------------+-----------+---------+-----+
| dept_name  |   addr    |   sex   | ct  |
+------------+-----------+---------+-----+
| NULL       | NULL      | NULL    | 4   |
| NULL       | NULL      | female  | 1   |
| NULL       | NULL      | male    | 3   |
| NULL       | beijing   | NULL    | 2   |
| NULL       | beijing   | male    | 2   |
| NULL       | shanghai  | NULL    | 2   |
| NULL       | shanghai  | female  | 1   |
| NULL       | shanghai  | male    | 1   |
| product    | NULL      | NULL    | 1   |
| product    | NULL      | male    | 1   |
| product    | shanghai  | NULL    | 1   |
| product    | shanghai  | male    | 1   |
| sale       | NULL      | NULL    | 3   |
| sale       | NULL      | female  | 1   |
| sale       | NULL      | male    | 2   |
| sale       | beijing   | NULL    | 2   |
| sale       | beijing   | male    | 2   |
| sale       | shanghai  | NULL    | 1   |
| sale       | shanghai  | female  | 1   |
+------------+-----------+---------+-----+

此时我们也可以使用With Cube来解决问题

GROUP BY a, b, c WITH CUBE is equivalent to
GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ( )).

所以HQL可以改为：

select dept_name,addr,sex,count(*) ct from test group by dept_name,addr,sex with cube order by dept_name,addr,sex;

+------------+-----------+---------+-----+
| dept_name  |   addr    |   sex   | ct  |
+------------+-----------+---------+-----+
| NULL       | NULL      | NULL    | 4   |
| NULL       | NULL      | female  | 1   |
| NULL       | NULL      | male    | 3   |
| NULL       | beijing   | NULL    | 2   |
| NULL       | beijing   | male    | 2   |
| NULL       | shanghai  | NULL    | 2   |
| NULL       | shanghai  | female  | 1   |
| NULL       | shanghai  | male    | 1   |
| product    | NULL      | NULL    | 1   |
| product    | NULL      | male    | 1   |
| product    | shanghai  | NULL    | 1   |
| product    | shanghai  | male    | 1   |
| sale       | NULL      | NULL    | 3   |
| sale       | NULL      | female  | 1   |
| sale       | NULL      | male    | 2   |
| sale       | beijing   | NULL    | 2   |
| sale       | beijing   | male    | 2   |
| sale       | shanghai  | NULL    | 1   |
| sale       | shanghai  | female  | 1   |
+------------+-----------+---------+-----+

6.With Rollup解决计算层级维度组合问题

数据：

1       2020    12      31      56
2       2020    12      31      89
3       2021    01      01      78
4       2021    01      01      67
5       2021    01      02      56

建表：

create table test1(order_id int,year string,month string,day string,order_amount int) row format delimited fields terminated by '\t';

现在需要按照时间统计销售总额，注意按照时间聚合数据的时候，单独的月份、单独的天、月份和天以及年和天的维度聚合出来的指标是没有意义的，也就是说按照时间维度聚合指标时只有“年月日”、“年月”、“年”以及不添加维度信息计算总和时才是有意义的。
这种像年月日维度的可以称为层级维度，同样可以使用Grouping Sets来解决该需求，HQL如下：

select year,month,day,sum(order_amount) from test1 group by year,month,day grouping sets((year,month,day),(year,month),(year),()) order by year,month,day;

+-------+--------+-------+------+
| year  | month  |  day  | _c3  |
+-------+--------+-------+------+
| NULL  | NULL   | NULL  | 346  |
| 2020  | NULL   | NULL  | 145  |
| 2020  | 12     | NULL  | 145  |
| 2020  | 12     | 31    | 145  |
| 2021  | NULL   | NULL  | 201  |
| 2021  | 01     | NULL  | 201  |
| 2021  | 01     | 01    | 145  |
| 2021  | 01     | 02    | 56   |
+-------+--------+-------+------+

此时我们还可以使用With Rollup来解决该需求

GROUP BY a, b, c, WITH ROLLUP is equivalent to GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ( )).

所以HQL可以改为：

select year,month,day,sum(order_amount) from test1 group by year,month,day with rollup order by year,month,day;

+-------+--------+-------+------+
| year  | month  |  day  | _c3  |
+-------+--------+-------+------+
| NULL  | NULL   | NULL  | 346  |
| 2020  | NULL   | NULL  | 145  |
| 2020  | 12     | NULL  | 145  |
| 2020  | 12     | 31    | 145  |
| 2021  | NULL   | NULL  | 201  |
| 2021  | 01     | NULL  | 201  |
| 2021  | 01     | 01    | 145  |
| 2021  | 01     | 02    | 56   |
+-------+--------+-------+------+

7.总结

Hive中提供了grouping sets、with cube以及with rollup来解决多维分析的问题，当维度特别多时，我们根据具体情况可以使用对应的函数来解决问题。
我们可以将结果输出到一张表中方便后续查询，比如我们是a,b,c三个维度使用with cube计算的结果集，当我们只需要group by a的结果时，我们可以使用select a,ct from result_table where a is not null and b is null and c is null;这样的HQL获取需要的结果，避免了我们在开头提到了多条HQL需要使用多张表来保存数据给最后查询带来麻烦的问题。