Hive中distinct和group by去重性能对比

Hadoop 同时被 3 个专栏收录
20 篇文章 0 订阅
11 篇文章 0 订阅
4 篇文章 0 订阅

前言

  • 操作系统:CentOS 7
  • hadoop:2.7.7
  • hive:2.3.0
  • 实验目的:本文主要测试在某字段各种不同值个数情况下,记录对此字段其使用DISTINCT/GROUP BY去重的查询语句执行时间,对比两者在不同场景下的去重性能
  • 实验表格:
表名记录数查询字段不同值个数DISTINCTGROUP BY
tab_11000003
tab_210000010000

实验过程

1)创建测试用表

drop table if exists tab_1;
create table tab_1(
    id int,
    value int
)
row format delimited
fields terminated by '\t';

drop table if exists tab_2;
create table tab_2 like tab_1;

2)加载测试数据集

测试用数据集tab_1.txt:

1	1
2	1
3	1
4	3
5	1
...
99997	3
99998	2
99999	3
100000	2

测试用数据集tab_2.txt:

1	3715
2	7211
3	4909
4	2913
5	9839
...
99997	2884
99998	698
99999	4839
100000	2101

分别加载数据集到对应表:

load data local inpath '/tmp/hive/data/tbl/tab_1.txt' overwrite into table tab_1;
load data local inpath '/tmp/hive/data/tbl/tab_2.txt' overwrite into table tab_2;

3)执行查询语句,记录执行时间

取消自动本地模式:

hive> set hive.exec.mode.local.auto = false;

手动指定Reducer个数:

hive> set mapreduce.job.reduces = 3;

执行测试查询语句,记录执行时间:

select distinct(value) from tab_1; -- 31.335s
select value from tab_1 group by value; -- 31.587s
select distinct(value) from tab_2; -- 32.376s
select value from tab_2 group by value; -- 33.834s

4)执行计划对比

  • explain select distinct(value) from tab_1;
0: jdbc:hive2://hadoop101:10000/default (test)> explain select distinct(value) from tab_1;
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: tab_1                           |
|             Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: value (type: int)       |
|               outputColumnNames: value             |
|               Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 keys: value (type: int)            |
|                 mode: hash                         |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: int) |
|                   sort order: +                    |
|                   Map-reduce partition columns: _col0 (type: int) |
|                   Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           keys: KEY._col0 (type: int)              |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 98611 Data size: 394445 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 98611 Data size: 394445 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

  • explain select value from tab_1 group by value;
0: jdbc:hive2://hadoop101:10000/default (test)> explain select value from tab_1 group by value;
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: tab_1                           |
|             Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: value (type: int)       |
|               outputColumnNames: value             |
|               Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 keys: value (type: int)            |
|                 mode: hash                         |
|                 outputColumnNames: _col0           |
|                 Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: int) |
|                   sort order: +                    |
|                   Map-reduce partition columns: _col0 (type: int) |
|                   Statistics: Num rows: 197223 Data size: 788895 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           keys: KEY._col0 (type: int)              |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 98611 Data size: 394445 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 98611 Data size: 394445 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

5)实验结果

表名记录数查询字段不同值个数DISTINCTGROUP BY
tab_1100000331.335s31.587s
tab_21000001000032.376s33.834s

实验结论:

在Hive 2.3.0中,使用DISTINCT去重和使用Group By去重的执行计划相同,执行时间也大致相同,因此两者去重性能基本无差异

实验过程及结论,如有不足之处,欢迎指正,此实验结论仅供参考。

PS:在Hive中使用聚集函数时一定要注意,在使用聚集函数时,一般Hive都只会使用单个Reducer来进行聚集操作(即使手动设置Reducer个数也是如此),如果此时查询数据量过大,则会导致Reducer节点失效,因此在使用聚集函数时,且数据量较大时,可以使用子查询来实现分步聚合,如:可以先在子查询中,按照某个字段进行分组,然后聚合,这样就可以使用多个Reducer加快查询过程,最后在外部查询中对子查询结果进行整体聚合。


End~

  • 0
    点赞
  • 0
    评论
  • 0
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

©️2021 CSDN 皮肤主题: 像素格子 设计师:CSDN官方博客 返回首页
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值