先说结论,两者没有区别,先看执行计划
1、group by
explain
select prov_id
from dim.dim_city
group by prov_id
;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: dim_city
Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: prov_id (type: int)
outputColumnNames: prov_id
Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONE
Group By Operator
keys: prov_id (type: int)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1887 Data size: 261026 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1887 Data size: 261026 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
2、distinct
explain
select distinct prov_id
from dim.dim_city
;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: dim_city
Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: prov_id (type: int)
outputColumnNames: prov_id
Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONE
Group By Operator
keys: prov_id (type: int)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1887 Data size: 261026 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1887 Data size: 261026 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
执行过程完全一致,distinct在map端同样会先做group by聚合,而不是都在reduce端做这个操作,老版本的hive没有这个优化,都在reduce端执行的话会有很大的性能差异