在Reduce阶段执行的聚合
hive (default)> set hive.map.aggr=false;
hive (default)> explain
> select s_age,sum(s_score) avg_score
> from student_tb_txt
> where s_age<20
> group by s_age;
执行结果如下
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: student_tb_txt
Statistics: Num rows: 32240060 Data size: 22568042496 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (s_age < 20L) (type: boolean)
Statistics: Num rows: 10746686 Data size: 7522680365 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: s_age (type: bigint)
sort order: +
Map-reduce partition columns: s_age (type: bigint)
Statistics: Num rows: 10746686 Data size: 7522680365 Basic stats: COMPLETE Column stats: NONE
value expressions: s_score (type: bigint)
Execution mode: vectorized
Reduce Operator Tree:
Group By Operator
aggregations: sum(VALUE._col0)
keys: KEY._col0 (type: bigint)
mode: complete
outputColumnNames: _col0, _col1
Statistics: Num rows: 5373343 Data size: 3761340182 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 5373343 Data size: 3761340182 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
map执行过程为 TableScan->Filter Operator ->Reduce Output Operator
map主要操作就是扫描全表,然后依照where的条件进行过滤,按照select的条件进行输出到Reduce
Reduce阶段Group By Operator->File Output Operator
Reduce主要是分组操作,分组聚合算法为sum,mode:complete表示所有的聚合操作都在Reduce阶段,outputColumnNames: _col0, _col1表示聚合操作输出列。File Output Operator表示输出格式
在Map和Reduce阶段聚合
set hive.map.aggr=true;
explain
select s_age,sum(s_score) avg_score
from student_tb_txt
where s_age<20
group by s_age;
执行结果如下
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: student_tb_txt
Statistics: Num rows: 32240060 Data size: 22568042496 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (s_age < 20L) (type: boolean)
Statistics: Num rows: 10746686 Data size: 7522680365 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: sum(s_score)
keys: s_age (type: bigint)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 10746686 Data size: 7522680365 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: bigint)
sort order: +
Map-reduce partition columns: _col0 (type: bigint)
Statistics: Num rows: 10746686 Data size: 7522680365 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Execution mode: vectorized
Reduce Operator Tree:
Group By Operator
aggregations: sum(VALUE._col0)
keys: KEY._col0 (type: bigint)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 5373343 Data size: 3761340182 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 5373343 Data size: 3761340182 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
相比较上面,这个执行计划中在map阶段的Reduce Output opretator 前面多了一个Group By Operator阶段,mode为hash模式。这个阶段表示在每个map阶段都会执行聚合一次。