一、数据准备
join_d1.c1<int> | join_d1.c2<string> | join_d1.c3<double> | join_d1.c4<string> | |
1 | a | 1.1 | a | |
2 | b | 1.2 | b | |
3 | c | 1.3 | c | |
4 | d | 1.4 | d | |
2 | e | 1.5 | e | |
3 | f | 1.6 | f |
二、查看执行计划
Case1: select c1 from hivedemo.join_d1 where c3 > 1.2 group by c1
hive> explain select c1 from hivedemo.join_d1 where c3 > 1.2 group by c1;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: join_d1
Statistics: Num rows: 4 Data size: 50 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (c3 > 1.2) (type: boolean)
Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: c1 (type: int)
outputColumnNames: c1
Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE Column stats: NONE
Group By Operator
keys: c1 (type: int)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.031 seconds, Fetched: 48 row(s)
Map阶段:以c1为key,组成HiveKey,相同的c1值会被过滤。在close时候,flush到hdfs上。以key是以c1组成的HiveKey,value是空值。ByteWritable。
Reduce阶段:输出key即可
首先该SQL,分为两个Stage,
Stage0是FetchTask。是依赖于Stage1的。FetchTask会生成FetchOperator。 获取Stage1的FileSinkOperator的数据
Stage1是的
Map阶段
ExecMapper(MapOperator)读取每行数据 -> TableScanOperator -> FilterOperator(c3 > 1.2) -> SelectOperator -> GroupByOperator
Reduce阶段
ExecReduer读取数据 -> GroupByOperator -> FileSinkOperator
过程如下:
Map阶段
ExecMapper (MapOperator)从HDFS上读取第一行数据1,a,1.1,a -> TableScanOperator(可做limit限制) -> FilterOperator (1.1 > 1.2) false,则该行抛弃,下一行
ExecMapper (MapOperator)从HDFS上读取第一行数据2,b,1.2,b -> TableScanOperator(可做limit限制) -> FilterOperator (1.2 > 1.2) false,则该行抛弃,下一行
ExecMapper (MapOperator)从HDFS上读取第三行数据3,c,1.3,c -> TableScanOperator(可做limit限制) -> FilterOperator (1.3 > 1.2) true -> SelectOperator 选择c1 -> GroupByOperator(获取3的hashcode,作为hashAggregation的key。value=aggr函数的值,本例中为空。如果key已经存在,则更新。没有则创建)
.. 依次迭代下去
当Map阶段结束后,flush到hdfs,以hasAggregations的key为Key,
Reduce阶段
ExecReduer (ReducerOperator) 从HDFS上读取第一行数据,比如 3为key,空为value -> GroupByOperator(更新hashAggregation,本例中无需更新)。 设置当前的key为3. 注意:因为MR默认是key排序的,所以,当出现key不等于当前key时候,后面的数据再也不出现之前的key,所以需要把结果flush到hdfs上。
->当key出现不同时,需要flush数据,即调用GroupByOperator.forward() -> FileSinkOperator
Case2: select c1,count(c2) from hivedemo.join_d1 where c3 > 1.2 group by c1
hive> explain select c1,count(c2) from hivedemo.join_d1 where c3 > 1.2 group by c1;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: join_d1
Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (c3 > 1.2) (type: boolean)
Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: c1 (type: int), c2 (type: string)
outputColumnNames: c1, c2
Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(c2)
keys: c1 (type: int)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.038 seconds, Fetched: 51 row(s)
Map阶段:以c1为key,组成HiveKey,相同的c1值,会把value进行aggr函数处理。即UDAF的iterate(AggregationBuffer agg,Object[] parameters)
Reduce阶段:读取key,然后把values进行aggr函数处理,即UDAF的merge(AggregationBuffer agg,Object partial)
首先该SQL,分为两个Stage,
Stage0是FetchTask。是依赖于Stage1的。FetchTask会生成FetchOperator。 获取Stage1的FileSinkOperator的数据
Stage1是的
Map阶段
ExecMapper(MapOperator)读取每行数据 -> TableScanOperator -> FilterOperator(c3 > 1.2) -> SelectOperator -> GroupByOperator
Reduce阶段
ExecReduer读取数据 -> GroupByOperator -> FileSinkOperator
过程如下:
Map阶段
ExecMapper (MapOperator)从HDFS上读取第一行数据1,a,1.1,a -> TableScanOperator(可做limit限制) -> FilterOperator (1.1 > 1.2) false,则该行抛弃,下一行
ExecMapper (MapOperator)从HDFS上读取第一行数据2,b,1.2,b -> TableScanOperator(可做limit限制) -> FilterOperator (1.2 > 1.2) false,则该行抛弃,下一行
ExecMapper (MapOperator)从HDFS上读取第三行数据3,c,1.3,c -> TableScanOperator(可做limit限制) -> FilterOperator (1.3 > 1.2) true -> SelectOperator 选择c1 -> GroupByOperator(获取3的hashcode,作为hashAggregation的key。value=aggr函数的值,本例中count计数。如果key已经存在,则更新。没有则创建)
.. 依次迭代下去
当Map阶段结束后,flush到hdfs,以hasAggregations的key为Key,count计数值为value
Reduce阶段
ExecReduer (ReducerOperator) 从HDFS上读取第一行数据,比如 3为key,count计数值为value -> GroupByOperator(更新hashAggregation,本例中无需更新)。 设置当前的key为3. 注意:因为MR默认是key排序的,所以,当出现key不等于当前key时候,后面的数据再也不出现之前的key,所以需要把结果flush到hdfs上。
->当key出现不同时,需要flush数据,即调用GroupByOperator.forward() -> FileSinkOperator
Case3: select c1,count(distinct(c2)) from hivedemo.join_d1 where c3 > 1.2 group by c1
hive> explain select c1,count(distinct(c2)) from hivedemo.join_d1 where c3 > 1.2 group by c1;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: join_d1
Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (c3 > 1.2) (type: boolean)
Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: c1 (type: int), c2 (type: string)
outputColumnNames: c1, c2
Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(DISTINCT c2)
keys: c1 (type: int), c2 (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int), _col1 (type: string)
sort order: ++
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Group By Operator
aggregations: count(DISTINCT KEY._col1:0._col0)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.022 seconds, Fetched: 50 row(s)
Map阶段:以c1,c2为key,且只有是第一次的newKey,才会把value进行aggr函数处理。即UDAF的iterate(AggregationBuffer agg,Object[] parameters)。这样即使多个相同的c1,c2的key,也只有第一次才会被aggr计数。当Map close时,以c1为key。
Reduce阶段:读取key,然后把values进行aggr函数处理,即UDAF的merge(AggregationBuffer agg,Object partial)
首先该SQL,分为两个Stage,
Stage0是FetchTask。是依赖于Stage1的。FetchTask会生成FetchOperator。 获取Stage1的FileSinkOperator的数据
Stage1是的
Map阶段
ExecMapper(MapOperator)读取每行数据 -> TableScanOperator -> FilterOperator(c3 > 1.2) -> SelectOperator -> GroupByOperator
Reduce阶段
ExecReduer读取数据 -> GroupByOperator -> FileSinkOperator
过程如下:
Map阶段
ExecMapper (MapOperator)从HDFS上读取第一行数据1,a,1.1,a -> TableScanOperator(可做limit限制) -> FilterOperator (1.1 > 1.2) false,则该行抛弃,下一行
ExecMapper (MapOperator)从HDFS上读取第一行数据2,b,1.2,b -> TableScanOperator(可做limit限制) -> FilterOperator (1.2 > 1.2) false,则该行抛弃,下一行
ExecMapper (MapOperator)从HDFS上读取第三行数据3,c,1.3,c -> TableScanOperator(可做limit限制) -> FilterOperator (1.3 > 1.2) true -> SelectOperator 选择c1,c2 -> GroupByOperator(获取<3,c>的hashcode,作为hashAggregation的key。value=aggr函数的值,本例中count计数。如果key已经存在,则更新。没有则创建)
.. 依次迭代下去
当Map阶段结束后,flush到hdfs,以hasAggregations的key为Key,count计数值为value
Reduce阶段
ExecReduer (ReducerOperator) 从HDFS上读取第一行数据,比如 3为key,count计数值为value -> GroupByOperator(更新hashAggregation,本例中无需更新)。 设置当前的key为3. 注意:因为MR默认是key排序的,所以,当出现key不等于当前key时候,后面的数据再也不出现之前的key,所以需要把结果flush到hdfs上。
->当key出现不同时,需要flush数据,即调用GroupByOperator.forward() -> FileSinkOperator