[HIVE] 源码解析SQL->MR [GroupBy]

一、数据准备

 join_d1.c1<int>join_d1.c2<string>join_d1.c3<double>join_d1.c4<string>
 1a1.1a
 2b1.2b
 3c1.3c
 4d1.4d
 2e1.5e
 3f1.6f

二、查看执行计划

Case1: select c1 from hivedemo.join_d1 where c3 > 1.2 group by c1

hive> explain select c1 from hivedemo.join_d1 where c3 > 1.2 group by c1;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: join_d1
            Statistics: Num rows: 4 Data size: 50 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (c3 > 1.2) (type: boolean)
              Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: c1 (type: int)
                outputColumnNames: c1
                Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE Column stats: NONE
                Group By Operator
                  keys: c1 (type: int)
                  mode: hash
                  outputColumnNames: _col0
                  Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE Column stats: NONE
                  Reduce Output Operator
                    key expressions: _col0 (type: int)
                    sort order: +
                    Map-reduce partition columns: _col0 (type: int)
                    Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE Column stats: NONE
      Reduce Operator Tree:
        Group By Operator
          keys: KEY._col0 (type: int)
          mode: mergepartial
          outputColumnNames: _col0
          Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 0.031 seconds, Fetched: 48 row(s)

Map阶段:以c1为key,组成HiveKey,相同的c1值会被过滤。在close时候,flush到hdfs上。以key是以c1组成的HiveKey,value是空值。ByteWritable。

Reduce阶段:输出key即可

 

首先该SQL,分为两个Stage,

Stage0是FetchTask。是依赖于Stage1的。FetchTask会生成FetchOperator。 获取Stage1的FileSinkOperator的数据

Stage1是的

Map阶段

ExecMapper(MapOperator)读取每行数据  -> TableScanOperator -> FilterOperator(c3 > 1.2) -> SelectOperator -> GroupByOperator

Reduce阶段

ExecReduer读取数据 -> GroupByOperator -> FileSinkOperator

 

过程如下:

Map阶段

ExecMapper (MapOperator)从HDFS上读取第一行数据1,a,1.1,a -> TableScanOperator(可做limit限制) -> FilterOperator (1.1 > 1.2) false,则该行抛弃,下一行

ExecMapper (MapOperator)从HDFS上读取第一行数据2,b,1.2,b -> TableScanOperator(可做limit限制) -> FilterOperator (1.2 > 1.2) false,则该行抛弃,下一行

ExecMapper (MapOperator)从HDFS上读取第三行数据3,c,1.3,c -> TableScanOperator(可做limit限制) -> FilterOperator (1.3 > 1.2) true -> SelectOperator 选择c1 -> GroupByOperator(获取3的hashcode,作为hashAggregation的key。value=aggr函数的值,本例中为空。如果key已经存在,则更新。没有则创建)

.. 依次迭代下去

 

当Map阶段结束后,flush到hdfs,以hasAggregations的key为Key,

Reduce阶段

ExecReduer (ReducerOperator) 从HDFS上读取第一行数据,比如 3为key,空为value -> GroupByOperator(更新hashAggregation,本例中无需更新)。 设置当前的key为3. 注意:因为MR默认是key排序的,所以,当出现key不等于当前key时候,后面的数据再也不出现之前的key,所以需要把结果flush到hdfs上。

->当key出现不同时,需要flush数据,即调用GroupByOperator.forward() -> FileSinkOperator

    

Case2: select c1,count(c2) from hivedemo.join_d1 where c3 > 1.2 group by c1

hive> explain select c1,count(c2) from hivedemo.join_d1 where c3 > 1.2 group by c1;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: join_d1
            Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (c3 > 1.2) (type: boolean)
              Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: c1 (type: int), c2 (type: string)
                outputColumnNames: c1, c2
                Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
                Group By Operator
                  aggregations: count(c2)
                  keys: c1 (type: int)
                  mode: hash
                  outputColumnNames: _col0, _col1
                  Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
                  Reduce Output Operator
                    key expressions: _col0 (type: int)
                    sort order: +
                    Map-reduce partition columns: _col0 (type: int)
                    Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
                    value expressions: _col1 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: int)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 0.038 seconds, Fetched: 51 row(s)

Map阶段:以c1为key,组成HiveKey,相同的c1值,会把value进行aggr函数处理。即UDAF的iterate(AggregationBuffer agg,Object[] parameters)

Reduce阶段:读取key,然后把values进行aggr函数处理,即UDAF的merge(AggregationBuffer agg,Object partial)

 

首先该SQL,分为两个Stage,

Stage0是FetchTask。是依赖于Stage1的。FetchTask会生成FetchOperator。 获取Stage1的FileSinkOperator的数据

Stage1是的

Map阶段

ExecMapper(MapOperator)读取每行数据  -> TableScanOperator -> FilterOperator(c3 > 1.2) -> SelectOperator -> GroupByOperator

Reduce阶段

ExecReduer读取数据 -> GroupByOperator -> FileSinkOperator

 

过程如下:

Map阶段

ExecMapper (MapOperator)从HDFS上读取第一行数据1,a,1.1,a -> TableScanOperator(可做limit限制) -> FilterOperator (1.1 > 1.2) false,则该行抛弃,下一行

ExecMapper (MapOperator)从HDFS上读取第一行数据2,b,1.2,b -> TableScanOperator(可做limit限制) -> FilterOperator (1.2 > 1.2) false,则该行抛弃,下一行

ExecMapper (MapOperator)从HDFS上读取第三行数据3,c,1.3,c -> TableScanOperator(可做limit限制) -> FilterOperator (1.3 > 1.2) true -> SelectOperator 选择c1 -> GroupByOperator(获取3的hashcode,作为hashAggregation的key。value=aggr函数的值,本例中count计数。如果key已经存在,则更新。没有则创建)

.. 依次迭代下去

 

当Map阶段结束后,flush到hdfs,以hasAggregations的key为Key,count计数值为value

Reduce阶段

ExecReduer (ReducerOperator) 从HDFS上读取第一行数据,比如 3为key,count计数值为value -> GroupByOperator(更新hashAggregation,本例中无需更新)。 设置当前的key为3. 注意:因为MR默认是key排序的,所以,当出现key不等于当前key时候,后面的数据再也不出现之前的key,所以需要把结果flush到hdfs上。

->当key出现不同时,需要flush数据,即调用GroupByOperator.forward() -> FileSinkOperator

 

Case3: select c1,count(distinct(c2)) from hivedemo.join_d1 where c3 > 1.2 group by c1

hive> explain select c1,count(distinct(c2)) from hivedemo.join_d1 where c3 > 1.2 group by c1;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: join_d1
            Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (c3 > 1.2) (type: boolean)
              Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: c1 (type: int), c2 (type: string)
                outputColumnNames: c1, c2
                Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
                Group By Operator
                  aggregations: count(DISTINCT c2)
                  keys: c1 (type: int), c2 (type: string)
                  mode: hash
                  outputColumnNames: _col0, _col1, _col2
                  Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
                  Reduce Output Operator
                    key expressions: _col0 (type: int), _col1 (type: string)
                    sort order: ++
                    Map-reduce partition columns: _col0 (type: int)
                    Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(DISTINCT KEY._col1:0._col0)
          keys: KEY._col0 (type: int)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 50 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 0.022 seconds, Fetched: 50 row(s)

 

Map阶段:以c1,c2为key,且只有是第一次的newKey,才会把value进行aggr函数处理。即UDAF的iterate(AggregationBuffer agg,Object[] parameters)。这样即使多个相同的c1,c2的key,也只有第一次才会被aggr计数。当Map close时,以c1为key。

Reduce阶段:读取key,然后把values进行aggr函数处理,即UDAF的merge(AggregationBuffer agg,Object partial)

 

首先该SQL,分为两个Stage,

Stage0是FetchTask。是依赖于Stage1的。FetchTask会生成FetchOperator。 获取Stage1的FileSinkOperator的数据

Stage1是的

Map阶段

ExecMapper(MapOperator)读取每行数据  -> TableScanOperator -> FilterOperator(c3 > 1.2) -> SelectOperator -> GroupByOperator

Reduce阶段

ExecReduer读取数据 -> GroupByOperator -> FileSinkOperator

 

过程如下:

Map阶段

ExecMapper (MapOperator)从HDFS上读取第一行数据1,a,1.1,a -> TableScanOperator(可做limit限制) -> FilterOperator (1.1 > 1.2) false,则该行抛弃,下一行

ExecMapper (MapOperator)从HDFS上读取第一行数据2,b,1.2,b -> TableScanOperator(可做limit限制) -> FilterOperator (1.2 > 1.2) false,则该行抛弃,下一行

ExecMapper (MapOperator)从HDFS上读取第三行数据3,c,1.3,c -> TableScanOperator(可做limit限制) -> FilterOperator (1.3 > 1.2) true -> SelectOperator 选择c1,c2 -> GroupByOperator(获取<3,c>的hashcode,作为hashAggregation的key。value=aggr函数的值,本例中count计数。如果key已经存在,则更新。没有则创建)

.. 依次迭代下去

 

当Map阶段结束后,flush到hdfs,以hasAggregations的key为Key,count计数值为value

Reduce阶段

ExecReduer (ReducerOperator) 从HDFS上读取第一行数据,比如 3为key,count计数值为value -> GroupByOperator(更新hashAggregation,本例中无需更新)。 设置当前的key为3. 注意:因为MR默认是key排序的,所以,当出现key不等于当前key时候,后面的数据再也不出现之前的key,所以需要把结果flush到hdfs上。

->当key出现不同时,需要flush数据,即调用GroupByOperator.forward() -> FileSinkOperator

 

转载于:https://my.oschina.net/u/204498/blog/2885371

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值