explain一条HiveSQL,分析这个结果

create table src119(key string, value string); 

EXPLAIN 
FROM src119 SELECT key , count(distinct value) group by key 

ABSTRACT SYNTAX TREE: 
  (TOK_QUERY (TOK_FROM (TOK_TABREF src119)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL key)) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL value)))) (TOK_GROUPBY (TOK_TABLE_OR_COL key)))) 

STAGE DEPENDENCIES: 
  Stage-1 is a root stage 
  Stage-2 depends on stages: Stage-1 
  Stage-0 is a root stage 

STAGE PLANS: 
  Stage: Stage-1 
    Map Reduce 
      Alias -> Map Operator Tree: 
        src119 
          TableScan 
            alias: src119 
            Select Operator 
              expressions: 
                    expr: key 
                    type: string 
                    expr: value 
                    type: string 
              outputColumnNames: key, value 
              Group By Operator 
                aggregations: 
                      expr: count(DISTINCT value) 
                bucketGroup: false 
                keys: 
                      expr: key 
                      type: string 
                      expr: value 
                      type: string 
                mode: hash 
                outputColumnNames: _col0, _col1, _col2 
                Reduce Output Operator 
                  key expressions: 
                        expr: _col0 
                        type: string 
                        expr: _col1 
                        type: string 
                  sort order: ++ 
                  Map-reduce partition columns: 
                        expr: _col0 
                        type: string 
                  tag: -1 
                  value expressions: 
                        expr: _col2 
                        type: bigint 
      Reduce Operator Tree: 
        Group By Operator 
          aggregations: 
                expr: count(DISTINCT KEY._col1:0._col0) 
          bucketGroup: false 
          keys: 
                expr: KEY._col0 
                type: string 
          mode: partials 
          outputColumnNames: _col0, _col1 
          File Output Operator 
            compressed: false 
            GlobalTableId: 0 
            table: 
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat 
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat 

  Stage: Stage-2 
    Map Reduce 
      Alias -> Map Operator Tree: 
        file:/tmp/tianzhao/hive_2011-06-11_05-50-09_095_6055107404619839036/-mr-10002 
            Reduce Output Operator 
              key expressions: 
                    expr: _col0 
                    type: string 
              sort order: + 
              Map-reduce partition columns: 
                    expr: _col0 
                    type: string 
              tag: -1 
              value expressions: 
                    expr: _col1 
                    type: bigint 
      Reduce Operator Tree: 
        Group By Operator 
          aggregations: 
                expr: count(VALUE._col0) 
          bucketGroup: false 
          keys: 
                expr: KEY._col0 
                type: string 
          mode: final 
          outputColumnNames: _col0, _col1 
          Select Operator 
            expressions: 
                  expr: _col0 
                  type: string 
                  expr: _col1 
                  type: bigint 
            outputColumnNames: _col0, _col1 
            File Output Operator 
              compressed: false 
              GlobalTableId: 0 
              table: 
                  input format: org.apache.hadoop.mapred.TextInputFormat 
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 

  Stage: Stage-0 
    Fetch Operator 
      limit: -1 


输入数据是: 
86val_87 
238val_22 
86val_165 
409val_419 
86val_255 
238val_278 
86val_98 
484val_488 
311val_341 
238val_278 


FROM src119 SELECT key , count(distinct value) group by key 
238 2 
311 1 
409 1 
484 1 
86 4 



(TOK_QUERY 
(TOK_FROM (TOK_TABREF src119)) 
(TOK_INSERT 
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) 
(TOK_SELECT 
(TOK_SELEXPR (TOK_TABLE_OR_COL key)) 
(TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL value))) 

(TOK_GROUPBY (TOK_TABLE_OR_COL key)) 




  Stage: Stage-1 
    Map Reduce 
  Stage: Stage-2 
    Map Reduce 
  这里的Map Reduce说明这两个Stage都是MapReduce job。 
  Stage-1 is a root stage 
  Stage-2 depends on stages: Stage-1   // Stage-2依赖Stage-1,Stage-1先运行。 


分析Stage-1: 
      Alias -> Map Operator Tree:  //Map阶段 
        src119 
          TableScan 
            alias: src119 
            Select Operator 
              expressions: 
                    expr: key 
                    type: string 
                    expr: value 
                    type: string 
              outputColumnNames: key, value 
              Group By Operator 
                aggregations: 
                      expr: count(DISTINCT value) 
                bucketGroup: false 
                keys: 
                      expr: key 
                      type: string 
                      expr: value 
                      type: string 
                mode: hash 
                outputColumnNames: _col0, _col1, _col2 
                Reduce Output Operator 
                  key expressions: 
                        expr: _col0 
                        type: string 
                        expr: _col1 
                        type: string 
                  sort order: ++ 
                  Map-reduce partition columns: 
                        expr: _col0 
                        type: string 
                  tag: -1 
                  value expressions: 
                        expr: _col2 
                        type: bigint 
  上面的 Map Operator Tree 说明这个是MapReduce job在Map阶段执行的操作。Map阶段执行4个Operator,按顺序分别是TableScanOperator、SelectOperator、GroupByOperator、ReduceSinkOperator。 


          TableScan 
            alias: src119 
上面这个是TableScanOperator,扫描表src119。 
            Select Operator 
              expressions: 
                    expr: key 
                    type: string 
                    expr: value 
                    type: string 
              outputColumnNames: key, value 
上面这个是SelectOperator,需要的字段是key和value。 
              Group By Operator 
                aggregations: 
                      expr: count(DISTINCT value) 
                bucketGroup: false 
                keys: 
                      expr: key 
                      type: string 
                      expr: value 
                      type: string 
                mode: hash 
                outputColumnNames: _col0, _col1, _col2 
上面这个是GroupByOperator,执行聚合函数(aggregations)是count(DISTINCT value)。 
                Reduce Output Operator 
                  key expressions: 
                        expr: _col0 
                        type: string 
                        expr: _col1 
                        type: string 
                  sort order: ++ 
                  Map-reduce partition columns: 
                        expr: _col0 
                        type: string 
                  tag: -1 
                  value expressions: 
                        expr: _col2 
                        type: bigint 
上面这个是ReduceSinkOperator。 

MapRunner读取一条条记录(record),把一条条record传递给Mapper(ExecMapper)处理。 
对于一条记录(record):86val_87(处理后key=86,value=val_87,blog上面显示稍微有问题), ExecMper.map 会依次调用TableScanOperator、SelectOperator、GroupByOperator、ReduceSinkOperator的processOp处理这条记录,最后在ReduceSinkOperator的processOp里面 out.collect(keyWritable, value); 收集到MapTask的环形缓冲区(circle buffer)里。(out是OutputCollector),OutputCollector.collect可以参考http://caibinbupt.iteye.com/blog/401374。 

   
      Reduce Operator Tree:  // reduce阶段 
        Group By Operator 
          aggregations: 
                expr: count(DISTINCT KEY._col1:0._col0) 
          bucketGroup: false 
          keys: 
                expr: KEY._col0 
                type: string 
          mode: partials 
          outputColumnNames: _col0, _col1 
          File Output Operator 
            compressed: false 
            GlobalTableId: 0 
            table: 
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat 
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat 
上面的 Reduce Operator Tree: 说明这个是MapReduce job在Reduce阶段执行的操作。 

Reduce阶段执行Group By Operator(GroupByOperator)和File Output Operator(FileSinkOperator) 

        Group By Operator 
          aggregations: 
                expr: count(DISTINCT KEY._col1:0._col0) 
          bucketGroup: false 
          keys: 
                expr: KEY._col0 
                type: string 
          mode: partials 
          outputColumnNames: _col0, _col1 
上面这个是GroupByOperator 

          File Output Operator 
            compressed: false 
            GlobalTableId: 0 
            table: 
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat 
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat 
上面这个是FileSinkOperator 

ReduceTask.run里面 
      while (values.more()) { 
        reduceInputKeyCounter.increment(1); 
        reducer.reduce(values.getKey(), values, collector, reporter); 
        if(incrProcCount) { 
          reporter.incrCounter(SkipBadRecords.COUNTER_GROUP, 
              SkipBadRecords.COUNTER_REDUCE_PROCESSED_GROUPS, 1); 
        } 
        values.nextKey(); 
        values.informReduceProgress(); 
      } 
reducer是ExecReducer。 

Reduce阶段执行Group By Operator(GroupByOperator)和File Output Operator(FileSinkOperator),Reduce对于每条record(key value对)执行一次GroupByOperator.processOp,当处理了一定的记录后(默认是1000),需要flush一次,flush是调用FileSinkOperator写入HDFS。最后Reduce.close的时候会顺序调用各个operator的close。顺序是因为这些opeator之间是父子关系。所以最后GroupByOperator中残留的数据会forward到FileSinkOperator,通过FileSinkOperator写入HDFS。 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值