Hive是如何将hql语法转换成MR执行的

       Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张表,并提供类SQL查询功能.说白了,hive就是MapReduce客户端,将用户编写的HQL语法转换成MR程序进行执行。那么,hive是如何实现将hql语法转换成Mr的呢?

                    

 

        如图中所示,总的来说,Hive是通过给用户提供的一系列交互接口,接收到用户的指令(SQL),使用自己的Driver,结合元数据(MetaStore),将这些指令翻译成MapReduce,提交到Hadoop中执行,最后,将执行返回的结果输出到用户交互接口。

1)用户接口:Client

    CLI(hiveshell)、JDBC/ODBC(java访问hive)、WEBUI(浏览器访问hive)

2)元数据:Metastore

元数据包括:表名、表所属的数据库(默认是default)、表的拥有者、列/分区字段、表的类型(是否是外部表)、表的数据所在目录等;

默认存储在自带的derby数据库中,推荐使用MySQL存储Metastore

3)Hadoop

使用HDFS进行存储,使用MapReduce进行计算。

4)驱动器:Driver

(1)解析器(SQL Parser):将SQL字符串转换成抽象语法树AST,这一步一般都用第三方工具库完成,比如antlr;对AST进行语法分析,比如表是否存在、字段是否存在、SQL语义是否有误。

(2)编译器(Physical Plan):将AST编译生成逻辑执行计划。

(3)优化器(Query Optimizer):对逻辑执行计划进行优化。

(4)执行器(Execution):把逻辑执行计划转换成可以运行的物理计划。对于Hive来说,就是MR/Spark。

演示:可以使用explain命令,查看语法树,即hive先将sql语法转成对应的语法树

> explain select * from mytest_staff_info_demo4_cp1 

 > where statis_date='20180229'

 > order by name

 > limit 3;

OK

STAGE DEPENDENCIES:

  Stage-1 is a root stage

  Stage-0 is a root stage

STAGE PLANS:

  Stage: Stage-1

    Map Reduce

      Map Operator Tree:

          TableScan

            alias: mytest_staff_info_demo4_cp1

            Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

            Select Operator

              expressions: name (type: string), deptcode (type: string), id (type: int), account (type: string), areacode (type: string), statis_date (type: string)

              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5

              Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

              Reduce Output Operator

                key expressions: _col0 (type: string)

                sort order: +

                Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

                value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int), _col3 (type: string), _col4 (type: string), _col5 (type: string)

      Reduce Operator Tree:

        Extract

          Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

          Limit

            Number of rows: 3

            Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

            File Output Operator

              compressed: true

              Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

              table:

                  input format: org.apache.hadoop.mapred.TextInputFormat

                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0

    Fetch Operator

      limit: 3

Time taken: 0.149 seconds, Fetched: 38 row(s)

    可以看出一个hive任务分为一个或多个stage阶段。不同的stage阶段会存在依赖关系,越复杂的sql会产生越多的stage,通常也需要越多的时间,通常像我们公司(top3电商)在IDE平台上任务的执行时间一般不要超过45分钟,否则最好分割优化任务。

还可以使用explain extended 查看更多信息。

> explain extended select * from mytest_staff_info_demo4_cp1 

> where statis_date='20180229'

> order by name

> limit 3;

OK

ABSTRACT SYNTAX TREE:

TOK_QUERY

   TOK_FROM

      TOK_TABREF

         TOK_TABNAME

            mytest_staff_info_demo4_cp1

   TOK_INSERT

      TOK_DESTINATION

         TOK_DIR

            TOK_TMP_FILE

      TOK_SELECT

         TOK_SELEXPR

            TOK_ALLCOLREF

      TOK_WHERE

         =

            TOK_TABLE_OR_COL

               statis_date

            '20180229'

      TOK_ORDERBY

         TOK_TABSORTCOLNAMEASC

            TOK_TABLE_OR_COL

               name

      TOK_LIMIT

         3 

STAGE DEPENDENCIES:

  Stage-1 is a root stage

  Stage-0 is a root stage

STAGE PLANS:

  Stage: Stage-1

    Map Reduce

      Map Operator Tree:

          TableScan

            alias: mytest_staff_info_demo4_cp1

            Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

            GatherStats: false

            Select Operator

              expressions: name (type: string), deptcode (type: string), id (type: int), account (type: string), areacode (type: string), statis_date (type: string)

              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5

              Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

              Reduce Output Operator

                key expressions: _col0 (type: string)

                sort order: +

                Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

                tag: -1

                value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int), _col3 (type: string), _col4 (type: string), _col5 (type: string)

      Path -> Alias:

        hdfs://SuningHadoop2/user/finance/hive/warehouse/fdm_sor.db/mytest_staff_info_demo4_cp1/statis_date=20180229 [mytest_staff_info_demo4_cp1]

      Path -> Partition:

        hdfs://SuningHadoop2/user/finance/hive/warehouse/fdm_sor.db/mytest_staff_info_demo4_cp1/statis_date=20180229

          Partition

            base file name: statis_date=20180229

            input format: org.apache.hadoop.mapred.TextInputFormat

            output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

            partition values:

              statis_date 20180229

            properties:

              bucket_count -1

              columns name,deptcode,id,account,areacode

              columns.comments

              columns.types string:string:int:string:string

              field.delim

              file.inputformat org.apache.hadoop.mapred.TextInputFormat

              file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

              location hdfs://SuningHadoop2/user/finance/hive/warehouse/fdm_sor.db/mytest_staff_info_demo4_cp1/statis_date=20180229

              name fdm_sor.mytest_staff_info_demo4_cp1

              partition_columns statis_date

              partition_columns.types string

              serialization.ddl struct mytest_staff_info_demo4_cp1 { string name, string deptcode, i32 id, string account, string areacode}

              serialization.format  

              serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

              transient_lastDdlTime 1521166440

            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

              input format: org.apache.hadoop.mapred.TextInputFormat

              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

              properties:

                bucket_count -1

                columns name,deptcode,id,account,areacode

                columns.comments

                columns.types string:string:int:string:string

                field.delim      

                file.inputformat org.apache.hadoop.mapred.TextInputFormat

                file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

                location hdfs://SuningHadoop2/user/finance/hive/warehouse/fdm_sor.db/mytest_staff_info_demo4_cp1

                name fdm_sor.mytest_staff_info_demo4_cp1

                partition_columns statis_date

                partition_columns.types string

                serialization.ddl struct mytest_staff_info_demo4_cp1 { string name, string deptcode, i32 id, string account, string areacode}

                serialization.format       

                serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

                transient_lastDdlTime 1519780670

              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

              name: fdm_sor.mytest_staff_info_demo4_cp1

            name: fdm_sor.mytest_staff_info_demo4_cp1

      Truncated Path -> Alias:

        /fdm_sor.db/mytest_staff_info_demo4_cp1/statis_date=20180229 [mytest_staff_info_demo4_cp1]

      Needs Tagging: false

      Reduce Operator Tree:

        Extract

          Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

          Limit

            Number of rows: 3

            Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

            File Output Operator

              compressed: true

              GlobalTableId: 0

              directory: alluxio-ft://namenode1-sit.cnsuning.com:19998/user/finance/tmp/hive-finance/hive_2018-03-19_09-35-04_376_4971094765120649963-1/-ext-10001

              NumFilesPerFileSink: 1

              Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

              Stats Publishing Key Prefix: alluxio-ft://namenode1-sit.cnsuning.com:19998/user/finance/tmp/hive-finance/hive_2018-03-19_09-35-04_376_4971094765120649963-1/-ext-10001/

              table:

                  input format: org.apache.hadoop.mapred.TextInputFormat

                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

                  properties:

                    columns _col0,_col1,_col2,_col3,_col4,_col5

                    columns.types string:string:int:string:string:string

                    escape.delim \

                    hive.serialization.extend.nesting.levels true

                    serialization.format 1

                    serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

              TotalFiles: 1

              GatherStats: false

              MultiFileSpray: false 

  Stage: Stage-0

    Fetch Operator

      limit: 3

Time taken: 0.151 seconds, Fetched: 133 row(s)

  • 16
    点赞
  • 73
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 8
    评论
评论 8
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

涤生大数据

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值