Hive是如何将hql语法转换成MR执行的

最新推荐文章于 2024-07-11 10:53:03 发布

涤生大数据

最新推荐文章于 2024-07-11 10:53:03 发布

阅读量2.5w

点赞数 16

分类专栏： Hive/MR原理剖析/优化实战文章标签： hive底层原理 hive如何将sql转换成MR hive底层架构 hive工作原理 hive

本文为博主原创文章，属博主老婆资产，故未经博主老婆允许不得转载。

本文链接：https://blog.csdn.net/qq_26442553/article/details/79616207

版权

Hive/MR原理剖析/优化实战专栏收录该内容

74 篇文章 113 订阅

订阅专栏

Hive是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张表，并提供类SQL查询功能.说白了，hive就是MapReduce客户端,将用户编写的HQL语法转换成MR程序进行执行。那么，hive是如何实现将hql语法转换成Mr的呢？

如图中所示，总的来说，Hive是通过给用户提供的一系列交互接口，接收到用户的指令(SQL)，使用自己的Driver，结合元数据(MetaStore)，将这些指令翻译成MapReduce，提交到Hadoop中执行，最后，将执行返回的结果输出到用户交互接口。

1）用户接口：Client

CLI（hiveshell）、JDBC/ODBC(java访问hive)、WEBUI（浏览器访问hive）

2）元数据：Metastore

元数据包括：表名、表所属的数据库（默认是default）、表的拥有者、列/分区字段、表的类型（是否是外部表）、表的数据所在目录等；

默认存储在自带的derby数据库中，推荐使用MySQL存储Metastore

3）Hadoop

使用HDFS进行存储，使用MapReduce进行计算。

4）驱动器：Driver

（1）解析器（SQL Parser）：将SQL字符串转换成抽象语法树AST，这一步一般都用第三方工具库完成，比如antlr；对AST进行语法分析，比如表是否存在、字段是否存在、SQL语义是否有误。

（2）编译器（Physical Plan）：将AST编译生成逻辑执行计划。

（3）优化器（Query Optimizer）：对逻辑执行计划进行优化。

（4）执行器（Execution）：把逻辑执行计划转换成可以运行的物理计划。对于Hive来说，就是MR/Spark。

演示：可以使用explain命令，查看语法树，即hive先将sql语法转成对应的语法树

> explain select * from mytest_staff_info_demo4_cp1

> where statis_date='20180229'

> order by name

> limit 3;

STAGE DEPENDENCIES:

Stage-1 is a root stage

Stage-0 is a root stage

STAGE PLANS:

Stage: Stage-1

Map Reduce

Map Operator Tree:

TableScan

alias: mytest_staff_info_demo4_cp1

Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

Select Operator

expressions: name (type: string), deptcode (type: string), id (type: int), account (type: string), areacode (type: string), statis_date (type: string)

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5

Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

Reduce Output Operator

key expressions: _col0 (type: string)

sort order: +

Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int), _col3 (type: string), _col4 (type: string), _col5 (type: string)

Reduce Operator Tree:

Extract

Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

Limit

Number of rows: 3

Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

File Output Operator

compressed: true

Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Stage: Stage-0

Fetch Operator

limit: 3

Time taken: 0.149 seconds, Fetched: 38 row(s)

可以看出一个hive任务分为一个或多个stage阶段。不同的stage阶段会存在依赖关系，越复杂的sql会产生越多的stage,通常也需要越多的时间，通常像我们公司（top3电商）在IDE平台上任务的执行时间一般不要超过45分钟，否则最好分割优化任务。

还可以使用explain extended 查看更多信息。

> explain extended select * from mytest_staff_info_demo4_cp1

> where statis_date='20180229'

> order by name

> limit 3;

ABSTRACT SYNTAX TREE:

TOK_QUERY

TOK_FROM

TOK_TABREF

TOK_TABNAME

mytest_staff_info_demo4_cp1

TOK_INSERT

TOK_DESTINATION

TOK_DIR

TOK_TMP_FILE

TOK_SELECT

TOK_SELEXPR

TOK_ALLCOLREF

TOK_WHERE

TOK_TABLE_OR_COL

statis_date

'20180229'

TOK_ORDERBY

TOK_TABSORTCOLNAMEASC

TOK_TABLE_OR_COL

name

TOK_LIMIT

STAGE DEPENDENCIES:

Stage-1 is a root stage

Stage-0 is a root stage

STAGE PLANS:

Stage: Stage-1

Map Reduce

Map Operator Tree:

TableScan

alias: mytest_staff_info_demo4_cp1

Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

GatherStats: false

Select Operator

expressions: name (type: string), deptcode (type: string), id (type: int), account (type: string), areacode (type: string), statis_date (type: string)

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5

Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

Reduce Output Operator

key expressions: _col0 (type: string)

sort order: +

Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

tag: -1

value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int), _col3 (type: string), _col4 (type: string), _col5 (type: string)

Path -> Alias:

hdfs://SuningHadoop2/user/finance/hive/warehouse/fdm_sor.db/mytest_staff_info_demo4_cp1/statis_date=20180229 [mytest_staff_info_demo4_cp1]

Path -> Partition:

hdfs://SuningHadoop2/user/finance/hive/warehouse/fdm_sor.db/mytest_staff_info_demo4_cp1/statis_date=20180229

Partition

base file name: statis_date=20180229

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

partition values:

statis_date 20180229

properties:

bucket_count -1

columns name,deptcode,id,account,areacode

columns.comments

columns.types string:string:int:string:string

field.delim

file.inputformat org.apache.hadoop.mapred.TextInputFormat

file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

location hdfs://SuningHadoop2/user/finance/hive/warehouse/fdm_sor.db/mytest_staff_info_demo4_cp1/statis_date=20180229

name fdm_sor.mytest_staff_info_demo4_cp1

partition_columns statis_date

partition_columns.types string

serialization.ddl struct mytest_staff_info_demo4_cp1 { string name, string deptcode, i32 id, string account, string areacode}

serialization.format

serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

transient_lastDdlTime 1521166440

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

properties:

bucket_count -1

columns name,deptcode,id,account,areacode

columns.comments

columns.types string:string:int:string:string

field.delim

file.inputformat org.apache.hadoop.mapred.TextInputFormat

file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

location hdfs://SuningHadoop2/user/finance/hive/warehouse/fdm_sor.db/mytest_staff_info_demo4_cp1

name fdm_sor.mytest_staff_info_demo4_cp1

partition_columns statis_date

partition_columns.types string

serialization.ddl struct mytest_staff_info_demo4_cp1 { string name, string deptcode, i32 id, string account, string areacode}

serialization.format

serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

transient_lastDdlTime 1519780670

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

name: fdm_sor.mytest_staff_info_demo4_cp1

Truncated Path -> Alias:

/fdm_sor.db/mytest_staff_info_demo4_cp1/statis_date=20180229 [mytest_staff_info_demo4_cp1]

Needs Tagging: false

Reduce Operator Tree:

Extract

Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

Limit

Number of rows: 3

Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

File Output Operator

compressed: true

GlobalTableId: 0

directory: alluxio-ft://namenode1-sit.cnsuning.com:19998/user/finance/tmp/hive-finance/hive_2018-03-19_09-35-04_376_4971094765120649963-1/-ext-10001

NumFilesPerFileSink: 1

Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE

Stats Publishing Key Prefix: alluxio-ft://namenode1-sit.cnsuning.com:19998/user/finance/tmp/hive-finance/hive_2018-03-19_09-35-04_376_4971094765120649963-1/-ext-10001/

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

properties:

columns _col0,_col1,_col2,_col3,_col4,_col5

columns.types string:string:int:string:string:string

escape.delim \

hive.serialization.extend.nesting.levels true

serialization.format 1

serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

TotalFiles: 1

GatherStats: false

MultiFileSpray: false

Stage: Stage-0

Fetch Operator

limit: 3

Time taken: 0.151 seconds, Fetched: 133 row(s)

涤生大数据

关注

16
点赞
踩
73

收藏

觉得还不错? 一键收藏
打赏
8
评论
Hive是如何将hql语法转换成MR执行的

Hive是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张表，并提供类SQL查询功能.说白了，hive就是MapReduce客户端,将用户编写的HQL语法转换成MR程序进行执行。那么，hive是如何实现将hql语法转换成Mr的呢？如图中所示，总的来说，Hive是通过给用户提供的一系列交互接口，接...
复制链接

扫一扫