hive执行计划

最新推荐文章于 2024-04-23 18:02:45 发布

芭芭拉小魔仙

最新推荐文章于 2024-04-23 18:02:45 发布

阅读量461

点赞数

分类专栏： hive 文章标签： hive

本文链接：https://blog.csdn.net/weixin_39259631/article/details/80599131

版权

hive 专栏收录该内容

0 篇文章 0 订阅

订阅专栏

Hql：

insert overwrite TABLE lpx SELECT t1.bar, t1.foo, t2.foo FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) ;

    ABSTRACT SYNTAX TREE: 
  
      (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME pokes) t1) (TOK_TABREF (TOK_TABNAME invites) t2) (= (. (TOK_TABLE_OR_COL t1) bar) (. (TOK_TABLE_OR_COL t2) bar)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME lpx))) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) bar)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) foo)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t2) foo)))))

STAGE DEPENDENCIES:

Stage-1 is a root stage /根

Stage-0 depends on stages: Stage-1 /0依赖1

Stage-2 depends on stages: Stage-0 /2依赖0

STAGE PLANS:
Stage: Stage-1
    Map Reduce   // 这个阶段是一个 mapreduce 作业
      Alias -> Map Operator Tree:    //map 操作树，对应 map 阶段
        t1

TableScan // 扫描表获取数据 from加载表，描述中有行数和大小等

alias: t1 // 表别名

Reduce Output Operator // 这里描述 map 的输出，也就是 reduce 的输入。比如 key ， partition ， sort 等信息

              key expressions:   //t1 表输出到 reduce 阶段的 key 信息
                    expr: bar
                    type: string
              sort order: + // 一个排序字段，这个排序字段是 key=bar ，多个排序字段多个 +
              Map-reduce partition columns: // partition 的信息，由此也可以看出 hive 在 join 的时候会以 join on 后的列作为 partition 的列，以保证具有相同此列的值的行被分到同一个 reduce 中去

                    expr: bar
                    type: string
              tag: 0                         // 对 t1 表打标签
              value expressions:   // t1 表输出到 reduce 阶段的 value 信息
                    expr: foo
                    type: int
                    expr: bar
                    type: string
        t2
          TableScan
            alias: t2
            Reduce Output Operator
              key expressions:
                    expr: bar
                    type: string
              sort order: +
              Map-reduce partition columns:
                    expr: bar
                    type: string
              tag: 1
              value expressions:
                    expr: foo
                    type: int
      Reduce Operator Tree: //reduce 操作树，相当于 reduce 阶段
        Join Operator
          condition map:
               Inner Join 0 to 1

condition expressions:

0 {VALUE._col0} {VALUE._col1} //对应前面 t1.bar, t1.foo

1 {VALUE._col0} // 对应前面t2.foo

handleSkewJoin: false //是否处理倾斜join

outputColumnNames: _col0, _col1, _col5

Select Operator // 筛选列，描述中有列名、类型，输出类型、大小等。

            expressions:
                  expr: _col1
                  type: string
                  expr: _col0
                  type: int
                  expr: _col5

type: int

outputColumnNames: _col0, _col1, _col2 // 为临时结果字段按规则起的临时字段名

File Output Operator // 输出结果到临时文件中，描述介绍了压缩格式、输出文件格式。

              compressed: false
              GlobalTableId: 1
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: default.lpx

Stage: Stage-0

Move Operator // Stage-0简单把结果从临时目录，移动到表 lpx 相关的目录。

      tables:
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: default.lpx

Stage: Stage-2
    Stats-Aggr Operator

从信息头：
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
Stage-2 depends on stages: Stage-0
从这里可以看出 Plan 计划的 Job 任务结构，整个任务会分为 3 个 Job 执行，
第一个 Job 将由 Stage-1 构成 ;
第二个 Job 处理由 Stage-0 构成， Stage-0 的处理必须依赖 Stage-1 阶段的结果 ;
第三个 Job 处理由 Stage-2 构成， Stage-2 的处理必须依赖 Stage-0 阶段的结果。

下面分别解释 Stage-1 和 Stage-0 ，执行 SQL 可以分成两步：
(1) 、 SELECT t1.bar, t1.foo, t2.foo FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar);
(2) 、 insert overwrite TABLE lpx;
    Stage: Stage-1 对应一次完整的 Map Reduce 任务，包括： Map Operator Tree 和 Reduce Operator Tree 两步操作 ,Map Operator Tree 对应 Map 任务， Reduce Operator Tree 对应 Reduce 任务。
        从 Map Operator Tree 阶段可以看出进行了两个并列的操作 t1 和 t2 ，分别 SELECT t1.bar, t1.foo FROM t1; 和 SELECT t2.foo FROM t2; 而且两个 Map 任务分别产生了 Reduce 阶段的输入 [Reduce Output Operator] 。
     从 Reduce Operator Tree 分析可以看到如下信息，条件连接 Map 的输出以及通过预定义的输出格式生成符合 default.lpx 的存储格式的数据存储到 HDFS 中。在我们创建 lpx 表
的时候，没有指定该表的存储格式，默认会以 Text 为存储格式，输入输出会以 TextInputFormat 与 TextOutputFormat 进行读写：
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: default.lpx
input format 的值对应 org.apache.hadoop.mapred.TextInputFormat ，
这是因为在开始的 Map 阶段产生的临时输出文件是以 TextOutputFormat 格式保存的，自然 Reduce 的读取是由 TextInputFormat 格式处理读入数据。这些是由 Hadoop 的 MapReduce 处理细节来控制，而 Hive 只需要指定处理格式即可。
    Serde 值为 org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 类，这时这个对象的保存的值为 _col0, _col1, _col2 ，也就是我们预期要查询的 t1.bar, t1.foo, t2.foo ，这个值具体的应该为 _col0+ 表 lpx 设置的列分割符 +_col1+ 表 lpx 设置的列分割符 +_col2 。 outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 可以知道 output 的处理是使用该类来处理的。
    Stage-0 对应上面提到的第二步操作。这时 stage-1 产生的临时处理文件举例如 tmp, 需要经过 stage-0 阶段操作处理到 lpx 表中。 Move Operator 代表了这并不是一个 MapReduce 任务，只需要调用 MoveTask 的处理就行，在处理之前会去检查输入文件是否符合 lpx 表的存储格式。

select a.*, b.cust_uid
from ods_ad_bid_deliver_info b join mds_ad_algo_feed_monitor_data_table a
where a.dt<=20140101 and a.dt<=20140108 and key='deliver_id_bucket_id' and a.dt=b.dt and a.key_slice=b.deliver_id

执行计划：

抽象语法树：

ABSTRACT SYNTAX TREE:

(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME ods_ad_bid_deliver_info) b) (TOK_TABREF (TOK_TABNAME mds_ad_algo_feed_monitor_data_table) a))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_ALLCOLREF (TOK_TABNAME a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) cust_uid))) (TOK_WHERE (and (and (and (and (<= (. (TOK_TABLE_OR_COL a) dt) 20140101) (<= (. (TOK_TABLE_OR_COL a) dt) 20140108)) (= (TOK_TABLE_OR_COL key) 'deliver_id_bucket_id')) (= (. (TOK_TABLE_OR_COL a) dt) (. (TOK_TABLE_OR_COL b) dt))) (= (. (TOK_TABLE_OR_COL a) key_slice) (. (TOK_TABLE_OR_COL b) deliver_id))))))

STAGE DEPENDENCIES:

Stage-1 is a root stage

Stage-0 is a root stage

STAGE PLANS:

Stage: Stage-1

Map Reduce

Alias -> Map Operator Tree:

TableScan

alias: a

Filter Operator

predicate:

expr: (key = 'deliver_id_bucket_id') // 按 key 指定值在 map 阶段过滤

type: boolean

Reduce Output Operator

sort order:

tag: 1

value expressions: //select * 导致输出到 reduce 的数据是全部的列信息

expr: key

type: string

expr: key_slice

type: string

expr: billing_mode_slice

type: string

expr: bucket_id

type: string

expr: ctr

type: string

expr: ecpm

type: string

expr: auc

type: string

expr: pctr

type: string

expr: pctr_ctr

type: string

expr: total_pv

type: string

expr: total_click

type: string

expr: dt

type: string

TableScan

alias: b

Reduce Output Operator

sort order:

tag: 0

value expressions:

expr: deliver_id

type: string

expr: cust_uid

type: string

expr: dt

type: string

Reduce Operator Tree:

Join Operator

condition map:

Inner Join 0 to 1

condition expressions:

0 {VALUE._col0} {VALUE._col6} {VALUE._col35}

1 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3} {VALUE._col4} {VALUE._col5} {VALUE._col6} {VALUE._col7} {VALUE._col8} {VALUE._col9} {VALUE._col10} {VALUE._col11}

handleSkewJoin: false

outputColumnNames: _col0, _col6, _col35, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49

Filter Operator

predicate:

expr: (((((_col49 <= 20140101) and (_col49 <= 20140108)) and (_col38 = 'deliver_id_bucket_id')) and (_col49 = _col35)) and (_col39 = _col0))

type: boolean

Select Operator

expressions:

expr: _col38

type: string

expr: _col39

type: string

expr: _col40

type: string

expr: _col41

type: string

expr: _col42

type: string

expr: _col43

type: string

expr: _col44

type: string

expr: _col45

type: string

expr: _col46

type: string

expr: _col47

type: string

expr: _col48

type: string

expr: _col49

type: string

expr: _col6

type: string

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12

File Output Operator

compressed: false

GlobalTableId: 0

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Stage: Stage-0

Fetch Operator

limit: -1

修改后
select a.*, b.cust_uid
from ods_ad_bid_deliver_info b
join mds_ad_algo_feed_monitor_data_table a
on(a.dt<=20140101 and a.dt<=20140108 and a.dt=b.dt and a.key_slice=b.deliver_id and a.key='deliver_id_bucket_id')

执行计划：

抽象语法树：

STAGE DEPENDENCIES:

Stage-1 is a root stage

Stage-0 is a root stage

STAGE PLANS:

Stage: Stage-1

Map Reduce

Alias -> Map Operator Tree:

TableScan

alias: a

Filter Operator

predicate:

expr: (key = 'deliver_id_bucket_id')

type: boolean

Filter Operator

predicate:

expr: (dt <= 20140101) // 分区过滤条件在 map 端生效

type: boolean

Filter Operator

predicate:

expr: (dt <= 20140108) // 分区过滤条件在 map 端生效

type: boolean

Filter Operator

predicate:

expr: (key = 'deliver_id_bucket_id')

type: boolean

Reduce Output Operator

key expressions:

expr: dt

type: string

expr: key_slice

type: string

sort order: ++

Map-reduce partition columns:

expr: dt

type: string

expr: key_slice

type: string

tag: 1

value expressions:

expr: key

type: string

expr: key_slice

type: string

expr: billing_mode_slice

type: string

expr: bucket_id

type: string

expr: ctr

type: string

expr: ecpm

type: string

expr: auc

type: string

expr: pctr

type: string

expr: pctr_ctr

type: string

expr: total_pv

type: string

expr: total_click

type: string

expr: dt

type: string

TableScan

alias: b

Reduce Output Operator

key expressions:

expr: dt

type: string

expr: deliver_id

type: string

sort order: ++

Map-reduce partition columns:

expr: dt

type: string

expr: deliver_id

type: string

tag: 0

value expressions:

expr: cust_uid

type: string

Reduce Operator Tree:

Join Operator

condition map:

Inner Join 0 to 1

condition expressions:

0 {VALUE._col6}

1 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3} {VALUE._col4} {VALUE._col5} {VALUE._col6} {VALUE._col7} {VALUE._col8} {VALUE._col9} {VALUE._col10} {VALUE._col11}

handleSkewJoin: false

outputColumnNames: _col6, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49

Select Operator

expressions:

expr: _col38

type: string

expr: _col39

type: string

expr: _col40

type: string

expr: _col41

type: string

expr: _col42

type: string

expr: _col43

type: string

expr: _col44

type: string

expr: _col45

type: string

expr: _col46

type: string

expr: _col47

type: string

expr: _col48

type: string

expr: _col49

type: string

expr: _col6

type: string

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12

File Output Operator

compressed: false

GlobalTableId: 0

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Stage: Stage-0

Fetch Operator

limit: -1

栗子：

select * from emp e
left join dept d on e.deptno=d.deptno
where d.dt='2018-06-04';

花费时间:Time taken: 44.401 seconds, Fetched: 17 row(s)

执行计划:

STAGE DEPENDENCIES:

Stage-4 is a root stage

Stage-3 depends on stages: Stage-4

Stage-0 depends on stages: Stage-3

STAGE PLANS:

Stage: Stage-4

//1、将启用本地MR读取小表

Map Reduce Local Work /本地执行

Alias -> Map Local Tables:

Fetch Operator

limit: -1

Alias -> Map Local Operator Tree:

TableScan

alias: d

Statistics: Num rows: 1 Data size: 168 Basic stats: PARTIAL Column stats: PARTIAL

//2、写入哈希表文件

HashTable Sink Operator

keys:

0 deptno (type: string)

1 deptno (type: string)

//3、上传至hdfs DC中执行计划中没有

2018-06-07 11:04:40	Uploaded 1 File to: file:/tmp/dw/fd4d06b7-523f-41a0-a8ff-ee4b13995b7e/hive_2018-06-07_11-04-35_149_2329359575876874876-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile01--.hashtable (583 bytes)

//4、执行一个map作业，读取大表，并与缓存中的小表连接操作

Stage: Stage-3

Map Reduce

Map Operator Tree:

TableScan

alias: e

Statistics: Num rows: 1 Data size: 757 Basic stats: PARTIAL Column stats: PARTIAL

Map Join Operator

condition map:

Left Outer Join0 to 1

keys:

0 deptno (type: string)

1 deptno (type: string)

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col12, _col13, _col14, _col15

Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

Filter Operator

predicate: (_col15 = '2018-06-04') (type: boolean)

Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

Select Operator

expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: str

ing), _col12 (type: string), _col13 (type: string), _col14 (type: string), '2018-06-04' (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12

Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

File Output Operator

compressed: false

Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Local Work:

Map Reduce Local Work

Stage: Stage-0

Fetch Operator

limit: -1

Processor Tree:

ListSink

--修改后
select * from emp e
left join dept d on (e.deptno=d.deptno and  d.dt='2018-06-04');

花费时间:Time taken: 23.804 seconds, Fetched: 17 row(s)

STAGE DEPENDENCIES:

Stage-4 is a root stage

Stage-3 depends on stages: Stage-4

Stage-0 depends on stages: Stage-3

STAGE PLANS:

Stage: Stage-4

Map Reduce Local Work

Alias -> Map Local Tables:

Fetch Operator

limit: -1

Alias -> Map Local Operator Tree:

TableScan

alias: d

filterExpr: (dt = '2018-06-04') (type: boolean)

Statistics: Num rows: 1 Data size: 84 Basic stats: PARTIAL Column stats: PARTIAL

HashTable Sink Operator

keys:

0 deptno (type: string)

1 deptno (type: string)

Stage: Stage-3

Map Reduce

Map Operator Tree:

TableScan

alias: e

Statistics: Num rows: 1 Data size: 757 Basic stats: PARTIAL Column stats: PARTIAL

Map Join Operator

condition map:

Left Outer Join0 to 1

keys:

0 deptno (type: string)

1 deptno (type: string)

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col12, _col13, _col14, _col15

Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

Select Operator

g), _col12 (type: string), _col13 (type: string), _col14 (type: string), _col15 (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12

Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

File Output Operator

compressed: false

Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Local Work:

Map Reduce Local Work

Stage: Stage-0

Fetch Operator

limit: -1

Processor Tree:

ListSink

资料来源：

https://tech.meituan.com/hive-sql-to-mapreduce.html 美团技术博客

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain hive文档

秦老师

芭芭拉小魔仙

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
hive执行计划

Hql：insert overwrite TABLE lpx SELECT t1.bar, t1.foo, t2.foo FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) ;OKABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME pokes)...
复制链接

扫一扫