Hive MapJoin原理

最新推荐文章于 2023-06-03 16:20:50 发布

sakulamartain

最新推荐文章于 2023-06-03 16:20:50 发布

阅读量1.3k

点赞数

分类专栏： Hadoop 文章标签： hadoop hive mysql

本文链接：https://blog.csdn.net/sakulamartain/article/details/121033473

版权

Hadoop 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

本文主要想讨论在Hive1.2.2环境中，以下三种情况下，2表做left join的执行原理：

1. 2张相同大小的表做left join；

2. 大表做主表，与临时表做left join；

在进行试验之前，先说下官方对于 Map Join的原理说明：

Hive MapJoin原理说明：

MapJoin 就是在Map阶段将小表读入内存并生成HashTableFiles，然后顺序扫描大表完成Join。

上图是Hive MapJoin的原理图，出自Facebook工程师Liyin Tang的一篇介绍Join优化的slice，从图中可以看出MapJoin分为两个阶段：

1. 通过MapReduce Local Task，将小表读入内存，生成HashTableFiles上传至Distributed Cache中，这里会对HashTableFiles进行压缩;

2. MapReduce Job在Map阶段，每个Mapper从Distributed Cache读取HashTableFiles到内存中，顺序扫描大表，在Map阶段直接进行Join，将数据传递给下一个MapReduce任务;

tips:

如果Join的两张表一张表是临时表，就会生成一个ConditionalTask，在运行期间判断是否使用MapJoin,

这里面涉及到一个概念就是CommonJoinResolver优化器，

CommonJoinResolver优化器就是将CommonJoin转化为MapJoin，转化过程如下(摘自网上)

1. 深度优先遍历Task Tree

2. 找到JoinOperator，判断左右表数据量大小

3. 对与小表 + 大表 => MapJoinTask，对于小/大表 + 中间表 => ConditionalTask

4. 遍历上一个阶段生成的MapReduce任务，发现

MapReduceTask[Stage-2]

JOIN[8]

中有一张表为临时表，先对Stage-2进行深度拷贝（由于需要保留原始执行计划为Backup

Plan，所以这里将执行计划拷贝了一份），生成一个MapJoinOperator替代JoinOperator，然后生成一个MapReduceLocalWork读取小表生成HashTableFiles上传至DistributedCache中。

在接下来的试验中发现，有些执行细节跟上述原理说明有不太一致的地方，不知道是不是因为一系列的版本的问题导致，还是说explain这个命令对于整个MapJoin过程的显示是有保留的；对于大表小表的判断上，从解析计划里并没有看出程序的判断原理，应该是在编译器与metastore交互过程中，读取到的。

试验一：2张相同大小的表做left join；

hive> explain

> select a.ip

> ,b.req_url

> from weblog_b a

> left join weblog_d b

> on a.ip=b.ip;

STAGE DEPENDENCIES:

Stage-4 is a root stage

Stage-3 depends on stages: Stage-4

Stage-0 depends on stages: Stage-3

STAGE PLANS:

Stage: Stage-4

Map Reduce Local Work

Alias -> Map Local Tables:

Fetch Operator

limit: -1

Alias -> Map Local Operator Tree:

TableScan

alias: b

Statistics: Num rows: 3 Data size: 101 Basic stats: COMPLETE Column stats: NONE

HashTable Sink Operator

keys:

0 ip (type: string)

1 ip (type: string)

Stage: Stage-3

Map Reduce

Map Operator Tree:

TableScan

alias: a

Statistics: Num rows: 1 Data size: 170 Basic stats: COMPLETE Column stats: NONE

Map Join Operator

condition map:

Left Outer Join0 to 1

keys:

0 ip (type: string)

1 ip (type: string)

outputColumnNames: _col0, _col10

Statistics: Num rows: 3 Data size: 111 Basic stats: COMPLETE Column stats: NONE

Select Operator

expressions: _col0 (type: string), _col10 (type: string)

outputColumnNames: _col0, _col1

Statistics: Num rows: 3 Data size: 111 Basic stats: COMPLETE Column stats: NONE

File Output Operator

compressed: false

Statistics: Num rows: 3 Data size: 111 Basic stats: COMPLETE Column stats: NONE

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Local Work:

Map Reduce Local Work

Stage: Stage-0

Fetch Operator

limit: -1

Processor Tree:

ListSink

Time taken: 0.168 seconds, Fetched: 56 row(s)

hive>

执行过程简析：

1. 执行各阶段依赖关系

2. Map Reduce Local Work，在刚才的原理说明中这一阶段应该是在与临时表做join时会有，事实表做join的步骤应该是MapReduce Local Task，但是试验发现2张事实表做join，也是生成Map Reduce Local Work，这一步最重要将小表读入内存生成HashTable，在生成HashTable的过程中确实有做压缩的操作；

3. 在Map阶段，Mapper开始顺序扫描大表，并与Hashtable做关联，将关联结果传递给下一个MapReduce任务；

试验二：大表做主表，与临时表做left join；

hive> explain

> select a.ip

> ,b.req_url

> from weblog_b a

> inner join (select ip,req_url from weblog_c where time in ('600','900','1200')) b

> on a.ip=b.ip;

STAGE DEPENDENCIES:

Stage-4 is a root stage

Stage-3 depends on stages: Stage-4

Stage-0 depends on stages: Stage-3

STAGE PLANS:

Stage: Stage-4

Map Reduce Local Work

Alias -> Map Local Tables:

Fetch Operator

limit: -1

Alias -> Map Local Operator Tree:

TableScan

alias: a

Statistics: Num rows: 1 Data size: 170 Basic stats: COMPLETE Column stats: NONE

Filter Operator

predicate: ip is not null (type: boolean)

Statistics: Num rows: 1 Data size: 170 Basic stats: COMPLETE Column stats: NONE

HashTable Sink Operator

keys:

0 ip (type: string)

1 _col0 (type: string)

Stage: Stage-3

Map Reduce

Map Operator Tree:

TableScan

alias: weblog_c

Statistics: Num rows: 5 Data size: 165 Basic stats: COMPLETE Column stats: NONE

Filter Operator

predicate: ((time) IN ('600', '900', '1200') and ip is not null) (type: boolean)

Statistics: Num rows: 1 Data size: 33 Basic stats: COMPLETE Column stats: NONE

Select Operator

expressions: ip (type: string), req_url (type: string)

outputColumnNames: _col0, _col1

Statistics: Num rows: 1 Data size: 33 Basic stats: COMPLETE Column stats: NONE

Map Join Operator

condition map:

Inner Join 0 to 1

keys:

0 ip (type: string)

1 _col0 (type: string)

outputColumnNames: _col0, _col9

Statistics: Num rows: 1 Data size: 187 Basic stats: COMPLETE Column stats: NONE

Select Operator

expressions: _col0 (type: string), _col9 (type: string)

outputColumnNames: _col0, _col1

Statistics: Num rows: 1 Data size: 187 Basic stats: COMPLETE Column stats: NONE

File Output Operator

compressed: false

Statistics: Num rows: 1 Data size: 187 Basic stats: COMPLETE Column stats: NONE

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Local Work:

Map Reduce Local Work

Stage: Stage-0

Fetch Operator

limit: -1

Processor Tree:

ListSink

Time taken: 0.156 seconds, Fetched: 66 row(s)

hive>

执行过程简析：

1. 执行各阶段依赖关系

2. Map Reduce Local Work，与刚才的原理说明中这一阶段在与临时表做join时会有一致，这一步最重要将小表读入内存生成HashTable，在生成HashTable的过程中确实有做压缩的操作；

3. 在Map阶段，Mapper开始顺序扫描大表，并与Hashtable做关联，将关联结果传递给下一个MapReduce任务；

sakulamartain

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Hive MapJoin原理

本文主要想讨论在Hive1.2.2环境中，以下三种情况下，2表做left join的执行原理：1. 2张相同大小的表做left join；2. 大表做主表，与临时表做left join；在进行试验之前，先说下官方对于 Map Join的原理说明：Hive MapJoin原理说明：MapJoin 就是在Map阶段将小表读入内存并生成HashTableFiles，然后顺序扫描大表完成Join。上图是Hive MapJoin的原理图，出自Facebook工程师Liyin T...
复制链接

扫一扫

专栏目录