hive join vs mapjoin 的原理对比

18 篇文章 0 订阅
16 篇文章 0 订阅

hive join vs mapjoin 的原理对比

join

例如下面一个普通join的sql语句

select u.id,d.name from user u join dept d on u.detp_id=d.dept_id;
  • 中间的过程
  1. 2个map分别处理 user ,dept ,得到关联条件的key 以及<flag,value> ,通过falg区分2个表
  2. shuffle 把key相同的拉到一起给reduce作业
  3. reduce 这个阶段逻辑复杂需要先将vale中的数据迭代遍历,根据flag区分2个表,得到 一个deptinfo,一个userList,
    再次遍历userList 把dept Set 进去得到结果集
  4. out到磁盘
  • 示例图
    普通join图解

mapjoin

  1. 通过MapReduce Local Task,将小表读入内存,生成HashTableFiles上传至Distributed Cache中,这里会对HashTableFiles进行压缩。

  2. MapReduce Job在Map阶段,每个Mapper从Distributed Cache读取HashTableFiles到内存中,顺序扫描大表,在Map阶段直接进行Join,将数据传递给下一个MapReduce任务。
    mapjoin图例

  • 准备测试数据
    创建 user表
    hive> create table  user(uid string,username string,dept_id string); 
    创建的dept表
    hive> create table  dept(dept_id string,deptname string); 
    向里面插入数据
    hive> select * from user;
    OK
    1	aaa	1001
    2	bbb	1001
    3	ccc	1002
    4	ccc	1002
    5	eee	1003
    
    hive> select * from dept;
    OK
    1001	ppp
    1002	lll
    1003	mmm
  • 默认是ture 会开启小表mapjoin
    hive> set hive.auto.convert.join;
    
    hive.auto.convert.join=true

hive> select u.uid,u.username,d.dept_id,d.deptname from user u join dept d on u.dept_id=d.dept_id;
Query ID = root_20190419143939_7e240803-dad1-4484-8ad1-96e8bea08c14
Total jobs = 1
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Execution log at: /tmp/root/root_20190419143939_7e240803-dad1-4484-8ad1-96e8bea08c14.log
==================开启MapReduce Local Task======================================
2019-04-19 02:39:19	Starting to launch local task to process map join;	maximum memory = 1908932608 
============将小表读入内存,生成HashTableFiles====================================
2019-04-19 02:39:20	Dump the side-table for tag: 1 with group count: 3 into file: file:/tmp/root/b2027549-5ad1-4224-adca-a3a50c579791/hive_2019-04-19_14-39-15_636_1590766828359575157-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile01--.hashtable
============将生成的HashTableFiles上传至Distributed Cache中==========================
2019-04-19 02:39:20	Uploaded 1 File to: file:/tmp/root/b2027549-5ad1-4224-adca-a3a50c579791/hive_2019-04-19_14-39-15_636_1590766828359575157-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile01--.hashtable (341 bytes)
2019-04-19 02:39:20	End of local task; Time Taken: 1.236 sec.
=============MapReduce Local Task成功===========================
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1555297938886_0129, Tracking URL = http://spark001:8088/proxy/application_1555297938886_0129/
Kill Command = /opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/hadoop/bin/hadoop job  -kill job_1555297938886_0129
==================只有一个map没有reduce跟shuffle==================
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2019-04-19 14:39:30,393 Stage-3 map = 0%,  reduce = 0%
2019-04-19 14:39:36,600 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 3.28 sec
MapReduce Total cumulative CPU time: 3 seconds 280 msec
Ended Job = job_1555297938886_0129
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   Cumulative CPU: 3.28 sec   HDFS Read: 6793 HDFS Write: 75 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 280 msec
OK
1	aaa	1001	ppp
2	bbb	1001	ppp
3	ccc	1002	lll
4	ccc	1002	lll
5	eee	1003	mmm
Time taken: 23.169 seconds, Fetched: 5 row(s)
  • explian查看执行计划
hive> explain  select u.uid,u.username,d.dept_id,d.deptname from user u join dept d on u.dept_id=d.dept_id;
OK
==================生成了3个stage======================
STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: dept_id is not null (type: boolean)
              Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 dept_id (type: string)
                  1 dept_id (type: string)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            Statistics: Num rows: 5 Data size: 50 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: dept_id is not null (type: boolean)
              Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 dept_id (type: string)
                  1 dept_id (type: string)
                outputColumnNames: _col0, _col1, _col6, _col7
                Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: string), _col1 (type: string), _col6 (type: string), _col7 (type: string)
                  outputColumnNames: _col0, _col1, _col2, _col3
                  Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 0.163 seconds, Fetched: 62 row(s)
  • 关闭自动mapjoin
hive> set hive.auto.convert.join=false;
hive> explain  select u.uid,u.username,d.dept_id,d.deptname from user u join dept d on u.dept_id=d.dept_id;
OK
================生成2个stage===========
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            Statistics: Num rows: 5 Data size: 50 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: dept_id is not null (type: boolean)
              Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: dept_id (type: string)
                sort order: +
                Map-reduce partition columns: dept_id (type: string)
                Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE
                value expressions: uid (type: string), username (type: string)
          TableScan
            alias: d
            Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: dept_id is not null (type: boolean)
              Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: dept_id (type: string)
                sort order: +
                Map-reduce partition columns: dept_id (type: string)
                Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE Column stats: NONE
                value expressions: deptname (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          keys:
            0 dept_id (type: string)
            1 dept_id (type: string)
          outputColumnNames: _col0, _col1, _col6, _col7
          Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: string), _col1 (type: string), _col6 (type: string), _col7 (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3
            Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 0.09 seconds, Fetched: 59 row(s)

hive> select u.uid,u.username,d.dept_id,d.deptname from user u join dept d on u.dept_id=d.dept_id;
Query ID = root_20190419144444_04313ede-8844-4670-89f8-5a8134ebd282
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1555297938886_0130, Tracking URL = http://spark001:8088/proxy/application_1555297938886_0130/
Kill Command = /opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/hadoop/bin/hadoop job  -kill job_1555297938886_0130
===================2个map 1 个reduce======================
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2019-04-19 14:44:39,720 Stage-1 map = 0%,  reduce = 0%
2019-04-19 14:44:45,992 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.92 sec
2019-04-19 14:44:51,152 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.64 sec
MapReduce Total cumulative CPU time: 8 seconds 640 msec
Ended Job = job_1555297938886_0130
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 8.64 sec   HDFS Read: 14732 HDFS Write: 75 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 640 msec
OK
2	bbb	1001	ppp
1	aaa	1001	ppp
4	ccc	1002	lll
3	ccc	1002	lll
5	eee	1003	mmm
Time taken: 21.517 seconds, Fetched: 5 row(s)
  • 总结
  1. mapjoin 可以看做 boardcast join 就是将小表的数据加载到内存中并且没有shuffle过程,加快处理效率
    但是这样如果数据量过大,加载到内存有可能会引起OOM
  2. 普通join 会产生shuffle,会影响效率(数据传输);也可能差生数据倾斜(一个key太多,那任务处理就会很慢)
  • 0
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值