hive join vs mapjoin 的原理对比

最新推荐文章于 2023-10-27 18:21:48 发布

潇洒-人生

最新推荐文章于 2023-10-27 18:21:48 发布

阅读量1.2k

点赞数 1

本文链接：https://blog.csdn.net/qq_35744460/article/details/89401536

版权

大数据同时被 3 个专栏收录

45 篇文章 0 订阅

订阅专栏

hive

18 篇文章 0 订阅

订阅专栏

hive

16 篇文章 0 订阅

订阅专栏

hive join vs mapjoin 的原理对比

join

例如下面一个普通join的sql语句

select u.id,d.name from user u join dept d on u.detp_id=d.dept_id;

中间的过程

2个map分别处理 user ,dept ，得到关联条件的key 以及<flag,value> ,通过falg区分2个表
shuffle 把key相同的拉到一起给reduce作业
reduce 这个阶段逻辑复杂需要先将vale中的数据迭代遍历，根据flag区分2个表，得到一个deptinfo，一个userList，
再次遍历userList 把dept Set 进去得到结果集
out到磁盘

示例图

mapjoin

通过MapReduce Local Task，将小表读入内存，生成HashTableFiles上传至Distributed Cache中，这里会对HashTableFiles进行压缩。
MapReduce Job在Map阶段，每个Mapper从Distributed Cache读取HashTableFiles到内存中，顺序扫描大表，在Map阶段直接进行Join，将数据传递给下一个MapReduce任务。

准备测试数据

    创建 user表
    hive> create table  user(uid string,username string,dept_id string); 
    创建的dept表
    hive> create table  dept(dept_id string,deptname string); 
    向里面插入数据
    hive> select * from user;
    OK
    1	aaa	1001
    2	bbb	1001
    3	ccc	1002
    4	ccc	1002
    5	eee	1003
    
    hive> select * from dept;
    OK
    1001	ppp
    1002	lll
    1003	mmm

默认是ture 会开启小表mapjoin

    hive> set hive.auto.convert.join;
    
    hive.auto.convert.join=true


hive> select u.uid,u.username,d.dept_id,d.deptname from user u join dept d on u.dept_id=d.dept_id;
Query ID = root_20190419143939_7e240803-dad1-4484-8ad1-96e8bea08c14
Total jobs = 1
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Execution log at: /tmp/root/root_20190419143939_7e240803-dad1-4484-8ad1-96e8bea08c14.log
==================开启MapReduce Local Task======================================
2019-04-19 02:39:19	Starting to launch local task to process map join;	maximum memory = 1908932608 
============将小表读入内存，生成HashTableFiles====================================
2019-04-19 02:39:20	Dump the side-table for tag: 1 with group count: 3 into file: file:/tmp/root/b2027549-5ad1-4224-adca-a3a50c579791/hive_2019-04-19_14-39-15_636_1590766828359575157-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile01--.hashtable
============将生成的HashTableFiles上传至Distributed Cache中==========================
2019-04-19 02:39:20	Uploaded 1 File to: file:/tmp/root/b2027549-5ad1-4224-adca-a3a50c579791/hive_2019-04-19_14-39-15_636_1590766828359575157-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile01--.hashtable (341 bytes)
2019-04-19 02:39:20	End of local task; Time Taken: 1.236 sec.
=============MapReduce Local Task成功===========================
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1555297938886_0129, Tracking URL = http://spark001:8088/proxy/application_1555297938886_0129/
Kill Command = /opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/hadoop/bin/hadoop job  -kill job_1555297938886_0129
==================只有一个map没有reduce跟shuffle==================
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2019-04-19 14:39:30,393 Stage-3 map = 0%,  reduce = 0%
2019-04-19 14:39:36,600 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 3.28 sec
MapReduce Total cumulative CPU time: 3 seconds 280 msec
Ended Job = job_1555297938886_0129
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   Cumulative CPU: 3.28 sec   HDFS Read: 6793 HDFS Write: 75 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 280 msec
OK
1	aaa	1001	ppp
2	bbb	1001	ppp
3	ccc	1002	lll
4	ccc	1002	lll
5	eee	1003	mmm
Time taken: 23.169 seconds, Fetched: 5 row(s)

explian查看执行计划

hive> explain  select u.uid,u.username,d.dept_id,d.deptname from user u join dept d on u.dept_id=d.dept_id;
OK
==================生成了3个stage======================
STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: dept_id is not null (type: boolean)
              Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 dept_id (type: string)
                  1 dept_id (type: string)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            Statistics: Num rows: 5 Data size: 50 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: dept_id is not null (type: boolean)
              Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 dept_id (type: string)
                  1 dept_id (type: string)
                outputColumnNames: _col0, _col1, _col6, _col7
                Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: string), _col1 (type: string), _col6 (type: string), _col7 (type: string)
                  outputColumnNames: _col0, _col1, _col2, _col3
                  Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 0.163 seconds, Fetched: 62 row(s)

关闭自动mapjoin

hive> set hive.auto.convert.join=false;
hive> explain  select u.uid,u.username,d.dept_id,d.deptname from user u join dept d on u.dept_id=d.dept_id;
OK
================生成2个stage===========
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            Statistics: Num rows: 5 Data size: 50 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: dept_id is not null (type: boolean)
              Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: dept_id (type: string)
                sort order: +
                Map-reduce partition columns: dept_id (type: string)
                Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE
                value expressions: uid (type: string), username (type: string)
          TableScan
            alias: d
            Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: dept_id is not null (type: boolean)
              Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: dept_id (type: string)
                sort order: +
                Map-reduce partition columns: dept_id (type: string)
                Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE Column stats: NONE
                value expressions: deptname (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          keys:
            0 dept_id (type: string)
            1 dept_id (type: string)
          outputColumnNames: _col0, _col1, _col6, _col7
          Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: string), _col1 (type: string), _col6 (type: string), _col7 (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3
            Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 0.09 seconds, Fetched: 59 row(s)

hive> select u.uid,u.username,d.dept_id,d.deptname from user u join dept d on u.dept_id=d.dept_id;
Query ID = root_20190419144444_04313ede-8844-4670-89f8-5a8134ebd282
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1555297938886_0130, Tracking URL = http://spark001:8088/proxy/application_1555297938886_0130/
Kill Command = /opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/hadoop/bin/hadoop job  -kill job_1555297938886_0130
===================2个map 1 个reduce======================
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2019-04-19 14:44:39,720 Stage-1 map = 0%,  reduce = 0%
2019-04-19 14:44:45,992 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.92 sec
2019-04-19 14:44:51,152 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.64 sec
MapReduce Total cumulative CPU time: 8 seconds 640 msec
Ended Job = job_1555297938886_0130
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 8.64 sec   HDFS Read: 14732 HDFS Write: 75 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 640 msec
OK
2	bbb	1001	ppp
1	aaa	1001	ppp
4	ccc	1002	lll
3	ccc	1002	lll
5	eee	1003	mmm
Time taken: 21.517 seconds, Fetched: 5 row(s)