Map join和Common join详解

利用hive进行join连接操作,相较于MR有两种执行方案,一种为common join,另一种为map join ,map join是相对于common join的一种优化,省去shullfe和reduce的过程,大大的降低的作业运行的时间。
一.先决条件

  • emp表
hive> select * from emp;
OK
369 SMITH   CLERK   7902    1980-12-17 00:00:00 800.0   NULL    20
7499    ALLEN   SALESMAN    7698    1981-02-20 00:00:00 1600.0  300.0   30
7521    WARD    SALESMAN    7698    1981-02-22 00:00:00 1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-04-02 00:00:00 2975.0  NULL    20
7654    MARTIN  SALESMAN    7698    1981-09-28 00:00:00 1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-05-01 00:00:00 2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981-06-09 00:00:00 2450.0  NULL    10
7788    SCOTT   ANALYST 7566    1982-12-09 00:00:00 3000.0  NULL    20
7839    KING    PRESIDENT   NULL    1981-11-17 00:00:00 5000.0  NULL    10
7844    TURNER  SALESMAN    7698    1981-09-08 00:00:00 1500.0  0.0 30
7876    ADAMS   CLERK   7788    1983-01-12 00:00:00 1100.0  NULL    20
7900    JAMES   CLERK   7698    1981-12-03 00:00:00 950.0   NULL    30
7902    FORD    ANALYST 7566    1981-12-03 00:00:00 3000.0  NULL    20
7934    MILLER  CLERK   7782    1982-01-23 00:00:00 1300.0  NULL    10
Time taken: 0.161 seconds, Fetched: 14 row(s)

dept表

hive> select * from dept;
OK
10  ACCOUNTING  NEW YORK
20  RESEARCH    DALLAS
30  SALES   CHICAGO
40  OPERATIONS  BOSTON
Time taken: 0.185 seconds, Fetched: 4 row(s)

二.具体实现
1.common join

  • 架构图
    这里写图片描述

两个map作业读取两张表,归并为emp:

deptno, (e.empno, e.ename)
dept: deptno, (d.dname)

的格式,然后经由reducer合并。最后能获取到join的连接结果。

  • 执行计划
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: e
            Statistics: Num rows: 7 Data size: 820 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: deptno is not null (type: boolean)
              Statistics: Num rows: 4 Data size: 468 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: deptno (type: int)
                sort order: +
                Map-reduce partition columns: deptno (type: int)
                Statistics: Num rows: 4 Data size: 468 Basic stats: COMPLETE Column stats: NONE
                value expressions: empno (type: int), ename (type: string)
          TableScan
            alias: d
            Statistics: Num rows: 1 Data size: 80 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: deptno is not null (type: boolean)
              Statistics: Num rows: 1 Data size: 80 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: deptno (type: int)
                sort order: +
                Map-reduce partition columns: deptno (type: int)
                Statistics: Num rows: 1 Data size: 80 Basic stats: COMPLETE Column stats: NONE
                value expressions: dname (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          keys:
            0 deptno (type: int)
            1 deptno (type: int)
          outputColumnNames: _col0, _col1, _col7, _col12
          Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: int), _col1 (type: string), _col7 (type: int), _col12 (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3
            Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
  • 详解
    1. 读取一张emp表的过程,另一张表以此类推
TableScan
            alias: e
            Statistics: Num rows: 7 Data size: 820 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: deptno is not null (type: boolean)
              Statistics: Num rows: 4 Data size: 468 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: deptno (type: int)//键值对的键
                sort order: +
                Map-reduce partition columns: deptno (type: int)
                Statistics: Num rows: 4 Data size: 468 Basic stats: COMPLETE Column stats: NONE
                value expressions: empno (type: int), ename (type: string)//键值对中的值
2.reduce端join
 Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          keys:
            0 deptno (type: int)//连接条件,两个字段
            1 deptno (type: int)
          outputColumnNames: _col0, _col1, _col7, _col12//输出位置号
          Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: int), _col1 (type: string), _col7 (type: int), _col12 (type: string)//输出类型
            outputColumnNames: _col0, _col1, _col2, _col3
            Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

3.结束

Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

2.map join

  • 架构图
    这里写图片描述

首先在本地生成一个local task 读取比较小的表dept,然后将表写入Hash Table Files ,上传到HDFS的缓存中,然后启动一个map作业,每读取一条数据,就与缓存中的小表进行join操作,直至整个大表读取结束。

  • 执行计划
 STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            Statistics: Num rows: 1 Data size: 80 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: deptno is not null (type: boolean)
              Statistics: Num rows: 1 Data size: 80 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 deptno (type: int)
                  1 deptno (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: e
            Statistics: Num rows: 7 Data size: 820 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: deptno is not null (type: boolean)
              Statistics: Num rows: 4 Data size: 468 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 deptno (type: int)
                  1 deptno (type: int)
                outputColumnNames: _col0, _col1, _col7, _col12
                Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: string), _col7 (type: int), _col12 (type: string)
                  outputColumnNames: _col0, _col1, _col2, _col3
                  Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 0.191 seconds, Fetched: 62 row(s)
  • 详解
    1.将启用本地MR读取小表
 Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            Statistics: Num rows: 1 Data size: 80 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: deptno is not null (type: boolean)
              Statistics: Num rows: 1 Data size: 80 Basic stats: COMPLETE Column stats: NONE

2.写入哈希表文件

 HashTable Sink Operator
                keys:
                  0 deptno (type: int)
                  1 deptno (type: int)

3.上传到Hadoop缓存中(执行计划不可见该步骤,可由日志看见)

 2018-01-11 10:30:28    Uploaded 1 File to: file:/tmp/hadoop/aedaa8e1-17a9-4211-86b1-79debe362aba/hive_2018-01-11_22-30-12_222_6099353227386611286-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile32--.hashtable (373 bytes)

4.执行一个map作业,读取大表,并与缓存中的小表连接操作

 Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: e
            Statistics: Num rows: 7 Data size: 820 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: deptno is not null (type: boolean)
              Statistics: Num rows: 4 Data size: 468 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 deptno (type: int)
                  1 deptno (type: int)
                outputColumnNames: _col0, _col1, _col7, _col12
                Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: string), _col7 (type: int), _col12 (type: string)
                  outputColumnNames: _col0, _col1, _col2, _col3
                  Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

5.结束

Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

三.实验结果

hive> select e.empno,e.ename,e.deptno, d.dname
    > from emp e join dept d on e.deptno=d.deptno;
Query ID = hadoop_20180111202424_2a1594f6-ef46-4a99-a85d-4a4cc82e1b9c
Total jobs = 1
18/01/12 00:08:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Execution log at: /tmp/hadoop/hadoop_20180111202424_2a1594f6-ef46-4a99-a85d-4a4cc82e1b9c.log
2018-01-12 12:08:21 Starting to launch local task to process map join;  maximum memory = 518979584
2018-01-12 12:08:24 Dump the side-table for tag: 1 with group count: 4 into file: file:/tmp/hadoop/aedaa8e1-17a9-4211-86b1-79debe362aba/hive_2018-01-12_00-08-10_896_8719990077853360918-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile51--.hashtable
2018-01-12 12:08:24 Uploaded 1 File to: file:/tmp/hadoop/aedaa8e1-17a9-4211-86b1-79debe362aba/hive_2018-01-12_00-08-10_896_8719990077853360918-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile51--.hashtable (373 bytes)
2018-01-12 12:08:24 End of local task; Time Taken: 2.59 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1515720212312_0005, Tracking URL = http://hadoop:8088/proxy/application_1515720212312_0005/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1515720212312_0005
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2018-01-12 00:08:39,954 Stage-3 map = 0%,  reduce = 0%
2018-01-12 00:08:53,406 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 2.59 sec
MapReduce Total cumulative CPU time: 2 seconds 590 msec
Ended Job = job_1515720212312_0005
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   Cumulative CPU: 2.59 sec   HDFS Read: 6800 HDFS Write: 309 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 590 msec
OK
369 SMITH   20  RESEARCH
7499    ALLEN   30  SALES
7521    WARD    30  SALES
7566    JONES   20  RESEARCH
7654    MARTIN  30  SALES
7698    BLAKE   30  SALES
7782    CLARK   10  ACCOUNTING
7788    SCOTT   20  RESEARCH
7839    KING    10  ACCOUNTING
7844    TURNER  30  SALES
7876    ADAMS   20  RESEARCH
7900    JAMES   30  SALES
7902    FORD    20  RESEARCH
7934    MILLER  10  ACCOUNTING
Time taken: 43.697 seconds, Fetched: 14 row(s)

之所以会出现两种方式,就是一个hive的调优参数

set hive.auto.convert.join = true;
若为false,则为common join
若为true,则为map join

若泽大数据交流群:671914634

阅读更多
版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/weixin_39216383/article/details/79043299
文章标签: hive
个人分类: hive
想对作者说点什么? 我来说一句

没有更多推荐了,返回首页

加入CSDN,享受更精准的内容推荐,与500万程序员共同成长!
关闭
关闭