oralce&Hive join 简介& 谓词下推

最新推荐文章于 2024-08-19 19:45:00 发布

不名一文

最新推荐文章于 2024-08-19 19:45:00 发布

阅读量2.7k

点赞数

本文链接：https://blog.csdn.net/u012485099/article/details/80588802

版权

大数据同时被 3 个专栏收录

20 篇文章 7 订阅

订阅专栏

hive

6 篇文章 0 订阅

订阅专栏

oracle

3 篇文章 0 订阅

订阅专栏

在平时写sql时，join语句可能是使用频率最高语句之一。可是，你真的足够了解join语句吗。接下来以oracle和hive为例子，介绍join相关的基础知识

1.Oracle

1.1连接类型

1.1.1nested loop join

嵌套循环连接将驱动表(外表)和被驱动表(内表)进行join，读取外表的每一行，和内表进行比较操作，数据库一般将建有索引的表作为内表。

适用范围：当数据集较小，访问列上有索引时

例子：

 SQL> select /*+ leading(t1) use_nl(t) */
    2         empno
    3       , ename
    4       , dname
    5       , loc
    6    from emp t
    7    join dept t1
    8      on t.deptno = t1.deptno;
  
  Execution Plan
  ----------------------------------------------------------
  Plan hash value: 4192419542
  
  ---------------------------------------------------------------------------
  | Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
  ---------------------------------------------------------------------------
  |   0 | SELECT STATEMENT   |      |    15 |   420 |    13   (0)| 00:00:01 |
  |   1 |  NESTED LOOPS      |      |    15 |   420 |    13   (0)| 00:00:01 |
  |   2 |   TABLE ACCESS FULL| DEPT |     4 |    72 |     3   (0)| 00:00:01 |
  |*  3 |   TABLE ACCESS FULL| EMP  |     4 |    40 |     3   (0)| 00:00:01 |
  ---------------------------------------------------------------------------
  
  Predicate Information (identified by operation id):
  ---------------------------------------------------
  
     3 - filter("T"."DEPTNO" IS NOT NULL AND "T"."DEPTNO"="T1"."DEPTNO")

这里使用hint强制走nested loop, /*+ leading(t1) use_nl(t) */ 表时t1表为驱动表，t为驱动表。执行计划中靠NESTED LOOPS最近的为驱动表,即

NESTED LOOPS 
    outer_loop
    inner_loop

执行原理:

上面执行计划可以用如下伪代码表示：

for erow in (select empno,dname,loc from dept) loop
    for drow in (select * from empno,ename where depto = outer.deptno) loop
      if matched then output values from erow and drow
      if not matched then discard the row
      end loop
  end loop

通过伪代码可以看到，执行主要分三步：

选定驱动表(一般选小表较好)
选定另一张表为内表
将驱动表中每一行和内表中的记录逐一进行匹

1.1.2hash join

使用场景，当数据量较大，且为等值条件。尤其当小表可以完全读入内存，hash join效率较高

执行步骤如下:

扫描小表，对小表使用hash函数并在pga中建立hash表，伪代码如下

  for small_table_row in (select * from small_table)
  loop
    slot_number := hash(small_table_row.join_key);
    insert_hash_table(slot_number,small_table_row);
  end loop

访问另一张较大的表，称为探测表。数据库访问较大表每一行，利用hash函数生成hash值，并和内存中较小的表逐一匹配，伪代码如下

for large_table_row IN (select * from large_table)
  loop
     slot_number := hash(large_table_row.join_key);
     small_table_row = look_hash_table(slot_number,large_table_row.join_key);
     IF small_table_row found
     then
        output small_table_row + large_table_row;
     end if;
  end loop;

在此过程中，可能会遇到以下问题:

hash过程中发生hash碰撞。遇到这种情况，数据库将相同hash值数据存储一个在链表中，想了解LinkedList，猛戳此处
小表在pga中存储不够。此时将生成hash桶表，将较大的桶写入临时表空间(磁盘)中。对探测表每一行，应用hash函数后和pga中hash桶进行匹配，没找到的情况下再去磁盘hash桶中查找

1.1.3sort merge join

hash Join需要一张hash表和一张探测表，而sort merge需要对两张排序表。在实际过程中，sort消耗成本较大，但在如下情况会使用sort merge

Join 条件为不等连接，如 >,<, !=，此种情况下，hash join无法工作。
当join 的列中存在索引，此时可以避免对第一张表进行排序(但是数据库会忽略索引，对第二张表进行排序)。
当小表无法完全缓存进内存中时，此种情况可能会进行sort merge。相对hash join会将表写入磁盘中，多次进行读取操作。而sort merge此时会将两张表部分数据都写入磁盘中，并且只会读取一次。

sort merge 执行伪代码如下

read data_set_1 sort by join key to temp_ds1
  read data_set_2 sort by join key to temp_ds2
   
  read ds1_row from temp_ds1
  read ds2_row from temp_ds2
  
  while not eof on temp_ds1,temp_ds2
  loop
      IF ( temp_ds1.key = temp_ds2.key ) OUTPUT JOIN ds1_row,ds2_row
      ELSIF ( temp_ds1.key <= temp_ds2.key ) READ ds1_row FROM temp_ds1
      ELSIF ( temp_ds1.key => temp_ds2.key ) READ ds2_row FROM temp_ds2
  end loop

总结：nested loop适用于两个较小数据集，或者较大数据集但是只返回几行数据(first_rows hint)，nested loop在sga中进行；当数据量较大时且连接条件不为不等连接或没有索引，此时适用hash join，hash join发生在pag中；如果数据集较大且为不等连接或存在索引，此时sort merge登场了，sort merge在pga中进行。注意：此处阐述的规律为一般情况，不排除有其它的特殊情况；表述的较大或较小数据集指的是进行谓词过滤后的数据集。

1.2连接方法

连接方法主要分为内连接，外连接，半连接、反连接和笛卡尔连接

1.2.1内连接

内连接取两个数据集中重合部分。

1.2.2外连接

外链接分左外链接，右外链接和全连接。左外连接取左表全部数据及右表能关联上数据，右表不能关联上则值为null。右外链接和左外链接刚好相反。全连接同时取左右表全部数据，不能关联上则置空。以下分别为inner join,left outer join,right outer join,full outer join

1.2.3半连接

左连接发生在两个数据集之间。当第一个数据的值在第二个数据集中找到第一行匹配项则停止查找，此时，第一个数据集返回函数。这样可以避免分返回大量数据行。

使用场景：当查询语句中含有in 或 exists子句

1.2.4反连接

反连接发生在两个数据集之间。当第一个数据集中在第二个数据集中找到第一行匹配项则停止查找，此时，第一个数据集不返回函数

使用场景

语句中存在not In或not exists
语句中存在outer join且条件中存在is null条件，如下：

1.2.5笛卡尔连接

笛卡尔连接为每一个集合的每一条记录和另一个集合的每一条记录进行连接，此时不需要连接键。若两个集合记录数分别为M、N，则结果记录数为M*N

参考文档

2.hive

hive存储基于hdfs，计算主要是基于mapreduce框架。

2.1连接类型

2.1.1map side join

如果join中有一张表较小，这种情况下，可以使用map join，不需要进行reduce操作。

 select /*+ mapjoin(b) */ 
         a.key
       , a.value
    from a 
    join b 
      on a.key = b.key

hive中还可以通过以下两种方式开启map join

  --方式二
  --根据输入文件大小判断是否开启map Join
  -- 0.7.0增加参数，默认值为false，0.11.0及以后版本为true
  set hive.auto.convert.join=true;
  
  --方式三
  -- 根据输入文件大小判断是否开启map join，如果小于hive.auto.convert.join.noconditionaltask.size值则开启mapjoin，hive.auto.convert.join.noconditionaltask.size值默认10M;
  -- 0.11.0版本增加参数，可以自动识别各种案例进行优化
  set hive.auto.convert.join.noconditionaltask=true;
  set hive.auto.convert.join.noconditionaltask.size=10000000;

目前hive推荐使用第三种方式智能进行mapjoin配置

select /*+mapjoin(smallTableTwo)*/ 
         idOne
       , idTwo
       , value 
    from (select /*+MAPJOIN(smallTableOne)*/ 
                idOne
              , idTwo
              , value 
           from bigTable 
           join smallTableOne 
             on bigTable.idOne = smallTableOne.idOne) firstjoin                  
    join smallTableTwo 
      on firstjoin.idTwo = smallTableTwo.idTwo

以上查询语句不支持。如果没有mapjoin，以上查询会生成两个map join，采用第三种方式设置则生成一个map join

➢ 原理：map join执行步骤：

本地工作：在本地机器读入数据，在内存中建立hash表，然后写入本地磁盘并上传至hdfs。最后将hash表写入分布式缓存中
map任务:将数据从分布式缓存中读入内存，将表中数据逐一和hash表匹配。将匹配到数据进行合并并写入hdfs中
无reduce任务

➢ 实现

MR实战

2.1.2common join/reduce side join/shuffle join

若两种表都较大在map端无法缓存进内存，此时只能在reduce端进行join

如果多张表中用同一列进行join操作，那么hive会将其转成一个map/redcuce job

 -- 同一列进行join，转成一个map reduce job
  select a.val
       , b.val
       , c.val 
   from a 
   join b 
     on a.key = b.key1
   join c ON c.key = b.key1;
   
   -- b中关联条件时分别用到 key1和key2，此时生成两个 map/reduce job
  select a.val
       , b.val
       , c.val 
   from a 
   join b 
     on a.key = b.key1
   join c
     on c.key = b.key2

在每一个join相关的map reduce任务中，reducer时最后一张表会作为流式(stream)处理，其它表会进行缓存。所以将最大表放在最后可以减少需要的内存 (新版貌似没这个规则了，会自动优化)

另外，可以通过hint指定流式表。如果hint被忽略，hive将最右边的表作为流式表 (新版貌似没这个规则了，会自动优化)

 select /*+ streamtable(a) */ 
         a.val
       , b.val
       , c.val 
    from a 
    join b 
      on a.key = b.key1 
    join c 
      on c.key = b.key1

官网文档一官网文档二官网文档三

➢ 实战

MR实战另外您也可以查看 Mapreduce简介中 3.7部分

2.1.3bucket mapjoin

如果join的表在join列上进行了分桶，一张表的桶数是其它表的倍数，这种情况下，相关桶之间在map端进行join操作，称为bucketized map-side join,通过如下参数开启

  -- 方法一
  set hive.optimize.bucketmapjoin = true;

➢ 原理

上图阐述了原理：相应的桶在map端进行join，图中没有显示有cache.

补充：

1.bucketmapjoin可认为属于mapjoin的特例，打开MapJoinOperator 源码可以看到：

if (!conf.isBucketMapJoin() && !conf.isDynamicPartitionHashJoin()) {  //此处判断了是否为 BucketMapJoin，说明为mapjoin的一种实现
  /*
   * The issue with caching in case of bucket map join is that different tasks
   * process different buckets and if the container is reused to join a different bucket,
   * join results can be incorrect. The cache is keyed on operator id and for bucket map join
   * the operator does not change but data needed is different. For a proper fix, this
   * requires changes in the Tez API with regard to finding bucket id and
   * also ability to schedule tasks to re-use containers that have cached the specific bucket.
   */
  if (LOG.isDebugEnabled()) {
    LOG.debug("This is not bucket map join, so cache");
  }

  Future<Pair<MapJoinTableContainer[], MapJoinTableContainerSerDe[]>> future =
      cache.retrieveAsync(
          cacheKey, () ->loadHashTable(mapContext, mrContext));
  asyncInitOperations.add(future);
} else if (!isInputFileChangeSensitive(mapContext)) {
  loadHashTable(mapContext, mrContext);
  hashTblInitedOnce = true;
}

2.笔者以为小表按桶开启了缓存(仅个人观点，未测试)

2.1.4sort merge bucket join(SMBJ)

如果join表在join列上进行了分桶和排序，并且有相同的桶数，此时进行sort merge join操作。相关的桶会在mapper进行Join。需要进行如下设置开启sort merge join

  set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
  set hive.optimize.bucketmapjoin = true;
  set hive.optimize.bucketmapjoin.sortedmerge = true;

具体原理参考 1.1.3 sort merge join 。此时并不需要将数据完全加载进内存，由于事先已经排序好，因此效率较高。

参考文档

2.2连接类型

hive 连接类型和oracle类似，写法上稍有不同

2.2.1左半连接

在Hive 0.13版本以前，不支持in、not in、exists、not exists子句。此时用left semi join实现in操作。下面两种写法是等价的

-- 写法一
  select a.key
       , a.value
    from a
   where a.key in (select b.key
                     from b);
  
  -- 写法二
  select a.key
       , a.val
    from a 
    left semi join b 
      on a.key = b.key

3.谓词下推/JOIN简化

谓词下推 Predicate Pushdown（PPD）：简而言之，就是在不影响结果的情况下，尽量将过滤条件提前执行。谓词下推后，过滤条件在map端执行，减少了map端的输出，降低了数据在集群上传输的量，节约了集群的资源，也提升了任务的性能。

谓词下推(Predicate Pushdown) 原理

对于join，先给出如下定义:

保留行表:outer join中返回所有行的表，如left outer join中的左表，right outer join中的右表，full outer join中左右表均为保留表
空供应表：如果和主表没有匹配上返回空值的表，如left outer join中的右表，right outer join中左表，full outer join中左右表均为空供应表
join过程中谓词：写在join过程中的过滤条件
join后谓词：写在where条件中的过滤条件

Preserved Row table	The table in an Outer Join that must return all rows. For left outer joins this is the Left table, for right outer joins it is the Right table, and for full outer joins both tables are Preserved Row tables.
Null Supplying table	This is the table that has nulls filled in for its columns in unmatched rows. In the non-full outer join case, this is the other table in the Join. For full outer joins both tables are also Null Supplying tables.
During Join predicate	A predicate that is in the JOIN ON clause. For example, in '`R1 join R2 on R1.x = 5`' the predicate '`R1.x = 5`' is a During Join predicate.
After Join predicate	A predicate that is in the WHERE clause.

存在如下规则:

Join过程中谓词不能推移到保留行表
join后谓词不能推移到空供应表

During Join predicates cannot be pushed past Preserved Row tables.
After Join predicates cannot be pushed past Null Supplying tables.

正向解释下：join过程中的过滤条件可以推移到空供应表上，where 过程中的条件可以推移到保留行表。

	Join(inner join)		Left Outer Join		Right Outer Join		Full Outer Join
	Left Table	Right Table	Left Table	Right Table	Left Table	Right Table	Left Table	Right Table
Join Predicate	Pushed	Pushed	Not Pushed	Pushed	Pushed	Not Pushed	Not Pushed	Not Pushed
Where Predicate	Pushed	Pushed	Pushed	Not Pushed	Not Pushed	Pushed	Not Pushed	Not Pushed

实例

相信70%工程师都会写出来如下的代码:

 select t.emp_id
       , t.emp_name
       , t1.dept_name
    from (select emp_id
               , emp_name
               , dept_id
            from emp
           where emp_id != 123
             and dt = '20180618')t
    left join (select dept_id
                    , dept_name
                 from dept 
                where dept != 'abc'
                  and dt = '20180618'
                  )t1
      on t.dept_id = t1.dept_id

根据谓词下推规则，等价代码如下:

 select t.emp_id
       , t.emp_name
       , t1.dept_name
    from emp t     -- 保留表
    left join dept t1   -- 空供应表
      on t.dept_id = t1.dept_id
     and t1.dept_id = 'abc'
     and t1.dt = '20180618'
   where t.emp_id = '123'
     and t.dt = '20180618'

emp为保留行表，dept为空供应表。where中t.emp_id = '123' and dt = '20180618'条件推移到emp表上；join中t1.dept_id='abc' and t1.dt = '20180618'推移到dept表上。

以上两种写法等价，第一种写法比较容易理解；第二种写法非常简洁，个人推荐第二种写法。

公司环境下：对于full outer join而言，在on 条件下的过滤是可以进行下推的，实际中请查看执行计划，谨慎使用(其它没问题)

参考文档参考文档

优化

可能很多书籍或者文章写过如列裁剪，建索引等方法。这里介绍的是根据笔者经验，通过一个例子介绍思维上不一样的点(很久以前的例子了，具体性能待测试)，仅供参考

-- 优化前
  select order_id
       , greast(t.consume_time,t1.consume_time) consume_time  -- 取最小值和最大值中的最大值
    from (select order_id
               , min(consume_time) consume_time   -- 取最小值
            from order_info
           where dt = '${bizdate}'
           group by order_id
         )t
    join (select order_id
               , max(consume_time) consume_time   -- 取最大值
            from order_info_his
           where dt = '${bizdate}'
           group by order_id
         )t1
      on t.order_id = t1.order_id;
      
  -- 不同写法
  select order_id
       , greast(min(if(flg = 1,t.consume_time,null)),max(if(flg =2,t2.consume_time,null))) consume_time
    from (select 1 flg
               , order_id
               , consume_time consume_time
            from order_info
           where dt = '${bizdate}'
          
           union all 
          
          select 2 flg
               , order_id
               , consume_time consume_time
            from order_info_his
           where dt = '${bizdate}'
         )t
   group by order_id

更多优化请参考 hive优化

参考：

参考文档一参考文档二参考文档三