《HiveSQL 常见的JOIN误区》

文章目录

1. SQL JOIN 条件是写在 WHERE 条件后边还是 ON 条件后边?

1.1 平常我们写 SQL 会筛选条件写在 WHERE 条件后边还是 ON 条件后边会产生不同的结果么?

我们遇到的 JOIN SQL 或者是我们写 SQL JOIN 的时候有时候会不会有以下的疑惑?JOIN 中俩个表的筛选条件应该写在 ON 后边还是 WHERE 后边,还是使用子查询先进行条件筛选是一个比较优的一个方式,看下以下 5 段 SQL,他们是否会得到相同的结果,平常我们使用哪种方式来写 SQL

  • 先看下以下三段 SQL,他们的执行计划和执行结果都是一样的,都会先筛选数据,然后进行 left join 操作,为什么会有这样的结果?原因是来源于 HIVE 自身谓词下推的优化
SELECT
    c1.order_no AS order_id
    ,c1.seller_id
    ,c1.day
    ,c2.order_id
    ,c2.cost
    ,c2.day
FROM ymtods.orders c1
LEFT JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id AND c2.day='20221215')
WHERE c1.day='20221215'
;
SELECT
    c1.order_no AS order_id
    ,c1.seller_id
    ,c1.day
    ,c2.order_id
    ,c2.cost
    ,c2.day
FROM ymtods.orders c1
LEFT JOIN
(
    SELECT
        order_id
        ,cost
        ,day
    FROM ymtods.order_supplys
    WHERE day = '20221215'
) c2
ON (c1.order_no = c2.order_id)
WHERE c1.day='20221215'
;
SELECT
    c1.order_no AS order_id
    ,c1.seller_id
    ,c1.day
    ,c2.order_id
    ,c2.cost
    ,c2.day
FROM
(
    SELECT
        order_no
        ,seller_id
        ,day
    FROM ymtods.orders
    WHERE day = '20221215'
) c1
LEFT JOIN
(
    SELECT
        order_id
        ,cost
        ,day
    FROM ymtods.order_supplys
    WHERE day = '20221215'
) c2
ON (c1.order_no = c2.order_id)
;

1.2 HIVE 中谓词下推是什么,会对我们 sql 做什么优化?

谓词下推,在 Hive 中叫 Predicate Pushdown,含义是指在不影响结果的前提下,尽量将过滤条件提前执行,使得最后参与 join 的表的数据量更小。谓词下推后,过滤条件将在 map 端提前执行,减少 map 端输出,降低了数据传输 IO,节约资源,提升性能。在 Hive 中通过配置 hive.optimize.ppd 参数为 true,开启谓词下推,默认为开启状态。

2. 测试环境以及测试表结构和测试数据

2.1 测试环境(目前公司环境)

开启cbo优化
hive.cbo.enable=true
执行引擎
hive.execution.engine=spark
hive版本
Hive 2.3.3
开启谓词下推优化
hive.optimize.ppd=true

2.1 测试表(ymtods)数据结构和数据

2.1.1 测试表(ymtods.orders)数据结构
create table if not exists ymtods.orders(
    order_no bigint comment 'order_no'
    ,day bigint comment 'day'
    ,seller_id bigint comment 'seller_id'
)
COMMENT '测试orders表'
STORED AS PARQUET;
2.1.2 测试表(ymtods.orders)数据插入
insert overwrite table ymtods.orders
select 1 as order_no,20221215 as day,12 as seller_id
union all
select 2 as order_no,20221215 as day,13 as seller_id
union all
select 3 as order_no,20221214 as day,14 as seller_id
;
2.1.3 测试表(ymtods.orders)数据
order_nodayseller_id
12022121512
22022121513
32022121414

2.2 测试表(ymtods.order_supply)数据结构和数据

2.2.1 测试表(ymtods.orders)数据结构
create table if not exists ymtods.order_supplys(
    order_id bigint comment 'order_id'
    ,day bigint comment 'day'
    ,cost bigint comment 'cost'
)
COMMENT '测试order_supplys表'
STORED AS PARQUET;
2.2.2 测试表(ymtods.orders)数据插入
insert overwrite table ymtods.order_supplys
select 1 as order_id,20221215 as day,100 as cost
union all
select 2 as order_id,20221215 as day,200 as cost
union all
select 3 as order_id,20221214 as day,300 as cost
;
2.2.3 测试表(ymtods.orders)数据
order_iddaycost
120221215100
220221215200
320221214300

3. SQL JOIN 中表的分类以及谓词的含义

3.1 保留表定义

  • Preserved Row table:保留表
  a LEFT (outer) JOIN b 中的 a 表;
  a right (outer) JOIN b 中的 b 表;
  a FULL OUTER JOIN b a 和 b 表都是 保留表

3.2 保留表定义

  • Null Supplying table:提供 null 值的表,也就是 非保留表,在 JOIN 中如果不能匹配上的使用 null 填充的表
  a LEFT (outer) JOIN b 中的 b 表;
  a right (outer) JOIN b 中的 a 表,
  a FULL OUTER JOIN b a 和 b 表都是 null值保留表

3.3 JOIN 中谓词定义

  • During JOIN predicate:JOIN 中谓词,就是 ON 后面的条件。
  a JOIN b (on a.id = b.id AND a.id = 5) –> a.id = 5就是JOIN中谓词

3.4 JOIN 后谓词定义

  • After JOIN predicate:JOIN 后谓词,就是 WHERE 后面的条件。
  a LEFT JOIN b WHERE a.t=1 –> a.t=1 就是JOIN后谓词

4. LEFT OUTER JOIN & RIGHT OUTER JOIN 谓词下推实例分析

4.1 过滤条件写在 WHERE, 且是 保留表的字段

4.1.1 过滤条件写在 WHERE, 且是 保留表的字段 SQL
EXPLAIN
SELECT
    c1.order_no AS order_id
    ,c1.seller_id
    ,c1.day
    ,c2.order_id
    ,c2.cost
    ,c2.day
FROM ymtods.orders c1
LEFT JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id)
WHERE c1.day='20221215'
;
4.1.2 过滤条件写在 WHERE, 且是 保留表的字段执行计划分析
Map Operator Tree:
    TableScan
      alias: c1
      filterExpr: (day = 20221215) (type: boolean)
      Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
      Filter Operator
        predicate: (day = 20221215) (type: boolean)
        Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
        Map JOIN Operator
          condition map:
               LEFT OUTER JOIN0 to 1
          keys:
            0 order_no (type: bigint)
            1 order_id (type: bigint)
          outputColumnNames: _col0, _col2, _col8
          input vertices:
            1 Map 2
          Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
4.1.3 过滤条件写在 WHERE, 且是 保留表的字段结果分析
  • 结论
可以谓词下推,首先对 c1 表按条件 day = 20221215 过滤,然后再进行 LEFT JOIN 操作
  • 运行结果
c1.order_noc1.seller_idc1.dayc2.order_idc2.costc2.day
11220221215110020221215
21320221215220020221215

4.2 过滤条件写在 WHERE, 且是 非保留表的字段

4.2.1 过滤条件写在 WHERE, 且是 非保留表的字段 SQL
EXPLAIN
SELECT
    c1.order_no AS order_id
    ,c1.seller_id
    ,c1.day
    ,c2.order_id
    ,c2.cost
    ,c2.day
FROM ymtods.orders c1
LEFT JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id)
WHERE c2.day='20221215'
;
4.2.2 过滤条件写在 WHERE, 且是 非保留表的字段 SQL 执行计划分析
Map Operator Tree:
    TableScan
      alias: c2
      filterExpr: ((day = 20221215) AND order_id is not null) (type: boolean)
      Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
      Filter Operator
        predicate: ((day = 20221215) AND order_id is not null) (type: boolean)
        Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE

Map Operator Tree:
    TableScan
      alias: c1
      filterExpr: order_no is not null (type: boolean)
      Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
      Filter Operator
        predicate: order_no is not null (type: boolean)
        Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
        Select Operator
          expressions: order_no (type: bigint), seller_id (type: bigint)
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
          Map JOIN Operator
            condition map:
                 INNER JOIN 0 to 1
            keys:
              0 _col0 (type: bigint)
              1 _col0 (type: bigint)
            outputColumnNames: _col0, _col1, _col4
            input vertices:
              1 Map 2
            Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
4.2.3 过滤条件写在 WHERE, 且是 非保留表的字段结果分析
  • 结论
可以谓词下推,首先对 c1 表按条件 day = 20221215 过滤,然后再进行 INNER JOIN 操作
  • 运行结果
c1.order_noc1.seller_idc1.dayc2.order_idc2.costc2.day
11220221215110020221215
21320221215220020221215

4.3 过滤条件写在 ON, 且是保留表的字段

4.3.1 过滤条件写在 ON, 且是保留表的字段 SQL
EXPLAIN
SELECT
    c1.order_no AS order_id
    ,c1.seller_id
    ,c1.day
    ,c2.order_id
    ,c2.cost
    ,c2.day
FROM ymtods.orders c1
LEFT JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id AND c1.day='20221215')
;
4.3.2 过滤条件写在 ON, 且是保留表的字段 SQL 执行计划分析
Map Operator Tree:
    TableScan
      alias: c1
      Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Colustats: NONE
      Map JOIN Operator
        condition map:
             LEFT OUTER JOIN0 to 1
        filter predicates:
          0 {(day = 20221215)}
          1
        keys:
          0 order_no (type: bigint)
          1 order_id (type: bigint)
        outputColumnNames: _col0, _col2, _col8
4.3.3 过滤条件写在 ON, 且是保留表的字段结果分析
  • 结论
不可以谓词下推,c1 表和 c2 表进行 LEFT JOIN,条件是筛选 c1 表(day = 20221215)和 c2 表 LEFT JOIN,结果 c1 表会保留所有数据,c2 表会保留和 c1 表(day = 20221215)JOIN 到的数据
  • 运行结果
c1.order_noc1.seller_idc1.dayc2.order_idc2.costc2.day
11220221215110020221215
21320221215220020221215
31420221214NULLNULLNULL

4.4 过滤条件写在 ON, 且是 非保留表的字段

4.4.1 过滤条件写在 ON, 且是 非保留表的字段 SQL
EXPLAIN
SELECT
    c1.order_no AS order_id
    ,c1.seller_id
    ,c1.day
    ,c2.order_id
    ,c2.cost
    ,c2.day
FROM ymtods.orders c1
LEFT JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id AND c2.day = '20221215')
;
4.4.2 过滤条件写在 ON, 且是 非保留表的字段 SQL 执行计划分析
Map Operator Tree:
    TableScan
      alias: c2
      filterExpr: (day = 20221215) (type: boolean)
      Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
      Filter Operator
        predicate: (day = 20221215) (type: boolean)
        Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
        Spark HashTable Sink Operator
          keys:
            0 order_no (type: bigint)
            1 order_id (type: bigint)

Map Operator Tree:
    TableScan
      alias: c1
      Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
      Map JOIN Operator
        condition map:
             LEFT OUTER JOIN0 to 1
        keys:
          0 order_no (type: bigint)
          1 order_id (type: bigint)
        outputColumnNames: _col0, _col2, _col8
        input vertices:
          1 Map 2
        Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
4.4.3 过滤条件写在 ON, 且是 非保留表的字段结果分析
  • 结论
可以谓词下推,首先对 c2 表进行 day = '20221215'条件筛选,然后将 c1 和 c2 表进行 LEFT JOIN 操作
  • 运行结果
c1.order_noc1.seller_idc1.dayc2.order_idc2.costc2.day
11220221215110020221215
21320221215220020221215
31420221214NULLNULLNULL

4.5 LEFT OUTER JOIN & RIGHT OUTER JOIN 谓词下推实例分析总结

ON/WHERE保留字段(LEFT 左表)非保留字段(LEFT 右表)
ON不可以可以
WHERE可以不可以
1. 对于 LEFT OUTER JOIN ,右侧的表筛选条件写在 ON 后面、左侧的表写在 WHERE 后面,性能上有提高;
2. 对于 RIGHT OUTER JOIN,左侧的表筛选条件写在 ON 后面、右侧的表写在 WHERE 后面,性能上有提高;

5. FULL OUTER JOIN 谓词下推实例分析

5.1 过滤条件写在 WHERE, 且是 保留表的字段

5.1.1 过滤条件写在 WHERE, 且是 保留表的字段 SQL
EXPLAIN
SELECT
    c1.order_no AS order_id
    ,c1.seller_id
    ,c1.day
    ,c2.order_id
    ,c2.cost
    ,c2.day
FROM ymtods.orders c1
FULL JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id)
WHERE c1.day='20221215'
;
5.1.2 过滤条件写在 WHERE, 且是 保留表的字段 SQL 执行计划分析
Map Operator Tree:
TableScan
  alias: c1
  filterExpr: (day = 20221215) (type: boolean)
  Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
  Filter Operator
    predicate: (day = 20221215) (type: boolean)
    Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
    Select Operator
      expressions: order_no (type: bigint), seller_id (type: bigint)
      outputColumnNames: _col0, _col2
      Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
      Map JOIN Operator
        condition map:
             LEFT OUTER JOIN0 to 1
        keys:
          0 _col0 (type: bigint)
          1 _col0 (type: bigint)
        outputColumnNames: _col0, _col2, _col4
        input vertices:
          1 Map 2
        Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE

5.1.3 过滤条件写在 WHERE, 且是 保留表的字段结果分析
  • 结论
可以谓词下推, 会对 c1 表先进行 day = 20221215 筛选,然后进行 JOIN 操作,但是这里有个注意点,这里转换成了 LEFT JOIN 并不是 FULL JOIN
  • 运行结果
c1.order_noc1.seller_idc1.dayc2.order_idc2.costc2.day
11220221215110020221215
21320221215220020221215

5.2 过滤条件写在 WHERE, 且是 非保留表的字段

5.2.1 过滤条件写在 WHERE, 且是 非保留表的字段 SQL
EXPLAIN
SELECT
    c1.order_no AS order_id
    ,c1.seller_id
    ,c1.day
    ,c2.order_id
    ,c2.cost
    ,c2.day
FROM ymtods.orders c1
FULL JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id)
WHERE c2.day='20221215'
;
5.2.2 过滤条件写在 WHERE, 且是 非保留表的字段 SQL 执行计划分析
TableScan
    alias: c2
    filterExpr: (day = 20221215) (type: boolean)
    Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
    Filter Operator
      predicate: (day = 20221215) (type: boolean)
      Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
      Select Operator
        expressions: order_id (type: bigint), cost (type: bigint)
        outputColumnNames: _col0, _col2
        Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
        Map JOIN Operator
          condition map:
               RIGHT OUTER JOIN0 to 1
          keys:
            0 _col0 (type: bigint)
            1 _col0 (type: bigint)
          outputColumnNames: _col0, _col1, _col4
          input vertices:
            0 Map 1
          Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
;
5.2.3 过滤条件写在 WHERE, 且是 非保留表的字段结果分析
  • 结论
可以谓词下推,会对 c2 表先进行 day = 20221215 筛选,然后进行 JOIN 操作,但是这里有个注意点,这里转换成了 c1 right JOIN c2 并不是 FULL JOIN
  • 运行结果
c1.order_noc1.seller_idc1.dayc2.order_idc2.costc2.day
11220221215110020221215
21320221215220020221215

5.3 过滤条件写在 ON, 且是 保留表的字段

5.3.1 过滤条件写在 ON, 且是 保留表的字段 SQL
EXPLAIN
SELECT
    c1.order_no AS order_id
    ,c1.seller_id
    ,c1.day
    ,c2.order_id
    ,c2.cost
    ,c2.day
FROM ymtods.orders c1
FULL JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id AND c1.day='20221215')
;
5.3.2 过滤条件写在 ON, 且是 保留表的字段 SQL 执行计划分析
Reducer 2
    Reduce Operator Tree:
      JOIN Operator
        condition map:
             Outer JOIN 0 to 1
        filter predicates:
          0 {(VALUE._col0 = 20221215)}
          1
        keys:
          0 _col0 (type: bigint)
          1 _col0 (type: bigint)
        outputColumnNames: _col0, _col2, _col4
        Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
5.3.3 过滤条件写在 ON, 且是 保留表的字段结果分析
  • 结论
不可以谓词下推, 会选择 c1(day = 20221215)和 c2 表进行 Outer JOIN 操作,c1 表会保留所有数据,c2 表会保留和 c1(day = 20221215) JOIN 上的数据
  • 运行结果
c1.order_noc1.seller_idc1.dayc2.order_idc2.costc2.day
11220221215110020221215
21320221215220020221215
31420221214NULLNULLNULL
NULLNULLNULL330020221214

5.4 过滤条件写在 ON, 且是 非保留表的字段

5.4.1 过滤条件写在 ON, 且是 非保留表的字段 SQL
EXPLAIN
SELECT
    c1.order_no AS order_id
    ,c1.seller_id
    ,c1.day
    ,c2.order_id
    ,c2.cost
    ,c2.day
FROM ymtods.orders c1
FULL JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id AND c2.day = '20221215')
;
5.4.2 过滤条件写在 ON, 且是 非保留表的字段 SQL 执行计划分析
Reducer 2
    Reduce Operator Tree:
      JOIN Operator
        condition map:
             Outer JOIN 0 to 1
        filter predicates:
          0
          1 {(VALUE._col0 = 20221215)}
        keys:
          0 _col0 (type: bigint)
          1 _col0 (type: bigint)
        outputColumnNames: _col0, _col1, _col4
        Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
5.4.3 过滤条件写在 ON, 且是 非保留表的字段结果分析
  • 结论
不可以谓词下推,会选择 c2(day = 20221215)和 c1 表进行 Outer JOIN 操作,c2 表会保留所有数据,c1 表会保留和 c2(day = 20221215) JOIN 上的数据
  • 运行结果
c1.order_noc1.seller_idc1.dayc2.order_idc2.costc2.day
11220221215110020221215
21320221215220020221215
31420221214NULLNULLNULL
NULLNULLNULL330020221214

5.5 FULL OUTER JOIN 谓词下推实例分析总结

  1. 写在 ON 后面,还是 WHERE 后面,都不会谓词下推

6. INNER JOIN 谓词下推实例分析

6.1 过滤条件写在 WHERE, 且是 保留表的字段

6.1.1 过滤条件写在 WHERE, 且是 保留表的字段 SQL
EXPLAIN
SELECT
    c1.order_no AS order_id
    ,c1.seller_id
    ,c1.day
    ,c2.order_id
    ,c2.cost
    ,c2.day
FROM ymtods.orders c1
JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id)
WHERE c1.day='20221215'
;
6.1.2 过滤条件写在 WHERE, 且是 保留表的字段 SQL 执行计划分析
Map Operator Tree:
    TableScan
      alias: c1
      filterExpr: ((day = 20221215) AND order_no is not null) (type: boolean)
      Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
      Filter Operator
        predicate: ((day = 20221215) AND order_no is not null) (type: boolean)
        Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE

Map Operator Tree:
    TableScan
      alias: c2
      filterExpr: order_id is not null (type: boolean)
      Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
      Filter Operator
        predicate: order_id is not null (type: boolean)
        Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
        Select Operator
          expressions: order_id (type: bigint), cost (type: bigint)
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
          Map JOIN Operator
            condition map:
                 INNER JOIN 0 to 1
            keys:
              0 _col0 (type: bigint)
              1 _col0 (type: bigint)
            outputColumnNames: _col0, _col2, _col4
            input vertices:
              0 Map 1
            Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
6.1.3 过滤条件写在 WHERE, 且是 保留表的字段结果分析
  • 结论
可以谓词下推, 对 c1 表先进行 day = 20221215 条件筛选,然后将 c1 表和 c2 表进行 INNER JOIN 操作
  • 运行结果
c1.order_noc1.seller_idc1.dayc2.order_idc2.costc2.day
11220221215110020221215
21320221215220020221215

6.2 过滤条件写在 WHERE, 且是 非保留表的字段

6.2.1 过滤条件写在 WHERE, 且是 非保留表的字段 SQL
EXPLAIN
SELECT
    c1.order_no AS order_id
    ,c1.seller_id
    ,c1.day
    ,c2.order_id
    ,c2.cost
    ,c2.day
FROM ymtods.orders c1
JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id)
WHERE c2.day='20221215'
;
6.2.2 过滤条件写在 WHERE, 且是 非保留表的字段 SQL 执行计划分析
Map Operator Tree:
    TableScan
      alias: c2
      filterExpr: ((day = 20221215) AND order_id is not null) (type: boolean)
      Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
      Filter Operator
        predicate: ((day = 20221215) AND order_id is not null) (type: boolean)
        Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE

Map Operator Tree:
    TableScan
      alias: c1
      filterExpr: order_no is not null (type: boolean)
      Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
      Filter Operator
        predicate: order_no is not null (type: boolean)
        Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
        Select Operator
          expressions: order_no (type: bigint), seller_id (type: bigint)
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
          Map JOIN Operator
            condition map:
                 INNER JOIN 0 to 1
            keys:
              0 _col0 (type: bigint)
              1 _col0 (type: bigint)
            outputColumnNames: _col0, _col1, _col4
            input vertices:
              1 Map 2
            Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
6.2.3 过滤条件写在 WHERE, 且是 非保留表的字段结果分析
  • 结论
可以谓词下推,首先对 c2 表先进行 day = 20221215 筛选,然后将 c1 表和 c2 表进行 JOIN 操作
  • 运行结果
c1.order_noc1.seller_idc1.dayc2.order_idc2.costc2.day
11220221215110020221215
21320221215220020221215

6.3 过滤条件写在 ON, 且是 保留表的字段

6.3.1 过滤条件写在 ON, 且是 保留表的字段 SQL
EXPLAIN
SELECT
    c1.order_no AS order_id
    ,c1.seller_id
    ,c1.day
    ,c2.order_id
    ,c2.cost
    ,c2.day
FROM ymtods.orders c1
JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id AND c1.day='20221215')
;
6.3.2 过滤条件写在 ON, 且是 保留表的字段 SQL 执行计划分析
Map Operator Tree:
    TableScan
      alias: c1
      filterExpr: ((day = 20221215) AND order_no is not null) (type: boolean)
      Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
      Filter Operator
        predicate: ((day = 20221215) AND order_no is not null) (type: boolean)
        Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE

Map Operator Tree:
    TableScan
      alias: c2
      filterExpr: order_id is not null (type: boolean)
      Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
      Filter Operator
        predicate: order_id is not null (type: boolean)
        Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
        Select Operator
          expressions: order_id (type: bigint), cost (type: bigint)
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
          Map JOIN Operator
            condition map:
                 INNER JOIN 0 to 1
            keys:
              0 _col0 (type: bigint)
              1 _col0 (type: bigint)
            outputColumnNames: _col0, _col2, _col4
            input vertices:
              0 Map 1
            Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
6.3.3 过滤条件写在 ON, 且是 保留表的字段结果分析
  • 结论
可以谓词下推,首先对 c1 表进行 day = 20221215 筛选,然后将 c1 表和 c2 表进行 INNER JOIN 操作
  • 运行结果
c1.order_noc1.seller_idc1.dayc2.order_idc2.costc2.day
11220221215110020221215
21320221215220020221215

6.4 过滤条件写在 ON, 且是 非保留表的字段

6.4.1 过滤条件写在 ON, 且是 非保留表的字段 SQL
EXPLAIN
SELECT
    c1.order_no AS order_id
    ,c1.seller_id
    ,c1.day
    ,c2.order_id
    ,c2.cost
    ,c2.day
FROM ymtods.orders c1
JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id AND c2.day = '20221215')
;
6.4.2 过滤条件写在 ON, 且是 非保留表的字段 SQL 执行计划分析
Map Operator Tree:
    TableScan
      alias: c2
      filterExpr: ((day = 20221215) AND order_id is not null) (type: boolean)
      Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
      Filter Operator
        predicate: ((day = 20221215) AND order_id is not null) (type: boolean)
        Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE

TableScan
    alias: c1
    filterExpr: order_no is not null (type: boolean)
    Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
    Filter Operator
      predicate: order_no is not null (type: boolean)
      Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
      Select Operator
        expressions: order_no (type: bigint), seller_id (type: bigint)
        outputColumnNames: _col0, _col1
        Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
        Map JOIN Operator
          condition map:
               INNER JOIN 0 to 1
          keys:
            0 _col0 (type: bigint)
            1 _col0 (type: bigint)
          outputColumnNames: _col0, _col1, _col4
          input vertices:
            1 Map 2
          Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
6.4.3 过滤条件写在 ON, 且是 非保留表的字段结果分析
  • 结论
可以谓词下推,首先对 c2 表进行 day = 20221215 筛选,然后将 c1 表和 c2 表进行 INNER JOIN 操作
  • 运行结果
    c1.order_noc1.seller_idc1.dayc2.order_idc2.costc2.day
    11220221215110020221215
    21320221215220020221215

6.5 INNER JOIN 谓词下推实例分析总结

INNER JOIN 不管写在 ON 后面 还是 WHERE 后面,都会进行谓词下推

7. HIVE SQL JOIN 经常使用总结

通过谓词下推的优化,我个人平时在写 sql 的时候对不同的 join 方式常使用不同的写法

  1. LEFT OUTER JOIN & RIGHT OUTER JOIN
    • 使用非子查询的方式,将保留表的条件写到 JOIN 后谓词(WHERE)后,将非保留表的条件写到 JOIN 谓词(ON)后
  2. FULL OUTER JOIN
    • 使用子查询的方式,将保留表和非保留表条件都写到子查询中,先进行过滤,再进行 FULL JOIN 操作
  3. INNER JOIN
    • 使用非子查询的方式,将保留表的条件写到 JOIN 后谓词(WHERE)后,将非保留表的条件写到 JOIN 谓词(ON)后

7.1 LEFT OUTER JOIN & RIGHT OUTER JOIN

  • 如果我们想要筛选保留表(ymtods.orders) day = ‘20221215’ 和 非保留表(ymtods.order_supplys)day = ‘20221215’ 有俩种常用方式,效果和执行计划一致都会谓词下推
  1. 非子查询方式
SELECT
    c1.order_no AS order_id
    ,c1.seller_id
    ,c1.day
    ,c2.order_id
    ,c2.cost
    ,c2.day
FROM ymtods.orders c1
LEFT JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id AND c2.day='20221215')
WHERE c1.day='20221215'
;
  1. 子查询方式
SELECT
    c1.order_no AS order_id
    ,c1.seller_id
    ,c1.day
    ,c2.order_id
    ,c2.cost
    ,c2.day
FROM
(
    SELECT
        order_no
        ,seller_id
        ,day
    FROM ymtods.orders
    WHERE day = '20221215'
) c1
LEFT JOIN
(
    SELECT
        order_id
        ,cost
        ,day
    FROM ymtods.order_supplys
    WHERE day = '20221215'
) c2
ON (c1.order_no = c2.order_id)
;

7.2 FULL OUTER JOIN

  • 如果我们想要筛选保留表(ymtods.orders) day = ‘20221215’ 和 非保留表所有数据进行 FULL JOIN 操作 建议使用子查询方式,非子查询方式可能会出现转换成 LEFT JOIN 操作(参看 FULL JOIN 谓词下推分析章节),得不到 FULL JOIN 效果
  1. 子查询方式(正确)
SELECT
    c1.order_no AS order_id
    ,c1.seller_id
    ,c1.day
    ,c2.order_id
    ,c2.cost
    ,c2.day
FROM
(
    SELECT
        order_no
        ,seller_id
        ,day
    FROM ymtods.orders
    WHERE day = '20221215'
) c1
FULL JOIN
(
    SELECT
        order_id
        ,cost
        ,day
    FROM ymtods.order_supplys
) c2
ON (c1.order_no = c2.order_id)
;
  1. 非子查询操作(错误),需要着重注意
SELECT
    c1.order_no AS order_id
    ,c1.seller_id
    ,c1.day
    ,c2.order_id
    ,c2.cost
    ,c2.day
FROM ymtods.orders c1
FULL JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id)
WHERE c1.day='20221215'
;

7.3 INNER OUTER JOIN

  • 如果我们想要筛选保留表(ymtods.orders) day = ‘20221215’ 和 非保留表(ymtods.order_supplys)day = ‘20221215’ 的 INNER JOIN 操作,无论条件写到 ON/WHERE 都可以谓词下推,以下为例
  1. 非子查询方式
SELECT
    c1.order_no AS order_id
    ,c1.seller_id
    ,c1.day
    ,c2.order_id
    ,c2.cost
    ,c2.day
FROM ymtods.orders c1
FULL JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id AND c2.day='20221215')
WHERE c1.day='20221215'
;
  1. 子查询方式
SELECT
    c1.order_no AS order_id
    ,c1.seller_id
    ,c1.day
    ,c2.order_id
    ,c2.cost
    ,c2.day
FROM
(
    SELECT
        order_no
        ,seller_id
        ,day
    FROM ymtods.orders
    WHERE day = '20221215'
) c1
JOIN
(
    SELECT
        order_id
        ,cost
        ,day
    FROM ymtods.order_supplys
    WHERE day = '20221215'
) c2
ON (c1.order_no = c2.order_id)
;
  • 3
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值