1.1 平常我们写 SQL 会筛选条件写在 WHERE 条件后边还是 ON 条件后边会产生不同的结果么?
我们遇到的 JOIN SQL 或者是我们写 SQL JOIN 的时候有时候会不会有以下的疑惑?JOIN 中俩个表的筛选条件应该写在 ON 后边还是 WHERE 后边,还是使用子查询先进行条件筛选是一个比较优的一个方式,看下以下 5 段 SQL,他们是否会得到相同的结果,平常我们使用哪种方式来写 SQL
先看下以下三段 SQL,他们的执行计划和执行结果都是一样的,都会先筛选数据,然后进行 left join 操作,为什么会有这样的结果?原因是来源于 HIVE 自身谓词下推的优化
SELECT
c1.order_no AS order_id
,c1.seller_id
,c1.day
,c2.order_id
,c2.cost
,c2.day
FROM ymtods.orders c1
LEFT JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id AND c2.day='20221215')
WHERE c1.day='20221215'
;
SELECT
c1.order_no AS order_id
,c1.seller_id
,c1.day
,c2.order_id
,c2.cost
,c2.day
FROM ymtods.orders c1
LEFT JOIN
(
SELECT
order_id
,cost
,day
FROM ymtods.order_supplys
WHERE day = '20221215'
) c2
ON (c1.order_no = c2.order_id)
WHERE c1.day='20221215'
;
SELECT
c1.order_no AS order_id
,c1.seller_id
,c1.day
,c2.order_id
,c2.cost
,c2.day
FROM
(
SELECT
order_no
,seller_id
,day
FROM ymtods.orders
WHERE day = '20221215'
) c1
LEFT JOIN
(
SELECT
order_id
,cost
,day
FROM ymtods.order_supplys
WHERE day = '20221215'
) c2
ON (c1.order_no = c2.order_id)
;
create table if not exists ymtods.orders(
order_no bigint comment 'order_no'
,day bigint comment 'day'
,seller_id bigint comment 'seller_id'
)
COMMENT '测试orders表'
STORED AS PARQUET;
2.1.2 测试表(ymtods.orders)数据插入
insert overwrite table ymtods.orders
select 1 as order_no,20221215 as day,12 as seller_id
union all
select 2 as order_no,20221215 as day,13 as seller_id
union all
select 3 as order_no,20221214 as day,14 as seller_id
;
2.1.3 测试表(ymtods.orders)数据
order_no
day
seller_id
1
20221215
12
2
20221215
13
3
20221214
14
2.2 测试表(ymtods.order_supply)数据结构和数据
2.2.1 测试表(ymtods.orders)数据结构
create table if not exists ymtods.order_supplys(
order_id bigint comment 'order_id'
,day bigint comment 'day'
,cost bigint comment 'cost'
)
COMMENT '测试order_supplys表'
STORED AS PARQUET;
2.2.2 测试表(ymtods.orders)数据插入
insert overwrite table ymtods.order_supplys
select 1 as order_id,20221215 as day,100 as cost
union all
select 2 as order_id,20221215 as day,200 as cost
union all
select 3 as order_id,20221214 as day,300 as cost
;
2.2.3 测试表(ymtods.orders)数据
order_id
day
cost
1
20221215
100
2
20221215
200
3
20221214
300
3. SQL JOIN 中表的分类以及谓词的含义
3.1 保留表定义
Preserved Row table:保留表
a LEFT (outer) JOIN b 中的 a 表;
a right (outer) JOIN b 中的 b 表;
a FULL OUTER JOIN b a 和 b 表都是 保留表
a LEFT (outer) JOIN b 中的 b 表;
a right (outer) JOIN b 中的 a 表,
a FULL OUTER JOIN b a 和 b 表都是 null值保留表
3.3 JOIN 中谓词定义
During JOIN predicate:JOIN 中谓词,就是 ON 后面的条件。
a JOIN b (on a.id = b.id AND a.id = 5) –> a.id = 5就是JOIN中谓词
3.4 JOIN 后谓词定义
After JOIN predicate:JOIN 后谓词,就是 WHERE 后面的条件。
a LEFT JOIN b WHERE a.t=1 –> a.t=1 就是JOIN后谓词
4. LEFT OUTER JOIN & RIGHT OUTER JOIN 谓词下推实例分析
4.1 过滤条件写在 WHERE, 且是 保留表的字段
4.1.1 过滤条件写在 WHERE, 且是 保留表的字段 SQL
EXPLAIN
SELECT
c1.order_no AS order_id
,c1.seller_id
,c1.day
,c2.order_id
,c2.cost
,c2.day
FROM ymtods.orders c1
LEFT JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id)
WHERE c1.day='20221215'
;
可以谓词下推,首先对 c1 表按条件 day = 20221215 过滤,然后再进行 LEFT JOIN 操作
运行结果
c1.order_no
c1.seller_id
c1.day
c2.order_id
c2.cost
c2.day
1
12
20221215
1
100
20221215
2
13
20221215
2
200
20221215
4.2 过滤条件写在 WHERE, 且是 非保留表的字段
4.2.1 过滤条件写在 WHERE, 且是 非保留表的字段 SQL
EXPLAIN
SELECT
c1.order_no AS order_id
,c1.seller_id
,c1.day
,c2.order_id
,c2.cost
,c2.day
FROM ymtods.orders c1
LEFT JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id)
WHERE c2.day='20221215'
;
4.2.2 过滤条件写在 WHERE, 且是 非保留表的字段 SQL 执行计划分析
Map Operator Tree:
TableScan
alias: c2
filterExpr: ((day = 20221215) AND order_id is not null) (type: boolean)
Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((day = 20221215) AND order_id is not null) (type: boolean)
Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
Map Operator Tree:
TableScan
alias: c1
filterExpr: order_no is not null (type: boolean)
Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: order_no is not null (type: boolean)
Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: order_no (type: bigint), seller_id (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
Map JOIN Operator
condition map:
INNER JOIN 0 to 1
keys:
0 _col0 (type: bigint)
1 _col0 (type: bigint)
outputColumnNames: _col0, _col1, _col4
input vertices:
1 Map 2
Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
4.2.3 过滤条件写在 WHERE, 且是 非保留表的字段结果分析
结论
可以谓词下推,首先对 c1 表按条件 day = 20221215 过滤,然后再进行 INNER JOIN 操作
运行结果
c1.order_no
c1.seller_id
c1.day
c2.order_id
c2.cost
c2.day
1
12
20221215
1
100
20221215
2
13
20221215
2
200
20221215
4.3 过滤条件写在 ON, 且是保留表的字段
4.3.1 过滤条件写在 ON, 且是保留表的字段 SQL
EXPLAIN
SELECT
c1.order_no AS order_id
,c1.seller_id
,c1.day
,c2.order_id
,c2.cost
,c2.day
FROM ymtods.orders c1
LEFT JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id AND c1.day='20221215')
;
可以谓词下推,首先对 c2 表进行 day = '20221215'条件筛选,然后将 c1 和 c2 表进行 LEFT JOIN 操作
运行结果
c1.order_no
c1.seller_id
c1.day
c2.order_id
c2.cost
c2.day
1
12
20221215
1
100
20221215
2
13
20221215
2
200
20221215
3
14
20221214
NULL
NULL
NULL
4.5 LEFT OUTER JOIN & RIGHT OUTER JOIN 谓词下推实例分析总结
ON/WHERE
保留字段(LEFT 左表)
非保留字段(LEFT 右表)
ON
不可以
可以
WHERE
可以
不可以
1. 对于 LEFT OUTER JOIN ,右侧的表筛选条件写在 ON 后面、左侧的表写在 WHERE 后面,性能上有提高;
2. 对于 RIGHT OUTER JOIN,左侧的表筛选条件写在 ON 后面、右侧的表写在 WHERE 后面,性能上有提高;
5. FULL OUTER JOIN 谓词下推实例分析
5.1 过滤条件写在 WHERE, 且是 保留表的字段
5.1.1 过滤条件写在 WHERE, 且是 保留表的字段 SQL
EXPLAIN
SELECT
c1.order_no AS order_id
,c1.seller_id
,c1.day
,c2.order_id
,c2.cost
,c2.day
FROM ymtods.orders c1
FULL JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id)
WHERE c1.day='20221215'
;
可以谓词下推, 会对 c1 表先进行 day = 20221215 筛选,然后进行 JOIN 操作,但是这里有个注意点,这里转换成了 LEFT JOIN 并不是 FULL JOIN
运行结果
c1.order_no
c1.seller_id
c1.day
c2.order_id
c2.cost
c2.day
1
12
20221215
1
100
20221215
2
13
20221215
2
200
20221215
5.2 过滤条件写在 WHERE, 且是 非保留表的字段
5.2.1 过滤条件写在 WHERE, 且是 非保留表的字段 SQL
EXPLAIN
SELECT
c1.order_no AS order_id
,c1.seller_id
,c1.day
,c2.order_id
,c2.cost
,c2.day
FROM ymtods.orders c1
FULL JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id)
WHERE c2.day='20221215'
;
可以谓词下推,会对 c2 表先进行 day = 20221215 筛选,然后进行 JOIN 操作,但是这里有个注意点,这里转换成了 c1 right JOIN c2 并不是 FULL JOIN
运行结果
c1.order_no
c1.seller_id
c1.day
c2.order_id
c2.cost
c2.day
1
12
20221215
1
100
20221215
2
13
20221215
2
200
20221215
5.3 过滤条件写在 ON, 且是 保留表的字段
5.3.1 过滤条件写在 ON, 且是 保留表的字段 SQL
EXPLAIN
SELECT
c1.order_no AS order_id
,c1.seller_id
,c1.day
,c2.order_id
,c2.cost
,c2.day
FROM ymtods.orders c1
FULL JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id AND c1.day='20221215')
;
如果我们想要筛选保留表(ymtods.orders) day = ‘20221215’ 和 非保留表(ymtods.order_supplys)day = ‘20221215’ 有俩种常用方式,效果和执行计划一致都会谓词下推
非子查询方式
SELECT
c1.order_no AS order_id
,c1.seller_id
,c1.day
,c2.order_id
,c2.cost
,c2.day
FROM ymtods.orders c1
LEFT JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id AND c2.day='20221215')
WHERE c1.day='20221215'
;
子查询方式
SELECT
c1.order_no AS order_id
,c1.seller_id
,c1.day
,c2.order_id
,c2.cost
,c2.day
FROM
(
SELECT
order_no
,seller_id
,day
FROM ymtods.orders
WHERE day = '20221215'
) c1
LEFT JOIN
(
SELECT
order_id
,cost
,day
FROM ymtods.order_supplys
WHERE day = '20221215'
) c2
ON (c1.order_no = c2.order_id)
;
7.2 FULL OUTER JOIN
如果我们想要筛选保留表(ymtods.orders) day = ‘20221215’ 和 非保留表所有数据进行 FULL JOIN 操作 建议使用子查询方式,非子查询方式可能会出现转换成 LEFT JOIN 操作(参看 FULL JOIN 谓词下推分析章节),得不到 FULL JOIN 效果
子查询方式(正确)
SELECT
c1.order_no AS order_id
,c1.seller_id
,c1.day
,c2.order_id
,c2.cost
,c2.day
FROM
(
SELECT
order_no
,seller_id
,day
FROM ymtods.orders
WHERE day = '20221215'
) c1
FULL JOIN
(
SELECT
order_id
,cost
,day
FROM ymtods.order_supplys
) c2
ON (c1.order_no = c2.order_id)
;
非子查询操作(错误),需要着重注意
SELECT
c1.order_no AS order_id
,c1.seller_id
,c1.day
,c2.order_id
,c2.cost
,c2.day
FROM ymtods.orders c1
FULL JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id)
WHERE c1.day='20221215'
;
SELECT
c1.order_no AS order_id
,c1.seller_id
,c1.day
,c2.order_id
,c2.cost
,c2.day
FROM ymtods.orders c1
FULL JOIN ymtods.order_supplys c2
ON (c1.order_no = c2.order_id AND c2.day='20221215')
WHERE c1.day='20221215'
;
子查询方式
SELECT
c1.order_no AS order_id
,c1.seller_id
,c1.day
,c2.order_id
,c2.cost
,c2.day
FROM
(
SELECT
order_no
,seller_id
,day
FROM ymtods.orders
WHERE day = '20221215'
) c1
JOIN
(
SELECT
order_id
,cost
,day
FROM ymtods.order_supplys
WHERE day = '20221215'
) c2
ON (c1.order_no = c2.order_id)
;