Shuffle Hash Join
启用 Shuffle Hash Join 必须满足以下几个条件:
- 仅支持等值 Join,不要求参与 Join 的 Keys 可排序;
- spark.sql.join.preferSortMergeJoin 参数必须设置为 false,参数是从 Spark 2.0.0 版本引入的,默认值为 true,也就是默认情况下选择 Sort Merge Join;
- 小表的大小(plan.stats.sizeInBytes)必须小于 spark.sql.autoBroadcastJoinThreshold * spark.sql.shuffle.partitions;而且小表大小(stats.sizeInBytes)的三倍必须小于等于大表的大小(stats.sizeInBytes),也就是 a.stats.sizeInBytes * 3 < = b.stats.sizeInBytes
常见 Join 类型
亦可参考: https://mazhuang.org/2017/09/11/joins-in-sql/
inner join
select * from studentone_message a
INNER JOIN studenttwo_message b
on a.id=b.id
隐形内连接:where连接
有显性内连接,当然也会有隐形内连接。一般的where 连接多个表也属于内连接,在数据库中被称为隐性内链接。如下:
select * from studentone_message a,studenttwo_message b
WHERE a.id=b.id
left join
LEFT JOIN 一般被译作左连接,也写作 LEFT OUTER JOIN。左连接查询会返回左表(表 A)中所有记录,不管右表(表 B)中有没有关联的数据。在右表中找到的关联数据列也会被一起返回。
文氏图:
SELECT A.PK AS A_PK, B.PK AS B_PK,
A.Value AS A_Value, B.Value AS B_Value
FROM Table_A A
LEFT JOIN Table_B B
ON A.PK = B.PK;
FULL JOIN
select * from message1 a
FULL JOIN message2 b
on a.id=b.id
Join 的几种类型
参考链接:
https://qileq.com/tech/spark/sql/join/
Spark 的 HashJoin.scala 代码知,目前 Spark 支持如下几种类型的 join:
- Inner join
内连接,语法:relation [ INNER ] JOIN relation [ join_criteria ]。
- Cross join
笛卡尔连接,语法:relation CROSS JOIN relation [ join_criteria ]。
- Left (outer) join
左(外)连接,语法:relation LEFT [ OUTER ] JOIN relation [ join_criteria ]。
- Right (outer) join
右(外)连接,语法:relation RIGHT [ OUTER ] JOIN relation [ join_criteria ]。
- Full (outer) join
全(外)连接,语法:relation FULL [ OUTER ] JOIN relation [ join_criteria ]。
- (Left) Semi join
(左)半连接,语法:relation [ LEFT ] SEMI JOIN relation [ join_criteria ]。
- (Left) Anti join
(左)反连接,语法:relation [ LEFT ] ANTI JOIN relation [ join_criteria ]。
Join 的示例
-- 源表 department
> SELECT * FROM department;
deptno deptname
1 Marketing
2 Sales
3 Engineering
-- 源表 employee
> SELECT * FROM employee;
id name deptno
1 Amy 1
2 Lisa 2
3 John 2
4 Paul 4
-- Inner join
> SELECT id, name, deptname FROM employee INNER JOIN department ON employee.deptno = department.deptno;
id name deptname
1 Amy Marketing
2 Lisa Sales
3 John Sales
-- Left outer join
> SELECT id, name, deptname FROM employee LEFT JOIN department ON employee.deptno = department.deptno;
id name deptname
1 Amy Marketing
2 Lisa Sales
3 John Sales
4 Paul NULL
-- Right outer join
> SELECT id, name, deptname FROM employee RIGHT JOIN department ON employee.deptno = department.deptno;
id name deptname
1 Amy Marketing
3 John Sales
2 Lisa Sales
NULL NULL Engineering
-- Full outer join
> SELECT id, name, deptname FROM employee FULL JOIN department ON employee.deptno = department.deptno;
id name deptname
1 Amy Marketing
2 Lisa Sales
3 John Sales
NULL NULL Engineering
4 Paul NULL
-- Cross join
> SELECT id, name, deptname FROM employee CROSS JOIN department ON employee.deptno = department.deptno;
-- or implicit cross join:
-- SELECT employee.id AS id, name, deptname FROM employee, department;
id name deptname
1 Amy Marketing
1 Amy Sales
1 Amy Engineering
2 Lisa Marketing
2 Lisa Sales
2 Lisa Engineering
3 John Marketing
3 John Sales
3 John Engineering
4 Paul Marketing
4 Paul Sales
4 Paul Engineering
-- Cross join + on
> SELECT id, name, deptname FROM employee CROSS JOIN department ON employee.deptno = department.deptno;
id name deptname
1 Amy Marketing
2 Lisa Sales
3 John Sales
-- Semi join
> SELECT id, name, deptno FROM employee SEMI JOIN department ON employee.deptno = department.deptno;
id name deptno
1 Amy 1
2 Lisa 2
3 John 2
-- Anti join
> SELECT id, name, deptno FROM employee ANTI JOIN department ON employee.deptno = department.deptno;
id name deptno
4 Paul 4