Spark SQL supports all basic join operations available in traditional SQL, though Spark Core Joins has huge performance issues when not designed with care as it involves data shuffling across the network, in the other hand Spark SQL Joins comes with more optimization by default (thanks to DataFrames & Dataset) however still there would be some performance issues to consider while using.
(道出了data shuffle影响性能的原因关键:即shuffle需要跨越网络的数据交换)
In this Join tutorial, you will learn different Join syntaxes and using different Join types on two or more DataFrames and Datasets using Scala examples.
- Join Syntax
- Join Types
- Self Join
- Using spark.sql
- Using joinType objects
- Join multiple Tables
- Source Code
1. Spark SQL Join Syntax
1) join(right: Dataset[_]): DataFrame
2) join(right: Dataset[_], usingColumn: String): DataFrame
// 在两张表中的join字段只会保留一份
3) join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
4) join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
// 在两张表中的join字段都会保留
5) join(right: Dataset[_], joinExprs: Column): DataFrame
6) join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
Alternatively, you can also write Join expression using where() and filter() methods on DataFrame and can have Joins on multiple columns.
2. Spark Join Types
JoinType | Join String | Equivalent SQL Join |
---|---|---|
Inner.sql | inner | INNER JOIN |
FullOuter.sql | outer, full, fullouter, full_outer | FULL OUTER JOIN |
LeftOuter.sql | left, leftouter, left_outer | LEFT JOIN |
RightOuter.sql | right, rightouter, right_outer | RIGHT JOIN |
Cross.sql | cross | |
LeftAnti.sql | anti, leftanti, left_anti | |
LeftSemi.sql | semi, leftsemi, left_semi |
注意:cross, leftanti, leftsemi三者都是从spark 2.0开始。
For Syntax 4 & 5 you can use either JoinType
or “Join String” defined on the above table for joinType”
string argument. When you use JoinType
, you should import org.apache.spark.sql.catalyst.plans._
as this package defines JoinType
objects.
The rest of the tutorial explains Join Types using syntax 6.
Before we jump into Spark SQL Join examples, first, let’s create an emp
and dept
DataFrame’s. here, column emp_id
is unique on emp and dept_id
is unique on the dept dataset’s and emp_dept_id
from emp
has a reference to dept_id
on dept
dataset.
// 注意学习这种构造df的方法
val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
(3,"Williams",1,"2010","10","M",1000),
(4,"Jones",2,"2005","10","F",2000),
(5,"Brown",2,"2010","40","",-1),
(6,"Brown",2,"2010","50","",-1)
)
val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
"emp_dept_id","gender","salary")
import spark.sqlContext.implicits._
val empDF = emp.toDF(empColumns:_*)
empDF.show(false)
val dept = Seq(("Finance",10),
("Marketing",20),
("Sales",30),
("IT",40)
)
val deptColumns = Seq("dept_name","dept_id")
val deptDF = dept.toDF(deptColumns:_*)
deptDF.show(false)
This print “emp” and “dept” DataFrame to console.
Emp Dataset
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
|6 |Brown |2 |2010 |50 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance |10 |
|Marketing|20 |
|Sales |30 |
|IT |40 |
+---------+-------+
2.1 Inner Join
Inner
join is the default join in Spark and it’s mostly used, this joins two datasets on key columns and where keys don’t match the rows get dropped from both datasets (emp
& dept
).
empDF.join(deptDF, empDF("emp_dept_id") === deptDF("dept_id"), "inner").show(false)
2.2 Outer, Full, Fullouter Join
Outer
a.k.a full
, fullouter
join returns all rows from both datasets, where join expression doesn’t match it returns null
on respective record columns.
// 以下三者等价
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"outer").show(false)
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"full").show(false)
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"fullouter").show(false)
From our “emp
” dataset’s “emp_dept_id
” with value 60 doesn’t have a record on “dept
” hence dept columns have null and “dept_id
” 30 doesn’t have a record in “emp
” hence you see null’s on emp columns. Below is the result of the above Join expression.
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|2 |Rose |1 |2010 |20 |M |4000 |Marketing|20 |
|5 |Brown |2 |2010 |40 | |-1 |IT |40 |
|1 |Smith |-1 |2018 |10 |M |3000 |Finance |10 |
|3 |Williams|1 |2010 |10 |M |1000 |Finance |10 |
|4 |Jones |2 |2005 |10 |F |2000 |Finance |10 |
|6 |Brown |2 |2010 |50 | |-1 |null |null |
|null |null |null |null |null |null |null |Sales |30 |
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
2.3 Left, Leftouter Join
Left
a.k.a Leftouter
join returns all rows from the left dataset regardless of match found on the right dataset when join expression doesn’t match, it assigns null for that record and drops records from right where match not found.
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"left")
.show(false)
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"leftouter")
.show(false)
From our dataset, “emp_dept_id
” 6o doesn’t have a record on “dept
” dataset hence, this record contains null on “dept
” columns (dept_name & dept_id). and “dept_id
” 30 from “dept
” dataset dropped from the results. Below is the result of the above Join expression.
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|1 |Smith |-1 |2018 |10 |M |3000 |Finance |10 |
|2 |Rose |1 |2010 |20 |M |4000 |Marketing|20 |
|3 |Williams|1 |2010 |10 |M |1000 |Finance |10 |
|4 |Jones |2 |2005 |10 |F |2000 |Finance |10 |
|5 |Brown |2 |2010 |40 | |-1 |IT |40 |
|6 |Brown |2 |2010 |50 | |-1 |null |null |
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
2.4 Righ, Rightouter Join
Right
a.k.a Rightouter
join is opposite of left
join, here it returns all rows from the right dataset regardless of math found on the left dataset, when join expression doesn’t match, it assigns null for that record and drops records from left where match not found.
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"right")
.show(false)
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"rightouter")
.show(false)
From our example, the right dataset “dept_id
” 30 doesn’t have it on the left dataset “emp
” hence, this record contains null on “emp
” columns. and “emp_dept_id
” 60 dropped as a match not found on left. Below is the result of the above Join expression.
2.5 Left Semi Join(特别关注)
leftsemi
join is similar to inner
join difference being leftsemi
join returns all columns from the left dataset and ignores all columns from the right dataset. In other words, this join returns columns from the only left dataset for the records match in the right dataset on join expression, records not matched on join expression are ignored from both left and right datasets.
即只保留左表的字段,并且数据条数和inner join之后相同(左表字段存在,但右边字段null的数据会被删掉)。semi join可以被inner join替代(即inner join + select 左表所有列字段)
The same result can be achieved using select on the result of the inner join however, using this join would be efficient.
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"leftsemi")
.show(false)
Below is the result of the above join expression.
leftsemi join
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
注意和2.3 left join的结果作对比。只存在左表的字段,且第6条数据被删掉。
2.6 Left Anti Join(特别,可用于求差集)
leftanti
join does the exact opposite of the leftsemi
, leftanti
join returns only columns from the left dataset for non-matched
records.
相当于左表和右表的差集(关联后,在左表但不在右表)
注意:从2.0开始
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"), "leftanti")
.show(false)
Yields below output
+------+-----+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+-----+---------------+-----------+-----------+------+------+
|6 |Brown|2 |2010 |50 | |-1 |
+------+-----+---------------+-----------+-----------+------+------+
3. Self Join
Joins are not complete without a self join, Though there is no self-join type available, we can use any of the above-explained join types to join DataFrame to itself. below example use inner
self join
虽然spark没有帮我们实现自关联self join,但是我们可以自己写代码实现
// 注意观察用法
empDF.as("emp1").join(empDF.as("emp2"),
col("emp1.superior_emp_id") === col("emp2.emp_id"),"inner")
.select(col("emp1.emp_id"),col("emp1.name"),
col("emp2.emp_id").as("superior_emp_id"),
col("emp2.name").as("superior_emp_name"))
.show(false)
Here, we are joining emp
dataset with itself to find out superior emp_id
and name
for all employees.
+------+--------+---------------+-----------------+
|emp_id|name |superior_emp_id|superior_emp_name|
+------+--------+---------------+-----------------+
|2 |Rose |1 |Smith |
|3 |Williams|1 |Smith |
|4 |Jones |2 |Rose |
|5 |Brown |2 |Rose |
|6 |Brown |2 |Rose |
+------+--------+---------------+-----------------+
4. Spark SQL Join
Since Spark SQL support native SQL syntax, we can also write join operations after creating temporary tables on DataFrame’s and using spark.sql()
这是平淡无奇的方式,因为没有用到dataframe的方法,还是再走sql的老路子。
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")
//SQL JOIN
val joinDF = spark.sql("select * from EMP e, DEPT d where e.emp_dept_id == d.dept_id")
joinDF.show(false)
val joinDF2 = spark.sql("select * from EMP e INNER JOIN DEPT d ON e.emp_dept_id == d.dept_id")
joinDF2.show(false)
5. Using joinType objects
All Join objects are defined at joinTypes class, In order to use these you need to import org.apache.spark.sql.catalyst.plans.{LeftOuter,Inner,....}
.
要注意这是从spark 2.0开始,才可以像Inner.sql
这么用
import org.apache.spark.sql.catalyst.plans.{LeftOuter,Inner}
println("cross join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),LeftOuter.sql)
.show(false)
println("Inner join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),Inner.sql)
.show(false)
6. Spark Join on multiple DataFrame’s
When you need to join more than two tables, you either use SQL expression after creating a temporary view on the DataFrame or use the result of join operation to join with another DataFrame like chaining them. for example
df1.join(df2,df1('id1') === df2('id2'),"inner")
.join(df3,df1('id1') === df3('id3'),"inner")
For more details, I would recommend reading How to Join multiple tables
7. Source Code of Spark SQL Join Example
package com.sparkbyexamples.spark.dataframe.join
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
object JoinExample extends App {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
(3,"Williams",1,"2010","10","M",1000),
(4,"Jones",2,"2005","10","F",2000),
(5,"Brown",2,"2010","40","",-1),
(6,"Brown",2,"2010","50","",-1)
)
val empColumns = Seq("emp_id","name","superior_emp_id","year_joined","emp_dept_id","gender","salary")
import spark.sqlContext.implicits._
val empDF = emp.toDF(empColumns:_*)
empDF.show(false)
val dept = Seq(("Finance",10),
("Marketing",20),
("Sales",30),
("IT",40)
)
val deptColumns = Seq("dept_name","dept_id")
val deptDF = dept.toDF(deptColumns:_*)
deptDF.show(false)
println("Inner join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"inner")
.show(false)
println("Outer join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"outer")
.show(false)
println("full join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"full")
.show(false)
println("fullouter join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"fullouter")
.show(false)
println("right join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"right")
.show(false)
println("rightouter join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"rightouter")
.show(false)
println("left join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"left")
.show(false)
println("leftouter join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"leftouter")
.show(false)
println("leftanti join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"leftanti")
.show(false)
println("leftsemi join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"leftsemi")
.show(false)
println("cross join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"cross")
.show(false)
println("Using crossJoin()")
empDF.crossJoin(deptDF).show(false)
println("self join")
empDF.as("emp1").join(empDF.as("emp2"),
col("emp1.superior_emp_id") === col("emp2.emp_id"),"inner")
.select(col("emp1.emp_id"),col("emp1.name"),
col("emp2.emp_id").as("superior_emp_id"),
col("emp2.name").as("superior_emp_name"))
.show(false)
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")
//SQL JOIN
val joinDF = spark.sql("select * from EMP e, DEPT d where e.emp_dept_id == d.dept_id")
joinDF.show(false)
val joinDF2 = spark.sql("select * from EMP e INNER JOIN DEPT d ON e.emp_dept_id == d.dept_id")
joinDF2.show(false)
}