[Spark] Spark SQL Join Types with examples

Spark SQL supports all basic join operations available in traditional SQL, though Spark Core Joins has huge performance issues when not designed with care as it involves data shuffling across the network, in the other hand Spark SQL Joins comes with more optimization by default (thanks to DataFrames & Dataset) however still there would be some performance issues to consider while using.

(道出了data shuffle影响性能的原因关键:即shuffle需要跨越网络的数据交换)

In this Join tutorial, you will learn different Join syntaxes and using different Join types on two or more DataFrames and Datasets using Scala examples.

  • Join Syntax
  • Join Types
  • Self Join
  • Using spark.sql
  • Using joinType objects
  • Join multiple Tables
  • Source Code

1. Spark SQL Join Syntax

1) join(right: Dataset[_]): DataFrame
2) join(right: Dataset[_], usingColumn: String): DataFrame
// 在两张表中的join字段只会保留一份
3) join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
4) join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
// 在两张表中的join字段都会保留
5) join(right: Dataset[_], joinExprs: Column): DataFrame
6) join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame

Alternatively, you can also write Join expression using where() and filter() methods on DataFrame and can have Joins on multiple columns.

2. Spark Join Types

JoinTypeJoin StringEquivalent SQL Join
Inner.sqlinnerINNER JOIN
FullOuter.sqlouter, full, fullouter, full_outerFULL OUTER JOIN
LeftOuter.sqlleft, leftouter, left_outerLEFT JOIN
RightOuter.sqlright, rightouter, right_outerRIGHT JOIN
Cross.sqlcross
LeftAnti.sqlanti, leftanti, left_anti
LeftSemi.sqlsemi, leftsemi, left_semi

注意:cross, leftanti, leftsemi三者都是从spark 2.0开始。

For Syntax 4 & 5 you can use either JoinType or “Join String” defined on the above table for joinType”string argument. When you use JoinType, you should import org.apache.spark.sql.catalyst.plans._ as this package defines JoinType objects.

The rest of the tutorial explains Join Types using syntax 6.

Before we jump into Spark SQL Join examples, first, let’s create an emp and dept DataFrame’s. here, column emp_id is unique on emp and dept_id is unique on the dept dataset’s and emp_dept_id from emp has a reference to dept_id on dept dataset.

// 注意学习这种构造df的方法
val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
    (2,"Rose",1,"2010","20","M",4000),
    (3,"Williams",1,"2010","10","M",1000),
    (4,"Jones",2,"2005","10","F",2000),
    (5,"Brown",2,"2010","40","",-1),
      (6,"Brown",2,"2010","50","",-1)
  )
  
  val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
       "emp_dept_id","gender","salary")
  import spark.sqlContext.implicits._
  val empDF = emp.toDF(empColumns:_*)
  empDF.show(false)

  val dept = Seq(("Finance",10),
    ("Marketing",20),
    ("Sales",30),
    ("IT",40)
  )

  val deptColumns = Seq("dept_name","dept_id")
  val deptDF = dept.toDF(deptColumns:_*)
  deptDF.show(false)

This print “emp” and “dept” DataFrame to console.

Emp Dataset
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |
|2     |Rose    |1              |2010       |20         |M     |4000  |
|3     |Williams|1              |2010       |10         |M     |1000  |
|4     |Jones   |2              |2005       |10         |F     |2000  |
|5     |Brown   |2              |2010       |40         |      |-1    |
|6     |Brown   |2              |2010       |50         |      |-1    |
+------+--------+---------------+-----------+-----------+------+------+

Dept Dataset
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+

2.1 Inner Join

Inner join is the default join in Spark and it’s mostly used, this joins two datasets on key columns and where keys don’t match the rows get dropped from both datasets (emp & dept).

empDF.join(deptDF, empDF("emp_dept_id") === deptDF("dept_id"), "inner").show(false)

2.2 Outer, Full, Fullouter Join

Outer a.k.a full, fullouter join returns all rows from both datasets, where join expression doesn’t match it returns null on respective record columns.

// 以下三者等价
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"outer").show(false)
empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"full").show(false)
empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"fullouter").show(false)

From our “emp” dataset’s “emp_dept_id” with value 60 doesn’t have a record on “dept” hence dept columns have null and “dept_id” 30 doesn’t have a record in “emp” hence you see null’s on emp columns. Below is the result of the above Join expression.

+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|2     |Rose    |1              |2010       |20         |M     |4000  |Marketing|20     |
|5     |Brown   |2              |2010       |40         |      |-1    |IT       |40     |
|1     |Smith   |-1             |2018       |10         |M     |3000  |Finance  |10     |
|3     |Williams|1              |2010       |10         |M     |1000  |Finance  |10     |
|4     |Jones   |2              |2005       |10         |F     |2000  |Finance  |10     |
|6     |Brown   |2              |2010       |50         |      |-1    |null     |null   |
|null  |null    |null           |null       |null       |null  |null  |Sales    |30     |
+------+--------+---------------+-----------+-----------+------+------+---------+-------+

2.3 Left, Leftouter Join

Left a.k.a Leftouter join returns all rows from the left dataset regardless of match found on the right dataset when join expression doesn’t match, it assigns null for that record and drops records from right where match not found.

empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"left")
	.show(false)
empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"leftouter")
	.show(false)

From our dataset, “emp_dept_id” 6o doesn’t have a record on “dept” dataset hence, this record contains null on “dept” columns (dept_name & dept_id). and “dept_id” 30 from “dept” dataset dropped from the results. Below is the result of the above Join expression.

+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |Finance  |10     |
|2     |Rose    |1              |2010       |20         |M     |4000  |Marketing|20     |
|3     |Williams|1              |2010       |10         |M     |1000  |Finance  |10     |
|4     |Jones   |2              |2005       |10         |F     |2000  |Finance  |10     |
|5     |Brown   |2              |2010       |40         |      |-1    |IT       |40     |
|6     |Brown   |2              |2010       |50         |      |-1    |null     |null   |
+------+--------+---------------+-----------+-----------+------+------+---------+-------+

2.4 Righ, Rightouter Join

Righta.k.a Rightouter join is opposite of left join, here it returns all rows from the right dataset regardless of math found on the left dataset, when join expression doesn’t match, it assigns null for that record and drops records from left where match not found.

empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"right")
	.show(false)
empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"rightouter")
	.show(false)

From our example, the right dataset “dept_id” 30 doesn’t have it on the left dataset “emp” hence, this record contains null on “emp” columns. and “emp_dept_id” 60 dropped as a match not found on left. Below is the result of the above Join expression.

2.5 Left Semi Join(特别关注)

leftsemi join is similar to inner join difference being leftsemi join returns all columns from the left dataset and ignores all columns from the right dataset. In other words, this join returns columns from the only left dataset for the records match in the right dataset on join expression, records not matched on join expression are ignored from both left and right datasets.

即只保留左表的字段,并且数据条数和inner join之后相同(左表字段存在,但右边字段null的数据会被删掉)。semi join可以被inner join替代(即inner join + select 左表所有列字段)

The same result can be achieved using select on the result of the inner join however, using this join would be efficient.

  empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"leftsemi")
    .show(false)

Below is the result of the above join expression.

leftsemi join
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |
|2     |Rose    |1              |2010       |20         |M     |4000  |
|3     |Williams|1              |2010       |10         |M     |1000  |
|4     |Jones   |2              |2005       |10         |F     |2000  |
|5     |Brown   |2              |2010       |40         |      |-1    |
+------+--------+---------------+-----------+-----------+------+------+

注意和2.3 left join的结果作对比。只存在左表的字段,且第6条数据被删掉。

2.6 Left Anti Join(特别,可用于求差集)

leftanti join does the exact opposite of the leftsemi, leftanti join returns only columns from the left dataset for non-matched records.

相当于左表和右表的差集(关联后,在左表但不在右表)

注意:从2.0开始

empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"), "leftanti")
    .show(false)

Yields below output

+------+-----+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+-----+---------------+-----------+-----------+------+------+
|6     |Brown|2              |2010       |50         |      |-1    |
+------+-----+---------------+-----------+-----------+------+------+

3. Self Join

Joins are not complete without a self join, Though there is no self-join type available, we can use any of the above-explained join types to join DataFrame to itself. below example use inner self join

虽然spark没有帮我们实现自关联self join,但是我们可以自己写代码实现

// 注意观察用法
empDF.as("emp1").join(empDF.as("emp2"),
	col("emp1.superior_emp_id") === col("emp2.emp_id"),"inner")
  .select(col("emp1.emp_id"),col("emp1.name"),
          col("emp2.emp_id").as("superior_emp_id"),
          col("emp2.name").as("superior_emp_name"))
  .show(false)

Here, we are joining emp dataset with itself to find out superior emp_id and name for all employees.

+------+--------+---------------+-----------------+
|emp_id|name    |superior_emp_id|superior_emp_name|
+------+--------+---------------+-----------------+
|2     |Rose    |1              |Smith            |
|3     |Williams|1              |Smith            |
|4     |Jones   |2              |Rose             |
|5     |Brown   |2              |Rose             |
|6     |Brown   |2              |Rose             |
+------+--------+---------------+-----------------+

4. Spark SQL Join

Since Spark SQL support native SQL syntax, we can also write join operations after creating temporary tables on DataFrame’s and using spark.sql()

这是平淡无奇的方式,因为没有用到dataframe的方法,还是再走sql的老路子。

  empDF.createOrReplaceTempView("EMP")
  deptDF.createOrReplaceTempView("DEPT")
//SQL JOIN
  val joinDF = spark.sql("select * from EMP e, DEPT d where e.emp_dept_id == d.dept_id")
  joinDF.show(false)

  val joinDF2 = spark.sql("select * from EMP e INNER JOIN DEPT d ON e.emp_dept_id == d.dept_id")
  joinDF2.show(false)

5. Using joinType objects

All Join objects are defined at joinTypes class, In order to use these you need to import org.apache.spark.sql.catalyst.plans.{LeftOuter,Inner,....}.

要注意这是从spark 2.0开始,才可以像Inner.sql这么用

  import org.apache.spark.sql.catalyst.plans.{LeftOuter,Inner}
  println("cross join")
  empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),LeftOuter.sql)
    .show(false)
  println("Inner join")
  empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),Inner.sql)
    .show(false)

6. Spark Join on multiple DataFrame’s

When you need to join more than two tables, you either use SQL expression after creating a temporary view on the DataFrame or use the result of join operation to join with another DataFrame like chaining them. for example

df1.join(df2,df1('id1') === df2('id2'),"inner")
   .join(df3,df1('id1') === df3('id3'),"inner")

For more details, I would recommend reading How to Join multiple tables

7. Source Code of Spark SQL Join Example

package com.sparkbyexamples.spark.dataframe.join

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
object JoinExample extends App {

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  spark.sparkContext.setLogLevel("ERROR")

  val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
    (2,"Rose",1,"2010","20","M",4000),
    (3,"Williams",1,"2010","10","M",1000),
    (4,"Jones",2,"2005","10","F",2000),
    (5,"Brown",2,"2010","40","",-1),
      (6,"Brown",2,"2010","50","",-1)
  )
  val empColumns = Seq("emp_id","name","superior_emp_id","year_joined","emp_dept_id","gender","salary")
  import spark.sqlContext.implicits._
  val empDF = emp.toDF(empColumns:_*)
  empDF.show(false)

  val dept = Seq(("Finance",10),
    ("Marketing",20),
    ("Sales",30),
    ("IT",40)
  )

  val deptColumns = Seq("dept_name","dept_id")
  val deptDF = dept.toDF(deptColumns:_*)
  deptDF.show(false)


  println("Inner join")
  empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"inner")
    .show(false)

  println("Outer join")
  empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"outer")
    .show(false)
  println("full join")
  empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"full")
    .show(false)
  println("fullouter join")
  empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"fullouter")
    .show(false)

  println("right join")
  empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"right")
    .show(false)
  println("rightouter join")
  empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"rightouter")
    .show(false)

  println("left join")
  empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"left")
    .show(false)
  println("leftouter join")
  empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"leftouter")
    .show(false)

  println("leftanti join")
  empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"leftanti")
    .show(false)

  println("leftsemi join")
  empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"leftsemi")
    .show(false)

  println("cross join")
  empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"cross")
    .show(false)

  println("Using crossJoin()")
  empDF.crossJoin(deptDF).show(false)

  println("self join")
  empDF.as("emp1").join(empDF.as("emp2"),
    col("emp1.superior_emp_id") === col("emp2.emp_id"),"inner")
    .select(col("emp1.emp_id"),col("emp1.name"),
      col("emp2.emp_id").as("superior_emp_id"),
      col("emp2.name").as("superior_emp_name"))
      .show(false)

  empDF.createOrReplaceTempView("EMP")
  deptDF.createOrReplaceTempView("DEPT")

  //SQL JOIN
  val joinDF = spark.sql("select * from EMP e, DEPT d where e.emp_dept_id == d.dept_id")
  joinDF.show(false)

  val joinDF2 = spark.sql("select * from EMP e INNER JOIN DEPT d ON e.emp_dept_id == d.dept_id")
  joinDF2.show(false)

}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值