Spark SQL 支持哪些类型的 JOIN ?

前言

本文隶属于专栏《大数据技术体系》,该专栏为笔者原创,引用请注明来源,不足和错误之处请在评论区帮忙指出,谢谢!

本专栏目录结构和参考文献请见大数据技术体系


7 种 JOIN

Spark 3.3.0 版本支持 7 种类型的 JOIN。

  1. INNER JOIN
  2. FULL OUTER JOIN
  3. LEFT OUTER JOIN
  4. RIGHT OUTER JOIN
  5. LEFT SEMI JOIN
  6. LEFT ANTI JOIN
  7. CROSS JOIN

pyspark 支持 pandas 的merge_asof,即支持 ASOF JOIN,详情请参考我的这篇博客——ASOF JOIN 是什么?pandas 的 merge_asof() 如何使用?


区别

INNER JOIN

INNER JOIN 一般被译作内连接。

内连接查询能将左表和右表中能关联起来的数据连接后返回。

例如:

SELECT A.PK AS A_PK, B.PK AS B_PK,
       A.Value AS A_Value, B.Value AS B_Value
FROM Table_A A
INNER JOIN Table_B B
ON A.PK = B.PK;

查询结果:

+------+------+---------+---------+
| A_PK | B_PK | A_Value | B_Value |
+------+------+---------+---------+
|    1 |    1 | both ab | both ab |
+------+------+---------+---------+
1 row in set (0.00 sec)

FULL OUTER JOIN

FULL OUTER JOIN 一般被译作外连接、全连接,实际查询语句中还可以写作 FULL JOIN

外连接查询能返回左右表里的所有记录,其中左右表里能关联起来的记录被连接后返回。

例如:

SELECT A.PK AS A_PK, B.PK AS B_PK,
       A.Value AS A_Value, B.Value AS B_Value
FROM Table_A A
FULL OUTER JOIN Table_B B
ON A.PK = B.PK;

查询结果:

+------+---------+------+---------+
| PK   | Value   | PK   | Value   |
+------+---------+------+---------+
|    1 | both ab |    1 | both ba |
|    2 | only a  | NULL | NULL    |
| NULL | NULL    |    3 | only b  |
+------+---------+------+---------+
3 rows in set (0.00 sec)

LEFT OUTER JOIN

LEFT OUTER JOIN 一般被译作左外连接,也写作 LEFT JOIN(左连接)。

左连接查询会返回左表中所有记录,不管右表中有没有关联的数据。

在右表中找到的关联数据列也会被一起返回。

例如:

SELECT A.PK AS A_PK, B.PK AS B_PK,
       A.Value AS A_Value, B.Value AS B_Value
FROM Table_A A
LEFT JOIN Table_B B
ON A.PK = B.PK;

查询结果:

+------+------+---------+---------+
| A_PK | B_PK | A_Value | B_Value |
+------+------+---------+---------+
|    1 |    1 | both ab | both ba |
|    2 | NULL | only a  | NULL    |
+------+------+---------+---------+
2 rows in set (0.00 sec)

RIGHT OUTER JOIN

RIGHT OUTER JOIN 一般被译作右外连接,也写作 RIGHT JOIN(右连接)。

右连接查询会返回右表中所有记录,不管左表中有没有关联的数据。

在左表中找到的关联数据列也会被一起返回。

例如:

SELECT A.PK AS A_PK, B.PK AS B_PK,
       A.Value AS A_Value, B.Value AS B_Value
FROM Table_A A
RIGHT JOIN Table_B B
ON A.PK = B.PK;

查询结果:

+------+------+---------+---------+
| A_PK | B_PK | A_Value | B_Value |
+------+------+---------+---------+
|    1 |    1 | both ab | both ba |
| NULL |    3 | NULL    | only b  |
+------+------+---------+---------+
2 rows in set (0.00 sec)

LEFT SEMI JOIN

LEFT SEMI JOIN (左半连接)是 IN/EXISTS 子查询的一种更高效的实现。

例如:

SELECT A.PK, A.Value
FROM A
WHERE A.PK IN (SELECT B.PK FROM B)

可以改写成:

SELECT A.PK, A.Value
FROM A LEFT SEMI JOIN B
ON (A.PK = B.PK)

特点

  1. LEFT SEMI JOIN 的限制是, JOIN 子句中右边的表只能在 ON 子句中设置过滤条件,在 WHERE 子句、SELECT 子句或其他地方过滤都不行。
  2. LEFT SEMI JOIN 只传递表的 join keymap 阶段,因此LEFT SEMI JOIN 中最后 SELECT 的结果只许出现左表。
  3. 因为 LEFT SEMI JOINin(keySet) 的关系,遇到右表重复记录,左表会跳过,而 JOIN 则会一直遍历。这就导致右表有重复值的情况下 LEFT SEMI JOIN 只产生一条,JOIN 会产生多条,也会导致LEFT SEMI JOIN的性能更高。

LEFT ANTI JOIN

LEFT ANTI JOIN 一般译作左反连接。

A LEFT ANTI JOIN B 的功能是在查询过程中,剔除 A 表中和 B 表有交集的部分

LEFT SEMI JOIN只会返回左表中能匹配到右表的数据,LEFT ANTI JOIN只会返回左表中不能匹配到右表的数据。

其实INNER JOINLEFT SEMI/ANTI JOIN 的效果功能,都可以使用LEFT OUTER JOIN 最后在WHERE中加以限制的方式进行实现,但是这样的做法会导致查询效率变低。


CROSS JOIN

如果不带WHERE条件子句,它将会返回被连接的两个表的笛卡尔积,返回结果的行数等于两个表行数的乘积;

比如,下面 A、B、C 执行结果相同,但是效率不一样:

A:

SELECT a.*, b.* FROM t1 a, t2 b WHERE a.id=b.id

B:

SELECT * FROM t1 a CROSS JOIN t2 b WHERE a.id=b.id

注:CROSS JOIN后加条件只能用WHERE,不能用ON

C:

SELECT * FROM t2 a INNER JOIN t2 b ON a.id=b.id

一般不建议使用 A 和 B,因为如果有WHERE子句的话,往往会先生成两个表行数乘积的行的数据表然后才根据WHERE条件从中选择。

因此,如果两个需要求交际的表太大,将会非常非常慢,不建议使用。


joinType

每种类型对应的joinType字符串为:

类型joinType
Innerinner
FullOuterouter/full/fullouter/full_outer
LeftOuterleftouter/left/left_outer
RightOuterrightouter/right/right_outer
LeftSemileftsemi/left_semi/semi
LeftAntileftanti/left_anti/anti
Crosscross

我们通过Dataset中的 API 可以选择不同的joinType

def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame

实践

源码下载

spark-examples 代码已开源,本项目致力于提供最具实践性的 Apache Spark 代码开发学习指南。

点击链接前往 github 下载源码:spark-examples


package com.shockang.study.spark.sql.join

import com.shockang.study.spark.util.Utils.formatPrint
import org.apache.spark.sql.SparkSession

/**
 *
 * Spark SQL 支持的 JOIN 测试
 *
 * @author Shockang
 */
object JoinExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("JoinExample").master("local[*]").getOrCreate()
    spark.sparkContext.setLogLevel("ERROR")

    val df1 = spark.createDataFrame(List(
      ("Shockang", "程序员"),
      ("Tom", "猫"),
      ("Jerry", "老鼠")
    )).toDF("name", "occupation").cache()

    df1.createTempView("t1")

    val df2 = spark.createDataFrame(List(
      ("Alice", 12),
      ("Tom", 10),
      ("Jerry", 11)
    )).toDF("name", "age").cache()

    df2.createTempView("t2")

    // INNER JOIN
    formatPrint("""df1.join(df2, Seq("name"), joinType = "inner").show()""")
    df1.join(df2, Seq("name"), joinType = "inner").show()

    formatPrint("""spark.sql("SELECT * FROM t1 INNER JOIN t2 ON t1.name=t2.name").show()""")
    spark.sql("SELECT * FROM t1 INNER JOIN t2 ON t1.name=t2.name").show()
    formatPrint("""spark.sql("SELECT * FROM t1 JOIN t2 ON t1.name=t2.name").show()""")
    spark.sql("SELECT * FROM t1 JOIN t2 ON t1.name=t2.name").show()

    // FULL OUTER JOIN
    formatPrint("""df1.join(df2, Seq("name"), joinType = "outer").show()""")
    df1.join(df2, Seq("name"), joinType = "outer").show()
    formatPrint("""df1.join(df2, Seq("name"), joinType = "full").show()""")
    df1.join(df2, Seq("name"), joinType = "full").show()
    formatPrint("""df1.join(df2, Seq("name"), joinType = "fullouter").show()""")
    df1.join(df2, Seq("name"), joinType = "fullouter").show()
    formatPrint("""df1.join(df2, Seq("name"), joinType = "full_outer").show()""")
    df1.join(df2, Seq("name"), joinType = "full_outer").show()

    formatPrint("""spark.sql("SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.name=t2.name").show()""")
    spark.sql("SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.name=t2.name").show()
    formatPrint("""spark.sql("SELECT * FROM t1 FULL JOIN t2 ON t1.name=t2.name").show()""")
    spark.sql("SELECT * FROM t1 FULL JOIN t2 ON t1.name=t2.name").show()

    // LEFT OUTER JOIN
    formatPrint("""df1.join(df2, Seq("name"), joinType = "leftouter").show()""")
    df1.join(df2, Seq("name"), joinType = "leftouter").show()
    formatPrint("""df1.join(df2, Seq("name"), joinType = "left").show()""")
    df1.join(df2, Seq("name"), joinType = "left").show()
    formatPrint("""df1.join(df2, Seq("name"), joinType = "left_outer").show()""")
    df1.join(df2, Seq("name"), joinType = "left_outer").show()

    formatPrint("""spark.sql("SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.name=t2.name").show()""")
    spark.sql("SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.name=t2.name").show()
    formatPrint("""spark.sql("SELECT * FROM t1 LEFT JOIN t2 ON t1.name=t2.name").show()""")
    spark.sql("SELECT * FROM t1 LEFT JOIN t2 ON t1.name=t2.name").show()

    // RIGHT OUTER JOIN
    formatPrint("""df1.join(df2, Seq("name"), joinType = "rightouter").show()""")
    df1.join(df2, Seq("name"), joinType = "rightouter").show()
    formatPrint("""df1.join(df2, Seq("name"), joinType = "right").show()""")
    df1.join(df2, Seq("name"), joinType = "right").show()
    formatPrint("""df1.join(df2, Seq("name"), joinType = "right_outer").show()""")
    df1.join(df2, Seq("name"), joinType = "right_outer").show()

    formatPrint("""spark.sql("SELECT * FROM t1 RIGHT OUTER JOIN t2 ON t1.name=t2.name").show()""")
    spark.sql("SELECT * FROM t1 RIGHT OUTER JOIN t2 ON t1.name=t2.name").show()
    formatPrint("""spark.sql("SELECT * FROM t1 RIGHT JOIN t2 ON t1.name=t2.name").show()""")
    spark.sql("SELECT * FROM t1 RIGHT JOIN t2 ON t1.name=t2.name").show()

    // LEFT SEMI JOIN
    formatPrint("""df1.join(df2, Seq("name"), joinType = "leftsemi").show()""")
    df1.join(df2, Seq("name"), joinType = "leftsemi").show()
    formatPrint("""df1.join(df2, Seq("name"), joinType = "left_semi").show()""")
    df1.join(df2, Seq("name"), joinType = "left_semi").show()
    formatPrint("""df1.join(df2, Seq("name"), joinType = "semi").show()""")
    df1.join(df2, Seq("name"), joinType = "semi").show()

    formatPrint("""spark.sql("SELECT * FROM t1 LEFT SEMI JOIN t2 ON t1.name=t2.name").show()""")
    spark.sql("SELECT * FROM t1 LEFT SEMI JOIN t2 ON t1.name=t2.name").show()
    formatPrint("""spark.sql("SELECT * FROM t1 SEMI JOIN t2 ON t1.name=t2.name").show()""")
    spark.sql("SELECT * FROM t1 SEMI JOIN t2 ON t1.name=t2.name").show()

    // LEFT ANTI JOIN
    formatPrint("""df1.join(df2, Seq("name"), joinType = "leftanti").show()""")
    df1.join(df2, Seq("name"), joinType = "leftanti").show()
    formatPrint("""df1.join(df2, Seq("name"), joinType = "left_anti").show()""")
    df1.join(df2, Seq("name"), joinType = "left_anti").show()
    formatPrint("""df1.join(df2, Seq("name"), joinType = "anti").show()""")
    df1.join(df2, Seq("name"), joinType = "anti").show()

    formatPrint("""spark.sql("SELECT * FROM t1 LEFT ANTI JOIN t2 ON t1.name=t2.name").show()""")
    spark.sql("SELECT * FROM t1 LEFT ANTI JOIN t2 ON t1.name=t2.name").show()
    formatPrint("""spark.sql("SELECT * FROM t1 ANTI JOIN t2 ON t1.name=t2.name").show()""")
    spark.sql("SELECT * FROM t1 ANTI JOIN t2 ON t1.name=t2.name").show()

    // CROSS JOIN
    formatPrint("""df1.join(df2, Seq("name"), joinType = "cross").show()""")
    df1.join(df2, Seq("name"), joinType = "cross").show()

    formatPrint("""spark.sql("SELECT * FROM t1 CROSS JOIN t2 ON t1.name=t2.name").show()""")
    spark.sql("SELECT * FROM t1 CROSS JOIN t2 ON t1.name=t2.name").show()

    spark.stop()
  }
}

控制台打印

========== df1.join(df2, Seq("name"), joinType = "inner").show() ==========
+-----+----------+---+
| name|occupation|age|
+-----+----------+---+
|  Tom|        猫| 10|
|Jerry|      老鼠| 11|
+-----+----------+---+

========== spark.sql("SELECT * FROM t1 INNER JOIN t2 ON t1.name=t2.name").show() ==========
+-----+----------+-----+---+
| name|occupation| name|age|
+-----+----------+-----+---+
|  Tom|        猫|  Tom| 10|
|Jerry|      老鼠|Jerry| 11|
+-----+----------+-----+---+

========== spark.sql("SELECT * FROM t1 JOIN t2 ON t1.name=t2.name").show() ==========
+-----+----------+-----+---+
| name|occupation| name|age|
+-----+----------+-----+---+
|  Tom|        猫|  Tom| 10|
|Jerry|      老鼠|Jerry| 11|
+-----+----------+-----+---+

========== df1.join(df2, Seq("name"), joinType = "outer").show() ==========
+--------+----------+----+
|    name|occupation| age|
+--------+----------+----+
|   Alice|      null|  12|
|   Jerry|      老鼠|  11|
|Shockang|    程序员|null|
|     Tom|        猫|  10|
+--------+----------+----+

========== df1.join(df2, Seq("name"), joinType = "full").show() ==========
+--------+----------+----+
|    name|occupation| age|
+--------+----------+----+
|   Alice|      null|  12|
|   Jerry|      老鼠|  11|
|Shockang|    程序员|null|
|     Tom|        猫|  10|
+--------+----------+----+

========== df1.join(df2, Seq("name"), joinType = "fullouter").show() ==========
+--------+----------+----+
|    name|occupation| age|
+--------+----------+----+
|   Alice|      null|  12|
|   Jerry|      老鼠|  11|
|Shockang|    程序员|null|
|     Tom|        猫|  10|
+--------+----------+----+

========== df1.join(df2, Seq("name"), joinType = "full_outer").show() ==========
+--------+----------+----+
|    name|occupation| age|
+--------+----------+----+
|   Alice|      null|  12|
|   Jerry|      老鼠|  11|
|Shockang|    程序员|null|
|     Tom|        猫|  10|
+--------+----------+----+

========== spark.sql("SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.name=t2.name").show() ==========
+--------+----------+-----+----+
|    name|occupation| name| age|
+--------+----------+-----+----+
|    null|      null|Alice|  12|
|   Jerry|      老鼠|Jerry|  11|
|Shockang|    程序员| null|null|
|     Tom|        猫|  Tom|  10|
+--------+----------+-----+----+

========== spark.sql("SELECT * FROM t1 FULL JOIN t2 ON t1.name=t2.name").show() ==========
+--------+----------+-----+----+
|    name|occupation| name| age|
+--------+----------+-----+----+
|    null|      null|Alice|  12|
|   Jerry|      老鼠|Jerry|  11|
|Shockang|    程序员| null|null|
|     Tom|        猫|  Tom|  10|
+--------+----------+-----+----+

========== df1.join(df2, Seq("name"), joinType = "leftouter").show() ==========
+--------+----------+----+
|    name|occupation| age|
+--------+----------+----+
|Shockang|    程序员|null|
|     Tom|        猫|  10|
|   Jerry|      老鼠|  11|
+--------+----------+----+

========== df1.join(df2, Seq("name"), joinType = "left").show() ==========
+--------+----------+----+
|    name|occupation| age|
+--------+----------+----+
|Shockang|    程序员|null|
|     Tom|        猫|  10|
|   Jerry|      老鼠|  11|
+--------+----------+----+

========== df1.join(df2, Seq("name"), joinType = "left_outer").show() ==========
+--------+----------+----+
|    name|occupation| age|
+--------+----------+----+
|Shockang|    程序员|null|
|     Tom|        猫|  10|
|   Jerry|      老鼠|  11|
+--------+----------+----+

========== spark.sql("SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.name=t2.name").show() ==========
+--------+----------+-----+----+
|    name|occupation| name| age|
+--------+----------+-----+----+
|Shockang|    程序员| null|null|
|     Tom|        猫|  Tom|  10|
|   Jerry|      老鼠|Jerry|  11|
+--------+----------+-----+----+

========== spark.sql("SELECT * FROM t1 LEFT JOIN t2 ON t1.name=t2.name").show() ==========
+--------+----------+-----+----+
|    name|occupation| name| age|
+--------+----------+-----+----+
|Shockang|    程序员| null|null|
|     Tom|        猫|  Tom|  10|
|   Jerry|      老鼠|Jerry|  11|
+--------+----------+-----+----+

========== df1.join(df2, Seq("name"), joinType = "rightouter").show() ==========
+-----+----------+---+
| name|occupation|age|
+-----+----------+---+
|Alice|      null| 12|
|  Tom|        猫| 10|
|Jerry|      老鼠| 11|
+-----+----------+---+

========== df1.join(df2, Seq("name"), joinType = "right").show() ==========
+-----+----------+---+
| name|occupation|age|
+-----+----------+---+
|Alice|      null| 12|
|  Tom|        猫| 10|
|Jerry|      老鼠| 11|
+-----+----------+---+

========== df1.join(df2, Seq("name"), joinType = "right_outer").show() ==========
+-----+----------+---+
| name|occupation|age|
+-----+----------+---+
|Alice|      null| 12|
|  Tom|        猫| 10|
|Jerry|      老鼠| 11|
+-----+----------+---+

========== spark.sql("SELECT * FROM t1 RIGHT OUTER JOIN t2 ON t1.name=t2.name").show() ==========
+-----+----------+-----+---+
| name|occupation| name|age|
+-----+----------+-----+---+
| null|      null|Alice| 12|
|  Tom|        猫|  Tom| 10|
|Jerry|      老鼠|Jerry| 11|
+-----+----------+-----+---+

========== spark.sql("SELECT * FROM t1 RIGHT JOIN t2 ON t1.name=t2.name").show() ==========
+-----+----------+-----+---+
| name|occupation| name|age|
+-----+----------+-----+---+
| null|      null|Alice| 12|
|  Tom|        猫|  Tom| 10|
|Jerry|      老鼠|Jerry| 11|
+-----+----------+-----+---+

========== df1.join(df2, Seq("name"), joinType = "leftsemi").show() ==========
+-----+----------+
| name|occupation|
+-----+----------+
|  Tom|        猫|
|Jerry|      老鼠|
+-----+----------+

========== df1.join(df2, Seq("name"), joinType = "left_semi").show() ==========
+-----+----------+
| name|occupation|
+-----+----------+
|  Tom|        猫|
|Jerry|      老鼠|
+-----+----------+

========== df1.join(df2, Seq("name"), joinType = "semi").show() ==========
+-----+----------+
| name|occupation|
+-----+----------+
|  Tom|        猫|
|Jerry|      老鼠|
+-----+----------+

========== spark.sql("SELECT * FROM t1 LEFT SEMI JOIN t2 ON t1.name=t2.name").show() ==========
+-----+----------+
| name|occupation|
+-----+----------+
|  Tom|        猫|
|Jerry|      老鼠|
+-----+----------+

========== spark.sql("SELECT * FROM t1 SEMI JOIN t2 ON t1.name=t2.name").show() ==========
+-----+----------+
| name|occupation|
+-----+----------+
|  Tom|        猫|
|Jerry|      老鼠|
+-----+----------+

========== df1.join(df2, Seq("name"), joinType = "leftanti").show() ==========
+--------+----------+
|    name|occupation|
+--------+----------+
|Shockang|    程序员|
+--------+----------+

========== df1.join(df2, Seq("name"), joinType = "left_anti").show() ==========
+--------+----------+
|    name|occupation|
+--------+----------+
|Shockang|    程序员|
+--------+----------+

========== df1.join(df2, Seq("name"), joinType = "anti").show() ==========
+--------+----------+
|    name|occupation|
+--------+----------+
|Shockang|    程序员|
+--------+----------+

========== spark.sql("SELECT * FROM t1 LEFT ANTI JOIN t2 ON t1.name=t2.name").show() ==========
+--------+----------+
|    name|occupation|
+--------+----------+
|Shockang|    程序员|
+--------+----------+

========== spark.sql("SELECT * FROM t1 ANTI JOIN t2 ON t1.name=t2.name").show() ==========
+--------+----------+
|    name|occupation|
+--------+----------+
|Shockang|    程序员|
+--------+----------+

========== df1.join(df2, Seq("name"), joinType = "cross").show() ==========
+-----+----------+---+
| name|occupation|age|
+-----+----------+---+
|  Tom|        猫| 10|
|Jerry|      老鼠| 11|
+-----+----------+---+

========== spark.sql("SELECT * FROM t1 CROSS JOIN t2 ON t1.name=t2.name").show() ==========
+-----+----------+-----+---+
| name|occupation| name|age|
+-----+----------+-----+---+
|  Tom|        猫|  Tom| 10|
|Jerry|      老鼠|Jerry| 11|
+-----+----------+-----+---+
  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值