Spark DataFrame中的join类型

原文:https://blog.csdn.net/anjingwunai/article/details/51934921

Spark DataFrame中的join类型

2016年07月17日 22:03:33 阅读数:13979

Spark DataFrame中join与SQL很像,都有inner join, left join, right join, full join; 
那么join方法如何实现不同的join类型呢? 
看其原型 
def join(right : DataFrame, usingColumns : Seq[String], joinType : String) : DataFrame 
def join(right : DataFrame, joinExprs : Column, joinType : String) : DataFrame 
可见,可以通过传入String类型的joinType来实现。 
joinType可以是”inner”、“left”、“right”、“full”分别对应inner join, left join, right join, full join,默认值是”inner”,代表内连接

personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person")).show()
personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "inner").show()
  • 1
  • 2

结果如下:

id_personnameaddressid_orderorderNumid_person
1张三深圳35331
1张三深圳44441
2李四成都13252
3王五厦门2343

“left”,”left_outer”或者”leftouter”代表左连接

personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "left").show()
personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "left_outer").show()
  • 1
  • 2

结果如下:

id_personnameaddressid_orderorderNumid_person
1张三深圳35331
1张三深圳44441
2李四成都13252
3王五厦门2343
4朱六杭州nullnullnull

“right”,”right_outer”及“rightouter”代表右连接

personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "right").show()
personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "right_outer").show()
  • 1
  • 2

结果如下:

id_personnameaddressid_orderorderNumid_person
2李四成都13252
3王五厦门2343
1张三深圳35331
1张三深圳44441
nullnullnull577711

“full”,”outer”,”full_outer”,”fullouter”代表全连接

personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "full").show()
personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "full_outer").show()
personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "outer").show()
  • 1
  • 2
  • 3

结果如下:

id_personnameaddressid_orderorderNumid_person
1张三深圳35331
1张三深圳44441
2李四成都13252
3王五厦门2343
4朱六杭州nullnullnull
nullnullnull577711

scala测试源码:

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SQLContext

case class Persons(id_person: Int, name: String, address: String)
case class Orders(id_order: Int, orderNum: Int, id_person: Int)

object DataFrameTest {
  def main(args: Array[String]) {
    val conf = new SparkConf().setMaster("local[2]").setAppName("DataFrameTest")
    val sc = new SparkContext(conf)

    val sqlContext = new SQLContext(sc)

    val personDataFrame = sqlContext.createDataFrame(List(Persons(1, "张三", "深圳"), Persons(2, "李四", "成都"), Persons(3, "王五", "厦门"), Persons(4, "朱六", "杭州")))
    val orderDataFrame = sqlContext.createDataFrame(List(Orders(1, 325, 2), Orders(2, 34, 3), Orders(3, 533, 1), Orders(4, 444, 1), Orders(5, 777, 11)))

    personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person")).show()
    personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "inner").show()
    personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "left").show()
    personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "left_outer").show()
    personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "right").show()
    personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "right_outer").show()
    personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "full").show()
    personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "full_outer").show()
    personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "outer").show()
  }
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27

如何实现的呢?查看spark源码中sql部分可知其是将String类型转换为了JoinType 
JoinType的伴生对象中对String类型的typ先转换成小写,然后去掉typ中的下划线 _ ,之后用模式匹配来决定用的是哪种join类型,另外,从源码中可知,除了内连接、左连接、右连接、全连接外,还有个LeftSemi连接,这种连接没用过,不太清楚

Spark中JoinType源码:

object JoinType {
  def apply(typ: String): JoinType = typ.toLowerCase.replace("_", "") match {
    case "inner" => Inner
    case "outer" | "full" | "fullouter" => FullOuter
    case "leftouter" | "left" => LeftOuter
    case "rightouter" | "right" => RightOuter
    case "leftsemi" => LeftSemi
    case _ =>
      val supported = Seq(
        "inner",
        "outer", "full", "fullouter",
        "leftouter", "left",
        "rightouter", "right",
        "leftsemi")

      throw new IllegalArgumentException(s"Unsupported join type '$typ'. " +
        "Supported join types include: " + supported.mkString("'", "', '", "'") + ".")
  }
}

sealed abstract class JoinType

case object Inner extends JoinType

case object LeftOuter extends JoinType

case object RightOuter extends JoinType

case object FullOuter extends JoinType

case object LeftSemi extends JoinType
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值