Spark中的Join操作及问题解决

一.数据准备

grade.txt:(id, grade)
1,75
1,86
1,64
2,76

address.txt:(id, address)
1,shanghai
1,beijing

二.Spark Core中的Join

需要把每行转换成键值对的形式。

val grade = sc.textFile("D:\\study\\ideaProject\\first_maven\\input\\grade.txt")
  .map(_.split(",")).map(x => (x(0), x(1)))
val address = sc.textFile("D:\\study\\ideaProject\\first_maven\\input\\address.txt")
  .map(_.split(",")).map(x => (x(0), x(1)))

innerJoin:

val joinRDD = grade.join(address).foreach(println)

(1,(75,shanghai))
(1,(75,beijing))
(1,(86,shanghai))
(1,(86,beijing))
(1,(64,shanghai))
(1,(64,beijing))

leftOuterJoin:

val joinRDD2 = grade.leftOuterJoin(address).foreach(println)

(2,(76,None))
(1,(75,Some(shanghai)))
(1,(75,Some(beijing)))
(1,(86,Some(shanghai)))
(1,(86,Some(beijing)))
(1,(64,Some(shanghai)))
(1,(64,Some(beijing)))

三.Spark SQL中的Join

1.常用案例

import spark.implicits._
val grade = sc.textFile("D:\\study\\ideaProject\\first_maven\\input\\grade.txt")
  .map(_.split(",")).map(x => (x(0), x(1)))
  .toDF("id", "grade")

val address = sc.textFile("D:\\study\\ideaProject\\first_maven\\input\\address.txt")
  .map(_.split(",")).map(x => (x(0), x(1)))
  .toDF("id", "address")

如果只是单纯的Join,是求的笛卡尔积操作。感觉和Cross Join一样。

grade.join(address).show()

+---+-----+---+--------+
| id|grade| id| address|
+---+-----+---+--------+
|  1|   75|  1|shanghai|
|  1|   86|  1|shanghai|
|  1|   75|  1| beijing|
|  1|   86|  1| beijing|
|  1|   64|  1|shanghai|
|  2|   76|  1|shanghai|
|  1|   64|  1| beijing|
|  2|   76|  1| beijing|
+---+-----+---+--------+

需要指定grade(“id”) === address(“id”),不过也会出现重复列 id。

grade.join(address, grade("id") === address("id")).show()

+---+-----+---+--------+
| id|grade| id| address|
+---+-----+---+--------+
|  1|   75|  1|shanghai|
|  1|   75|  1| beijing|
|  1|   86|  1|shanghai|
|  1|   86|  1| beijing|
|  1|   64|  1|shanghai|
|  1|   64|  1| beijing|
+---+-----+---+--------+

left_outer操作,左表会完整显示,右表部分为null。

grade.join(address, grade("id") === address("id"), "left_outer").show()

+---+-----+----+--------+
| id|grade|  id| address|
+---+-----+----+--------+
|  1|   75|   1|shanghai|
|  1|   75|   1| beijing|
|  1|   86|   1|shanghai|
|  1|   86|   1| beijing|
|  1|   64|   1|shanghai|
|  1|   64|   1| beijing|
|  2|   76|null|    null|
+---+-----+----+--------+

使用Seq(“id”),可以去掉重复列。

grade.join(address, Seq("id"), "left_outer").show()

+---+-----+--------+
| id|grade| address|
+---+-----+--------+
|  1|   75|shanghai|
|  1|   75| beijing|
|  1|   86|shanghai|
|  1|   86| beijing|
|  1|   64|shanghai|
|  1|   64| beijing|
|  2|   76|    null|
+---+-----+--------+

2.Spark SQL中的5种Join

参考文章:Spark的五种JOIN策略解析

四.解决数据倾斜问题

1.大表Join小表(100MB级别),直接广播

spark.sql(
  """
    |select /*+ BROADCAST (address) */ * from grade, address where grade.id = address.id
    |""".stripMargin).show()

运行结果:

+---+-----+---+--------+
| id|grade| id| address|
+---+-----+---+--------+
|  1|   75|  1| beijing|
|  1|   75|  1|shanghai|
|  1|   86|  1| beijing|
|  1|   86|  1|shanghai|
|  1|   64|  1| beijing|
|  1|   64|  1|shanghai|
|  2|   76|  2|   wuhan|
+---+-----+---+--------+

2.大表Join小表(GB级别),大表打散,小表扩容

spark.sql(
  """
    |select id, grade, concat(id, ceiling(rand() * 100) % 3) as new_id from grade
    |""".stripMargin).createOrReplaceTempView("grade_new")

spark.sql(
    """
      |select id, address, concat(id, suffix) as new_id
      |from (
      |   select id, address, suffix from address Lateral View explode(array(0, 1, 2)) tmp as suffix
      |)
      |""".stripMargin).createOrReplaceTempView("address_new")

spark.sql(
  """
    |select SUBSTRING(grade_new.new_id, 0, 1) as id, grade, address 
    		from grade_new, address_new where grade_new.new_id = address_new.new_id
    |""".stripMargin).show()

运行结果:

+---+-----+--------+
| id|grade| address|
+---+-----+--------+
|  1|   75|shanghai|
|  1|   75| beijing|
|  1|   86|shanghai|
|  1|   86| beijing|
|  1|   64|shanghai|
|  1|   64| beijing|
|  2|   76|   wuhan|
+---+-----+--------+

3.大表(少量倾斜)Join其他表,倾斜的部分广播

spark.sql(
  """
    |select * from grade where id <> 1
    |""".stripMargin).createOrReplaceTempView("grade_noskew")

spark.sql(
  """
    |select * from grade where id = 1
    |""".stripMargin).createOrReplaceTempView("grade_skew")

spark.sql(
  """
    |select * from (
    |   select grade_noskew.id, grade, address
    |       from grade_noskew, address where grade_noskew.id = address.id
    |   union all
    |   select /*+ BROADCAST (grade_skew) */ grade_skew.id, grade, address
    |       from grade_skew, address where grade_skew.id = address.id
    |)
    |""".stripMargin).show()

运行结果:

+---+-----+--------+
| id|grade| address|
+---+-----+--------+
|  2|   76|   wuhan|
|  1|   64|shanghai|
|  1|   86|shanghai|
|  1|   75|shanghai|
|  1|   64| beijing|
|  1|   86| beijing|
|  1|   75| beijing|
+---+-----+--------+
  • 0
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值