目录
一.数据准备
grade.txt:(id, grade)
1,75
1,86
1,64
2,76
address.txt:(id, address)
1,shanghai
1,beijing
二.Spark Core中的Join
需要把每行转换成键值对的形式。
val grade = sc.textFile("D:\\study\\ideaProject\\first_maven\\input\\grade.txt")
.map(_.split(",")).map(x => (x(0), x(1)))
val address = sc.textFile("D:\\study\\ideaProject\\first_maven\\input\\address.txt")
.map(_.split(",")).map(x => (x(0), x(1)))
innerJoin:
val joinRDD = grade.join(address).foreach(println)
(1,(75,shanghai))
(1,(75,beijing))
(1,(86,shanghai))
(1,(86,beijing))
(1,(64,shanghai))
(1,(64,beijing))
leftOuterJoin:
val joinRDD2 = grade.leftOuterJoin(address).foreach(println)
(2,(76,None))
(1,(75,Some(shanghai)))
(1,(75,Some(beijing)))
(1,(86,Some(shanghai)))
(1,(86,Some(beijing)))
(1,(64,Some(shanghai)))
(1,(64,Some(beijing)))
三.Spark SQL中的Join
1.常用案例
import spark.implicits._
val grade = sc.textFile("D:\\study\\ideaProject\\first_maven\\input\\grade.txt")
.map(_.split(",")).map(x => (x(0), x(1)))
.toDF("id", "grade")
val address = sc.textFile("D:\\study\\ideaProject\\first_maven\\input\\address.txt")
.map(_.split(",")).map(x => (x(0), x(1)))
.toDF("id", "address")
如果只是单纯的Join,是求的笛卡尔积操作。感觉和Cross Join一样。
grade.join(address).show()
+---+-----+---+--------+
| id|grade| id| address|
+---+-----+---+--------+
| 1| 75| 1|shanghai|
| 1| 86| 1|shanghai|
| 1| 75| 1| beijing|
| 1| 86| 1| beijing|
| 1| 64| 1|shanghai|
| 2| 76| 1|shanghai|
| 1| 64| 1| beijing|
| 2| 76| 1| beijing|
+---+-----+---+--------+
需要指定grade(“id”) === address(“id”),不过也会出现重复列 id。
grade.join(address, grade("id") === address("id")).show()
+---+-----+---+--------+
| id|grade| id| address|
+---+-----+---+--------+
| 1| 75| 1|shanghai|
| 1| 75| 1| beijing|
| 1| 86| 1|shanghai|
| 1| 86| 1| beijing|
| 1| 64| 1|shanghai|
| 1| 64| 1| beijing|
+---+-----+---+--------+
left_outer操作,左表会完整显示,右表部分为null。
grade.join(address, grade("id") === address("id"), "left_outer").show()
+---+-----+----+--------+
| id|grade| id| address|
+---+-----+----+--------+
| 1| 75| 1|shanghai|
| 1| 75| 1| beijing|
| 1| 86| 1|shanghai|
| 1| 86| 1| beijing|
| 1| 64| 1|shanghai|
| 1| 64| 1| beijing|
| 2| 76|null| null|
+---+-----+----+--------+
使用Seq(“id”),可以去掉重复列。
grade.join(address, Seq("id"), "left_outer").show()
+---+-----+--------+
| id|grade| address|
+---+-----+--------+
| 1| 75|shanghai|
| 1| 75| beijing|
| 1| 86|shanghai|
| 1| 86| beijing|
| 1| 64|shanghai|
| 1| 64| beijing|
| 2| 76| null|
+---+-----+--------+
2.Spark SQL中的5种Join
参考文章:Spark的五种JOIN策略解析
四.解决数据倾斜问题
1.大表Join小表(100MB级别),直接广播
spark.sql(
"""
|select /*+ BROADCAST (address) */ * from grade, address where grade.id = address.id
|""".stripMargin).show()
运行结果:
+---+-----+---+--------+
| id|grade| id| address|
+---+-----+---+--------+
| 1| 75| 1| beijing|
| 1| 75| 1|shanghai|
| 1| 86| 1| beijing|
| 1| 86| 1|shanghai|
| 1| 64| 1| beijing|
| 1| 64| 1|shanghai|
| 2| 76| 2| wuhan|
+---+-----+---+--------+
2.大表Join小表(GB级别),大表打散,小表扩容
spark.sql(
"""
|select id, grade, concat(id, ceiling(rand() * 100) % 3) as new_id from grade
|""".stripMargin).createOrReplaceTempView("grade_new")
spark.sql(
"""
|select id, address, concat(id, suffix) as new_id
|from (
| select id, address, suffix from address Lateral View explode(array(0, 1, 2)) tmp as suffix
|)
|""".stripMargin).createOrReplaceTempView("address_new")
spark.sql(
"""
|select SUBSTRING(grade_new.new_id, 0, 1) as id, grade, address
from grade_new, address_new where grade_new.new_id = address_new.new_id
|""".stripMargin).show()
运行结果:
+---+-----+--------+
| id|grade| address|
+---+-----+--------+
| 1| 75|shanghai|
| 1| 75| beijing|
| 1| 86|shanghai|
| 1| 86| beijing|
| 1| 64|shanghai|
| 1| 64| beijing|
| 2| 76| wuhan|
+---+-----+--------+
3.大表(少量倾斜)Join其他表,倾斜的部分广播
spark.sql(
"""
|select * from grade where id <> 1
|""".stripMargin).createOrReplaceTempView("grade_noskew")
spark.sql(
"""
|select * from grade where id = 1
|""".stripMargin).createOrReplaceTempView("grade_skew")
spark.sql(
"""
|select * from (
| select grade_noskew.id, grade, address
| from grade_noskew, address where grade_noskew.id = address.id
| union all
| select /*+ BROADCAST (grade_skew) */ grade_skew.id, grade, address
| from grade_skew, address where grade_skew.id = address.id
|)
|""".stripMargin).show()
运行结果:
+---+-----+--------+
| id|grade| address|
+---+-----+--------+
| 2| 76| wuhan|
| 1| 64|shanghai|
| 1| 86|shanghai|
| 1| 75|shanghai|
| 1| 64| beijing|
| 1| 86| beijing|
| 1| 75| beijing|
+---+-----+--------+