最近工作上用到Spark SQL来处理数据。但是期间遇到两个表full join的问题,网上比较少关于spark SQL full join的资料,后面Google了一番找到了问题的核心。在这边做一个记录,方便他人也方便自己复盘。
工作的数据涉及保密,这边用两个dataframe来代替。
1. 先创建dataframe:
val left = Seq((0, "zero"), (1, "one")).toDF("id", "left")
val right = Seq((0, "zero"), (2, "two"), (3, "three")).toDF("id", "right")
2. 对两个dataframe进行full join:
scala> left.join(right, Seq("id"), "fullouter").show
+---+----+-----+
| id|left|right|
+---+----+-----+
| 1| one| null|
| 3|null|three|
| 2|null| two|
| 0|zero| zero|
+---+----+-----+
注:这里Seq是join方法的一个参数,里面放表要join的公共的key。
3. 对这句sql进行explain:
scala> left.join(right, Seq("id"), "fullouter").explain
== Physical Plan ==
*Project [coalesce(id#50, id#60) AS id#85, left#51, right#61]
+- SortMergeJoin [id#50], [id#60], FullOuter
:- *Sort [id#50 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#50, 200)
: +- LocalTableScan [id#50, left#51]
+- *Sort [id#60 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#60, 200)
+- LocalTableScan [id#60, right#61]
从上面可以看出,*Project [coalesce(id#50, id#60) AS id#85, left#51, right#61]中已经对公共的key--“id”进行了coalesce判断,所以在涉及表字段选择的时候,就不需要判断哪个字段是否为null了。
举例:
dataframe1: id, name, age
dataframe2: id, name, value
val result = dataframe1.as("d1").join(dataframe2.as("d2"), Seq("id","name"), "full")
.selectExpr("id","name","d1.age","d2.value")
这边的公共key,id和name,就不能再加别名了,或者写成下面这样都是不对的,这和hive的sql不一样了。
这么写是会报错的:
val result = dataframe1.as("d1").join(dataframe2.as("d2"), Seq("id","name"), "full")
.selectExpr("coalesce(d1.id,d2.id)","coalesce(di.name,d2.name)","d1.age","d2.value")