Spark DataFrame中join与SQL很像,都有inner join, left join, right join, full join;
那么join方法如何实现不同的join类型呢?
看其原型
def join(right : DataFrame, usingColumns : Seq[String], joinType : String) : DataFrame
def join(right : DataFrame, joinExprs : Column, joinType : String) : DataFrame
joinType可以是”inner”、“left”、“right”、“full”分别对应inner join, left join, right join, full join,默认值是”inner”,代表内连接
例子:
a表 id name b表 id job parent_id
1 张3 1 23 1
2 李四 2 34 2
3 王武 3 34 4
内连接
vvDf.join(vvtDf, Seq("city", "state"), "inner").show
vvDf.join(vvtDf, Seq("city", "state")).show
Seq是指连接的字段,这个相当于
SELECT a.au_fname, a.au_lname, p.pub_name
FROM authors AS a INNER JOIN publishers AS p
ON a.city = p.city
AND a.state = p.state
ORDER BY a.au_lname ASC, a.au_fname ASC
结果是
1 张3 1 23 1
2 李四 2 34 2
2.左外连接
vvDf.join(vvtDf, Seq("city", "state"), "left").show
结果是
1 张3 1 23 1
2 李四 2 34 2
3 王武 null
。
其他连接同理