各种JOIN在Spark Core中的使用
一. inner join
inner join,只返回左右都匹配上的
>>> data2 = sc.parallelize(range(6,15)).map(lambda line:(line,1))
>>> data2.collect()
[(6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)]
>>> data1 = sc.parallelize(range(10)).map(lambda line:(line,1))
>>> data1.collect()
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)]
>>> data1.join(data2)
PythonRDD[14] at RDD at PythonRDD.scala:43
>>> data1.join(data2).collect()
[(6, (1, 1)), (7, (1, 1)), (8, (1, 1)), (9, (1, 1))]
二. left outer join
left:是以左边为基准,向左靠
左边(a)的记录一定会存在,右边(b)的记录有的返回Some(x),没有的补None。
>>> data1.leftOuterJoin(data2).collect()
[(0, (1, None)), (1, (1, None)), (2, (1, None)), (3, (1, None)), (4, (1, None)), (5, (1, None)), (6, (1, 1)), (7, (1, 1)), (8, (1, 1)), (9, (1, 1))]
>>> data2.leftOuterJoin(data1).collect()
[(6, (1, 1)), (7, (1, 1)), (8, (1, 1)), (9, (1, 1)), (10, (1, None)), (11, (1, None)), (12, (1, None)), (13, (1, None)), (14, (1, None))]
三. right outer join
right:是以右边为基准,向右靠
>>> data1.rightOuterJoin(data2).collect()
[(6, (1, 1)), (7, (1, 1)), (8, (1, 1)), (9, (1, 1)), (10, (None, 1)), (11, (None, 1)), (12, (None, 1)), (13, (None, 1)), (14, (None, 1))]
>>> data2.rightOuterJoin(data1).collect()
[(0, (None, 1)), (1, (None, 1)), (2, (None, 1)), (3, (None, 1)), (4, (None, 1)), (5, (None, 1)), (6, (1, 1)), (7, (1, 1)), (8, (1, 1)), (9, (1, 1))]
右边(b)的记录一定会存在,左边(a)的记录有的返回None,没有的补None。
四. full outer join
注意:使用JOIN之前,要知道JOIN之后的数据结构是什么。
>>> data1.fullOuterJoin(data2).collect()
[(0, (1, None)), (1, (1, None)), (2, (1, None)), (3, (1, None)), (4, (1, None)), (5, (1, None)), (6, (1, 1)), (7, (1, 1)), (8, (1, 1)), (9, (1, 1)), (10, (None, 1)), (11, (None, 1)), (12, (None, 1)), (13, (None, 1)), (14, (None, 1))]
五、union
>>> data1.union(data2).collect()
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)]
参考:https://blog.csdn.net/wawa8899/article/details/81027633