JOIN在Spark CORE中的使用
如下需要注意的是:Array后面跟的是一个数组应为:Array(再填写元素)
scala> val a = sc.parallelize(Array("A","a1"),("B","b1"),("C","c1"),("D","d1"),("E","e1"))
<console>:24: error: too many arguments for method parallelize: (seq: Seq[T], numSlices: Int)(implicit evidence$1: scala.reflect.ClassTag[T])org.apache.spark.rdd.RDD[T]
val a = sc.parallelize(Array("A","a1"),("B","b1"),("C","c1"),("D","d1"),("E","e1"))
^
1、scala> val a = sc.parallelize(Array(("A","a1"),("C","c1"),("D","d1"),("F","f1"),("F","f2")))
a: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[28] at parallelize at <console>:24
2、scala> val b = sc.parallelize(Array(("A","a2"),("C","c2"),("C","c3"),("D","d2"),("E","e1")))
b: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[29] at parallelize at <console>:24
val a = (A,a1),(C,c1),(D,d1),(F,f1),(F,f2)
val b = (A,a2),(C,c2),(C,c3),(D,d2),(E,e1)
3、a.join(b).collect
res47: scala> a.join(b).collect
res9: Array[(String, (String, String))] = Array((D,(d1,d2)), (A,(a1,a2)), (C,(c1,c2)), (C,(c1,c3)))
------------------------------------------------------------------
A就是join的条件,a中有一个C,b中有两个C,a中有F,b中没有,所以关联不上。
inner join 只返回左右都匹配上的
4、leftOuterJoin
scala> a.leftOuterJoin(b).collect
res2: Array[(String, (String, Option[String]))] = Array((F,(f1,None)), (F,(f2,None)), (D,(d1,Some(d2))), (A,(a1,Some(a2))), (C,(c1,Some(c2))), (C,(c1,Some(c3))))
发现一个情况:是以left左为基准的,左边所有的都会给列出来,返回左边的全部跟右边关联,左边有右边没有就会返回none。
5、rightOuterJoin 是以右边为基准的
scala> a.rightOuterJoin(b).collect
res4: Array[(String, (Option[String], String))] = Array((D,(Some(d1),d2)), (A,(Some(a1),a2)), (C,(Some(c1),c2)), (C,(Some(c1),c3)), (E,(None,e1)))
6、fullOuterJoin
scala> a.fullOuterJoin(b).collect
res2: Array[(String, (Option[String], Option[String]))] = Array((F,(Some(f1),None)), (F,(Some(f2),None)), (D,(Some(d1),Some(d2))), (A,(Some(a1),Some(a2))), (C,(Some(c1),Some(c2))), (C,(Some(c1),Some(c3))), (E,(None,Some(e1))))
a.fullOuterJoin(b)的数据结构
org.apache.spark.rdd.RDD[(String, (Option[String], Option[String]))
注意:一定要清楚每一步操作后的数据结构。