spark中的连接操作
是PairRDDFunctions.类中的方法,详情页请参考官网api:http://spark.apache.org/docs/latest/api/scala/index.html
RDD的Join操作有很多种,下面介绍几种常见的连接操作:
(1)join
如果熟悉sql的同学应该很熟悉join,这里的join和sql中的inner join操作很相似,返回结果是前面一个集合和后面一个集合中匹配成功的,过滤掉关联不上的。
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
Return an RDD containing all pairs of elements with matching keys in this and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this and (k, v2) is in other. Performs a hash join across the cluster.
具体实际操作如下:
val a =sc.parallelize(Array(("1",4.0),("2",8.0),("3",9.0)))
val b=sc.parallelize(Array(("1",2.0),("2",8.0)))
val c=a.join(b)
c.foreach(println)
//打印结果出来如下:
//(2,(8.0,8.0))
//(1,(4.0,2.0))
//这里返回的结果很显然是3匹配不到过滤掉,合并匹配到。
(2)leftOuterJoin
leftOuterJoin类似于SQL中的左外关联left outer join,返回结果以第一个RDD为主,关联不上的记录为空。
def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
Perform a left outer join of this and other. For each element (k, v) in this, the resulting RDD will either contain all pairs (k, (v, Some(w))) for w in other, or the pair (k, (v, None)) if no elements in other have key k. Hash-partitions the output using the existing partitioner/parallelism level.
具体实际操作如下:
val a =sc.parallelize(Array(("1",4.0),("2",8.0),("3",9.0)))
val b=sc.parallelize(Array(("1",2.0),("2",8.0)))
val c=a.leftOuterJoin(b)
c.foreach(println)
//打印结果出来如下:
//(2,(8.0,Some(8.0)))
//(3,(9.0,None))
//(1,(4.0,Some(2.0)))
(3)rightOuterJoin
rightOuterJoin类似于SQL中的有外关联right outer join,返回结果以参数也就是第二个RDD为主,关联不上的记录为空
def rightOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], W))]
Perform a right outer join of this and other. For each element (k, w) in other, the resulting RDD will either contain all pairs (k, (Some(v), w)) for v in this, or the pair (k, (None, w)) if no elements in this have key k. Hash-partitions the resulting RDD using the existing partitioner/parallelism level.
具体实际操作如下:
val a =sc.parallelize(Array(("1",4.0),("2",8.0),("3",9.0)))
val b=sc.parallelize(Array(("1",2.0),("2",8.0)))
val c=a.rightOuterJoin(b)
c.foreach(println)
//打印结果出来如下:
//(2,(Some(8.0),8.0))
//(1,(Some(4.0),2.0))