join操作,可以将两个pairRDD,把key相同的记录连接起来
如果key值相同,返回的每个记录都是一个元组 (key,pair1_value,pair2_value)
这里的key和数据库中的key不同之处在于,数据库中两个记录的key的值不能相同,这里可以
代码演示如下(以下操作在spark-shell 下进行)
pair_test
1,11,111,1111
2,22,222,2222
3,33,333,3333
4,44,444,4444
5,55,555,5555
6,66,666,6666
pair_test2
1,aa,aaa
2,bb,bbb
c,cc,ccc
scala> val data1=sc.textFile("/home/wangtuntun/pair_test").map(_.split(","))
data1: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:27
scala> val data2=sc.textFile("/home/wangtuntun/pair_test2").map(_.split(","))
data2: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at <console>:27
scala> val pair1=data1.map( x=>((x(0)),(x(1),x(2),x(3))) )
pair1: org.apache.spark.rdd.RDD[(String, (String, String, String))] = MapPartitionsRDD[6] at map at <console>:29
scala> val pair2=data2.map( x=>((x(0)),(x(1),x(2))) )
pair2: org.apache.spark.rdd.RDD[(String, (String, String))] = MapPartitionsRDD[7] at map at <console>:29
scala> val join=pair1.join(pair2)
join: org.apache.spark.rdd.RDD[(String, ((String, String, String), (String, String)))] = MapPartitionsRDD[10] at join at <console>:35
scala> val filter=join.filter( x=> (x._2._1._1)<(x._2._2._1) )
filter: org.apache.spark.rdd.RDD[(String, ((String, String, String), (String, String)))] = MapPartitionsRDD[11] at filter at <console>:37
scala> filter.take(5)
res7: Array[(String, ((String, String, String), (String, String)))] = Array((1,((11,111,1111),(12,aaa))))