// 使用join算子关联两个RDD
// join以后,还是会根据key进行join,并返回JavaPairRDD
// 但是JavaPairRDD的第一个泛型类型,之前两个JavaPairRDD的key的类型,因为是通过key进行join的
// 第二个泛型类型,是Tuple2<v1, v2>的类型,Tuple2的两个泛型分别为原始RDD的value的类型
// join,就返回的RDD的每一个元素,就是通过key join上的一个pair
public static void myJoin(){
SparkConf conf = new SparkConf()
.setAppName("join")
.setMaster("local");
// 创建JavaSparkContext
JavaSparkContext sc = new JavaSparkContext(conf);
List<Tuple2<Integer, String>> studentList = Arrays.asList(
new Tuple2<Integer, String>(1, "leo"),
new Tuple2<Integer, String>(2, "jack"),
new Tuple2<Integer, String>(3, "tom"));
List<Tuple2<Integer, Integer>> scoreList = Arrays.asList(
new Tuple2<Integer, Integer>(2, 90),
new Tuple2<Integer, Integer>(1, 100),
new Tuple2<Integer, Integer>(3, 60));
JavaPairRDD<Integer, String> students = sc.parallelizePairs(studentList);
JavaPairRDD<Integer, Integer> scores = sc.parallelizePairs(scoreList);
JavaPairRDD<Integer, Tuple2<String, Integer>> studentScores = students.join(scores);
studentScores.foreach(
new VoidFunction<Tuple2<Integer,Tuple2<String,Integer>>>() {
private static final long serialVersionUID = 1L;
@Override
public void call(Tuple2<Integer, Tuple2<String, Integer>> t)
throws Exception {
System.out.println("student id: " + t._1);
System.out.println("student name: " + t._2._1);
System.out.println("student score: " + t._2._2);
System.out.println("===============================");
}
});
sc.close();
}
运算结果:
student id: 1
student name: leo
student score: 100
===============================
student id: 3
student name: tom
student score: 60
===============================
student id: 2
student name: jack
student score: 90
===============================
如果把其中一个删除了会有什么结果呢?
可以看到只是返回找到jion的结果:
student id: 3
student name: tom
student score: 60
===============================
student id: 2
student name: jack
student score: 90
===============================