所有的transformation都是采用的懒策略,就是如果只是将transformation提交是不会执行计算的,计算只有在action被提交的时候才被触发。
分析join规则
1,join示例代码片段
JavaPairRDD<Integer,Integer> tc = ctx.parallelizePairs(generateGraph(),slices).cache();
JavaPairRDD<Integer,Integer> edges = tc.mapToPair(new PairFunction<Tuple2<Integer, Integer>, Integer, Integer>() {
@Override
public Tuple2<Integer, Integer> call(Tuple2<Integer, Integer> e) throws Exception {
return new Tuple2<Integer,Integer>(e._2(),e._1());
}
});
JavaPairRDD<Integer,Tuple2<Integer,Integer>> tcJoinEdges = tc.join(edges);
数据跟踪分析
tc数据集:
(x,y)
(y,z)
(x,z)
(w,y)
edges数据:
(y,x)
(z,y)
(z,x)
(y,w)
tc.join(deges)结果:
(y,(z,x))
(y,(z,w))
2,mapToPair示例代码片段
<span style="font-family:SimSun;font-size:14px;">JavaPairRDD<Integer,Integer> mapToPaired = tcJoinEdges.mapToPair(ProjectFn.INSTANCE);
static class ProjectFn implements PairFunction<Tuple2<Integer,Tuple2<Integer,Integer>>,Integer,Integer>{
static final ProjectFn INSTANCE = new ProjectFn();
@Override
public Tuple2<Integer, Integer> call(Tuple2<Integer, Tuple2<Integer, Integer>> triple) throws Exception {
return new Tuple2<Integer,Integer>(triple._2()._2(),triple._2()._1());
}
}</span>
mapToPaired结果集:
(x,z)
(w,z)
3,union示例代码片段
<span style="font-family:SimSun;font-size:14px;">JavaPairRDD<Integer,Integer> unioned = tc.union(mapToPaired);</span>
unioned结果集:(x,y)
(y,z)
(x,z)
(w,y)
(x,z)
(w,z)
4,去除重复结果代码片段
JavaPairRDD<Integer,Integer> distincted = unioned.distinct().cache();
distincted结果集:
(x,y)
(y,z)
(x,z)
(w,y)
(w,z)