数据格式 <user1>,<frend1>,<frend2>.....<frendn>
eg: aa,bb,cc,dd,ee
bb,aa,dd,ee
cc,aa
dd,aa,bb
ee,aa,bb
通过flatmaptopair将数据变成以<user1frenfi,frend1 frend2...frendn>的格式。如下:
<aabb, bb,cc,dd,ee>
<aacc, bb,cc,dd,ee>
<aadd, bb,cc,dd,ee>
<aaee, bb,cc,dd,ee>
<aacc, aa>
<aadd, aa,bb>
<aabb, aa,bb>
<aaee, aa,bb>
<bbee, aa,bb>
如此键相同时取交集即为两者共同好友。
注:<aabb>与<bbaa>相同,所以需要以一定顺序排序
代码如下:
public class CommonFrends { public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.setAppName("commonFrends").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> javaRDD = sc.textFile("/opt/hadoop/frends.txt"); JavaPairRDD<String, String> pairRDD = javaRDD.flatMapToPair(x -> { List<Tuple2<String,String>> l=new ArrayList<Tuple2<String,String>>(); String[] frends = x.split(","); for (int i = 1; i < frends.length; i++) { if (frends[0].compareTo(frends[i]) < 0) { l.add(new Tuple2<String, String>(frends[0] + frends[i], x.replace(frends[0]+",",""))); } else { l.add(new Tuple2<String, String>(frends[i] + frends[0], x.replace(frends[0]+",",""))); } } return l.iterator(); }).persist(StorageLevel.MEMORY_AND_DISK()); JavaPairRDD<String, List<String>> rdd = pairRDD.groupByKey().mapValues(x -> { Map<String, Integer> map = new HashMap<String, Integer>(); for (String s : x) { String[] frends = s.split(","); if (s == null || s.isEmpty()) { continue; } for (String f : frends) { if (map.get(f) == null) { map.put(f, 1); } else { map.put(f, map.get(f) + 1); } } } List<String> commonFrends = new ArrayList<String>(); for (String m : map.keySet()) { if (map.get(m) > 1) { commonFrends.add(m); } } return commonFrends; }); pairRDD.saveAsTextFile("/opt/spark/commonFrend"); rdd.saveAsTextFile("/opt/spark/commonFrends"); } }