1、mapToPair
- 创建键值对 RDD;读取文件
sample.txt
,将每行的第一个单词作为键key
,1 为值value
创建pairRDD
文件sample.txt
中的内容如下:
aa bb cc aa aa aa dd dd ee ee ee ee
ff aa bb zks
ee kks
ee zz zks
Scala版本
Scala 中没有mapToPair
函数,但 Scala 有map
val conf = new SparkConf().setAppName("CartesianScala").setMaster("local[*]")
val sc = new SparkContext(conf)
val lines = sc.textFile("in/sample.txt")
val pairs = lines.map(x => (x.split(" ")(0), 1))
pairs.collect.foreach(println)
Java版本
SparkConf conf = new SparkConf().setAppName("mapToPairJava").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile("in/sample.txt");
//输入的是一个string的字符串,输出的是一个(String, Integer) 的map
JavaPairRDD<String, Integer> pairRDD = lines.mapToPair(new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String s) throws Exception {
return new Tuple2<String, Integer>(s.split(" ")[0], 1);
}
});
List<Tuple2<String, Integer>> collect = pairRDD.collect();
for (Tuple2<String, Integer> str:collect){
System.out.println(str);
}
运行结果如下:
2、flatMapToPair
相当于先flatMap
,再mapToPair
;
例如:读取文件夹中内容,创建键值对,以每个单词为键key
,以 1 为值value
创建RDD键值对
Scala版本
val conf = new SparkConf().setAppName("flatMapToPairScala").setMaster("local[*]")
val sc = new SparkContext(conf)
val lines = sc.textFile("in/sample.txt")
val flatRDD = lines.flatMap(x => (x.split(" ")))
val pairs = flatRDD.map(x=>(x,1))
pairs.collect.foreach(println)
Java版本 Spark2.0以下版本
SparkConf conf = new SparkConf().setAppName("flatMapToPairJava").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile("in/sample.txt");
JavaPairRDD<String, Integer> wordPairRDD = lines.flatMapToPair(new PairFlatMapFunction<String, String, Integer>() {
@Override
public Iterator<Tuple2<String, Integer>> call(String s) throws Exception {
ArrayList<Tuple2<String, Integer>> tpLists = new ArrayList<Tuple2<String, Integer>>();
String[] split = s.split("\\s+");
for (int i = 0; i <split.length ; i++) {
Tuple2 tp = new Tuple2<String,Integer>(split[i], 1);
tpLists.add(tp);
}
return tpLists.iterator();
}
});
for (Tuple2<String, Integer> str:wordPairRDD.collect()){
System.out.println(str);
}
Java版本 Spark2.0以上版本
SparkConf conf = new SparkConf().setAppName("flatMapToPairJava").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile("in/sample.txt");
JavaPairRDD<String, Integer> wordPairRDD = lines.flatMapToPair(new PairFlatMapFunction<String, String, Integer>() {
@Override
public Iterator<Tuple2<String, Integer>> call(String s) throws Exception {
ArrayList<Tuple2<String, Integer>> tpLists = new ArrayList<Tuple2<String, Integer>>();
String[] split = s.split("\\s+");
for (int i = 0; i <split.length ; i++) {
Tuple2 tp = new Tuple2<String,Integer>(split[i], 1);
tpLists.add(tp);
}
return tpLists.iterator();
}
});
for (Tuple2<String, Integer> str:wordPairRDD.collect()){
System.out.println(str);
}
运行结果: