spark支持两种RDD操作,transformation操作对当前的RDD进行转换形成新的RDD;action是指对RDD的最后操作,返回结果给driver。
transformation是lazy的,即只有action操作时,才执行transformation,也就是说如果程序中只有transformation操作,即使运行程序,该程序也不会执行。这样做的目的是优化执行过程,避免产生过多中间结果。
常用的的transformation有:map,fiter,flatMap,groupByKey,reduceByKey,sortByKey,join,cogroup
1. map:任何类型的RDD都可以调用map算子,map算子接受Function参数,创建的Function设置两个参数,其中第一个为原参数,第二个参数为返回的新元素的类型。重写call()方法内部,对原始RDD中每一个元素进行处理返回一个新元素,所有的新元素组成一个RDD。示例:
private static void map(){
SparkConf conf = new SparkConf().setAppName("map").setMaster("local");
List<Integer> numbers = Arrays.asList(1,2,3,4,5);
JavaRDD<Integer> numbersRDD = sc.parallelize(numbers);
JavaRDD<Integer> mapnumber = numbersRDD.map(
new Function<Integer, Integer>(){
private static final long serialVersionUID = 1L;
@Override
public Integer call(Integer v1) throws Exception{
return v1*2;
}
});
mapnumber.foreach(new VoidFunction<Integer>() {
private static final long serialVersionUID = 1L;
@Override
public void call(Integer t) throws Exception {
System.out.println(t);
}
});
sc.close();
}
- filter算子,过滤,使用方式同map,也是传入Function,call方法返回Boolean型,通过call方法的返回判断元素是否过滤。true保留该元素,false过滤掉该元素。示例:
private static void filter(){
SparkConf conf = new SparkConf().setAppName("map").setMaster("local");
List<Integer> numbers = Arrays.asList(1,2,3,4,5);
JavaRDD<Integer> numbersRDD = sc.parallelize(numbers);
JAVARDD<Integer> filterRDD = numbersRDD.filter(
new Function<Integer,Boolean>(){
private static final long serialVersionUID = 1L;
@Override
public Boolean call(Integer v1) throws Exception{
return v1%2==0
}
});
filterRDD.foreach(new VoidFunction<Integer>() {
private static final long serialVersionUID = 1L;
@Override
public void call(Integer t) throws Exception {
System.out.println(t);
}
});
sc.close();
}
- flatMap:和map类似,每个元素和返回多个新元素,接受的参数是FlatMapFunction,第一个参数是原来元素类型,第二个参数是返回的新元素的迭代器(Iterable) 中的元素类型,call方法返回Iterable,flatMap的实质是先map操作,然后flat。示例:
private static void flatMap() {
SparkConf conf = new SparkConf()
.setAppName("flatMap")
.setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
List<String> lineList = Arrays.asList("hello you", "hello me", "hello world");
JavaRDD<String> lines = sc.parallelize(lineList);
JavaRDD<String> words = lines.flatMap(
new FlatMapFunction<String, String>(){
private static final long serialVersionUID = 1L;
@Override
public Iterable<String> call(String l) throws Exception{
return Arrays.asList(l.split(" "));
}
});
words.foreach(new VoidFunction<String>() {
private static final long serialVersionUID = 1L;
@Override
public void call(String t) throws Exception {
System.out.println(t);
}
});
sc.close();
- reduceByKey:对每个key对应的value进行reduce操作,接受的参数为Function2类型,有三个泛型参数,前两个参数是接受的value的类型,第三个是返回的类型,由于reduce过程中,上一次reduce的结果会作为本次reduce的输入,所以三个参数类型应该是一致的。reduceByKey算子可以被JavaPairRDD调用,返回仍是JavaPairRDD。示例:
private static void reduceByKey() {
SparkConf conf = new SparkConf()
.setAppName("reduceByKey")
.setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
List<Tuple2<String, Integer>> scoreList = Arrays.asList(
new Tuple2<String, Integer>("class1", 80),
new Tuple2<String, Integer>("class2", 75),
new Tuple2<String, Integer>("class1", 90),
new Tuple2<String, Integer>("class2", 65));
JavaPairRDD<String, Integer> scores = sc.parallelizePairs(scoreList);
JavaPairRDD<String, Integer> total_score = scores.reduceByKey(
new Function2<Integer,Integer,Integer>(){
private static final long serialVersionUID = 1L;
@Override
public Integer call(Integer v1, Integer v2){
return v1+v2;
}
});
totalScores.foreach(new VoidFunction<Tuple2<String,Integer>>() {
private static final long serialVersionUID = 1L;
@Override
public void call(Tuple2<String, Integer> t) throws Exception {
System.out.println(t._1 + ": " + t._2);
}
});
sc.close();
- groupByKey操作,针对JavaPairRDD,返回仍为JavaPairRDD,根据key分组,返回每一个key对应的Iterable,示例:
private static void groupByKey() {
SparkConf conf = new SparkConf()
.setAppName("groupByKey")
.setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
List<Tuple2<String, Integer>> scoreList = Arrays.asList(
new Tuple2<String, Integer>("class1", 80),
new Tuple2<String, Integer>("class2", 75),
new Tuple2<String, Integer>("class1", 90),
new Tuple2<String, Integer>("class2", 65));
JavaPairRDD<String, Integer> scores = sc.parallelizePairs(scoreList);
JavaPairRDD<String, Iterable<Integer>> groupedScores = scores.groupByKey();
groupedScores.foreach(new VoidFunction<Tuple2<String,Iterable<Integer>>>() {
private static final long serialVersionUID = 1L;
@Override
public void call(Tuple2<String, Iterable<Integer>> t)
throws Exception {
System.out.println("class: " + t._1);
Iterator<Integer> ite = t._2.iterator();
while(ite.hasNext()) {
System.out.println(ite.next());
}
System.out.println("==============================");
}
});
sc.close();
}
- sortByKey:根据JavaPairRDD的key排序,返回排好序的JavaPairRDD,可以指定升序或者降序,参数为ture升序,参数为false降序。示例:
private static void sortByKey() {
SparkConf conf = new SparkConf()
.setAppName("sortByKey")
.setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
List<Tuple2<Integer, String>> scoreList = Arrays.asList(
new Tuple2<Integer, String>(65, "leo"),
new Tuple2<Integer, String>(50, "tom"),
new Tuple2<Integer, String>(100, "marry"),
new Tuple2<Integer, String>(80, "jack"));
JavaPairRDD<Integer, String> scores = sc.parallelizePairs(scoreList);
JavaPairRDD<Integer, String> sortedScores = scores.sortByKey(false);
sortedScores.foreach(new VoidFunction<Tuple2<Integer,String>>() {
private static final long serialVersionUID = 1L;
@Override
public void call(Tuple2<Integer, String> t) throws Exception {
System.out.println(t._1 + ": " + t._2);
}
});
sc.close();
}
- join:根据key join两个JavaPairRDD,返回JavaPairRDD,返回第一个泛型类型是之前两个JavaPairRDD的key类型,第二个泛型是一个Tuple2
private static void join() {
SparkConf conf = new SparkConf()
.setAppName("join")
.setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
List<Tuple2<Integer, String>> studentList = Arrays.asList(
new Tuple2<Integer, String>(1, "leo"),
new Tuple2<Integer, String>(1, "zhao"),
new Tuple2<Integer, String>(2, "jack"),
new Tuple2<Integer, String>(3, "tom"));
List<Tuple2<Integer, Integer>> scoreList = Arrays.asList(
new Tuple2<Integer, Integer>(1, 100),
new Tuple2<Integer, Integer>(1, 10),
new Tuple2<Integer, Integer>(2, 90),
new Tuple2<Integer, Integer>(3, 60));
JavaPairRDD<Integer, String> students = sc.parallelizePairs(studentList);
JavaPairRDD<Integer, Integer> scores = sc.parallelizePairs(scoreList);
JavaPairRDD<Integer, Tuple2<String, Integer>> studentScores = students.join(scores);
studentScores.foreach(
new VoidFunction<Tuple2<Integer,Tuple2<String,Integer>>>() {
private static final long serialVersionUID = 1L;
@Override
public void call(Tuple2<Integer, Tuple2<String, Integer>> t)
throws Exception {
System.out.println("student id: " + t._1);
System.out.println("student name: " + t._2._1);
System.out.println("student score: " + t._2._2); System.out.println("===============================");
}
});
sc.close();
}
- cogroup:类似join,不过每个key对应的Iterable。示例:
private static void cogroup() {
// 创建SparkConf
SparkConf conf = new SparkConf()
.setAppName("cogroup")
.setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
List<Tuple2<Integer, String>> studentList = Arrays.asList(
new Tuple2<Integer, String>(1, "leo"),
new Tuple2<Integer, String>(2, "jack"),
new Tuple2<Integer, String>(3, "tom"));
List<Tuple2<Integer, Integer>> scoreList = Arrays.asList(
new Tuple2<Integer, Integer>(1, 100),
new Tuple2<Integer, Integer>(2, 90),
new Tuple2<Integer, Integer>(3, 60),
new Tuple2<Integer, Integer>(1, 70),
new Tuple2<Integer, Integer>(2, 80),
new Tuple2<Integer, Integer>(3, 50));
JavaPairRDD<Integer, String> students = sc.parallelizePairs(studentList);
JavaPairRDD<Integer, Integer> scores = sc.parallelizePairs(scoreList);
JavaPairRDD<Integer, Tuple2<Iterable<String>, Iterable<Integer>>> studentScores =
students.cogroup(scores);
studentScores.foreach(
new VoidFunction<Tuple2<Integer,Tuple2<Iterable<String>,Iterable<Integer>>>>() {
private static final long serialVersionUID = 1L;
@Override
public void call(
Tuple2<Integer, Tuple2<Iterable<String>, Iterable<Integer>>> t)
throws Exception {
System.out.println("student id: " + t._1);
System.out.println("student name: " + t._2._1);
System.out.println("student score: " + t._2._2);
System.out.println("===============================");
}
});
sc.close();
}
}