1.map对RDD中的每个元素执行一个函数,然后返回新的RDD;
JavaSparkContext sc = new JavaSparkContext(conf);
List<Integer> list = Arrays.asList(1,2,3,4,5,6,7,8,9,10);
JavaRDD<Integer> intRDD = sc.parallelize(list);
intRDD.map(x -> x+1).foreach(f-> System.out.println(f));
2.过滤filter操作
对RDD中的每个元素执行条件判断,true返回,false过滤掉;
JavaRDD<Integer> filterRDD = intRDD.filter(x->(x % 2 == 0));
filterRDD.foreach(x-> System.out.println(x));
3.flatMap
在flatMap()转换中,源RDD的一个元素映射到的一个或多个元素到目标RDD;并在每个源RDD的元素上执行一个函数,产生一个或多个输出。
返回一个迭代器;java.util.iterator;
JavaRDD<String> stringRDD = sc.parallelize(Arrays.asList("Hello Spark","Hello Java"));
JavaRDD<String> flatRDD = stringRDD.flatMap(t->Arrays.asList(t.split(" ")).iterator());
flatRDD.foreach(f->System.out.println(f));
4.mapToPair
与map类似,只不过是产生一个keyValue对的JavaPairRDD<String, Integer>
List<Integer> intList = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
JavaRDD<Integer> intRDD = javaSparkContext .parallelize( intList , 2);
JavaPairRDD<String, Integer> mapToPair = intRDD.mapToPair(i -> (i % 2 == 0) ? new
Tuple2<String, Integer>("even", i) : new Tuple2<String, Integer>("odd", i));
5.flatMapToPair
类似flatMap只不过是keyValue键值对;返回iterator迭代器;
stringRDD.flatMapToPair( s->Arrays.asList(s.split(" ")).stream().map(token->new Tuple2<String,Integer>(token,token.length())).iterator());
6.union合并
合并两个RDD,但不去重。
JavaRDD<Integer> intRDD2 = sc.parallelise(Arrays.asList(1,2,3));
intRDD.union(intRDD2);
7.求两个RDD的公共部分Intersection
intRDD.intersection(intRDD2);
8.去重RDD
将RDD的重复数据去掉,只保留一个。
JavaRDD<Integer> rddwithdupElements = javaSparkContext.parallelize(Arrays.asList(1,1,2,4,5,6,8,8,9,10,11,11));
rddwithdupElements.distinct();
9.Cartesian笛卡尔积
JavaRDD<String> rddStrings = javaSparkContext.parallelize(Arrays.asList("A","B","C"));
JavaRDD<Integer> rddIntegers = javaSparkContext.parallelize(Arrays.asList(1,4,5));
rddStrings.cartesian(rddIntegers);
10.分组groupByKey
对PairRDD按key进行分组;
PairRDD.groupByKey(Partitioner partitioner)
groupbyKey转换与PairRDD一起工作(也就是说,RDD由(键、值)对组成)。它被用来对与键相关的所有值进行分组。它有助于转换由<key,value>组成的PairRDD对到<key,Iterable<value>>)对。
11.聚合函数reduceByKey
对键值对进行按key分组统计
pairRDD.reduceByKey((v1, v2) -> v1 + v2);
同时可以设置分区:
reduceByKey(Partitioner partitioner,new Function2<T, T, T>() func)
12.按键值对键值对进行排序
JavaPairRDD<String, Integer> unsortPairRDD = sc.parallelizePairs( Arrays.asList(new Tuple2<String, Integer>("B", 2), new Tuple2<String, Integer>("B", 5), new Tuple2<String, Integer>("A", 7), new Tuple2<String, Integer>("A", 8) ) ); //降序排序 unsortPairRDD.sortByKey(false).foreach( f-> System.out.println(f._1 + ", " + f._2) );
可以自定义比较对象
sortByKey(Comparator<T> comp, boolean ascending)
13.Join连接
RDD1 <x,y>与 RDD2 <x,z> 连接后返回<x, (y,z)>
JavaPairRDD<String, Tuple2<String, Integer>> joinedRDD = pairRDD1.join(pairRDD2);
JavaPairRDD<String, Integer> unsortPairRDD = sc.parallelizePairs( Arrays.asList(new Tuple2<String, Integer>("B", 2), new Tuple2<String, Integer>("A", 5), new Tuple2<String, Integer>("B", 7), new Tuple2<String, Integer>("A", 8) ) );
JavaPairRDD<String, String> pairRDD1 = sc.parallelizePairs( Arrays.asList(new Tuple2<String,String>("B","A"),new Tuple2<String,String>("B","D"), new Tuple2<String,String>("A","E"),new Tuple2<String,String>("A","B")) ); JavaPairRDD<String, Tuple2<String,Integer>> joinRDD = pairRDD1.join(unsortPairRDD); joinRDD.foreach(f-> System.out.println(f._1 + ", " + f._2._1 + ", " + f._2._2));