1、first
返回第一个元素
Scala版本
val conf = new SparkConf().setAppName("FirstScala").setMaster("local[*]")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(List(1,2,3,3))
println(rdd.first())
Java版本
SparkConf conf = new SparkConf().setAppName("FirstJava").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(5, 4, 3, 2, 1));
Integer first = rdd.first();
System.out.println(first);
2、take
返回前 N 个元素,N 为传入的参数
Scala版本
val conf = new SparkConf().setMaster("local[*]").setAppName("TakeScala")
val sc = new SparkContext(conf)
val rdd = sc.makeRDD(List(1,2,3,4,6,5))
val take = rdd.take(3)
println(take.toList)
println(take.mkString(","))
Java版本
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("TakeJava");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6));
List<Integer> take = rdd.take(3);
System.out.println(take.toString());
3、count
返回 RDD 中的元素个数
Scala版本
val conf = new SparkConf().setAppName("CountScala").setMaster("local[*]")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(List(1,2,3,5))
println(rdd.count())
Java版本
SparkConf conf = new SparkConf().setAppName("CountJava").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5));
System.out.println(rdd.count());
4、countByValue
各元素在 RDD 中出现的次数,返回{(key1,次数),(key2,次数),…(keyn,次数)}
Scala版本
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("CountByValueScala")
val sc = new SparkContext(sparkConf)
val rdd = sc.parallelize(List(1,1,1,2,3,4,4,4,5,6,6))
println(rdd.countByValue())
//Map(5 -> 1, 1 -> 3, 6 -> 2, 2 -> 1, 3 -> 1, 4 -> 3)
Java版本
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("CountByValueJava");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(1, 4, 3, 4, 5,9,5,3,4,2));
System.out.println(rdd.countByValue());
//{5=2, 1=1, 9=1, 2=1, 3=2, 4=3}
5、reduce
并行整合 RDD 中所有数据, 类似于是 scala 中集合的 reduce
Scala版本
val conf = new SparkConf().setAppName("ReduceScala").setMaster("local[*]")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(List(1,2,3,4,5,5,5))
println(rdd.reduce((x, y) => x + y))
//25
Java版本
SparkConf conf = new SparkConf().setAppName("ReduceJava").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5));
Integer reduce = rdd.reduce(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}
});
System.out.println(reduce);
//15
6、aggregate
def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): U
这个算子一共有两个参数列表,第一个参数列表中传递 zeroValue
(第零个值)第二个参数列表中传递两个函数,传入的第一个函数seqOp
函数会作用于每个分区,第二个函数combOp
函数在第一个函数执行完之后汇总所有分区结果。详情请看
Scala版本
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[*]").setAppName("AggregateScala")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(List(1,2,3,4,5,6),1)
val sum = rdd.aggregate(0)((x,y)=>{
println(x+","+y)
x+y
},(a,b)=>{
println(a+","+b)
a+b
})
println(sum)
运行结果如下:
仔细观察运行结果!!!
7、fold
fold(num)(func) 和 reduce() 一 样, 但是提供了初始值 num,每个元素计算时,先要合这个初始值进行折叠,注意,这里会按照每个分区进行 fold,然后分区之间还会再次进行 fold,提供初始值
Scala版本
val conf = new SparkConf().setAppName("FoldScala").setMaster("local[*]")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(List(1,2,3,4,5),1)
println(rdd.fold(0)(_ + _))
//15
println(rdd.fold(10)(_ + _))
//35
Java版本
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("FoldJava");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5),1);
Integer fold = rdd.fold(10, new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}
});
System.out.println(fold);
//35
8、top
按照降序的或者指定的排序规则,返回前 N 个元素
Scala版本
val conf = new SparkConf().setAppName("TopScala").setMaster("local[*]")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(List(1,2,3,4,5))
println(rdd.top(2).mkString(","))
//5,4
Java版本
SparkConf conf = new SparkConf().setAppName("TopJava").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5));
System.out.println(rdd.top(2));
//[5, 4]
9、takeOrdered
对 RDD 元素进行升序排序,取出前 N 个元素并返回,类似于 top 的相反的方法
Scala版本
val conf = new SparkConf().setMaster("local[*]").setAppName("TakeOrderedScala")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(List(5,6,9,8,7,3,4,3,6))
println(rdd.takeOrdered(3).mkString(","))
//3,3,4
Java版本
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("TakeOrderedJava");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(6, 2, 3, 4, 5));
System.out.println(rdd.takeOrdered(3));
//[2,3,4]
10、foreach
对 RDD 中的每个元素使用给定的函数
Scala版本
val conf = new SparkConf().setAppName("ForeachScala").setMaster("local[*]")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(List(1,2,3,4,5))
rdd.foreach(println)
Java版本
SparkConf conf = new SparkConf().setAppName("ForeachJava").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5),1);
rdd.foreach(new VoidFunction<Integer>() {
@Override
public void call(Integer integer) throws Exception {
System.out.println(integer);
}
});