spark java api调用mapPartitions方法的例子如下:
package com.dsinpractice.spark.samples.core;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
public class MapPartitions implements Serializable {
public static void main(String[] args) {
if (args.length < 2) {
System.out.println("Usage: " + MapPartitions.class.getName() + " <inpath> <outpath>");
System.out.println("Input files to use at resources/common-text-data");
System.exit(-1);
}
MapPartitions mapPartitions = new MapPartitions();
mapPartitions.run(args);
}
private void run(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("map partitions demo");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
JavaRDD<String> lines = javaSparkContext.textFile(args[0], 2);
JavaRDD<String> lowerCaseLines = lines.mapPartitions(new FlatMapFunction<Iterator<String>, String>() {
@Override
public Iterable<String> call(Iterator<String> linesPerPartition) throws Exception {
List<String> lowerCaseLines = new ArrayList<String>();
while (linesPerPartition.hasNext()) {
String line = linesPerPartition.next();
lowerCaseLines.add(line.toLowerCase());
}
return lowerCaseLines;
}
});
lowerCaseLines.saveAsTextFile(args[1]);
}
}
在上面例子中,我们可以看到一段这样的代码:
JavaRDD<String> lines = javaSparkContext.textFile(args[0], 2);
JavaRDD<String> lowerCaseLines = lines.mapPartitions(new FlatMapFunction<Iterator<String>, String>() {
@Override
public Iterable<String> call(Iterator<String> linesPerPartition)
其中,
lines的类型为JavaRDD<String>
FlatMapFunction的泛型类型为<Iterator<String>, String>
FlatMapFunction的call方法的参数类型为Iterator<String>
FlatMapFunction的call方法的返回值类型为Iterator<String>
mapPartitions方法的返回值类型为JavaRDD<String>
这之间是否有什么联系呢?
本文将会解决这个问题。
JavaRDD class定义如下:
class JavaRDD[T](val rdd: RDD[T])(implicit val classTag: ClassTag[T])
extends AbstractJavaRDDLike[T, JavaRDD[T]] {
....
}
从上面代码可以看出,若JavaRDD<String>的泛型参数是String,则AbstractJavaRDDLike的第一个泛型参数T是String。
AbstractJavaRDDLike 抽象类定义如下:
/**
* As a workaround for https://issues.scala-lang.org/browse/SI-8905, implementations
* of JavaRDDLike should extend this dummy abstract class instead of directly inheriting
* from the trait. See SPARK-3266 for additional details.
*/
private[spark] abstract class AbstractJavaRDDLike[T, This <: JavaRDDLike[T, This]]
extends JavaRDDLike[T, This]
从上面代码可以看出,若AbstractJavaRDDLike的第一个泛型参数T是String,则JavaRDDLike的第一个泛型参数T是String。
JavaRDDLike trait定义如下:
trait JavaRDDLike[T, This <: JavaRDDLike[T, This]] extends Serializable {
.....
}
JavaRDDLike的mapPartitions方法如下:
**
* Return a new RDD by applying a function to each partition of this RDD.
*/
def mapPartitions[U](f: FlatMapFunction[java.util.Iterator[T], U]): JavaRDD[U] = {
def fn: (Iterator[T]) => Iterator[U] = {
(x: Iterator[T]) => f.call(x.asJava).iterator().asScala
}
JavaRDD.fromRDD(rdd.mapPartitions(fn)(fakeClassTag[U]))(fakeClassTag[U])
}
从上面代码,我们可以得出结论:
mapPartitions方法是一个多态方法,它定义的格式为mapPartitions[U]。并且,它使用一个FlatMapFunction作为它的方法参数。
它要求了若JavaRDDLike trait的第一个泛型参数为T,则FlatMapFunction的第一个泛型参数java.util.Iterator[T]。
它要求了若FlatMapFunction的第二个泛型类型为U,则mapPartitions方法的返回值类型为JavaRDD[U]。
===============================================
同样的道理,下面代码中
JavaPairRDD<String, Map<String, Object>> esRDD = JavaEsSpark.esRDD(jsc, sparkConf);
JavaRDD<String> rtMidRDD = esRDD
.mapPartitions(new FlatMapFunction<Iterator<Tuple2<String, Map<String, Object>>>, String>() {
@Override
public Iterable<String> call(Iterator<Tuple2<String, Map<String, Object>>> iter) throws Exception {
..............
}
}
}
JavaPairRDD的定义如下:
class JavaPairRDD[K, V](val rdd: RDD[(K, V)])
(implicit val kClassTag: ClassTag[K], implicit val vClassTag: ClassTag[V])
extends AbstractJavaRDDLike[(K, V), JavaPairRDD[K, V]] {
......
}
从上面代码可以看出,若JavaPairRDD的K和V两个泛型参数分别为String和Map<String, Object>,则AbstractJavaRDDLike的第一个泛型参数为Tuple2<String, Map<String, Object>>。