reducebykey java_【Spark Java API】Transformation(11)—reduceByKey、foldByKey

最新推荐文章于 2021-03-06 21:27:33 发布

王小我

最新推荐文章于 2021-03-06 21:27:33 发布

阅读量64

点赞数

文章标签： reducebykey java

本文链接：https://blog.csdn.net/weixin_42504327/article/details/114468683

版权

reduceByKey

官方文档描述：

Merge the values for each key using an associative reduce function.

This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.

函数原型：

def reduceByKey(partitioner: Partitioner, func: JFunction2[V, V, V]): JavaPairRDD[K, V]

def reduceByKey(func: JFunction2[V, V, V], numPartitions: Int): JavaPairRDD[K, V]

该函数利用映射函数将每个K对应的V进行运算。

其中参数说明如下：

func：映射函数，根据需求自定义；

partitioner：分区函数；

numPartitions：分区数，默认的分区函数是HashPartitioner。

源码分析：

def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {

combineByKey[V]((v: V) => v, func, func, partitioner)

}

从源码中可以看出，reduceByKey()是基于combineByKey()实现的，其中createCombiner只是简单的转化，而mergeValue和mergeCombiners相同，都是利用用户自定义函数。reduceyByKey() 相当于传统的 MapReduce，整个数据流也与 Hadoop 中的数据流基本一样。在combineByKey()中在 map 端开启 combine()，因此，reduceyByKey() 默认也在 map 端开启 combine()，这样在 shuffle 之前先通过 mapPartitions 操作进行 combine，得到 MapPartitionsRDD，然后 shuffle 得到 ShuffledRDD，再进行 reduce(通过 aggregate + mapPartitions() 操作来实现)得到 MapPartitionsRDD。

实例：

List data = Arrays.asList(1, 2, 4, 3, 5, 6, 7);

JavaRDD javaRDD = javaSparkContext.parallelize(data);

//转化为K，V格式

JavaPairRDD javaPairRDD = javaRDD.mapToPair(new PairFunction() {

@Override

public Tuple2 call(Integer integer) throws Exception {

return new Tuple2(integer,1);

}

});

JavaPairRDD reduceByKeyRDD = javaPairRDD.reduceByKey(new Function2() {

@Override

public Integer call(Integer v1, Integer v2) throws Exception {

return v1 + v2;

}

});

System.out.println(reduceByKeyRDD.collect());

//指定numPartitions

JavaPairRDD reduceByKeyRDD2 = javaPairRDD.reduceByKey(new Function2() {

@Override

public Integer call(Integer v1, Integer v2) throws Exception {

return v1 + v2;

}

},2);

System.out.println(reduceByKeyRDD2.collect());

//自定义partition

JavaPairRDD reduceByKeyRDD4 = javaPairRDD.reduceByKey(new Partitioner() {

@Override

public int numPartitions() { return 2; }

@Override

public int getPartition(Object o) {

return (o.toString()).hashCode()%numPartitions();

}

}, new Function2() {

@Override

public Integer call(Integer v1, Integer v2) throws Exception {

return v1 + v2;

}

});

System.out.println(reduceByKeyRDD4.collect());

foldByKey

官方文档描述：

Merge the values for each key using an associative function and a neutral "zero value" which

may be added to the result an arbitrary number of times, and must not change the result

(e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).

函数原型：

def foldByKey(zeroValue: V, partitioner: Partitioner, func: JFunction2[V, V, V]): JavaPairRDD[K, V]

def foldByKey(zeroValue: V, numPartitions: Int, func: JFunction2[V, V, V]): JavaPairRDD[K, V]

def foldByKey(zeroValue: V, func: JFunction2[V, V, V]): JavaPairRDD[K, V]

该函数用于将K对应V利用函数映射进行折叠、合并处理，其中参数zeroValue是对V进行初始化。

具体参数如下：

zeroValue：初始值；

numPartitions：分区数，默认的分区函数是HashPartitioner；

partitioner：分区函数；

func：映射函数，用户自定义函数。

源码分析：

def foldByKey( zeroValue: V, partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)] = self.withScope {

// Serialize the zero value to a byte array so that we can get a new clone of it on each key

val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)

val zeroArray = new Array[Byte](zeroBuffer.limit)

zeroBuffer.get(zeroArray)

// When deserializing, use a lazy val to create just one instance of the serializer per task

lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()

val createZero = () => cachedSerializer.deserialize[V](ByteBuffer.wrap(zeroArray))

val cleanedFunc = self.context.clean(func)

combineByKey[V]((v: V) => cleanedFunc(createZero(), v), cleanedFunc, cleanedFunc, partitioner)

}

从foldByKey()实现可以看出，该函数是基于combineByKey()实现的，其中createCombiner只是利用zeroValue对V进行初始化，而mergeValue和mergeCombiners相同，都是利用用户自定义函数。在这里需要注意如果实现K的V聚合操作，初始设置需要特别注意，不要改变聚合的结果。

实例：

List data = Arrays.asList(1, 2, 4, 3, 5, 6, 7, 1, 2);

JavaRDD javaRDD = javaSparkContext.parallelize(data);

final Random rand = new Random(10);

JavaPairRDD javaPairRDD = javaRDD.mapToPair(new PairFunction() {

@Override

public Tuple2 call(Integer integer) throws Exception {

return new Tuple2(integer,Integer.toString(rand.nextInt(10)));

}

});

JavaPairRDD foldByKeyRDD = javaPairRDD.foldByKey("X", new Function2() {

@Override

public String call(String v1, String v2) throws Exception {

return v1 + ":" + v2;

}

});

System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + foldByKeyRDD.collect());

JavaPairRDD foldByKeyRDD1 = javaPairRDD.foldByKey("X", 2, new Function2() {

@Override

public String call(String v1, String v2) throws Exception {

return v1 + ":" + v2;

}

});

System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + foldByKeyRDD1.collect());

JavaPairRDD foldByKeyRDD2 = javaPairRDD.foldByKey("X", new Partitioner() {

@Override

public int numPartitions() { return 3; }

@Override

public int getPartition(Object key) {

return key.toString().hashCode()%numPartitions();

}

}, new Function2() {

@Override

public String call(String v1, String v2) throws Exception {

return v1 + ":" + v2;

}

});

System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + foldByKeyRDD2.collect());

王小我

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
reducebykey java_【Spark Java API】Transformation(11)—reduceByKey、foldByKey

reduceByKey官方文档描述：Merge the values for each key using an associative reduce function.This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "comb...
复制链接

扫一扫

reducebykey java_【Spark Java API】Transformation(11)—reduceByKey、foldByKey

“相关推荐”对你有帮助么？