1、创建一个累加变量
public <T> Accumulator<T> accumulator(T initialValue,
AccumulatorParam<T> param)
Create an Accumulator variable of a given type, which tasks can "add" values to using the += method. Only the driver can access the accumulator's value.
Parameters:
initialValue - (undocumented)
param - (undocumented)
Returns:
(undocumented)
2、AccumulatorParam介绍
概念:
initialValue:Accumulator的初始值,也就是调用SparkContext.accululator时传递的initialValue
zeroValue: AccumulatorParam的初始值,也就是zero方法的返回值。
假设样本数据集合为simple={1,2,3,4}
执行顺序:
1.调用zero(initialValue),返回zeroValue
2.调用addAccumulator(zeroValue,1) 返回v1.
调用addAccumulator(v1,2)返回v2.
调用addAccumulator(v2,3)返回v3.
调用addAccumulator(v3,4)返回v4.
3.调用addInPlace(initialValue,v4)
因此最终结果是zeroValue+1+2+3+4+initialValue.
3、实现AccumulatorParam
package com.spark.api.spark_java_api_learn;
import org.apache.spark.AccumulatorParam;
public class LongAccumulator implements AccumulatorParam<Long> {
private static final long serialVersionUID = 1L;
// 执行完addAccumulator方法之后,最后会执行这个方法,将value加到 initialValue
@Override
public Long addInPlace(Long initialValue, Long value) {
System.out.println("addInPlace, " + initialValue + " : " + value);
return initialValue + value;
}
/*
* initialValue 就是SparkContext.accumulator(initialValue)参数initialValue
* 这里的返回值是累计的起始值。注意哦,他可以不等于 initialValue
*
* 如果initialValue=10, zero(initialValue)=0,那么运算过程如下:
* v1:=0+step
* v1:=v1+step
* ...
* ...
* 最后v1:=v1+initialValue
**/
@Override
public Long zero(Long initialValue) {
System.out.println("zero, initialValue=" + initialValue);
// 使得zeroValue初始化为0
return 0L;
}
@Override
public Long addAccumulator(Long value, Long step) {
// TODO Auto-generated method stub
System.out.println("addAccumulator, " + value + "," + step);
return value + step;
}
}
接下来使用
package com.spark.api.spark_java_api_learn;
import java.util.Arrays;
import java.util.List;
import org.apache.spark.Accumulator;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;
public class AccumulatorDemo {
public static void main(String[] args) {
SparkConf conf = new SparkConf()
.setAppName("AccumulatorDemo")
.setMaster("local");
JavaSparkContext sc=new JavaSparkContext(conf);
// initialValue = 0
final Accumulator<Long> acc = sc.accumulator(0L, new LongAccumulator());
List<Long> seq = Arrays.asList(1L,2L,3L,4L);
JavaRDD<Long> rdd=sc.parallelize(seq);
rdd.foreach(new VoidFunction<Long>(){
@Override
public void call(Long arg0) throws Exception {
acc.add(arg0);
}
});
System.out.println(acc.value());;
}
}
运行结果
// zero方法接收initialValue,返回0作为zeroValue
zero, initialValue=0
// addAccumulator方法,传入2个值,分别为value和step,第一次调用传入的value是zeroValue
// 每一次的返回值,作为下一次的value
addAccumulator, 0,1
addAccumulator, 1,2
addAccumulator, 3,3
addAccumulator, 6,4
// addInPlace方法,执行完addAccumulator方法之后,
// 最后会执行这个方法,将value加到 initialValue
addInPlace, 0 : 10
// acc.value(),最终结果
10