累加器
解决了在Driver端定义变量,在Executor 中对变量操作不会被回收到Driver端的问题。
例如:
val spark = SparkSession.builder().appName("AccumulatorDemo").master("local").getOrCreate()
val sc = spark.sparkContext
var i = 0;
//随便定义个rdd
val rdd1 = sc.parallelize(List("a", "b", "c"))
rdd1.map(s => {
i += 1
s
}).collect()
print(s"i=$i") //i=0
使用累加器 会把partition中变量收集回Driver中 在进行累加
Scala Demo
val spark = SparkSession.builder().appName("AccumulatorDemo").master("local").getOrCreate()
val sc = spark.sparkContext
var i = 0;
//1.6版本创建累加器 val accumulator = sc.accumulator(0)
//定义累加器
val accumulator = sc.longAccumulator
//随便定义个rdd
val rdd1 = sc.parallelize(List("a", "b", "c"))
rdd1.map(s => {
i += 1
//使用累加器
accumulator.add(1)
s
}).collect()
println(s"i=$i") //i=0
//获取累加器的值 accumulator.value
println(s"accumulator=${accumulator.value}") //accumulator=3
自定义累加器
Java Demo:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.util.AccumulatorV2
object DefindSelfAccumulator {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("AccumulateDemo").master("local").getOrCreate()
val sc = spark.sparkContext
//创建rdd[String] "姓名 年龄" 用于人数 和年龄和
val rdd1 = sc.parallelize(List("zhangsan 18", "lisi 20", "wangwu 45"))
//初始化自定义累加器
val myAccumulator = new SelfAccumulator()
sc.register(myAccumulator, "SelfAccumulatorDemo")
//对rdd1 进行拆分
val value: RDD[Info] = rdd1.map(p => {
val info = Info(1, p.split(" ")(1).toInt)
myAccumulator.add(info)
info
})
value.collect()
val info = myAccumulator.value
println(s"count = ${info.count},sumAge = ${info.age} ")
}
}
case class Info(var count: Int, var age: Int)
/**
* 自定义累加器需要继承 AccumulatorV2[Info,Info] 第一个为输入类型,第二个为输出类型
*/
class SelfAccumulator extends AccumulatorV2[Info, Info] {
private var result: Info = new Info(0, 0)
// 判断累加器是否为初始值 初始值为下面reset中的定义的值一致
override def isZero: Boolean = {
result.count == 0 && result.age == 0
}
// 该方法不是核心方法 只用于复制当前累加器
override def copy(): AccumulatorV2[Info, Info] = {
val newAccumulator = new SelfAccumulator()
newAccumulator.result = this.result
newAccumulator
}
//重置AccumulatorV2中的数据,初始化的数据作于RDD中每个分区的内的值
override def reset(): Unit = {
result = new Info(0, 0)
}
//每个分区累加器处理逻辑
override def add(v: Info): Unit = {
result.count += v.count
result.age += v.age
}
//最终merge处理逻辑 (分区累加器值合并)
override def merge(other: AccumulatorV2[Info, Info]): Unit = other match {
case o: SelfAccumulator => {
result.count += o.result.count
result.age += o.result.age
}
case _=> throw new UnsupportedOperationException{
s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}"
}
}
//最终返回的值
override def value: Info = result
}