Spark 的 Accumulator 与 AccumulatorV2
1.概述
Accumulator累加器能精确地统计数据的各种属性,eg:可以统计符合条件的session,在一段时间段内产生了多少次购买,统计出各种属性的数据。
def accumulator[T](initialValue: T, name: String)
initialValue: 初始值
name: 创建累加器时,可以指定累加器的名字,这样在Driver 4040 Web UI的Task显示时可以帮助你了解程序运行的情况。
2. 例子
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("SparkRdd")
val sc = new SparkContext(conf)
val accum = sc.accumulator(0,"test1")
val data = sc.parallelize(1 to 9)
//使用action操作触发执行
data.foreach(x=>accum+=1)
println(accum.value)
//结果将返回9
}
3. 注意事项
-
accumulator需要通过一个action操作来触发才能获取到accum.value的值
-
在使用累加器的过程中,只能使用一次action的操作才能保证结果的准确性
object SparkAccumulator { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster("local[2]").setAppName("SparkAccumulator") val sc = new SparkContext(conf) val accum = sc.accumulator(0,"test2") val data = sc.parallelize(1 to 8) val data2 = data.map{ x=>{ if(x%2==0){ accum+=1 0 }else 1 } } //使用action操作触发执行 println(data2.count) // 输出结果8 println(accum.value) // 输出结果4 println(data2.count) // 输出结果8 println(accum.value) // 输出结果8 sc.stop() } }
第二次调用data2.count action的时候,会重新获取一次data2,导致accum+=1执行了两次
解决方案:只要将任务之间的依赖关系切断,使用rdd的cache,persist方法
4. AccumulatorV2
Accumulator 在spark2.0就过时了,2.0后使用AccumulatorV2
/**
* The base class for accumulators, that can accumulate inputs of type `IN`,
* and produce output of type `OUT`.
*/
abstract class AccumulatorV2[IN, OUT] extends Serializable { }
与Accumulator的变化
-
不用传初始化值参数,默认是从0开始;
-
创建累加器时,可以指定累加器的名字,这样在Driver 4040 Web UI的Task显示时,可以看到该名字的累加器在各Task中的实际的值(如果不指定累加器名字,则不会在Web UI上显示)
-
新增了reset方法,可以重置该累加器归零
-
获取实例方式
val accumulator = sc.longAccumulator(“test”)
或者
val accumulator = new LongAccumulator()
sc.register(accumulator,“test”)
LongAccumulator 源码
/**
* An [[AccumulatorV2 accumulator]] for computing sum, count, and averages for 64-bit integers.
*
* @since 2.0.0
*/
class LongAccumulator extends AccumulatorV2[jl.Long, jl.Long] {
private var _sum = 0L
private var _count = 0L
/**
* Adds v to the accumulator, i.e. increment sum by v and count by 1.
* @since 2.0.0
*/
override def isZero: Boolean = _sum == 0L && _count == 0
override def copy(): LongAccumulator = {
val newAcc = new LongAccumulator
newAcc._count = this._count
newAcc._sum = this._sum
newAcc
}
override def reset(): Unit = {
_sum = 0L
_count = 0L
}
/**
* Adds v to the accumulator, i.e. increment sum by v and count by 1.
* @since 2.0.0
*/
override def add(v: jl.Long): Unit = {
_sum += v
_count += 1
}
/**
* Adds v to the accumulator, i.e. increment sum by v and count by 1.
* @since 2.0.0
*/
def add(v: Long): Unit = {
_sum += v
_count += 1
}
/**
* Returns the number of elements added to the accumulator.
* @since 2.0.0
*/
def count: Long = _count
/**
* Returns the sum of elements added to the accumulator.
* @since 2.0.0
*/
def sum: Long = _sum
/**
* Returns the average of elements added to the accumulator.
* @since 2.0.0
*/
def avg: Double = _sum.toDouble / _count
override def merge(other: AccumulatorV2[jl.Long, jl.Long]): Unit = other match {
case o: LongAccumulator =>
_sum += o.sum
_count += o.count
case _ =>
throw new UnsupportedOperationException(
s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
}
private[spark] def setValue(newValue: Long): Unit = _sum = newValue
override def value: jl.Long = _sum
}
实例
object MyAccumulator {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("MyAccumulator")
val sc = new SparkContext(conf)
val accumulator = sc.longAccumulator("count")
val rdd1 = sc.parallelize(10 to 100).map(x=>{
if(x%2==0){
accumulator.add(1)
}
})
println("count = " + rdd1.count())
println("accumulator = " + accumulator.value)
sc.stop()
}
}