Spark 的 Accumulator与 AccumulatorV2

最新推荐文章于 2023-10-09 08:57:59 发布

wtzhm

最新推荐文章于 2023-10-09 08:57:59 发布

阅读量2.4k

点赞数 1

分类专栏： sparksql 文章标签： Spark 的 Accumulator Accumulator AccumulatorV2

本文链接：https://blog.csdn.net/wtzhm/article/details/86481846

版权

sparksql 专栏收录该内容

22 篇文章 1 订阅

订阅专栏

Spark 的 Accumulator 与 AccumulatorV2

1.概述

Accumulator累加器能精确地统计数据的各种属性，eg:可以统计符合条件的session，在一段时间段内产生了多少次购买，统计出各种属性的数据。

def accumulator[T](initialValue: T, name: String)

initialValue: 初始值
name: 创建累加器时，可以指定累加器的名字，这样在Driver 4040 Web UI的Task显示时可以帮助你了解程序运行的情况。

2. 例子

def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[2]").setAppName("SparkRdd")
    val sc = new SparkContext(conf)
    val accum  = sc.accumulator(0,"test1")
    val data = sc.parallelize(1 to 9)

	//使用action操作触发执行
    data.foreach(x=>accum+=1)
    println(accum.value)
    //结果将返回9
}

3. 注意事项

accumulator需要通过一个action操作来触发才能获取到accum.value的值

在使用累加器的过程中，只能使用一次action的操作才能保证结果的准确性

  object SparkAccumulator {

  def main(args: Array[String]): Unit = {
      val conf = new SparkConf().setMaster("local[2]").setAppName("SparkAccumulator")
      val sc = new SparkContext(conf)
      val accum = sc.accumulator(0,"test2")
      val data = sc.parallelize(1 to 8)
      val data2 = data.map{
          x=>{
              if(x%2==0){
                  accum+=1
                  0
              }else 1
          }
      }

  //使用action操作触发执行
      println(data2.count)  // 输出结果8
      println(accum.value)  // 输出结果4

      println(data2.count)  // 输出结果8
      println(accum.value)  // 输出结果8
      sc.stop()
 		 }
  	}

第二次调用data2.count action的时候，会重新获取一次data2，导致accum+=1执行了两次

解决方案：只要将任务之间的依赖关系切断，使用rdd的cache，persist方法

4. AccumulatorV2

Accumulator 在spark2.0就过时了，2.0后使用AccumulatorV2

/**
 * The base class for accumulators, that can accumulate inputs of type `IN`, 
 * and produce output of type `OUT`.
 */
abstract class AccumulatorV2[IN, OUT] extends Serializable { }

与Accumulator的变化

不用传初始化值参数，默认是从0开始；
创建累加器时，可以指定累加器的名字，这样在Driver 4040 Web UI的Task显示时，可以看到该名字的累加器在各Task中的实际的值（如果不指定累加器名字，则不会在Web UI上显示）
新增了reset方法，可以重置该累加器归零
获取实例方式

val accumulator = sc.longAccumulator(“test”)
或者
val accumulator = new LongAccumulator()
sc.register(accumulator,“test”)

LongAccumulator 源码

/**
 * An [[AccumulatorV2 accumulator]] for computing sum, count, and averages for 64-bit integers.
 *
 * @since 2.0.0
 */
class LongAccumulator extends AccumulatorV2[jl.Long, jl.Long] {
  private var _sum = 0L
  private var _count = 0L

  /**
   * Adds v to the accumulator, i.e. increment sum by v and count by 1.
   * @since 2.0.0
   */
  override def isZero: Boolean = _sum == 0L && _count == 0

  override def copy(): LongAccumulator = {
    val newAcc = new LongAccumulator
    newAcc._count = this._count
    newAcc._sum = this._sum
    newAcc
  }

  override def reset(): Unit = {
    _sum = 0L
    _count = 0L
  }

  /**
   * Adds v to the accumulator, i.e. increment sum by v and count by 1.
   * @since 2.0.0
   */
  override def add(v: jl.Long): Unit = {
    _sum += v
    _count += 1
  }

  /**
   * Adds v to the accumulator, i.e. increment sum by v and count by 1.
   * @since 2.0.0
   */
  def add(v: Long): Unit = {
    _sum += v
    _count += 1
  }

  /**
   * Returns the number of elements added to the accumulator.
   * @since 2.0.0
   */
  def count: Long = _count

  /**
   * Returns the sum of elements added to the accumulator.
   * @since 2.0.0
   */
  def sum: Long = _sum

  /**
   * Returns the average of elements added to the accumulator.
   * @since 2.0.0
   */
  def avg: Double = _sum.toDouble / _count

  override def merge(other: AccumulatorV2[jl.Long, jl.Long]): Unit = other match {
    case o: LongAccumulator =>
      _sum += o.sum
      _count += o.count
    case _ =>
      throw new UnsupportedOperationException(
        s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
  }

  private[spark] def setValue(newValue: Long): Unit = _sum = newValue

  override def value: jl.Long = _sum
}

实例

object MyAccumulator {
    def main(args: Array[String]): Unit = {
        val conf = new SparkConf().setMaster("local[2]").setAppName("MyAccumulator")
        val sc = new SparkContext(conf)
        val accumulator = sc.longAccumulator("count")
        val rdd1 = sc.parallelize(10 to 100).map(x=>{
            if(x%2==0){
                accumulator.add(1)
            }
        })
        println("count =   " + rdd1.count())
        println("accumulator =   " + accumulator.value)
        sc.stop()
    }
}

wtzhm

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Spark 的 Accumulator与 AccumulatorV2

Spark 的 Accumulator1.概述Accumulator累加器能精确地统计数据的各种属性，eg:可以统计符合条件的session，在一段时间段内产生了多少次购买，统计出各种属性的数据。2. 例子def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster(&amp;amp;amp;quot;local[2]&amp;amp;amp;quot;)....
复制链接

扫一扫