Spark累加器

最新推荐文章于 2022-06-27 21:08:25 发布

尘世壹俗人

最新推荐文章于 2022-06-27 21:08:25 发布

阅读量289

点赞数

分类专栏：大数据Spark技术文章标签：大数据 spark

本文链接：https://blog.csdn.net/dudadudadd/article/details/113759554

版权

大数据Spark技术专栏收录该内容

46 篇文章 2 订阅

订阅专栏

Spark累加器的作用是Driver端将一个公共的可操作对象共享给所有的容器，使得所有容器运行任务时可以同步某一个需要的信息，比如记录某一个数据的出现次数等

这个时候就有人会说，记录次数不是可以直接用Driver端代码的一个变量之后自增不就解决了吗，这里要给大家纠正一个误区，Spark在运行的时候，容器中任务执行所需的资源在向Driver获取时是以副本的形式获取的，并不是直接用原有的

给大家举个例子

package com.wy

import org.apache.spark.{SparkConf, SparkContext}

object Accumulator {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setMaster("local").setAppName("accumulator")
    val sc = new SparkContext(conf)
    
    var sum = 0
    
    sc.textFile("./words.txt").foreach { x =>{sum = sum + 1;println(sum)}}

    println(sum+"================================")
    sc.stop()
  }
}

如果运行上面这个流程，看结果会发现sum最后还是0，但是每个executor运行时，是正常自增的，就说明了我上面说的问题，所以我们不能这样操作

而累加器在一开始的时候很简单，功能也很单一，只能是做一个值累加，如下

package com.wy

import org.apache.spark.{SparkConf, SparkContext}

object Accumulator {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setMaster("local").setAppName("accumulator")
    val sc = new SparkContext(conf)
    val accumulator = sc.accumulator(0,"MyA")
    
    sc.textFile("./words.txt").foreach { x =>{accumulator.add(1)}}

    println(accumulator.value)
    sc.stop()
  }
}

但是现在这个API官方不推荐使用了，对它进行了升级，分成了三种更详细的累加器

org.apache.spark.util.CollectionAccumulator;
org.apache.spark.util.DoubleAccumulator;
org.apache.spark.util.LongAccumulator;

创建的时候直接用调用SparkContext的方法就可以，方法名为小驼峰的类名，不过要注意的是，CollectionAccumulator的定义的时候需要对存放的容器泛型，不泛型也可以但是有时候会出莫名其妙的问题

sc.collectionAccumulator[mutable.Map[String, String]]("MyCo")
sc.longAccumulator("MyLong");
sc.doubleAccumulator("MyDouble");

用的时候正常用就行

longAccumulator.add(1);
doubleAccumulator.add(1.2);
accum.add(mutable.Map("111" -> "222"))

Spark还支持自定义累加器，但是自定义的时候要注意，Spark提供的自定义累加器API有两个一个是Accumulator，另一个是AccumulatorV2，我们要继承AccumulatorV2，Accumulator太老了过时了

package com.dtdream.driver

import org.apache.spark.rdd.RDD
import org.apache.spark.util.AccumulatorV2
import org.apache.spark.{SparkConf, SparkContext}

//继承AccumulatorV2[in,out]类，对输入输出类型做泛型，我们做一个求总次数的累加器
class MyCount extends AccumulatorV2[Int, Int] {

  //准备运行时需要的变量
  var result = 0

  /** *
   * isZero，翻译过来的意思是是否零点
   * 它的作用就是用来做一个判断，在之后调用这个累加器时，每一个executor拿到副本会调用判断是否是对于executor自身来说理想的初始状态
   * 这里我用来判断是否初始值为0，也就是说每一个executor都只需要负责自己计算内容的累加数量就可以了
   * 返回值为真则当前状态是理想的直接使用，为假Spark会调用重置方法
   * @return
   */
  override def isZero: Boolean = {
    result == 0
  }

  /** *
   * 这个方法用途和其名字一样，不过它作用在子节点中的executor容器获取累加器时，用来副本当前累加器状态
   * 因为我一开始就说了，获取资源不是直接拿，而是拷贝副本
   * @return
   */
  override def copy(): AccumulatorV2[Int, Int] = {
    var r = new MyCount()
    r.result = this.result
    r
  }

  /** *
   * 重置方法，作用上面已经说了
   */
  override def reset(): Unit = {
    result=0
  }

  /**
   * 规定累加器如何累加
   *
   * @param v
   */
  override def add(v: Int): Unit = {
    result += v
  }

  /**
   * 合并，在累加最后调用，合并所有executor的结果
   *
   * @param other
   */
  override def merge(other: AccumulatorV2[Int, Int]): Unit = {
    result+=other.value
  }
  
  //返回结果
  override def value: Int= result


}

使用方式如下

def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("TextDriver").setMaster("local[1]")
    val context = new SparkContext(conf)
    val count = new MyCount
    context.register(count)

    val data: RDD[Int] = context.parallelize(Array(1, 1, 1, 1, 1))

    data.foreach(count.add(_))

    println(s"累加器的结果${count.value}")

  }