Spark Accumulator累加器

最新推荐文章于 2024-03-10 11:59:51 发布

鸭梨山大哎

最新推荐文章于 2024-03-10 11:59:51 发布

阅读量336

点赞数 1

分类专栏： spark 文章标签： spark 累加器

本文链接：https://blog.csdn.net/u010711495/article/details/109988098

版权

spark 专栏收录该内容

121 篇文章 8 订阅

订阅专栏

什么是累加器

累加器用来对信息进行聚合
1 算子在计算时,不会影响到driver里的变量的值(driver里的变量称之为共享变量)
2 算子使用的其实都是driver里的变量的一个副本
3 如果想要影响driver里的变量,需要搜集数据到Driver端才行
4 除了搜集之外,Spark提供的累加器也可以完成对Driver中的变量的更新.

为何需要累加器?

算子在计算时,不会影响到driver里的变量的值(driver里的变量称之为共享变量)

object Test_021 {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName("count").setMaster("local")
    val sc = new SparkContext(conf)
    var arr = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8), 2)
    //sum是在driver上的sum
    var sum = 0
    //算子是在worker里的executor上里执行的
    arr.foreach(x => {
      //sum是driver上传送过来的,初始值0,然后再worker上进行累加,并没有累加到driver端的sum上
      sum += x
    })
    //打印的是driver自己的sum,所以结果是0
    println(sum) //0
  }
}

不用累加器进行求和

object Test_021 {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName("count").setMaster("local")
    val sc = new SparkContext(conf)
    var arr = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8), 2)
    //sum是在driver上的sum
    var sum = 0
    arr.collect()foreach(x => {
      sum += x
    })
    println(sum) //36
  }
}

低版本累加器

低版本累加器,可以帮我们完成求和等操作
SparkContext有一个accumulator方法
调用时,传入一个初始值
在累加时,调用累加器的add方法
在获取累加器的值时,调用累加器的value方法

object Test_021 {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName("count").setMaster("local")
    val sc = new SparkContext(conf)
    var arr = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8), 2)
    //使用低版本累加器
    var myacc: Accumulator[Int] =sc.accumulator(0)
    arr.foreach(x=>{
      myacc.add(x)
    })
    println(myacc.value)//36
  }
}

高版本累加器AccumulatorV2

本身是个抽象类
有一些可用的子类累加器比如 CollectionAccumulator,DoubleAccumulator,LongAccumulator
使用时需要创建子类型对象并在Spark-Context里面注册

object Test_021 {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName("sum").setMaster("local")
    val sc = new SparkContext(conf)
    var arr = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8), 2)
    //创建累加器对象
    //An accumulator for computing sum, count, and average of 64-bit integers.
    var myAcc = new LongAccumulator()
    //向上下文注册累加器
    //Register the given accumulator with given name.
    sc.register(myAcc, "sum")
    arr.foreach(x => {
      myAcc.add(x)
    })
    println(myAcc.value) //36
  }
}

自定义累加器

1 继承AccumulatorV2
2 规定泛型,第一个泛型是要输入的数据类型,第二个是要输出的数据类型
3 定义一个成员变量

object Test_021 {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName("sum").setMaster("local")
    val sc = new SparkContext(conf)
    var arr = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8), 2)
    //创建累加器对象
    //An accumulator for computing sum, count, and average of 64-bit integers.
    var myAcc = new SumAccumulator
    //向上下文注册累加器
    //Register the given accumulator with given name.
    sc.register(myAcc, "sum")
    arr.foreach(x => {
      myAcc.add(x)
    })
    println(myAcc.value) //36
  }
}

class SumAccumulator extends  AccumulatorV2[Long,Long]{
  //定义一个变量,存储累加后的结果
  var sum:Long=0
  //判断累加器是否为空,true表示空
  override def isZero: Boolean = {
    //sum的结果为0表示没有累加过,即为空
    sum==0
  }
  //复制累加器对象到别的worker上,也就是创建一个新的累加器对象
  override def copy(): AccumulatorV2[Long, Long] ={
    val other =new SumAccumulator
    //将累加器的对象的值得对象赋值到新的累加器对象上
    other.sum=this.sum
    other
  }

  //重置累加器,就是回归初始值
  override def reset(): Unit = {
    sum=0
  }
 //将要累加的数据累加到累加器的值上
  override def add(v: Long): Unit = {
    sum+=v
  }
 //用于两两合并累加器的值,
  override def merge(other: AccumulatorV2[Long, Long]): Unit = {
    sum+=other.value
  }

  override def value: Long = {
    sum
  }
}

利用自定义累加器统计单词

import org.apache.spark.rdd.RDD
import org.apache.spark.util.AccumulatorV2
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable
//利用累加器统计单词
object _TestAcc {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName("wordcount").setMaster("local")
    val sc = new SparkContext(conf)
    val words: RDD[String] = sc.parallelize(Array("hello", "word", "hello", "word", "kitty", "word"))
    val myAcc = new WordCountAccumulator
    sc.register(myAcc)
    words.foreach(myAcc.add)
    for (elem <- myAcc.value) {
      println(elem)
    }
  }
}

class WordCountAccumulator extends AccumulatorV2[String, mutable.HashMap[String, Int]] {
  //成员变量的维护
  var map = new mutable.HashMap[String, Int]()

  override def isZero: Boolean = {
    map.isEmpty
  }

  override def copy(): AccumulatorV2[String, mutable.HashMap[String, Int]] = {
    val newAcc = new WordCountAccumulator
    newAcc.map = this.map
    newAcc
  }

  override def reset(): Unit = {
    map.clear()
  }

  override def add(v: String): Unit = {
    //分区类累加,查看这个单词是否存在map中,如果不存在,则value是1,如果存在,取出value,累加1
    map模式匹配只有两种,要么None,要么Some(value)
    map.get(v) match {
      case None => map.put(v, 1)
      case Some(x) => map.put(v, x + 1)
    }
  }

  override def merge(other: AccumulatorV2[String, mutable.HashMap[String, Int]]): Unit = {
    //两个累加器进行合并时,如果有相同单词,就累加value值.如果没有相同的单词,就直接封装原来的值
    for (elem <- other.value) {
      //表示的是other里的每一个单词的kv对象
      //查看this的map中是否有other里的这个单词
      map.get(elem._1) match {
        case Some(e) => map.put(elem._1, e + elem._2)
        case None => map.put(elem._1, elem._2)
      }
    }
  }

  override def value: mutable.HashMap[String, Int] = {
    map
  }
}