大数据之spark_累加器

最新推荐文章于 2023-07-23 11:00:16 发布

普罗米修斯之火

最新推荐文章于 2023-07-23 11:00:16 发布

阅读量383

点赞数

分类专栏： spark 文章标签： spark

本文链接：https://blog.csdn.net/WuBoooo/article/details/108858229

版权

spark 专栏收录该内容

27 篇文章 0 订阅

订阅专栏

累加器

用来统计数据条数,可以统计总条数,也可以统计指定条件筛选后的数据条数,例如:处理日志数据时,出现的脏数据条数,如果我们想返回所有的脏数据并返回有用的数据,那么我们需要触发两次Job才能做到,如果用了累加器则不用触发两次Job

累加器它是先在每个Task中进行累加,返回Driver端时再整体累加所有Task中的累加器,累加器在每个Task上是独立的(class 修饰的),这样可以避免线程安全问题

运用累加器时,实际上就是一个闭包,它定义在Driver端,通过Task发送到Executor端,运用在每个Task中,累加器相当于Task中的一个变量,跟着Task处理数据的条数实时的更新,不断的累加,触发Action之后发送到Driver端聚合.

定义累加器:

    //定义累加器共分三种,longAccumulator,doubleAccumulator,collectionAccumulator,括号内参数为累加器名
    val longAccumulator: LongAccumulator = sc.longAccumulator("longAccumulator")      //储存long类型数据

    val doubleAccumulator: DoubleAccumulator = sc.doubleAccumulator("doubleAccumulator") //储存double类型数据
    
    //储存任意类型数据,将他们放入集合中,使用时要指定数据类型
    val collectionAccumulator: CollectionAccumulator[String] = sc.collectionAccumulator[String]("collectionAccumulator")

package com.doit.spark.day09

import java.lang

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.util.{CollectionAccumulator, DoubleAccumulator, LongAccumulator}

object AccumulatorDemo {
  def main(args: Array[String]): Unit = {

    val conf: SparkConf = new SparkConf().setAppName("AccumulatorDemo")

    val setMaster = args(0).toBoolean

    if (setMaster){
      conf.setMaster("local[*]")
    }
    val sc = new SparkContext(conf)


    val rdd1: RDD[Int] = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))

    //定义累加器共分三种,longAccumulator,doubleAccumulator,collectionAccumulator
    val longAccumulator: LongAccumulator = sc.longAccumulator("longAccumulator")      //储存long类型数据

    val doubleAccumulator: DoubleAccumulator = sc.doubleAccumulator("doubleAccumulator") //储存double类型数据

    //储存任意类型数据,将他们放入集合中,使用时要指定数据类型
    val collectionAccumulator: CollectionAccumulator[String] = sc.collectionAccumulator[String]("collectionAccumulator")

    val rdd2: RDD[Int] = rdd1.map(x => {
      if(x % 2 == 0) {
      //往累加器中添加元素
        longAccumulator.add(10L)
      }
      
      x * 5
    })
    //触发Action
    println(rdd2.collect().toBuffer)  //ArrayBuffer(5, 10, 15, 20, 25, 30, 35, 40, 45, 50)

    //获取累加器中的元素
    val value: lang.Long = longAccumulator.value

    //获取累加器的数据条数
    val count: Long = longAccumulator.count

    //获取累加器中数据的平均值
    val avg: Double = longAccumulator.avg

    println(s"value:$value" +
      s"count:$count" +
      s"avg:$avg")  //value:50  count:5   avg:10.0

  }
}

注意:触发action得在统计累加器内条数之前,因为没有触发action,数据不会进行正真的运算,累加器也是不会发送到Driver端进行累加的

而且,一个程序中如果触发了多次Job,因为多次Job中的Task用的都是同一个累加器,所以计算的数据条数会在多次Job上进行累加,而不会被覆盖,要想避免多次累加,可以使用cache或persist将上次的计算结果缓存起来,避免重复计算,重复累加

collectionAccumulator累加器的使用

package com.doit.spark.day09

import java.{lang, util}

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.util.{CollectionAccumulator, DoubleAccumulator, LongAccumulator}

object AccumulatorDemo {
  def main(args: Array[String]): Unit = {

    val conf: SparkConf = new SparkConf().setAppName("AccumulatorDemo")

    val setMaster = args(0).toBoolean

    if (setMaster){
      conf.setMaster("local[*]")
    }
    val sc = new SparkContext(conf)


    val rdd1: RDD[Int] = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))

    //定义累加器共分三种,longAccumulator,doubleAccumulator,collectionAccumulator
    val longAccumulator: LongAccumulator = sc.longAccumulator("longAccumulator")      //储存long类型数据

    val doubleAccumulator: DoubleAccumulator = sc.doubleAccumulator("doubleAccumulator") //储存double类型数据

    //储存任意类型数据,将他们放入集合中,使用时要指定数据类型
    val collectionAccumulator: CollectionAccumulator[String] = sc.collectionAccumulator[String]("collectionAccumulator")

    val rdd2: RDD[Int] = rdd1.map(x => {
      if(x % 2 == 0) {
        collectionAccumulator.add("数字")
      }
      x * 5
    })
    //触发Action
    println(rdd2.collect().toBuffer)  //ArrayBuffer(5, 10, 15, 20, 25, 30, 35, 40, 45, 50)

    //将集合累加器转为List[CollectionAccumulator[String]]集合
    val list: List[CollectionAccumulator[String]] = List(collectionAccumulator)
    for (elem <- list) {

      println(elem)
//CollectionAccumulator(id: 2, name: Some(collectionAccumulator), value: [数字, 数字, 数字, 数字, 数字])
    }
    
    //获取集合中的数据,放入的是一个Java的集合
    val value: util.List[String] = collectionAccumulator.value
    
    //集合类的累加器,不能直接获取数据条数,可以转成List集合后,获取长度
    println(value.size())  //5
    value.forEach(println)  //数字, 数字, 数字, 数字, 数字  

 //在scala语法中遍历java集合
    
    import scala.collection.JavaConverters._
    for(e <- value.asScala){

      println(e)
      
    }
  }
}

使用累加器保存错误日志数据时,例如是Json类型的数据,可以在解析JSON时,将处理数据的过程try起来,只有有错误数据就会去到cath中,然后在里面使用累加器进行累加,从而就能得到错误数据.那么当数据太多时,可能会出现内存溢出的情况,所以最佳还是建议将错误数据写入到存储错误日志的磁盘中

普罗米修斯之火

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
大数据之spark_累加器

累加器用来统计数据条数,可以统计总条数,也可以统计指定条件筛选后的数据条数,例如:处理日志数据时,出现的脏数据条数,如果我们想返回所有的脏数据并返回有用的数据,那么我们需要触发两次Job才能做到,如果用了累加器则不用触发两次Job累加器它是先在每个Task中进行累加,返回Driver端时再整体累加所有Task中的累加器,累加器在每个Task上是独立的(class 修饰的),这样可以避免线程安全问题运用累加器时,实际上就是一个闭包,它定义在Driver端,通过Task发送到Executor端,运用在每个
复制链接

扫一扫

专栏目录