实时流处理学习(五)- SparkStreaming 进阶(状态算子、写入mysql、window、黑名单过滤)

待深入......代码地址:https://github.com/vicotorz/sparkStreaming

  1. 带状态的算子:Update StateByKey 状态的累加

     如果使用了stateful的算子,就必须设置checkpoint(可供检查每个批次状态的临时文件)

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * 使用Spark Streaming完成有状态统计
  */
object StatefulWordCount {

  def main(args: Array[String]): Unit = {


    val sparkConf = new SparkConf().setAppName("StatefulWordCount").setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    // 如果使用了stateful的算子,必须要设置checkpoint
    // 在生产环境中,把checkpoint设置到HDFS的某个文件夹中
    ssc.checkpoint(".")

    val lines = ssc.socketTextStream("localhost", 6789)

    val result = lines.flatMap(_.split(" ")).map((_,1))
    val state = result.updateStateByKey[Int](updateFunction _)

    state.print()

    ssc.start()
    ssc.awaitTermination()
  }


  /**
    * 把当前的数据去更新已有的或者是老的数据
    * @param currentValues  当前的
    * @param preValues  老的
    * @return
    */
  def updateFunction(currentValues: Seq[Int], preValues: Option[Int]): Option[Int] = {
    val current = currentValues.sum
    val pre = preValues.getOrElse(0)

    Some(current + pre)
  }
}

 

      2.  将统计结果写入到mysql中(foreachRDD)

示例表:create table wordcount(word varchar(50) default null, wordcount int(10) default null);

result.foreachRDD(rdd => {
  rdd.foreachPartition(partionOfRecords => {
    val connection = createConnection()
      partionOfRecords.foreach(record => {
        val sql = "insert into wordcount(word,wordcount) values('" + record._1 + "'," + record._2 + ")"
        connection.createStatement().execute(sql)
      })
      connection.close()
  })
})

编码过程中的异常:

org.apache.spark.SparkException: Spark Streaming.note

at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)

at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)

at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)

at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)

at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:925)

at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)

at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:924)

at SparkStreamingDemo.ForeachRDD$$anonfun$main$1.apply(ForeachRDD.scala:31)

at SparkStreamingDemo.ForeachRDD$$anonfun$main$1.apply(ForeachRDD.scala:30)

... ....

序列化异常:检查代码中是否有不符合序列化的内容

 

     3.  Window:定时执行一个时间段的操作

    

     核心概念:

  •   Window length:窗口的长度
  •   Sliding interval:窗口的间隔

       每隔多久计算某个范围内的数据:每隔sliding interval 统计前window length的值

       例:每隔10秒计算前30秒的值:

val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) =>(a+b), Seconds(30), Seconds(10))

 

      4. 黑名单过滤(DStream 与 RDD之间操作) 待补充,不深刻

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * 黑名单过滤
  */
object TransformApp {


  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

    /**
      * 创建StreamingContext需要两个参数:SparkConf和batch interval
      */
    val ssc = new StreamingContext(sparkConf, Seconds(5))


    /**
      * 构建黑名单:transform, leftjoin
      */
    val blacks = List("zs", "ls")
    val blacksRDD = ssc.sparkContext.parallelize(blacks).map(x => (x, true))

    val lines = ssc.socketTextStream("localhost", 6789)
    val clicklog = lines.map(x => (x.split(",")(1), x)).transform(rdd => {
      rdd.leftOuterJoin(blacksRDD)
        .filter(x=> x._2._2.getOrElse(false) != true)
        .map(x=>x._2._1)
    })

    clicklog.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

 

      5. Spark SQL 与 Spark Streaming 的整合

      (1) foreachRDD ==> 转化为RDD

      (2) RDD转化为DataFrame

      (3) DataFrame 注册成临时表

      (4) 最后通过SQL的方式从表中获取值

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值