sparkstreaming的相关知识和案例

sparkstreaming的相关知识

  • storm,sparkstreaming,flink对比
名字容错性吞吐率延迟消费语义
strom延迟非常低至少一次(借用tridentAPI也可以实现有且仅有一次)
sparkstreaming很高很高延迟高有且仅有一次
flink很高很高延迟一般有且仅有一次
  • 特点
    1. 容易使用
      1. 支持多种语言 java,scala,python
      2. 提供了很多高阶算子
    2. 容错性保持有且仅有一次
  • 流式处理的经典架构
    flume>>>>>>kafka>>>>>sparkstreaming/storm >>>>redis/mysql/hbase
  • sparkstreaming的基本原理就是将输入的数据流以时间间隔(秒级)为单位进行拆分,然后以类似批处理的方式处理每个时间片的数据。
  • StreamingContext 编程入口
  • DStream 离散流 离散的RDD
  • checkpoint:Spark Streaming 需要 checkpoint 足够的信息到容错存储系统,以便可以从故障中恢复
    • checkpoint保存的两种数据:
      • streaming 定义的计算的信息保存到容错存储(如 HDFS)中。
      • 将生成的 RDD 保存到可靠的存储。
    • 何时启用 checkpoint
      • 使用状态转换 - 如果在应用程序中使用 updateStateByKey或 reduceByKeyAndWindow(具有反向功能),则必须提供 checkpoint 目录以允许定期的 RDD checkpoint。
      • 从运行应用程序的 driver 的故障中恢复 - 元数据 checkpoint 用于使用进度信息进行恢复。
    • 如何配置 checkpoint:streamingContext.checkpoint(checkpointDirectory)
  • 参考中文文档spark的中文文档
  • 参考英文文文档spark的英文文档
  • 中文社区文档中文社区文档

sparkstreaming的案例

  1. 前绪准备

     1.准备下载  yum install nc
     2.开启nc nc -lk 9999 随后就可以在上面输出你的数据 idea这边就是可以接收到了		
    
  2. 简单的wordcount案例

     1.准备下载  yum install nc
     2.开启nc nc -lk 9999 随后就可以在上面输出你的数据 idea这边就是可以接收到了
     package spark.streaming
     
     import org.apache.spark.SparkConf
     import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
     import org.apache.spark.streaming.{Seconds, StreamingContext}
     
     object NewWordCount {
       def main(args: Array[String]): Unit = {
         //获取编程入口StreamingContext
         val sparkConf: SparkConf = new SparkConf().setAppName("NewWordCount").setMaster("local[2]")
         val streamingContext: StreamingContext = new StreamingContext(sparkConf,Seconds(4))
         //通过StreamingContext 构建第一个DStream
         val inputDStream: ReceiverInputDStream[String] = 
         streamingContext.socketTextStream("hadoop",9999)
         //对DStream进行各种transformation操作
         val wordsDStream: DStream[String] = inputDStream.flatMap(_.split(" "))
         val wordAndDStream: DStream[(String, Int)] = wordsDStream.map((word: String) =>(word,1))
         val result: DStream[(String, Int)] = wordAndDStream.reduceByKey((x, y)=>x+y)
         //对于结果数据进行output操作
         result.print()
         //提交sparkStreaming的应用程序StreamingContext.start()  ssc.awaitTermination()
         streamingContext.start()
         streamingContext.awaitTermination()
       }
     }
    
  3. 保持结果记录的wordcount

     package spark.streaming
     
     import org.apache.spark.SparkConf
     import org.apache.spark.streaming.{Seconds, StreamingContext}
     import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
     
     object UpDataStateByKey {
       def main(args: Array[String]): Unit = {
         System.setProperty("HADOOP_USER_NAME","root")
         //获取编程入口StreamingContext
         val sparkConf: SparkConf = new SparkConf().setAppName("NewWordCount").setMaster("local[2]")
         val streamingContext: StreamingContext = new StreamingContext(sparkConf, Seconds(4))
         //The checkpoint directory has not been set指定目录
         streamingContext.checkpoint("hdfs://hadoop:9000/sparkstreaming/checkpoint")
         //通过StreamingContext 构建第一个DStream
         val inputDStream: ReceiverInputDStream[String] = 
         streamingContext.socketTextStream("hadoop", 9999)
         //对DStream进行各种transformation操作
         val wordsDStream: DStream[String] = inputDStream.flatMap(_.split(" "))
         val wordAndDStream: DStream[(String, Int)] = wordsDStream.map((word: String) => (word, 1))
         /**
           * updateFunc: (Seq[V], Option[S]) => Option[S],
           * numPartitions: Int
           * values: Seq[Int]新值
           * state: Option[Int]): Option[Int] 状态值
           * updateFunction:状态更新函数
           */
         def updateFunction(values: Seq[Int], state: Option[Int]): Option[Int] = {
           val new_value: Int = values.sum
           val state_value: Int = state.getOrElse(0)
           Some(new_value + state_value)
         }
         val result: DStream[(String, Int)] = wordAndDStream.updateStateByKey(updateFunction)
         result.print()
         streamingContext.start()
         streamingContext.awaitTermination()
       }
     }
    
  4. 读取hdfs文件的案例

     package spark.streaming
     
     import org.apache.spark.rdd.RDD
     import org.apache.spark.{SparkConf, SparkContext}
     import org.apache.spark.streaming.{Seconds, StreamingContext}
     import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
     
     object FileWordCount {
       def main(args: Array[String]): Unit = {
         System.setProperty("HADOOP_USER_NAME","root")
         //获取编程入口StreamingContext
         val sparkConf: SparkConf = new SparkConf().setAppName("FileWordCount").setMaster("local[2]")
         val streamingContext: StreamingContext = new StreamingContext(sparkConf, Seconds(4))
         //The checkpoint directory has not been set指定目录
         streamingContext.checkpoint("hdfs://hadoop:9000/sparkstreaming/checkpoint")
         //通过StreamingContext 构建第一个DStream
         //这里的参数只能是目录并且这里的inputDStream 结果必须是新添加的文件才会有结果
         val inputDStream: DStream[String] = 
         streamingContext.textFileStream("hdfs://hadoop:9000/student")
         //对DStream进行各种transformation操作
         val wordsDStream: DStream[String] = inputDStream.flatMap(_.split(","))
         val wordAndDStream: DStream[(String, Int)] = wordsDStream.map((word: String) => (word, 1))
         /**
           * updateFunc: (Seq[V], Option[S]) => Option[S],
           * numPartitions: Int
           * values: Seq[Int]新值
           * state: Option[Int]): Option[Int] 状态值
           * updateFunction:状态更新函数
           */
         def updateFunction(values: Seq[Int], state: Option[Int]): Option[Int] = {
           val new_value: Int = values.sum
           val state_value: Int = state.getOrElse(0)
           Some(new_value + state_value)
         }
         val result: DStream[(String, Int)] = wordAndDStream.updateStateByKey(updateFunction)
         result.print()
         streamingContext.start()
         streamingContext.awaitTermination()
       }
     }
    
  5. 高可用的wordcount案例

     package spark.streaming
     
     import org.apache.spark.SparkConf
     import org.apache.spark.streaming.{Seconds, StreamingContext}
     import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
     
     /**
       * 默认情况下程序停止了是不会保存StreamingContext这个对象的
       * 测试步骤:
       * 1.先正常运行程序一段时间
       * 2.然后停止程序
       * 3.在启动程序再次输入数据
       * 4.查看再次启动后程序运行的结果是否接上了上一次的结果
       */
     object WordCount_DriverHA {
       def main(args: Array[String]): Unit = {
         val checkpointDirectory="hdfs://hadoop:9000/sparkstreaming/checkpoint1"
         System.setProperty("HADOOP_USER_NAME","root")
         def functionToCreateContext(): StreamingContext = {
           //获取编程入口StreamingContext
           val sparkConf: SparkConf = new SparkConf()
           .setAppName("WordCount_DriverHA").setMaster("local[2]")
           val streamingContext: StreamingContext = new StreamingContext(sparkConf, Seconds(4))
           //The checkpoint directory has not been set指定目录
           streamingContext.checkpoint(checkpointDirectory)
           //通过StreamingContext 构建第一个DStream
           val inputDStream: ReceiverInputDStream[String] = 
           streamingContext.socketTextStream("hadoop", 9999)
           //对DStream进行各种transformation操作
           val wordsDStream: DStream[String] = inputDStream.flatMap(_.split(" "))
           val wordAndDStream: DStream[(String, Int)] = wordsDStream.map((word: String) => (word, 1))
           /**
             * updateFunc: (Seq[V], Option[S]) => Option[S],
             * numPartitions: Int
             * values: Seq[Int]新值
             * state: Option[Int]): Option[Int] 状态值
             * updateFunction:状态更新函数
             */
           def updateFunction(values: Seq[Int], state: Option[Int]): Option[Int] = {
             val new_value: Int = values.sum
             val state_value: Int = state.getOrElse(0)
             Some(new_value + state_value)
           }
           val result: DStream[(String, Int)] = wordAndDStream.updateStateByKey(updateFunction)
           result.print()
           streamingContext.start()
           streamingContext.awaitTermination()
           streamingContext
         }
         /**
           * checkpointDirectory 这个就是checkpoint的目录如果是第一次执行则创建如果
           * 第二次运行就不会再创建对象了这样就可以还原程序之前的状态了
           */
         val context = StreamingContext
           .getOrCreate(checkpointDirectory, functionToCreateContext _)
         context.start()
         context.awaitTermination()
       }
     }
    

6.wordcount的过虑(黑名单和白名单)

	package spark.streaming
	
	import org.apache.spark.SparkConf
	import org.apache.spark.broadcast.Broadcast
	import org.apache.spark.rdd.RDD
	import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
	import org.apache.spark.streaming.{Seconds, StreamingContext}
	
	
	object WordCount_BlackList {
	  def main(args: Array[String]): Unit = {
	    System.setProperty("HADOOP_USER_NAME","root")
	    val checkpointDir="hdfs://hadoop:9000/sparkstreaming/checkpoint02"
	    val sparkConf: SparkConf = new SparkConf()
	    .setAppName("WordCount_BlackList").setMaster("local[2]")
	    val streamingContext: StreamingContext = new StreamingContext(sparkConf,Seconds(4))
	    streamingContext.checkpoint(checkpointDir)
	    val blackList=List(".","!",",","@","$","%","^","*")
	    val inputDStream: ReceiverInputDStream[String] = streamingContext.socketTextStream("hadoop",9999)
	    val wordsDStream: DStream[String] = inputDStream.flatMap(_.split(" "))
	    val bc: Broadcast[List[String]] = streamingContext.sparkContext.broadcast(blackList)
	    val transformFunc:RDD[String]=>RDD[String]=(rdd:RDD[String])=>{
	      val newRDD: RDD[String] = rdd.mapPartitions(ptndata => {
	        val bl: List[String] = bc.value
	        val newPtn: Iterator[String] = ptndata.filter(x => !bl.contains(x))
	        newPtn
	      })
	      newRDD
	    }
	    //transform 的作用就是把rdd中的每个数据拿出来处理一次
	    val trueWordsDStream: DStream[String] = wordsDStream.transform(transformFunc)
	    val result: DStream[(String, Int)] = trueWordsDStream.map(x => (x, 1))
				.updateStateByKey((values: Seq[Int], state: Option[Int]) => {
	      Some(values.sum + state.getOrElse(0))
	    })
	    result.print()
	    streamingContext.start()
	    streamingContext.awaitTermination()
	  }
	}

7.reduceByKeyAndWindow 的案例

	package spark.streaming
	
	import org.apache.spark.SparkConf
	import org.apache.spark.streaming.{Seconds, StreamingContext}
	import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
	
	object WordCount_Window {
	  def main(args: Array[String]): Unit = {
	    System.setProperty("HADOOP_USER_NAME","root")
	    //获取编程入口StreamingContext
	    val sparkConf: SparkConf = new SparkConf().setAppName("NewWordCount").setMaster("local[2]")
	    val streamingContext: StreamingContext = new StreamingContext(sparkConf,Seconds(2))
	    //The checkpoint directory has not been set指定目录
	    streamingContext.checkpoint("hdfs://hadoop:9000/sparkstreaming/checkpoint3")
	    //通过StreamingContext 构建第一个DStream
	    val inputDStream: ReceiverInputDStream[String] = 
	    streamingContext.socketTextStream("hadoop", 9999)
	    //对DStream进行各种transformation操作
	    val wordsDStream: DStream[String] = inputDStream.flatMap(_.split(" "))
	    val wordAndDStream: DStream[(String, Int)] = wordsDStream.map((word: String) => (word, 1))
		//每隔四秒计算过去六秒的数据 下面这里的数字必须是 streamingContext 这里指定参数的倍数 不然会报错
	    val result: DStream[(String, Int)] = wordAndDStream
	      .reduceByKeyAndWindow((x:Int, y:Int)=>(x+y),Seconds(4),Seconds(6))
	    result.print()
	    streamingContext.start()
	    streamingContext.awaitTermination()
	  }
	}
  1. foreachRDD的相关知识

    特点
    oreach遍历分布式集合中的每一个元素
    oreachPartition遍历一个分布式集合中的每一个分区
    oreachRDD遍历一个分布式集合中的DStream中的每一个RDD
     dstream.foreachRDD { rdd =>
       rdd.foreachPartition { partitionOfRecords =>
         // 创建一个连接池的连接
         val connection = ConnectionPool.getConnection()
         partitionOfRecords.foreach(record => connection.send(record))//使用连接发送数据
         ConnectionPool.returnConnection(connection)  // 返回连接池中的连接
       }
     }
    
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值