SparkStreaming
1.课程目标
- 掌握SparkStreaming原理和架构
- 掌握DStream常用的操作
- 掌握SparkStreaming整合flume
- 掌握SparkStreaming整合kafka
2.SparkStreaming概念
- SparkStreaming是用来开发实时数据处理程序的
- SparkStreaming可以接入不同实时数据源,Kafka、Flume、Socket等不同的数据源。SparkStreaming一旦启动,(程序本身)无法停止。除非有人工干预或者机器故障或者数据异常
- SparkStreaming特点:
- 易用 :可以支持多种开发语言
- 容错:Spark Streaming可以从盒子中恢复丢失的工作和操作员状态(如滑动窗口),而不需要任何额外的代码。
- 易整合:SparkStreaming 会整合Graphx以及机器学习。SparkStreaming 还可整合Kakfa、Flume、文件系统、SparkCore、SparkSQL
3.SparkStreaming 原理
SparkStreaming是伪实时数据处理程序(微批处理)程序。 微:代表时间间隔 很小 批处理:代表的离线数据处理 1.SparkStreaing 将时间间隔缩短成秒级甚至是毫秒级进行数据处理。 2.SparkStreaing 将按照时间间隔读取进来数据转换成RDD,然后RDD进行转换操作,最后也是提交Job。 SparkStreaming 中按照时间间隔自动提交Job程序,
- SparkStreaming实时性
SparkStreaming目前能做到实时性 100毫秒, 如果对于实时性要求非常高的项目,比如证券交易、新闻发布,在进行技术选型的时候,要考虑Storm或者Flink、Spark中Structured Streaming。 1.job中划分Task时间(1秒钟 50个task,每个task 20毫秒) 2.将task划分到不同的worker节点(网络传输+序列化和反序列化过程+执行过程) 以上两个过程时间间隔>=100毫秒 在实际开发过程中SparkStreaming一般处理到秒级
- SparkStreaming 容错性
- 基于RDD的血统机制容错
- RDD缓存机制容错
- RDD的checkpoint进行容错
4.SparkStreaming的编程入口
- 导入pom依赖
<!--SparkStreaming 依赖--> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.11</artifactId> <version>2.2.0</version> </dependency>
SparkCore编程入口 SparkContext SparkSQL 编程入口 SparkSession SparkStreaming的编程入口 StreamingContext StreamingContext功能: 1.接入数据源 并且创建DStream 2.StreamingContext.start() 启动实时数据处理程序 3.context.awaitTermination() 启动线程监控程序是否有异常情况出现,如果出现异常则调用stop方法 4.StreamingContext.stop() 停止和释放程序资源 创建StreamingContext对象:需要SparkConf 和时间间隔参数
5.SparkStreaming编程模型:DStream
DStream:是离散流对象,是sparkStreaming的基础编程抽象。DStream实际上就是一个RDD的集合。 DStream 转换算子 map、flatMap、reduceByKey、groupByKey等操作 每次转换操作都会形成新的 Dstream DStream 输出算子:print、foreachRDD、transform、saveAsTextFiles等操作 DStreams 三大特性: 1.每个DStream都依赖于其他的DStreams def dependencies: List[DStream[_]] 2.每一个时间间隔都会创建RDD var generatedRDDs = new HashMap[Time, RDD[T]]() 3.在转换算子中每一个函数操作都会作用于DStream中的RDD之上 算子中函数式通过DStream中compute方法作用于每个时间间隔对应的RDD之上 override def compute(validTime: Time): Option[RDD[U]] = { parent.getOrCompute(validTime).map(_.map[U](mapFunc)) }
6.SparkStreaming基于IDEA编程实战
- pom依赖
<!--SparkStreaming 依赖--> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.11</artifactId> <version>2.2.0</version> </dependency>
6.1 接入Socket数据
- 准备工作
- 安装socket数据源软件 :命令 yum -y install nc
- 启动:nc -lk 端口号
- 编写程序
package cn.itcast.sparkStreaming import org.apache.spark.SparkConf import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.{Seconds, StreamingContext} object SocketSparkStreaming { def main(args: Array[String]): Unit = { //创建SparkConf对象 val conf = new SparkConf().setAppName("SocketSparkStreaming").setMaster("local[2]") //构建StreamingContext val sc: StreamingContext = new StreamingContext(conf, Seconds(5)) sc.sparkContext.setLogLevel("OFF") //接入数据源 创建Dstream val linesDstream: DStream[String] = sc.socketTextStream("node-01", 9999) // 输入数据格式 , 分开 wordcount val wordCountDstream: DStream[(String, Int)] = linesDstream.flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _) //打印输出 wordCountDstream.print() //启动实时数据处理程序 sc.start() sc.awaitTermination() } } 注意:SparkStreaming实时处理只能够处理当前批次的数据,并不能保存历史数据状态
- 保存实时数据处理历史状态,则需要使用updateStateByKey
package cn.itcast.sparkStreaming import org.apache.spark.SparkConf import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.{Seconds, StreamingContext} object UpdateStateSparkStreaming { /** * * @param oldValues 每一个历史批次的统计值:这里就相当于Dstream中每一个RDD的wordcount操作 * @param newValue 当前批次的rdd操作wordcount * @return */ def updateFunc(oldValues: Seq[Int], newValue: Option[Int]): Option[Int] = { val value = oldValues.sum + newValue.getOrElse(0) Option(value) } def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("UpdateStateSparkStreaming").setMaster("local[2]") //创建StreamingContext对象 val context = new StreamingContext(conf, Seconds(5)) context.sparkContext.setLogLevel("OFF") //设置checkpoint路径 hdfs之上 context.checkpoint("D:/ck") //接入数据源 val linesDstream: DStream[String] = context.socketTextStream("node-01", 9999) val wordcountDStream: DStream[(String, Int)] = linesDstream.flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _) //保存历史状态 将数据打印到控制台 wordcountDStream.updateStateByKey(updateFunc _).print() context.start() context.awaitTermination() } }
- 宕机后历史状态无法保存。
- 实现宕机后依然保存历史状态
package cn.itcast.sparkStreaming import cn.itcast.sparkStreaming.UpdateStateSparkStreaming.updateFunc import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.dstream.DStream object SparkStreamingUpdate { def createFunc(): StreamingContext = { val conf = new SparkConf().setAppName("UpdateStateSparkStreaming").setMaster("local[2]") //创建StreamingContext对象 val context = new StreamingContext(conf, Seconds(10)) context.sparkContext.setLogLevel("OFF") //设置checkpoint路径 hdfs之上 context.checkpoint("D:/ck") //接入数据源 val linesDstream: DStream[String] = context.socketTextStream("node-01", 9999) val wordcountDStream: DStream[(String, Int)] = linesDstream.flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _) //保存历史状态 将数据打印到控制台 wordcountDStream.updateStateByKey(updateFunc _).print() //返回值 context } def main(args: Array[String]): Unit = { //设置checkpint路径 生产环境下 一般是hdfs路径 val streamingContext=StreamingContext.getOrCreate("D:/ck", createFunc _) streamingContext.start() streamingContext.awaitTermination() } }
- 实现宕机后依然保存历史状态
- 需求:统计每20秒钟产生的热词,每10秒钟统计一次,sparkStreaming的时间间隔 5秒钟。SparkStreaming中窗口操作。
object SparkStreamingWindow { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("UpdateStateSparkStreaming").setMaster("local[2]") //创建StreamingContext对象 val context = new StreamingContext(conf, Seconds(5)) context.sparkContext.setLogLevel("OFF") //接入数据源 val linesDstream: DStream[String] = context.socketTextStream("node-01", 9999) val wordcountDStream: DStream[(String, Int)] = linesDstream.flatMap(_.split(",")).map((_, 1)) .reduceByKeyAndWindow((x: Int, y: Int) => x + y, Seconds(15), Seconds(10)) //窗口操作的时间设置 必须是产生RDD时间间隔整数倍 wordcountDStream.print() context.start() context.awaitTermination() } }
6.2 SparkStreaming其他数据来源
- SparkStreaming可以读取文件系统中文件
def main(args: Array[String]): Unit = { val context = new StreamingContext(new SparkConf().setAppName("SparkStreamingFile").setMaster("local[*]"), Seconds(5)) context.sparkContext.setLogLevel("OFF") //读取文件数据 val dstream = context.textFileStream("D:/data/") dstream.print() context.start() context.awaitTermination() }
6.3自定义Receiver读取数据源
- 自定义类继承Receiver 实现revicer中抽象方法, onStart onStop
package cn.itcast.sparkStreaming import java.io.{BufferedReader, InputStreamReader} import java.net.Socket import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.receiver.Receiver //构建自定义receiver 接受socket数据 class MyReceiver(val host: String, val port: Int) extends Receiver[String](StorageLevel.MEMORY_ONLY) { var socket: Socket = _ override def onStart(): Unit = { //接入数据源 socket = new Socket(host, port) new Thread() { override def run(): Unit = { //读取数据并且将数据存储到缓存中 receiverMsg() } }.start() } def receiverMsg(): Unit = { //获取数据流 val inputStream = socket.getInputStream() //构建读取对象 val reader = new InputStreamReader(inputStream) val bufferedReader = new BufferedReader(reader) var line: String = bufferedReader.readLine() while(!isStopped() && line != null) { //存储数据 store(line) line = bufferedReader.readLine() } } override def onStop(): Unit = { //释放资源操作 socket.close() } } object SparkStreamingReceiver { def main(args: Array[String]): Unit = { val context = new StreamingContext(new SparkConf().setAppName("SparkStreamingFile").setMaster("local[*]"), Seconds(5)) context.sparkContext.setLogLevel("OFF") val dstream = context.receiverStream(new MyReceiver("node-01", 9999)) dstream.print() context.start() context.awaitTermination() } }
7.SparkStreaming读取Flume中数据
- Flume日志采集,准备工作
- 将spark-streaming-flume-sink*.jar 替换成spark-streaming-flume-sink_2.11-2.2.0.jar
- 将 scala-library-2.10.5.jar 替换成 scala-library-2.11.8.jar
- 配置pom文件
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-flume_2.11</artifactId> <version>2.2.0</version> </dependency>
- SparkStreaming 读取Flume中数据,有两种方式
poll(拉取数据):Flume会将收集到的数据发送到 Flume所在机器的某个特定端口
push(推送方式):Flume将收集到的数据推送到Spark应用程序执行的机器
- 读取数据步骤
- 创建flume-poll.properties文件
a1.sources = r1 a1.sinks = k1 a1.channels = c1 #source a1.sources.r1.channels = c1 a1.sources.r1.type = spooldir a1.sources.r1.spoolDir = /home/data a1.sources.r1.fileHeader = true #channel a1.channels.c1.type =memory a1.channels.c1.capacity = 20000 a1.channels.c1.transactionCapacity=5000 #sinks a1.sinks.k1.channel = c1 a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink a1.sinks.k1.hostname=node-03 a1.sinks.k1.port = 9999 a1.sinks.k1.batchSize= 2000
package cn.itcast.sparkStreaming import org.apache.spark.SparkConf import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent} import org.apache.spark.streaming.{Seconds, StreamingContext} object SparkStreamingFlume { def main(args: Array[String]): Unit = { //创建StreamingContext val conf = new SparkConf().setMaster("local[*]").setAppName("SparkStreamingFlume") val context = new StreamingContext(conf, Seconds(10)) context.sparkContext.setLogLevel("OFF") //读取flume中数据 /** * ssc: StreamingContext, * hostname: String * port: Int, */ val flumeDstream = FlumeUtils.createPollingStream(context, "node-03", 9999) val wordcountDstream = flumeDstream.map((sfe: SparkFlumeEvent) => { val body = sfe.event.getBody new String(body.array()) }).flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _) //排序 val sortDstream = wordcountDstream.transform(rdd => { rdd.sortBy(t => t._2, false) }) //topN //sortDstream.count().print() context.start() context.awaitTermination() } }
- 创建flume-poll.properties文件
- push 模式消费数据
- 创建flume-push.properties
a1.sources = r1 a1.sinks = k1 a1.channels = c1 #source a1.sources.r1.channels = c1 a1.sources.r1.type = spooldir a1.sources.r1.spoolDir = /home/data/ a1.sources.r1.fileHeader = true #channel a1.channels.c1.type =memory a1.channels.c1.capacity = 20000 a1.channels.c1.transactionCapacity=5000 #sinks a1.sinks.k1.channel = c1 a1.sinks.k1.type = avro #注意这里的ip需要指定的是我们spark程序所运行的服务器的ip,也就是我们的win7的ip地址 a1.sinks.k1.hostname=192.168.23.22 a1.sinks.k1.port = 8888 a1.sinks.k1.batchSize= 2000
package cn.itcast.sparkStreaming import org.apache.spark.SparkConf import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent} import org.apache.spark.streaming.{Seconds, StreamingContext} object SparkStreamingFlume2 { def main(args: Array[String]): Unit = { //创建StreamingContext val conf = new SparkConf().setMaster("local[*]").setAppName("SparkStreamingFlume") val context = new StreamingContext(conf, Seconds(10)) context.sparkContext.setLogLevel("OFF") /** * ssc: StreamingContext, * hostname: String, * port: Int */ val dstream:DStream[SparkFlumeEvent] = FlumeUtils.createStream(context, "192.168.23.22", 8888) dstream.map(sfe=>{ new String(sfe.event.getBody.array()) }).print() context.start() context.awaitTermination() } }
- 创建flume-push.properties
8.SparkStreaming消费kafka中数据
- 操作步骤
- 启动Kafka 集群
- 通过Receiver消费kafka中数据
package cn.itcast.sparkStreaming import org.apache.spark.SparkConf import org.apache.spark.streaming.kafka.KafkaUtils import org.apache.spark.streaming.{Seconds, StreamingContext} /** * createStream 这种消费方式 kafka的offset 存放在zookeeper中,由程序自动管理的 * 以receiver方式 消费kafka中数据,receiver以多线程的方式消费的, 线程数量 跟每一个topic对应partition数量有关 * topic1->3 topic2->3 线程数大小 6 */ object SparkStreamingKakfa_8 { def main(args: Array[String]): Unit = { //创建StreamingContext val conf = new SparkConf().setMaster("local[*]").setAppName("SparkStreamingFlume") val context = new StreamingContext(conf, Seconds(10)) context.sparkContext.setLogLevel("OFF") //链接kafka数据源 /** * ssc: StreamingContext, * zkQuorum: String, * groupId: String, * topics: Map[String, Int], */ val zkQuorum: String = "node-01:2181,node-02:2181,node-03:2181" val groupId: String = "shenzhen" val topics: Map[String, Int] = Map("shenzhen_itcast" -> 3) val kafkaDstream =KafkaUtils.createStream(context, zkQuorum, groupId, topics) val wcDstream = kafkaDstream.map(_._2).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _) wcDstream.print() context.start() context.awaitTermination() } }
/** * Direct 方式消费 kafka中数据,offset 存在于kafka的topic */ object SparkStreamingKafkaDriect_8 { def main(args: Array[String]): Unit = { //创建StreamingContext val conf = new SparkConf().setMaster("local[*]").setAppName("SparkStreamingFlume") val context = new StreamingContext(conf, Seconds(10)) context.sparkContext.setLogLevel("OFF") //通过Dirvect方式消费kafka中数据 /** * ssc: StreamingContext, * kafkaParams: Map[String, String], * topics: Set[String] */ val kafkaParams: Map[String, String] = Map( "metadata.broker.list" -> "node-01:9092,node-02:9092,node-03:9092", "groupId" -> "shenzhen" ) val topics: Set[String] = Set("shenzhen_itcast") val directStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](context, kafkaParams, topics) var offsetRanges = Array.empty[OffsetRange] directStream.transform { rdd => offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges rdd }.foreachRDD { rdd => for (o <- offsetRanges) { println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}") } //fromOffset 一般存放到redise } context.start() context.awaitTermination() } }
基于kafka10版本 object SparkStreamingKafka10 { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster("local[*]").setAppName("SparkStreamingFlume") val context = new StreamingContext(conf, Seconds(10)) context.sparkContext.setLogLevel("OFF") val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "node-01:9092,node-02:9092,node-03:9092", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "use_a_separate_group_id_for_each_stream", "auto.offset.reset" -> "latest", //从哪里开始消费 "enable.auto.commit" -> (false: java.lang.Boolean) //设置不自动提交offset ) val topics = Array("shenzhen_itcast") val stream = KafkaUtils.createDirectStream[String, String]( context, PreferConsistent, Subscribe[String, String](topics, kafkaParams) ) stream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges rdd.foreachPartition { iter => val o: OffsetRange = offsetRanges(TaskContext.get.partitionId) println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}") } //提交offset 提交zookeeper中 生产环境下 一般通过redis管理offset stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) } context.start() context.awaitTermination() /* shenzhen_itcast 2 10 10 shenzhen_itcast 0 11 11 shenzhen_itcast 1 11 11 */ } }
- 消费数据语义
- 至少消费一次 (可以重复消费)
- 通过receiver方式消费kafka数据,将数据读取到SparkStreaming,但是并没有自动提交offset,
- 如果需要消费仅且消息一次,需要开启WAL(预写日志方式) 日志HDFS之上 这样就保证了数据不会丢失。
- 至多消费一次 (可能丢失数据)
- 通过receiver方式消费kafka数据,数据还没有进行处理程序异常退出,但是offset已经提交了
- 消费仅且消息一次
- 类似于关系型数据库中事务
- 至少消费一次 (可以重复消费)