Spark Streaming基础与实践

最新推荐文章于 2024-05-08 18:50:24 发布

贫僧洗头爱飘柔

最新推荐文章于 2024-05-08 18:50:24 发布

阅读量1.8k

点赞数

分类专栏： Spark 文章标签： Spark Streaming基础 Spark Streaming架构 Spark Streaming原理 Spark Streaming实践 Spark Streaming与Kafka整合

本文链接：https://blog.csdn.net/ForgetThatNight/article/details/79766015

版权

Spark 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

（一）Spark Streaming简介

参考文章：点击打开链接

1、Spark Streaming概念

Spark Streaming是Spark核心API的一个扩展，可以实现高吞吐量的、具备容错机制的实时流数据的处理。类似于ApacheStorm，用于流式数据的处理。根据其官方文档介绍，Spark Streaming有高吞吐量和容错能力强等特点。Spark Streaming支持的数据输入源很多，例如：Kafka、Flume、Twitter、ZeroMQ和简单的TCP套接字等等。数据输入后可以用Spark的高度抽象原语如：map、reduce、join、window等进行运算。而结果也能保存在很多地方，如HDFS，数据库等。另外Spark Streaming也能和MLlib（机器学习）以及Graphx完美融合。

多种数据源获取数据：

Spark Streaming接收Kafka、Flume、HDFS等各种来源的实时输入数据，进行处理后，处理结构保存在HDFS、DataBase等各种地方

2、为什么要学习Spark Streaming

1.易用

2.容错

3.易整合到Spark体系

3、Spark Core 与 Spark Streaming

两者关系如图：

• 第一步：针对小数据块的RDD DAG的构建
• 第二步：连续Data的切片处理

• Spark Streaming将接收到的实时流数据，按照一定时间间隔，对数据进行拆分，交给SparkEngine引擎处理，最终得到一批批的结果

• 每一批数据，在Spark内核对应一个RDD实例
• Dstream可以看做一组RDDs，即RDD的一个序列

4. Spark与Storm的对比

spark streamming需要设置batch interval，严格说也是批处理框架，时间设置较小，可以理解为微型实时流处理框架

Spark	Storm

开发语言：Scala	开发语言：Clojure
编程模型：DStream	编程模型：Spout/Bolt

（二） DStream

1. 什么是DStream

Discretized Stream是Spark Streaming提供了表示连续数据流的、高度抽象的被称为离散流的DStream。在内部实现上，DStream是一系列连续的RDD来表示。每个RDD含有一段时间间隔内的数据，如下图：

任何对DStream的操作都会转变为对底层RDD的操作

• Spark Streaming程序中一般会有若干个对DStream的操作。DStreamGraph就是由这些操作的依赖关系构成

计算过程由Spark engine来完成

• 将连续的数据持久化、离散化，然后进行批量处理
• 为什么？
    – 数据持久化：接收到的数据暂存
    – 离散化：按时间分片，形成处理单元
    – 分片处理：分批处理

• 作用Dstream上的Operation分成两类：
    – Transformation：转换
        • Spark支持RDD进行各种转换，因为DStream是由RDD组成的Spark Streaming提供了一个可以在DStream上使用的转换集合，这些集合和RDD上可用的转换类似
        • 转换应用到DStream的每个RDD
        • Spark Streaming提供了reduce和count这样的算子，但不会直接触发DStream计算
        • Map、flatMap、join、reduceByKey
    – Output：执行算子、或输出算子
        • Print
        • saveAsObjectFile、saveAsTextFile、saveAsHadoopFiles：将一批数据输出到Hadoop文件系统中，用批量数据的开始时间戳来命名
        • forEachRDD：允许用户对DStream的每一批量数据对应的RDD本身做任意操作

• 一系列transformation操作的抽象
• 例如：
– c = a.join(b), d = c.filter() 时，它们的 DAG 逻辑关系是a/b → c，c → d，但在 Spark Streaming 在进行物理记录时却是反向的 a/b ← c, c ← d

• Dstream之间的转换所形成的的依赖关系全部保存在DStreamGraph中， DStreamGraph对于后期生成RDD Graph至关重要
• DStreamGraph有点像简洁版的DAG scheduler，负责根据某个时间间隔生成一序列JobSet，以及按照依赖关系序列化

2. DStream相关操作

DStream上的原语与RDD的类似，分为Transformations（转换）和OutputOperations（输出）两种，此外转换操作中还有一些比较特殊的原语，如：updateStateByKey()、transform()以及各种Window相关的原语。

2.1. Transformations on DStreams

Transformation	Meaning
map(func)	Return a new DStream by passing each element of the source DStream through a function func.
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items.
filter(func)	Return a new DStream by selecting only the records of the source DStream on which func returns true.
repartition(numPartitions)	Changes the level of parallelism in this DStream by creating more or fewer partitions.
union(otherStream)	Return a new DStream that contains the union of the elements in the source DStream and otherDStream.
count()	Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
reduce(func)	Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative so that it can be computed in parallel.
countByValue()	When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
reduceByKey(func, [numTasks])	When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
join(otherStream, [numTasks])	When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
cogroup(otherStream, [numTasks])	When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
transform(func)	Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
updateStateByKey(func)	Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.

特殊的Transformations

1.UpdateStateByKeyOperation
UpdateStateByKey原语用于记录历史记录，上文中Word Count示例中就用到了该特性。若不用UpdateStateByKey来更新状态，那么每次数据进来后分析完成后，结果输出后将不在保存
2.TransformOperation
Transform原语允许DStream上执行任意的RDD-to-RDD函数。通过该函数可以方便的扩展Spark API。此外，MLlib（机器学习）以及Graphx也是通过本函数来进行结合的。
3.WindowOperations

Window Operations有点类似于Storm中的State，可以设置窗口的大小和滑动窗口的间隔来动态的获取当前Steaming的允许状态

2.2. Output Operations on DStreams

Output Operations可以将DStream的数据输出到外部的数据库或文件系统，当某个Output Operations原语被调用时（与RDD的Action相同），streaming程序才会开始真正的计算过程。

Output Operation	Meaning
print()	Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.
saveAsTextFiles(prefix, [suffix])	Save this DStream's contents as text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsObjectFiles(prefix, [suffix])	Save this DStream's contents as SequenceFiles of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsHadoopFiles(prefix, [suffix])	Save this DStream's contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
foreachRDD(func)	The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

（三）Spark Streaming架构

整个架构由3个模块组成：

    – Master：记录Dstream之间的依赖关系或者血缘关系，并负责任务调度以生成新的RDD
    – Worker：从网络接收数据，存储并执行RDD计算
    – Client：负责向Spark Streaming中灌入数据

Spark Streaming作业提交

• Network Input Tracker：跟踪每一个网络received数据，并且将其映射到相应的input Dstream上
• Job Scheduler：周期性的访问DStream Graph并生成Spark Job，将其交给Job Manager执行
• Job Manager：获取任务队列，并执行Spark任务

Streaming 窗口操作

• Spark提供了一组窗口操作，通过滑动窗口技术对大规模数据的增量更新进行统计分析
• Window Operation：定时进行一定时间段内的数据处理

• 任何基于窗口操作需要指定两个参数：
– 窗口总长度（window length）

– 滑动时间间隔（slide interval）

Streaming 容错性分析

• 实时的流式处理系统必须是7*24运行的，同时可以从各种各样的系统错误中恢复，在设计之初，Spark Streaing就支持driver和worker节点的错误恢复。
• Worker容错：spark和rdd的保证worker节点的容错性。spark streaming构建在spark之上，所以它的worker节点也是同样的容错机制
• Driver容错：依赖WAL持久化日志
    – 启动WAL需要做如下的配置
    – 1：给streamingContext设置checkpoint的目录，该目录必须是HADOOP支持的文件系统，用来保存WAL和做Streaming的checkpoint
    – 2：spark.streaming.receiver.writeAheadLog.enable 设置为true

Streaming 中 WAL工作原理

• 流程梳理：

• 当一个Driver失败重启后，恢复流程：

（四）Spark Streaming实战

1. 用Spark Streaming实现实时WordCount

架构图：

1.安装并启动生成者
首先在一台Linux（ip：192.168.10.101）上用YUM安装nc工具
yum install -y nc
启动一个服务端并监听9999端口
nc -lk 9999

2.编写Spark Streaming程序

package cn.itcast.spark.streaming

import cn.itcast.spark.util.LoggerLevel
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object NetworkWordCount {
  def main(args: Array[String]) {
    //设置日志级别
    LoggerLevel.setStreamingLogLevels()
    //创建SparkConf并设置为本地模式运行
    //注意local[2]代表开两个线程
    val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    //设置DStream批次时间间隔为2秒
    val ssc = new StreamingContext(conf, Seconds(2))
    //通过网络读取数据
    val lines = ssc.socketTextStream("192.168.10.101", 9999)
    //将读到的数据用空格切成单词
    val words = lines.flatMap(_.split(" "))
    //将单词和1组成一个pair
    val pairs = words.map(word => (word, 1))
    //按单词进行分组求相同单词出现的次数
    val wordCounts = pairs.reduceByKey(_ + _)
    //打印结果到控制台
    wordCounts.print()
    //开始计算
    ssc.start()
    //等待停止
    ssc.awaitTermination()
  }
}

3.启动Spark Streaming程序：由于使用的是本地模式 "local[2]" 所以可以直接在本地运行该程序

注意：要指定并行度，如在本地运行设置setMaster("local[2]")，相当于启动两个线程，一个给receiver，一个给computer。如果是在集群中运行，必须要求集群中可用core数大于1

4.在Linux端命令行中输入单词

5.在IDEA控制台中查看结果

问题：结果每次在Linux段输入的单词次数都被正确的统计出来，但是结果不能累加！如果需要累加需要使用updateStateByKey(func)来更新状态，下面给出一个例子：

package cn.itcast.spark.streaming

import cn.itcast.spark.util.LoggerLevel
import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.streaming.{StreamingContext, Seconds}

object NetworkUpdateStateWordCount {
  /**
    * String : 单词 hello
    * Seq[Int] ：单词在当前批次出现的次数
    * Option[Int] ： 历史结果
    */
  val updateFunc = (iter: Iterator[(String, Seq[Int], Option[Int])]) => {
    //iter.flatMap(it=>Some(it._2.sum + it._3.getOrElse(0)).map(x=>(it._1,x)))
    iter.flatMap{case(x,y,z)=>Some(y.sum + z.getOrElse(0)).map(m=>(x, m))}
  }

  def main(args: Array[String]) {
    LoggerLevel.setStreamingLogLevels()
    val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkUpdateStateWordCount")
    val ssc = new StreamingContext(conf, Seconds(5))
    //做checkpoint 写入共享存储中
    ssc.checkpoint("c://aaa")
    val lines = ssc.socketTextStream("192.168.10.100", 9999)
    //reduceByKey 结果不累加
    //val result = lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_)
    //updateStateByKey结果可以累加但是需要传入一个自定义的累加函数：updateFunc
    val results = lines.flatMap(_.split(" ")).map((_,1)).updateStateByKey(updateFunc, new HashPartitioner(ssc.sparkContext.defaultParallelism), true)
    results.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

2. Spark Streaming整合Kafka完成网站点击流实时统计

1.安装并配置zk

2.安装并配置Kafka

3.启动zk
4.启动Kafka

5.创建topic

bin/kafka-topics.sh--create --zookeeper node1.itcast.cn:2181,node2.itcast.cn:2181 \
--replication-factor3 --partitions 3 --topic urlcount

6.编写Spark Streaming应用程序

package cn.itcast.spark.streaming

package cn.itcast.spark

import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

object UrlCount {
  val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
    iterator.flatMap{case(x,y,z)=> Some(y.sum + z.getOrElse(0)).map(n=>(x, n))}
  }

  def main(args: Array[String]) {
    //接收命令行中的参数
    val Array(zkQuorum, groupId, topics, numThreads, hdfs) = args
    //创建SparkConf并设置AppName
    val conf = new SparkConf().setAppName("UrlCount")
    //创建StreamingContext
    val ssc = new StreamingContext(conf, Seconds(2))
    //设置检查点
    ssc.checkpoint(hdfs)
    //设置topic信息
    val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
    //重Kafka中拉取数据创建DStream
    val lines = KafkaUtils.createStream(ssc, zkQuorum ,groupId, topicMap, StorageLevel.MEMORY_AND_DISK).map(_._2)
    //切分数据，截取用户点击的url
    val urls = lines.map(x=>(x.split(" ")(6), 1))
    //统计URL点击量
    val result = urls.updateStateByKey(updateFunc, new HashPartitioner(ssc.sparkContext.defaultParallelism), true)
    //将结果打印到控制台
    result.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

贫僧洗头爱飘柔

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Spark Streaming基础与实践

（一）Spark Streaming简介参考文章：点击打开链接1、Spark Streaming概念Spark Streaming是Spark核心API的一个扩展，可以实现高吞吐量的、具备容错机制的实时流数据的处理。类似于ApacheStorm，用于流式数据的处理。根据其官方文档介绍，Spark Streaming有高吞吐量和容错能力强等特点。Spark Streaming支持的数据输入源很多，例...
复制链接

扫一扫