Spark Streaming编程指南(updating)

大怀特

于 2021-10-26 18:23:14 发布

阅读量299

点赞数

分类专栏： stream 文章标签： 1024程序员节

原文链接：http://spark.apache.org/docs/2.4.8/streaming-programming-guide.html

版权

stream 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

离散的流(不连续的流(DStreams))

UpdateStateByKey Operation

SparkStreaming是Spark Core API的扩展, 拥有可扩展,高吞吐,可容错的实时处理数据流.

数据能够从很多数据源抽取数据,像Kafka,Flume,Kiness,Or TCP Socket,并且可以用像map,reduce,join和windows这样的高级函数来表示复杂算法.

最后可以把数据输出到文件系统,数据库和仪表盘.实际上你可以应用Spark的机器学习和图计算在流处理上.

在内部他的工作如下,SparkStreaming 接收实时输入流并且隔开数据成批,批数据然后被Spark引擎处理,最终生成结果也是成批.

SparkStream提供就级别的抽象叫离散流或者是DStream,它代表连续的实时流处理.DStream可以从输入流数据源,kafka或Flume,Kinesis,或者应用通过其它的DStream, 实际上DStream代表一系列RDD.

快速案例

import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3

// Create a local StreamingContext with two working thread and batch interval of 1 second.
// The master requires 2 cores to prevent a starvation scenario.
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))

// Create a DStream that will connect to hostname:port, like localhost:9999
val lines = ssc.socketTextStream("localhost", 9999)

// Split each line into words
val words = lines.flatMap(_.split(" "))

import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3

// Count each word in each batch
val pairs = words.map(word => (word, 1))val wordCounts = pairs.reduceByKey(_ + _)

// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()

ssc.start()             // Start the computation
ssc.awaitTermination()  // Wait for the computation to terminate

基本概念

添加依赖

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_2.12</artifactId>
    <version>2.4.8</version>
    <scope>provided</scope>
</dependency>

从像kafka,flume或是kinesis中摄取数据现在已经不属于spark streaming 核心API中了,需要增加相应的(spark-streaming-xyz_2.12)依赖,常用的依赖如下

Source	Artifact
Kafka	spark-streaming-kafka-0-10_2.12
Flume	spark-streaming-flume_2.12
Kinesis	spark-streaming-kinesis-asl_2.12 [Amazon Software License]

初始化StreamingContext

为初始化Spark Stream程序,StreamingContext对象的创建作为Spark Stream功能的主要入口.

import org.apache.spark._
import org.apache.spark.streaming._

val conf = new SparkConf().setAppName(appName).setMaster(master)
val ssc = new StreamingContext(conf, Seconds(1))

appName 是你展示在cluster UI上的名字. master可以是spark,mesos,kubernetes,yarn的URL或者指定"local[*]"在本地上运行.在实际操作是在集群上运行.本地测试或是单元测试用"local[*]"模式
.setMaster("local[2]")
master可以是spark,Mesos,Kubernetes,或者是Yarn,或者指定local[*]来表示以本地模式运行.

注意:在内部是创建SparkContext来开启spark的功能,并且可以通过ssc.sparkContext来获得

StreamingContext对象也可以通过已存在的SparkContext创建

import org.apache.spark.streaming._

val sc = ...                // existing SparkContext
val ssc = new StreamingContext(sc, Seconds(1))

StreamingContext定义完之后,需要做正下操作

通过创建DStream来定义输入源
通过对DStream应用转换或输出操作来定义实时计算
用streamingContext.start()启动,接收数据并处理
执行streamingContext.stop(),等待处理程序停止(手动或是因为发生了错误)
执行过程可以手动用streamingContext.stop()来停止

需要记住的重要点

一旦context启动,不能有新的流计算可以被设置或是增加
一旦context停止,不能在重新启动
在一个JVM上只能运行一个StreamingContext
调用StreamingContext 对象上的stop()也会把SparkContext停止,只是想停止StreamingContext可* 以调用stopSparkContext对象上的stop
只要上一个StreamingContext停止,SparkContext就可以多次创建StreamingContext.

离散的流(不连续的流(DStreams))

Discretized Stream 或 DStream是Spark Streaming的基本抽象.它代表一个连续的流数据或者是从数据源输入的数据,或者从通过转换生成的输入流.DStream内部表示一系列连接的RDD,RDD在Spark的抽象中是不可变的,分布式数据集.每一个RDD在DStream中包含时间间隔明确的数据,如下图

任何应用到DStream上转换操作都会转换成底层下的RDD操作.例如在之前转换流中一行数据为单词的例子, flatMap操作是应用到每一个RDD中的每行,并生成单词的DStream

Spark Streaming

底层是通过RDD的转换是通过Spark引擎处理的. DStream操作隐藏非常多细节来给开发提供便利的高级抽象.

输入DStream和接收者

输入DStream表示从数据源接入的流数据.在快速入门中lines为输入 DStream,它表示从netcat server接收的流数据.每一个输入DStream(文件流除外)都关联一个接收者对象,接收者从数据源接收数据,并存放数据在Spark内存用来处理.

Spark Streaming提供两大类内建流数据源

基础数据源: 通过StreamingContext API直接生成数据源,如系统文件,或是socket 连接
高级数据源: 像Kafka, Flume, Kinesis等可以通过额外的工具类获得. 需要添加对应的依赖.

后边会对现在有类型数据源做讨论

注意:如果你想要并行接收多个多个流数据,你可以创建多个DStreams. 这样会创建多个接收者同时接收多份数据. 但要注意spark worker/executor是长运行任务,因此会占用服务器一个核.因此需要记住Spark Streaming程序需要分配足够的核(若是本地则为线程)来处理接收数据,还有数据接收者.

记忆要点

当Spark Streaming运行本地模式时,不需要用"local"或"local[1]"做的master URL, 两者者表示只用一个线程来运行.如果用一个输入DStream做为数据接收者(如socket, kafka, flume 等),这个线程将会被用来作为接收者,非没有留下线程处理接收到的数据.因此当运行本地时,用"local[*]"作为master, n大于接收者数量
扩展的逻辑运行在集群上,分配给Spark Stream应用一定大于数据接收者数量,否则系统不会以处理接收到的数据

基本数据源

通过入门案例ssc.socketTextStream(...)连接TCP socket接收文本数据.除了socket,StreamContext API提供用文件做为数据源.

文件流

在文件系统(包含HDFS,S3,NFS等)中读数据, DStream可以被创建通过streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass].

File 流不需要运行一个接收者,所以不需要分配任何核来接收数据.

对于简单的文本文件,最容易的方法是streamingContext.textFileStream(dataDirectory)

streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory)

streamingContext.textFileStream(dataDirectory)

目录是怎样被监控的

Spark Streaming将要监控目录文件,并且处理任何在目录中新增的文件

被目录被监控,像""hdfs://namenode:8040/logs/"". 目录下的所有文件出现时将要实际处理
支持POSIX glob模式,像"hdfs://namenode:8040/logs/2017/*". DStream由匹配目录下所有文件.也诮是他是一个目录匹配,而还是目录中文件的匹配
所有的文件一定要相同格式
监控的文件是它的修改时间而不是创建时间
一旦被处理,在当前窗口时间改的文件不会重复处理.也就是说更新是被忽略的
目录中文件多即使没有文件更新,也需要用长时间来处理
如果目录中用到通配符,像"hdfs://namenode:8040/logs/2016-*",重命名目录中被监控的文件,仅仅是那些修改时间在当前处理窗口时间的会包含在流中
调用 FileSystem.setTimes() 来修复时间戳是一种处理在窗口中获得文件的一种方法,即使它的内容没有变化.

用对象存储作为数据源

"所有"文件系统,像HDFS趋向于一旦输出流被创建,在它们的文件设置修改时间.当文件被打开,即使数据还没有完成写入,它可能被包含到DStream,之后的更新在同一个窗口中会被忽略.也就是说改变可能会丢失,数据会在流中被忽略.

为了保证这些改变在当前窗口中获得,写文件到没有监控的,然后在文件生成后立刻重命名到监控目录.提供重命名的文件出现在被扫描的目录,当窗口被运行时这此新数据会被拿到.

相反,对象存储在 Amazon S3 和Azure Storage通常重命名操作非常慢,因为数据通常是真拷贝.
此外,重命名对象可能有一定时间操做rename()操作做为修改时间时,所以可能没有被当前窗口处理到.

要对目标对象存储做细心测试,核对存储的时间戳是否与spark stream处理的行为是否一致.直接写数据到目标目录可能是选择合适的对象存储策略.

更多的细节查看hadoop 文件系统说明

自定义接收流

DStream可以用自定义数据流接收.

作为流的RDD队列

为了用测试数据测试Spark Streaming应用,可以创建一个基于RDD队列的DStream.

streamingContext.queueStream(queueOfRDDs)

每一个RDD被推入队列,将要作为一批DStream,并且处理起来像一个流.

更多从socket和file创建stream的详情,可以参考scala StreamContext, java JavaStreamingContext和Python StreamingContext

高级数据源

Python Api ,从spark 2.4.8开始, kafka,kinesis和flume数据源可以用Python API调用.

不同各类的数据源需要引入非spark的类库,他们中的一些需要一些复杂的依赖,像(kafka,flume).
为了减少依赖发生的版本冲突,从数据源创建数据源已经移动到不同的类库,这样需要显示引用.

注意:这些高级的数据源在spark shell中是不可用的,因此高级数据源不能在shell中测试.如果你确定想要用他们在spark shell中,需要从maven库下载对应的jar到classpath中.

下边是高级的数据源

Kafka:Spark2.4.8兼容kafka0.8.2.1及以上版本,详情查看 Kafka Integration Guide
Flume:Spark2.4.8兼容flume 1.6.0,详情查看 Flume Integration Guide
Kinesis:Spark2.4.8兼容kinesis客户端1.2.1,详情查看 Kinesis Integration Guide

自定义数据源

通过自定义数据源可以创建输入DStream.你所要做的就是实现receiver,这样就可以从自定义数据源接收数据发送到spark.详情查看 Custom Receiver Guide

接收者的可靠性

有两类数据源是可靠的,Kafka和Flume数据源允许有确认传输数据.如果系统从可靠有确认接收正确数据,这样可以保证任何错误而失败而没有数据丢失. 这样导致了两种接收者.

可靠的接收者- 可靠的接收者发送确认到可靠的数据源当用spark接收并且保存复本.
不可靠的接收者-不可靠的接收者不发送确认信息到数据源.这样数据源也就不能做确认操作,或者用可靠数据源时当不想或是需要做复杂的确认时.

DStream中的转换操作

和RDD相似,数据从一个输入DStream修改.DStream支持非常多的转换.

Transformation	Meaning
map(func)	Return a new DStream by passing each element of the source DStream through a function func.
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items.
filter(func)	Return a new DStream by selecting only the records of the source DStream on which func returns true.
repartition(numPartitions)	Changes the level of parallelism in this DStream by creating more or fewer partitions.
union(otherStream)	Return a new DStream that contains the union of the elements in the source DStream and otherDStream.
count()	Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
reduce(func)	Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative and commutative so that it can be computed in parallel.
countByValue()	When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
reduceByKey(func, [numTasks])	When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property `spark.default.parallelism`) to do the grouping. You can pass an optional `numTasks` argument to set a different number of tasks.
join(otherStream, [numTasks])	When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
cogroup(otherStream, [numTasks])	When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
transform(func)	Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
updateStateByKey(func)	Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.

UpdateStateByKey Operation

updateStateByKey操作允许在流处理中用新值维护任意状态.只需要做两步

定义状态 - 状态可以为任意类型
定义状态更新函数 - 指定函数用前一个状态和新值怎样更新状态

每次批处理,spark应用状态更新函数到已存在的key, 不管里边有没有数据.当更新函数返回None,对应的key-value就会被忽略.

用下边的例子来说明下. 你想要维护每个在输入文本流中单词的count,这里有一个运行中的count 状态和一个数字.我们定义更新函数:

def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
    val newCount = ...  // add the new values with the previous running count to get the new count
    Some(newCount)
}

应用到DStream包含单词(pairs DStream是上例中(word,1))

val runningCounts = pairs.updateStateByKey[Int](updateFunction _)

更新函数会被每个单词调用,newValue里放着由1组成的队列,(来至(word,1) pairs), runningCount放上一次的值

注意:用updateStateByKey 需要配置checkpoint,详情见checkpointing

转换操作

transform操作(以及他的变体transformWith)允许在一个DStream上应用任意RDD 到 RDD函数操作.可以在DStreamAPI内被应用到任何RDD操作并且不暴露. 例如,每批数据流join另外数据集在DStream中不直接暴露出去.这样你可以非常容易的使用transform来这样做. 这样可以开启非常强大的可能性.如,实时程序清理通过join一个预计算滥用信息的输入流数据来过滤.

val spamInfoRDD = ssc.sparkContext.newAPIHadoopRDD(...) // RDD containing spam information

val cleanedDStream = wordCounts.transform { rdd =>
  rdd.join(spamInfoRDD).filter(...) // join data stream with spam information to do data cleaning
  ...
}

注意:在每次批处理上应用这个功能.允许做时间变化的RDD操作,也就是,RDD的操作,分区操做,广播变量等,可以在不同批处理中改变.

窗口操作

Spark Streaming提供窗口计算操作,允许在滑动窗口中应用转换操作.下图是滑动窗口说明

Spark Streaming

如上所示,每次窗口都在DStream上滑动, RDD数据源落入对应的窗口中,组合并操作来生成窗口DStream. 上例中操作应用两个时间单位,滑动步长为2个时间单位. 展示任何窗口操作需要指定两个参数

窗口长度 - 窗口的持续时间
滑动步长 - 执行窗口的时间间隔

这两个参数一定是每批的位数时间

用例子阐述下窗口操作.扩展下最早的例子为,每隔10s钟计算最近30s的数据.为做这样例子需要用到reduceByKey操作在最近30s的pairs DStream上,这里用reduceByKeyAndWindow.

val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))

下边是一些常用的窗口操作函数.所有的这些操作都要需要用窗口长度和滑动步长参数.

Transformation	Meaning
window(windowLength, slideInterval)	Return a new DStream which is computed based on windowed batches of the source DStream.
countByWindow(windowLength, slideInterval)	Return a sliding window count of elements in the stream.
reduceByWindow(func, windowLength, slideInterval)	Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative and commutative so that it can be computed correctly in parallel.
reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property `spark.default.parallelism`) to do the grouping. You can pass an optional `numTasks` argument to set a different number of tasks.
reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])	A more efficient version of the above `reduceByKeyAndWindow()` where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in `reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
countByValueAndWindow(windowLength, slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in `reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument.

Join操作

最后,值得说下,在Spark Streaming里做join操作是多么容易

流与流之前joins

流可以非常容易的和其它流做join

val stream1: DStream[String, String] = ...
val stream2: DStream[String, String] = ...
val joinedStream = stream1.join(stream2)

每个指处理,stream1产生RDD将要join stream2产生的RDD.也可以用leftOuterJoin, rightOuterJoin.此外,join操作还可以在窗口流上.非常容易.

val windowedStream1 = stream1.window(Seconds(20))
val windowedStream2 = stream2.window(Minutes(1))
val joinedStream = windowedStream1.join(windowedStream2)

流和数据集的join

在前边已经展示过了一个. 另外一个窗口流join数据集

val dataset: RDD[String, String] = ...
val windowedStream = stream.window(Seconds(20))...
val joinedStream = windowedStream.transform { rdd => rdd.join(dataset) }

实质上,也可以动态修改你想要join的数据集.这个功能提供了transform在每个批处理中使用

完整的DStream转换操作,请查看API文档, Scala查看 DStream and PairDStreamFunctions. Java查看 JavaDStream 和JavaPairDStream. Python可看 DStream.

在DStream上的输出操作

输出操作允许DStream数据推动到外部系统,像数据库或是文件系统.因此输出操作实质上转换数据到外部系统.下边为输出操作.

Output Operation	Meaning
print()	Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging. Python API This is called pprint() in the Python API.
saveAsTextFiles(prefix, [suffix])	Save this DStream's contents as text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsObjectFiles(prefix, [suffix])	Save this DStream's contents as `SequenceFiles` of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". Python API This is not available in the Python API.
saveAsHadoopFiles(prefix, [suffix])	Save this DStream's contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". Python API This is not available in the Python API.
foreachRDD(func)	The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

为使用foreachRDD的设计模式

dstream.foreachRDD是非常强大原语,可以发数据发送到外部系统.不管怎样,了解怎样正确和高效的使用原语是非常重要的.如下一些常见的错误需要避免.

通常写数据到外部系统需要创建连接对象(如TCP连接到元素服务器)并且使用它发送数据到远程系统.为此开发者可能无意试着创建连接对象在Spark driver上,然后试着使用它在Spark worker中保存RDD中记录.例如(用Scala)

dstream.foreachRDD { rdd =>
  val connection = createNewConnection()  // executed at the driver
  rdd.foreach { record =>
    connection.send(record) // executed at the worker
  }
}

这是错误的,因为这样需要连接对象序列化,从driver发送到worker. 这样的连接对象很少跨机器转移.这样的错误可能是明显的序列化错误(连接对象不能被序列化),初始化错误(连接对象需要在worker上被安化 )等等.正确解决方案是创建连接对象在worker中.

然而这也可以导致另外一个常见的错误 - 为每条记录创建连接一个新连接.例如:

dstream.foreachRDD { rdd =>
  rdd.foreach { record =>
    val connection = createNewConnection()
    connection.send(record)
    connection.close()
  }
}

典型的,创建一个连接对象有时间和资源的开销. 因此,为每条记录创建和销毁联系对象会发生不必要高开销并且能显著减少系统总的吞吐.好的解决方案是使用 rdd.foreachPartition - 创建对象并且发送RDD 分区的所有记录使用那个连接器.

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    val connection = createNewConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    connection.close()
  }
}

通过多条记录分担了创建联接创建过载.

最后,可以近一步优化通过重复使用连接对象通过多个RDDs/批量处理.一个可以维护静态连接池对象就可以重复使用当多个RDD批处理被推送到外部系统,因此近一步减少负载.

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    // ConnectionPool is a static, lazily initialized pool of connections
    val connection = ConnectionPool.getConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
  }
}

注意连接池应该是懒加载的,当有需要时才创建,并且如果一段时间不用会超时.这种实现是效率最高的把数据发送到外部系统.

其它需要记住点:

DStream是懒加载运行,通过输出操作,就像RDD是懒加载执行通过RDD actions. 确切的来说,RDD actions 内部DStream输出操作强制执行接收数据过程.因此,如果你的应用没有任何输出操作,或有输出操作像dstream.foreachRDD()没有任何RDD action在里边,那么会什么都不执行.系统只是简单的接收数据并且丢弃.
默认输出操作被执行一次一个.并且他们被执行是按他们在应用中定义的顺序.

DataFrame 和 SQL 的操作

你可以非常容易的使用DataFrame和SQL操作在流数据上. 你必须使用正在使用SparkContext创建一个SparkSession.另外,这一定这样做,在driver上失败时可以重启.可以这样做是通过创建一个懒单例实例SparkSession.下边例子有展示.修改了之前的单词计数例子来生成单词数量使用 DataFrames 和 SQL.每个RDD转换成DataFrame,被注册成一个临时表并且之后查询使用SQL.

/** DataFrame operations inside your streaming program */

val words: DStream[String] = ...

words.foreachRDD { rdd =>

  // Get the singleton instance of SparkSession
  val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
  import spark.implicits._

  // Convert RDD[String] to DataFrame
  val wordsDataFrame = rdd.toDF("word")

  // Create a temporary view
  wordsDataFrame.createOrReplaceTempView("words")

  // Do word count on DataFrame using SQL and print it
  val wordCountsDataFrame = 
    spark.sql("select word, count(*) as total from words group by word")
  wordCountsDataFrame.show()
}

你也可以运行SQL查询来至不同线程在流上数定义的表(那是异步去运行StreamingContext).刚好确定你设置SparkContext来记住足够数量流数据以致可以运行. 否则不知道任何异步SQL查询的StreamingContext将会删除旧的流数据在查询运行完成之前.例如你想要查询最后的批数据,但是你查询用5分钟来运行,然后调用最后的批处理,但是你查询用了5分钟,然后调用 streamingContext.remember(Minutes(5)) .

查看 DataFrames and SQL 指南了解更多关于DataFrames

MLib Operation

你也非常容易使用机器学习算法通过MLlib.首先有流机器学习算法(像流线性回归,流KMeans,等),它可以从流数据中学习也可以应用模型在流数据上.除此之外,对于更大的机器学习算法,你可以学习一个离线学习模型(如,使用使数据)并且然后应用在线模型在流数据上.需要更多详情查看MLlib.

Caching/Persistence

类似RDDs,DStream也允许开发者持久化流数据在内存中.也就是使用 persist() 方法在 DStream 会自动持久化每个DStream RDD在内存中. 这是非常有用的如果在DStream中的数据会被多次计算(如,在同份数据的多次操作).对于窗口操作像 reduceByWindow 和 reduceByKeyAndWindow 和基于状态像 updateStateByKey,这是隐式为true.因此,DStream通过窗口操作是自动持久化到内存,不需要开发者调用persist().

对于输入流,接收数据通过网络(像Kafka,Flume,socket等),默认持久化级别被设置复制数据到两个节点为容错性.

注意:不像RDD,DStream默认持久化级别保存数据序列化到内存.更多的讨论在性能调优部分.在持久化级别更多信息可以在Spark编程指南中查找.

Checkpointing

Spark Streaming - Spark 2.4.8 Documentation (apache.org)

....

大怀特

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark Streaming编程指南(updating)

SparkStreaming是Spark Core API的扩展, 拥有可扩展,高吞吐,可容错的实时处理数据流.数据能够从很多数据源抽取数据,像Kafka,Flume,Kiness,Or TCP Socket,并且可以用像map,reduce,join和windows这样的高级函数来表示复杂算法.最后可以把数据输出到文件系统,数据库和仪表盘.实际上你可以应用Spark的机器学习和图计算在流处理上.在内部他的工作如下,SparkStreaming 接收实时输入流并且隔开数据成批,批数据然后被S
复制链接

扫一扫

专栏目录