SparkStreaming原理

最新推荐文章于 2021-04-27 18:11:21 发布

chenyanlong_v

最新推荐文章于 2021-04-27 18:11:21 发布

阅读量256

点赞数 1

如有不足，欢迎纠正

本文链接：https://blog.csdn.net/longyanchen/article/details/100547615

版权

官网介绍

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.

Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

在内部，它的工作原理如下。Spark streams接收实时输入的数据流，并将数据分成批次，然后由Spark引擎对这些数据进行处理，以批量生成最终的结果流。

Spark流提供了一个高级抽象，称为离散流或DStream，它表示连续的数据流。

详细解释：

Spark Streaming中，会有一个接收器组件Receiver，作为一个长期运行的task。Spark Streaming中，会有一个接收器组件Receiver，作为一个长期运行的task跑在一个Executor上。Receiver接收外部的数据流形成input DStream。DStream会被按照时间间隔划分成一批一批的RDD，当批处理间隔缩短到秒级时，便可以用于处理实时数据流。时间间隔的大小可以由参数指定，一般设在500毫秒到几秒之间。对DStream进行操作就是对RDD进行操作，计算处理的结果可以传给外部系统。park Streaming的工作流程像上面的图所示一样，接受到实时数据后，给数据分批次，然后传给Spark Engine处理最后生成该批次的结果。