Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.

结构化流是一个基于Spark SQL可扩展以及容错的流式处理引擎

You can express your streaming computation the same way you would express a batch computation on static data.


The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive.

Spark SQL主要负责连续地、持续的去更新最终的一个结果集

You can use the Dataset/DataFrame API in Scala, Java, Python or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc.

由于是基于SparkSQL的,所以你可以使用Dataset/DataFrame API,而本次我们是用scala来学习结构化流的;当然你也可以使用这些API去完成流聚合、事件窗口以及流到批的join操作

The computation is executed on the same optimized Spark SQL engine.


Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs.


In short, Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.


Internally, by default, Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees.


However, since Spark 2.3, we have introduced a new low-latency processing mode called Continuous Processing, which can achieve end-to-end latencies as low as 1 millisecond with at-least-once guarantees.

但是自从2.3版本引进来之后,它提供了一个叫Continuous Processing模式,这个模式可以实现至少1ms的低延迟,而且保证一次性容错

Without changing the Dataset/DataFrame operations in your queries, you will be able to choose the mode based on your application requirements.


Quick Example


import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession

val spark = SparkSession
import spark.implicits._
// Create DataFrame representing the stream of input lines from connection to localhost:9999
val lines = spark.readStream    //readStream结构化流的启动入口
  .format("socket")   //format是指定你们数据源是什么(socket、kafka.....)
  .option("host", "localhost")
  .option("port", 9999)//通过key-value的形式传一些参数

// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))

// Generate running word count
val wordCounts = words.groupBy("value").count()   //这里的value是socket的sechma

This lines DataFrame represents an unbounded table containing the streaming text data.

这个lines DataFrame表示着这是一个无界的数据表

This table contains one column of strings named “value”, and each line in the streaming text data becomes a row in the table.


Note, that this is not currently receiving any data as we are just setting up the transformation, and have not yet started it. Next, we have converted the DataFrame to a Dataset of String using .as[String], so that we can apply the flatMap operation to split each line into multiple words. The resultant words Dataset contains all the words.


Finally, we have defined the wordCounts DataFrame by grouping by the unique values in the Dataset and counting them. Note that this is a streaming DataFrame which represents the running word counts of the stream.


// Start running the query that prints the running counts to the console
val query = wordCounts.writeStream
  .outputMode("complete")   //有三种模式
  .start()    //启动流计算

query.awaitTermination()   //通过这个方法来控制引擎一直处于查询(active )状态


Programming Model程序模型

The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended.


This leads to a new stream processing model that is very similar to a batch processing model.

You will express your streaming computation as standard batch-like query as on a static table, and Spark runs it as an incremental query on the unbounded input table.


Let’s understand this model in more detail.


A query on the input will generate the “Result Table”.



Every trigger interval (say, every 1 second), new rows get appended to the Input Table, which eventually updates the Result Table. Whenever the result table gets updated, we would want to write the changed result rows to an external sink.


The “Output” is defined as what gets written out to the external storage. The output can be defined in a different mode:

  • Complete Mode - The entire updated Result Table will be written to the external storage. It is up to the storage connector to decide how to handle writing of the entire table.Complete模式,把每一个间隔产生的结果都会输出到外部存储
  • Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only on the queries where existing rows in the Result Table are not expected to change.Append模式,他只会把新出现的记录才会输出到外部存储,而对于DF来说里面的数据还是全的
  • Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (available since Spark 2.1.1). Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain aggregations, it will be equivalent to Append mode.更新模式,就是经过计算后再输出,eg:在第一个统计小明出现过2次,大飞1次;而第二个进来一个小明以及狗蛋一次,这时候只会数据小明三次,狗蛋一次;对于总的数据集还是把上述的全部累计起来


This model is significantly different from many other stream processing engines.


Many streaming systems require the user to maintain running aggregations themselves, thus having to reason about fault-tolerance, and data consistency (at-least-once, or at-most-once, or exactly-once).


In this model, Spark is responsible for updating the Result Table when there is new data, thus relieving the users from reasoning about it. As an example, let’s see how this model handles event-time based processing and late arriving data.


Handling Event-time and Late Data

Event-time is the time embedded in the data itself. For many applications, you may want to operate on this event-time.

Event-time是指每一行数据他自己是属于哪个时间,对于流式处理,Event-time才是属于最有价值的字段,因为他可以判断是属于哪个时间的,eg:7点产生了一条数据,8点才到程序中,8点就是他的pocessing time,而7点就是这条数据的产生时间,也就是Event-time;这两个时间不能混淆

For example, if you want to get the number of events generated by IoT devices every minute, then you probably want to use the time when the data was generated (that is, event-time in the data), rather than the time Spark receives them.

This event-time is very naturally expressed in this model – each event from the devices is a row in the table, and event-time is a column value in the row.


This allows window-based aggregations (e.g. number of events every minute) to be just a special type of grouping and aggregation on the event-time column – each time window is a group and each row can belong to multiple windows/groups.
Therefore, such event-time-window-based aggregation queries can be defined consistently on both a static dataset (e.g. from collected device events logs) as well as on a data stream, making the life of the user much easier.


Furthermore, this model naturally handles data that has arrived later than expected based on its event-time.


Since Spark is updating the Result Table, it has full control over updating old aggregates when there is late data, as well as cleaning up old aggregates to limit the size of intermediate state data. Since Spark 2.1, we have support for watermarking which allows the user to specify the threshold of late data, and allows the engine to accordingly clean up old state. These are explained later in more detail in the Window Operations section.

Streaming Deduplication流式去重


You can deduplicate records in data streams using a unique identifier in the events. This is exactly same as deduplication on static using a unique identifier column. The query will store the necessary amount of data from previous records such that it can filter duplicate records. Similar to aggregations, you can use deduplication with or without watermarking.

  • With watermark - If there is a upper bound on how late a duplicate record may arrive, then you can define a watermark on a event time column and deduplicate using both the guid and the event time columns. The query will use the watermark to remove old state data from past records that are not expected to get any duplicates any more. This bounds the amount of the state the query has to maintain.


  • Without watermark - Since there are no bounds on when a duplicate record may arrive, the query stores the data from all the past records as state.


val streamingDf = spark.readStream. ...  // columns: guid, eventTime, ...


// Without watermark using guid column

// With watermark using guid and eventTime columns
  .withWatermark("eventTime", "10 seconds")
  .dropDuplicates("guid", "eventTime")

Window Operations on Event Time


import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }

// Group the data by window and word and compute the count of each group
val windowedCounts = words.groupBy(
  window($"timestamp", "10 minutes", "5 minutes"),  //通过timestamp,每10分钟做一个窗口,每5分钟做一个滑动窗口

Handling Late Data and Watermarking处理延时数据

Now consider what happens if one of the events arrives late to the application. For example, say, a word generated at 12:04 (i.e. event time) could be received by the application at 12:11. The application should use the time 12:04 instead of 12:11 to update the older counts for the window 12:00 - 12:10. This occurs naturally in our window-based grouping – Structured Streaming can maintain the intermediate state for partial aggregates for a long period of time such that late data can update aggregates of old windows correctly, as illustrated below.



不过由于是基于Event Time的,所以还是会落到他这个批次中

However, to run this query for days, it’s necessary for the system to bound the amount of intermediate in-memory state it accumulates.


This means the system needs to know when an old aggregate can be dropped from the in-memory state because the application is not going to receive late data for that aggregate any more.


To enable this, in Spark 2.1, we have introduced watermarking, which lets the engine automatically track the current event time in the data and attempt to clean up old state accordingly.


You can define the watermark of a query by specifying the event time column and the threshold on how late the data is expected to be in terms of event time.


For a specific window ending at time T, the engine will maintain state and allow late data to update the state until (max event time seen by the engine - late threshold > T). In other words, late data within the threshold will be aggregated, but data later than the threshold will start getting dropped (see later in the section for the exact guarantees). Let’s understand this with an example. We can easily define watermarking on the previous example using withWatermark() as shown below.

import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }

// Group the data by window and word and compute the count of each group
val windowedCounts = words
   .withWatermark("timestamp", "10 minutes")   //设置了一个水印,10分钟的,也就是当前时间10分钟之前的数据就不会再来接收了,就会去除掉
       window($"timestamp", "10 minutes", "5 minutes"),


Conditions for watermarking to clean aggregation state
It is important to note that the following conditions must be satisfied for the watermarking to clean the state in aggregation queries (as of Spark 2.1.1, subject to change in the future).

  • Output mode must be Append or Update. Complete mode requires all aggregate data to be preserved, and hence cannot use watermarking to drop intermediate state. See the Output Modes section for detailed explanation of the semantics of each output mode.


  • The aggregation must have either the event-time column, or a window on the event-time column.


  • withWatermark must be called on the same column as the timestamp column used in the aggregate. For example, df.withWatermark(“time”, “1 min”).groupBy(“time2”).count() is invalid in Append output mode, as watermark is defined on a different column from the aggregation column.

  • withWatermark must be called before the aggregation for the watermark details to be used. For example, df.groupBy(“time”).count().withWatermark(“time”, “1 min”) is invalid in Append output mode.

Join Operations

Structured Streaming supports joining a streaming Dataset/DataFrame with a static Dataset/DataFrame as well as another streaming Dataset/DataFrame.

结构化流是支持好几类(Stream-static Joins、Stream-stream Joins…)


The result of the streaming join is generated incrementally, similar to the results of streaming aggregations in the previous section. In this section we will explore what type of joins (i.e. inner, outer, etc.) are supported in the above cases. Note that in all the supported join types, the result of the join with a streaming Dataset/DataFrame will be the exactly the same as if it was with a static Dataset/DataFrame containing the same data in the stream.

Stream-stream Joins

In Spark 2.3, we have added support for stream-stream joins, that is, you can join two streaming Datasets/DataFrames. The challenge of generating join results between two data streams is that, at any point of time, the view of the dataset is incomplete for both sides of the join making it much harder to find matches between inputs. Any row received from one input stream can match with any future, yet-to-be-received row from the other input stream. Hence, for both the input streams, we buffer past input as streaming state, so that we can match every future input with past input and accordingly generate joined results. Furthermore, similar to streaming aggregations, we automatically handle late, out-of-order data and can limit the state using watermarks. Let’s discuss the different types of supported stream-stream joins and how to use them.


Stream-stream Joins代码案例


import org.apache.spark.sql.functions.expr

val impressions = spark.readStream. ...
val clicks = spark.readStream. ...

// Apply watermarks on event-time columns
val impressionsWithWatermark = impressions.withWatermark("impressionTime", "2 hours")
val clicksWithWatermark = clicks.withWatermark("clickTime", "3 hours")


// Join with event-time constraints
   clickAdId = impressionAdId AND
   clickTime >= impressionTime AND
   clickTime <= impressionTime + interval 1 hour

Output Sinks


  • File sink - Stores the output to a directory.
  • Kafka sink - Stores the output to one or more topics in Kafka.
  • Foreach sink - Runs arbitrary computation on the records in the output. See later in the section for more details.
  • Console sink (for debugging) - Prints the output to the console/stdout every time there is a trigger. Both, Append and Complete output modes, are supported. This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver’s memory after every trigger.
  • Memory sink (for debugging) - The output is stored in memory as an in-memory table. Both, Append and Complete output modes, are supported. This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver’s memory. Hence, use it with caution.

Using Foreach and ForeachBatch

The foreach and foreachBatch operations allow you to apply arbitrary operations and writing logic on the output of a streaming query. They have slightly different use cases - while foreach allows custom write logic on every row, foreachBatch allows arbitrary operations and custom logic on the output of each micro-batch. Let’s understand their usages in more detail.


foreachBatch(…) allows you to specify a function that is executed on the output data of every micro-batch of a streaming query. Since Spark 2.4, this is supported in Scala, Java and Python. It takes two parameters: a DataFrame or Dataset that has the output data of a micro-batch and the unique ID of the micro-batch.


streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  // Transform and write batchDF 



  • By default, foreachBatch provides only at-least-once write guarantees. However, you can use the batchId provided to the function as way to deduplicate the output and get an exactly-once guarantee.
  • foreachBatch does not work with the continuous processing mode as it fundamentally relies on the micro-batch execution of a streaming query. If you write data in the continuous mode, use foreach instead.



  new ForeachWriter[String] {

    def open(partitionId: Long, version: Long): Boolean = {
      // Open connection

    def process(record: String): Unit = {
      // Write string to connection

    def close(errorOrNull: Throwable): Unit = {
      // Close the connection




