大数据最佳实践-spark structstreaming

最新推荐文章于 2024-06-04 00:37:58 发布

猿与禅

最新推荐文章于 2024-06-04 00:37:58 发布

阅读量1.4k

点赞数 1

分类专栏：大数据文章标签： spark structstreaming 调优最佳实践

本文链接：https://blog.csdn.net/qq_16038125/article/details/115968456

版权

大数据专栏收录该内容

93 篇文章 1 订阅

订阅专栏

概述

结构化流是基于Spark SQL引擎构建的可伸缩且容错的流处理引擎。您可以像对静态数据进行批处理计算一样，来表示流计算。当流数据继续到达时，Spark SQL引擎将负责递增地，连续地运行它并更新最终结果。您可以在Scala，Java，Python或R中使用Dataset / DataFrame API来表示流聚合，事件时间窗口，流到批处理联接等。计算是在同一优化的Spark SQL引擎上执行的。最后，该系统通过检查点和预写日志来确保端到端的一次容错保证。简而言之，结构化流提供了快速，可扩展，容错，端到端的精确一次流处理，而用户无需推理流。

在内部，默认情况下，结构化流查询是使用微批量处理引擎处理的，该引擎将数据流作为一系列小批量作业处理，从而实现了低至100毫秒的端到端延迟以及一次精确的容错保证。但是，从Spark 2.3开始，我们引入了一种称为“连续处理”的新低延迟处理模式，该模式可以实现一次最少保证的低至1毫秒的端到端延迟。在不更改查询中的Dataset / DataFrame操作的情况下，您将能够根据应用程序需求选择模式。

在本指南中，我们将带您逐步了解编程模型和API。我们将主要使用默认的微批处理模型来解释这些概念，然后再讨论连续处理模型。首先，让我们从结构化流查询的简单示例开始-流字数。

可扩展，容错，端到端
低至100毫秒
自Spark 2.3起，我们引入了一种称为“ 连续处理”的新低延迟处理模式，该模式可以实现一次最少保证的低至1毫秒的端到端延迟

outputMode(“complete”)为在每次更新计数时将完整的计数集（由指定）打印到控制台。然后使用开始流计算start()。

query对象是该活动流查询的句柄，我们已决定使用来等待查询终止，awaitTermination()以防止在该查询处于活动状态时退出该过程。
将实时数据流视为被连续添加的表

该模型与批处理模型非常相似
在每个触发间隔（例如，每1秒钟），新行将附加到输入表中，并最终更新结果表。无论何时更新结果表，我们都希望将更改后的结果行写入外部接收器

Micro-batch Processing微批处理，最低100ms 延迟，提供恰好一次的语义保证
Continuous Processing连续处理[^1]，最低1ms 延迟，提供最少一次的语义保证
它会机智的使用上一次的运算结果去简化运算过程。

它会机智的使用上一次的运算结果去简化运算过程。

API

一些不支持的运算
链式聚合运算
limit / top N rows
Distinct
Sorting (除了complte mode )
其他

聚合函数
current_date() / current_time() (因为计算延迟无法保证)
输入除了kafka 以外的数据源（测试可以使用 rate source 生成数据)
输出除了kafka 以外的存储方式 (测试可以输出到内存/控制台)
失败重试（一此任务处理挂了都会导致直接退出，需要手动从检查点重新启动）

Query为整个表的map操作；
Trigger为processing time（比如说1minute一次，或者越快越好）；
Output类型为Append(每次执行都会新写入一个新的HDFS文件)；

Query为"select count(*) from visits group by page";
Trigger为processing time（比如说1minute一次，或者越快越好）；
Output类型为在mysql表中的实时更新（及时更新变化的记录）；
针对相同的查询，我们可以查询已经写入HDFS的文件，而不是再次执行然后输出结果；

sliding window包含4个参数：
window的大小；
sliding 间隔；
window起始时间；
Tumbling window是一种特殊的sliding window，它的sliding间隔同window大小一样；针对这种查询，作如下处理：
Query是一种带有特别(多个）groupby的count操作，将所有的event map到所属的window（类似于spark中的flatMap）；
Trigger为processing time；
Output模型为及时替换；如果一个已经完成的window收到一个迟到的事件，我们依然可以更新老的记录；

Query on top of window(find most popular window in past hour)
Query：对之前的window counted查询添加一个top k操作；
Trigger为processing time；
Output模型为HDFS中的一个可更新文件或者可以写入top k的Kafka stream（Compeleted模式）

Session statistics(count number and average length of sessions)
Query: 对每个记录分配一个session ID(包含session start/end/session grouping key)并做聚合操作，然后可以基于该sessionID做count(*)/max(time)/min(time);
Trigger为processing time；
Output模型为HDFS中的一个可更新文件

Repeated Query模型的好处：

没有stream的概念–所有的都是table和sql 查询；
不同于Google Dataflow， triggers和outputs同查询本身是独立的；Dataflow中window（从sql的角度看就是一个groupby）必须确定一个outputmode 和trigger，而在RQ中，可以使用这些查询，而不一定使用window的概念；
同batch processing兼容性很好；
许多心仪的features(sessions/feedback loops等)很容易实现；
RQ主要的缺点为查询的渐增是由planner完成的，planner必须支持queries/output mode/triggers的结合，比如说必须支持什么时候可以删除老的数据或者状态，用户对此则没法控制。

RQ同时支持processing time和event time;
RQ API同底层执行模型是解耦的，可以实现不基于microbatch的执行模型；
RQ 可以提供关系型数据库的查询优化同时能够处理更加复杂的查询语句；

CQL(Streams + Tables)
CQL，Calcite和其他的一些streaming DB也有streams和table的概念，但这些都是嘉定一个单调、不可变的时间度量，对于这些系统，结果一旦差生就不可再变更。
每次流入的文本会作为一行新数据加入到unbounded table上，然后在这个表上执行word count查询后，把统计出的word count写到结果表中并输出。

把数据流当作一个没有边界的数据表来对待
SQL引擎将会增量地、连续地计算它们，然后更新最终的结果
过检查点和预写日志的方式确保端到端只执行一次的容错保证。总之，结构化流（Structured Streaming）提供了快速的、可扩展的、容错的和端到端只执行一次的（end-to-end exactly-once）流处理

内部默认使用微批处理引擎（ micro-batch processing engine），它将数据流看作一系列小的批任务（batch jobs）来处理，从而达到端到端如100毫秒这样低的延迟以及只执行一次容错的保证

连续处理（Continuous Processing），可以达到端到端如1毫秒这样低的延迟至少一次保证。不用改变查询中DataSet/DataFrame的操作，你就能够选择基于应用要求的查询模式。

watermarking

数据的事件时间与在其流上能找到的最大事件时间的最大差值（Time-To-Live, TTL），如果这个差值超过了设定的阈值
不再参与计算

窗口的截止时间和窗口中最晚/最新的事件时间

目标系统最后时刻的状态

其他

package cassandra

import org.apache.spark.sql._
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql.CassandraConnector
import kafka.KafkaService
import radio.{SimpleSongAggregation, SimpleSongAggregationKafka}
import spark.SparkHelper
import foreachSink._
import log.LazyLogger

object CassandraDriver extends LazyLogger {
  private val spark = SparkHelper.getSparkSession()
  import spark.implicits._

  val connector = CassandraConnector(SparkHelper.getSparkSession().sparkContext.getConf)

  val namespace = "structuredstreaming"
  val foreachTableSink = "radio"
  val StreamProviderTableSink = "radioothersink"
  val kafkaMetadata = "kafkametadata"
  def getTestInfo() = {
    val rdd = spark.sparkContext.cassandraTable(namespace, kafkaMetadata)

    if( !rdd.isEmpty ) {
      log.warn(rdd.count)
      log.warn(rdd.first)
    } else {
      log.warn(s"$namespace, $kafkaMetadata is empty in cassandra")
    }
  }


  /**
    * remove kafka metadata and only focus on business structure
    */
  def getDatasetForCassandra(df: DataFrame) = {
    df.select(KafkaService.radioStructureName + ".*")
      .as[SimpleSongAggregation]
  }

  //https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach
  //https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
  def saveForeach(df: DataFrame ) = {
    val ds = CassandraDriver.getDatasetForCassandra(df)

    ds
      .writeStream
      .queryName("KafkaToCassandraForeach")
      .outputMode("update")
      .foreach(new CassandraSinkForeach())
      .start()
  }

  def saveStreamSinkProvider(ds: Dataset[SimpleSongAggregationKafka]) = {
    ds
      .toDF() //@TODO see if we can use directly the Dataset object
      .writeStream
      .format("cassandra.StreamSinkProvider.CassandraSinkProvider")
      .outputMode("update")
      .queryName("KafkaToCassandraStreamSinkProvider")
      .start()
  }

  /**
    * @TODO handle more topic name, for our example we only use the topic "test"
    *
    *  we can use collect here as kafkameta data is not big at all
    *
    * if no metadata are found, we would use the earliest offsets.
    *
    * @see https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-batch
    *  assign	json string {"topicA":[0,1],"topicB":[2,4]}
    *  Specific TopicPartitions to consume. Only one of "assign", "subscribe" or "subscribePattern" options can be specified for Kafka source.
    */
  def getKafaMetadata() = {
    try {
      val kafkaMetadataRDD = spark.sparkContext.cassandraTable(namespace, kafkaMetadata)

      val output = if (kafkaMetadataRDD.isEmpty) {
        ("startingOffsets", "earliest")
      } else {
        ("startingOffsets", transformKafkaMetadataArrayToJson(kafkaMetadataRDD.collect()))
      }
      log.warn("getKafkaMetadata " + output.toString)

      output
    }
    catch {
      case e: Exception =>
        ("startingOffsets", "earliest")
    }
  }

  /**
    * @param array
    * @return {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}
    */
  def transformKafkaMetadataArrayToJson(array: Array[CassandraRow]) : String = {
      s"""{"${KafkaService.topicName}":
          {
           "${array(0).getLong("partition")}": ${array(0).getLong("offset")}
          }
         }
      """.replaceAll("\n", "").replaceAll(" ", "")
  }

  def debug() = {
   val output = spark.sparkContext.cassandraTable(namespace, foreachTableSink)

    log.warn(output.count)
  }
}
package cassandra

import kafka.KafkaMetadata

object CassandraKafkaMetadata {
  private def cql(metadata: KafkaMetadata): String = s"""
       INSERT INTO ${CassandraDriver.namespace}.${CassandraDriver.kafkaMetadata} (partition, offset)
       VALUES(${metadata.partition}, ${metadata.offset})
    """

  //https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md#connection-pooling
  def save(metadata: KafkaMetadata) = {
    CassandraDriver.connector.withSessionDo(session =>
      session.execute(cql(metadata))
    )
  }
}
package cassandra.StreamSinkProvider

import cassandra.{CassandraDriver, CassandraKafkaMetadata}
import org.apache.spark.sql.{DataFrame, Dataset}
import org.apache.spark.sql.execution.streaming.Sink
import org.apache.spark.sql.functions.max
import spark.SparkHelper
import cassandra.CassandraDriver
import com.datastax.spark.connector._
import kafka.KafkaMetadata
import log.LazyLogger
import org.apache.spark.sql.execution.streaming.Sink
import org.apache.spark.sql.types.LongType
import radio.SimpleSongAggregation

/**
* must be idempotent and synchronous (@TODO check asynchronous/synchronous from Datastax's Spark connector) sink
*/
class CassandraSink() extends Sink with LazyLogger {
  private val spark = SparkHelper.getSparkSession()
  import spark.implicits._
  import org.apache.spark.sql.functions._

  private def saveToCassandra(df: DataFrame) = {
    val ds = CassandraDriver.getDatasetForCassandra(df)
    ds.show() //Debug only

    ds.rdd.saveToCassandra(CassandraDriver.namespace,
      CassandraDriver.StreamProviderTableSink,
      SomeColumns("title", "artist", "radio", "count")
    )

    saveKafkaMetaData(df)
  }

  /*
   * As per SPARK-16020 arbitrary transformations are not supported, but
   * converting to an RDD allows us to do magic.
   */
  override def addBatch(batchId: Long, df: DataFrame) = {
    log.warn(s"CassandraSink - Datastax's saveToCassandra method -  batchId : ${batchId}")
    saveToCassandra(df)
  }

  /**
    * saving the highest value of offset per partition when checkpointing is not available (application upgrade for example)
    * http://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlTransactionsDiffer.html
    * should be done in the same transaction as the data linked to the offsets
    */
  private def saveKafkaMetaData(df: DataFrame) = {
    val kafkaMetadata = df
      .groupBy($"partition")
      .agg(max($"offset").cast(LongType).as("offset"))
      .as[KafkaMetadata]

    log.warn("Saving Kafka Metadata (partition and offset per topic (only one in our example)")
    kafkaMetadata.show()

    kafkaMetadata.rdd.saveToCassandra(CassandraDriver.namespace,
      CassandraDriver.kafkaMetadata,
      SomeColumns("partition", "offset")
    )

    //Otherway to save offset inside Cassandra
    //kafkaMetadata.collect().foreach(CassandraKafkaMetadata.save)
  }
}

package cassandra.foreachSink

import cassandra.CassandraDriver
import log.LazyLogger
import org.apache.spark.sql.ForeachWriter
import radio.SimpleSongAggregation

/**
  * Inspired by
  * https://github.com/ansrivas/spark-structured-streaming/
  * https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach
  */
class CassandraSinkForeach() extends ForeachWriter[SimpleSongAggregation] with LazyLogger {
  private def cqlRadio(record: SimpleSongAggregation): String = s"""
       insert into ${CassandraDriver.namespace}.${CassandraDriver.foreachTableSink} (title, artist, radio, count)
       values('${record.title}', '${record.artist}', '${record.radio}', ${record.count})"""

  def open(partitionId: Long, version: Long): Boolean = {
    // open connection
    //@TODO command to check if cassandra cluster is up
    true
  }

  //https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md#connection-pooling
  def process(record: SimpleSongAggregation) = {
    log.warn(s"Saving record: $record")
    CassandraDriver.connector.withSessionDo(session =>
      session.execute(cqlRadio(record))
    )
  }

  //https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#cassandra-connection-parameters

  def close(errorOrNull: Throwable): Unit = {
    // close the connection
    //connection.keep_alive_ms	--> 5000ms :	Period of time to keep unused connections open
  }
}package cassandra.StreamSinkProvider

import org.apache.spark.sql.sources.StreamSinkProvider
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.{DataFrame, SQLContext}

/**
  From Holden Karau's High Performance Spark
  https://github.com/holdenk/spark-structured-streaming-ml/blob/master/src/main/scala/com/high-performance-spark-examples/structuredstreaming/CustomSink.scala#L66
  *
  */
class CassandraSinkProvider extends StreamSinkProvider {
  override def createSink(sqlContext: SQLContext,
                          parameters: Map[String, String],
                          partitionColumns: Seq[String],
                          outputMode: OutputMode): CassandraSink = {
    new CassandraSink()
  }
}package elastic

import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.apache.spark.sql.streaming.{OutputMode, StreamingQuery}
import radio.{SimpleSongAggregation, Song}
import org.elasticsearch.spark.sql.streaming._
import org.elasticsearch.spark.sql._
import org.elasticsearch.spark.sql.streaming.EsSparkSqlStreamingSink

object ElasticSink {
  def writeStream(ds: Dataset[Song] ) : StreamingQuery = {
    ds   //Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark
      .writeStream
      .outputMode(OutputMode.Append) //Only mode for ES
      .format("org.elasticsearch.spark.sql") //es
      .queryName("ElasticSink")
      .start("test/broadcast") //ES index
  }

}package kafka

case class KafkaMetadata(partition: Long, offset: Long)
package kafka

import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types._
import spark.SparkHelper

object KafkaService {
  private val spark = SparkHelper.getSparkSession()

  val radioStructureName = "radioCount"

  val topicName = "test"

  val bootstrapServers = "localhost:9092"

  val schemaOutput = new StructType()
    .add("title", StringType)
    .add("artist", StringType)
    .add("radio", StringType)
    .add("count", LongType)
}
package kafka

import org.apache.spark.sql.{DataFrame, Dataset}
import org.apache.spark.sql.functions.{struct, to_json, _}
import _root_.log.LazyLogger
import org.apache.spark.sql.streaming.StreamingQuery
import org.apache.spark.sql.types.{StringType, _}
import radio.{SimpleSongAggregation, SimpleSongAggregationKafka}
import spark.SparkHelper

object KafkaSink extends LazyLogger {
  private val spark = SparkHelper.getSparkSession()

  import spark.implicits._

  def writeStream(staticInputDS: Dataset[SimpleSongAggregation]) : StreamingQuery = {
    log.warn("Writing to Kafka")
    staticInputDS
      .select(to_json(struct($"*")).cast(StringType).alias("value"))
      .writeStream
      .outputMode("update")
      .format("kafka")
      .option("kafka.bootstrap.servers", KafkaService.bootstrapServers)
      .queryName("Kafka - Count number of broadcasts for a title/artist by radio")
      .option("topic", "test")
      .start()
  }

  /**
      Console sink from Kafka's stream
      +----+--------------------+-----+---------+------+--------------------+-------------+--------------------+
      | key|               value|topic|partition|offset|           timestamp|timestampType|          radioCount|
      +----+--------------------+-----+---------+------+--------------------+-------------+--------------------+
      |null|[7B 22 72 61 64 6...| test|        0|    60|2017-11-21 22:56:...|            0|[Feel No Ways,Dra...|
    *
    */
  def debugStream(staticKafkaInputDS: Dataset[SimpleSongAggregationKafka]) = {
    staticKafkaInputDS
      .writeStream
      .queryName("Debug Stream Kafka")
      .format("console")
      .start()
  }
}
package kafka

import org.apache.spark.sql.{DataFrame, Dataset}
import org.apache.spark.sql.functions.{struct, to_json, _}
import _root_.log.LazyLogger
import org.apache.spark.sql.types.{StringType, _}
import radio.{SimpleSongAggregation, SimpleSongAggregationKafka}
import spark.SparkHelper

/**
 @see https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
 */
object KafkaSource extends LazyLogger {
  private val spark = SparkHelper.getSparkSession()

  import spark.implicits._

  /**
    * will return, we keep some kafka metadata for our example, otherwise we would only focus on "radioCount" structure
     |-- key: binary (nullable = true)
     |-- value: binary (nullable = true)
     |-- topic: string (nullable = true) : KEPT
     |-- partition: integer (nullable = true) : KEPT
     |-- offset: long (nullable = true) : KEPT
     |-- timestamp: timestamp (nullable = true) : KEPT
     |-- timestampType: integer (nullable = true)
     |-- radioCount: struct (nullable = true)
     |    |-- title: string (nullable = true)
     |    |-- artist: string (nullable = true)
     |    |-- radio: string (nullable = true)
     |    |-- count: long (nullable = true)

    * @return
    *
    *
    * startingOffsets should use a JSON coming from the lastest offsets saved in our DB (Cassandra here)
    */
    def read(startingOption: String = "startingOffsets", partitionsAndOffsets: String = "earliest") : Dataset[SimpleSongAggregationKafka] = {
      log.warn("Reading from Kafka")

      spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "localhost:9092")
      .option("subscribe", KafkaService.topicName)
      .option("enable.auto.commit", false) // Cannot be set to true in Spark Strucutured Streaming https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#kafka-specific-configurations
      .option("group.id", "Structured-Streaming-Examples")
      .option("failOnDataLoss", false) // when starting a fresh kafka (default location is temporary (/tmp) and cassandra is not (var/lib)), we have saved different offsets in Cassandra than real offsets in kafka (that contains nothing)
      .option(startingOption, partitionsAndOffsets) //this only applies when a new query is started and that resuming will always pick up from where the query left off
      .load()
      .withColumn(KafkaService.radioStructureName, // nested structure with our json
        from_json($"value".cast(StringType), KafkaService.schemaOutput) //From binary to JSON object
      ).as[SimpleSongAggregationKafka]
      .filter(_.radioCount != null) //TODO find a better way to filter bad json
  }
}
package log

import org.apache.log4j.LogManager

trait LazyLogger {
  @transient lazy val log = LogManager.getLogger(getClass)
}
package main

import cassandra.CassandraDriver
import elastic.ElasticSink
import kafka.{KafkaSink, KafkaSource}
import mapGroupsWithState.MapGroupsWithState
import parquetHelper.ParquetService
import spark.SparkHelper

object Main {

  def main(args: Array[String]) {
    val spark = SparkHelper.getAndConfigureSparkSession()

    //Classic Batch
    //ParquetService.batchWay()

    //Streaming way
    //Generate a "fake" stream from a parquet file
    val streamDS = ParquetService.streamingWay()

    val songEvent = ParquetService.streamEachEvent

    ElasticSink.writeStream(songEvent)

    //Send it to Kafka for our example
    KafkaSink.writeStream(streamDS)

    //Finally read it from kafka, in case checkpointing is not available we read last offsets saved from Cassandra
    val (startingOption, partitionsAndOffsets) = CassandraDriver.getKafaMetadata()
    val kafkaInputDS = KafkaSource.read(startingOption, partitionsAndOffsets)

    //Just debugging Kafka source into our console
    KafkaSink.debugStream(kafkaInputDS)

    //Saving using Datastax connector's saveToCassandra method
    CassandraDriver.saveStreamSinkProvider(kafkaInputDS)

    //Saving using the foreach method
    //CassandraDriver.saveForeach(kafkaInputDS) //Untype/unsafe method using CQL  --> just here for example

    //Another fun example managing an arbitrary state
    MapGroupsWithState.write(kafkaInputDS)

    //Wait for all streams to finish
    spark.streams.awaitAnyTermination()
  }
}
package mapGroupsWithState

import org.apache.spark.sql.{DataFrame, Dataset}
import org.apache.spark.sql.functions.{struct, to_json, _}
import _root_.log.LazyLogger
import org.apache.spark.sql.types.StringType
import spark.SparkHelper
import org.apache.spark.sql.streaming.{GroupState, GroupStateTimeout, OutputMode}
import radio.{ArtistAggregationState, SimpleSongAggregation, SimpleSongAggregationKafka}

object MapGroupsWithState extends LazyLogger {
  private val spark = SparkHelper.getSparkSession()

  import spark.implicits._


  def updateArtistStateWithEvent(state: ArtistAggregationState, artistCount : SimpleSongAggregation) = {
    log.warn("MapGroupsWithState - updateArtistStateWithEvent")
    if(state.artist == artistCount.artist) {
      ArtistAggregationState(state.artist, state.count + artistCount.count)
    } else {
      state
    }
  }

  def updateAcrossEvents(artist:String,
                         inputs: Iterator[SimpleSongAggregation],
                         oldState: GroupState[ArtistAggregationState]): ArtistAggregationState = {

    var state: ArtistAggregationState = if (oldState.exists)
      oldState.get
    else
      ArtistAggregationState(artist, 1L)

    // for every rows, let's count by artist the number of broadcast, instead of counting by artist, title and radio
    for (input <- inputs) {
      state = updateArtistStateWithEvent(state, input)
      oldState.update(state)
    }

    state
  }


  /**
    *
    * @return
    *
    * Batch: 4
      -------------------------------------------
      +------+-----+
      |artist|count|
      +------+-----+
      | Drake| 4635|
      +------+-----+

      Batch: 5
      -------------------------------------------
      +------+-----+
      |artist|count|
      +------+-----+
      | Drake| 4710|
      +------+-----+
    */
  def write(ds: Dataset[SimpleSongAggregationKafka] ) = {
    ds.select($"radioCount.title", $"radioCount.artist", $"radioCount.radio", $"radioCount.count")
      .as[SimpleSongAggregation]
      .groupByKey(_.artist)
      .mapGroupsWithState(GroupStateTimeout.NoTimeout)(updateAcrossEvents) //we can control what should be done with the state when no update is received after a timeout.
      .writeStream
      .outputMode(OutputMode.Update())
      .format("console")
      .queryName("mapGroupsWithState - counting artist broadcast")
      .start()
  }
}
package parquetHelper

import log.LazyLogger
import org.apache.spark.sql.{DataFrame, Dataset}
import org.apache.spark.sql.types._
import radio.{SimpleSongAggregation, Song}
import spark.SparkHelper

object ParquetService extends LazyLogger {
  val pathRadioStationSongs = "data/allRadioPartitionByRadioAndDate.parquet"
  val pathRadioES = "data/broadcast.parquet"

  private val spark = SparkHelper.getSparkSession()
  import spark.implicits._

  val schema = new StructType()
    .add("timestamp", TimestampType)
    .add("title", StringType)
    .add("artist", StringType)
    .add("radio", StringType)
    .add("humanDate", LongType)
    .add("hour", IntegerType)
    .add("minute", IntegerType)
    .add("allArtists", StringType)
    .add("year", IntegerType)
    .add("month", IntegerType)
    .add("day", IntegerType)

  def batchWay() = {
    //Classic  Batch way
    val batchWay =
      spark
        .read
        .schema(ParquetService.schema)
        .parquet(pathRadioStationSongs)
        .where($"artist" === "Drake")
        .groupBy($"radio", $"artist",  $"title")
        .count()
        .orderBy("count")
        .as[Song]

    batchWay.show()

    batchWay
  }

  def streamingWay() : Dataset[SimpleSongAggregation] = {
    log.warn("Starting to stream events from Parquet files....")

    spark
      .readStream
      .schema(ParquetService.schema)
      .option("maxFilesPerTrigger", 1000)  // Treat a sequence of files as a stream by picking one file at a time
      .parquet(pathRadioStationSongs)
      .as[Song]
      .where($"artist" === "Drake")
      .groupBy($"radio", $"artist",  $"title")
      .count()
      .as[SimpleSongAggregation]
  }

  def streamEachEvent : Dataset[Song]  = {
    spark
      .readStream
      .schema(ParquetService.schema)
      .option("maxFilesPerTrigger", 1000)  // Treat a sequence of files as a stream by picking one file at a time
      .parquet(pathRadioES)
      .as[Song]
      .where($"artist" === "Drake")
      .withWatermark("timestamp", "10 minutes")
      .as[Song]
  }

  //Process stream on console to debug only
  def debugStream(staticInputDF: DataFrame) = {
    staticInputDF.writeStream
      .format("console")
      .outputMode("complete")
      .queryName("Console - Count number of broadcasts for a title/artist by radio")
      .start()
  }
}
package radio

import java.sql.Timestamp

case class Song(timestamp: Long, humanDate:Long, year:Int, month:Int, day:Int, hour:Int, minute: Int, artist:String, allArtists: String, title:String, radio:String)

case class SimpleSong(title: String, artist: String, radio: String)

case class SimpleSongAggregation(title: String, artist: String, radio: String, count: Long)

case class SimpleSongAggregationKafka(topic: String, partition: Int, offset: Long, timestamp: Timestamp, radioCount: SimpleSongAggregation)

case class ArtistAggregationState(artist: String, count: Long)package spark

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession

object SparkHelper {
  def getAndConfigureSparkSession() = {
    val conf = new SparkConf()
      .setAppName("Structured Streaming from Parquet to Cassandra")
      .setMaster("local[2]")
      .set("spark.cassandra.connection.host", "127.0.0.1")
      .set("spark.sql.streaming.checkpointLocation", "checkpoint")
      .set("es.nodes", "localhost") // full config : https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html
      .set("es.index.auto.create", "true") //https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
      .set("es.nodes.wan.only", "true")

    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")

    SparkSession
      .builder()
      .getOrCreate()
  }

  def getSparkSession() = {
    SparkSession
      .builder()
      .getOrCreate()
  }
}

首先可以注意到的了论文标题中的 Declarative API，中文一般叫做声明式编程 API。一般直接看到这个词可能不知道什么意思，但是当我们列出他的对立面：Imperative API，中文一般叫命令式编程 API，仿佛一切都明了了。是的，没错，Declarative 只是表达出我们想要什么，而Imperative 则是说为了得到什么我们需要做哪些东西一个个说明。举个例子，我们要一个糕点，去糕点店直接去定做告诉店员我们要什么样式的糕点，然后店员去给我们做出来，这就是Declarative。而 Imperative 对应的就是面粉店了。

Spark Streaming 不足
在开始正式介绍 Structured Streaming 之前有一个问题还需要说清楚，就是 Spark Streaming 存在哪些不足？总结一下主要有下面几点：

使用 Processing Time 而不是 Event Time。首先解释一下，Processing Time 是数据到达 Spark 被处理的时间，而 Event Time 是数据自带的属性，一般表示数据产生于数据源的时间。比如 IoT 中，传感器在 12:00:00 产生一条数据，然后在 12:00:05 数据传送到 Spark，那么 Event Time 就是 12:00:00，而 Processing Time 就是 12:00:05。我们知道 Spark Streaming 是基于 DStream 模型的 micro-batch 模式，简单来说就是将一个微小时间段，比如说 1s，的流数据当前批数据来处理。如果我们要统计某个时间段的一些数据统计，毫无疑问应该使用 Event Time，但是因为 Spark Streaming 的数据切割是基于 Processing Time，这样就导致使用 Event Time 特别的困难。

Complex, low-level api。这点比较好理解，DStream （Spark Streaming 的数据模型）提供的 API 类似 RDD 的 API 的，非常的 low level。当我们编写 Spark Streaming 程序的时候，本质上就是要去构造 RDD 的 DAG 执行图，然后通过 Spark Engine 运行。这样导致一个问题是，DAG 可能会因为开发者的水平参差不齐而导致执行效率上的天壤之别。这样导致开发者的体验非常不好，也是任何一个基础框架不想看到的（基础框架的口号一般都是：你们专注于自己的业务逻辑就好，其他的交给我）。这也是很多基础系统强调 Declarative 的一个原因。

reason about end-to-end application。这里的 end-to-end 指的是直接 input 到 out，比如 Kafka 接入 Spark Streaming 然后再导出到 HDFS 中。DStream 只能保证自己的一致性语义是 exactly-once 的，而 input 接入 Spark Streaming 和 Spark Straming 输出到外部存储的语义往往需要用户自己来保证。而这个语义保证写起来也是非常有挑战性，比如为了保证 output 的语义是 exactly-once 语义需要 output 的存储系统具有幂等的特性，或者支持事务性写入，这个对于开发者来说都不是一件容易的事情。

批流代码不统一。尽管批流本是两套系统，但是这两套系统统一起来确实很有必要，我们有时候确实需要将我们的流处理逻辑运行到批数据上面。关于这一点，最早在 2014 年 Google 提出 Dataflow 计算服务的时候就批判了 streaming/batch 这种叫法，而是提出了 unbounded/bounded data 的说法。DStream 尽管是对 RDD 的封装，但是我们要将 DStream 代码完全转换成 RDD 还是有一点工作量的，更何况现在 Spark 的批处理都用 DataSet/DataFrame API 了。

概述

Structured Streaming 在 Spark 2.0 版本于 2016 年引入，设计思想参考很多其他系统的思想，比如区分 processing time 和 event time，使用 relational 执行引擎提高性能等。同时也考虑了和 Spark 其他组件更好的集成。Structured Streaming 和其他系统的显著区别主要如下：

Incremental query model: Structured Streaming 将会在新增的流式数据上不断执行增量查询，同时代码的写法和批处理 API （基于 Dataframe 和 Dataset API）完全一样，而且这些 API 非常的简单。

Support for end-to-end application: Structured Streaming 和内置的 connector 使的 end-to-end 程序写起来非常的简单，而且 “correct by default”。数据源和 sink 满足 “exactly-once” 语义，这样我们就可以在此基础上更好地和外部系统集成。

复用 Spark SQL 执行引擎：我们知道 Spark SQL 执行引擎做了非常多的优化工作，比如执行计划优化、codegen、内存管理等。这也是 Structured Streaming 取得高性能和高吞吐的一个原因。

核心设计

Input and Output: Structured Streaming 内置了很多 connector 来保证 input 数据源和 output sink 保证 exactly-once 语义。而实现 exactly-once 语义的前提是：

Input 数据源必须是可以 replay 的，比如 Kafka，这样节点 crash 的时候就可以重新读取 input 数据。常见的数据源包括 Amazon Kinesis, Apache Kafka 和文件系统。

Output sink 必须要支持写入是幂等的。这个很好理解，如果 output 不支持幂等写入，那么一致性语义就是 at-least-once 了。另外对于某些 sink, Structured Streaming 还提供了原子写入来保证 exactly-once 语义。

API: Structured Streaming 代码编写完全复用 Spark SQL 的 batch API，也就是对一个或者多个 stream 或者 table 进行 query。query 的结果是 result table，可以以多种不同的模式（append, update, complete）输出到外部存储中。另外，Structured Streaming 还提供了一些 Streaming 处理特有的 API：Trigger, watermark, stateful operator。

Execution: 复用 Spark SQL 的执行引擎。Structured Streaming 默认使用类似 Spark Streaming 的 micro-batch 模式，有很多好处，比如动态负载均衡、再扩展、错误恢复以及 straggler （straggler 指的是哪些执行明显慢于其他 task 的 task）重试。除了 micro-batch 模式，Structured Streaming 还提供了基于传统的 long-running operator 的 continuous 处理模式。

Operational Features: 利用 wal 和状态存储，开发者可以做到集中形式的 rollback 和错误恢复。还有一些其他 Operational 上的 feature，这里就不细说了。

编程模型

可能是受到 Google Dataflow 的批流统一的思想的影响，Structured Streaming 将流式数据当成一个不断增长的 table，然后使用和批处理同一套 API，都是基于 DataSet/DataFrame 的。如下图所示，通过将流式数据理解成一张不断增长的表，从而就可以像操作批的静态数据一样来操作流数据了。

在这个模型中，主要存在下面几个组成部分：
Input Unbounded Table: 流式数据的抽象表示
Query: 对 input table 的增量式查询
Result Table: Query 产生的结果表
Output: Result Table 的输出

// Create DataFrame representing the stream of input lines from connection to localhost:9999
val lines = spark.readStream
  .format("socket")
  .option("host", "localhost")
  .option("port", 9999)
  .load()

// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))

// Generate running word count
val wordCounts = words.groupBy("value").count()

// Start running the query that prints the running counts to the console
val query = wordCounts.writeStream
    .outputMode("complete")
    .format("console")
    .start()

代码实际执行流程可以用下图来表示。把流式数据当成一张不断增长的 table，也就是图中的 Unbounded table of all input。然后每秒 trigger 一次，在 trigger 的时候将 query 应用到 input table 中新增的数据上，有时候还需要和之前的静态数据一起组合成结果。query 产生的结果成为 Result Table，我们可以选择将 Result Table 输出到外部存储。输出模式有三种：

Complete mode: Result Table 全量输出

Append mode (default): 只有 Result Table 中新增的行才会被输出，所谓新增是指自上一次 trigger 的时候。因为只是输出新增的行，所以如果老数据有改动就不适合使用这种模式。

Update mode: 只要更新的 Row 都会被输出，相当于 Append mode 的加强版。

和 batch 模式相比，streaming 模式还提供了一些特有的算子操作，比如 window, watermark, statefaul oprator 等。

window，下图是一个基于 event-time 统计 window 内事件的例子。

import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }

// Group the data by window and word and compute the count of each group
val windowedCounts = words.groupBy(
  window("eventTime", "10 minutes", "5 minutes"),
  $"word"
).count()

如下图所示，窗口大小为 10 分钟，每 5 分钟 trigger 一次。在 12:11 时候收到了一条 12:04 的数据，也就是 late data （什么叫 late data 呢？就是 Processing Time 比 Event Time 要晚），然后去更新其对应的 Result Table 的记录。

watermark，是也为了处理，很多情况下对于这种 late data 的时效数据并没有必要一直保留太久。比如说，数据晚了 10 分钟或者还有点有，但是晚了 1 个小时就没有用了，另外这样设计还有一个好处就是中间状态没有必要维护那么多。watermark 的形式化定义为 max(eventTime) - threshold，早于 watermark 的数据直接丢弃。

import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }

// Group the data by window and word and compute the count of each group
val windowedCounts = words
    .withWatermark("eventTime", "10 minutes")
    .groupBy(
        window("eventTime", "10 minutes", "5 minutes"),
        $"word")
    .count()

用下图表示更加形象。在 12:15 trigger 时 watermark 为 12:14 - 10m = 12:04，所以 late date (12:08, dog; 12:13, owl) 都被接收了。在 12:20 trigger 时 watermark 为 12:21 - 10m = 12:11，所以 late data (12:04, donkey) 都丢弃了。

用户可以自定义状态计算逻辑的算子：
mapGroupsWithState
flatMapGroupsWithState
看名字大概也能看出来 mapGroupsWithState 是 one -> one，flatMapGroupsWithState 是 one -> multi。这两个算子的底层都是基于 Spark Streaming 的 updateStateByKey。

Continuous Processing Mode

好，终于要介绍到“真正”的流处理了，我之所以说“真正”是因为 continuous mode 是传统的流处理模式，通过运行一个 long-running 的 operator 用来处理数据。之前 Spark 是基于 micro-batch 模式的，就被很多人诟病不是“真正的”流式处理。continuous mode 这种处理模式只要一有数据可用就会进行处理，如下图所示。epoch 是 input 中数据被发送给 operator 处理的最小单位，在处理过程中，epoch 的 offset 会被记录到 wal 中。另外 continuous 模式下的 snapshot 存储使用的一致性算法是 Chandy-Lamport 算法。

这种模式相比与 micro-batch 模式缺点和优点都很明显。
缺点是不容易做扩展
优点是延迟更低

一致性语义

对于 Structured Streaming 来说，因为有两种模式，所以我们分开讨论。

micro-batch 模式可以提供 end-to-end 的 exactly-once 语义。原因是因为在 input 端和 output 端都做了很多工作来进行保证，比如 input 端 replayable + wal，output 端写入幂等。

continuous mode 只能提供 at-least-once 语义。关于 continuous mode 的官方讨论的实在太少，甚至只是提了一下。在和 @李呈祥讨论之后觉得应该还是 continuous mode 由于要尽可能保证低延迟，所以在 sink 端没有做一致性保证。

Benchmark
Structured Streming 的官方论文里面给出了 Yahoo! Streaming Benchmark 的结果，Structured Streaming 的 throughput 大概是 Flink 的 2 倍和 Kafka Streaming 的 90 多倍。

总结一下，Structured Streaming 通过提供一套 high-level 的 declarative api 使得流式计算的编写相比 Spark Streaming 简单容易不少，同时通过提供 end-to-end 的 exactly-once 语义使用 Structured Streaming 可以和其他系统更好的集成。性能方面，通过复用 Spark SQL Engine 来保证高性能。

上面的这些特定无论是出于哪个方面的考虑，对开发者都是有足够的说服力转向 Structured Streaming。

闲扯

最后，闲扯一点别的。Spark 在 5 年推出基于 micro-batch 模式的 Spark Streaming 必然是基于当时 Spark Engine 最快的方式，尽管不是真正的流处理，但是在吞吐量更重要的年代，还是尝尽了甜头。而 Spark 的真正基于 continuous 处理模式的 Structured Streaming 直到 Spark 2.3 版本才真正推出，从而导致近两年让 Flink 尝尽了甜头（当然和 Flink 的优秀的语义模型存在很大的关系）。在实时计算领域，目前来看，两家的方向都是朝着 Google DataFlow 的方向。由 Spark 的卓越核心 SQL Engine 助力的 Structured Streaming，还是风头正劲的 Flink，亦或是其他流处理引擎，究竟谁将占领统治地位，还是值得期待一下的。

Structured Streaming是一个可扩展和容错的流处理引擎，并且是构建于sparksql引擎之上。你可以用处理静态数据的方式去处理你的流计算。随着流数据的不断流入，Sparksql引擎会增量的连续不断的处理并且更新结果。可以使用DataSet/DataFrame的API进行 streaming aggregations, event-time windows, stream-to-batch joins等等。计算的执行也是基于优化后的sparksql引擎。通过checkpointing and Write Ahead Logs该系统可以保证点对点，一次处理，容错担保。

可以把输入的数据流当成一张表。数据流新增的每一条数据就像添加到该表的新增行数据。

案例

1，需要导入的依赖为

org.apache.spark spark-sql-kafka-0-10_2.11 2.2.0

2，以kafka为source数据源，console为sink输出的例子为

val spark = SparkSession
  .builder()
  .appName("Spark structured streaming Kafka example")
  .master("local")
  .getOrCreate()
val inputstream = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "blog")
  .load()
val keyValueString = inputstream.selectExpr("CAST(key AS STRING)", "CAST( value AS STRING)").as[(String, String)]
val wordCounts = keyValueString.map(_._2.split(" ")).groupBy("value").count()
val query = wordCounts.writeStream.trigger(Trigger.ProcessingTime(1000000))
  .outputMode("complete")
  .format("console")
  .start()
query.awaitTermination()

3，重点介绍的两个概念：source和sink。

A),source

目前支持的source有三种：

File Sourcec:从给定的目录读取数据，目前支持的格式有text,csv,json,parquet.容错。

Kafka Source:从kafka拉取数据。仅兼容kafka 0.10.0或者更高版本。容错。

Socket Source(for testing):从一个连接中读取UTF8编码的文本数据。不容错。

B),output modes

1),Append mode(default):仅仅从上次触发计算到当前新增的行会被输出到sink。仅仅支持行数据添加入结果表后不进行梗概的query操作。因此，这种方式保证没个流操作仅仅输出一次。例如，带有Select，where，map，flatmap，filter，join等的query操作支持append模式。

2),Complete mode:每次trigger都会讲整个结果表输出到sink。这个是针对聚合操作。

3),Updata mode:仅仅是自上次trigger之后结果表有变更的行会输出到sink。在以后的版本中会有更详细的信息。

不同类型的Streaming query支持不同的输出模式。

Query Type

支持的输出模式

注释

Queries with aggregation

Aggregation on event-time with watermark

Append, Update, Complete

Append mode和Update mode采用高水位watermark去drop掉历史的聚合状态。Completemode不会删除历史聚合状态，由该模式的语义决定。

Other aggregations

Complete, Update

由于没有定义高水位watermark，旧的聚合状态不会drop。Append mode不支持因为聚合操作是违反该模式的语义的。

Queries with mapGroupsWithState

Update

Queries with flatMapGroupsWithState

Append operation mode

Append

Aggregations are allowed after flatMapGroupsWithState.

Update operation mode

Update

Aggregations not allowed after flatMapGroupsWithState.

Other queries

Append, Update

Complete mode不支持这种模式的原因是在结果表保留所有的非聚合的数据是不合适的。

C),sinks

1),FileSink:保存数据到指定的目录

noAggDF
.writeStream
.format(“parquet”)
.option(“checkpointLocation”, “path/to/checkpoint/dir”)
.option(“path”, “path/to/destination/dir”)
.start()

2),Foreach sink:在输出的数据上做任何操作。

writeStream
.foreach(…)
.start()

3),Console sink(for debugging)：每次trigger都会讲结果输出到console或stdout。

aggDF
.writeStream
.outputMode(“complete”)
.format(“console”)
.start()

4),memory sink

// Have all the aggregates in an in-memory table
aggDF
.writeStream
.queryName(“aggregates”) // this query name will be the table name
.outputMode(“complete”)
.format(“memory”)
.start()
spark.sql(“select * from aggregates”).show()

5),kafkasink

支持stream和batch数据写入kafka

val ds = df
.selectExpr(“CAST(key AS STRING)”, “CAST(value AS STRING)”)
.writeStream
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “host1:port1,host2:port2”)
.option(“topic”, “topic1”)
.start()

Sink支持的输出模式

Sink

Outputmode

Options

容错

注释

FileSink

Append

path:输出路径，必须指定

Yes

支持写入分区的tables。按照时间分区或许有用。

ForeachSink

Append，Update，Complete

None

取决于ForeachWriter的实现

细节请看官网

ConsoleSink

Append，Complete，Update

NumRows：每个trigger显示的行数。Truncate：假如太长是否删除，默认是true

MemorySink

Append,Complete

None

No.但是在Completemode 重新query就会导致重新创建整张表

Table name is the query name.

以上是全部概念。关于kafka的batch写入，会在后面详细介绍。

1，重要的类

A),DataSource

负责构造可插入数据源的数据源。除了充当描述数据源的规范参数集之外，这个类也用于解析一个可以在查询计划中使用的具体实现的描述（或批处理或流）或使用外部库写出数据。

B),StreamingQueryManager

管理所有的StreamingQuery行为的类。

C),StreamExecution

使用单独一个线程管理Streaming Spark Sql query的执行。跟标准的查询不一样之处，一个Streaming query，在query plan中存在的source每次有新的数据产生都会重复执行。每当新的输入到达，就会生成一个QueryExecution，然后结果会事务的方式提交到给定的sink。

D),ProcessingTimeExecutor

继承自TriggerExecutor，每个 ‘intervalMs’个毫秒运行一个batch的trigger Executor。

E),DataStreamWriter

将一个Streaming Dataset写入外部存储系统的接口，使用Dataset.writeStream。

F),DataStreamReader

从外部存储系统加载一个Streaming dataset。使用SparkSession.readStream。

2，重要的源码

采用上面的样例源码。

A),构建Streaming Dataset

Load方法中

val dataSource =
  DataSource(
    sparkSession,
    userSpecifiedSchema = userSpecifiedSchema,
    className = source,
    options = extraOptions.toMap)
Dataset.ofRows(sparkSession, StreamingRelation(dataSource))

然后StreamingRelation(dataSource)

object StreamingRelation {
  def apply(dataSource: DataSource): StreamingRelation = {
    StreamingRelation(
      dataSource, dataSource.sourceInfo.name, dataSource.sourceInfo.schema.toAttributes)
  }
}

构建Dataset这段源码其实写的还是很简单的，重点阅读的两个点是DataSource的两个lazy的变量。

lazy val providingClass: Class[_] = DataSource.lookupDataSource(className)
lazy val sourceInfo: SourceInfo = sourceSchema()

ProvidingClass主要是负责构建我们具体的source，比如kafkaSource。

SourceInfo主要是表结构。

Provider class的实现

DataSource.lookupDataSource(className)

重点代码是加载所有DataSourceRegister实现，然后获取shortName，比如kafka相关的实现是KafkaSourceProvider，shortName是kafka，正好跟我们上文format指定的格式kafka匹配，此时就会得到providingClass的实现就是kafkaSourceProvider。

//加载所有DataSourceRegister的实现
val serviceLoader = ServiceLoader.load(classOf[DataSourceRegister], loader)
try {
serviceLoader.asScala.filter(_.shortName().equalsIgnoreCase(provider1)).toList

表结构的实现

主要是由上步构建的providingClass，得到相应的schema。Kafka的schema信息是很固定的。

providingClass.newInstance() match {
case s: StreamSourceProvider =>
val (name, schema) = s.sourceSchema(
sparkSession.sqlContext, userSpecifiedSchema, className, caseInsensitiveOptions)
SourceInfo(name, schema, Nil)

B),输出到sink

入口是start方法，此处我们指定的sink类型是console

val (useTempCheckpointLocation, recoverFromCheckpointLocation) =
if (source == “console”) {
(true, false)
} else {
(false, true)
}
val dataSource =
DataSource(
df.sparkSession,
className = source,
options = extraOptions.toMap,
partitionColumns = normalizedParCols.getOrElse(Nil))
df.sparkSession.sessionState.streamingQueryManager.startQuery(
extraOptions.get(“queryName”),
extraOptions.get(“checkpointLocation”),
df,
dataSource.createSink(outputMode),
outputMode,
useTempCheckpointLocation = useTempCheckpointLocation,
recoverFromCheckpointLocation = recoverFromCheckpointLocation,
trigger = trigger)

其中，参数中有一个dataSource.createSink(outputMode)，就会根据我们构建DataSource的时候构建的providingClass(此处ConsoleSinkProvider)，实例之后得到的具体的sink(ConsoleSink)，

providingClass.newInstance() match {
case s: StreamSinkProvider =>
s.createSink(sparkSession.sqlContext, caseInsensitiveOptions, partitionColumns, outputMode)

在startQuery里，首先构建了Query，然后调用其start方法进行了执行。

val query = createQuery(
userSpecifiedName,
userSpecifiedCheckpointLocation,
df,
sink,
outputMode,
useTempCheckpointLocation,
recoverFromCheckpointLocation,
trigger,
triggerClock)

CreateQuery方法中构建StreamingQueryWrapper和StreamExecution

new StreamingQueryWrapper(new StreamExecution(
sparkSession,
userSpecifiedName.orNull,
checkpointLocation,
analyzedPlan,
sink,
trigger,
triggerClock,
outputMode,
deleteCheckpointOnStop))

调用StreamExecution的start方法，启动执行。

query.streamingQuery.start()

该方法中主要是启动了一个线程进行microBatch的处理。

microBatchThread.setDaemon(true)
microBatchThread.start()

val microBatchThread =
new StreamExecutionThread(s"stream execution thread for $prettyIdString") {
override def run(): Unit = {
// To fix call site like “run at :0”, we bridge the call site from the caller
// thread to this micro batch thread
sparkSession.sparkContext.setCallSite(callSite)
runBatches()
}
}

重点是runBatches()方法。

这里要提到两个一个重要的变量，triggerExecutor，这个会根据我们在样例中trigger(Trigger.ProcessingTime(1000000))设置的时间，类型觉得是构建处理一次的OneTimeExecutor，还是ProcessingTimeExecutor。

该对象在构建StreamExecution时构建和初始化

private val triggerExecutor = trigger match {
case t: ProcessingTime => ProcessingTimeExecutor(t, triggerClock)
case OneTimeTrigger => OneTimeExecutor()
case _ => throw new IllegalStateException(s"Unknown type of trigger: $trigger")
}

微批处理的执行

runBatches方法中，在数据可用的情况下会调用runBatch

if (dataAvailable) {
currentStatus = currentStatus.copy(isDataAvailable = true)
updateStatusMessage(“Processing new data”)
runBatch(sparkSessionToRunBatches)
}

runBatch，主要分三步走：获取数据，构建执行计划，输出。

A),请求未处理的数据

// Request unprocessed data from all sources.
newData = reportTimeTaken(“getBatch”) {
availableOffsets.flatMap {
case (source, available)
if committedOffsets.get(source).map(_ != available).getOrElse(true) =>
val current = committedOffsets.get(source)
val batch = source.getBatch(current, available)
logDebug(s"Retrieving data from $source: $current -> $available")
Some(source -> batch)
case _ => None
}
}

B),处理构建具体的执行计划

// Replace sources in the logical plan with data that has arrived since the last batch.
val withNewSources = logicalPlan transform {
case StreamingExecutionRelation(source, output) =>
newData.get(source).map { data =>
val newPlan = data.logicalPlan
assert(output.size == newPlan.output.size,
s"Invalid batch: ${Utils.truncatedString(output, ",")} != " + s"$ {Utils.truncatedString(newPlan.output, “,”)}")
replacements ++= output.zip(newPlan.output)
newPlan
}.getOrElse {
LocalRelation(output)
}
}

// Rewire the plan to use the new attributes that were returned by the source.
val replacementMap = AttributeMap(replacements)
val triggerLogicalPlan = withNewSources transformAllExpressions {
case a: Attribute if replacementMap.contains(a) => replacementMap(a)
case ct: CurrentTimestamp =>
CurrentBatchTimestamp(offsetSeqMetadata.batchTimestampMs,
ct.dataType)
case cd: CurrentDate =>
CurrentBatchTimestamp(offsetSeqMetadata.batchTimestampMs,
cd.dataType, cd.timeZoneId)
}

reportTimeTaken(“queryPlanning”) {
lastExecution = new IncrementalExecution(
sparkSessionToRunBatch,
triggerLogicalPlan,
outputMode,
checkpointFile(“state”),
currentBatchId,
offsetSeqMetadata)
lastExecution.executedPlan // Force the lazy generation of execution plan
}

3),调用sink的addBatch输出

val nextBatch =
new Dataset(sparkSessionToRunBatch, lastExecution, RowEncoder(lastExecution.analyzed.schema))

reportTimeTaken(“addBatch”) {
sink.addBatch(currentBatchId, nextBatch)
}

对于本文的consoleSink具体实现如下

data.sparkSession.createDataFrame(
data.sparkSession.sparkContext.parallelize(data.collect()), data.schema)
.show(numRowsToShow, isTruncated)

Structured Streaming 提供高效，可升级，容错，点对点仅一次处理的流处理，用户根本不用关心流。

分三个，概念大家会理解的更清晰。
1，DataSource
2，Sink
3，DataSet/DataFrame的执行计划。

把流当成一张表，新增数据就是新增表的行。这么理解是不是更彻底呢？

Spark Structured Streaming高级特性

使用Structured Streaming基于事件时间的滑动窗口的聚合操作是很简单的，很像分组聚合。在一个分组聚合操作中，聚合值被唯一保存在用户指定的列中。在基于窗口的聚合的情况下，对于行的事件时间的每个窗口，维护聚合值。

如前面的例子，我们运行wordcount操作，希望以10min窗口计算，每五分钟滑动一次窗口。也即，12:00 - 12:10, 12:05 - 12:15, 12:10 - 12:20 这些十分钟窗口中进行单词统计。12:00 - 12:10意思是在12:00之后到达12:10之前到达的数据，比如一个单词在12:07收到。这个单词会影响12:00 - 12:10, 12:05 - 12:15两个窗口。

结果表将如下所示。

三，处理延迟的数据和高水位

现在考虑假如消息到达应用延迟的情况。例如，假如一个word是在12:04产生，但是在12:11被接收到。应用应该使用的时间是12:04而不是12:11去更新12:00-12:10这个窗口。这在我们基于窗口的分组中自然出现 - 结构化流可以长时间维持部分聚合的中间状态，以便后期数据可以正确更新旧窗口的聚合，如下所示。

但是，为了运行这个查询几天，系统必须限制其积累的内存中间状态的数量。这意味着系统需要知道何时可以从内存状态中删除旧聚合，因为应用程序不会再为该聚合接收到较晚的数据。为了实现这一点，在Spark 2.1中，我们引入了watermark，这使得引擎可以自动跟踪数据中的当前事件时间，并尝试相应地清除旧状态。您可以通过指定事件时间列来定义查询的watermark ，以及预计数据在事件时间方面的延迟。对于从时间T开始的特定窗口，引擎将保持状态，并允许延迟数据更新状态，直到引擎看到的最大事件时间-(延迟阈值>T)为止。换句话说阈值内的晚到数据将会被聚合，但比阈值晚的数据将会被丢弃。让我们以一个例子来理解这一点

val windowedCounts = words .withWatermark(“timestamp”, “10 minutes”).groupBy(
window($“timestamp”, windowDuration, slideDuration), $“word”
).count().orderBy(“window”)

在这个例子中，我们正在定义“timestamp”列的查询的watermark ，并将“10分钟”定义为允许数据延迟的阈值。如果此查询在Update 输出模式下运行（关于输出模式”请参考<Spark源码系列之spark2.2的StructuredStreaming使用及源码介绍 >），则引擎将不断更新结果表中窗口的计数，直到窗口比watermark 更旧，watermark滞后“timestamp”列中的当前事件时间10分钟。

如图所示，引擎跟踪的最大事件时间是蓝色虚线，每个触发开始时设置watermark 为（最大事件时间 - '10分钟）是红线。例如，当引擎看到数据(12:14,dog)，他为下次触发设置水印为12:04。Watermark使得引擎保持额外十分钟的状态，以允许迟到的数据能够被统计。

例如，数据(12:09,cat)乱序了，并且迟到了，她落在12:05-12:15和12:10-12:20这两个窗口内。由于，在触发计算时它依然高于Watermark 12:04，引擎仍然将中间计数保持为状态，并正确更新相关窗口的计数。然后，假如watermark被更新为12:11，12:00-12:10窗口的中间状态将会被清除，所有相关的数据(例如，(12:04,donkey))将被视为太迟而被忽略。请注意，按照更新模式规定，每次触发之后，更新的技术将被作为触发输出写入sink。

某些接收器（例如文件）可能不支持更新模式所需的细粒度更新。要与他们一起工作，我们还支持追加模式，只有最后的计数被写入sink。

请注意，在非流数据集上使用watermark是无效的。由于watermark不应以任何方式影响任何批次查询，我们将直接忽略它。

类似前面的Update模式，引擎为每个窗口保持中间统计。然而，部分结果不会更新到结果表也不会被写入sink。引擎等待迟到的数据“10分钟”进行计数，然后将窗口<watermark的中间状态丢弃，并将最终计数附加到结果表/sink。例如，只有在将watermark 更新为12:11之后，窗口12:00 - 12:10的最终计数才附加到结果表中。

watermark 清理聚合状态的条件重要的是要注意，为了清除聚合查询中的状态（从Spark 2.1.1开始，将来会更改），必须满足以下条件。

A),输出模式必须是Append或者Update。Complete 模式要求保留所有聚合数据，因此不能使用watermark 来中断状态。

B),聚合必须具有事件时间列或事件时间列上的窗口。

C),必须在与聚合中使用的时间戳列相同的列上调用withWatermark 。例如：df.withWatermark(“time”, “1 min”).groupBy(“time2”).count() 是在Append模式下是无效的，因为watermark定义的列和聚合的列不一致。

D),必须在聚合之前调用withWatermark 才能使用watermark 细节。例如，在附加输出模式下，df.groupBy(“time”).count().withWatermark(“time”,”1 min”)无效。

四，join操作

Streaming DataFrames可以与静态的DataFrames进行join，进而产生新的DataFrames。下面是几个例子：

val staticDf = spark.read. …
val streamingDf = spark.readStream. …

streamingDf.join(staticDf, “type”) // inner equi-join with a static DF
streamingDf.join(staticDf, “type”, “right_join”) // right outer join with a static DF

五，流式去重

您可以使用事件中的唯一标识符对数据流中的记录进行重复数据删除。这与使用唯一标识符列的静态重复数据删除完全相同。该查询将存储先前记录所需的数据量，以便可以过滤重复的记录。与聚合类似，您可以使用带有或不带有watermark 的重复数据删除功能。

A),带watermark：如果重复记录可能到达的时间有上限，则可以在事件时间列上定义watermark ，并使用guid和事件时间列进行重复数据删除。

B),不带watermark：由于重复记录可能到达时间没有界限，所以查询将来自所有过去记录的数据存储为状态。

val streamingDf = spark.readStream. ...  // columns: guid, eventTime, ...

// Without watermark using guid column
streamingDf.dropDuplicates("guid")

// With watermark using guid and eventTime columns
streamingDf
  .withWatermark("eventTime", "10 seconds")
  .dropDuplicates("guid", "eventTime")

六，任意有状态的操作

许多用例需要比聚合更高级的状态操作。例如，在许多用例中，您必须跟踪事件数据流中的会话。对于进行此类会话，您将必须将任意类型的数据保存为状态，并在每个触发器中使用数据流事件对状态执行任意操作。从Spark 2.2，这可以通过操作mapGroupsWithState和更强大的操作flatMapGroupsWithState来完成。这两个操作都允许您在分组的数据集上应用用户定义的代码来更新用户定义的状态。

// A mapping function that maintains an integer state for string keys and returns a string.
// Additionally, it sets a timeout to remove the state if it has not received data for an hour.
def mappingFunction(key: String, value: Iterator[Int], state: GroupState[Int]): String = {

if (state.hasTimedOut) { // If called when timing out, remove the state
state.remove()

} else if (state.exists) { // If state exists, use it for processing
val existingState = state.get // Get the existing state
val shouldRemove = … // Decide whether to remove the state
if (shouldRemove) {
state.remove() // Remove the state

} else {
  val newState = ...
  state.update(newState)              // Set the new state
  state.setTimeoutDuration("1 hour")  // Set the timeout
}

} else {
val initialState = …
state.update(initialState) // Set the initial state
state.setTimeoutDuration(“1 hour”) // Set the timeout
}
…
// return something
}

dataset
.groupByKey(…)
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout)(mappingFunction)

七，不支持的操作

Streaming DataFrames / Datasets不支持DataFrame / Dataset操作。其中一些如下。

A),流Datasets不支持多个流聚合（即流DF上的聚合链）。

B),流数据集不支持Limit 和取前N行。

C),不支持流数据集上的Distinct 操作。

D),只有在聚合和Complete 输出模式下，流数据集才支持排序操作。

E),有条件地支持流和静态数据集之间的外连接。

a) 不支持与流数据集Full outer join

b) 不支持与右侧的流数据集Left outer join

c) 不支持与左侧的流数据集Right outer join

F),两个流数据集之间的任何类型的连接尚不被支持。

此外，还有一些Dataset方法将不适用于流数据集。它们是立即运行查询并返回结果的操作，这在流数据集上没有意义。相反，这些功能可以通过显式启动流式查询来完成。

A),Count()- 无法从流数据集返回单个计数。而是使用ds.groupBy().count()返回一个包含运行计数的流数据集。

B),foreach() - 使用ds.writeStream.foreach(…) 代替

C),show() -使用console sink 代替

如果您尝试任何这些操作，您将看到一个AnalysisException，如“操作XYZ不支持streaming DataFrames/Datasets”。虽然一些操作在未来的Spark版本中或许会得到支持，但还有一些其它的操作很难在流数据上高效的实现。例如，例如，不支持对输入流进行排序，因为它需要跟踪流中接收到的所有数据。因此，从根本上难以有效执行。

八，监控流式查询

有两个API用于监视和调试查询 - 以交互方式和异步方式。

1，交互API

您可以使用streamingQuery.lastProgress（）和streamingQuery.status（）直接获取active查询的当前状态和指标。lastProgress（）在Scala和Java中返回一个StreamingQueryProgress对象，并在Python中返回与该字段相同的字典。它具有关于流的上一个触发操作进度的所有信息 - 处理哪些数据，处理速率，延迟等等。还有streamingQuery.recentProgress返回最后几个处理的数组。

此外，streamingQuery.status（）返回Scala和Java中的StreamingQueryStatus对象，以及Python中具有相同字段的字典。它提供有关查询立即执行的信息 - 触发器是活动的，正在处理的数据等。

这里有几个例子。

val query: StreamingQuery = …

println(query.lastProgress)

/* Will print something like the following.

{
“id” : “ce011fdc-8762-4dcb-84eb-a77333e28109”,
“runId” : “88e2ff94-ede0-45a8-b687-6316fbef529a”,
“name” : “MyQuery”,
“timestamp” : “2016-12-14T18:45:24.873Z”,
“numInputRows” : 10,
“inputRowsPerSecond” : 120.0,
“processedRowsPerSecond” : 200.0,
“durationMs” : {
“triggerExecution” : 3,
“getOffset” : 2
},
“eventTime” : {
“watermark” : “2016-12-14T18:45:24.873Z”
},
“stateOperators” : [ ],
“sources” : [ {
“description” : “KafkaSource[Subscribe[topic-0]]”,
“startOffset” : {
“topic-0” : {
“2” : 0,
“4” : 1,
“1” : 1,
“3” : 1,
“0” : 1
}
},
“endOffset” : {
“topic-0” : {
“2” : 0,
“4” : 115,
“1” : 134,
“3” : 21,
“0” : 534
}
},
“numInputRows” : 10,
“inputRowsPerSecond” : 120.0,
“processedRowsPerSecond” : 200.0
} ],
“sink” : {
“description” : “MemorySink”
}
}
*/

println(query.status)

/* Will print something like the following.
{
“message” : “Waiting for data to arrive”,
“isDataAvailable” : false,
“isTriggerActive” : false
}
*/

2，异步API

您还可以通过附加StreamingQueryListener（Scala / Java文档）异步监视与SparkSession关联的所有查询。一旦您使用sparkSession.streams.attachListener（）附加您的自定义StreamingQueryListener对象，您将在查询启动和停止时以及在活动查询中进行时获得回调。

val spark: SparkSession = ...

spark.streams.addListener(new StreamingQueryListener() {
  override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {
    println("Query started: " + queryStarted.id)
  }
  override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = {
    println("Query terminated: " + queryTerminated.id)
  }
  override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
    println("Query made progress: " + queryProgress.progress)
  }
})

九，使用checkpoint进行故障恢复

如果发生故障或故意关机，您可以恢复之前的查询的进度和状态，并从停止的地方继续执行。这是使用检查点和预写日志完成的。您可以使用检查点位置配置查询，那么查询将将所有进度信息（即，每个触发器中处理的偏移范围）和运行聚合（例如，快速示例中的字计数）保存到检查点位置。此检查点位置必须是HDFS兼容文件系统中的路径，并且可以在启动查询时将其设置为DataStreamWriter中的选项。

aggDF
  .writeStream
  .outputMode("complete")
  .option("checkpointLocation", "path/to/HDFS/dir")
  .format("memory")
  .start()

本文主要介绍Spark Structured Streaming一些高级特性：窗口操作，处理延迟数据及watermark，join操作，流式去重，一些不支持的操作，监控API和故障恢复。希望帮助大家更进一步了解Structured Streaming。

本文应结合<>和flink相关的文章一起看，这样可以更深入的了解Spark Streaming ，flink及Structured Streaming之间的区别。后面会出文章详细对比介绍三者的区别。

整合kafka

structed streaming 并不是不提交任何offset,只是不提交到zk而已, 首先kafka本身有个topic有个存放 __consumer_offsets, 同时还可以提交到外部存储,比如说mysql、redis 之类, 通过checkpointLocation 可以设置偏移量存储的路径,可以自己试一下
.option(“checkpointLocation”,"./checkpoint")

3.1.2
Overview
Programming Guides
API Docs
Deploying
More
Search the docs
Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)
Structured Streaming integration for Kafka 0.10 to read data from and write data to Kafka.

Linking
For Scala/Java applications using SBT/Maven project definitions, link your application with the following artifact:

groupId = org.apache.spark
artifactId = spark-sql-kafka-0-10_2.12
version = 3.1.2
Please note that to use the headers functionality, your Kafka client version should be version 0.11.0.0 or up.

For Python applications, you need to add this above library and its dependencies when deploying your application. See the Deploying subsection below.

For experimenting on spark-shell, you need to add this above library and its dependencies too when invoking spark-shell. Also, see the Deploying subsection below.

Reading Data from Kafka
Creating a Kafka Source for Streaming Queries
Scala
Java
Python
// Subscribe to 1 topic
val df = spark
.readStream
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “host1:port1,host2:port2”)
.option(“subscribe”, “topic1”)
.load()
df.selectExpr(“CAST(key AS STRING)”, “CAST(value AS STRING)”)
.as[(String, String)]

// Subscribe to 1 topic, with headers
val df = spark
.readStream
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “host1:port1,host2:port2”)
.option(“subscribe”, “topic1”)
.option(“includeHeaders”, “true”)
.load()
df.selectExpr(“CAST(key AS STRING)”, “CAST(value AS STRING)”, “headers”)
.as[(String, String, Array[(String, Array[Byte])])]

// Subscribe to multiple topics
val df = spark
.readStream
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “host1:port1,host2:port2”)
.option(“subscribe”, “topic1,topic2”)
.load()
df.selectExpr(“CAST(key AS STRING)”, “CAST(value AS STRING)”)
.as[(String, String)]

// Subscribe to a pattern
val df = spark
.readStream
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “host1:port1,host2:port2”)
.option(“subscribePattern”, “topic.*”)
.load()
df.selectExpr(“CAST(key AS STRING)”, “CAST(value AS STRING)”)
.as[(String, String)]
Creating a Kafka Source for Batch Queries
If you have a use case that is better suited to batch processing, you can create a Dataset/DataFrame for a defined range of offsets.

Scala
Java
Python
// Subscribe to 1 topic defaults to the earliest and latest offsets
val df = spark
.read
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “host1:port1,host2:port2”)
.option(“subscribe”, “topic1”)
.load()
df.selectExpr(“CAST(key AS STRING)”, “CAST(value AS STRING)”)
.as[(String, String)]

// Subscribe to multiple topics, specifying explicit Kafka offsets
val df = spark
.read
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “host1:port1,host2:port2”)
.option(“subscribe”, “topic1,topic2”)
.option(“startingOffsets”, “”"{“topic1”:{“0”:23,“1”:-2},“topic2”:{“0”:-2}}""")
.option(“endingOffsets”, “”"{“topic1”:{“0”:50,“1”:-1},“topic2”:{“0”:-1}}""")
.load()
df.selectExpr(“CAST(key AS STRING)”, “CAST(value AS STRING)”)
.as[(String, String)]

// Subscribe to a pattern, at the earliest and latest offsets
val df = spark
.read
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “host1:port1,host2:port2”)
.option(“subscribePattern”, “topic.*”)
.option(“startingOffsets”, “earliest”)
.option(“endingOffsets”, “latest”)
.load()
df.selectExpr(“CAST(key AS STRING)”, “CAST(value AS STRING)”)
.as[(String, String)]
Each row in the source has the following schema:

Column Type
key binary
value binary
topic string
partition int
offset long
timestamp timestamp
timestampType int
headers (optional) array
The following options must be set for the Kafka source for both batch and streaming queries.

Option value meaning
assign json string {“topicA”:[0,1],“topicB”:[2,4]} Specific TopicPartitions to consume. Only one of “assign”, “subscribe” or “subscribePattern” options can be specified for Kafka source.
subscribe A comma-separated list of topics The topic list to subscribe. Only one of “assign”, “subscribe” or “subscribePattern” options can be specified for Kafka source.
subscribePattern Java regex string The pattern used to subscribe to topic(s). Only one of "assign, “subscribe” or “subscribePattern” options can be specified for Kafka source.
kafka.bootstrap.servers A comma-separated list of host:port The Kafka “bootstrap.servers” configuration.
The following configurations are optional:

Option value default query type meaning
startingOffsetsByTimestamp json string “”" {“topicA”:{“0”: 1000, “1”: 1000}, “topicB”: {“0”: 2000, “1”: 2000}} “”" none (the value of startingOffsets will apply) streaming and batch The start point of timestamp when a query is started, a json string specifying a starting timestamp for each TopicPartition. The returned offset for each partition is the earliest offset whose timestamp is greater than or equal to the given timestamp in the corresponding partition. If the matched offset doesn’t exist, the query will fail immediately to prevent unintended read from such partition. (This is a kind of limitation as of now, and will be addressed in near future.)
Spark simply passes the timestamp information to KafkaConsumer.offsetsForTimes, and doesn’t interpret or reason about the value.

For more details on KafkaConsumer.offsetsForTimes, please refer javadoc for details.

Also the meaning of timestamp here can be vary according to Kafka configuration (log.message.timestamp.type): please refer Kafka documentation for further details.

Note: This option requires Kafka 0.10.1.0 or higher.

Note2: startingOffsetsByTimestamp takes precedence over startingOffsets.

Note3: For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest.

startingOffsets “earliest”, “latest” (streaming only), or json string “”" {“topicA”:{“0”:23,“1”:-1},“topicB”:{“0”:-2}} “”" “latest” for streaming, “earliest” for batch streaming and batch The start point when a query is started, either “earliest” which is from the earliest offsets, “latest” which is just from the latest offsets, or a json string specifying a starting offset for each TopicPartition. In the json, -2 as an offset can be used to refer to earliest, -1 to latest. Note: For batch queries, latest (either implicitly or by using -1 in json) is not allowed. For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest.
endingOffsetsByTimestamp json string “”" {“topicA”:{“0”: 1000, “1”: 1000}, “topicB”: {“0”: 2000, “1”: 2000}} “”" latest batch query The end point when a batch query is ended, a json string specifying an ending timestamp for each TopicPartition. The returned offset for each partition is the earliest offset whose timestamp is greater than or equal to the given timestamp in the corresponding partition. If the matched offset doesn’t exist, the offset will be set to latest.
Spark simply passes the timestamp information to KafkaConsumer.offsetsForTimes, and doesn’t interpret or reason about the value.

For more details on KafkaConsumer.offsetsForTimes, please refer javadoc for details.

Also the meaning of timestamp here can be vary according to Kafka configuration (log.message.timestamp.type): please refer Kafka documentation for further details.

Note: This option requires Kafka 0.10.1.0 or higher.

Note2: endingOffsetsByTimestamp takes precedence over endingOffsets.

endingOffsets latest or json string {“topicA”:{“0”:23,“1”:-1},“topicB”:{“0”:-1}} latest batch query The end point when a batch query is ended, either “latest” which is just referred to the latest, or a json string specifying an ending offset for each TopicPartition. In the json, -1 as an offset can be used to refer to latest, and -2 (earliest) as an offset is not allowed.
failOnDataLoss true or false true streaming and batch Whether to fail the query when it’s possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn’t work as you expected.
kafkaConsumer.pollTimeoutMs long 120000 streaming and batch The timeout in milliseconds to poll data from Kafka in executors. When not defined it falls back to spark.network.timeout.
fetchOffset.numRetries int 3 streaming and batch Number of times to retry before giving up fetching Kafka offsets.
fetchOffset.retryIntervalMs long 10 streaming and batch milliseconds to wait before retrying to fetch Kafka offsets
maxOffsetsPerTrigger long none streaming and batch Rate limit on maximum number of offsets processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume.
minPartitions int none streaming and batch Desired minimum number of partitions to read from Kafka. By default, Spark has a 1-1 mapping of topicPartitions to Spark partitions consuming from Kafka. If you set this option to a value greater than your topicPartitions, Spark will divvy up large Kafka partitions to smaller pieces. Please note that this configuration is like a hint: the number of Spark tasks will be approximately minPartitions. It can be less or more depending on rounding errors or Kafka partitions that didn’t receive any new data.
groupIdPrefix string spark-kafka-source streaming and batch Prefix of consumer group identifiers (group.id) that are generated by structured streaming queries. If “kafka.group.id” is set, this option will be ignored.
kafka.group.id string none streaming and batch The Kafka group id to use in Kafka consumer while reading from Kafka. Use this with caution. By default, each query generates a unique group id for reading data. This ensures that each Kafka source has its own consumer group that does not face interference from any other consumer, and therefore can read all of the partitions of its subscribed topics. In some scenarios (for example, Kafka group-based authorization), you may want to use a specific authorized group id to read data. You can optionally set the group id. However, do this with extreme caution as it can cause unexpected behavior. Concurrently running queries (both, batch and streaming) or sources with the same group id are likely interfere with each other causing each query to read only part of the data. This may also occur when queries are started/restarted in quick succession. To minimize such issues, set the Kafka consumer session timeout (by setting option “kafka.session.timeout.ms”) to be very small. When this is set, option “groupIdPrefix” will be ignored.
includeHeaders boolean false streaming and batch Whether to include the Kafka headers in the row.
Offset fetching
In Spark 3.0 and before Spark uses KafkaConsumer for offset fetching which could cause infinite wait in the driver. In Spark 3.1 a new configuration option added spark.sql.streaming.kafka.useDeprecatedOffsetFetching (default: true) which could be set to false allowing Spark to use new offset fetching mechanism using AdminClient. When the new mechanism used the following applies.

First of all the new approach supports Kafka brokers 0.11.0.0+.

In Spark 3.0 and below, secure Kafka processing needed the following ACLs from driver perspective:

Topic resource describe operation
Topic resource read operation
Group resource read operation
Since Spark 3.1, offsets can be obtained with AdminClient instead of KafkaConsumer and for that the following ACLs needed from driver perspective:

Topic resource describe operation
Since AdminClient in driver is not connecting to consumer group, group.id based authorization will not work anymore (executors never done group based authorization). Worth to mention executor side is behaving the exact same way like before (group prefix and override works).

Consumer Caching
It’s time-consuming to initialize Kafka consumers, especially in streaming scenarios where processing time is a key factor. Because of this, Spark pools Kafka consumers on executors, by leveraging Apache Commons Pool.

The caching key is built up from the following information:

Topic name
Topic partition
Group ID
The following properties are available to configure the consumer pool:

Property Name Default Meaning Since Version
spark.kafka.consumer.cache.capacity 64 The maximum number of consumers cached. Please note that it’s a soft limit. 3.0.0
spark.kafka.consumer.cache.timeout 5m (5 minutes) The minimum amount of time a consumer may sit idle in the pool before it is eligible for eviction by the evictor. 3.0.0
spark.kafka.consumer.cache.evictorThreadRunInterval 1m (1 minute) The interval of time between runs of the idle evictor thread for consumer pool. When non-positive, no idle evictor thread will be run. 3.0.0
spark.kafka.consumer.cache.jmx.enable false Enable or disable JMX for pools created with this configuration instance. Statistics of the pool are available via JMX instance. The prefix of JMX name is set to “kafka010-cached-simple-kafka-consumer-pool”. 3.0.0
The size of the pool is limited by spark.kafka.consumer.cache.capacity, but it works as “soft-limit” to not block Spark tasks.

Idle eviction thread periodically removes consumers which are not used longer than given timeout. If this threshold is reached when borrowing, it tries to remove the least-used entry that is currently not in use.

If it cannot be removed, then the pool will keep growing. In the worst case, the pool will grow to the max number of concurrent tasks that can run in the executor (that is, number of task slots).

If a task fails for any reason, the new task is executed with a newly created Kafka consumer for safety reasons. At the same time, we invalidate all consumers in pool which have same caching key, to remove consumer which was used in failed execution. Consumers which any other tasks are using will not be closed, but will be invalidated as well when they are returned into pool.

Along with consumers, Spark pools the records fetched from Kafka separately, to let Kafka consumers stateless in point of Spark’s view, and maximize the efficiency of pooling. It leverages same cache key with Kafka consumers pool. Note that it doesn’t leverage Apache Commons Pool due to the difference of characteristics.

The following properties are available to configure the fetched data pool:

Property Name Default Meaning Since Version
spark.kafka.consumer.fetchedData.cache.timeout 5m (5 minutes) The minimum amount of time a fetched data may sit idle in the pool before it is eligible for eviction by the evictor. 3.0.0
spark.kafka.consumer.fetchedData.cache.evictorThreadRunInterval 1m (1 minute) The interval of time between runs of the idle evictor thread for fetched data pool. When non-positive, no idle evictor thread will be run. 3.0.0
Writing Data to Kafka
Here, we describe the support for writing Streaming Queries and Batch Queries to Apache Kafka. Take note that Apache Kafka only supports at least once write semantics. Consequently, when writing—either Streaming Queries or Batch Queries—to Kafka, some records may be duplicated; this can happen, for example, if Kafka needs to retry a message that was not acknowledged by a Broker, even though that Broker received and wrote the message record. Structured Streaming cannot prevent such duplicates from occurring due to these Kafka write semantics. However, if writing the query is successful, then you can assume that the query output was written at least once. A possible solution to remove duplicates when reading the written data could be to introduce a primary (unique) key that can be used to perform de-duplication when reading.

The Dataframe being written to Kafka should have the following columns in schema:

Column Type
key (optional) string or binary
value (required) string or binary
headers (optional) array
topic (*optional) string
partition (optional) int

The topic column is required if the “topic” configuration option is not specified.

The value column is the only required option. If a key column is not specified then a null valued key column will be automatically added (see Kafka semantics on how null valued key values are handled). If a topic column exists then its value is used as the topic when writing the given row to Kafka, unless the “topic” configuration option is set i.e., the “topic” configuration option overrides the topic column. If a “partition” column is not specified (or its value is null) then the partition is calculated by the Kafka producer. A Kafka partitioner can be specified in Spark by setting the kafka.partitioner.class option. If not present, Kafka default partitioner will be used.

The following options must be set for the Kafka sink for both batch and streaming queries.

Option value meaning
kafka.bootstrap.servers A comma-separated list of host:port The Kafka “bootstrap.servers” configuration.
The following configurations are optional:

Option value default query type meaning
topic string none streaming and batch Sets the topic that all rows will be written to in Kafka. This option overrides any topic column that may exist in the data.
includeHeaders boolean false streaming and batch Whether to include the Kafka headers in the row.
Creating a Kafka Sink for Streaming Queries
Scala
Java
Python
// Write key-value data from a DataFrame to a specific Kafka topic specified in an option
val ds = df
.selectExpr(“CAST(key AS STRING)”, “CAST(value AS STRING)”)
.writeStream
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “host1:port1,host2:port2”)
.option(“topic”, “topic1”)
.start()

// Write key-value data from a DataFrame to Kafka using a topic specified in the data
val ds = df
.selectExpr(“topic”, “CAST(key AS STRING)”, “CAST(value AS STRING)”)
.writeStream
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “host1:port1,host2:port2”)
.start()
Writing the output of Batch Queries to Kafka
Scala
Java
Python
// Write key-value data from a DataFrame to a specific Kafka topic specified in an option
df.selectExpr(“CAST(key AS STRING)”, “CAST(value AS STRING)”)
.write
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “host1:port1,host2:port2”)
.option(“topic”, “topic1”)
.save()

// Write key-value data from a DataFrame to Kafka using a topic specified in the data
df.selectExpr(“topic”, “CAST(key AS STRING)”, “CAST(value AS STRING)”)
.write
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “host1:port1,host2:port2”)
.save()
Producer Caching
Given Kafka producer instance is designed to be thread-safe, Spark initializes a Kafka producer instance and co-use across tasks for same caching key.

The caching key is built up from the following information:

Kafka producer configuration
This includes configuration for authorization, which Spark will automatically include when delegation token is being used. Even we take authorization into account, you can expect same Kafka producer instance will be used among same Kafka producer configuration. It will use different Kafka producer when delegation token is renewed; Kafka producer instance for old delegation token will be evicted according to the cache policy.

The following properties are available to configure the producer pool:

Property Name Default Meaning Since Version
spark.kafka.producer.cache.timeout 10m (10 minutes) The minimum amount of time a producer may sit idle in the pool before it is eligible for eviction by the evictor. 2.2.1
spark.kafka.producer.cache.evictorThreadRunInterval 1m (1 minute) The interval of time between runs of the idle evictor thread for producer pool. When non-positive, no idle evictor thread will be run. 3.0.0
Idle eviction thread periodically removes producers which are not used longer than given timeout. Note that the producer is shared and used concurrently, so the last used timestamp is determined by the moment the producer instance is returned and reference count is 0.

Kafka Specific Configurations
Kafka’s own configurations can be set via DataStreamReader.option with kafka. prefix, e.g, stream.option(“kafka.bootstrap.servers”, “host:port”). For possible kafka parameters, see Kafka consumer config docs for parameters related to reading data, and Kafka producer config docs for parameters related to writing data.

Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception:

group.id: Kafka source will create a unique group id for each query automatically. The user can set the prefix of the automatically generated group.id’s via the optional source option groupIdPrefix, default value is “spark-kafka-source”. You can also set “kafka.group.id” to force Spark to use a special group id, however, please read warnings for this option and use it with caution.
auto.offset.reset: Set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed. Note that startingOffsets only applies when a new streaming query is started, and that resuming will always pick up from where the query left off.
key.deserializer: Keys are always deserialized as byte arrays with ByteArrayDeserializer. Use DataFrame operations to explicitly deserialize the keys.
value.deserializer: Values are always deserialized as byte arrays with ByteArrayDeserializer. Use DataFrame operations to explicitly deserialize the values.
key.serializer: Keys are always serialized with ByteArraySerializer or StringSerializer. Use DataFrame operations to explicitly serialize the keys into either strings or byte arrays.
value.serializer: values are always serialized with ByteArraySerializer or StringSerializer. Use DataFrame operations to explicitly serialize the values into either strings or byte arrays.
enable.auto.commit: Kafka source doesn’t commit any offset.
interceptor.classes: Kafka source always read keys and values as byte arrays. It’s not safe to use ConsumerInterceptor as it may break the query.
Deploying
As with any Spark applications, spark-submit is used to launch your application. spark-sql-kafka-0-10_2.12 and its dependencies can be directly added to spark-submit using --packages, such as,

./bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 …
For experimenting on spark-shell, you can also use --packages to add spark-sql-kafka-0-10_2.12 and its dependencies directly,

./bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 …
See Application Submission Guide for more details about submitting applications with external dependencies.

Security
Kafka 0.9.0.0 introduced several features that increases security in a cluster. For detailed description about these possibilities, see Kafka security docs.

It’s worth noting that security is optional and turned off by default.

Spark supports the following ways to authenticate against Kafka cluster:

Delegation token (introduced in Kafka broker 1.1.0)
JAAS login configuration
Delegation token
This way the application can be configured via Spark parameters and may not need JAAS login configuration (Spark can use Kafka’s dynamic JAAS configuration feature). For further information about delegation tokens, see Kafka delegation token docs.

The process is initiated by Spark’s Kafka delegation token provider. When spark.kafka.clusters.${cluster}.auth.bootstrap.servers is set, Spark considers the following log in options, in order of preference:

JAAS login configuration, please see example below.
Keytab file, such as,

./bin/spark-submit
–keytab <KEYTAB_FILE>
–principal
–conf spark.kafka.clusters.${cluster}.auth.bootstrap.servers=<KAFKA_SERVERS>
…
Kerberos credential cache, such as,

./bin/spark-submit
–conf spark.kafka.clusters.${cluster}.auth.bootstrap.servers=<KAFKA_SERVERS>
…
The Kafka delegation token provider can be turned off by setting spark.security.credentials.kafka.enabled to false (default: true).

Spark can be configured to use the following authentication protocols to obtain token (it must match with Kafka broker configuration):

SASL SSL (default)
SSL
SASL PLAINTEXT (for testing)
After obtaining delegation token successfully, Spark distributes it across nodes and renews it accordingly. Delegation token uses SCRAM login module for authentication and because of that the appropriate spark.kafka.clusters.${cluster}.sasl.token.mechanism (default: SCRAM-SHA-512) has to be configured. Also, this parameter must match with Kafka broker configuration.

When delegation token is available on an executor Spark considers the following log in options, in order of preference:

JAAS login configuration, please see example below.
Delegation token, please see spark.kafka.clusters.${cluster}.target.bootstrap.servers.regex parameter for further details.
When none of the above applies then unsecure connection assumed.

Configuration
Delegation tokens can be obtained from multiple clusters and ${cluster} is an arbitrary unique identifier which helps to group different configurations.

Property Name Default Meaning Since Version
spark.kafka.clusters. ${cluster}.auth.bootstrap.servers None A list of coma separated host/port pairs to use for establishing the initial connection to the Kafka cluster. For further details please see Kafka documentation. Only used to obtain delegation token. 3.0.0 spark.kafka.clusters.$ {cluster}.target.bootstrap.servers.regex .* Regular expression to match against the bootstrap.servers config for sources and sinks in the application. If a server address matches this regex, the delegation token obtained from the respective bootstrap servers will be used when connecting. If multiple clusters match the address, an exception will be thrown and the query won’t be started. Kafka’s secure and unsecure listeners are bound to different ports. When both used the secure listener port has to be part of the regular expression. 3.0.0
spark.kafka.clusters. ${cluster}.security.protocol SASL_SSL Protocol used to communicate with brokers. For further details please see Kafka documentation. Protocol is applied on all the sources and sinks as default where bootstrap.servers config matches (for further details please see spark.kafka.clusters.$ {cluster}.target.bootstrap.servers.regex), and can be overridden by setting kafka.security.protocol on the source or sink. 3.0.0
spark.kafka.clusters. ${cluster}.sasl.kerberos.service.name kafka The Kerberos principal name that Kafka runs as. This can be defined either in Kafka's JAAS config or in Kafka's config. For further details please see Kafka documentation. Only used to obtain delegation token. 3.0.0 spark.kafka.clusters.$ {cluster}.ssl.truststore.location None The location of the trust store file. For further details please see Kafka documentation. Only used to obtain delegation token. 3.0.0
spark.kafka.clusters. ${cluster}.ssl.truststore.password None The store password for the trust store file. This is optional and only needed if spark.kafka.clusters.$ {cluster}.ssl.truststore.location is configured. For further details please see Kafka documentation. Only used to obtain delegation token. 3.0.0
spark.kafka.clusters. ${cluster}.ssl.keystore.location None The location of the key store file. This is optional for client and can be used for two-way authentication for client. For further details please see Kafka documentation. Only used to obtain delegation token. 3.0.0 spark.kafka.clusters.$ {cluster}.ssl.keystore.password None The store password for the key store file. This is optional and only needed if spark.kafka.clusters. ${cluster}.ssl.keystore.location is configured. For further details please see Kafka documentation. Only used to obtain delegation token. 3.0.0 spark.kafka.clusters.$ {cluster}.ssl.key.password None The password of the private key in the key store file. This is optional for client. For further details please see Kafka documentation. Only used to obtain delegation token. 3.0.0
spark.kafka.clusters. ${cluster}.sasl.token.mechanism SCRAM-SHA-512 SASL mechanism used for client connections with delegation token. Because SCRAM login module used for authentication a compatible mechanism has to be set here. For further details please see Kafka documentation (sasl.mechanism). Only used to authenticate against Kafka broker with delegation token. 3.0.0 Kafka Specific Configurations Kafka’s own configurations can be set with kafka. prefix, e.g, --conf spark.kafka.clusters.$ {cluster}.kafka.retries=1. For possible Kafka parameters, see Kafka adminclient config docs.

Caveats
Obtaining delegation token for proxy user is not yet supported (KAFKA-6945).
JAAS login configuration
JAAS login configuration must placed on all nodes where Spark tries to access Kafka cluster. This provides the possibility to apply any custom authentication logic with a higher cost to maintain. This can be done several ways. One possibility is to provide additional JVM parameters, such as,

./bin/spark-submit
–driver-java-options “-Djava.security.auth.login.config=/path/to/custom_jaas.conf”
–conf spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/path/to/custom_jaas.conf
…

问题一: 读Kafka的方式
// 读取一个Topic
val inputTable=spark
.readStream
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “kafka01:9092,kafka02:9092,kafka03:9092”)
.option(“subscribe”, “topic_1”)
.load()

inputTable
.selectExpr(“CAST(key AS STRING)”, “CAST(value AS STRING)”)
.as[(String, String)]

// 读取多个Topic
val inputTable=spark
.readStream
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “kafka01:9092,kafka02:9092,kafka03:9092”)
.option(“subscribe”, “topic_1,topic_2”)
.load()

inputTable
.selectExpr(“CAST(key AS STRING)”, “CAST(value AS STRING)”)
.as[(String, String)]

问题二: 读Kafka与批处理
批处理适合一次性作业。

// 读取一个Topic
// 默认earliest、latest offset
val inputTable=spark
.read
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “kafka01:9092,kafka02:9092,kafka03:9092”)
.option(“subscribe”, “topic_1”)
.load()

val resultTable=inputTable
.selectExpr(“CAST(key AS STRING)”, “CAST(value AS STRING)”)
.as[(String, String)]

resultTable
.write
.format(“console”)
.save()

// 读取多个Topic
// 可通过startingOffsets、endingOffsets指定topic partition 的offset
// 注意: 此种方式下，需要指定所有topic partition 的offset。-1: latest -2: earliest
val inputTable=spark
.read
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “kafka01:9092,kafka02:9092,kafka03:9092”)
.option(“subscribe”, “topic_1,topic_2”)
.option(“startingOffsets”, “”"{“topic_1”:{“0”:13624,“1”:-2,“2”:-2},“topic_2”:{“0”:-2,“1”:-2,“2”:-2}}""")
.option(“endingOffsets”, “”"{“topic_1”:{“0”:13626,“1”:13675,“2”:-1},“topic_2”:{“0”:1,“1”:-1,“2”:-1}}""")
.load()

// 读取多个Topic
// 可通过startingOffsets、endingOffsets指定topic partition 的offset为earliest、latest
val inputTable=spark
.read
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “kafka01:9092,kafka02:9092,kafka03:9092”)
.option(“subscribePattern”, “topic_.*”)
.option(“startingOffsets”, “earliest”)
.option(“endingOffsets”, “latest”)
.load()

问题三: 读Kafka生成的DataFrame的Schema
Column Column Type 含义
key binary Message对应的Key
value binary Message对应的Value
topic string Message对应的Topic
partition integer Message对应的Partition
offset long Message对应的Offset
timestamp timestamp Message对应的时间戳
timestampType integer Message对应的时间戳类型。0: CreateTime, 1:LogAppendTime。
问题四: 读Kafka与反序列化
//1)普通字符串
//将字节数据转换为普通字符串
inputTable.selectExpr(“CAST(key AS STRING)”, “CAST(value AS STRING)”)

//2)JSON与Avro
当Kafka中的数据为Json或Avro格式的数据时，可用from_json/from_avro抽取需要的字段。
以json为例，如下:
val schema=new StructType()
.add(“name”,DataTypes.StringType)
.add(“age”,DataTypes.IntegerType)

val resultTable=inputTable.select(
col(“key”).cast(“string”),
from_json(col(“value”).cast(“string”), schema).as(“value”)
).select($“value.*”)

resultTable.printSchema()
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)

问题五: 读Kafka动态发现Topic、Partition
依靠KafkaConsumerCoordinator,默认可自动动态发现新增的Topic或Partition。

问题六: 读Kafka事务消息
Spark Streaming对Kafka事务消息的读取没有提供很好的支持。

在Structured Streaming中，设置option(“kafka.isolation.level”,“read_committed”)，可只读取事务成功的消息。默认为read_uncommitted,读取所有消息，包括事务终止的消息。

问题七: 读Kafka的Offset
Structured Streaming消费Kafka，不需要自己管理Offset，开启Checkpoint后，会将Offset保存到Checkpoint中。

问题八: 读Kafka速率限制
通过maxOffsetsPerTrigger参数控制每次Trigger最多拉取的记录数。

问题九: 写Kafka
要写入Kafka的Dataframe应该包含以下列:

key: 可选列。string或binary类型。默认null。

value: 必须列。string或binary类型。

topic: 可选列。string类型。

// 写入单个Topic
// 通过option指定Topic
df
.select($“value”)
.writeStream
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “kafka01:9092,kafka02:9092,kafka03:9092”)
.option(“topic”, “topic_1”)
.option(“checkpointLocation”, “…”)
.outputMode(“update”)
.trigger(Trigger.ProcessingTime(“2 seconds”))
.start()

// 写入多个Topic
// topic取自数据中的topic列
df
.select( $" v a l u e ",$ “topic”)
.writeStream
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “kafka01:9092,kafka02:9092,kafka03:9092”)
.option(“checkpointLocation”, “…”)
.outputMode(“update”)
.trigger(Trigger.ProcessingTime(“2 seconds”))
.start()

问题十: 读写Kafka与Exactly Once语义
结合Checkpoint和可重播的Kafka数据源，Structured Streaming处理能保证EOS语义。但写Kafka只能提供At Least Once语义。

Structured Streaming消费Kafka数据
Spark提供了很好的批流统一API，而最近刚推出的delta也是如此。这样，流处理也能受益于针对Dataframe的优化。

为了使用kafka数据源，需要加载相应的jar包，所以在启动pyspark或者是通过spark-submit提交时，需要加入相关依赖：

$ pyspark --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3
$ spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3
使用maven做项目管理时，可以把spark-sql-kafka加入到依赖中

org.apache.spark spark-sql-kafka-0-10_2.12 2.4.3 provided 使用readStream来读入流，指定format为kafka，kafka的broker配置以及绑定的主题（可以绑定多个主题）。还可以指定offset的位置(有latest, earliest以及具体对每一个topic的每一个分区进行指定)。

df = spark
.readStream
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “host1:port1,host2:port2”)
.option(“subscribe”, “topic1,topic2”)
.option(“startingOffsets”, “”"{“topic1”:{“0”:23,“1”:-2},“topic2”:{“0”:-2}}""")
.load()
这样子，获得到的dataframe就可以使用DataFrame相关的操作。得到清洗后的数据就可以写入到sink。spark常用的文件格式为parquet，有关parquet的介绍可以参考后面的链接资料。

写入到parquet文件

df.writeStream.format(“parquet”).
option(“path”, “save_path”).
option(“checkpointLocation”, “save_path/checkpoints”).
start()
大家可能会困惑Structured Streaming是怎么对kafka的offset进行管理。我们看到读入流的时候要设置offset，那么如果程序中断之后再重启会是怎样呢？

这里，我们注意到流写入到sink的时候，必须要设置一个checkpointLocaion，Structured Streaming就是在这个目录下来管理offset。如果程序中断之后再重启，虽然在读入流的时候设置的是某一个offset，但是在写入流的时候，如果已经存在了checkpointLocation，那么流会从之前中断的地方继续处理，即读入流对offset的设置只是针对checkpointLocation第一次初始化的时候有效。

在实际使用Structured Streaming的时候，我们也遇到了一些问题：

对于长期运行的Structured Streaming程序，如何做到动态使用资源
首先先评估是否有必要使用长期运行的streaming程序，如果对数据实时性要求没那么高，可以考虑做定期的流任务。如果需要长期运行，可以考虑spark的动态分配资源选项（听闻bug比较多）：
–conf spark.dynamicAllocation.enabled=true \
–conf spark.dynamicAllocation.initialExecutors=2 \
–conf spark.dynamicAllocation.minExecutors=2 \
–conf spark.dynamicAllocation.maxExecutors=5
使用Structured Streaming写入parquet文件时，会导致产生很多小的parquet文件，这样子对HDFS的namenode压力比较大
可以参考这篇文章, 主要的两个解决方案是：
在Structured Streaming中通过coalesce来减少分区，从而减少写入的parquet文件数量
通过运行一个批处理程序来读入多个小parquet文件，通过repartition为指定数量后再写入parquet文件