Spark Structrued Streaming源码分析--(一)创建Source、Sink及自定义输入、输出端

最新推荐文章于 2024-09-04 10:25:50 发布

LS_ice

最新推荐文章于 2024-09-04 10:25:50 发布

阅读量3k

点赞数 2

分类专栏： spark structured streaming源码

本文链接：https://blog.csdn.net/LS_ice/article/details/82226828

版权

本文详细分析了Spark Structured Streaming如何通过DataStreamReader.load()创建Source和DataFrame，包括查找DataSource、创建Resolved Logical Plan的过程。同时探讨了DataStreamWriter.start()创建Sink并启动流计算查询线程的步骤，以及自定义Source和Sink的实现方法。

摘要由CSDN通过智能技术生成

一、流计算示例example
二、通过DataStreamReader.load()方法，查找Source并创建DataFrame
三、通过dataStreamWriter.start()方法，创建sink并启动流计算持续查询线程

一、流计算示例example

从Kafka Topic读取记录行，统计单词个数，并写出到console控制台

    val dataStreamReader: DataStreamReader = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", bootstrapServers)
      .option(subscribeType, topics)

    val lines: Dataset[String] = dataStreamReader
      .load()
      .selectExpr("CAST(value AS STRING)")
      .as[String]

    // Generate running word count
    val wordCounts = lines.flatMap(_.split(" ")).groupBy("value").count()

    // Start running the query that prints the running counts to the console
    val dataStreamWriter: DataStreamWriter[Row] = wordCounts.writeStream
      .outputMode("complete")
      .format("console")
      .option("checkpointLocation", checkpointLocation)

    val query = dataStreamWriter.start()

    query.awaitTermination()

示例中分为以下步骤：
① dataStreamReader.load()方法查找source，并创建dataframe
② 数据(dataframe) transform
③ dataStreamWriter创建sink，并启动流计算持续查询线程。
dataStreamReader.load()和dataStreamWriter.start()是本文分析的重点。
数据(Dataset) transform转换过程即常用的spark sql api (selectExpr、select、map、flatmap等)，本文不另作分析。

二、通过DataStreamReader.load()方法，查找Source并创建DataFrame

一、流计算示例example
二、通过DataStreamReader.load()方法，查找Source并创建DataFrame
三、通过dataStreamWriter.start()方法，创建sink并启动流计算持续查询线程

1、DataStreamReader参数设置：

通过调用DataStreamReader 提供的format()、option()方法，可以设置reader的各项参数
其设置的option数据主要保存在的extraOptions: HashMap[String, String]中

final class DataStreamReader private[sql](sparkSession: SparkSession) extends Logging {
   
  def format(source: String): DataStreamReader = {
    this.source = source
    this
  }
  def option(key: String, value: String): DataStreamReader = {
    this.extraOptions += (key -> value)
    this
  }

  private var source: String = sparkSession.sessionState.conf.defaultDataSourceName

  private var userSpecifiedSchema: Option[StructType] = None

  private var extraOptions = new scala.collection.mutable.HashMap[String, String]
}

2、DataStreamReader.load()查找source并创建dataframe

load()方法主要步骤
· 通过DataSource.lookupDataSource(source, conf)查找kafka、file、com.it.provider.source.HbaseSource对应的provider类，并生成实例
· 生成v1DataSource、v1Relation，其中v1Relation主要是用于构建StreamingRelationV2，批处理方式不会调用StreamingRelationV2相关的case
· 调用Dataset.ofRows()方法，使用v1DataSource创建dataframe

  def load(): DataFrame = {
    if (source.toLowerCase(Locale.ROOT) == DDLUtils.HIVE_PROVIDER) {
      throw new AnalysisException("Hive data source can only be used with tables, you can not " +
        "read files of Hive data source directly.")
    }

    val ds = DataSource.lookupDataSource(source, sparkSession.sqlContext.conf).newInstance()
    val options = new DataSourceOptions(extraOptions.asJava)

    val v1DataSource = DataSource(
      sparkSession,
      userSpecifiedSchema = userSpecifiedSchema,
      className = source,
      options = extraOptions.toMap)
    val v1Relation = ds match {
      case _: StreamSourceProvider => Some(StreamingRelation(v1DataSource))
      case _ => None
    }
    ds match {
      case s: MicroBatchReadSupport =>
        val tempReader = s.createMicroBatchReader(
          Optional.ofNullable(userSpecifiedSchema.orNull),
          Utils.createTempDir(namePrefix = s"temporaryReader").getCanonicalPath,
          options)
        Dataset.ofRows(
          sparkSession,
          StreamingRelationV2(
            s, source, extraOptions.toMap,
            tempReader.readSchema().toAttributes, v1Relation)(sparkSession))
      case s: ContinuousReadSupport =>
        val tempReader = s.createContinuousReader(
          Optional.ofNullable(userSpecifiedSchema.orNull),
          Utils.createTempDir(namePrefix = s"temporaryReader").getCanonicalPath,
          options)
        Dataset.ofRows(
          sparkSession,
          StreamingRelationV2(
            s, source, extraOptions.toMap,
            tempReader.readSchema().toAttributes, v1Relation)(sparkSession))
      case _ =>
        // Code path for data source v1.
        Dataset.ofRows(sparkSession, StreamingRelation(v1DataSource))
    }
  }

I) DataSource.lookupDataSource()，通过class全路径或预定义的简写，查找source provider class

DataSource.lookupDataSource()查找Provider class有三种逻辑实现：
· 系统中已预定义的KafkaSource，通过_.shortName().equalsIgnoreCase(provider1)匹配，输入为“kafka”。
· json、csv、paraquet格式，对应的provider映射在backwardCompatibilityMap中定义。
· 自定义的provider可以通过loader.loadClass(provider1)，查找上下文中全路径对应class即可，例如指定source：org.apache.spark.sql.usersource.HbaseSourceProvider。

  def lookupDataSource(provider: String, conf: S

最低0.47元/天解锁文章

LS_ice

关注

2
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录