一、流计算示例example
从Kafka Topic读取记录行,统计单词个数,并写出到console控制台
val dataStreamReader: DataStreamReader = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option(subscribeType, topics)
val lines: Dataset[String] = dataStreamReader
.load()
.selectExpr("CAST(value AS STRING)")
.as[String]
// Generate running word count
val wordCounts = lines.flatMap(_.split(" ")).groupBy("value").count()
// Start running the query that prints the running counts to the console
val dataStreamWriter: DataStreamWriter[Row] = wordCounts.writeStream
.outputMode("complete")
.format("console")
.option("checkpointLocation", checkpointLocation)
val query = dataStreamWriter.start()
query.awaitTermination()
示例中分为以下步骤:
① dataStreamReader.load()方法查找source,并创建dataframe
② 数据(dataframe) transform
③ dataStreamWriter创建sink,并启动流计算持续查询线程。
dataStreamReader.load()和dataStreamWriter.start()是本文分析的重点。
数据(Dataset) transform转换过程即常用的spark sql api (selectExpr、select、map、flatmap等),本文不另作分析。
二、通过DataStreamReader.load()方法,查找Source并创建DataFrame
1、DataStreamReader参数设置:
通过调用DataStreamReader 提供的format()、option()方法,可以设置reader的各项参数
其设置的option数据主要保存在的extraOptions: HashMap[String, String]中
final class DataStreamReader private[sql](sparkSession: SparkSession) extends Logging {
def format(source: String): DataStreamReader = {
this.source = source
this
}
def option(key: String, value: String): DataStreamReader = {
this.extraOptions += (key -> value)
this
}
private var source: String = sparkSession.sessionState.conf.defaultDataSourceName
private var userSpecifiedSchema: Option[StructType] = None
private var extraOptions = new scala.collection.mutable.HashMap[String, String]
}
2、DataStreamReader.load()查找source并创建dataframe
load()方法主要步骤
· 通过DataSource.lookupDataSource(source, conf)查找kafka、file、com.it.provider.source.HbaseSource对应的provider类,并生成实例
· 生成v1DataSource、v1Relation,其中v1Relation主要是用于构建StreamingRelationV2,批处理方式不会调用StreamingRelationV2相关的case
· 调用Dataset.ofRows()方法,使用v1DataSource创建dataframe
def load(): DataFrame = {
if (source.toLowerCase(Locale.ROOT) == DDLUtils.HIVE_PROVIDER) {
throw new AnalysisException("Hive data source can only be used with tables, you can not " +
"read files of Hive data source directly.")
}
val ds = DataSource.lookupDataSource(source, sparkSession.sqlContext.conf).newInstance()
val options = new DataSourceOptions(extraOptions.asJava)
val v1DataSource = DataSource(
sparkSession,
userSpecifiedSchema = userSpecifiedSchema,
className = source,
options = extraOptions.toMap)
val v1Relation = ds match {
case _: StreamSourceProvider => Some(StreamingRelation(v1DataSource))
case _ => None
}
ds match {
case s: MicroBatchReadSupport =>
val tempReader = s.createMicroBatchReader(
Optional.ofNullable(userSpecifiedSchema.orNull),
Utils.createTempDir(namePrefix = s"temporaryReader").getCanonicalPath,
options)
Dataset.ofRows(
sparkSession,
StreamingRelationV2(
s, source, extraOptions.toMap,
tempReader.readSchema().toAttributes, v1Relation)(sparkSession))
case s: ContinuousReadSupport =>
val tempReader = s.createContinuousReader(
Optional.ofNullable(userSpecifiedSchema.orNull),
Utils.createTempDir(namePrefix = s"temporaryReader").getCanonicalPath,
options)
Dataset.ofRows(
sparkSession,
StreamingRelationV2(
s, source, extraOptions.toMap,
tempReader.readSchema().toAttributes, v1Relation)(sparkSession))
case _ =>
// Code path for data source v1.
Dataset.ofRows(sparkSession, StreamingRelation(v1DataSource))
}
}
I) DataSource.lookupDataSource(),通过class全路径或预定义的简写,查找source provider class
DataSource.lookupDataSource()查找Provider class有三种逻辑实现:
· 系统中已预定义的KafkaSource,通过_.shortName().equalsIgnoreCase(provider1)匹配,输入为“kafka”。
· json、csv、paraquet格式,对应的provider映射在backwardCompatibilityMap中定义。
· 自定义的provider可以通过loader.loadClass(provider1),查找上下文中全路径对应class即可,例如指定source:org.apache.spark.sql.usersource.HbaseSourceProvider。
def lookupDataSource(provider: String, conf: S