1.参考官方demo cdh2.4.0
2.展示
计算结果为迭代递增
session启动后在监听状态周期性完成state无源数据则进入sleep状态
代码
import org.apache.spark.sql.SparkSession object WordCount { def main(args: Array[String]): Unit = { val spark = SparkSession .builder .master("local") .appName("StructuredNetworkWordCount") .getOrCreate() spark.sparkContext.setLogLevel("WARN") import spark.implicits._ // 创建DataFrame // Create DataFrame representing the stream of input lines from connection to localhost:9999 val lines = spark.readStream .format("socket") .option("host", "192.168.50.135") .option("port", 9999) .load() // Split the lines into words val words = lines.as[String].flatMap(_.split(" ")) // Generate running word count val wordCounts = words.groupBy("value").count() // Start running the query that prints the running counts to the console // 三种模式: // 1 complete 所有内容都输出 // 2 append 新增的行才输出 // 3 update 更新的行才输出 val query = wordCounts.writeStream .outputMode("complete") .format("console") .start() query.awaitTermination() } }
计算时存储方式有三种
-
完整模式 - 整个更新的结果表将写入外部存储器。由存储连接器决定如何处理整个表的写入。
-
追加模式 - 自上次触发后,只有结果表中附加的新行才会写入外部存储器。这仅适用于预计结果表中的现有行不会更改的查询。
-
更新模式 - 只有自上次触发后在结果表中更新的行才会写入外部存储(自Spark 2.1.1起可用)。请注意,这与完整模式的不同之处在于此模式仅输出自上次触发后已更改的行。如果查询不包含聚合,则它将等同于追加模式。
追加模式支持:select
, where
,map
,flatMap
,filter
,join
输入源:
Input Sources
There are a few built-in sources.
-
File source - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, orc, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.
-
Kafka source - Reads data from Kafka. It’s compatible with Kafka broker versions 0.10.0 or higher. See the Kafka Integration Guide for more details.
-
Socket source (for testing) - Reads UTF8 text data from a socket connection. The listening server socket is at the driver. Note that this should be used only for testing as this does not provide end-to-end fault-tolerance guarantees.
-
Rate source (for testing) - Generates data at the specified number of rows per second, each output row contains a
timestamp
andvalue
. Wheretimestamp
is aTimestamp
type containing the time of message dispatch, andvalue
is ofLong
type containing the message count, starting from 0 as the first row. This source is intended for testing and benchmarking.