10.2 spark2 structured streaming执行wordcount

最新推荐文章于 2024-05-03 17:09:47 发布

我的海_

最新推荐文章于 2024-05-03 17:09:47 发布

阅读量181

点赞数

本文链接：https://blog.csdn.net/kk25114/article/details/98770576

版权

1.参考官方demo cdh2.4.0

2.展示

计算结果为迭代递增

session启动后在监听状态周期性完成state无源数据则进入sleep状态

代码

import org.apache.spark.sql.SparkSession

object WordCount {
  def main(args: Array[String]): Unit = {


    val spark = SparkSession
      .builder
      .master("local")
      .appName("StructuredNetworkWordCount")
      .getOrCreate()

    spark.sparkContext.setLogLevel("WARN")

    import spark.implicits._
    // 创建DataFrame
    // Create DataFrame representing the stream of input lines from connection to localhost:9999
    val lines = spark.readStream
      .format("socket")
      .option("host", "192.168.50.135")
      .option("port", 9999)
      .load()

    // Split the lines into words
    val words = lines.as[String].flatMap(_.split(" "))

    // Generate running word count
    val wordCounts = words.groupBy("value").count()

    // Start running the query that prints the running counts to the console
    // 三种模式：
    // 1 complete 所有内容都输出
    // 2 append   新增的行才输出
    // 3 update   更新的行才输出
    val query = wordCounts.writeStream
      .outputMode("complete")
      .format("console")
      .start()

    query.awaitTermination()
  }
}

计算时存储方式有三种

完整模式 - 整个更新的结果表将写入外部存储器。由存储连接器决定如何处理整个表的写入。
追加模式 - 自上次触发后，只有结果表中附加的新行才会写入外部存储器。这仅适用于预计结果表中的现有行不会更改的查询。
更新模式 - 只有自上次触发后在结果表中更新的行才会写入外部存储（自Spark 2.1.1起可用）。请注意，这与完整模式的不同之处在于此模式仅输出自上次触发后已更改的行。如果查询不包含聚合，则它将等同于追加模式。

追加模式支持:select， where，map，flatMap，filter，join

输入源:

Input Sources

There are a few built-in sources.

File source - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, orc, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.
Kafka source - Reads data from Kafka. It’s compatible with Kafka broker versions 0.10.0 or higher. See the Kafka Integration Guide for more details.
Socket source (for testing) - Reads UTF8 text data from a socket connection. The listening server socket is at the driver. Note that this should be used only for testing as this does not provide end-to-end fault-tolerance guarantees.
Rate source (for testing) - Generates data at the specified number of rows per second, each output row contains a timestamp and value. Where timestamp is a Timestamp type containing the time of message dispatch, and value is of Long type containing the message count, starting from 0 as the first row. This source is intended for testing and benchmarking.