spark 2.2.0学习笔记5之SparkStreamingWordCountDemo

最新推荐文章于 2021-02-18 16:19:32 发布

undergrowth

最新推荐文章于 2021-02-18 16:19:32 发布

阅读量279

点赞数

分类专栏： spark_scala 云计算文章标签： spark 流式计算

本文链接：https://blog.csdn.net/undergrowth/article/details/78887863

版权

spark_scala 同时被 2 个专栏收录

7 篇文章 0 订阅

订阅专栏

云计算

6 篇文章 0 订阅

订阅专栏

spark 2.2.0学习笔记5之SparkStreamingWordCountDemo

Info

spark streaming—-Spark 提供的对实时数据进行流式计算的组件/微批次架构

Spark Streaming 使用离散化流（discretized stream）作为抽象表示，叫作DStream
DStream 是随时间推移而收到的数据的序列
- 一种是转化操作（transformation），会生成一个新的DStream
- 无状态（stateless）—-每个批次的处理不依赖于之前批次的数据
- 有状态（stateful)—-需要使用之前批次的数据或者是中间结果来计算当前批次的数据
- 另一种是输出操作（output operation），可以把数据写入外部系统

样例

解压nc.rar,cmd运行 nc -L -p 9999 -v
nc.rar 位于本代码目录doc\software\nc.rar

Code

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

// scalastyle:off println
package spark30.streaming

import org.apache.spark.storage.StorageLevel
import spark30.basic.{SparkContextUtil, StreamingExamples}

/**
  * Counts words in UTF8 encoded, '\n' delimited text received from the network every second.
  *
  * Usage: NetworkWordCount <hostname> <port>
  * <hostname> and <port> describe the TCP server that Spark Streaming would connect to receive data.
  *
  * To run this on your local machine, you need to first run a Netcat server
  * `$ nc -lk 9999`
  * and then run the example
  * `$ bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999`
  */
object SparkStreamingWordCountDemo {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println("Usage: NetworkWordCount <hostname> <port>")
      System.exit(1)
    }

    StreamingExamples.setStreamingLogLevels()

    // Create the context with a 1 second batch size
    val ssc = SparkContextUtil.getStreamingContext("NetworkWordCount")

    // Create a socket stream on target ip:port and count the
    // words in input stream of \n delimited text (eg. generated by 'nc')
    // Note that no duplication in storage level only for running locally.
    // Replication necessary in distributed scenario for fault tolerance.
    val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

// scalastyle:on println

SparkContextUtil 中部分代码

val master2: String = "local[2]"

def getStreamingContext(appName: String): StreamingContext = {
    val sparkConf = new SparkConf().setAppName(appName).setMaster(master2)
    val ssc = new StreamingContext(sparkConf, Seconds(1))
    ssc
  }