Spark组件之Spark Streaming学习1--NetworkWordCount学习-CSDN博客

本文链接：https://blog.csdn.net/xubo245/article/details/51251970

本文介绍如何使用 Apache Spark Streaming 进行实时文本处理，通过网络接收数据并每秒更新词频统计结果，展示了两种运行方式及示例代码。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

更多代码请见：https://github.com/xubo245/SparkLearning

NetworkWordCount：每个1秒将接收的数据进行wordCount，不累加

使用

1.方法1：在集群的examples中启动

一个terminal：

nc -lk 9999

可以在这个terminal发送数据，前面一个terminal就会统计信息

另一个terminal：

./bin/run-example streaming.NetworkWordCount localhost 9999

2.运行方法2：打成jar包上传运行：

运行脚本：

    #!/usr/bin/env bash  
    spark-submit --name WordCountSpark  \
--class org.apache.spark.Streaming.learning.NetworkWordCount \
--master spark://<strong>Master</strong>:7077 \
--executor-memory 512M \
--total-executor-cores 10 Streaming.jar localhost 9999

然后一个ternimal运行nc，一个运行这个脚本，同上

输入数据：

hadoop@Master:~$ sudo nc -lk 9999
a
hello
world

a
hello
world
hello
hw^Hello
word
a
a
a
a
a
a
a

结果输出：

hadoop@Master:~/cloud/testByXubo/spark/Streaming$ ./submitJob.sh 
-------------------------------------------                                     
Time: 1461661853000 ms
-------------------------------------------

-------------------------------------------
Time: 1461661854000 ms
-------------------------------------------
(,1)
(hello,1)
(world,1)
(a,1)

-------------------------------------------
Time: 1461661855000 ms
-------------------------------------------
(a,1)

-------------------------------------------
Time: 1461661856000 ms
-------------------------------------------

-------------------------------------------
Time: 1461661857000 ms
-------------------------------------------
(hello,1)

-------------------------------------------
Time: 1461661858000 ms
-------------------------------------------
(world,1)

-------------------------------------------
Time: 1461661859000 ms
-------------------------------------------

-------------------------------------------
Time: 1461661860000 ms
-------------------------------------------
(hello,1)

-------------------------------------------
Time: 1461661861000 ms
-------------------------------------------

-------------------------------------------
Time: 1461661862000 ms
-------------------------------------------
(hello,1)

-------------------------------------------
Time: 1461661863000 ms
-------------------------------------------
(word,1)

-------------------------------------------                                     
Time: 1461661864000 ms
-------------------------------------------
(a,5)

-------------------------------------------                                     
Time: 1461661865000 ms
-------------------------------------------

代码：

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

// scalastyle:off println
package org.apache.spark.Streaming.learning

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.dstream.DStream.toPairDStreamFunctions

/**
 * Counts words in UTF8 encoded, '\n' delimited text received from the network every second.
 *
 * Usage: NetworkWordCount <hostname> <port>
 * <hostname> and <port> describe the TCP server that Spark Streaming would connect to receive data.
 *
 * To run this on your local machine, you need to first run a Netcat server
 *    `$ nc -lk 9999`
 * and then run the example
 *    `$ bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999`
 */
object NetworkWordCount {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println("Usage: NetworkWordCount <hostname> <port>")
      System.exit(1)
    }

    StreamingExamples.setStreamingLogLevels()

    // Create the context with a 1 second batch size
    val sparkConf = new SparkConf().setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(1))

    // Create a socket stream on target ip:port and count the
    // words in input stream of \n delimited text (eg. generated by 'nc')
    // Note that no duplication in storage level only for running locally.
    // Replication necessary in distributed scenario for fault tolerance.
    val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}
// scalastyle:on println

参考：

【1】 http://spark.apache.org/docs/1.5.2/streaming-programming-guide.html