首先,我们将Spark Streaming类的名称和StreamingContext的一些隐式转换导入到我们的环境中,以便将有用的方法添加到我们需要的其他类(如DStream)中。StreamingContext是所有流媒体功能的主要入口点。我们创建一个具有两个执行线程的本地StreamingContext,批处理间隔为10秒。
val sparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[4]")
val ssc = new StreamingContext(sparkConf, Seconds(10)) //每隔10秒统计一次字符总数
使用这个上下文,我们可以创建一个DStream,表示来自TCP源的流数据,指定为主机名(例如localhost)和端口(例如9998)。
//创建珍一个DStream,连接master:9998
val lines = ssc.socketTextStream("127.0.0.1",9998, StorageLevel.MEMORY_AND_DISK_SER)
该linesDStream表示将从数据服务器接收的数据流。此DStream中的每条记录都是一行文本。接下来,我们要把空格字符分割成单词。
val words = lines.flatMap(_.split(" "))
flatMap是一对多DStream操作,它通过从源DStream中的每个记录生成多个新记录来创建一个新的DStream。在这种情况下,每行将被分成多个单词,单词流被表示为 wordsDStream。接下来,我们要计算这些单词。
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
Spark Streaming仅仅准备好了它要执行的计算,实际上并没有真正开始执行。在这些转换操作准备好之后,要真正执行计算,需要调用如下的方法
ssc.start() //开始计算
ssc.awaitTermination()//通过手动终止计算,否则一直运行下去
下面在sorcket 9998端口写入数据:
package streaming.helloworld
import java.io.PrintWriter
import java.net.ServerSocket
/**
* Created by Administrator on 2017/5/26.
*/
object GenerateChar {
def generateContext(index : Int) : String = {
import scala.collection.mutable.ListBuffer
val charList = ListBuffer[Char]()
for(i <- 65 to 90)
charList += i.toChar
val charArray = charList.toArray
charArray(index).toString
}
def index = {
import java.util.Random
val rdm = new Random
rdm.nextInt(7)
}
def main(args: Array[String]) {
val listener = new ServerSocket(9998)
while(true){
val socket = listener.accept()
new Thread(){
override def run() = {
println("Got client connected from :"+ socket.getInetAddress)
val out = new PrintWriter(socket.getOutputStream,true)
while(true){
Thread.sleep(500)
val context = generateContext(index) //产生的字符是字母表的前七个随机字母
println(context)
out.write(context + '\n')
out.flush()
}
socket.close()
}
}.start()
}
}
}
NetworkWordCount.scala的源码:
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
// scalastyle:off println
package org.apache.spark.examples.streaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
object NetworkWordCount {
def main(args: Array[String]) {
StreamingExamples.setStreamingLogLevels()
//创建一个本地的StreamingContext,含2个工作线程
val sparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[4]")
val ssc = new StreamingContext(sparkConf, Seconds(10)) //每隔10秒统计一次字符总数
//创建珍一个DStream,连接master:9998
val lines = ssc.socketTextStream("127.0.0.1",9998, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start() //开始计算
ssc.awaitTermination()//通过手动终止计算,否则一直运行下去
}
}
// scalastyle:on println