前言
前面介绍了Spark Streaming的相关概念,这里我们使用IDEA编写Spark Streaming基于Socket数据流进行的WordCount。
一、IDEA编写NetWordCount
在原有的SparkCore项目基础上,添加Spark Streaming项目依赖:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
<!--<scope>provided</scope>-->
</dependency>
NetWordCount完整代码如下:
package com.m.jd.streaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object NetWordCount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[*]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
// Create a DStream that will connect to hostname:port, like localhost:9999
val lines = ssc.socketTextStream(args(0),args(1).toInt)
// Split each line into words
val words = lines.flatMap(_.split(" "))
//import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
//ssc.stop()
}
}
二、执行
按照Spark Core中的方式进行打包,并将程序上传到Spark机器上。并运行:
bin/spark-submit --class com.m.jd.streaming.NetWordCount /opt/spark-jar/networdcount-jar-with-dependencies.jar hadoop0 9999
上述命令启动的时候会去监听hadoop0:9999端口,在启动过程中,如果没有先执行
$ nc -lk 9999
会出现类似hadoop0:9999连接被拒绝的日志,所以最好先在hadoop0上执行上述nc命令,如果程序运行时,log日志太多,可以将spark conf目录下的log4j文件里面的日志级别改成WARN。
运行结果如下:
[root@hadoop0 ~]# nc -lk 9999
hadoop kkkk
-------------------------------------------
Time: 1532682169000 ms
-------------------------------------------
(kkkk,1)
(hadoop,1)
更多可以查看官方文档:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#a-quick-example