SparkStreaming读取指定目录中的文本进行单词计数例子
开发和运行环境
IDEA 2018.2
jdk1.8.0_151
scala-2.11.12
spark_2.4.0
Linux centos 3.10.0-327.el7.x86_64 GNU/Linux
Spark版本
在pom.xml,配置使用Spark的2.4.0版本
<groupId>SparkLearn</groupId>
<artifactId>spark-learn</artifactId>
<version>1.0</version>
<properties>
<spark.version>2.4.0</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
Scala类
做一个Scala类FileWordCount.scala ,监听指定目录,目录中有新文件创建、拷贝、移动进来时,对文本中的单词计数。
目录结构
FileWordCount.scala 内容:
package org.apache.spark.examples.streaming
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}
object FileWordCount {
val path = "/opt/spark-2.4.0/out"
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("FileWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(5))
// Create the FileInputDStream on the directory and use the
// stream to count words in new files created
val lines1 = ssc.textFileStream(path)
val words1 = lines1.flatMap(line => line.split(" "))
val count1 = words1.map((_, 1)).reduceByKey(_ + _)
count1.print()
ssc.start()
ssc.awaitTermination()
}
}
在源码中,创建StreamingContext,每5秒钟从/opt/spark-2.4.0/out目录中读取文本文件,
使用flatMap把内容按空格分隔,使用map计数每个单词为1,最后使用reduceByKey把相同单词的计数汇总,输出结果到控制台
Maven打jar
对Scala类源码使用Maven打jar
上传jar
将打的spark-learn-1.0.jar上传到Spark所在服务器的一个目录中,这里上传到/opt/spark-2.4.0/lib目录
[root@centos lib]# pwd
/opt/spark-2.4.0/lib
[root@centos lib]# ll
总用量 316
-rw-r--r--. 1 root root 321620 Dec 20 16:14 spark-learn-1.0.jar
启动Spark
在安装Spark的服务器上,进入Spark的目录,启动Spark
[root@centos sbin]# pwd
/opt/spark-2.4.0/sbin
[root@centos sbin]# ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark-2.4.0/logs/spark-root-org.apache.spark.deploy.master.Master-1-centos.out
failed to launch: nice -n 0 /opt/spark-2.4.0/bin/spark-class org.apache.spark.deploy.master.Master --host centos --port 7077 --webui-port 8080
full log in /opt/spark-2.4.0/logs/spark-root-org.apache.spark.deploy.master.Master-1-centos.out
.....
提交任务
编写提交任务运行的命令
spark-submit \
--class org.apache.spark.examples.streaming.FileWordCount \
--master spark://centos:7077 \
--executor-memory 4G \
--total-executor-cores 2 \
/opt/spark-2.4.0/lib/spark-learn-1.0.jar
spark://centos:7077 是Spark的Master的服务地址
在/opt/spark-2.4.0/sbin中执行以上命令,会启动Spark的任务,输出日志
2018-12-20 16:50:43 INFO SparkContext:54 - Running Spark version 2.4.0
2018-12-20 16:50:43 INFO SparkContext:54 - Submitted application: FileWordCount
2018-12-20 16:50:43 INFO SecurityManager:54 - Changing view acls to: root
2018-12-20 16:50:43 INFO SecurityManager:54 - Changing modify acls to: root
2018-12-20 16:50:43 INFO SecurityManager:54 - Changing view acls groups to:
2018-12-20 16:50:43 INFO SecurityManager:54 - Changing modify acls groups to:
2018-12-20 16:50:43 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2018-12-20 16:50:43 INFO Utils:54 - Successfully started service 'sparkDriver' on port 50780.
...
测试
开启一个新的命令窗口,拷贝或者创建一个文本文件到/opt/spark-2.4.0/out目录
[root@centos spark-2.4.0]# cp test.txt /opt/spark-2.4.0/out
观察Spark的日志输出:
2018-12-20 16:50:11 INFO DAGScheduler:54 - ResultStage 97 (print at FileWordCount.scala:20) finished in 0.125 s
2018-12-20 16:50:11 INFO DAGScheduler:54 - Job 48 finished: print at FileWordCount.scala:20, took 0.992444 s
-------------------------------------------
Time: 1545295810000 ms
-------------------------------------------
(stream,2)
(scenario,1)
((eg.,1)
(level,1)
(locally.,1)
(Create,1)
(only,1)
(duplication,1)
(Note,1)
(\n,1)
...
从日志中,看到从文本文件中,计数的单位数目