SparkStreaming读取指定目录中的文本进行单词计数

SparkStreaming读取指定目录中的文本进行单词计数例子

开发和运行环境

IDEA 2018.2
jdk1.8.0_151
scala-2.11.12
spark_2.4.0
Linux centos 3.10.0-327.el7.x86_64 GNU/Linux

 

Spark版本

在pom.xml,配置使用Spark的2.4.0版本

    <groupId>SparkLearn</groupId>
    <artifactId>spark-learn</artifactId>
    <version>1.0</version>
    
    <properties>
        <spark.version>2.4.0</spark.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
    </dependencies>

Scala类

做一个Scala类FileWordCount.scala ,监听指定目录,目录中有新文件创建、拷贝、移动进来时,对文本中的单词计数。

目录结构

FileWordCount.scala 内容:


package org.apache.spark.examples.streaming

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}

object FileWordCount {
	val path = "/opt/spark-2.4.0/out"

	def main(args: Array[String]) {
		val sparkConf = new SparkConf().setAppName("FileWordCount")
		val ssc = new StreamingContext(sparkConf, Seconds(5))

		// Create the FileInputDStream on the directory and use the
		// stream to count words in new files created
		val lines1 = ssc.textFileStream(path)
		val words1 = lines1.flatMap(line => line.split(" "))
		val count1 = words1.map((_, 1)).reduceByKey(_ + _)
		count1.print()
		ssc.start()
		ssc.awaitTermination()
	}
}


在源码中,创建StreamingContext,每5秒钟从/opt/spark-2.4.0/out目录中读取文本文件,
使用flatMap把内容按空格分隔,使用map计数每个单词为1,最后使用reduceByKey把相同单词的计数汇总,输出结果到控制台

 

Maven打jar

对Scala类源码使用Maven打jar

 

上传jar

将打的spark-learn-1.0.jar上传到Spark所在服务器的一个目录中,这里上传到/opt/spark-2.4.0/lib目录
[root@centos lib]# pwd
/opt/spark-2.4.0/lib
[root@centos lib]# ll
总用量 316
-rw-r--r--. 1 root root 321620 Dec 20 16:14 spark-learn-1.0.jar

 

启动Spark

在安装Spark的服务器上,进入Spark的目录,启动Spark
[root@centos sbin]# pwd
/opt/spark-2.4.0/sbin
[root@centos sbin]# ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark-2.4.0/logs/spark-root-org.apache.spark.deploy.master.Master-1-centos.out
failed to launch: nice -n 0 /opt/spark-2.4.0/bin/spark-class org.apache.spark.deploy.master.Master --host centos --port 7077 --webui-port 8080
full log in /opt/spark-2.4.0/logs/spark-root-org.apache.spark.deploy.master.Master-1-centos.out

.....

 

提交任务

编写提交任务运行的命令

spark-submit \
  --class org.apache.spark.examples.streaming.FileWordCount \
  --master spark://centos:7077  \
  --executor-memory 4G \
  --total-executor-cores 2 \
  /opt/spark-2.4.0/lib/spark-learn-1.0.jar

 spark://centos:7077 是Spark的Master的服务地址

在/opt/spark-2.4.0/sbin中执行以上命令,会启动Spark的任务,输出日志
2018-12-20 16:50:43 INFO  SparkContext:54 - Running Spark version 2.4.0
2018-12-20 16:50:43 INFO  SparkContext:54 - Submitted application: FileWordCount
2018-12-20 16:50:43 INFO  SecurityManager:54 - Changing view acls to: root
2018-12-20 16:50:43 INFO  SecurityManager:54 - Changing modify acls to: root
2018-12-20 16:50:43 INFO  SecurityManager:54 - Changing view acls groups to: 
2018-12-20 16:50:43 INFO  SecurityManager:54 - Changing modify acls groups to: 
2018-12-20 16:50:43 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2018-12-20 16:50:43 INFO  Utils:54 - Successfully started service 'sparkDriver' on port 50780.
...

 

测试 

开启一个新的命令窗口,拷贝或者创建一个文本文件到/opt/spark-2.4.0/out目录
 [root@centos spark-2.4.0]# cp test.txt /opt/spark-2.4.0/out


 观察Spark的日志输出:
2018-12-20 16:50:11 INFO  DAGScheduler:54 - ResultStage 97 (print at FileWordCount.scala:20) finished in 0.125 s
2018-12-20 16:50:11 INFO  DAGScheduler:54 - Job 48 finished: print at FileWordCount.scala:20, took 0.992444 s
-------------------------------------------
Time: 1545295810000 ms
-------------------------------------------
(stream,2)
(scenario,1)
((eg.,1)
(level,1)
(locally.,1)
(Create,1)
(only,1)
(duplication,1)
(Note,1)
(\n,1)
...

从日志中,看到从文本文件中,计数的单位数目

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值