本实例旨在:通过Spark Streaming流式地处理一个数据服务从TCP套接字中接收到的数据。
一创建maven工程,引入相应依赖jar包
<properties> <scala.version>2.11.8</scala.version> </properties> <repositories> <repository> <id>repos</id> <name>Repository</name> <url>http://maven.aliyun.com/nexus/content/groups/public</url> </repository> <repository> <id>scala-tools.org</id> <name>Scala-Tools Maven2 Repository</name> <url>http://scala-tools.org/repo-releases</url> </repository> </repositories> <pluginRepositories> <pluginRepository> <id>repos</id> <name>Repository</name> <url>http://maven.aliyun.com/nexus/content/groups/public</url> </pluginRepository> <pluginRepository> <id>scala-tools.org</id> <name>Scala-Tools Maven2 Repository</name> <url>http://scala-tools.org/repo-releases</url> </pluginRepository> </pluginRepositories> <dependencies> <!--Spark Core核心依赖--> <dependency> <!-- Spark ,依赖的Scala版本为Scala 2.12.x版本 (View all targets) 否则会报错 java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)[Ljava/lang/Object; --> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.4.0</version> <scope>provided</scope><!--运行时提供,打包不添加,Spark集群已自带--> </dependency> <!-- Spark SQL核心依赖--> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.4.0</version> <scope>provided</scope><!--运行时提供,打包不添加,Spark集群已自带--> </dependency> <!-- Spark Streaming依赖包--> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.11</artifactId> <version>2.4.0</version> <scope>provided</scope><!--运行时提供,打包不添加,Spark集群已自带--> </dependency> <!-- 2.12.x需要与spark的2.12对应--> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> </dependency> </dependencies> <build> <sourceDirectory>src/main/scala</sourceDirectory> <testSourceDirectory>src/test/scala</testSourceDirectory> <plugins> <!-- 混合Scala/Java编译--> <plugin><!--Scala编译插件 --> <groupId>org.scala-tools</groupId> <artifactId>maven-scala-plugin</artifactId> <version>2.15.2</version> <executions> <execution> <id>scala-compile-first</id> <goals> <goal>compile</goal> </goals> <configuration> <includes> <include>**/*.scala</include> </includes> </configuration> </execution> <execution> <id>scala-test-compile</id> <goals> <goal>testCompile</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.1</version> <configuration> <source>1.8</source><!-- 设置Java源--> <target>1.8</target> </configuration> </plugin> <plugin><!-- 将所有的依赖包打入同一个jar包--> <artifactId>maven-assembly-plugin</artifactId> <configuration> <appendAssemblyId>false</appendAssemblyId> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef><!--jar包的后缀名--> </descriptorRefs> <archive> <manifest> <mainClass>org.jy.data.yh.bigdata.drools.scala.sparkstreaming.SparkStreamingWordsFrep</mainClass> </manifest> </archive> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> <plugin><!--Maven打包插件 --> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-jar-plugin</artifactId> <version>2.4</version> <configuration> <archive> <manifest> <addClasspath>true</addClasspath><!--添加类路径--> <!--设置程序的入口类--> <mainClass>org.jy.data.yh.bigdata.drools.scala.sparkstreaming.SparkStreamingWordsFrep</mainClass> </manifest> </archive> </configuration> </plugin> </plugins> </build> </project>
二:Scala代码如下:
package org.jy.data.yh.bigdata.drools.scala.sparkstreaming import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} /** * Spark Streaming 数据流统计词频 */ object SparkStreamingWordsFrep { def main(args: Array[String]): Unit = { // Spark 配置项 val sparkConf = new SparkConf() .setAppName("SparkStreamingWordsFrep") .setMaster("spark://centoshadoop1:7077,centoshadoop2:7077") // 创建流式上下文 val sparkStreamContext = new StreamingContext(sparkConf,Seconds(2)) // 创建一个DStream,连接指定的hostname:port,比如localhost:9999 val lines = sparkStreamContext.socketTextStream("centoshadoop1",9999) // 将接收到的每条信息分割成次词语 val words = lines.flatMap(line =>{line.split(" ")}) // 统计每个batch的词频 val pairs = words.map(word =>(word,1)) // 汇总词频 val wordCounts = pairs.reduceByKey(_+_) // 将key相同的元组的value累积 // 打印从DStream中生成的RDD的前10个元素到控制台 wordCounts.print(10000) sparkStreamContext.start() // 开始计算 sparkStreamContext.awaitTermination() //等待计算结束 } }
三,linux系统安装nmap-ncat软件
yum install nc
四,打包到spark集群上运行,命令如下
bin/spark-submit \
--class org.jy.data.yh.bigdata.drools.scala.sparkstreaming.SparkStreamingWordsFrep \
--num-executors 4 \
--driver-memory 2G \
--executor-memory 1g \
--executor-cores 1 \
--conf spark.default.parallelism=1000 \
/home/hadoop/tools/SSO-Scala-SparkStreaming-1.0-SNAPSHOT.jar
五,打开两个linux界面,在其中一个界面输入要统计的文本内容
如下图:
[hadoop@centoshadoop1 ~]$ nc -lk 9999
Spark streaming is an extension of the core Spark API
输出效果如下: