spark streaming对客户端发送的单词计数

最新推荐文章于 2022-05-26 23:17:56 发布

BUAA_ZSY

最新推荐文章于 2022-05-26 23:17:56 发布

阅读量190

点赞数

分类专栏： Spark Streaming 文章标签： streaming统计客户端词频 Ubuntu

本文链接：https://blog.csdn.net/weixin_40393128/article/details/102785232

版权

Spark Streaming 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

1、spark streaming应用程序与客户端程序的连接方式如下

2、在myeclipse中创建maven项目DStreamTest，创建WordStream包，在该包下创建WordCount.scala类（具体方法参考本人之前的博客：https://blog.csdn.net/weixin_40393128/article/details/102669873，MyEclipse下利用Maven打包并运行Spark的Scala程序）

3、pom.xml代码

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com2</groupId>
  <artifactId>DStreamTest</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <dependencies>
		<dependency> <!-- Spark -->
			<groupId>org.apache.spark</groupId>
			<artifactId>spark-core_2.11</artifactId>
			<version>1.6.2</version>
			<scope>provided</scope>
		</dependency>
		<dependency> <!-- Spark Streaming -->
			<groupId>org.apache.spark</groupId>
			<artifactId>spark-streaming_2.10</artifactId>
			<version>1.6.2</version>
			<scope>provided</scope>
		</dependency>
		<dependency><!-- Log -->
			<groupId>log4j</groupId>
			<artifactId>log4j</artifactId>
			<version>1.2.17</version>
		</dependency>
		<dependency>
			<groupId>org.slf4j</groupId>
			<artifactId>slf4j-log4j12</artifactId>
			<version>1.7.12</version>
		</dependency>
	</dependencies>

	<build>
		<plugins>
			<!-- mixed scala/java compile -->
			<plugin>
				<groupId>org.scala-tools</groupId>
				<artifactId>maven-scala-plugin</artifactId>
				<executions>
					<execution>
						<id>compile</id>
						<goals>
							<goal>compile</goal>
						</goals>
						<phase>compile</phase>
					</execution>
					<execution>
						<id>test-compile</id>
						<goals>
							<goal>testCompile</goal>
						</goals>
						<phase>test-compile</phase>
					</execution>
					<execution>
						<phase>process-resources</phase>
						<goals>
							<goal>compile</goal>
						</goals>
					</execution>
				</executions>
			</plugin>
			<plugin>
				<artifactId>maven-compiler-plugin</artifactId>
				<configuration>
					<source>1.7</source>
					<target>1.7</target>
				</configuration>
			</plugin>
			<!-- for fatjar -->
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-assembly-plugin</artifactId>
				<version>2.4</version>
				<configuration>
					<descriptorRefs>
						<descriptorRef>jar-with-dependencies</descriptorRef>
					</descriptorRefs>
				</configuration>
				<executions>
					<execution>
						<id>assemble-all</id>
						<phase>package</phase>
						<goals>
							<goal>single</goal>
						</goals>
					</execution>
				</executions>
			</plugin>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-jar-plugin</artifactId>
				<configuration>
					<archive>
						<manifest>
							<addClasspath>true</addClasspath>
							<mainClass>WordStream.WordCount</mainClass>
						</manifest>
					</archive>
				</configuration>
			</plugin>
		</plugins>
	</build>
	<repositories>
		<repository>
			<id>alimaven</id>
			<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
		</repository>
	</repositories>
</project>

4、WordCount.scala代码

package WordStream
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.{SparkConf, SparkContext}

// Create a local StreamingContext with two working thread and batch interval of 1 second.
// The master requires 2 cores to prevent from a starvation scenario.
object WordCount {
  def main(args: Array[String]) {
    val conf = new SparkConf()
      .setAppName("SocketWordFreq")
      .setMaster("local[2]")
    val ssc = new StreamingContext(conf, Seconds(1))
    // Create a DStream that will connect to hostname:port, like localhost:8999
    val lines = ssc.socketTextStream("localhost", 8999)
    // Split each line into words
    val words = lines.flatMap(_.split(" "))
    // Count each word in each batch
    val pairs = words.map(word => (word, 1))
    val wordCounts = pairs.reduceByKey(_ + _)
    // Print the first ten elements of each RDD generated in this DStream to the console
    wordCounts.print()
    ssc.start() // Start the computation
    ssc.awaitTermination() // Wait for the computation to terminate
  }
}

5、在WordStream包中再创建一个新的Java类，ClientApp.java，代码如下

package WordStream;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.PrintStream;
import java.io.PrintWriter;
import java.net.ServerSocket;
import java.net.Socket;

public class ClientApp
{
  public static void main(String[] args)
  {
    try
    {
      System.out.println("Defining new Socket");
      ServerSocket soc = new ServerSocket(8999);
      System.out.println("Waiting for Incoming Connection");
      Socket clientSocket = soc.accept();
      System.out.println("Connection Received");
      OutputStream outputStream = clientSocket.getOutputStream();
      for (;;)
      {
        PrintWriter out = new PrintWriter(outputStream, true);
        BufferedReader read = new BufferedReader(new InputStreamReader(System.in));
        System.out.println("Waiting for user to input some words");
        String words = read.readLine();
        System.out.println("words are received and now writing them to Socket");
        out.println(words);
      }
    }
    catch (Exception e)
    {
      e.printStackTrace();
    }
  }
}

6、maven install将整个项目打包

7、将myeclipse的workspace中找到DStreamTest项目下的target目录中的DStreamTest-0.0.1-SNAPSHOT-jar-with-dependencies.jar文件拷贝到文件夹“下载”中

8、下面开始在两个终端中分别执行客户端程序和spark streaming程序

1）在下载目录下打开一个终端，输入如下命令运行ClientApp.class文件

java -classpath DStreamTest-0.0.1-SNAPSHOT-jar-with-dependencies.jar WordStream.ClientApp

2）在任意目录下打开一个终端，输入如下命令执行streaming程序

/usr/local/spark/bin/spark-submit --class WordStream.WordCount /home/hadoop/下载/DStreamTest-0.0.1-SNAPSHOT-jar-with-dependencies.jar

这个时候就可以在该终端下看到1）中终端下输入的单词了

9、停止进程

在终端中输入命令

netstat -nultp

输入命令杀死Java进程，可以看见Java进程已经关闭，且8中的1）、2）两个终端中都提示已杀死

kill -9 7947
kill -9 7969

10、利用updateStateByKey方法不断累计单词数量

package WordStream
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.{SparkConf, SparkContext}

// Create a local StreamingContext with two working thread and batch interval of 1 second.
// The master requires 2 cores to prevent from a starvation scenario.
object WordCount {
  def main(args: Array[String]) {
    val conf = new SparkConf()
      .setAppName("SocketWordFreq")
      .setMaster("local[2]")
    val ssc = new StreamingContext(conf, Seconds(1))
    //to use the method updateStateByKey，you must set checkpoint
    ssc.checkpoint("file:///home/hadoop/下载/checkpoint")
    // Create a DStream that will connect to hostname:port, like localhost:8999
    val lines = ssc.socketTextStream("localhost", 8999)
    // Split each line into words
    val words = lines.flatMap(_.split(" "))
    // Count each word in each batch
    val pairs = words.map(word => (word, 1))
    val wordCounts = pairs.updateStateByKey(updateFunction)
    // Print the first ten elements of each RDD generated in this DStream to the console
    wordCounts.print()
    ssc.start() // Start the computation
    ssc.awaitTermination() // Wait for the computation to terminate
  }

def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
    val newCount =runningCount.getOrElse(0)+newValues.sum
    Some(newCount)
  }

}