大数据-SparkStreaming(二)
数据源
- socket数据源
需求:sparkStreaming实时接收socket数据,实现单词计数
业务处理流程图
安装socket服务
首先在linux服务器node01上用yum 安装nc工具,nc命令是netcat命令的简称,它是用来设置路由器。我们可以利用它向某个端口发送数据。
yum -y install nc
#执行命令向指定的端口发送数据
nc -lk 9999
代码开发
pom.xml配置
<properties>
<scala.version>2.11.8</scala.version>
<spark.version>2.3.3</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass></mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
开发sparkStreaming程序
package com.kaikeba.streaming
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* sparkStreaming接受socket数据实现单词计数程序
*/
object SocketWordCount {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
// todo: 1、创建SparkConf对象
val sparkConf: SparkConf = new SparkConf().setAppName("TcpWordCount").setMaster("local[2]")
// todo: 2、创建StreamingContext对象
val ssc = new StreamingContext(sparkConf,Seconds(2))
//todo: 3、接受socket数据
val socketTextStream: ReceiverInputDStream[String] = ssc.socketTextStream("node01",9999)
//todo: 4、对数据进行处理
val result: DStream[(String, Int)] = socketTextStream.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
//todo: 5、打印结果
result.print()
//todo: 6、开启流式计算
ssc.start()
ssc.awaitTermination()
}
}
- HDFS数据源
需求:通过sparkStreaming监控hdfs上的目录,有新的文件产生,就把数据拉取过来进行处理。
业务处理流程图
代码开发
package com.kaikeba.streaming
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
/**
* HDFS数据源
*/
object HdfsWordCount {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
// todo: 1、创建SparkConf对象
val sparkConf: SparkConf = new SparkConf().setAppName("HdfsWordCount").setMaster("local[2]")
// todo: 2、创建StreamingContext对象
val ssc = new StreamingContext(sparkConf,Seconds(2))
//todo: 3、监控hdfs目录数据
val textFileStream: DStream[String] = ssc.textFileStream("hdfs://node01:8020/data")
//todo: 4、对数据进行处理
val result: DStream[(String, Int)] = textFileStream.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
//todo: 5、打印结果
result.print()
//todo: 6、开启流式计算
ssc.start()
ssc.awaitTermination()
}
}
- 自定义数据源
代码开发
/**
* 自定义一个Receiver,这个Receiver从socket中接收数据
* 使用方式:nc -lk 8888
*/
package com.kaikeba.streaming
import java.io.{BufferedReader, InputStreamReader}
import java.net.Socket
import java.nio.charset.StandardCharsets
import org.apache.spark.SparkConf
import org.apache.spark.internal.Logging
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.receiver.Receiver
/**
* 自定义数据源
*/
object CustomReceiver {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
// todo: 1、创建SparkConf对象
val sparkConf: SparkConf = new SparkConf()
.setAppName("CustomReceiver")
.setMaster("local[2]")
// todo: 2、创建StreamingContext对象
val ssc = new StreamingContext(sparkConf,Seconds(2))
//todo: 3、调用 receiverStream api,将自定义的Receiver传进去
val receiverStream = ssc.receiverStream(new CustomReceiver("node01",8888))
//todo: 4、对数据进行处理
val result: DStream[(String, Int)] = receiverStream
.flatMap(_.split(" "))
.map((_,1))
.reduceByKey(_+_)
//todo: 5、打印结果
result.print()
//todo: 6、开启流式计算
ssc.start()
ssc.awaitTermination()
}
}
/**
* 自定义source数据源
* @param host
* @param port
*/
class CustomReceiver(host:String,port:Int) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_SER) with Logging{
override def onStart(): Unit ={
//启动一个线程,开始接受数据
new Thread("socket receiver"){
override def run(): Unit = {
receive()
}
}.start()
}
/** Create a socket connection and receive data until receiver is stopped */
private def receive() {
var socket: Socket = null
var userInput: String = null
try {
logInfo("Connecting to " + host + ":" + port)
socket = new Socket(host, port)
logInfo("Connected to " + host + ":" + port)
val reader = new BufferedReader(
new InputStreamReader(socket.getInputStream(), StandardCharsets.UTF_8))
userInput = reader.readLine()
while(!isStopped && userInput != null) {
store(userInput)
userInput = reader.readLine()
}
reader.close()
socket.close()
logInfo("Stopped receiving")
restart("Trying to connect again")
} catch {
case e: java.net.ConnectException =>
restart("Error connecting to " + host + ":" + port, e)
case t: Throwable =>
restart("Error receiving data", t)
}
}
override def onStop(): Unit ={
}
}