导入 spark 和 spark-streaming 依赖包
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.4.5</version>
</dependency>
案例一:SparkStreaming接受socket数据,实现单词计数WordCount
spark版本
从本机的7777端口源源不断地收到以换行符分隔的文本数据流,并计算单词个数
package cn.kgc.kb09.Spark
import org.apache.spark.streaming.{
Seconds, StreamingContext}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ReceiverInputDStream
object SparkStreamDemo1 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("SparkStreamDemo1").setMaster("local[*]")
// 采集周期,指定的3秒为每次采集的时间间隔
val streamingContext = new StreamingContext(conf,Seconds(3))
// 指定采集的方法
val socketLineStream: ReceiverInputDStream[String] =
streamingContext.socketTextStream("192.168.247.201",7777)
// 将采集来的信息进行处理,统计数据(wordcount)
val wordStream = socketLineStream.flatMap(line => line.split("\\s+"))
val mapStream = wordStream.map(x => (x,1))
val wordcountStream = mapStream.reduceByKey(_+_)
// 打印
wordcountStream.print()
// 启动采集器
streamingContext.start()
streamingContext.awaitTermination()
}
}
这时候在Linux中输入的内容会在控制台打印wordcount单词统计:
根据指定的采集周期,每次采集的时间间隔3秒。spark streaming的本质是微批处理。
Spark Streaming拥有两类数据源
-
(1)基本源(Basic sources):这些源在StreamingContext API中直接可用。例如文件系统、套接字连接、Akka的actor等。
-
(2)高级源(Advanced sources):这些源包括Kafka,Flume,Kinesis,Twitter等等。
基本数据源输入源码:
SparkStream 对于外部的数据输入源,一共有下面几种:
-
(1)用户自定义的数据源:receiverStream
-
(2)根据TCP协议的数据源: socketTextStream、socketStream
-
(3)网络数据源:rawSocketStream
-
(4)hadoop文件系统输入源:fileStream、textFileStream、binaryRecordsStream
-
(5)其他输入源(队列形式的RDD):queueStream
Java版本
package cn.kgc.kb09.Spark;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.