示意图
以下几个数据源处理都以单词计数为例,如下图:
数据抽象
不论何种数据源,最终都会被抽象成ReceiverInputDStream
。
/**
* Abstract class for defining any [[org.apache.spark.streaming.dstream.InputDStream]]
* that has to start a receiver on worker nodes to receive external data.
* Specific implementations of ReceiverInputDStream must
* define [[getReceiver]] function that gets the receiver object of type
* [[org.apache.spark.streaming.receiver.Receiver]] that will be sent
* to the workers to receive data.
* @param _ssc Streaming context that will execute this input stream
* @tparam T Class type of the object of this stream
*/
abstract class ReceiverInputDStream[T: ClassTag](_ssc: StreamingContext)
extends InputDStream[T](_ssc) {
socket数据源
# 注意计数只能统计每个批次的数据
object socket {
def main(args: Array[String]): Unit = {
//非直连模式local>=2
val conf: SparkConf= new SparkConf().setAppName("socketWD").setMaster("local[2]")
//设置block之间的时间间隔
//.set("spark.streaming.blockInterval", "50")
//指定conf和batch之间的时间间隔
val sc = new StreamingContext(conf,Seconds(2))
//指定hostname和port
val socketTextStream: ReceiverInputDStream[String] = sc.socketTextStream("node01",9999)
//处理数据
val result: DStream[(String, Int)] = socketTextStream.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
result.print()
//开启流式计算
sc.start()
sc.awaitTermination()
}
}
输入
结果
两个batch的单词数没有进行叠加
HDFS数据源
object hdfs {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("hdfsWD").setMaster("local[2]")
val sc = new StreamingContext(conf,Seconds(2))
val textFS: DStream[String] = sc.textFileStream("hdfs://node01:8020/data")
val result: DStream[(String, Int)] = textFS.flatMap(_.split(" ").map((_,1))).reduceByKey(_+_)
result.print()
//开启流式计算
sc.start()
sc.awaitTermination()
}
}
结果
flume数据源
sparkstreaming很少直接整合flume,sparkstreaming整合flume有两种方式:
- sparkstreaming作为sink端
- flume将数据缓存到sink端后,sparkstreaming去拉取
前置要求:
- 至少一个spark的worker必须在启动flume的服务器上
pom文件
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.11</artifactId>
<version>${spark.version}</version>
//如果上面version有问题,可以直接指定如下的版本
// <version>2.2.0</version>
</dependency>
更多内容可以参考以下文章
sparkstreaming整合flume
官网地址
自定义数据源
自定义数据源需要实现Receiver
,并实现以下两个方法:
- onStart():用来开始接收数据
// 一般启动一个线程来接收数据,数据接收和处理流程在receive()方法中实现
def onStart() {
// Start the thread that receives data over a connection
new Thread("Socket Receiver") {
override def run() {
receive()
}
}.start()
}
- onStop():用来停止接收数据
def onStop() {
// There is nothing much to do as the thread calling receive()
// is designed to stop by itself if isStopped() returns false
}
示例:自定义receiver接收socket传输的数据
//自定义receiver,处理数据类型为string
object custom{
def main(args: Array[String]): Unit = {
//非直连模式local>=2
val conf: SparkConf= new SparkConf().setAppName("customerWD").setMaster("local[2]")
//指定conf和batch之间的时间间隔
val sc = new StreamingContext(conf,Seconds(2))
//指定自定义的receiver和hostname:port
val socketTextStream: ReceiverInputDStream[String] = sc.receiverStream(new customReceiver("node01",8888))
//处理数据
val result: DStream[(String, Int)] = socketTextStream.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
result.print()
//开启流式计算
sc.start()
sc.awaitTermination()
}
class customReceiver(host:String,port:Int) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging
{
override def onStart(){
// Start the thread that receives data over a connection
new Thread("socket Receiver"){
override def run(){
receive()
}
}.start()
}
override def onStop(): Unit ={
// There is nothing much to do as the thread calling receive()
// is designed to stop by itself if isStopped() returns false
}
/** Create a socket connection and receive data until receiver is stopped */
private def receive(){
var socket:Socket = null
var userInput : String = null
try {
// connect to host:port
logInfo("Connecting to " + host + ":" + port)
val socket: Socket = new Socket(host,port)
logInfo("Connected to " + host + ":" + port)
val reader = new BufferedReader(new InputStreamReader(socket.getInputStream, StandardCharsets.UTF_8))
userInput= reader.readLine()
//通过内置方法isStopped和输入流判断是否停止此次的输入流
while (!isStopped() && userInput != null){
//通过sotre方法存储数据到spark的memory中
store(userInput)
}
reader.close()
socket.close()
logInfo("reader close")
logInfo("socket close")
// Restart in an attempt to connect again when server is active again
restart("trying re-connect")
}catch {
case e: java.net.ConnectException =>
// restart if could not connect to server
restart("Error connecting to " + host + ":" + port, e)
case t: Throwable =>
// restart if there is any other error
restart("Error receiving data", t)
}
}
}
}
kafka数据源
见下一篇blog