以如下代码为例(SocketInputDStream):
Spark Streaming从Socket读取数据的代码是在SocketReceiver的receive方法中,撇开异常情况不谈(Receiver有重连机制,restart方法,默认情况下在Receiver挂了之后,间隔两秒钟重新建立Socket连接),读取到的数据通过调用store(textRead)方法进行存储。数据的流转需要关注如下几个问题:
1. 数据存储到什么位置了
2. 数据存储的结构如何?
3. 数据什么时候被读取
4. 读取到的数据(batch interval)如何转换为RDD
1. SocketReceiver#receive
/** Create a socket connection and receive data until receiver is stopped */
def receive() {
var socket: Socket = null
try {
logInfo("Connecting to " + host + ":" + port)
socket = new Socket(host, port)
logInfo("Connected to " + host + ":" + port)
val iterator = bytesToObjects(socket.getInputStream())
while(!isStopped && iterator.hasNext) {
store(iterator.next)
}
logInfo("Stopped receiving")
restart("Retrying connecting to " + host + ":" + port)
} catch {
case e: java.net.ConnectException =>
restart("Error connecting to " + host + ":" + port, e)
case t: Throwable =>
restart("Error receiving data", t)
} finally {
if (socket != null) {
socket.close()
logInfo("Closed socket to " + host + ":" + port)
}
}
}
2. SocketReceiver#receive=>SocketReceiver#store
/**
* Store a single item of received data to Spark's memory.
* These single items will be aggregated together into data blocks before
* being pushed into Spark's memory.
*/
def store(dataItem: T) {
executor.pushSingle(dataItem)
}
数据存储作为Executor功能之一,store方法调用了executor中的pushSingle操作,此时的Single可以理解为一次数据读取,而dataItem就是一次读取的数据对象
3. SocketReceiver#store=>executor.pushSingle(ReceiverSupervisorImpl.pushSingle)
/** Push a single record of received data into block generator. */
def pushSingle(data: Any) {
blockGenerator.addData(data)
}
数据放入到了blockGenerator数据结构中了,blockGener