2021SC@SDUSC
前言
之前一篇博客分析了StreamingContext的运行流程,接下来开始分析Spark的数据接收
SparkStreaming数据接收
数据接收主要有以下事情:在Executor上启动Receiver以持续接受流数据。接受的数据是离散的,必须收集成Block,让BlockManager进行保存,并通知Driver端已保存的Block的元数据信息。
首先分析一下Receiver类:
//Receiver部分片段
/**
* ::DeveloperApi::
* Abstract class of a receiver that can be run on worker nodes to receive external data.A
* custom receiver can be defined by defining the functions 'onStart()' and 'onStop()', 'onStart()'
* should define the setup steps necessary to start receiving data,
* and 'onStop()' should define the cleanup steps necessary to stop receiving data.
* Exceptions while receiving can be handled either by restarting the receiver with 'restart(...)'
* or stopped completely by 'stop(...)' or
* A custom receiver in Scala would look like this.
*
* {{{
* class MyReceiver(storageLevel: StorageLevel) extends NetworkReceicer [String](storageLevel) {
* def onStart() {
* //Setup stuff (start threads, open sockets, etc.) to start receiving data.
* //Must start new thread to receive data, as onStart() must be non-blocking.
*
* //Call store(...) in those threads to store received data into Spark's memory.
*
* //Call stop(...), restart(...) or reportError(...) on any thread based on how
* //different errors needs to be handled.
*
* //See corresponding method documentation for more details
* }
*
* def onStop() {
* //Cleanup stuff (stop threads, close sockets, etc.) to stop receiving data.
* }
* }
* }}}
*
...
*/
@DeveloperApi
abstract class Receiver[T](val storageLevel: StorageLevel) extends Serializable {
...
如上有一大段注释,解释了如何定制自己的Receiver。Receiver是抽象类,而且是Serializable子类。因为Receiver是要通过Driver序列化后传到Executor上去接收数据。Spark中已有的Receiver子类已达19种,在此不一一列举。
Executor端的数据接收由Driver端的ReceiverTracker统一管理。
至于ReceiverTracker启动的流程图,由于过程比较长,在这里引用一张图:
ReceiverTracker正常启动的主流程图(前部分)
图中SparkContext.submitJob处粗箭头下面的流程是在Executor上运行的,其他流程是在Driver上运行的。
ReceiverTracker.start的源码如下所示:
//ReceiverTracker.start
/** Start the endpoint and receiver execution thread */
def start(): Unit = synchronized {
if (isTrackerStarted) {
throw new SparkException("ReceiverTracker already started")
}
// Receiver的启动的依据:是否有输入数据流
if (!receiverInputStreams.isEmpty) {
endpoint = ssc.env.rpcEnv.setupEndpoint(
"ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv))
if (!skipReceiverLaunch) launchReceivers()
logInfo("ReceiverTracker started")
trackerState = Started
}
}
ReceiverTracker的start方法需要启动RPC消息通信体endpoint,因为Receiver Tracker会监控整个集群中的Receiver,Receiver转过来要向Receiver Tracker Endpoint汇报自己的状态、接收的数据,包括生命周期等信息。
// ReceiverTacker.launchReceivers
/*
* Get the receivers from the ReceiverInputDStreams, distributes them to the
* worker nodes as a parallel collection, and runs them.
*/
private def launchReceivers(): Unit = {
val receivers = receiverInputStreams.map(nis => {
// 一个ReceiverInputDStream只对应一个Receiver
val rcvr = nis.getReceiver()
rcvr.setReceiverId(nis.id)
rcvr
})
runDummySparkJob()
logInfo("Starting " + receivers.length + " receivers")
// 此时的endpoint就是上面代码中在ReceiverTracker的start方法中构造的ReceiverTrackerEndpoint
endpoint.send(StartAllReceivers(receivers))
}
其中的runDymmySparkJob的源代码如下所示:
// ReceiverTracker.runDummySparkJob
/**
* Run the dummy Spark job to ensure that all slaves have registered. This avoids all the
* receivers to be scheduled on the same node.
*
* TODO Should poll the executor number and wait for executors according to
* "Spark.scheduler.minRegisteredRedourcesRatio"and
* "Spark.scheduler.maxRegisteredResourcesWaitingTime" rather than running a dummy job.
*/
private def runDummySparkJob(): Unit = {
if(!ssc.sparkContext.isLocal) {
ssc.sparkContext.makeRDD(1 to 50, 50).map(x => (x,1)).reduceByKey(_+_,20).collect()
}
assert(getExecutors.nonEmpty)
}
其中利用makeRDD生成一个RDD,然后进行map、reduceByKey、collect操作。collect是action操作,会触发Spark Job的执行。ReceiverTracker.runDummySparkJob就是通过运行一个简单的作业来确保所有的slave节点都已经注册。这样就意味着所有的节点活着,在后面分配Receivers时,可以避免所有的Receivers集中在一个节点上。 注释中流露出想改变这种运行虚拟Job的做法。
其中的collect调用了SparkContext.runJob,其中先后调用了RDD.partitions、ParallelCollectionRDD.getPartitions、ParallelCollectionRDD.slice,将一个集合切分成多个子集合,这样可以让多个Receiver有效的运行在Spark上。这里不一一列出这些Spark Core的源码。
再回去看ReceiverTracker.launchReceivers中的getReceiver():
// ReceiverInputDStream.getReceiver
/**
* Gets the receiver object that will be sent to the worker nodes
* to reveive data. This metod needs to defined by any specific implementation
* of a ReceiverInputDStream.
*/
def getReceiver(): Receiver[T] //返回的是Receiver对象
ReceiverInputDStream的getReceiver方法返回Receiver对象。该方法实际上要靠ReceiverInputDStream的子类实现。
ReceiverInputDStream的子类还必须定义自己的Receiver子类,因为这个Receiver子类会在getReceiver方法中用来创建这个Receiver子类的对象。了解ReceiverInputDStream子类及其相应的Receiver子类很重要,因为对于特别的数据源,往往需要开发者自己去定制