Spark分析(八)Spark Streaming运行流程详解(3)

2021SC@SDUSC

前言

之前一篇博客分析了StreamingContext的运行流程,接下来开始分析Spark的数据接收

SparkStreaming数据接收

数据接收主要有以下事情:在Executor上启动Receiver以持续接受流数据。接受的数据是离散的,必须收集成Block,让BlockManager进行保存,并通知Driver端已保存的Block的元数据信息。

首先分析一下Receiver类:

//Receiver部分片段
/**
 * ::DeveloperApi::
 * Abstract class of a receiver that can be run on worker nodes to receive external data.A
 * custom receiver can be defined by defining the functions 'onStart()' and 'onStop()', 'onStart()'
 * should define the setup steps necessary to start receiving data,
 * and 'onStop()' should define the cleanup steps necessary to stop receiving data.
 * Exceptions while receiving can be handled either by restarting the receiver with 'restart(...)'
 * or stopped completely by 'stop(...)' or
 * A custom receiver in Scala would look like this.
 * 
 * {{{
 * 	class MyReceiver(storageLevel: StorageLevel) extends NetworkReceicer [String](storageLevel) {
 * 		def onStart() {
 * 			//Setup stuff (start threads, open sockets, etc.) to start receiving data.
 * 			//Must start new thread to receive data, as onStart() must be non-blocking.
 *  
 *  		//Call store(...) in those threads to store received data into Spark's memory.
 *  
 * 			//Call stop(...), restart(...) or reportError(...) on any thread based on how
 * 			//different errors needs to be handled.
 *  
 * 			//See corresponding method documentation for more details
 * 		}
 *  
 * 		def onStop() {
 * 			//Cleanup stuff (stop threads, close sockets, etc.) to stop receiving data.
 * 			}
 * 	}
 * }}}
 *  
...
*/

@DeveloperApi
abstract class Receiver[T](val storageLevel: StorageLevel) extends Serializable {
...

如上有一大段注释,解释了如何定制自己的Receiver。Receiver是抽象类,而且是Serializable子类。因为Receiver是要通过Driver序列化后传到Executor上去接收数据。Spark中已有的Receiver子类已达19种,在此不一一列举。
Executor端的数据接收由Driver端的ReceiverTracker统一管理。
至于ReceiverTracker启动的流程图,由于过程比较长,在这里引用一张图:
在这里插入图片描述

ReceiverTracker正常启动的主流程图(前部分)

图中SparkContext.submitJob处粗箭头下面的流程是在Executor上运行的,其他流程是在Driver上运行的。
ReceiverTracker.start的源码如下所示:

//ReceiverTracker.start
/** Start the endpoint and receiver execution thread */
def start(): Unit = synchronized {
	if (isTrackerStarted) {
		throw new SparkException("ReceiverTracker already started")
	}


	// Receiver的启动的依据:是否有输入数据流
	if (!receiverInputStreams.isEmpty) {
		endpoint = ssc.env.rpcEnv.setupEndpoint(
			"ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv))
		if (!skipReceiverLaunch) launchReceivers()
		logInfo("ReceiverTracker started")
		trackerState = Started
	}
}

ReceiverTracker的start方法需要启动RPC消息通信体endpoint,因为Receiver Tracker会监控整个集群中的Receiver,Receiver转过来要向Receiver Tracker Endpoint汇报自己的状态、接收的数据,包括生命周期等信息。

// ReceiverTacker.launchReceivers
/*
 * Get the receivers from the ReceiverInputDStreams, distributes them to the
 * worker nodes as a parallel collection, and runs them.
 */
 private def launchReceivers(): Unit = {
 	val receivers = receiverInputStreams.map(nis => {
 		// 一个ReceiverInputDStream只对应一个Receiver
 		val rcvr = nis.getReceiver()
 		rcvr.setReceiverId(nis.id)
 		rcvr
 	})

	runDummySparkJob()

	logInfo("Starting " + receivers.length + " receivers")
	// 此时的endpoint就是上面代码中在ReceiverTracker的start方法中构造的ReceiverTrackerEndpoint
	endpoint.send(StartAllReceivers(receivers))

}

其中的runDymmySparkJob的源代码如下所示:

// ReceiverTracker.runDummySparkJob

/**
 * Run the dummy Spark job to ensure that all slaves have registered. This avoids all the
 * receivers to be scheduled on the same node.
 *  
 * TODO Should poll the executor number and wait for executors according to
 * "Spark.scheduler.minRegisteredRedourcesRatio"and
 * "Spark.scheduler.maxRegisteredResourcesWaitingTime" rather than running a dummy job.
 */

private def runDummySparkJob(): Unit = {
	if(!ssc.sparkContext.isLocal) {
		ssc.sparkContext.makeRDD(1 to 50, 50).map(x => (x,1)).reduceByKey(_+_,20).collect()
	}
	 
	assert(getExecutors.nonEmpty)

}

其中利用makeRDD生成一个RDD,然后进行map、reduceByKey、collect操作。collect是action操作,会触发Spark Job的执行。ReceiverTracker.runDummySparkJob就是通过运行一个简单的作业来确保所有的slave节点都已经注册。这样就意味着所有的节点活着,在后面分配Receivers时,可以避免所有的Receivers集中在一个节点上。 注释中流露出想改变这种运行虚拟Job的做法。
其中的collect调用了SparkContext.runJob,其中先后调用了RDD.partitions、ParallelCollectionRDD.getPartitions、ParallelCollectionRDD.slice,将一个集合切分成多个子集合,这样可以让多个Receiver有效的运行在Spark上。这里不一一列出这些Spark Core的源码。
再回去看ReceiverTracker.launchReceivers中的getReceiver():

// ReceiverInputDStream.getReceiver
/**
 * Gets the receiver object that will be sent to the worker nodes
 * to reveive data. This metod needs to defined by any specific implementation
 * of a ReceiverInputDStream.
 */
def getReceiver(): Receiver[T] //返回的是Receiver对象

ReceiverInputDStream的getReceiver方法返回Receiver对象。该方法实际上要靠ReceiverInputDStream的子类实现。
ReceiverInputDStream的子类还必须定义自己的Receiver子类,因为这个Receiver子类会在getReceiver方法中用来创建这个Receiver子类的对象。了解ReceiverInputDStream子类及其相应的Receiver子类很重要,因为对于特别的数据源,往往需要开发者自己去定制

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值