第10课:Spark Streaming源码解读之流数据不断接收全生命周期彻底研究和思考
这一个课继续解读sparkstreaming在接收数据的时候生命周期的研究和思考,首先会考虑一下接收数据的架构模式,在这个基础上研究源码。
大数据和其他的IT系统不太一样,例如j2EE或者其他软件开发,架构和设计是首要考虑的。而对于大数据应用程序而言,性能最重要,其次才是架构设计。
Sparkstreaming接收数据是不断的持续接收(具有Receiver的Sparkstreaming),一般Receiver和Driver是不在一个进程中,所以Receiver在接收数据时要不断的汇报给Driver.即使没有Receiver在接收到数据也汇报给Driver(DriectAPI的方式),如果接收到数据不汇报给Driver,Driver在调度的时候就不能把数据纳入到调度系统。汇报的元数据,RDD本来就是基于接收的数据不断构造出来的(没有接收到数据也是数据,构造空的RDD)。
接收数据的时候肯定有一个循环器不断的接收数据,接收到数据肯定也有存储器,存储过之后向Driver汇报。接收数据和存储数据当然要分为两个不同的模块。
Sparkstreaming接收数据的整个流程类似于MVC模式,M就是Receiver,V就是Driver,C就是ReceiverSupervisor
既要看真相,又要看心向(个人觉得是这个向,心之所向)。
// Function to start the receiver on the worker node
//这是一个在work node上启动receiver的函数,receiver启动后出现问题不能重试,只能从新schedule
//ReceiverSupervisorImpl是receiver的监控器,同时负责receiver的写操作
//这个方法需要传入一个Iterator,实时上里边就只有一个Receiver
val startReceiverFunc: Iterator[Receiver[_]] => Unit =
(iterator: Iterator[Receiver[_]]) => {
//iterator中需要有元素
if (!iterator.hasNext) {
throw new SparkException(
"Could not start receiver as object not found.")
}
//重试次数必须是0,Receiver失败后只能重新调度启动
if (TaskContext.get().attemptNumber() == 0) {
//获得receiver,这个receiver是根据数据输入来源InputDstream获得的receiver。
//以SocketInputDstream为例,它的receiver就是SocketReceiver.这里的receiver只是一个引用,并没有被实例化。作为一个参数传入ReceiverSupervisorImpl
val receiver = iterator.next()
assert(iterator.hasNext == false)
val supervisor = new ReceiverSupervisorImpl(
receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
supervisor.start()
supervisor.awaitTermination()
} else {
// It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
}
}
// Create the RDD using the scheduledLocations to run the receiver in a Spark job
//把receiver封装进一个RDD
val receiverRDD: RDD[Receiver[_]] =
if (scheduledLocations.isEmpty) {
ssc.sc.makeRDD(Seq(receiver), 1)
} else {
val preferredLocations = scheduledLocations.map(_.toString).distinct
ssc.sc.makeRDD(Seq(receiver -> preferredLocations))
}
receiverRDD.setName(s"Receiver $receiverId")
ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))
//为了启动Receiver启动了一个spark作业,每一个Receiver的启动都会有一个作业来负责,Receiver是一个一个的启动的
//如果是将所有的Receiver作为一个作业的不同tesk来启动会有很多弱点。
//1.Reciver启动可能失败进而导致应用程序失败
//2.运行的过程中会有任务倾斜的问题,将所有的Receiver作为一个作业的不同tesk来运行是采用的spark core的调度方式,在很不幸的情况下会出现所有Receiver运行在一个节点上,
//Receiver要不断的接收数据,需要消耗很多资源,就会导致这个节点负载特别大。
//将每个Receiver都作为一个job来运行就会最大可能的负载均衡,不过这样也有可能失败,失败之后不会重试job,而是从新schedule提交一个新的job来运行Receiver,并且不会在之前
//运行的executor上启动,只要sparkstreaming程序不停止,假如Receiver出故障就会不休止的进行重新echedule并启动,确保Receiver一定会启动
//还有很重要的一点是,当重新启动一个Receiver时,是用一个线程池在新的线程中启动的
val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
// We will keep restarting the receiver job until ReceiverTracker is stopped
future.onComplete {
case Success(_) =>
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
} else {
logInfo(s"Restarting Receiver $receiverId")
self.send(RestartReceiver(receiver))
}
case Failure(e) =>
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
} else {
logError("Receiver has been stopped. Try to restart it.", e)
logInfo(s"Restarting Receiver $receiverId")
self.send(RestartReceiver(receiver))
}
}(submitJobThreadPool)
logInfo(s"Receiver ${receiver.streamId} started")
}
ReceiverSupervisorImpl负责处理Receiver接收到的数据,处理之后汇报给ReceiverTracker,所以ReceiverSupervisorImpl内部有和ReceiverTracker进行通信的endpoint。
这个负责向ReceiverTracker发送消息。
private val trackerEndpoint = RpcUtils.makeDriverRef("ReceiverTracker", env.conf, env.rpcEnv)
这个负责接收ReceiverTracker发送的消息,CleanupOldBlocks是用来清除运行完的每个batch的Blocks,UpdateRateLimit是用来随时调整限流(限流其实是限的数据存储的速度)。
/** RpcEndpointRef for receiving messages from the ReceiverTracker in the driver */
private val endpoint = env.rpcEnv.setupEndpoint(
"Receiver-" + streamId + "-" + System.currentTimeMillis(), new ThreadSafeRpcEndpoint {
override val rpcEnv: RpcEnv = env.rpcEnv
override def receive: PartialFunction[Any, Unit] = {
case StopReceiver =>
logInfo("Received stop signal")
ReceiverSupervisorImpl.this.stop("Stopped by driver", None)
case CleanupOldBlocks(threshTime) =>
logDebug("Received delete old batch signal")
cleanupOldBlocks(threshTime)
case UpdateRateLimit(eps) =>
logInfo(s"Received a new rate limit: $eps.")
registeredBlockGenerators.foreach { bg =>
bg.updateRate(eps)
}
}
})
下面是ReceiverSupervisor的start方法
def start() {
onStart()
startReceiver()
}
在onStart中启动的是BlockGenerator,BlockGenerator是把接收到的一条一条的数据生成block存储起来,一个BlockGenerator只服务于一个Receiver。所以BlockGenerator要在Receiver启动之前启动。
override protected def onStart() {
registeredBlockGenerators.foreach { _.start() }
}
BlockGenerator种有一个定时器。这个定时器每隔一定(默认是200ms,和设定的batchduration无关)的时间就执行如下方法。这个方法就是把接收到的数据一条一条的放入到这个buffer缓存中,再把这个buffer按照一定的时间或者尺寸合并成block。除了定时器以外还有一条线程不停的把生成的block交给blockmanager存储起来。
/** Change the buffer to which single records are added to. */
private def updateCurrentBuffer(time: Long): Unit = {
try {
var newBlock: Block = null
synchronized {
if (currentBuffer.nonEmpty) {
val newBlockBuffer = currentBuffer
currentBuffer = new ArrayBuffer[Any]
val blockId = StreamBlockId(receiverId, time - blockIntervalMs)
listener.onGenerateBlock(blockId)
newBlock = new Block(blockId, newBlockBuffer)
}
}
if (newBlock != null) {
blocksForPushing.put(newBlock) // put is blocking when queue is full
}
} catch {
case ie: InterruptedException =>
logInfo("Block updating timer thread was interrupted")
case e: Exception =>
reportError("Error in block updating thread", e)
}
}
下面来看startReceiver方法
/** Start receiver */
def startReceiver(): Unit = synchronized {
try {
if (onReceiverStart()) {
logInfo("Starting receiver")
receiverState = Started
receiver.onStart()
logInfo("Called receiver onStart")
} else {
// The driver refused us
stop("Registered unsuccessfully because Driver refused to start receiver " + streamId, None)
}
} catch {
case NonFatal(t) =>
stop("Error starting receiver " + streamId, Some(t))
}
}
在启动Receiver之前还要向ReceiverTracker请求是否可以启动Receiver。当返回是true才会启动。ReceiverTracker接收到汇报的信息就把注册Receiver的信息。
override protected def onReceiverStart(): Boolean = {
val msg = RegisterReceiver(
streamId, receiver.getClass.getSimpleName, host, executorId, endpoint)
trackerEndpoint.askWithRetry[Boolean](msg)
}
Receiver的启动只是调用receiver.onStart(),Receiver就在work节点上运行了。
以SocketReceiver为例我看看它的onStart方法
private[streaming]
class SocketReceiver[T: ClassTag](
host: String,
port: Int,
bytesToObjects: InputStream => Iterator[T],
storageLevel: StorageLevel
) extends Receiver[T](storageLevel) with Logging {
def onStart() {
// Start the thread that receives data over a connection
new Thread("Socket Receiver") {
setDaemon(true)
override def run() { receive() }
}.start()
}
def onStop() {
// There is nothing much to do as the thread calling receive()
// is designed to stop by itself isStopped() returns false
}
/** Create a socket connection and receive data until receiver is stopped */
def receive() {
var socket: Socket = null
try {
logInfo("Connecting to " + host + ":" + port)
socket = new Socket(host, port)
logInfo("Connected to " + host + ":" + port)
val iterator = bytesToObjects(socket.getInputStream())
while(!isStopped && iterator.hasNext) {
store(iterator.next)
}
if (!isStopped()) {
restart("Socket data stream had no more data")
} else {
logInfo("Stopped receiving")
}
} catch {
case e: java.net.ConnectException =>
restart("Error connecting to " + host + ":" + port, e)
case NonFatal(e) =>
logWarning("Error receiving data", e)
restart("Error receiving data", e)
} finally {
if (socket != null) {
socket.close()
logInfo("Closed socket to " + host + ":" + port)
}
}
}
}