Spark分析(十)Spark Streaming运行流程详解(5)

2021SC@SDUSC

前言

至上一篇博客分析完了Spark Streaming的数据接收初步流程,接下来分析Spark Streaming的数据清理

Spark Streaming数据清理

Spark Streaming应用是持续不断地运行着的。如果不对内存资源进行有效管理,内存就有可能很快就耗尽。Spark Streaming应用有自己的对象、数据、元数据的清理机制。
Spark Streaming应用中的对象、数据、元数据是操作DStream时产生的。
先给出数据清理的总流程图:

图1

前面还有一部分叫做JobGenerator的job.run:

// JobScheduler.JobHandler.run

def run() {
	try {
		...
		var _eventLoop = eventLoop
	if (_eventLoop != null) {
		_eventLoop.post(JobStarted(job, clock.getTimeMillis()))
		PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
			job.run()
		}
		_eventLoop = eventLoop
		if (_eventLoop != null) {
		_eventLoop.post(JobCompleted(job, clock.getTimeMillis()))
		}
	} else {
		// JobScheduler has been stopped
	}
} finally {
	ssc.sc.setLocalProperty(JobScheduler.BATCH_TIME_PROPERTY_KEY, null)
	ssc.sc.setLocalProperty(JobScheduler.OUTPUT_OP_ID_PROPERTY_KEY, null)
	}

}
	

run发送了JobCompleted消息。JobScheduler.processEvent定义了针对消息的处理:

// JobScheduler.processEvent

private def processEvent(event: JobSchedulerEvent) {
	try {
		event match{
			case JobStarted(job, startTime) => hndleJobStart(job, startTime)
			case JobCompleted(job, completedTime) => handleJobCopletion(job, completedTime)
			case ErrorReported(m, e) => handleError(m, e)
		}
	} catch {
		case e: Throwable =>
			reportError("Error in job scheduler", e)
	}

}

对JobCompleted事件的处理是调用了handleJobCompletion。

// JobScheduler.handleJobCompletion

private def handleJobCompletion(job: Job, completedTime: Long) {
	val jobSet = jobSets.get(job.time)
	jobSet.handleJobCompletion(job)
	job.setEndTime(completedTime)
	listenerBus.post(StreamingListenerOutputOperationCompleted(job.toOutputOperationInfo))
	logInfo("Finished job "+ job.id + "from job set of tme "+ jobSet.time)
	if (jobSet.hasCompleted) {
		jobSets.remove(jobSet.time)
		jobGenerator.onBatchCompletion(jobSet.time)
		logInfo("Total delay: %.3f s for time %s (execution: %.3f s)".format(
			jobSet.totalDelay / 1000.0, jobSet.time.toString,
			jobSet.processingDelay / 1000.0
			))
			listenerBus.post(StreamingListenerBatchCompleted(jobSet.toBatchInfo)
		}
		job.result match {
			case Failure(e) =>
				reportError("Error running job " + job, e)
			case _ =>

		}

}

清理JobSets中已提交执行的JobSet,还调用了jobGenerator.onBatchCompletion

// JobGenerator.onBatchCompletion

/**
 * Callback called when a batch has been completely processed.
 */

def onBatchCompletion(time: Time) {
	eventLoop.post(ClearMetadata(time))
}

发送了ClearMetadata消息。下面查看以下JobGenerator.start中eventLoop的定义:

// JobGenerator.start片段

eventLoop = new EventLoop[JobGeneratorEvent] )"JobGenerator") {
	override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)



	override protected def onError(e: Throwable): Unit = {
		jobScheduler.reportError("Error in job generator", e)
		
	}

}

由此可知,消息是由eventLoop.onReceive指定的JobGenerator.processEvent做处理:

// JobGenerator.processEvent

/** Processes all events */

private def processEvent(event: JobGeneratorEvent) {
	logDebug("Got event " + event)
	event match {
		case GenerateJobs(time) => generateJobs(time)
		case ClearMetadata(time) => clearMetadata(time)
		case DoCheckpoint(time, clearCheckpointDataLater) =>
			doCheckpoint(time, clearCheckpointDataLater)
		case ClearCheckpointData(time) => clearCheckpointData(time)
	
	}

}

其中针对清理元数据(ClearMetadata)消息的处理是clearMetadata

// JobGenerator.clearMetadata

/** Clear DStream metadata for the given 'time'. */

private def clearMetadata(time: Time) {
	ssc.graph.clearMetadata(time)



	// If checkpoint is enabled, then checkpoint,
	// else mark batch to be fully processed
	if (shouldCheckpoint) {
		eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = true))
	} else {
		// If checkpointing is not enabled, the delete meatdata information about
		// received blocks (block data not saved in any case). Otherwise, wait for
		// checkpointing of this batch to complete.
		val maxRememberDuration = graph.getMaxInputStreamRememberDuration()
		jobScheduler.receiverTracker.cleanupOldBlocksAndBatched(time - maxRemeberDuration)
		jobScheduler.inputInfoTracker.cleanup(time - maxRemeberDuration)
		markBatchFullyProcessed(time)
	}

}

可以看到有多项清理工作。而receiverTracker和inputInfoTracker的清理工作有前提条件:不需要设置检查点

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值