本期内容:
1 Spark Streaming中的空RDD处理
2 Spark Streaming程序的停止
1 Spark Streaming中的空RDD处理
在Spark Streaming应用程序中,无论使用什么 DStream,底层实际上就是操作RDD。
从一个应用程序片段开始,进行剖析:
...
val lines = ssc.socketTextStream("Master", 9999)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.foreachRDD
{ rdd =>
rdd.foreachPartition { partitionOfRecords => {
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => {
val sql = "insert into streaming_itemcount(item,count) values('" + record._1 + "'," + record._2 + ")"
val stmt = connection.createStatement();
stmt.executeUpdate(sql);
})
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
...
程序中有一个这样的问题:wordCounts.foreachRDD里面,开始时并没有判断rdd是否为空,就进行处理了。
rdd为空时,也获取CPU core等计算资源,并进行里面的计算。这显然是不合适的。
虽然Spark中定义了EmptyRDD,且让其Compute时抛出异常,但实际Spark应用程序并没有使用EmptyRDD。
应该对每个rdd进行处理前,应该判断rdd是否为空。
再看看RDD.isEmpty:
def isEmpty(): Boolean = withScope {
partitions.length == 0 || take(1).length == 0
}
故前面应用程序的代码可以在加一行代码:
wordCounts.foreachRDD
{ rdd =>
if (!rdd.isEmpty) {
rdd.foreachPartition { partitionOfRecords => {
...
}
...
2 Spark Streaming程序的停止
先看StreamingContext.top:
def stop(
stopSparkContext: Boolean = conf.getBoolean("spark.streaming.stopSparkContextByDefault", true)
): Unit = synchronized {
stop(stopSparkContext, false)
}
真正好的停止一个Spark Streaming应用程序,应该用另一个stop:
def stop(stopSparkContext: Boolean,
stopGracefully
: Boolean): Unit = {
var shutdownHookRefToRemove: AnyRef = null
if (AsynchronousListenerBus.withinListenerThread.value) {
throw new SparkException("Cannot stop StreamingContext within listener thread of" +
" AsynchronousListenerBus")
}
synchronized {
try {
state match {
case INITIALIZED =>
logWarning("StreamingContext has not been started yet")
case STOPPED =>
logWarning("StreamingContext has already been stopped")
case ACTIVE =>
scheduler.stop(stopGracefully)
// Removing the streamingSource to de-register the metrics on stop()
env.metricsSystem.removeSource(streamingSource)
uiTab.foreach(_.detach())
StreamingContext.setActiveContext(null)
waiter.notifyStop()
if (shutdownHookRef != null) {
shutdownHookRefToRemove = shutdownHookRef
shutdownHookRef = null
}
logInfo("StreamingContext stopped successfully")
}
} finally {
// The state should always be Stopped after calling `stop()`, even if we haven't started yet
state = STOPPED
}
}
if (shutdownHookRefToRemove != null) {
ShutdownHookManager.removeShutdownHook(shutdownHookRefToRemove)
}
// Even if we have already stopped, we still need to attempt to stop the SparkContext because
// a user might stop(stopSparkContext = false) and then call stop(stopSparkContext = true).
if (stopSparkContext) sc.stop()
}
stopGracefully参数默认是false,生产环境应该设置为 true,具体做法是配置文件中把spark.streaming.stopGeacefullyOnShutdown设置为true,这样能保证已运行的程序运行完再停止,以保证数据处理的完整。
Spark Streaming程序是怎么做到的呢?StreamingContext.stopShutDown调用了上面的stop。
StreamingContext.stopShutDown:
private def stopOnShutdown(): Unit = {
val stopGracefully = conf.getBoolean("spark.streaming.stopGracefullyOnShutdown", false)
logInfo(s"Invoking stop(stopGracefully=$stopGracefully) from shutdown hook")
// Do not stop SparkContext, let its own shutdown hook stop it
stop
(stopSparkContext = false, stopGracefully = stopGracefully)
}
在StreamingContext.start中,会加一个hook来调用stopShutDown:
StreamingContext.start:
def start(): Unit = synchronized {
state match {
case INITIALIZED =>
startSite.set(DStream.getCreationSite())
StreamingContext.ACTIVATION_LOCK.synchronized {
StreamingContext.assertNoOtherContextIsActive()
try {
validate()
// Start the streaming scheduler in a new thread, so that thread local properties
// like call sites and job groups can be reset without affecting those of the
// current thread.
ThreadUtils.runInNewThread("streaming-start") {
sparkContext.setCallSite(startSite.get)
sparkContext.clearJobGroup()
sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
scheduler.start()
}
state = StreamingContextState.ACTIVE
} catch {
case NonFatal(e) =>
logError("Error starting the context, marking it as stopped", e)
scheduler.stop(false)
state = StreamingContextState.STOPPED
throw e
}
StreamingContext.setActiveContext(this)
}
shutdownHookRef = ShutdownHookManager.
addShutdownHook
(
StreamingContext.SHUTDOWN_HOOK_PRIORITY)(
stopOnShutdown
)
// Registering Streaming Metrics at the start of the StreamingContext
assert(env.metricsSystem != null)
env.metricsSystem.registerSource(streamingSource)
uiTab.foreach(_.attach())
logInfo("StreamingContext started")
case ACTIVE =>
logWarning("StreamingContext has already been started")
case STOPPED =>
throw new IllegalStateException("StreamingContext has already been stopped")
}
}
在StreamingContext启动时,就用了钩子,定义了在shutdown时必须调用有
stopGracefully参数的stop方法。