Spark Streaming updateStateByKey主要功能是随着之间的流逝,通过ChkekPoint来维护一份state状态,通过更新函数对该key的状态进行更新。对每批新数据而言,updateStateByKey为已经存在的key进行state状态更新,如果通过更新函数对state更新后返回none,此时key对应的state状态被删除。如果不断更新每个key的state,会涉及到状态的保存和容错,此时需要开启checkpoint机制和功能,checkpoint的数据可以保存到文件系统上(本地磁盘,HDFS)。
Spark Streaming updateStateByKey源码:
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S],
partitioner: Partitioner,
initialRDD: RDD[(K, S)]
): DStream[(K, S)] = ssc.withScope {
val cleanedUpdateF = sparkContext.clean(updateFunc)
val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {
iterator.flatMap(t => cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))
}
updateStateByKey(newUpdateFunc, partitioner, true, initialRDD)
}
实战代码(有详细说明):
object updateBykey {
def main(args: Array[String]): Unit = {
/**
* Load configuration file
*/
implicit val conf = ConfigFactory.load()
/**
* Initialize SparkConf
*/
val sparkConf = new SparkConf()
.setAppName(conf.getString("spark.appName"))
.setMaster(conf.getString("spark.master"))
/**
* Initialize SteamingContext
*/
val IntervalBatch = Milliseconds(20000)
val ssc = new StreamingContext(sparkConf, IntervalBatch)
val updateFunc = (values: Seq[Int], state: Option[Int]) => {
/**
* Calculate sum of current count
*/
val currentCount = values.sum
/**
* Get state previous count
*/
val previousCount = state.getOrElse(0)
/**
* Get value of currentCount add previousCount
*/
Some(currentCount + previousCount)
}
val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))
}
val partitioner = new HashPartitioner(ssc.sparkContext.defaultParallelism)
ssc.checkpoint("/usr/local/src/checkpoint")
val initialRDD = ssc.sparkContext.parallelize(List(("key", 1), ("value", 1)))
val input = ssc.socketTextStream("localhost", 8888)
val pair = input.flatMap(_.split(" ")).map(word => (word, 1))
val wordCount = pair.updateStateByKey(newUpdateFunc, partitioner, true, initialRDD)
wordCount.print()
ssc.start()
ssc.awaitTermination()
}
}
代码github地址:https://github.com/DragonTong/Streaming/blob/master/src/main/scala/streaming/updateBykey.scala
测试输入流:nc -lk