updateStateByKey方法详解

最新推荐文章于 2021-10-31 14:22:22 发布

lucasmaluping

最新推荐文章于 2021-10-31 14:22:22 发布

阅读量2.3k

点赞数 1

分类专栏： Spark

本文链接：https://blog.csdn.net/lucasmaluping/article/details/103236580

版权

Spark 专栏收录该内容

41 篇文章 1 订阅

订阅专栏

updateStateByKey操作允许您在使用新的信息持续更新时保持任意状态。
1、定义状态 - 状态可以是任意数据类型。
2、定义状态更新功能 - 使用函数指定如何使用上一个状态更新状态，并从输入流中指定新值。

如何使用该函数，spark文档写的很模糊，网上资料也不够详尽，自己翻阅源码总结一下，并给一个完整的例子
updateStateBykey函数有6种重载函数：
1、只传入一个更新函数，最简单的一种。
更新函数两个参数Seq[V], Option[S]，前者是每个key新增的值的集合，后者是当前保存的状态，
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S]
): DStream[(K, S)] = ssc.withScope {
updateStateByKey(updateFunc, defaultPartitioner())
}
例如，对于wordcount，我们可以这样定义更新函数：
(values:Seq[Int],state:Option[Int])=>{
//创建一个变量，用于记录单词出现次数
var newValue=state.getOrElse(0) //getOrElse相当于if…else…
for(value <- values){
newValue +=value //将单词出现次数累计相加
}
Option(newValue)
}
2、传入更新函数和分区数
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S],
numPartitions: Int
): DStream[(K, S)] = ssc.withScope {
updateStateByKey(updateFunc, defaultPartitioner(numPartitions))
}
3、传入更新函数和自定义分区
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S],
partitioner: Partitioner
): DStream[(K, S)] = ssc.withScope {
val cleanedUpdateF = sparkContext.clean(updateFunc)
val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {
iterator.flatMap(t => cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))
}
updateStateByKey(newUpdateFunc, partitioner, true)
}
4、传入完整的状态更新函数
前面的函数传入的都是不完整的更新函数，只是针对一个key的，他们在执行的时候也会生成一个完整的状态更新函数。
Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)] 入参是一个迭代器，参数1是key，参数2是这个key在这个batch中更新的值的集合，参数3是当前状态，最终得到key–>newvalue
def updateStateByKey[S: ClassTag](
updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
partitioner: Partitioner,
rememberPartitioner: Boolean
): DStream[(K, S)] = ssc.withScope {
new StateDStream(self, ssc.sc.clean(updateFunc), partitioner, rememberPartitioner, None)
}
例如，对于wordcount：

val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
iterator.flatMap(t => function1(t._2, t._3).map(s => (t._1, s)))
}

5、加入初始状态
initialRDD: RDD[(K, S)] 初始状态集合
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S],
partitioner: Partitioner,
initialRDD: RDD[(K, S)]
): DStream[(K, S)] = ssc.withScope {
val cleanedUpdateF = sparkContext.clean(updateFunc)
val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {
iterator.flatMap(t => cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))
}
updateStateByKey(newUpdateFunc, partitioner, true, initialRDD)
}
6、是否记得当前的分区
def updateStateByKey[S: ClassTag](
updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
partitioner: Partitioner,
rememberPartitioner: Boolean,
initialRDD: RDD[(K, S)]
): DStream[(K, S)] = ssc.withScope {
new StateDStream(self, ssc.sc.clean(updateFunc), partitioner,
rememberPartitioner, Some(initialRDD))
}

完整的例子：
def testUpdate={
val sc = SparkUtils.getSpark(“test”, “db01”).sparkContext
val ssc = new StreamingContext(sc, Seconds(5))
ssc.checkpoint(“hdfs://ns1/config/checkpoint”)
val initialRDD = sc.parallelize(List((“hello”, 1), (“world”, 1)))
val lines = ssc.fileStreamLongWritable,Text,TextInputFormat
val words = lines.flatMap(x=>x.2.toString.split(","))
val wordDstream :DStream[(String, Int)]= words.map(x => (x, 1))
val result=wordDstream.reduceByKey( + _)

def function1(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
  val newCount = newValues.sum + runningCount.getOrElse(0) // add the new values with the previous running count to get the new count
  Some(newCount)
}
val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
  iterator.flatMap(t => function1(t._2, t._3).map(s => (t._1, s)))
}
val stateDS=result.updateStateByKey(newUpdateFunc,new HashPartitioner (sc.defaultParallelism),true,initialRDD)
stateDS.print()
ssc.start()
ssc.awaitTermination()

}

lucasmaluping

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
updateStateByKey方法详解

updateStateByKey操作允许您在使用新的信息持续更新时保持任意状态。1、定义状态 - 状态可以是任意数据类型。2、定义状态更新功能 - 使用函数指定如何使用上一个状态更新状态，并从输入流中指定新值。如何使用该函数，spark文档写的很模糊，网上资料也不够详尽，自己翻阅源码总结一下，并给一个完整的例子updateStateBykey函数有6种重载函数：1、只传入一个更新函数，最...
复制链接

扫一扫