updateStateByKey函数详解及worldcount例子

updateStateByKey操作允许您在使用新的信息持续更新时保持任意状态。
1、定义状态 - 状态可以是任意数据类型。 
2、定义状态更新功能 - 使用函数指定如何使用上一个状态更新状态,并从输入流中指定新值。 

如何使用该函数,spark文档写的很模糊,网上资料也不够详尽,自己翻阅源码总结一下,并给一个完整的例子
updateStateBykey函数有6种重载函数:
1、只传入一个更新函数,最简单的一种。
更新函数两个参数 Seq [ V ], Option [ S ],前者是每个key新增的值的集合,后者是当前保存的状态,
def updateStateByKey [ S : ClassTag ](
updateFunc: ( Seq [ V ], Option [ S ]) => Option [ S ]
): DStream [( K , S )] = ssc . withScope {
updateStateByKey (updateFunc, defaultPartitioner ())
}

例如,对于wordcount,我们可以这样定义更新函数:
(values: Seq [Int],state: Option [Int])=>{
//创建一个变量,用于记录单词出现次数
var newValue =state. getOrElse ( 0 ) //getOrElse相当于if....else.....
for ( value <- values){
newValue += value //将单词出现次数累计相加
}
Option ( newValue )
}

2、传入更新函数和分区数
def updateStateByKey [ S : ClassTag ](
updateFunc: ( Seq [ V ], Option [ S ]) => Option [ S ],
numPartitions: Int
): DStream [( K , S )] = ssc . withScope {
updateStateByKey (updateFunc, defaultPartitioner (numPartitions))
}
3、传入更新函数和自定义分区
def updateStateByKey [ S : ClassTag ](
updateFunc: ( Seq [ V ], Option [ S ]) => Option [ S ],
partitioner: Partitioner
): DStream [( K , S )] = ssc . withScope {
val cleanedUpdateF = sparkContext . clean (updateFunc)
val newUpdateFunc = (iterator: Iterator [( K , Seq [ V ], Option [ S ])]) => {
iterator. flatMap (t => cleanedUpdateF (t._2, t._3). map (s => (t._1, s)))
}
updateStateByKey ( newUpdateFunc , partitioner, true )
}
4、传入完整的状态更新函数
前面的函数传入的都是不完整的更新函数,只是针对一个key的,他们在执行的时候也会生成一个完整的状态更新函数。
Iterator [( K , Seq [ V ], Option [ S ])]) => Iterator [( K , S )] 入参是一个迭代器,参数1是key,参数2是这个key在这个batch中更新的值的集合,参数3是当前状态,最终得到key-->newvalue
def updateStateByKey [ S : ClassTag ](
updateFunc: ( Iterator [( K , Seq [ V ], Option [ S ])]) => Iterator [( K , S )],
partitioner: Partitioner ,
rememberPartitioner: Boolean
): DStream [( K , S )] = ssc . withScope {
new StateDStream (self, ssc . sc . clean (updateFunc), partitioner, rememberPartitioner, None )
}

例如,对于wordcount:

val newUpdateFunc = (iterator: Iterator [( String , Seq [Int], Option [Int])]) => {
iterator. flatMap (t => function1 (t._2, t._3). map (s => (t._1, s)))
}

5、加入初始状态
initialRDD: RDD [( K , S )] 初始状态集合
def updateStateByKey [ S : ClassTag ](
updateFunc: ( Seq [ V ], Option [ S ]) => Option [ S ],
partitioner: Partitioner ,
initialRDD: RDD [( K , S )]
): DStream [( K , S )] = ssc . withScope {
val cleanedUpdateF = sparkContext . clean (updateFunc)
val newUpdateFunc = (iterator: Iterator [( K , Seq [ V ], Option [ S ])]) => {
iterator. flatMap (t => cleanedUpdateF (t._2, t._3). map (s => (t._1, s)))
}
updateStateByKey ( newUpdateFunc , partitioner, true , initialRDD)
}
6、是否记得当前的分区
def updateStateByKey [ S : ClassTag ](
updateFunc: ( Iterator [( K , Seq [ V ], Option [ S ])]) => Iterator [( K , S )],
partitioner: Partitioner ,
rememberPartitioner: Boolean,
initialRDD: RDD [( K , S )]
): DStream [( K , S )] = ssc . withScope {
new StateDStream (self, ssc . sc . clean (updateFunc), partitioner,
rememberPartitioner, Some (initialRDD))
}

完整的例子:

def testUpdate ={
val sc = SparkUtils . getSpark ( "test" , "db01" ).sparkContext
val ssc = new StreamingContext ( sc , Seconds ( 5 ))
ssc . checkpoint ( "hdfs://ns1/config/checkpoint" )
val initialRDD = sc . parallelize ( List (( "hello" , 1 ), ( "world" , 1 )))
val lines = ssc . fileStream [ LongWritable , Text , TextInputFormat ]( "hdfs://ns1/config/data/" )
val words = lines . flatMap (x=>x._2. toString . split ( "," ))
val wordDstream : DStream [( String , Int)]= words . map (x => (x, 1 ))
val result = wordDstream . reduceByKey (_ + _)

def function1 (newValues: Seq [Int], runningCount: Option [Int]): Option [Int] = {
val newCount = newValues. sum + runningCount. getOrElse ( 0 ) // add the new values with the previous running count to get the new count
Some ( newCount )
}
val newUpdateFunc = (iterator: Iterator [( String , Seq [Int], Option [Int])]) => {
iterator. flatMap (t => function1 (t._2, t._3). map (s => (t._1, s)))
}
val stateDS = result . updateStateByKey ( newUpdateFunc , new HashPartitioner ( sc . defaultParallelism ), true , initialRDD )
stateDS . print ()
ssc . start ()
ssc . awaitTermination ()
}
  • 4
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
updateStateByKey is a Spark Streaming operation that allows you to maintain state across batches of data. It is used to update the state of a DStream by applying a state update function to each batch of data in the stream. updateStateByKey takes two arguments: 1. The state update function, which takes the current state and the new values for a key and returns the updated state for that key. 2. A checkpoint directory, which is used to store the state information between batches. The state update function should take two arguments: the current state for a key and the new values for that key in the current batch. The function should then return the updated state for that key. For example, if you have a DStream of (key, value) pairs and you want to maintain a count of the values for each key, you can use updateStateByKey to update the count for each key across batches. Here is an example of how to use updateStateByKey: ``` from pyspark.streaming import StreamingContext ssc = StreamingContext(sparkContext, 1) # Create a DStream of (key, value) pairs lines = ssc.socketTextStream("localhost", 9999) pairs = lines.map(lambda x: (x.split(" ")[0], int(x.split(" ")[1]))) # Define the update function def updateFunc(newValues, currentSum): if currentSum is None: currentSum = 0 return sum(newValues, currentSum) # Use updateStateByKey to update the state stateDstream = pairs.updateStateByKey(updateFunc) # Print the state stateDstream.pprint() ssc.start() ssc.awaitTermination() ``` In this example, we create a DStream of (key, value) pairs from a socket connection. We then define the update function to sum the new values for each key with the current sum. Finally, we use updateStateByKey to update the state and print the result.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值