updateStateByKey函数详解及worldcount例子

最新推荐文章于 2020-12-07 19:12:00 发布

lmb633

最新推荐文章于 2020-12-07 19:12:00 发布

阅读量7k

点赞数 4

分类专栏： spark 文章标签： sparkstreaming scala updateStateByKey

本文链接：https://blog.csdn.net/lmb09122508/article/details/80537881

版权

spark 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

 
 updateStateByKey操作允许您在使用新的信息持续更新时保持任意状态。 

 
 1、定义状态 - 状态可以是任意数据类型。  

 
 2、定义状态更新功能 - 使用函数指定如何使用上一个状态更新状态，并从输入流中指定新值。  

 
 如何使用该函数，spark文档写的很模糊，网上资料也不够详尽，自己翻阅源码总结一下，并给一个完整的例子 

 
 updateStateBykey函数有6种重载函数： 

 
 1、只传入一个更新函数，最简单的一种。 

 
 更新函数两个参数 
 Seq 
 [ 
 V 
 ],  
 Option 
 [ 
 S 
 ]，前者是每个key新增的值的集合，后者是当前保存的状态， 

 
 def  
 updateStateByKey 
 [ 
 S 
 :  
 ClassTag 
 ]( 

 
  updateFunc: ( 
 Seq 
 [ 
 V 
 ],  
 Option 
 [ 
 S 
 ]) =>  
 Option 
 [ 
 S 
 ] 

 
  ):  
 DStream 
 [( 
 K 
 ,  
 S 
 )] =  
 ssc 
 . 
 withScope  
 { 

 
   
 updateStateByKey 
 (updateFunc,  
 defaultPartitioner 
 ()) 

}

 
 例如，对于wordcount，我们可以这样定义更新函数： 

 
 (values: 
 Seq 
 [Int],state: 
 Option 
 [Int])=>{ 

 
   
 //创建一个变量，用于记录单词出现次数 

 
   
 var  
 newValue 
 =state. 
 getOrElse 
 ( 
 0 
 )  
 //getOrElse相当于if....else..... 

 
   
 for 
 ( 
 value  
 <- values){ 

 
   
 newValue  
 += 
 value  
 //将单词出现次数累计相加 

}

 
   
 Option 
 ( 
 newValue 
 ) 

}

 
 2、传入更新函数和分区数 

 
 def  
 updateStateByKey 
 [ 
 S 
 :  
 ClassTag 
 ]( 

 
  updateFunc: ( 
 Seq 
 [ 
 V 
 ],  
 Option 
 [ 
 S 
 ]) =>  
 Option 
 [ 
 S 
 ], 

 
  numPartitions: Int 

 
  ):  
 DStream 
 [( 
 K 
 ,  
 S 
 )] =  
 ssc 
 . 
 withScope  
 { 

 
   
 updateStateByKey 
 (updateFunc,  
 defaultPartitioner 
 (numPartitions)) 

}

 
 3、传入更新函数和自定义分区 

 
 def  
 updateStateByKey 
 [ 
 S 
 :  
 ClassTag 
 ]( 

 
  updateFunc: ( 
 Seq 
 [ 
 V 
 ],  
 Option 
 [ 
 S 
 ]) =>  
 Option 
 [ 
 S 
 ], 

 
  partitioner:  
 Partitioner 

 
   
 ):  
 DStream 
 [( 
 K 
 ,  
 S 
 )] =  
 ssc 
 . 
 withScope  
 { 

 
   
 val  
 cleanedUpdateF  
 =  
 sparkContext 
 . 
 clean 
 (updateFunc) 

 
   
 val  
 newUpdateFunc  
 = (iterator:  
 Iterator 
 [( 
 K 
 ,  
 Seq 
 [ 
 V 
 ],  
 Option 
 [ 
 S 
 ])]) => { 

 
  iterator. 
 flatMap 
 (t =>  
 cleanedUpdateF 
 (t._2, t._3). 
 map 
 (s => (t._1, s))) 

}

 
   
 updateStateByKey 
 ( 
 newUpdateFunc 
 , partitioner,  
 true 
 ) 

}

 
 4、传入完整的状态更新函数 

  前面的函数传入的都是不完整的更新函数，只是针对一个key的，他们在执行的时候也会生成一个完整的状态更新函数。 

 
 Iterator 
 [( 
 K 
 ,  
 Seq 
 [ 
 V 
 ],  
 Option 
 [ 
 S 
 ])]) =>  
 Iterator 
 [( 
 K 
 ,  
 S 
 )] 入参是一个迭代器，参数1是key，参数2是这个key在这个batch中更新的值的集合，参数3是当前状态，最终得到key-->newvalue 

 
 def  
 updateStateByKey 
 [ 
 S 
 :  
 ClassTag 
 ]( 

 
  updateFunc: ( 
 Iterator 
 [( 
 K 
 ,  
 Seq 
 [ 
 V 
 ],  
 Option 
 [ 
 S 
 ])]) =>  
 Iterator 
 [( 
 K 
 ,  
 S 
 )], 

 
  partitioner:  
 Partitioner 
 , 

 
  rememberPartitioner: Boolean 

 
  ):  
 DStream 
 [( 
 K 
 ,  
 S 
 )] =  
 ssc 
 . 
 withScope  
 { 

 
   
 new  
 StateDStream 
 (self,  
 ssc 
 . 
 sc 
 . 
 clean 
 (updateFunc), partitioner, rememberPartitioner,  
 None 
 ) 

}

例如，对于wordcount：

 
 val  
 newUpdateFunc  
 = (iterator:  
 Iterator 
 [( 
 String 
 ,  
 Seq 
 [Int],  
 Option 
 [Int])]) => { 

 
  iterator. 
 flatMap 
 (t =>  
 function1 
 (t._2, t._3). 
 map 
 (s => (t._1, s))) 

}

 
 5、加入初始状态 

 
  initialRDD:  
 RDD 
 [( 
 K 
 ,  
 S 
 )] 初始状态集合 

 
 def  
 updateStateByKey 
 [ 
 S 
 :  
 ClassTag 
 ]( 

 
  updateFunc: ( 
 Seq 
 [ 
 V 
 ],  
 Option 
 [ 
 S 
 ]) =>  
 Option 
 [ 
 S 
 ], 

 
  partitioner:  
 Partitioner 
 , 

 
  initialRDD:  
 RDD 
 [( 
 K 
 ,  
 S 
 )] 

 
  ):  
 DStream 
 [( 
 K 
 ,  
 S 
 )] =  
 ssc 
 . 
 withScope  
 { 

 
   
 val  
 cleanedUpdateF  
 =  
 sparkContext 
 . 
 clean 
 (updateFunc) 

 
   
 val  
 newUpdateFunc  
 = (iterator:  
 Iterator 
 [( 
 K 
 ,  
 Seq 
 [ 
 V 
 ],  
 Option 
 [ 
 S 
 ])]) => { 

 
  iterator. 
 flatMap 
 (t =>  
 cleanedUpdateF 
 (t._2, t._3). 
 map 
 (s => (t._1, s))) 

}

 
   
 updateStateByKey 
 ( 
 newUpdateFunc 
 , partitioner,  
 true 
 , initialRDD) 

}

 
 6、是否记得当前的分区 

 
 def  
 updateStateByKey 
 [ 
 S 
 :  
 ClassTag 
 ]( 

 
  updateFunc: ( 
 Iterator 
 [( 
 K 
 ,  
 Seq 
 [ 
 V 
 ],  
 Option 
 [ 
 S 
 ])]) =>  
 Iterator 
 [( 
 K 
 ,  
 S 
 )], 

 
  partitioner:  
 Partitioner 
 , 

 
  rememberPartitioner: Boolean, 

 
  initialRDD:  
 RDD 
 [( 
 K 
 ,  
 S 
 )] 

 
  ):  
 DStream 
 [( 
 K 
 ,  
 S 
 )] =  
 ssc 
 . 
 withScope  
 { 

 
   
 new  
 StateDStream 
 (self,  
 ssc 
 . 
 sc 
 . 
 clean 
 (updateFunc), partitioner, 

 
  rememberPartitioner,  
 Some 
 (initialRDD)) 

}

完整的例子：

 
 def  
 testUpdate 
 ={ 

 
   
 val  
 sc  
 =  
 SparkUtils 
 . 
 getSpark 
 ( 
 "test" 
 ,  
 "db01" 
 ).sparkContext 

 
   
 val  
 ssc  
 =  
 new  
 StreamingContext 
 ( 
 sc 
 ,  
 Seconds 
 ( 
 5 
 )) 

 
   
 ssc 
 . 
 checkpoint 
 ( 
 "hdfs://ns1/config/checkpoint" 
 ) 

 
   
 val  
 initialRDD  
 =  
 sc 
 . 
 parallelize 
 ( 
 List 
 (( 
 "hello" 
 ,  
 1 
 ), ( 
 "world" 
 ,  
 1 
 ))) 

 
   
 val  
 lines  
 =  
 ssc 
 . 
 fileStream 
 [ 
 LongWritable 
 , 
 Text 
 , 
 TextInputFormat 
 ]( 
 "hdfs://ns1/config/data/" 
 ) 

 
   
 val  
 words  
 =  
 lines 
 . 
 flatMap 
 (x=>x._2. 
 toString 
 . 
 split 
 ( 
 "," 
 )) 

 
   
 val  
 wordDstream  
 : 
 DStream 
 [( 
 String 
 , Int)]=  
 words 
 . 
 map 
 (x => (x,  
 1 
 )) 

 
   
 val  
 result 
 = 
 wordDstream 
 . 
 reduceByKey 
 (_ + _) 

 
   
 def  
 function1 
 (newValues:  
 Seq 
 [Int], runningCount:  
 Option 
 [Int]):  
 Option 
 [Int] = { 

 
   
 val  
 newCount  
 = newValues. 
 sum  
 + runningCount. 
 getOrElse 
 ( 
 0 
 )  
 // add the new values with the previous running count to get the new count 

 
   
 Some 
 ( 
 newCount 
 ) 

}

 
   
 val  
 newUpdateFunc  
 = (iterator:  
 Iterator 
 [( 
 String 
 ,  
 Seq 
 [Int],  
 Option 
 [Int])]) => { 

 
  iterator. 
 flatMap 
 (t =>  
 function1 
 (t._2, t._3). 
 map 
 (s => (t._1, s))) 

}

 
   
 val  
 stateDS 
 = 
 result 
 . 
 updateStateByKey 
 ( 
 newUpdateFunc 
 , 
 new  
 HashPartitioner  
 ( 
 sc 
 . 
 defaultParallelism 
 ), 
 true 
 , 
 initialRDD 
 ) 

 
   
 stateDS 
 . 
 print 
 () 

 
   
 ssc 
 . 
 start 
 () 

 
   
 ssc 
 . 
 awaitTermination 
 () 

}

lmb633

关注

4
点赞
踩
13

收藏

觉得还不错? 一键收藏
1
评论
updateStateByKey函数详解及worldcount例子

updateStateByKey操作允许您在使用新的信息持续更新时保持任意状态。1、定义状态 - 状态可以是任意数据类型。 2、定义状态更新功能 - 使用函数指定如何使用上一个状态更新状态，并从输入流中指定新值。如何使用该函数，spark文档写的很模糊，网上资料也不够详尽，自己翻阅源码总结一下，并给一个完整的例子updateStateBykey函数有6种重载函数：1、只传入一个更新函数，最简单的...
复制链接

扫一扫