一、关于算子updateStateByKey参数问题:
dataDS.updateStateByKey(updateFunction).print()
/** * Return a new "state" DStream where the state for each key is updated by applying * the given function on the previous state of the key and the new values of each key. * In every batch the updateFunc will be called for each state even if there are no new values. * Hash partitioning is used to generate the RDDs with Spark's default number of partitions. * @param updateFunc State update function. If `this` function returns None, then * corresponding state key-value pair will be eliminated. * @tparam S State type */ def updateStateByKey[S: ClassTag]( updateFunc: (Seq[V], Option[S]) => Option[S] ): DStream[(K, S)] = ssc.withScope { updateStateByKey(updateFunc, defaultPartitioner()) }
根据注释可以知道入参需要一个函数,该函数可以是一个def显式定义的函数:
def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = { val newCount = newValues.sum + runningCount.getOrElse(0) Some(newCount) }
也可以是一种匿名函数,使用匿名函数有两种写法:
写法一:
dataDS.updateStateByKey( (newValues: Seq[Int], runningCount: Option[Int]) => { val newCount = newValues.sum + runningCount.getOrElse(0) Some(newCount) } ).print()
根据参数提示直接在方法中写匿名函数。
写法二:
val updateFun = (newValues: Seq[Int], runningCount: Option[Int]) =>{ val newCount = newValues.sum + runningCount.getOrElse(0) Some(newCount) } dataDS.updateStateByKey(updateFun).print()
根据提示定义匿名函数,并赋给一个变量,将变量传入作为入参。