State & Fault Tolerance(重点)
Flink是一个基于状态计算的流计算服务。Flink将所有的状态分为两大类: keyed state 与 operator state .所谓的keyed state指的是Flink底层会给每一个Key绑定若干个类型的状态值,特指操作KeyedStream中所涉及的状态。所谓operator state指的是非keyed stream中所涉及状态称为operator state,所有的operator state会将状态和具体某个操作符进行绑定。无论是 keyed state 还是 operator state flink将这些状态管理理底层分为两种存储形式:Managed State和Raw State Managed State- 所谓的Managed State,指的是由Flink控制状态存储结构,例如:状态数据结构、数据类型等,由于是Flink自己管理理状态,因此Flink可以更好的针对于管理状态做内存的优化和故障恢复。Raw State - 所谓的Raw state,指的是Flink对状态的信息和结构一无所知,Flink仅知道该状态是一些二进制字节数组,需要用户自己完成状态序列化和反序列化。,因此Raw State Flink不能够针对性的做内存优化,也不支持故障状态的恢复。因此在Flink实战项目开发中,几乎不使用Raw State.
Using Managed Keyed State
托管键控状态接口提供对不同类型的状态的访问,所有状态都限于当前输入元素的键。这意味着这种状态类型只能在KeyedStream可以通过创建的上使用stream.keyBy(…)。
现在,我们将首先研究可用的不同类型的状态,然后我们将了解如何在程序中使用它们。可用的状态原语是:
- ValueState:这将保留一个可以更新和检索的值(如上所述,作用域为输入元素的键,因此该操作看到的每个键可能会有一个值)。该值可以使用设置update(T)和使用检索 T value()。
- ListState:这保留了元素列表。您可以追加元素并检索Iterable 所有当前存储的元素。使用add(T)或添加元素addAll(List),可以使用检索Iterable Iterable get()。您还可以使用以下方法覆盖现有列表update(List)
- ReducingState:这将保留一个值,该值代表添加到状态的所有值的集合。介面与相似,ListState但使用新增的元素 add(T)会使用指定的简化为汇总ReduceFunction。
- AggregatingState<IN, OUT>:这将保留一个值,该值代表添加到状态的所有值的集合。与相反ReducingState,聚合类型可能不同于添加到状态中的元素的类型。该接口与for相同,ListState但是使用添加的元素add(IN)是使用指定的聚合的AggregateFunction。
- FoldingState<T, ACC>:这将保留一个值,该值代表添加到状态的所有值的集合。与相反ReducingState,聚合类型可能不同于添加到状态中的元素的类型。该接口类似于,ListState但是使用添加的元素add(T)使用指定的折叠为一个集合FoldFunction。
- MapState<UK, UV>:这将保留一个映射列表。您可以将键值对放入状态,并检索Iterable所有当前存储的映射。使用put(UK, UV)或 添加映射putAll(Map<UK, UV>)。可以使用检索与用户密钥关联的值get(UK)。对于映射,键和值可迭代视图可以使用被检索entries(),keys()并values()分别。您还可以isEmpty()用来检查此映射是否包含任何键值映射。
所有类型的状态都具有clear()清除当前活动键(即输入元素的键)状态的方法。
注意 FoldingState,FoldingStateDescriptor在Flink 1.4中已弃用此功能,以后将完全删除它。请使用AggregatingState和AggregatingStateDescriptor代替。
重要的是要记住,这些状态对象仅用于与状态的接口。状态不一定存储在内部,而是可以驻留在磁盘上或其他位置。要记住的第二件事是,您从状态中获得的值取决于输入元素的键。因此,如果涉及的键不同,则在用户函数的一次调用中获得的值可能与在另一次调用中获得的值不同。
要获取状态句柄,您必须创建一个StateDescriptor。它保存状态的名称(我们将在后面看到,您可以创建多个状态,并且它们必须具有唯一的名称,以便您可以引用它们),状态所保存的值的类型以及可能的用户-指定的功能,例如ReduceFunction。根据要检索的状态类型,创建a ValueStateDescriptor,a ListStateDescriptor,a ReducingStateDescriptor,a FoldingStateDescriptor或a MapStateDescriptor。
状态是使用进行访问的RuntimeContext,因此只有在丰富的功能中才有可能。请参阅此处以获取有关此信息,但不久之后我们还将看到一个示例。该RuntimeContext是在提供RichFunction具有这些方法来访问状态:
- ValueState getState(ValueStateDescriptor)
- ReducingState getReducingState(ReducingStateDescriptor)
- ListState getListState(ListStateDescriptor)
- AggregatingState<IN, OUT> getAggregatingState(AggregatingStateDescriptor<IN, ACC, OUT>)
- FoldingState<T, ACC> getFoldingState(FoldingStateDescriptor<T, ACC>)
- MapState<UK, UV> getMapState(MapStateDescriptor<UK, UV>)
ValueState
package com.baizhi.jsy.keyedState
import org.apache.flink.api.common.functions.{RichMapFunction, RuntimeContext}
import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala._
object FlinkWordCountKeyedValueState {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏行行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//env.disableOperatorChaining()
//2.创建DataStream - 细化
val text = env.socketTextStream("Centos",9999)
//3.执⾏行行DataStream的转换算⼦子
val counts = text.flatMap(line=>line.split("\\s+"))
.map((_,1))
.keyBy(0)
.map(new WordCountMapFunction)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏行行流计算任务
env.execute("Window Stream WordCount")
}
}
class WordCountMapFunction extends RichMapFunction[(String,Int),(String,Int)]{
var vs:ValueState[Int]=_
override def open(parameters: Configuration): Unit = {
//创建对应状态描述符
val vsd = new ValueStateDescriptor[Int]("wordCount",createTypeInformation[Int])
//获取RuntimeContext
val context:RuntimeContext = getRuntimeContext
//获取指定的类型状态
vs = context.getState(vsd)
}
override def map(in: (String, Int)): (String, Int) = {
//获取历史值
val historyState = vs.value()
//更新状态
vs.update(historyState+in._2)
//返回最新值
(in._1,vs.value())
}
}
ListState
package com.baizhi.jsy.keyedState
import org.apache.flink.api.common.functions.{RichMapFunction, RuntimeContext}
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor, ValueState, ValueStateDescriptor}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala._
import scala.collection.JavaConverters._
object FlinkUserKeyedListState {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏行行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
//001 zhansan 电子类 1000 001 zhansan 机械类 1500
val text = env.socketTextStream("Centos",9999)
//3.执⾏行行DataStream的转换算⼦子
val counts = text.map(line=>line.split("\\s+"))
.map(ts=>(ts(0)+":"+ts(1),ts(2)))
.keyBy(0)
.map(new UserListMapFunction)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏行行流计算任务
env.execute("Window Stream WordCount")
}
}
class UserListMapFunction extends RichMapFunction[(String,String),(String,String)]{
var userList:ListState[String]=_
override def open(parameters: Configuration): Unit = {
//创建对应状态描述符
val lsd = new ListStateDescriptor[String]("user",createTypeInformation[String])
//获取RuntimeContext
val context:RuntimeContext = getRuntimeContext
//获取指定的类型状态
userList = context.getListState(lsd)
}
override def map(in: (String, String)): (String, String) = {
//获取历史值
var historyState = userList.get().asScala.toList
//更新状态
historyState = historyState.::(in._2).distinct
userList.update(historyState.asJava)
//返回最新值
(in._1,historyState.mkString("|"))
}
}
MapState
package com.baizhi.jsy.keyedState
import org.apache.flink.api.common.functions.{RichMapFunction, RuntimeContext}
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor, MapState, MapStateDescriptor, ValueState, ValueStateDescriptor}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala._
import scala.collection.JavaConverters._
object FlinkUserKeyedMapState {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏行行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
//001 zhansan 电子类 1000 001 zhansan 机械类 1500
val text = env.socketTextStream("Centos",9999)
//3.执⾏行行DataStream的转换算⼦子
val counts = text.map(line=>line.split("\\s+"))
.map(ts=>(ts(0)+":"+ts(1),ts(2)))
.keyBy(0)
.map(new UserMapFunction)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏行行流计算任务
env.execute("Window Stream WordCount")
}
}
class UserMapFunction extends RichMapFunction[(String,String),(String,String)]{
var userMap:MapState[String,Int]=_
override def open(parameters: Configuration): Unit = {
//创建对应状态描述符
val msd = new MapStateDescriptor[String,Int]("userMap",createTypeInformation[String],createTypeInformation[Int])
//获取RuntimeContext
val context:RuntimeContext = getRuntimeContext
//获取指定的类型状态
userMap = context.getMapState(msd)
}
override def map(in: (String, String)): (String, String) = {
var count=0
if(userMap.contains(in._2)){
count=userMap.get(in._2)
}
userMap.put(in._2,count+1)
val list = userMap.entries().asScala.map(entry=>(entry.getKey+":"+entry.getValue)).toList
//返回最新值
(in._1,list.mkString("|"))
}
}
ReducingState
package com.baizhi.jsy.keyedState
import org.apache.flink.api.common.functions.{ReduceFunction, RichMapFunction, RuntimeContext}
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor, ReducingState, ReducingStateDescriptor, ValueState, ValueStateDescriptor}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala._
object FlinkWordCountKeyedReduceState {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val text = env.socketTextStream("Centos",9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map((_,1))
.keyBy(0)
.map(new UserReduceMapFunction)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
}
}
class UserReduceMapFunction extends RichMapFunction[(String,Int),(String,Int)]{
var rs:ReducingState[Int]=_
override def open(parameters: Configuration): Unit = {
//创建对应状态描述符
val rsd = new ReducingStateDescriptor[Int]("WordCountReduce",new ReduceFunction[Int](){
override def reduce(t: Int, t1: Int): Int = t+t1
},createTypeInformation[Int])
//获取RuntimeContext
val context:RuntimeContext = getRuntimeContext
//获取指定的类型状态
rs = context.getReducingState(rsd)
}
override def map(in: (String, Int)): (String, Int) = {
rs.add(in._2)
//返回最新值
(in._1,rs.get())
}
}
AggregatingState
package com.baizhi.jsy.keyedState
import org.apache.flink.api.common.functions.{AggregateFunction, RichMapFunction, RuntimeContext}
import org.apache.flink.api.common.state.{AggregatingState, AggregatingStateDescriptor, ListState, ListStateDescriptor, MapState, MapStateDescriptor, ValueState, ValueStateDescriptor}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala._
object FlinkUserKeyedAggregatingState {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
//001 zhansan 1000 001 zhansan 800
val text = env.socketTextStream("Centos",9999)
//3.执⾏DataStream的转换算⼦
val counts = text.map(line=>line.split("\\s+"))
.map(ts=>(ts(0)+":"+ts(1),ts(2).toDouble))
.keyBy(0)
.map(new UserAggregatingState)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
}
}
class UserAggregatingState extends RichMapFunction[(String,Double),(String,Double)]{
var userAggregate:AggregatingState[Double,Double]=_
override def open(parameters: Configuration): Unit = {
//创建对应状态描述符
val as = new AggregatingStateDescriptor[Double,(Int,Double),Double]("userAggregate",
new AggregateFunction[Double,(Int,Double),Double] {
override def createAccumulator(): (Int, Double) = (0,0.0)
override def add(in: Double, acc: (Int, Double)): (Int, Double) = (acc._1+1,acc._2+in)
override def getResult(acc: (Int, Double)): Double = {
acc._2/acc._1
}
override def merge(acc: (Int, Double), acc1: (Int, Double)): (Int, Double) = {
(acc._1+acc1._1,acc._2+acc1._2)
}
},createTypeInformation[(Int,Double)]
)
//获取RuntimeContext
val context:RuntimeContext = getRuntimeContext
//获取指定的类型状态
userAggregate = context.getAggregatingState(as)
}
override def map(in: (String, Double)): (String, Double) = {
userAggregate.add(in._2)
//返回最新值
(in._1,userAggregate.get())
}
}
ReduceAndFoldState
package com.baizhi.jsy.keyedState
import org.apache.flink.api.common.functions.{AggregateFunction, FoldFunction, ReduceFunction, RichMapFunction, RuntimeContext}
import org.apache.flink.api.common.state.{AggregatingState, AggregatingStateDescriptor, FoldingState, FoldingStateDescriptor, ListState, ListStateDescriptor, MapState, MapStateDescriptor, ReducingState, ReducingStateDescriptor, ValueState, ValueStateDescriptor}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala._
object FlinkUserKeyedReduceAndFoldState {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
//001 zhansan 1000 001 zhansan 800
val text = env.socketTextStream("Centos",9999)
//3.执⾏DataStream的转换算⼦
val counts = text.map(line=>line.split("\\s+"))
.map(ts=>(ts(0)+":"+ts(1),ts(2).toDouble))
.keyBy(0)
.map(new UserReduceAndFoldState)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
}
}
class UserReduceAndFoldState extends RichMapFunction[(String,Double),(String,Double)]{
var rs:ReducingState[Int]=_
var fs:FoldingState[Double,Double]=_
override def open(parameters: Configuration): Unit = {
val reduceState = new ReducingStateDescriptor[Int]("ReducedState", new ReduceFunction[Int] {
override def reduce(t: Int, t1: Int): Int = t + t1
}, createTypeInformation[Int]
)
val foldState = new FoldingStateDescriptor[Double, Double]("FoldState",0, new FoldFunction[Double, Double] {
override def fold(t: Double, o: Double): Double = t + o
}, createTypeInformation[Double])
var context: RuntimeContext = getRuntimeContext
rs = context.getReducingState(reduceState)
fs = context.getFoldingState(foldState)
}
override def map(in: (String, Double)): (String, Double) = {
rs.add(1)
fs.add(in._2)
(in._1,fs.get()/rs.get())
}
}