Flink State状态以及Checkpoint机制（一）

最新推荐文章于 2024-09-18 20:21:46 发布

不清不慎

最新推荐文章于 2024-09-18 20:21:46 发布

阅读量1.1w

点赞数 11

分类专栏： Flink Flink入门到精通

本文链接：https://blog.csdn.net/qq_37142346/article/details/90667283

版权

Flink 同时被 2 个专栏收录

14 篇文章 4 订阅

订阅专栏

Flink入门到精通

12 篇文章 12 订阅

订阅专栏

一、State状态

在Flink中，它使用了State状态机制以及Checkpoint策略提供了强大的容错机制，不过我们需要注意区分它们，State状态是指一个Flink Job中的task中的每一个operator的状态，而Checkpoint是指在某个特定的时刻下，对整个job一个全局的快照，当我们遇到故障或者重启的时候可以从备份中进行恢复。

在Flink中，State中主要分为Operator State以及KeyedState，在Flink 1.5之后，又推出了BroadCast State，它通常使用在两个流进行连接处理，其中一个流的数据是一些不常改变的数据，比如一些配置规则等等，另一个流需要连接这个Broadcast DataStream进行操作等场景。

State状态的作用

聚合操作，机器学习迭代训练模型。
job故障恢复，重启等场景。

状态形式

托管状态（manager）：由Flink管理的一系列状态，比如ValueState，ListState，MapState，通过框架提供的接口进行管理和更新操作，不需要进行序列化。
原始状态（raw）：由用户自行创建管理的具体数据结构，在做Checkpoint的时候，会以byte[]的形式来读取数据，它需要进行序列化。

Operator State

Operator State是指在一个job中的一个task中的每一个operator对应着一个state，比如在一个job中，涉及到map，filter，sink等操作，那么在这些operator中，每一个可以对应着一个state（一个冰并行度），如果是多个并行度，那么每一个并行度都对应着一个state。对于Operator State主要有ListState可以进行使用。

如何使用Operator State呢？我们可以通过实现CheckpointedFunction接口来实现，或者实现ListCheckpointed<T extends Serializable>接口来实现，它们之间主要的区别是：实现CheckpointedFunction接口，有两种形式的ListState API可以使用，分别是getListState以及getListUnionState，它们都会返回一个ListState，但是他们在重新分区的时候会有区别，后面会详细介绍。如果我们直接实现ListCheckpointed接口，那么就会规定使用ListState，不需要我们进行初始化，Flink内部帮我们解决。

这里有一个小案例，有一组数据，我们需要计算1之间的数据，输入一行数据，输出形式为<count,String>。

比如输入1 2 3 4 7 5 1 5 4 6 1 7 8 9 1，输出：

(5,2 3 4 7 5)			//代表1之间有5个数，分别是2,3，4,7，5
 (3,5 4 6)
 (3,7 8 9)

整个程序代码如下：

/**
 * Operator State Demo
 *
 * 1 2 3 4 7 5 1 5 4 6 1 7 8 9 1
 *
 * 输出如下：
 * (5,2 3 4 7 5)
 * (3,5 4 6)
 * (3,7 8 9)
 */
public class OperatorStateDemo {


    public static void main(String[] args) throws Exception {
        final StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        DataStream<Long> input=env.fromElements(1L,2L,3L,4L,7L,5L,1L,5L,4L,6L,1L,7L,8L,9L,1L);

        input.flatMap(new OperatorStateMap()).setParallelism(1).print();

        System.out.println(env.getExecutionPlan());

        env.execute();
    }


    public static class OperatorStateMap extends RichFlatMapFunction<Long,Tuple2<Integer,String>> implements CheckpointedFunction{

        //托管状态
        private ListState<Long> listState;
        //原始状态
        private List<Long> listElements;

        @Override
        public void flatMap(Long value, Collector collector) throws Exception {
            if(value==1){
                if(listElements.size()>0){
                    StringBuffer buffer=new StringBuffer();
                    for(Long ele:listElements){
                        buffer.append(ele+" ");
                    }
                    int sum=listElements.size();
                    collector.collect(new Tuple2<Integer,String>(sum,buffer.toString()));
                    listElements.clear();
                }
            }else{
                listElements.add(value);
            }
        }

        /**
         * 进行checkpoint进行快照
         * @param context
         * @throws Exception
         */
        @Override
        public void snapshotState(FunctionSnapshotContext context) throws Exception {
            listState.clear();
            for(Long ele:listElements){
                listState.add(ele);
            }
        }

        /**
         * state的初始状态，包括从故障恢复过来
         * @param context
         * @throws Exception
         */
        @Override
        public void initializeState(FunctionInitializationContext context) throws Exception {
            ListStateDescriptor listStateDescriptor=new ListStateDescriptor("checkPointedList",
                    TypeInformation.of(new TypeHint<Long>() {}));
            listState=context.getOperatorStateStore().getListState(listStateDescriptor);
            //如果是故障恢复
            if(context.isRestored()){
                //从托管状态将数据到移动到原始状态
                for(Long ele:listState.get()){
                    listElements.add(ele);
                }
                listState.clear();
            }
        }

        @Override
        public void open(Configuration parameters) throws Exception {
            super.open(parameters);
            listElements=new ArrayList<Long>();
        }
    }

}

可以看见上述案例中我们实现了它的initializeState、snapshotState等方法，如果实现了ListCheckpointed接口，就不需要我们自己初始化状态，直接从之前的状态进行恢复，只需要实现以下两个方法即可：

List<T> snapshotState(long checkpointId, long timestamp) throws Exception;

void restoreState(List<T> state) throws Exception;

重新分区

当我们在一个job中重新设置了一个operator的并行度之后，之前的state该如何被分配呢？下面我们就ListState、ListUnionState以及BroadcastSate来说明如何重新进行分区。

ListState
如下图所示，如果刚开始operator的并行度为3，那么在重新分区之后，会将所有的元素平均分配给每一个state。
在这里插入图片描述
ListUnionState
如下图中，并行度为3，当重新分区后会将之前的所有的state的元素分配给每一个分区后的state。

Broadcast State
如果一个operator的state为Broadcast State，那么它的每一个并行度中的state都一样，那么重新分区之后增加或者减少相应的state即可。
在这里插入图片描述

KeyState

它主要应用在KeyedDataStream中，上面的Operator State中，每一个并行度对应着一个state，KeyState是指一个key对应着一个state，这意味着如果在一个应用中需要维护着很多的key，那么保存它的state必然会给应用带来额外的开销。

它主要提供了以下的state：

Value State：ValueState 分区的单值状态。
Map State：MapState<UK,UV> 分区的键值状态。
List State：ListState 分区的列表状态。
Reducing State：ReducingState 每次调用 add(T) 添加新元素，会调用 ReduceFunction 进行聚合。传入类型和返回类型相同。
Aggregating State：AggregatingState<IN,OUT> 每次调用 add(T) 添加新元素，会调用ReduceFunction 进行聚合。传入类型和返回类型可以不同。

下面是一个简单的示例，计算每一个key中平均每3个数据的平均值，如：

(1,2),(1,3),(1,4),(2,1),(2,2),(2,3)
输出结果如下：
(1,3)
(2,2)

完整代码如下：

/**
 * KeyedState Demo
 * 计算不同key的平均每三个之间的平均值
 */
public class KeyedStateDemo {


    public static void main(String[] args) throws Exception {
        final StreamExecutionEnvironment  env=StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        DataStream<Tuple2<Long,Long>> input=env.fromElements(
                Tuple2.of(1L,4L),
                Tuple2.of(1L,2L),
                Tuple2.of(1L,6L),
                Tuple2.of(2L,4L),
                Tuple2.of(2L,4L),
                Tuple2.of(3L,5L),
                Tuple2.of(2L,3L),
                Tuple2.of(1L,4L)
        );

        input.keyBy(0)
                .flatMap(new KeyedStateAgvFlatMap())
                .setParallelism(10)
                .print();

        env.execute();
    }


    public static class KeyedStateAgvFlatMap extends RichFlatMapFunction<Tuple2<Long,Long>,Tuple2<Long,Long>>{

        private ValueState<Tuple2<Long,Long>> valueState;

        @Override
        public void flatMap(Tuple2<Long, Long> value, Collector<Tuple2<Long, Long>> collector) throws Exception {
            Tuple2<Long,Long> currentValue=valueState.value();
            if(currentValue==null){
                currentValue=Tuple2.of(0L,0L);
            }
            currentValue.f0+=1;
            currentValue.f1+=value.f1;
            valueState.update(currentValue);
            //大于三个
            if(currentValue.f0>=3){
                collector.collect(Tuple2.of(value.f0,currentValue.f1/currentValue.f0));
                valueState.clear();
            }
        }

        @Override
        public void open(Configuration parameters) throws Exception {
            super.open(parameters);

            //keyedState可以设置TTL过期时间
            StateTtlConfig config=StateTtlConfig
                    .newBuilder(Time.seconds(30))
                    .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
                    .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
                    .build();

            ValueStateDescriptor valueStateDescriptor=new ValueStateDescriptor("agvKeyedState",
                    TypeInformation.of(new TypeHint<Tuple2<Long,Long>>() {}));

            //设置支持TTL配置
            valueStateDescriptor.enableTimeToLive(config);

            valueState=getRuntimeContext().getState(valueStateDescriptor);
        }
    }
}

TTL过期时间

对于每一个keyed State，还可以设置TTL过期时间，它会将过期的state删除掉，通过下面的方式来设置TTL：

StateTtlConfig retainOneDay = StateTtlConfig  
    .newBuilder(Time.days(1)) // ①
    .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite) // ②
    .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired) // ③
    .build();

① TTL 时长；

② 定义更新 TTL 状态最后访问时间的更新类型：
创建和读（StateTtlConfig.UpdateType.OnCreateAndWrite）、
读和写（StateTtlConfig.UpdateType.OnReadAndWrite）；

③ 定义状态的可见性，是否返回过期状态。

启用 TTL：

stateDescriptor.enableTimeToLive(retainOneDay);

Broadcast State

Broadcast State是Flink 1.5之后提出的一种新的state，一般情况下使用在两个流需要进行连接操作的场景中。它分为Keyed以及Non-Keyed State，它一般保存在内存当中，而不是RocksDB等State Backend中，使用它们需要分别实现下面两个抽象类：

KeyedBroadcastProcessFunction
BroadcastProcessFunction

这两个抽象类中主要有以下两个抽象方法需要我们自己实现：

public abstract class BroadcastProcessFunction<IN1, IN2, OUT> extends BaseBroadcastProcessFunction {
	//ReadOnlyContext 对Broadcast State有只读权限
    public abstract void processElement(IN1 value, ReadOnlyContext ctx, Collector<OUT> out) throws Exception;
	//Context 有读写权限
    public abstract void processBroadcastElement(IN2 value, Context ctx, Collector<OUT> out) throws Exception;
}

public abstract class KeyedBroadcastProcessFunction<KS, IN1, IN2, OUT> {

    public abstract void processElement(IN1 value, ReadOnlyContext ctx, Collector<OUT> out) throws Exception;

    public abstract void processBroadcastElement(IN2 value, Context ctx, Collector<OUT> out) throws Exception;
	//可以设置定时器来触发计算
    public void onTimer(long timestamp, OnTimerContext ctx, Collector<OUT> out) throws Exception;
}

processElement方法用来处理普通流，processBroadcastElement用来处理Broadcast DataStream，如下代码中我们使用MapStateDescriptor去创建了一个BroadcastState，然后我们应用在广播流中，两个流进行connect操作，在后面的process方法中传入具体的操作逻辑：

// key the shapes by color
KeyedStream<Item, Color> colorPartitionedStream = shapeStream
                        .keyBy(new KeySelector<Shape, Color>(){...});
                        
// a map descriptor to store the name of the rule (string) and the rule itself.
MapStateDescriptor<String, Rule> ruleStateDescriptor = new MapStateDescriptor<>(
			"RulesBroadcastState",
			BasicTypeInfo.STRING_TYPE_INFO,
			TypeInformation.of(new TypeHint<Rule>() {}));
		
// broadcast the rules and create the broadcast state
BroadcastStream<Rule> ruleBroadcastStream = ruleStream
                        .broadcast(ruleStateDescriptor);
                        
DataStream<Match> output = colorPartitionedStream
                 .connect(ruleBroadcastStream)
                 .process(
                     
                     // type arguments in our KeyedBroadcastProcessFunction represent: 
                     //   1. the key of the keyed stream
                     //   2. the type of elements in the non-broadcast side
                     //   3. the type of elements in the broadcast side
                     //   4. the type of the result, here a string
                     
                     new KeyedBroadcastProcessFunction<Color, Item, Rule, String>() {
                         // my matching logic
                     }
                 )