大数据之flink状态(上)

一、概念理解

1、State状态

Flink实时计算程序为了保证计算过程中,出现异常可以容错,就要将中间的计算结果数据存储起来,这些中间数据就叫做State。

State可以是多种类型的,默认是保存在JobManager的内存中,也可以保存到TaskManager本地文件系统或HDFS这样的分布式文件系统。

2、StateBackEnd

用来保存State的存储后端就叫做StateBackEnd,默认是保存在JobManager的内存中,也可以保存的本地文件系统或HDFS这样的分布式文件系统。

备注:写到HDFS需要导入相关包,见视频53

3、CheckPointing

Flink实时计算为了容错,可以将中间数据定期保存到起来,这种定期触发保存中间结果的机制叫CheckPointing。CheckPointing是周期执行的。具体的过程是JobManager定期的向TaskManager中的SubTask发送RPC消息,SubTask将其计算的State保存到StateBackEnd中,并且向JobManager相应Checkpoint是否成功。如果程序出现异常或重启,TaskManager中的SubTask可以从上一次成功的CheckPointing的State恢复。

在这里插入图片描述
4、CheckPointingMode

exactly-once: 精确一次性语义,可以保证数据消费且消费一次,但是要结合对应的数据源,比如Kafka支持exactly-once

at-least-once: 至少消费一次,可能会重复消费,但是效率要比exactly-once高

二、重启策略

Flink实时计算程序,为了容错,需要开启CheckPointing,一旦开启CheckPointing,如果没有重启策略,默认的重启策略是无限重启,可以也可以设置其他重启策略。

package cn._51doit.flink.day06;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.restartstrategy.RestartStrategies;
import org.apache.flink.api.common.time.Time;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

//设置重启策略
public class RestartStrategyDemo1 {

    public static void main(String[] args) throws Exception{

        //创建Flink流计算执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //设置重启策略
        //env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 5000));  //重启3次,每次延迟5秒重启
        //env.setRestartStrategy(RestartStrategies.failureRateRestart(3, Time.seconds(30), Time.seconds(3)));  //30秒内不能达到3次,每次重启延迟时间为3秒
        //开启checkpoint
        env.enableCheckpointing(10000); //默认的重启策略是无限重启

        //Source
        DataStreamSource<String> lines = env.socketTextStream("localhost", 8888);

        //调用Transformation
        SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndOne = lines.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public void flatMap(String line, Collector<Tuple2<String, Integer>> collector) throws Exception {
                String[] words = line.split(" ");
                for (String word : words) {
                    if("error".equals(word)) {
                        throw new RuntimeException("出现异常了!!!!!");
                    }
                    //new Tuple2<String, Integer>(word, 1)
                    collector.collect(Tuple2.of(word, 1));
                }
            }
        });

        //分组
        KeyedStream<Tuple2<String, Integer>, String> keyed = wordAndOne.keyBy(new KeySelector<Tuple2<String, Integer>, String>() {
            @Override
            public String getKey(Tuple2<String, Integer> tp) throws Exception {
                return tp.f0;
            }
        });

        //聚合
        SingleOutputStreamOperator<Tuple2<String, Integer>> summed = keyed.sum(1);
        
        summed.print();
        env.execute("StreamingWordCount");

    }
}

三、状态分类

keyed State: 调用keyBy方法后,每个分区可以有一到多个独立的state,不需要处理key, 只需要处理value;其中Keyed State 包括:ValueState(value是一个基本类型、集合类型、自定义类型),MapState(value 是k,v类型),ListState(value是一个list集合)

non-keyed State: Operator State(没有分组,每个subTask自己维护一个状态)、Broadcast State、QueryState

四、自定义keyed State

package cn._51doit.flink.day06;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.common.restartstrategy.RestartStrategies;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
import java.io.*;
import java.util.HashMap;

/**
 * 自己每个keyBy之后的SubTask中定义一个hashMap保存中间结果
 * 可以定期的将hashmap中的数据持久好到磁盘
 * 并且subTask出现异常重启,在open方法中可以读取磁盘中的文件,恢复历史状态
 * (区别:每个定时器都是在SubTask中定期执行的)
 */
public class MyKeyedState02 {

    public static void main(String[] args) throws Exception{

        //创建Flink流计算执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //env.enableCheckpointing(10000);
        //设置重启策略
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 5000));

        //创建DataStream
        //Source
        DataStreamSource<String> lines = env.socketTextStream("localhost", 8888);

        //调用Transformation开始
        //调用Transformation
        SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndOne = lines.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public void flatMap(String line, Collector<Tuple2<String, Integer>> collector) throws Exception {
                String[] words = line.split(" ");
                for (String word : words) {
                    if("error".equals(word)) {
                        throw new RuntimeException("出现异常了!!!!!");
                    }
                    //new Tuple2<String, Integer>(word, 1)
                    collector.collect(Tuple2.of(word, 1));
                }
            }
        });

        //分组
        KeyedStream<Tuple2<String, Integer>, String> keyed = wordAndOne.keyBy(t -> t.f0);

        keyed.map(new RichMapFunction<Tuple2<String, Integer>, Tuple2<String, Integer>>() {

            private HashMap<String, Integer> counter;

            @Override
            public void open(Configuration parameters) throws Exception {
                //初始化hashMap或恢复历史数据
                //获取当前subTask的编号
                int indexOfThisSubtask = getRuntimeContext().getIndexOfThisSubtask();
                File ckFile = new File("/Users/xing/Desktop/myck/" + indexOfThisSubtask);
                if(ckFile.exists()) {
                    FileInputStream fileInputStream = new FileInputStream(ckFile);
                    ObjectInputStream objectInputStream = new ObjectInputStream(fileInputStream);
                    counter = (HashMap<String, Integer>) objectInputStream.readObject();
                } else {
                   counter = new HashMap<>();
                }
                //简化:直接在当前的subTask中启动一个定时器
                new Thread(new Runnable() {
                    @Override
                    public void run() {
                       while (true) {
                           try {
                               Thread.sleep(10000);
                               if (!ckFile.exists()) {
                                   ckFile.createNewFile();
                               }
                               //将hashMap中的数据持久化到文件中
                               ObjectOutputStream objectOutputStream = new ObjectOutputStream(new FileOutputStream(ckFile));
                               objectOutputStream.writeObject(counter);
                               objectOutputStream.flush();
                               objectOutputStream.close();
                           } catch (Exception e) {
                               e.printStackTrace();
                           }
                       }
                    }
                }).start();
            }

            @Override
            public Tuple2<String, Integer> map(Tuple2<String, Integer> input) throws Exception {
                String word = input.f0;
                Integer count = input.f1;
                //从map中取出历史次数
                Integer historyCount = counter.get(word);
                if(historyCount == null) {
                    historyCount = 0;
                }
                int sum = historyCount + count; //当前输入跟历史次数进行累加
                //更新map中的数据
                counter.put(word, sum);
                //输出结果
                return Tuple2.of(word, sum);
            }
        }).print();
        //启动执行
        env.execute("StreamingWordCount");
    }
}

四、ValueState基本使用

package cn._51doit.flink.day06;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.common.restartstrategy.RestartStrategies;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
import java.io.*;
import java.util.HashMap;

/**
 * KeyBy之后,用来存储K-V类型的状态,叫做KeyedState
 * KeyedState种类有:ValueState<T> (value是一个基本类型、集合类型、自定义类型)
 * MapState<小K,V> (存储的是k,v类型)  (大K -> (小K, 小V))
 * ListState(Value是一个list集合)
 *
 */
public class KeyedStateDemo01 {

    public static void main(String[] args) throws Exception{

        //创建Flink流计算执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env.enableCheckpointing(10000);
        //设置重启策略
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 5000));

        //创建DataStream
        //Source
        DataStreamSource<String> lines = env.socketTextStream("localhost", 8888);

        //调用Transformation开始
        //调用Transformation
        SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndOne = lines.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public void flatMap(String line, Collector<Tuple2<String, Integer>> collector) throws Exception {
                String[] words = line.split(" ");
                for (String word : words) {
                    if("error".equals(word)) {
                        throw new RuntimeException("出现异常了!!!!!");
                    }
                    //new Tuple2<String, Integer>(word, 1)
                    collector.collect(Tuple2.of(word, 1));
                }
            }
        });

        //分组
        KeyedStream<Tuple2<String, Integer>, String> keyed = wordAndOne.keyBy(t -> t.f0);

        keyed.map(new RichMapFunction<Tuple2<String, Integer>, Tuple2<String, Integer>>() {

            private transient ValueState<Integer> counter;

            @Override
            public void open(Configuration parameters) throws Exception {
                //想使用状态,先定义一个状态描述器(State的类型,名称)
                ValueStateDescriptor<Integer> stateDescriptor = new ValueStateDescriptor<>("wc-desc", Integer.class);
                //初始化或恢复历史状态
                counter = getRuntimeContext().getState(stateDescriptor);
            }

            @Override
            public Tuple2<String, Integer> map(Tuple2<String, Integer> input) throws Exception {
                //String word = input.f0;
                Integer currentCount = input.f1;
                //从ValueState中取出历史次数
                Integer historyCount = counter.value(); //获取当前key对应的value
                if(historyCount == null) {
                    historyCount = 0;
                }
                Integer total = historyCount + currentCount; //累加
                //跟新状态(内存中)
                counter.update(total);
                input.f1 = total; //累加后的次数
                return input;
            }
        }).print();

        //启动执行
        env.execute("StreamingWordCount");

    }
}

五、MapState的基本使用

package cn._51doit.flink.day06;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.common.restartstrategy.RestartStrategies;
import org.apache.flink.api.common.state.MapState;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;

/**
 * KeyBy之后,用来存储K-V类型的状态,叫做KeyedState
 * KeyedState种类有:ValueState<T> (value是一个基本类型、集合类型、自定义类型)
 * MapState<小K,V> (存储的是k,v类型)  (大K -> (小K, 小V))
 * ListState(Value是一个list集合)
 *
 */
public class MapStateDemo01 {

    public static void main(String[] args) throws Exception{

        //创建Flink流计算执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env.enableCheckpointing(10000);
        //设置重启策略
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 5000));

        //辽宁省,沈阳市,1000
        //辽宁省,铁岭市,2000
        //河北省,廊坊市,1000
        //河北省,保定市,2000
        DataStreamSource<String> lines = env.socketTextStream("localhost", 8888);

        //调用Transformation开始
        //调用Transformation
        SingleOutputStreamOperator<Tuple3<String, String, Double>> tpDataStream = lines.map(new MapFunction<String, Tuple3<String, String, Double>>() {
            @Override
            public Tuple3<String, String, Double> map(String value) throws Exception {
                String[] fields = value.split(",");
                return Tuple3.of(fields[0], fields[1], Double.parseDouble(fields[2]));
            }
        });
        
        //先按省分组
        KeyedStream<Tuple3<String, String, Double>, String> keyedStream = tpDataStream.keyBy(t -> t.f0);

        //value为市,金额
        SingleOutputStreamOperator<Tuple3<String, String, Double>> result = keyedStream.process(new KeyedProcessFunction<String, Tuple3<String, String, Double>, Tuple3<String, String, Double>>() {

            private transient MapState<String, Double> mapState;

            @Override
            public void open(Configuration parameters) throws Exception {
                //定义一个状态描述器
                MapStateDescriptor<String, Double> stateDescriptor = new MapStateDescriptor<String, Double>("kv-state", String.class, Double.class);
                //初始化或恢复历史状态
                mapState = getRuntimeContext().getMapState(stateDescriptor);
            }

            @Override
            public void processElement(Tuple3<String, String, Double> value, Context ctx, Collector<Tuple3<String, String, Double>> out) throws Exception {
                String city = value.f1;
                Double money = value.f2;
                Double historyMoney = mapState.get(city);
                if (historyMoney == null) {
                    historyMoney = 0.0;
                }
                Double totalMoney = historyMoney + money; //累加
                //更新到state(内存)中
                mapState.put(city, totalMoney);
                //输出
                value.f2 = totalMoney;
                out.collect(value);
            }
        });

        result.print();
        //启动执行
        env.execute("StreamingWordCount");
    }
}

六、ListState的使用

package cn._51doit.flink.day06;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.restartstrategy.RestartStrategies;
import org.apache.flink.api.common.state.ListState;
import org.apache.flink.api.common.state.ListStateDescriptor;
import org.apache.flink.api.common.state.MapState;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;
import java.util.ArrayList;
import java.util.List;

/**
 * KeyBy之后,用来存储K-V类型的状态,叫做KeyedState
 * ListState(Value是一个list集合)
 */
public class ListStateDemo01 {

    public static void main(String[] args) throws Exception{

        //创建Flink流计算执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env.enableCheckpointing(10000);
        //设置重启策略
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 5000));

        //辽宁省,沈阳市
        //辽宁省,铁岭市
        //河北省,保定市
        DataStreamSource<String> lines = env.socketTextStream("localhost", 8888);

        //调用Transformation
        SingleOutputStreamOperator<Tuple2<String, String>> tpDataStream = lines.map(new MapFunction<String, Tuple2<String, String>>() {
            @Override
            public Tuple2<String, String> map(String value) throws Exception {
                String[] fields = value.split(",");
                return Tuple2.of(fields[0], fields[1]);
            }
        });

        KeyedStream<Tuple2<String, String>, String> keyedStream = tpDataStream.keyBy(t -> t.f0);

        keyedStream.process(new KeyedProcessFunction<String, Tuple2<String, String>, Tuple2<String, List<String>>>() {

            private transient ListState<String> listState;

            @Override
            public void open(Configuration parameters) throws Exception {
                //定义一个状态描述器
                ListStateDescriptor<String> stateDescriptor = new ListStateDescriptor<>("lst-state", String.class);
                //初始化状态或恢复状态
                listState = getRuntimeContext().getListState(stateDescriptor);
            }

            @Override
            public void processElement(Tuple2<String, String> value, Context ctx, Collector<Tuple2<String, List<String>>> out) throws Exception {
                String action = value.f1;
                listState.add(action);
                Iterable<String> iterator = listState.get();
                ArrayList<String> events = new ArrayList<>();
                for (String name : iterator) {
                    events.add(name);
                }
                out.collect(Tuple2.of(value.f0, events));
            }
        }).print();

        //启动执行
        env.execute("StreamingWordCount");

    }
}

七、使用ValueSate实现ListState

package cn._51doit.flink.day06;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.restartstrategy.RestartStrategies;
import org.apache.flink.api.common.state.ListState;
import org.apache.flink.api.common.state.ListStateDescriptor;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;

import java.util.ArrayList;
import java.util.List;

/**
 * KeyBy之后,用来存储K-V类型的状态,叫做KeyedState
 * ListState(Value是一个list集合)
 * 不是以ListState,而是使用ValueState,实现ListState的功能
 * ValueState<List<String>>
 *
 */
public class ListStateDemo02 {

    public static void main(String[] args) throws Exception{

        //创建Flink流计算执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env.enableCheckpointing(10000);
        //设置重启策略
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 5000));

        //创建DataStream
        //Source
        DataStreamSource<String> lines = env.socketTextStream("localhost", 8888);

        //调用Transformation开始
        //调用Transformation
        SingleOutputStreamOperator<Tuple2<String, String>> tpDataStream = lines.map(new MapFunction<String, Tuple2<String, String>>() {
            @Override
            public Tuple2<String, String> map(String value) throws Exception {
                String[] fields = value.split(",");
                return Tuple2.of(fields[0], fields[1]);
            }
        });

        KeyedStream<Tuple2<String, String>, String> keyedStream = tpDataStream.keyBy(t -> t.f0);

        keyedStream.process(new KeyedProcessFunction<String, Tuple2<String, String>, Tuple2<String, List<String>>>() {

            private transient ValueState<List<String>> listState;

            @Override
            public void open(Configuration parameters) throws Exception {
                //定义一个状态描述器
                ValueStateDescriptor<List<String>> listStateDescriptor = new ValueStateDescriptor<>("lst-state", TypeInformation.of(new TypeHint<List<String>>() {}));
                listState = getRuntimeContext().getState(listStateDescriptor);
            }

            @Override
            public void processElement(Tuple2<String, String> value, Context ctx, Collector<Tuple2<String, List<String>>> out) throws Exception {
                String action = value.f1;
                List<String> lst = listState.value();
                if(lst == null) {
                    lst = new ArrayList<String>();
                }
                lst.add(action);
                //更新状态
                listState.update(lst);
                out.collect(Tuple2.of(value.f0, lst));
            }
        }).print();

        //启动执行
        env.execute("StreamingWordCount");

    }
}

八、OperatorState的使用

OperatorState又称为non-keyed state,没有分组,每个subTask自己维护一个状态

package cn._51doit.flink.day06;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.restartstrategy.RestartStrategies;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

//使用OperatorState实现一个AtLeastOnceSource案例
public class MyAtLeastOnceSourceDemo {

    public static void main(String[] args) throws Exception{

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //开启Checkpointing
        env.enableCheckpointing(30000);
        //可以定期将状态保存到StateBackend(对状态做快照)
        env.setStateBackend(new FsStateBackend("file:///Users/xing/Documents/dev/doit17/flink-java/ck");
        //开启重启策略
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 5000));

        DataStreamSource<String> lines1 = env.socketTextStream("localhost", 8888);

        SingleOutputStreamOperator<String> errorData = lines1.map(new MapFunction<String, String>() {
            @Override
            public String map(String value) throws Exception {
                if (value.startsWith("error")) {
                    int i = 10 / 0;
                }
                return value;
            }
        });

        DataStreamSource<String> lines2 = env.addSource(new MyAtLeastOnceSource("/Users/xing/Desktop/data"));

        DataStream<String> union = errorData.union(lines2);

        union.print();
        env.execute();
    }
}

自定义MyAtLeastOnceSource

package cn._51doit.flink.day06;

import com.google.common.base.Charsets;
import org.apache.flink.api.common.state.ListState;
import org.apache.flink.api.common.state.ListStateDescriptor;
import org.apache.flink.runtime.state.FunctionInitializationContext;
import org.apache.flink.runtime.state.FunctionSnapshotContext;
import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction;
import org.apache.flink.streaming.api.functions.source.RichParallelSourceFunction;
import java.io.RandomAccessFile;

public class MyAtLeastOnceSource extends RichParallelSourceFunction<String> implements CheckpointedFunction {

    private String path;

    public MyAtLeastOnceSource(String path) {
        this.path = path;
    }

    private boolean flag = true;
    private Long offset = 0L;
    private transient ListState<Long> listState;
    
    /*
     * 初始化状态或恢复状态执行一次,在run方法执行之前执行一次
     */
    @Override
    public void initializeState(FunctionInitializationContext context) throws Exception {
        ListStateDescriptor<Long> stateDescriptor = new ListStateDescriptor<>("offset-state", Long.class);
        listState = context.getOperatorStateStore().getListState(stateDescriptor);
        //当前的状态是否已经恢复了
        if(context.isRestored()) {
            //从ListState中恢复偏移量
            Iterable<Long> iterable = listState.get();
            for (Long l : iterable) {
                offset = l;
            }
        }
    }

    @Override
    public void run(SourceContext<String> ctx) throws Exception {
        int indexOfThisSubtask = getRuntimeContext().getIndexOfThisSubtask();
        RandomAccessFile randomAccessFile = new RandomAccessFile(path + "/" + indexOfThisSubtask + ".txt", "r");
        randomAccessFile.seek(offset); //从指定的位置读取数据
        while (flag) {
            String line = randomAccessFile.readLine();
            if(line != null) {
                line = new String(line.getBytes(Charsets.ISO_8859_1), Charsets.UTF_8);
                synchronized (ctx.getCheckpointLock()) {
                    offset = randomAccessFile.getFilePointer();
                    ctx.collect(indexOfThisSubtask + ".txt : " + line);
                }
            } else {
                Thread.sleep(1000);
            }
        }
    }

    /**
     *
     * 在checkpoint时会支持一次,该方法会周期性的调用
     */
    @Override
    public void snapshotState(FunctionSnapshotContext context) throws Exception {
        //定期的更新OperatorState
        listState.clear();
        listState.add(offset);
    }

    @Override
    public void cancel() {
        flag = false;
    }
}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

大数据同盟会

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值