多个版本介绍 高可用的 wordCount

脆弱的 wordCount

        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
        env.addSource(new RichSourceFunction<Tuple2<String, Integer>>() {


            private transient  boolean isRunning;

            @Override
            public void open(Configuration parameters) throws Exception {
                super.open(parameters);
                isRunning  = true;
            }

            @Override
            public void run(SourceContext<Tuple2<String, Integer>> ctx) throws Exception {
                int speed = 1;
                while(isRunning) {
                    TimeUnit.SECONDS.sleep(3);
                    //source Exception模拟
                    if (speed % 5 == 0) {
                        int a = 1/0;
                    }
                    ctx.collect(new Tuple2("key1", speed++));
                }
            }

            @Override
            public void cancel() {
                this.isRunning = false;
            }
        })
                .keyBy(e -> e.f0)
                .reduce((e, ee) -> new  Tuple2(e.f0, e.f1+ee.f1))
                .print();
        env.execute("customer source");

执行结果:

WARN [main] - Log file environment variable ‘log.file’ is not set.
WARN [main] - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable ‘log.file’ or configuration key ‘web.log.path’.
3> (key1,1)
3> (key1,3)
3> (key1,6)
3> (key1,10)
WARN [Source: Custom Source (1/1)#0] - Source: Custom Source (1/1)#0 (e611362dc032837ed97f9eb062e07f1c) switched from RUNNING to FAILED with failure cause: java.lang.ArithmeticException: / by zero
at demos.source.test.Test$1.run(Test.java:36)

原因

当source的值到达5的倍数时会造成除以0的错误,作业直接失败。但是经观察下一次值为11时是可以正常执行的
故可以重启继续进行作业。

带重启策略的***wordCount***

        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
        //可进行三次失败重启
        env.setRestartStrategy(
                RestartStrategies.fixedDelayRestart(3,
                        Time.of(2, TimeUnit.SECONDS))
        );
        env.addSource(new RichSourceFunction<Tuple2<String, Integer>>() {

            private transient  boolean isRunning;

            @Override
            public void open(Configuration parameters) throws Exception {
                super.open(parameters);
                isRunning  = true;
            }

            @Override
            public void run(SourceContext<Tuple2<String, Integer>> ctx) throws Exception {
                int speed = 1;
                while(isRunning) {
                    TimeUnit.SECONDS.sleep(3);
                    if (speed % 5 == 0) {
                        int a = 1/0;
                    }
                    ctx.collect(new Tuple2("key1", speed++));
                }
            }

            @Override
            public void cancel() {
                this.isRunning = false;
            }
        })
                .keyBy(e -> e.f0)
                .reduce((e, ee) -> new  Tuple2(e.f0, e.f1+ee.f1))
                .print();

        env.execute("customer source");

执行结果:

WARN [main] - Log file environment variable 'log.file' is not set.
  WARN [main] - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'web.log.path'.
 3> (key1,1)
 3> (key1,3)
 3> (key1,6)
 3> (key1,10)
  WARN [Source: Custom Source (1/1)#0] - Source: Custom Source (1/1)#0 (e611362dc032837ed97f9eb062e07f1c) switched from RUNNING to FAILED with failure cause: java.lang.ArithmeticException: / by zero
 	at demos.source.test.Test$1.run(Test.java:36)
 	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:116)
 	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:73)
 	at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:323)
 
 3> (key1,1)
 3> (key1,3)
 3> (key1,6)
 3> (key1,10)
  WARN [Source: Custom Source (1/1)#1] - Source: Custom Source (1/1)#1 (e2715e993fdb59f07151ffe999a78a5c) switched from RUNNING to FAILED with failure cause: java.lang.ArithmeticException: / by zero
 	at demos.source.test.Test$1.run(Test.java:36)
 	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:116)
 	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:73)
 	at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:323)
 
 3> (key1,1)
 3> (key1,3)
 3> (key1,6)
 3> (key1,10)
  WARN [Source: Custom Source (1/1)#2] - Source: Custom Source (1/1)#2 (c9651273984af23eabf17d4bd3286902) switched from RUNNING to FAILED with failure cause: java.lang.ArithmeticException: / by zero
 	at demos.source.test.Test$1.run(Test.java:36)
 	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:116)
 	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:73)
 	at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:323)
 
 3> (key1,1)
 ERROR [flink-rest-server-netty-worker-thread-1] - Unhandled exception.
 org.apache.flink.runtime.resourcemanager.exceptions.UnknownTaskExecutorException: No TaskExecutor registered under faa23095-fa8c-44b0-a1d0-c5e0ca848194.
 	at org.apache.flink.runtime.resourcemanager.ResourceManager.requestTaskManagerDetailsInfo(ResourceManager.java:624)
 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 3> (key1,3)
 3> (key1,6)
 3> (key1,10)
  WARN [Source: Custom Source (1/1)#3] - Source: Custom Source (1/1)#3 (9f74a9bb38748859130572c132493eb9) switched from RUNNING to FAILED with failure cause: java.lang.ArithmeticException: / by zero
 	at demos.source.test.Test$1.run(Test.java:36)
 	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:116)
 Exception in thread "main" org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
 	at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
 	at org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137)
 Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=3, backoffTimeMS=2000)
 	at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:138)
 	... 4 more
 Caused by: java.lang.ArithmeticException: / by zero
 	at demos.source.test.Test$1.run(Test.java:36)
 	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:116)
 	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:73)
 	at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:323)

原因

当source的值到达5的倍数时会造成除以0的错误,因为设置了固定重试次数为3作业失败但是会重启继续执行。三次失败后作业失败。

失败重启策略

env.setRestartStrategy(RestartStrategies.noRestart());
不重试
直接失败

RestartStrategies.fixedDelayRestart(3, Time.of(5, TimeUnit.SECONDS))
固定次数重启策略
可尝试3次重启, 间隔5秒重试一次,三次后再有失败的job将失败

RestartStrategies.failureRateRestart(3, Time.of(5, TimeUnit.SECONDS), Time.of(3, TimeUnit.SECONDS))));
失败率重启策略(totalFailure3 / (failureInterval: 5 * delayInterval:3)),失败后延迟指定时间再重启
五秒内可重试三次,每次重试延时3秒

FallbackRestartStrategyConfiguration
使用集群配置( flink-conf.yaml)的重启策略
集群配置

存在的问题

观察上面的输出日志发现下次重启后会从头开始计算,如果能使其从上次失败数据进行计算会更有意义,比如kafka offset。
可以使用flink提供的state来进行当前(source中的seppd)值的存储,并定期进行checkpoint持久化到状态后端(不一定真的持久化,如使用内存作为backend)
在重启时从state取出存储的上次计算值接着计算。

#能重启并且带状态的wordCount
自定义的source,实现了checkpoint接口。
checkpoint

checkpoint 接口中有两个方法,snapshotState(…)和 initializeState(…)
其中snapshotState()会在checkpoint调度器发出执行信号时进行触发,将state持久化到stateBackend
initializeState()会在第一次初始化或者错误重启时进行state初始化

public  class  HaveStateSource extends RichSourceFunction<Tuple2<String, Integer>> implements CheckpointedFunction {

    private static transient boolean hasRunning = true;
    private transient int currentValue;
    private transient ListState<Integer> listState;
    private transient Logger logger = LoggerFactory.getLogger(HaveStateSource.class);

    /**
     * 定期进行state 持久化到stateBackEnd
     * @param context
     * @throws Exception
     */
    @Override
    public void snapshotState(FunctionSnapshotContext context) throws Exception {
        listState.update(Arrays.asList(currentValue));
    }

    /**
     * 在初始化或者错误恢复时进行状态初始化(恢复)
     * @param context
     * @throws Exception
     */
    @Override
    public void initializeState(FunctionInitializationContext context) throws Exception {
        listState = context.getOperatorStateStore().getListState(
                new ListStateDescriptor<Integer>("listState", TypeInformation.of(Integer.class)));
    }

    @Override
    public void run(SourceContext ctx) throws Exception {
        while (hasRunning) {
            listState.get().forEach(e -> currentValue = e);
            ctx.collect(new Tuple2("key1", currentValue++));
            TimeUnit.SECONDS.sleep(3);
            if (currentValue % 5 == 0) {
                try {
                    int a = currentValue/0;
                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
            listState.update(Collections.singletonList(currentValue));
        }
    }

    public void cancel() {
        this.hasRunning = false;
    }
}

wordCount

StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
       //2000ms进行一次checkpoint
        env.enableCheckpointing(2000);
        //state存储到内存中(测试)
        env.setStateBackend(new MemoryStateBackend());
        //设置策略为失败率重启,失败后延迟2s进行重启,5s内可重启3次,超过三次 job失败
        env.setRestartStrategy(RestartStrategies.failureRateRestart(3, Time.of(5, TimeUnit.SECONDS.SECONDS), Time.of(2, TimeUnit.SECONDS)));
        //使用自定义的实现checkpoint的source
        env.addSource(new HaveStateSource()).uid("001").name("customerSource")
            .keyBy(e -> e.f0)
            .reduce((e, ee) -> new Tuple2(e.f0, e.f1 + ee.f1))
            .print();
        env.execute("customer source");

执行结果

 WARN [main] - Log file environment variable 'log.file' is not set.
 WARN [main] - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'web.log.path'.
3> (key1,0)
3> (key1,1)
3> (key1,3)
3> (key1,6)
3> (key1,10)
java.lang.ArithmeticException: / by zero
	at demos.source.HaveStateSource.run(HaveStateSource.java:58)
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:116)
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:73)
	at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:323)
3> (key1,15)
3> (key1,21)
3> (key1,28)
3> (key1,36)
3> (key1,45)
java.lang.ArithmeticException: / by zero
	at demos.source.HaveStateSource.run(HaveStateSource.java:58)
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:116)
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:73)
	at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:323)
3> (key1,55)
3> (key1,66)
3> (key1,78)
3> (key1,91)
3> (key1,105)
...

观察输入结果可发现作业能从异常中恢复,并且恢复了上次的状态。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值