脆弱的 wordCount
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
env.addSource(new RichSourceFunction<Tuple2<String, Integer>>() {
private transient boolean isRunning;
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
isRunning = true;
}
@Override
public void run(SourceContext<Tuple2<String, Integer>> ctx) throws Exception {
int speed = 1;
while(isRunning) {
TimeUnit.SECONDS.sleep(3);
//source Exception模拟
if (speed % 5 == 0) {
int a = 1/0;
}
ctx.collect(new Tuple2("key1", speed++));
}
}
@Override
public void cancel() {
this.isRunning = false;
}
})
.keyBy(e -> e.f0)
.reduce((e, ee) -> new Tuple2(e.f0, e.f1+ee.f1))
.print();
env.execute("customer source");
执行结果:
WARN [main] - Log file environment variable ‘log.file’ is not set.
WARN [main] - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable ‘log.file’ or configuration key ‘web.log.path’.
3> (key1,1)
3> (key1,3)
3> (key1,6)
3> (key1,10)
WARN [Source: Custom Source (1/1)#0] - Source: Custom Source (1/1)#0 (e611362dc032837ed97f9eb062e07f1c) switched from RUNNING to FAILED with failure cause: java.lang.ArithmeticException: / by zero
at demos.source.test.Test$1.run(Test.java:36)
原因
当source的值到达5的倍数时会造成除以0的错误,作业直接失败。但是经观察下一次值为11时是可以正常执行的
故可以重启继续进行作业。
带重启策略的***wordCount***
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
//可进行三次失败重启
env.setRestartStrategy(
RestartStrategies.fixedDelayRestart(3,
Time.of(2, TimeUnit.SECONDS))
);
env.addSource(new RichSourceFunction<Tuple2<String, Integer>>() {
private transient boolean isRunning;
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
isRunning = true;
}
@Override
public void run(SourceContext<Tuple2<String, Integer>> ctx) throws Exception {
int speed = 1;
while(isRunning) {
TimeUnit.SECONDS.sleep(3);
if (speed % 5 == 0) {
int a = 1/0;
}
ctx.collect(new Tuple2("key1", speed++));
}
}
@Override
public void cancel() {
this.isRunning = false;
}
})
.keyBy(e -> e.f0)
.reduce((e, ee) -> new Tuple2(e.f0, e.f1+ee.f1))
.print();
env.execute("customer source");
执行结果:
WARN [main] - Log file environment variable 'log.file' is not set.
WARN [main] - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'web.log.path'.
3> (key1,1)
3> (key1,3)
3> (key1,6)
3> (key1,10)
WARN [Source: Custom Source (1/1)#0] - Source: Custom Source (1/1)#0 (e611362dc032837ed97f9eb062e07f1c) switched from RUNNING to FAILED with failure cause: java.lang.ArithmeticException: / by zero
at demos.source.test.Test$1.run(Test.java:36)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:116)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:73)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:323)
3> (key1,1)
3> (key1,3)
3> (key1,6)
3> (key1,10)
WARN [Source: Custom Source (1/1)#1] - Source: Custom Source (1/1)#1 (e2715e993fdb59f07151ffe999a78a5c) switched from RUNNING to FAILED with failure cause: java.lang.ArithmeticException: / by zero
at demos.source.test.Test$1.run(Test.java:36)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:116)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:73)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:323)
3> (key1,1)
3> (key1,3)
3> (key1,6)
3> (key1,10)
WARN [Source: Custom Source (1/1)#2] - Source: Custom Source (1/1)#2 (c9651273984af23eabf17d4bd3286902) switched from RUNNING to FAILED with failure cause: java.lang.ArithmeticException: / by zero
at demos.source.test.Test$1.run(Test.java:36)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:116)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:73)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:323)
3> (key1,1)
ERROR [flink-rest-server-netty-worker-thread-1] - Unhandled exception.
org.apache.flink.runtime.resourcemanager.exceptions.UnknownTaskExecutorException: No TaskExecutor registered under faa23095-fa8c-44b0-a1d0-c5e0ca848194.
at org.apache.flink.runtime.resourcemanager.ResourceManager.requestTaskManagerDetailsInfo(ResourceManager.java:624)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
3> (key1,3)
3> (key1,6)
3> (key1,10)
WARN [Source: Custom Source (1/1)#3] - Source: Custom Source (1/1)#3 (9f74a9bb38748859130572c132493eb9) switched from RUNNING to FAILED with failure cause: java.lang.ArithmeticException: / by zero
at demos.source.test.Test$1.run(Test.java:36)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:116)
Exception in thread "main" org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
at org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137)
Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=3, backoffTimeMS=2000)
at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:138)
... 4 more
Caused by: java.lang.ArithmeticException: / by zero
at demos.source.test.Test$1.run(Test.java:36)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:116)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:73)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:323)
原因
当source的值到达5的倍数时会造成除以0的错误,因为设置了固定重试次数为3作业失败但是会重启继续执行。三次失败后作业失败。
失败重启策略
env.setRestartStrategy(RestartStrategies.noRestart());
不重试
直接失败
RestartStrategies.fixedDelayRestart(3, Time.of(5, TimeUnit.SECONDS))
固定次数重启策略
可尝试3次重启, 间隔5秒重试一次,三次后再有失败的job将失败
RestartStrategies.failureRateRestart(3, Time.of(5, TimeUnit.SECONDS), Time.of(3, TimeUnit.SECONDS))));
失败率重启策略(totalFailure3 / (failureInterval: 5 * delayInterval:3)),失败后延迟指定时间再重启
五秒内可重试三次,每次重试延时3秒
FallbackRestartStrategyConfiguration
使用集群配置( flink-conf.yaml)的重启策略
集群配置
存在的问题
观察上面的输出日志发现下次重启后会从头开始计算,如果能使其从上次失败数据进行计算会更有意义,比如kafka offset。
可以使用flink提供的state来进行当前(source中的seppd)值的存储,并定期进行checkpoint持久化到状态后端(不一定真的持久化,如使用内存作为backend)
在重启时从state取出存储的上次计算值接着计算。
#能重启并且带状态的wordCount
自定义的source,实现了checkpoint接口。
checkpoint
checkpoint 接口中有两个方法,snapshotState(…)和 initializeState(…)
其中snapshotState()会在checkpoint调度器发出执行信号时进行触发,将state持久化到stateBackend
initializeState()会在第一次初始化或者错误重启时进行state初始化
public class HaveStateSource extends RichSourceFunction<Tuple2<String, Integer>> implements CheckpointedFunction {
private static transient boolean hasRunning = true;
private transient int currentValue;
private transient ListState<Integer> listState;
private transient Logger logger = LoggerFactory.getLogger(HaveStateSource.class);
/**
* 定期进行state 持久化到stateBackEnd
* @param context
* @throws Exception
*/
@Override
public void snapshotState(FunctionSnapshotContext context) throws Exception {
listState.update(Arrays.asList(currentValue));
}
/**
* 在初始化或者错误恢复时进行状态初始化(恢复)
* @param context
* @throws Exception
*/
@Override
public void initializeState(FunctionInitializationContext context) throws Exception {
listState = context.getOperatorStateStore().getListState(
new ListStateDescriptor<Integer>("listState", TypeInformation.of(Integer.class)));
}
@Override
public void run(SourceContext ctx) throws Exception {
while (hasRunning) {
listState.get().forEach(e -> currentValue = e);
ctx.collect(new Tuple2("key1", currentValue++));
TimeUnit.SECONDS.sleep(3);
if (currentValue % 5 == 0) {
try {
int a = currentValue/0;
} catch (Exception e) {
e.printStackTrace();
}
}
listState.update(Collections.singletonList(currentValue));
}
}
public void cancel() {
this.hasRunning = false;
}
}
wordCount
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
//2000ms进行一次checkpoint
env.enableCheckpointing(2000);
//state存储到内存中(测试)
env.setStateBackend(new MemoryStateBackend());
//设置策略为失败率重启,失败后延迟2s进行重启,5s内可重启3次,超过三次 job失败
env.setRestartStrategy(RestartStrategies.failureRateRestart(3, Time.of(5, TimeUnit.SECONDS.SECONDS), Time.of(2, TimeUnit.SECONDS)));
//使用自定义的实现checkpoint的source
env.addSource(new HaveStateSource()).uid("001").name("customerSource")
.keyBy(e -> e.f0)
.reduce((e, ee) -> new Tuple2(e.f0, e.f1 + ee.f1))
.print();
env.execute("customer source");
执行结果
WARN [main] - Log file environment variable 'log.file' is not set.
WARN [main] - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'web.log.path'.
3> (key1,0)
3> (key1,1)
3> (key1,3)
3> (key1,6)
3> (key1,10)
java.lang.ArithmeticException: / by zero
at demos.source.HaveStateSource.run(HaveStateSource.java:58)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:116)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:73)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:323)
3> (key1,15)
3> (key1,21)
3> (key1,28)
3> (key1,36)
3> (key1,45)
java.lang.ArithmeticException: / by zero
at demos.source.HaveStateSource.run(HaveStateSource.java:58)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:116)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:73)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:323)
3> (key1,55)
3> (key1,66)
3> (key1,78)
3> (key1,91)
3> (key1,105)
...
观察输入结果可发现作业能从异常中恢复,并且恢复了上次的状态。