我有一个flink应用程序(flink版本是1.9.2),它启用了 checkpoints功能。当我在apache flink平台上运行它时。我总是收到checkpoint failed消息:checkpoint expired before完成。之后在 checkpoints期间检查taskManager的threadDumps,我发现包含两个请求外部服务的操作符的线程总是处于runnable状态。下面是我对这个操作符的设计和 checkpoints配置。请帮助建议如何解决这个问题?
操作员设计:
public class OperatorA extends RichMapFunction {
private Connection connection;
private String getCusipSourceIdPairsQuery;
private String getCusipListQuery;
private MapState> modifiedCusipState;
private MapState> bwicMatchedModifiedCusipState;
@Override
public POJOA map(POJOA value) throw Exception {
// create local variable PreparedStatement every time invoke this map method
// update/clear those two MapStates
}
@Override
public void open(Configuration parameters) {
// initialize jdbc connection and TTL MapStates using GlobalJobParameters
}
@Override
public void close() {
// close jdbc connection
}
}
public class OperatorB extends RichMapFunction {
private MyServiceA serviceA;
private MyServiceB serviceB;
@Override
public POJOA map(POJOA value) throw Exception {
// call a restful GET API of ServiceB, get a XML response, about 500 fields in the response.
// use serviceA's function to extract the XML document and then populate the value fields.
}
@Override
public void open(Configuration parameters) {
// initialize local jdbc connection and PreparedStatement using globalJobParameters. then use the executed results to initialize serviceA.
// initialize serviceB.
}
}
checkpoints配置:
Checkpointing Mode Exactly Once
Interval 15m 0s
Timeout 10m 0s
Minimum Pause Between Checkpoints 5m 0s
Maximum Concurrent Checkpoints 1
Persist Checkpoints Externally Disabled
checkpoints历史记录示例:
ID Status Acknowledged Trigger Time Latest Acknowledgement End to End Duration State Size Buffered During Alignment
20 In Progress 3/12 (25%) 15:03:13 15:04:14 1m 1s 5.65 KB 0 B
19 Failed 3/12 14:48:13 14:50:12 10m 0s 5.65 KB 0 B
18 Failed 3/12 14:33:13 14:34:50 10m 0s 5.65 KB 0 B
17 Failed 4/12 14:18:13 14:27:04 9m 59s 2.91 MB 64.0 KB
16 Failed 3/12 14:03:13 14:05:18 10m 0s 5.65 KB 0 B