官方配置:Configuration | Apache Flink
1、TM进程过一段时间就停止
报错信息:org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Task did not exit gracefully within 180 + seconds.
org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully within 180 + seconds.
at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1791) [flink-dist_2.11-1.14.4.jar:1.14.4]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_291]
原因:任务取消超时
解决:TM配置文件${FLINK_HOME}/conf/flink-conf.yml
#取消任务取消watchdog
task.cancellation.timeout: 0
参数说明:Timeout in milliseconds after which a task cancellation times out and leads to a fatal TaskManager error. A value of 0 deactivates the watch dog. Notice that a task cancellation is different from both a task failure and a clean shutdown. Task cancellation timeout only applies to task cancellation and does not apply to task closing/clean-up caused by a task failure or a clean shutdown.
2、web端上传的jar包,在独立集群重启后全部丢失
原因:文件默认保存在/tmp目录,会被清除
解决:JM配置文件${FLINK_HOME}/conf/flink-conf.yml
web.upload.dir: /usr/local/flink/upload
web.tmpdir: /usr/local/flink/tmpdir
3、JM stop-cluster.sh stop不能停止独立集群
原因:pid文件默认保存在/tmp目录,会被清除导致脚本找不到pid结束进程
解决:JM配置文件${FLINK_HOME}/conf/flink-conf.yml
env.pid.dir: /usr/local/flink/piddir
4、zookeeper存储value太长,zookeeper集群down掉导致TM全部down掉,zookeeper报错信息:
Unexpected exception causing shutdown while sock still open
java.io.IOException: Unreasonable length = 1970218037
at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:95)
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:249)
Zookeeper server went down in HA cluster. Please replay if there is any work around.
You can attempt to increase your jute.maxbuffer Java System Property on the ZK servers to a value higher than 2-3 GB (in bytes) to overcome this. It appears a very large record was somehow placed into your ZK by an application, which appears to have then caused this issue.
解决方法:配置zookeeper的jute.maxbuffer参数到合适的长度
5、java.lang.OutOfMemoryError: Metaspace. 详细报错信息:
java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak in user code or some of its dependencies which has to be investigated and fixed. The task executor has to be shutdown...
at java.lang.ClassLoader.defineClass1(Native Method) ~[?:1.8.0_291]
at java.lang.ClassLoader.defineClass(ClassLoader.java:756) ~[?:1.8.0_291]
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) ~[?:1.8.0_291]
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) ~[?:1.8.0_291]
at java.net.URLClassLoader.access$100(URLClassLoader.java:74) ~[?:1.8.0_291]
at java.net.URLClassLoader$1.run(URLClassLoader.java:369) ~[?:1.8.0_291]
at java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[?:1.8.0_291]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_291]
at java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[?:1.8.0_291]
原因:没有找到具体原因,持续观察,网上搜索有两种说法:代码阻塞、背压
短期解决方案:TM配置文件${FLINK_HOME}/conf/flink-conf.yml
修改配置(默认256m)taskmanager.memory.jvm-metaspace.size: 512m
6、flink ui查询checkpoint报错
ERROR org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler [] - Unhandled exception.
org.apache.commons.math3.exception.NullArgumentException: input array
at org.apache.commons.math3.util.MathArrays.verifyValues(MathArrays.java:1650) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.commons.math3.stat.descriptive.AbstractUnivariateStatistic.test(AbstractUnivariateStatistic.java:158) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.commons.math3.stat.descriptive.rank.Percentile.evaluate(Percentile.java:272) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.commons.math3.stat.descriptive.rank.Percentile.evaluate(Percentile.java:241) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.runtime.metrics.DescriptiveStatisticsHistogramStatistics$CommonMetricsSnapshot.getPercentile(DescriptiveStatisticsHistogramStatistics.java:158) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.runtime.metrics.DescriptiveStatisticsHistogramStatistics.getQuantile(DescriptiveStatisticsHistogramStatistics.java:52) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.runtime.checkpoint.StatsSummarySnapshot.getQuantile(StatsSummarySnapshot.java:108) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.runtime.rest.messages.checkpoints.StatsSummaryDto.valueOf(StatsSummaryDto.java:81) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.createCheckpointingStatistics(CheckpointingStatisticsHandler.java:129) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.handleRequest(CheckpointingStatisticsHandler.java:84) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.handleRequest(CheckpointingStatisticsHandler.java:58) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.runtime.rest.handler.job.AbstractAccessExecutionGraphHandler.handleRequest(AbstractAccessExecutionGraphHandler.java:68) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.lambda$handleRequest$0(AbstractExecutionGraphHandler.java:87) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) [?:1.8.0_291]
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) [?:1.8.0_291]
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) [?:1.8.0_291]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_291]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_291]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_291]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_291]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_291]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_291]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_291]
原因:flink 版本≤1.14.4 序列化bug
解决:升级版本到1.14.5 1.15.0,不过release还没发布-20220524[FLINK-25904] NullArgumentException when accessing checkpoint stats on standby JobManager - ASF JIRA