Flink standalone集群问题记录

LeoGanlin

已于 2022-05-24 19:34:32 修改

阅读量2.1k

点赞数

文章标签： flink 大数据 big data

于 2022-05-11 23:00:00 首次发布

本文链接：https://blog.csdn.net/LeoGanlin/article/details/124692129

版权

官方配置：Configuration | Apache Flink

1、TM进程过一段时间就停止

报错信息：org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Task did not exit gracefully within 180 + seconds.
org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully within 180 + seconds.
at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1791) [flink-dist_2.11-1.14.4.jar:1.14.4]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_291]

原因：任务取消超时

解决：TM配置文件${FLINK_HOME}/conf/flink-conf.yml

#取消任务取消watchdog

task.cancellation.timeout: 0

参数说明：Timeout in milliseconds after which a task cancellation times out and leads to a fatal TaskManager error. A value of 0 deactivates the watch dog. Notice that a task cancellation is different from both a task failure and a clean shutdown. Task cancellation timeout only applies to task cancellation and does not apply to task closing/clean-up caused by a task failure or a clean shutdown.

2、web端上传的jar包，在独立集群重启后全部丢失

原因：文件默认保存在/tmp目录，会被清除

解决：JM配置文件${FLINK_HOME}/conf/flink-conf.yml

web.upload.dir: /usr/local/flink/upload
web.tmpdir: /usr/local/flink/tmpdir

3、JM stop-cluster.sh stop不能停止独立集群

原因：pid文件默认保存在/tmp目录，会被清除导致脚本找不到pid结束进程

解决：JM配置文件${FLINK_HOME}/conf/flink-conf.yml

env.pid.dir: /usr/local/flink/piddir

4、zookeeper存储value太长，zookeeper集群down掉导致TM全部down掉，zookeeper报错信息：

Unexpected exception causing shutdown while sock still open
java.io.IOException: Unreasonable length = 1970218037

at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:95)
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:249)

Zookeeper server went down in HA cluster. Please replay if there is any work around.

You can attempt to increase your jute.maxbuffer Java System Property on the ZK servers to a value higher than 2-3 GB (in bytes) to overcome this. It appears a very large record was somehow placed into your ZK by an application, which appears to have then caused this issue.

解决方法：配置zookeeper的jute.maxbuffer参数到合适的长度

5、java.lang.OutOfMemoryError: Metaspace. 详细报错信息：

java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak in user code or some of its dependencies which has to be investigated and fixed. The task executor has to be shutdown...
at java.lang.ClassLoader.defineClass1(Native Method) ~[?:1.8.0_291]
at java.lang.ClassLoader.defineClass(ClassLoader.java:756) ~[?:1.8.0_291]
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) ~[?:1.8.0_291]
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) ~[?:1.8.0_291]
at java.net.URLClassLoader.access$100(URLClassLoader.java:74) ~[?:1.8.0_291]
at java.net.URLClassLoader$1.run(URLClassLoader.java:369) ~[?:1.8.0_291]
at java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[?:1.8.0_291]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_291]
at java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[?:1.8.0_291]

原因：没有找到具体原因，持续观察，网上搜索有两种说法：代码阻塞、背压

短期解决方案：TM配置文件${FLINK_HOME}/conf/flink-conf.yml

修改配置(默认256m)taskmanager.memory.jvm-metaspace.size: 512m

6、flink ui查询checkpoint报错

ERROR org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler [] - Unhandled exception.
org.apache.commons.math3.exception.NullArgumentException: input array
at org.apache.commons.math3.util.MathArrays.verifyValues(MathArrays.java:1650) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.commons.math3.stat.descriptive.AbstractUnivariateStatistic.test(AbstractUnivariateStatistic.java:158) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.commons.math3.stat.descriptive.rank.Percentile.evaluate(Percentile.java:272) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.commons.math3.stat.descriptive.rank.Percentile.evaluate(Percentile.java:241) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.runtime.metrics.DescriptiveStatisticsHistogramStatistics$CommonMetricsSnapshot.getPercentile(DescriptiveStatisticsHistogramStatistics.java:158) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.runtime.metrics.DescriptiveStatisticsHistogramStatistics.getQuantile(DescriptiveStatisticsHistogramStatistics.java:52) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.runtime.checkpoint.StatsSummarySnapshot.getQuantile(StatsSummarySnapshot.java:108) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.runtime.rest.messages.checkpoints.StatsSummaryDto.valueOf(StatsSummaryDto.java:81) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.createCheckpointingStatistics(CheckpointingStatisticsHandler.java:129) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.handleRequest(CheckpointingStatisticsHandler.java:84) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.handleRequest(CheckpointingStatisticsHandler.java:58) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.runtime.rest.handler.job.AbstractAccessExecutionGraphHandler.handleRequest(AbstractAccessExecutionGraphHandler.java:68) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.lambda$handleRequest$0(AbstractExecutionGraphHandler.java:87) ~[flink-dist_2.11-1.14.4.jar:1.14.4]
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) [?:1.8.0_291]
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) [?:1.8.0_291]
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) [?:1.8.0_291]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_291]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_291]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_291]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_291]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_291]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_291]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_291]