flink 运维中遇到的问题

Mumunu-

已于 2023-04-11 16:20:46 修改

阅读量8.4k

点赞数 4

分类专栏： hadoop 文章标签： flink

于 2021-03-03 18:10:17 首次发布

本文链接：https://blog.csdn.net/h952520296/article/details/114327232

版权

hadoop 专栏收录该内容

82 篇文章 5 订阅

订阅专栏

1. Java.lang.IllegalStateException: No Executor found. Please make sure to export the HADOOP_CLASSPATH environment variable or have hadoop in your classpath. For more information refer to the "Deployment" section of the official Apache Flink documentation.

不要去设置什么环境变量。网上的答案大多数都是对官方文档的误解

只需要运行

export HADOOP_CLASSPATH=`hadoop classpath`

即可

其实hadoop classpath 命令就是把有关于hadoop的环境全部加载出来

2. [] - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

连不上8032端口 flink on yarn 只能在 master 上运行才能找到8032 这个配置项我在配置文件中没找到大佬找到了可以解答一下

3 flink on yarn 模式下 web ui的地址

运行命令之后会在最后一行提示暂时没有找到常驻和固定ip的方法

4、Exception in thread "Thread-5" java.lang.IllegalStateException: Trying to access closed classloader. Please check if you store classloaders directly or indirectly in static fields. If the stacktrace suggests that the leak occurs in a third party library and cannot be fixed immediately, you can disable this check with the configuration 'classloader.check-leaked-classloader'.

conf/flink-conf.yaml,大约在192行添加

classloader.check-leaked-classloader: false

5、 flink任务提交任务虚拟内存不足导致的失败

Container [pid=3007,containerID=container_1599018748796_0004_01_000004] is
running 342252032B beyond the 'VIRTUAL' memory limit. Current usage: 416.0 MB
of 1 GB physical memory used; 2.4 GB of 2.1 GB virtual memory used.
Killing container.

原因

因为yarn强制检查虚拟内存是否符合配置导致的，当我们的服务器或者虚拟机的内存达不到配置要求，可能就会报这个错误。

解决

修改检查虚拟内存的属性为false

    <property>  
        <name>yarn.nodemanager.vmem-check-enabled</name>  
        <value>false</value>  
    </property>

5、java.lang.NumberFormatException: For input string: "30s"

经过排查之后发现错误原因是:

hadoop 的配置文件hdfs-xml(handoop目录下/conf当中) 在配置过程当中添加了目前不需要的变量，导致报错，这个变量的value 值就是"30s",在错误当前没有引用，而导致配置文件冲突异常。

在 hdfs-site.xml 中添加 dfs.client.datanode-restart.timeout 30

6、flink mysql cdc报错 Caused by: java.sql.SQLSyntaxErrorException: Access denied; you need (at least one of) the RELOAD privilege(s) for this operation

尽量搞个root权限吧。很多操作都需要超级用户或者*.* 级别

7、flink mysql cdc sql-client 报 unexpected block data

类加载顺序问题，flink默认是child-first，在flink的flink-conf.yaml文件中添加classloader.resolve-order: parent-first 改成parent-first，重启集群即可

8、flink cdc[ERROR] Could not execute SQL statement. Reason:org.apache.flink.table.api.ValidationException: Could not find any factory for identifier 'mysql-cdc' that implements 'org.apache.flink.table.f

原因: 缺少jar包

flink cdc 依赖下载地址:

Welcome to Flink CDC — Flink CDC 2.0.0 documentation

9、执行 Flink SQL 报错 `[ERROR] Could not execute SQL statement. Reason: java.net.ConnectException: Connection refused`

看日志 flink-root-sql-client-xxxx.log

Connection refused: localhost/127.0.0.1:8081

flink-conf.yaml里没有修改成真实的flink地址修改一下

rest.address: xxxxxxx

rest.bind-port: 8173

具体端口需要自行确认会输出在启动flink-session 之后屏幕输出里

10、flink1.15.3 flink cdc 读取mysql，一段时间之后不更新/不同步

应该是checkpoint的问题，但是无法确认，配置了一些参数之后成功。按理说这些参数不会影响。但是总得来说生效了

//每120秒触发一次 checkpoint
env.enableCheckpointing(120000);
//Flink 框架内保证 EXACTLY_ONCE
env.getCheckpointConfig().setCheckpointMode(CheckpointMode.EXACTLY_ONCE);
//两个checkpoint之间最少有120s间隔
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(120000);
//checkpoint 超时时间  600s
env.getCheckpointConfig().setCheckpointTimeout(600000);
//同时只有一个checkpoint运行
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
//取消作业时保留checkpoint，因为有时候任务savepoint可能不可用，这时我们就可以直接从checkpoint重启任务
env.getCheckpointConfig()。enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
//checkpoint失败时task不失败，因为可能会有偶尔的HDFS写入失败。但是这并不会影响我们任务的运行
//偶尔的由于网络抖动 checkpoint 失败是可以接受的，但是如果经常失败就需要定位具体的问题
env.getCheckpointConfig().setFailOnCheckpointingErrors(false);

DataStream<Row> userStream = env

11、The heartbeat of TaskManager with id container timed out

引起心跳超时有可能是yarn压力比较大引起的。在conf/flink-conf.yaml将这个值调大一点，

#Timeout for requesting and receiving heartbeat for both sender and receiver sides.
heartbeat.timeout: 180000

12、flink Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout

yarn 资源不足了

13、Container exited with a non-zero exit code 13. Error file: prelaunch.err
java.lang.ClassNotFoundException:org.apache.hadoop.mapred.JobConf

lib下缺少以上几个jar包放进去即可

14、问题11：Could not start rest endpoint on any port in port range 32446

原因：无法绑定端口，可以将rest.bind-port配置为一个范围，比如10000-20000

问题15：IF(condition,true_value,false_value)时使用null，报错org.apache.calcite.sql.validate.SqlValidatorException: Illegal use of 'NULL'

比如：IF(5>3,'12321',NULL)，这时会报错可以写成IF(5>3,'12321',cast(NULL as STRING))

问题16：org.apache.flink.table.api.ValidationException: Could not find any factory for identifier 'jdbc' that implements 'org.apache.flink.table.factories.DynamicTableFactory' in the classpath.

解决：缺少flink-connector-jdbc_xxx.jar，下载放置到lib下

问题17：Flink InvalidTypesException: The return type of function could not be determined automatically...

解决：大致意思是，lambda写法无法提供足够的类型信息，无法推断出正确的类型，建议要么改成匿名类写法，要么用type information提供明细的类型信息。我们可以在转换的算子之后调用returns(...)方法来显示指明要返回的数据类型信息。

比如：map((MapFunction<String, Tuple2<String, Integer>>) filterRecord -> {
return new Tuple2(filterRecord, 1);
}).returns(Types.TUPLE(Types.STRING, Types.INT))

问题18：在idea环境中，执行env.execute()启动flink以后，发现程序似乎运行着，但一直卡着

解决：很可能是报错了，但是由于StreamExecutionEnvironment本身的重启策略是固定延迟但是不限重启次数的策略，所以错误才会一直无法报出来。可以先设置不重启策略来看下。env.setRestartStrategy(RestartStrategies.noRestart());

问题19：Flink SQL 创建 TableEnvironment 对象失败

现象：No factory implements 'org.apache.flink.table.delegation.ExecutorFactory'

解决：除了引入：

<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-java-bridge_2.11</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>

以外，还需要引入：

<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-planner_2.11</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>

问题20：Table sink doesn't support consuming update changes which is produced

将Table转化为DataStream的时候，调用了api：toDataStream或toAppendStream，这是因为sql里些了比如 count(xxx) group by 这种形成的不是append stream，所以无法转化，需要使用toChangelogStream或toRetractStream

问题21：Exception in thread “Thread-6” java.lang.IllegalStateException: Trying to access closed classloader. Please check if you store classloaders directly or indirectly in static fields

这是一个hadoop3和flink导致的一个bug，详见：

https://issues.apache.org/jira/browse/FLINK-19916

这并不影响当前功能，所以可以先不用关注。

22、java.lang.ClassNotFoundException: org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$BlockingInterface

将hbase-protocol-shaded-2.2.2.jar包放进lib目录下。去hbase的家目录复制一个就行

cdh应该是hbase-protocol-shaded-2.1.0-cdh6.3.2.jar