F1.集群突然宕机
- 找到Master节点的日志 vi 打开 Shift + g 跳到文件最后一行 Shift + n 查询任务名称找到对应的id
- 进入hdfs hadoop fs -ls /flink-checkpoints | grep 任务ID 找到id对应的checkPoint目录 进入目录 获取/flink-checkpoints/a1cb4cadb79c74ac8d3c7a11b6029ec2/chk-4863 地址
- 从checkPoint 中重启任务 如果不行则勾选Allow Non Restored State
2.org.apache.flink.table.api.StreamQueryConfig; local class incompatible: stream classdesc serialVersionUID = XX, local class serialVersionUID = -XXX
查看jar版本是否冲突,我的原因是有两个一样的Flink Table jar但是版本不一样 所有导致序列化异常
3 Flink Address already in use
使用Flink独立集群模式启动的,这个问题是master2上面已经启动集群了,此时在master1上面重复启动集群就会报错。可以选择kill master2上面的master进程
4.任务运行时突然抛出Caused by: java.lang.NullPointerException
org.apache.flink.types.NullFieldException: Field 3 is null, but expected to hold a value.
at org.apache.flink.api.java.typeutils.runtime.TupleSerializer.serialize(TupleSerializer.java:127)
at org.apache.flink.api.java.typeutils.runtime.TupleSerializer.serialize(TupleSerializer.java:30)
at org.apache.flink.contrib.streaming.state.RocksDBKeySerializationUtils.writeKey(RocksDBKeySerializationUtils.java:108)
at org.apache.flink.contrib.streaming.state.AbstractRocksDBState.writeKeyWithGroupAndNamespace(AbstractRocksDBState.java:217)
at org.apache.flink.contrib.streaming.state.AbstractRocksDBState.writeKeyWithGroupAndNamespace(AbstractRocksDBState.java:192)
at org.apache.flink.contrib.streaming.state.AbstractRocksDBState.writeCurrentKeyWithGroupAndNamespace(AbstractRocksDBState.java:179)
at org.apache.flink.contrib.streaming.state.AbstractRocksDBState.getKeyBytes(AbstractRocksDBState.java:161)
at org.apache.flink.contrib.streaming.state.RocksDBReducingState.add(RocksDBReducingState.java:96)
at org.apache.flink.runtime.state.ttl.TtlReducingState.add(TtlReducingState.java:52)
at com.yjp.stream.stat.business.crm.function.ReturnOrderFlatMapFunction.flatMap(ReturnOrderFlatMapFunction.java:101)
at com.yjp.stream.stat.business.crm.function.ReturnOrderFlatMapFunction.flatMap(ReturnOrderFlatMapFunction.java:24)
at org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)
at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
解决办法:
此异常是在keyby时以多个key分组某个key字段为null时抛出。在处理数据时将需要分组的数据都进行非null判断默认赋值。Field 3 is null 说明问题是分组的第四个key为null值(下标从0开始)
4.修改keyby字段的个数(由之前的三个key到现在4个key)后从savePoint中启动报错
java.lang.Exception: Exception while creating StreamOperatorStateContext.
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:192)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:227)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for StreamFlatMap_16b166ecf5d9fd813aab48502efdb6f5_(1/1) from any of the 1 provided restore options.
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:279)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:133)
... 5 more
Caused by: org.apache.flink.util.StateMigrationException: The new key serializer is not compatible to read previous keys. Aborting now since state migration is currently not available
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBFullRestoreOperation.restoreKVStateMetaData(RocksDBKeyedStateBackend.java:689)
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBFullRestoreOperation.restoreKeyGroupsInStateHandle(RocksDBKeyedStateBackend.java:652)
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBFullRestoreOperation.doRestore(RocksDBKeyedStateBackend.java:638)
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:166)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
... 7 more
解决办法:
修改此key对应key state的uid 从savePoint中恢复强依赖uid 更改uid后可以抛弃原有的state
5.命令行提交任务异常:
./flink run -yid application_X_X -c 启动类 '打包地址' 启动参数
org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path for the flink jar passed. Using the location of class org.ap ache.flink.yarn.YarnClusterDescriptor to locate the jar 此异常看不出什么 直接看jobManager的日志
java.nio.file.NoSuchFileException: /tmp/flink-web-0188545c-dc3d-46bb-923b-6b3f2f7fc61b/flink-web-upload/599d2006-a636-4339-8229-7d256270dd2f
解决办法:新建目录并赋权
6.批处理用sql join维度表 异常 Too many duplicate keys
解决方案:将维度表直接用where条件过滤 而不是join
7. flink run 提交了一个jar包,任务在持续运行中,不小心把jar包rm掉了,jar的缓存文件地址:
1.配置了HA模式,在这个目录 high-availability.storageDir
2.没有配置HA,在io.tmp.dirs 文件夹名是 blobStore 开头,里面有jobid,自己看下就能找到
8.配置Flink的字符集
在在 flink-conf.yaml 中添加如下一行
env.java.opts: "-Dfile.encoding=UTF-8"
9.有N台服务器,一个hadoop集群,前x台服务器分配给一种flink作业,另外y台服务器分配给另一种flink作业。采用的是yarn集群调度,这种情况怎么配置来协调不同job调度到不同服务器?
解决方案:Hadoop YARN新特性—label based scheduling 参考文章
10.Flink版本1.8.0 在使用袋鼠云的StreamSql 时报错:
Caused by: java.lang.ClassNotFoundException: org.apache.flink.table.sinks.TableSink
将flink-table-common-1.8.0.jar放到flinklib下
Caused by: java.lang.ClassNotFoundException: org.apache.flink.table.sinks.RetractStreamTableSink
将flink-table-planner_2.11-1.8.0.jar放到flinklib下
Caused by: java.lang.ClassNotFoundException: org.apache.flink.table.api.StreamQueryConfig
将flink-table-api-java-1.8.0.jar放到flinklib下
11:org.apache.flink.util.FlinkException: The assigned slot container_e42_1571624624393_14453_01_000012_0 was removed.
查看jobManger的日志信息发现:Closing TaskExecutor connection container_e42_1571624624393_14453_01_000012 because: The heartbeat of TaskManager with id container_e42_1571624624393_14453_01_000012 timed out 本质的原因是timeout超时。原因,解决方案
此错误是container心跳超时,出现此种错误一般有两种可能:
1、分布式物理机网络失联,这种原因一般情况下failover后作业能正常恢复,如果出现的不频繁可以不用关注;
2、failover的节点对应TM的内存设置太小,GC严重导致心跳超时,建议调大对应节点的内存值。
12:Caused by: org.apache.flink.util.StateMigrationException: The new state serializer cannot be incompatible.
任务从savePoint重启报错:原因原有的state数据类型改变从 Mapstat<String,Long>改为Mapstat<Long,Long>改回来就可以了
13: Flink yarn Could not start rest endpoint on any port in port range 8081
参考答案
此错误是说端口被占用。查看源代:
Iterator<Integer> portsIterator;
try {
portsIterator = NetUtils.getPortRangeFromString(restBindPortRange);
} catch (IllegalConfigurationException e) {
throw e;
} catch (Exception e) {
throw new IllegalArgumentException("Invalid port range definition: " + restBindPortRange);
}
对应的配置是 flink-conf.yaml中的rest.bind-port。
rest.bind-port不设置,则Rest Server默认绑定到rest.port端口(8081)。
rest.bind-port可以设置成列表格式如50100,50101,也可设置成范围格式如50100-50200。推荐范围格式,避免端口冲突。
14:线程上下文属性丢失
问题:在new对象时需要为其他属性赋值 但是在asyncInvoke()方法中使用时报错asyncClient 和table都为null。说明赋值未成功。
public AsyncKuduLoginRealUser(FlinkKuduConfig flinkKuduConfig, List<String> queryFields, LRUCacheConfig lruCacheConfig) {
super(flinkKuduConfig, queryFields, lruCacheConfig);
this.asyncClient = AsyncQueryHelper.getAsyncKuduClientBuilder(flinkKuduConfig);
this.table = AsyncQueryHelper.getKuduTable(flinkKuduConfig);
System.out.println(Thread.currentThread().getName());
}
问题原因:
在new对象时是main线程,但是调用asyncInvoke()是Source: Custom Source -> Flat Map -> Flat Map -> Process -> insert_async_bizuserid (1/1) 初始化和正在执行的线程不是同一个线程所以会有信息丢失。但是调用open方法的线程是SourceXXX线程 将额外的赋值操作移到open()方法中就可以了。
15:Flink生产数据到Kafka频繁出现事务失效导致任务重启
参考文章
添加配置:
//the transaction timeout must be larger than the checkpoint interval, but smaller than the broker transaction.max.timeout.ms.
properties.setProperty(ProducerConfig.TRANSACTION_TIMEOUT_CONFIG, 1000 * 60 * 3 + "");
properties.setProperty(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, "1");
properties.setProperty(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true");
16:flink 1.2因为checkPoint失败导致任务重启
需要设置容忍最大checkPoint失败的次数,默认为0次。
sEnv.getCheckpointConfig().setTolerableCheckpointFailureNumber(Integer.MAX_VALUE);
参考:
http://apache-flink.147419.n8.nabble.com/tolerableCheckpointFailureNumber-td10599.html
17: flink 1.2
org.codehaus.janino.CompilerFactory cannot be cast to org.codehaus.commons.compiler.ICompilerFactory
修改flink-conf.yaml配置文件 classloader.resolve-order: parent-first
18: flink 1.2 Caused by: org.apache.flink.table.api.ValidationException: Could not find any factory for identifier 'kafka' that implements 'org.apache.flink.table.factories.DynamicTableFactory' in the classpath.
本地运行成功但是上flink环境报错
gradle打包时 加上 mergeServiceFiles()
maven打包时 加上
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
</transformers>
</configuration>
19:flink 1.2 flink sql left join维表需要加上 FOR SYSTEM_TIME AS OF d.proctime
如果不配置test_side是一次性读取 如果配置则是每次查询
20.flink 1.3.0
Exception in thread "main" java.lang.IllegalStateException: No operators defined in streaming topology. Cannot generate StreamGraph.
这个是因为source是batch模式的,而且有print这样的输出,所以去掉env.execute("")就行。
21.flink 1.3.0
Exception in thread "main" org.apache.flink.table.api.NoMatchingTableFactoryException: Could not find a suitable table factory for 'org.apache.flink.table.delegation.ExecutorFactory' in the classpath.
引入jar包 flinkShadowJar 'org.apache.flink:flink-table-planner-blink_2.11:1.13.0'
22.flink 1.3.0
Caused by: java.lang.ClassNotFoundException: org.apache.flink.table.sources.TableSource
引入jar包 compileOnly("org.apache.flink:flink-table:1.13.0")
23.Exception: Connection refused: localhost/127.0.0.1:8081
在本地通过命令行提交时抛出的错误,检查发现是没有启动flink集群
./start-cluster.sh
24.Flink版本 1.14.0
在使用Flink hybirdSource时,从checkpoint中恢复报错
[03:31:13:35:56:654] ERROR [SourceCoordinator-Source: file-source -> Sink: Print to Std. Out] [] [] [] @@SourceCoordinator@@ | Uncaught exception in the SplitEnumerator for Source Source: file-source -> Sink: Print to Std. Out while handling operator event SourceEventWrapper[SourceReaderFinishedEvent{sourceIndex=-1}] from subtask 3. Triggering job failover.
java.lang.NullPointerException: Source for index=0 not available
at org.apache.flink.util.Preconditions.checkNotNull(Preconditions.java:104) ~[flink-core-1.14.0.jar:1.14.0]
at org.apache.flink.connector.base.source.hybrid.SwitchedSources.sourceOf(SwitchedSources.java:36) ~[flink-connector-base-1.14.0.jar:1.14.0]
at org.apache.flink.connector.base.source.hybrid.HybridSourceSplitEnumerator.sendSwitchSourceEvent(HybridSourceSplitEnumerator.java:148) ~[flink-connector-base-1.14.0.jar:1.14.0]
at org.apache.flink.connector.base.source.hybrid.HybridSourceSplitEnumerator.handleSourceEvent(HybridSourceSplitEnumerator.java:222) ~[flink-connector-base-1.14.0.jar:1.14.0]
at org.apache.flink.runtime.source.coordinator.SourceCoordinator.lambda$handleEventFromOperator$1(SourceCoordinator.java:175) ~[flink-runtime-1.14.0.jar:1.14.0]
at org.apache.flink.runtime.source.coordinator.SourceCoordinator.lambda$runInEventLoop$8(SourceCoordinator.java:331) ~[flink-runtime-1.14.0.jar:1.14.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_281]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_281]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_281]
25.Flink版本 1.14.0
flinkSQL left join on抛出异常
Temporal table join requires an equality condition on fields of table
检查on条件后的 a.x=b.x 两个x字段类型是否一致
26.Flink版本 1.14.0
前提 一张表是由两个flinkSQL任务写入的。问题默认使用count(distinct a)会丢失count(distinct a)=0的消息
select c, d, count(distinct a) from tmp where b=0 group c, d;
当c, d分组后的最后一条数据从b=0修改为b=-1。
此时where被翻译为了filter导致只会有delete消息往下游发送没有insert消息,
也就丢失了count()=0的结果
修改SQL
select c, d, count(distinct case when b=0 then a else null end ) from tmp group c, d;
修改完成后可以获取到count()=0的消息