Flink-Sql1.12 to Hive使用遇到的问题-程序因集群高负载导致宕机且自动恢复失败原因排查记录

一. 排查yarn logs,找出如下报错日志:

1. 程序checkpoints开始异常的日志:

2023-09-13 19:57:28,125 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 61275 (type=CHECKPOINT) @ 1694606247915 for job 9a6f6c003e8eb3edf8cea8b3b0966456.
2023-09-13 19:57:29,391 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 61275 for job 9a6f6c003e8eb3edf8cea8b3b0966456 (27026 bytes in 1018 ms).
2023-09-13 19:59:28,064 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 61276 (type=CHECKPOINT) @ 1694606367915 for job 9a6f6c003e8eb3edf8cea8b3b0966456.
2023-09-13 19:59:48,200 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - PartitionCommitter -> Sink: end (1/1) (cf1974ce54d8116af731fb3838552db6) switched from RUNNING to FAILED on container_e38_1686722180292_14272_01_000010 @ dn05xxx-xxx16.xxx.com (dataPort=43355).
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id container_e38_1686722180292_14272_01_000010(dn05xxx-xxx16.xxx.com:8041) timed out.

 集群中一台机器dn05由于过载已被NN暂时下线,接下来由于dn05一直无法连接,程序最终CANCELED.

2. 程序开始从最近的checkpoints开始RESTARTING

#开始重启

Job default: INSERT INTO ......(9a6f6c003e8eb3edf8cea8b3b0966456) switched from state RESTARTING to RUNNING.

2023-09-13 20:00:11,061 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Restoring job 9a6f6c003e8eb3edf8cea8b3b0966456 from Checkpoint 61275 @ 1694606247915 for 9a6f6c003e8eb3edf8cea8b3b0966456 located at hdfs://nameservicexxx/user/xxx/realdb/flink/checkpoint/9a6f6c003e8eb3edf8cea8b3b0966456/chk-61275

3. 程序TM-container开始 DEPLOYING ;attempt(标识整个程序重启的次数) 日志

streaming-writer (1/6) (07c19cfa57ae78887325b3fecc4547e4) switched from SCHEDULED to DEPLOYING.

.......

streaming-writer (1/6) (attempt #1) with attempt id 9bce97d536faba6a676755deeaa718c7 to container_e38_1686722180292_14272_01_000009 @ dn05xxx-xxx16.xxx.com (dataPort=43151) with allocation id 95f6802610ca3892f9543af43c04901d

streaming-writer (2/6) (attempt #1) with attempt id 9b73205a06f56376b8902462202cb34d to container_e38_1686722180292_14272_01_000012 @ dn06xxx-xxx17.xxx.com (dataPort=43436) with allocation id a14b93ef41a0d5b2746c2a5a6523cf89

streaming-writer (3/6) (attempt #1) with attempt id 4f1e2772ed77566a31a20fc93c51b2da to container_e38_1686722180292_14272_01_000013 @ dn06xxx-xxx17.xxx.com (dataPort=36781) with allocation id ddbe6ee511eba735866c63ff5599556a 

streaming-writer (4/6) (attempt #1) with attempt id 655f537c478fd2524671d686630880bc to container_e38_1686722180292_14272_01_000011 @ dn05xxx-xxx16.xxx.com (dataPort=42400) with allocation id b3b7cecc550fa8597f79e42479b82fd

streaming-writer (5/6) (attempt #1) with attempt id dec418b041d5123110ee03fa338d5694 to container_e38_1686722180292_14272_01_000016 @ dn08xxx-xxx19.xxx.com (dataPort=36219) with allocation id f3d715629bdf509204f45a96dd0e6e09

streaming-writer (6/6) (attempt #1) with attempt id 1d074d7cf6c33330fddb3691b7bc365d to container_e38_1686722180292_14272_01_000014 @ dn06xxx-xxx17.xxx.com (dataPort=43549) with allocation id 0c3d867bb2ebafac0bc492ce467162f1

task manager的六个container仅部署在三台机器[dn05,dn06,dn08],集群有9个DN,可见此时集群的负载很高。(后续查看NM日志发现从19:59开始dn05已处于过载状态,被临时下线,此时该DN无法读写,但对于yarn而言它的计算资源依然可以申请。[下线是为了防止DN宕机数据丢失,机器负载降低后会重新上线,但这也造成了机器的频繁上下线])。

4. 程序TM-container DEPLOYING to RUNNING [exception]  to CANCELING to  CANCELED日志

streaming-writer (1/6) (9bce97d536faba6a676755deeaa718c7) switched from DEPLOYING to RUNNING

......

streaming-writer (5/6) (dec418b041d5123110ee03fa338d5694) switched from DEPLOYING to RUNNING.

#在这之后,dn05又被下线(很快又上线,该过程很频繁,大概是因为集群资源当时太紧张,没有其他机器可供分配)。

2023-09-13 20:00:18,266 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [null] failed with java.net.ConnectException: Connection refused: dn05xxx-xxx16.xxx.com/172.40.13.17:34930 【注意:没出现TM terminated 异常,非严重异常即dn05及时上线了】

....

#重新申请资源恢复dn05失败的两个TaskManagers:一个由redundant机制直接提供一个(priority 1);另一个由yarn分配,遗憾的是Requesting new worker还在dn05上(资源紧张,机器频繁上下线)。

第一个TM【org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requesting new worker with resource spec WorkerResourceSpec {cpuCores=1.0, taskHeapSize=1.425gb (1530082070 bytes), taskOffHeapSize=0 bytes, networkMemSize=343.040mb (359703515 bytes), managedMemSize=1.340gb (1438814063 bytes)}, current pending count: 1】

第二个TM【org.apache.flink.yarn.YarnResourceManagerDriver              [] - Requesting new TaskExecutor container with resource TaskExecutorProcessSpec {cpuCores=1.0, frameworkHeapSize=128.000mb (134217728 bytes), frameworkOffHeapSize=128.000mb (134217728 bytes), taskHeapSize=1.425gb (1530082070 bytes), taskOffHeapSize=0 bytes, networkMemSize=343.040mb (359703515 bytes), managedMemorySize=1.340gb (1438814063 bytes), jvmMetaspaceSize=256.000mb (268435456 bytes), jvmOverheadSize=409.600mb (429496736 bytes)}, priority 1】

#第一个TM还在dn05上。

2023-09-13 20:00:30,697 INFO  org.apache.flink.yarn.YarnResourceManagerDriver              [] - TaskExecutor container_e40_1686722180292_14272_01_000001(dn05xxx-xxx16.xxx.com:8041) will be started on dn05xxx-xxx16.xxx.com with TaskExecutorProcessSpec {cpuCores=1.0, frameworkHeapSize=128.000mb (134217728 bytes), frameworkOffHeapSize=128.000mb (134217728 bytes), taskHeapSize=1.425gb (1530082070 bytes), taskOffHeapSize=0 bytes, networkMemSize=343.040mb (359703515 bytes), managedMemorySize=1.340gb (1438814063 bytes), jvmMetaspaceSize=256.000mb (268435456 bytes), jvmOverheadSize=409.600mb (429496736 bytes)}.

......

#dn08处于负载临界点,内存不足,Failed to create Hive RecordWriter;出现严重故障,无法恢复,程序进入准备重启状态。

 streaming-writer (5/6) (dec418b041d5123110ee03fa338d5694) switched from RUNNING to FAILED on container_e38_1686722180292_14272_01_000016 @ dn08xxx-xxx19.xxx.com (dataPort=36219).
org.apache.flink.connectors.hive.FlinkHiveException: org.apache.flink.table.catalog.exceptions.CatalogException: Failed to create Hive RecordWrite

....Caused by: java.lang.OutOfMemoryError: Java heap space

#取消任务

Job default: INSERT INTO ...... (9a6f6c003e8eb3edf8cea8b3b0966456) switched from state RUNNING to RESTARTING. ......

(4/6) (4842bb76778fd3cfe11805b5a3e033cb) switched from RUNNING to CANCELING. ......多个 ......

(5/6) (34c88ea306caca716e8a161184128f48) switched from CANCELING to CANCELED. ......多个 ......

5. 程序遭遇严重故障后【OutOfMemoryError】,再度重启日志,第二次 。(allocation id,container id重启前后保持一致;每次每个tm-container id对应的attempt id都不同,用于统计tm的重启次数

Job default: INSERT INTO ......(9a6f6c003e8eb3edf8cea8b3b0966456) switched from state RESTARTING to RUNNING.

.....chk-61275....

...(3/6) (58ab717d3d93694434ab3c4a5d31b4ab) switched from CREATED to SCHEDULED....

...streaming-writer (3/6) (58ab717d3d93694434ab3c4a5d31b4ab) switched from SCHEDULED to DEPLOYING....

...streaming-writer (3/6) (attempt #2) with attempt id 58ab717d3d93694434ab3c4a5d31b4ab to container_e38_1686722180292_14272_01_000013@ dn06xxx-xxx17.xxx.com (dataPort=36781) with allocation id ddbe6ee511eba735866c63ff5599556a...

...switched from DEPLOYING to RUNNING...

#dn08的部署运行再度出现严重错误,并准备再重启

streaming-writer (5/6) (8312252753df1bfe91bb3d8165fff4ec) switched from DEPLOYING to RUNNING

compact-operator (5/6) (83b87a0cbda2991a416d7c423f82be94) switched from DEPLOYING to RUNNING

【streaming-writer (5/6) (8312252753df1bfe91bb3d8165fff4ec) switched from RUNNING to FAILED on container_e38_1686722180292_14272_01_000016 @ dn08xxx-xxx19.xxx.com (dataPort=36219).】

org.apache.flink.table.catalog.exceptions.CatalogException: Failed to create Hive RecordWriter

Caused by: java.lang.OutOfMemoryError: Java heap space

Job default: INSERT INTO ...(9a6f6c003e8eb3edf8cea8b3b0966456) switched from state RUNNING to RESTARTING.

取消任务;

#第三次重启开始,部署的机器还是这三台dn05,dn06,dn08()

switched from state RESTARTING to RUNNING;

chk-61275 restore;

switched from CREATED to SCHEDULED;

switched from SCHEDULED to DEPLOYING;

switched from DEPLOYING to RUNNING.

到dn08时再度内存oom;

#接下来基本按照这个重启流程重启了设定的剩余的次数,到部署dm08的tm开始运行时Failed;

#attempt #24次时,部署的机器还是这三台dn05,dn06,dn08;此时集群资源依然高负载(20:38)

#attempt #25次时,部署的机器还是这三台dn05,dn06,dn08;此时集群资源依然高负载(20:39);

此次重试dn06在 (3/6) (2ffc15ce8f313ff5d65595ef01b28443) switched from DEPLOYING to RUNNING 后被下线;

Association with remote system [akka.tcp://flink@dn06xxx-xxx17.xxx.com:46827] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2023-09-13 20:39:57,921 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Worker container_e38_1686722180292_14272_01_000013 is terminated. Diagnostics: [2023-09-13 20:39:57.166]Container exited with a non-zero exit code 239
[2023-09-13 20:39:57.166]Container exited with a non-zero exit code 239

#出现严重Tm terminated异常,准备重启;为何dn06-TM没恢复的可能原因:dn06一直未能上线,最新状态数据无法恢复,仅能通过checkpoints恢复,那么只能重启恢复了。

#dn08异常的大致时段: 19:59 ~ 20:48 何时恢复的正常未考究;

#dn06异常的大致时段: 20:39(不确定) ~ 20:48 何时恢复的正常未考究,该过程出现了频繁的上下线;

dn05异常的大致时段:   20:00 ~ 20:48 何时恢复的正常未考究,该过程出现了频繁的上下线;

二. 重启过程概述:

第一次重启:

原因:checkpoints失败,严重异常导致重启;

现象:该过程中dn05被短暂下线后恢复上线;dn06 normal;dn08异常。

第二次重启:

原因:dn08 OOM,严重异常导致重启;

现象:该过程中dn05,dn06 normal;dn08异常。

第三次至二十四次重启:【8:06~20:38】

原因:dn08 OOM,严重异常导致重启;

现象:该过程中dn05,dn06 normal;dn08异常。

第二十五次重启:

原因:第二十四次重启时dn08 OOM,严重异常导致再次重启;

现象:该过程中dn05 normal;dn08异常;dn06 下线且较长时间未上线,超过了TM恢复的时限。

第二十六次重启:

原因:第二十五次重启时dn06 较长时间下线,严重异常导致再次重启;

现象:该过程中dn05 normal;dn08异常;dn06可能异常:一个DEPLOYING to RUNNING时重新申请了两个worker(redundant:1[container_e40_1686722180292_14272_01_000002];dn05:1);

第二十七次重启:

原因:第二十六次重启时dn08 OOM,严重异常导致再次重启;

现象:该过程中dn05,dn06 normal;dn08异常;

第二十八次重启:

原因:第二十七次重启时dn08 OOM,严重异常导致再次重启;

现象:该过程中dn05,dn06 normal;dn08异常;

第二十九次重启:

原因:第二十八次重启时dn08 OOM,严重异常导致再次重启;

现象:该过程中dn05下线且较长lost(严重异常)导致TM无法恢复开始下次重启;dn06,dn08状态未知;

最后一次重启:【到达配置的30次了】

原因:第二十九次重启时dn05 较长时间下线,严重异常导致再次重启;

现象:dn08异常OOM;dn05可能正常;dn06异常或者下线;

出现一次有2个TM(dn06)重新申请:container_e40_1686722180292_14272_01_000003(dn10xxx-xxx20.xxx.com:8041)->yarn分配。

结局:

Job default: INSERT INTO ... (9a6f6c003e8eb3edf8cea8b3b0966456) switched from state FAILING to FAILED

Shutting down...
 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值