Flink-Sql1.12 to Hive使用遇到的问题-程序因集群高负载导致宕机且自动恢复失败原因排查记录

坚持keep

已于 2023-09-18 13:30:32 修改

阅读量500

点赞数

文章标签： flink hive 大数据

于 2023-09-15 21:08:52 首次发布

本文链接：https://blog.csdn.net/hehe1212tt/article/details/132908757

版权

一. 排查yarn logs，找出如下报错日志：

1. 程序checkpoints开始异常的日志：

2023-09-13 19:57:28,125 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 61275 (type=CHECKPOINT) @ 1694606247915 for job 9a6f6c003e8eb3edf8cea8b3b0966456.
2023-09-13 19:57:29,391 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 61275 for job 9a6f6c003e8eb3edf8cea8b3b0966456 (27026 bytes in 1018 ms).
2023-09-13 19:59:28,064 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 61276 (type=CHECKPOINT) @ 1694606367915 for job 9a6f6c003e8eb3edf8cea8b3b0966456.
2023-09-13 19:59:48,200 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - PartitionCommitter -> Sink: end (1/1) (cf1974ce54d8116af731fb3838552db6) switched from RUNNING to FAILED on container_e38_1686722180292_14272_01_000010 @ dn05xxx-xxx16.xxx.com (dataPort=43355).
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id container_e38_1686722180292_14272_01_000010(dn05xxx-xxx16.xxx.com:8041) timed out.

集群中一台机器dn05由于过载已被NN暂时下线，接下来由于dn05一直无法连接，程序最终CANCELED.

2. 程序开始从最近的checkpoints开始RESTARTING

#开始重启

Job default: INSERT INTO ......(9a6f6c003e8eb3edf8cea8b3b0966456) switched from state RESTARTING to RUNNING.

2023-09-13 20:00:11,061 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job 9a6f6c003e8eb3edf8cea8b3b0966456 from Checkpoint 61275 @ 1694606247915 for 9a6f6c003e8eb3edf8cea8b3b0966456 located at hdfs://nameservicexxx/user/xxx/realdb/flink/checkpoint/9a6f6c003e8eb3edf8cea8b3b0966456/chk-61275

3. 程序TM-container开始 DEPLOYING ；attempt(标识整个程序重启的次数) 日志

streaming-writer (1/6) (07c19cfa57ae78887325b3fecc4547e4) switched from SCHEDULED to DEPLOYING.

.......

streaming-writer (1/6) (attempt #1) with attempt id 9bce97d536faba6a676755deeaa718c7 to container_e38_1686722180292_14272_01_000009 @ dn05xxx-xxx16.xxx.com (dataPort=43151) with allocation id 95f6802610ca3892f9543af43c04901d

streaming-writer (2/6) (attempt #1) with attempt id 9b73205a06f56376b8902462202cb34d to container_e38_1686722180292_14272_01_000012 @ dn06xxx-xxx17.xxx.com (dataPort=43436) with allocation id a14b93ef41a0d5b2746c2a5a6523cf89

streaming-writer (3/6) (attempt #1) with attempt id 4f1e2772ed77566a31a20fc93c51b2da to container_e38_1686722180292_14272_01_000013 @ dn06xxx-xxx17.xxx.com (dataPort=36781) with allocation id ddbe6ee511eba735866c63ff5599556a

streaming-writer (4/6) (attempt #1) with attempt id 655f537c478fd2524671d686630880bc to container_e38_1686722180292_14272_01_000011 @ dn05xxx-xxx16.xxx.com (dataPort=42400) with allocation id b3b7cecc550fa8597f79e42479b82fd

streaming-writer (5/6) (attempt #1) with attempt id dec418b041d5123110ee03fa338d5694 to container_e38_1686722180292_14272_01_000016 @ dn08xxx-xxx19.xxx.com (dataPort=36219) with allocation id f3d715629bdf509204f45a96dd0e6e09

streaming-writer (6/6) (attempt #1) with attempt id 1d074d7cf6c33330fddb3691b7bc365d to container_e38_1686722180292_14272_01_000014 @ dn06xxx-xxx17.xxx.com (dataPort=43549) with allocation id 0c3d867bb2ebafac0bc492ce467162f1

task manager的六个container仅部署在三台机器[dn05,dn06,dn08]，集群有9个DN，可见此时集群的负载很高。(后续查看NM日志发现从19:59开始dn05已处于过载状态,被临时下线,此时该DN无法读写,但对于yarn而言它的计算资源依然可以申请。[下线是为了防止DN宕机数据丢失,机器负载降低后会重新上线,但这也造成了机器的频繁上下线])。

4. 程序TM-container DEPLOYING to RUNNING [exception] to CANCELING to CANCELED日志

streaming-writer (1/6) (9bce97d536faba6a676755deeaa718c7) switched from DEPLOYING to RUNNING

......

streaming-writer (5/6) (dec418b041d5123110ee03fa338d5694) switched from DEPLOYING to RUNNING.

#在这之后,dn05又被下线(很快又上线,该过程很频繁,大概是因为集群资源当时太紧张,没有其他机器可供分配)。

2023-09-13 20:00:18,266 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to [null] failed with java.net.ConnectException: Connection refused: dn05xxx-xxx16.xxx.com/172.40.13.17:34930 【注意：没出现TM terminated 异常,非严重异常即dn05及时上线了】

....

#重新申请资源恢复dn05失败的两个TaskManagers：一个由redundant机制直接提供一个（priority 1）；另一个由yarn分配，遗憾的是Requesting new worker还在dn05上(资源紧张,机器频繁上下线)。

第一个TM【org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requesting new worker with resource spec WorkerResourceSpec {cpuCores=1.0, taskHeapSize=1.425gb (1530082070 bytes), taskOffHeapSize=0 bytes, networkMemSize=343.040mb (359703515 bytes), managedMemSize=1.340gb (1438814063 bytes)}, current pending count: 1】

第二个TM【org.apache.flink.yarn.YarnResourceManagerDriver [] - Requesting new TaskExecutor container with resource TaskExecutorProcessSpec {cpuCores=1.0, frameworkHeapSize=128.000mb (134217728 bytes), frameworkOffHeapSize=128.000mb (134217728 bytes), taskHeapSize=1.425gb (1530082070 bytes), taskOffHeapSize=0 bytes, networkMemSize=343.040mb (359703515 bytes), managedMemorySize=1.340gb (1438814063 bytes), jvmMetaspaceSize=256.000mb (268435456 bytes), jvmOverheadSize=409.600mb (429496736 bytes)}, priority 1】

#第一个TM还在dn05上。

2023-09-13 20:00:30,697 INFO org.apache.flink.yarn.YarnResourceManagerDriver [] - TaskExecutor container_e40_1686722180292_14272_01_000001(dn05xxx-xxx16.xxx.com:8041) will be started on dn05xxx-xxx16.xxx.com with TaskExecutorProcessSpec {cpuCores=1.0, frameworkHeapSize=128.000mb (134217728 bytes), frameworkOffHeapSize=128.000mb (134217728 bytes), taskHeapSize=1.425gb (1530082070 bytes), taskOffHeapSize=0 bytes, networkMemSize=343.040mb (359703515 bytes), managedMemorySize=1.340gb (1438814063 bytes), jvmMetaspaceSize=256.000mb (268435456 bytes), jvmOverheadSize=409.600mb (429496736 bytes)}.

......

#dn08处于负载临界点，内存不足，Failed to create Hive RecordWriter；出现严重故障,无法恢复,程序进入准备重启状态。

streaming-writer (5/6) (dec418b041d5123110ee03fa338d5694) switched from RUNNING to FAILED on container_e38_1686722180292_14272_01_000016 @ dn08xxx-xxx19.xxx.com (dataPort=36219).
org.apache.flink.connectors.hive.FlinkHiveException: org.apache.flink.table.catalog.exceptions.CatalogException: Failed to create Hive RecordWrite

....Caused by: java.lang.OutOfMemoryError: Java heap space

#取消任务

Job default: INSERT INTO ...... (9a6f6c003e8eb3edf8cea8b3b0966456) switched from state RUNNING to RESTARTING. ......

(4/6) (4842bb76778fd3cfe11805b5a3e033cb) switched from RUNNING to CANCELING. ......多个 ......

(5/6) (34c88ea306caca716e8a161184128f48) switched from CANCELING to CANCELED. ......多个 ......

5. 程序遭遇严重故障后【OutOfMemoryError】，再度重启日志，第二次。（allocation id,container id重启前后保持一致；每次每个tm-container id对应的attempt id都不同,用于统计tm的重启次数）

Job default: INSERT INTO ......(9a6f6c003e8eb3edf8cea8b3b0966456) switched from state RESTARTING to RUNNING.

.....chk-61275....

...(3/6) (58ab717d3d93694434ab3c4a5d31b4ab) switched from CREATED to SCHEDULED....

...streaming-writer (3/6) (58ab717d3d93694434ab3c4a5d31b4ab) switched from SCHEDULED to DEPLOYING....

...streaming-writer (3/6) (attempt #2) with attempt id 58ab717d3d93694434ab3c4a5d31b4ab to container_e38_1686722180292_14272_01_000013@ dn06xxx-xxx17.xxx.com (dataPort=36781) with allocation id ddbe6ee511eba735866c63ff5599556a...

...switched from DEPLOYING to RUNNING...

#dn08的部署运行再度出现严重错误，并准备再重启

streaming-writer (5/6) (8312252753df1bfe91bb3d8165fff4ec) switched from DEPLOYING to RUNNING

compact-operator (5/6) (83b87a0cbda2991a416d7c423f82be94) switched from DEPLOYING to RUNNING

【streaming-writer (5/6) (8312252753df1bfe91bb3d8165fff4ec) switched from RUNNING to FAILED on container_e38_1686722180292_14272_01_000016 @ dn08xxx-xxx19.xxx.com (dataPort=36219).】

org.apache.flink.table.catalog.exceptions.CatalogException: Failed to create Hive RecordWriter

Caused by: java.lang.OutOfMemoryError: Java heap space

Job default: INSERT INTO ...(9a6f6c003e8eb3edf8cea8b3b0966456) switched from state RUNNING to RESTARTING.

取消任务；

#第三次重启开始,部署的机器还是这三台dn05,dn06,dn08()

switched from state RESTARTING to RUNNING；

chk-61275 restore；

switched from CREATED to SCHEDULED；

switched from SCHEDULED to DEPLOYING;

switched from DEPLOYING to RUNNING.

到dn08时再度内存oom;

#接下来基本按照这个重启流程重启了设定的剩余的次数，到部署dm08的tm开始运行时Failed；

#attempt #24次时，部署的机器还是这三台dn05,dn06,dn08；此时集群资源依然高负载(20:38)

#attempt #25次时，部署的机器还是这三台dn05,dn06,dn08；此时集群资源依然高负载(20:39)；

此次重试dn06在 (3/6) (2ffc15ce8f313ff5d65595ef01b28443) switched from DEPLOYING to RUNNING 后被下线；

Association with remote system [akka.tcp://flink@dn06xxx-xxx17.xxx.com:46827] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2023-09-13 20:39:57,921 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Worker container_e38_1686722180292_14272_01_000013 is terminated. Diagnostics: [2023-09-13 20:39:57.166]Container exited with a non-zero exit code 239
[2023-09-13 20:39:57.166]Container exited with a non-zero exit code 239

#出现严重Tm terminated异常，准备重启；为何dn06-TM没恢复的可能原因：dn06一直未能上线，最新状态数据无法恢复，仅能通过checkpoints恢复,那么只能重启恢复了。

#dn08异常的大致时段： 19:59 ~ 20:48 何时恢复的正常未考究；

#dn06异常的大致时段： 20:39(不确定) ~ 20:48 何时恢复的正常未考究，该过程出现了频繁的上下线；

dn05异常的大致时段： 20:00 ~ 20:48 何时恢复的正常未考究,该过程出现了频繁的上下线；