问题现象
- Hadoop集群的任务提交不上去,一直失败
- 集群资源未出现资源不足的情况
查看日志
RM出现zk相关报错
active的ResourceManager的日志报往zk存储任务状态的时候失败,等待调度器丢弃相关事件
2021-08-26 14:53:13 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:857 - State store operation failed
java.io.IOException: Wait for ZKClient creation timed out
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1200)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1236)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:1067)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:812)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:189)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:175)
at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:844)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:904)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:899)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:182)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
at java.lang.Thread.run(Thread.java:745)
2021-08-26 14:53:13 WARN org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:859 - State-store fenced ! Transitioning RM to standby
2021-08-26 14:53:13 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:187 - Removing info for app: application_1606379848739_775542
2021-08-26 14:53:13 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:921 - RMStateStore has been fenced
2021-08-26 14:53:13 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:757 - Transitioning RM to Standby mode
2021-08-26 14:53:13 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:14 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:15 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:16 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:17 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:18 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:19 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:20 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:21 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:22 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:23 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:24 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:25 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:26 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:27 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:28 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:29 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:30 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:31 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:32 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:33 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:34 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:35 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:36 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:37 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:38 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:39 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:40 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:41 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:42 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:43 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:44 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:45 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:46 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:47 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:48 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:49 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:50 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:51 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:52 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:53 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:54 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:55 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:56 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:57 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:58 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:53:59 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:54:00 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:54:01 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:54:02 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:54:03 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:54:04 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:54:05 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:54:06 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:54:07 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:54:08 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:54:09 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:54:10 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:54:11 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:54:12 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:54:13 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:151 - Waiting for AsyncDispatcher to drain.
2021-08-26 14:54:13 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:191 - Error removing app: application_1606379848739_775542
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1195)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1236)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:1067)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:812)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:189)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:175)
at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:844)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:904)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:899)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:182)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
at java.lang.Thread.run(Thread.java:745)
2021-08-26 14:54:13 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:857 - State store operation failed
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1195)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1236)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:1067)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:812)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:189)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:175)
at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:844)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:904)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:899)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:182)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
at java.lang.Thread.run(Thread.java:745)
2021-08-26 14:54:13 WARN org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:859 - State-store fenced ! Transitioning RM to standby
2021-08-26 14:54:13 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:921 - RMStateStore has been fenced
2021-08-26 14:54:13 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:207 - Registering class org.apache.hadoop.yarn.server.resourcemanager.RMFatalEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher
2021-08-26 14:54:13 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:757 - Transitioning RM to Standby mode
2021-08-26 14:54:13 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM:75 - NMTokenKeyRollingInterval: 86400000ms and NMTokenKeyActivationDelay: 900000ms
2021-08-26 14:54:13 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager:77 - ContainerTokenKeyRollingInterval: 86400000ms and ContainerTokenKeyActivationDelay: 900000ms
2021-08-26 14:54:13 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager:94 - AMRMTokenKeyRollingInterval: 86400000ms and AMRMTokenKeyActivationDelay: 900000 ms
2021-08-26 14:54:13 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStoreFactory:33 - Using RMStateStore implementation - class org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
2021-08-26 14:54:13 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:207 - Registering class org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStoreEventType for class org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler
2021-08-26 14:54:13 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:207 - Registering class org.apache.hadoop.yarn.server.resourcemanager.NodesListManagerEventType for class org.apache.hadoop.yarn.server.resourcemanager.NodesListManager
2021-08-26 14:54:13 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:298 - Using Scheduler: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
2021-08-26 14:54:13 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:207 - Registering class org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.SchedulerEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher
2021-08-26 14:54:13 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:207 - Registering class org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher
2021-08-26 14:54:13 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:207 - Registering class org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher
2021-08-26 14:54:13 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:207 - Registering class org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher
2021-08-26 14:54:13 INFO org.apache.hadoop.metrics2.impl.MetricsConfig:111 - loaded properties from hadoop-metrics2.properties
2021-08-26 14:54:13 INFO com.ucar.spacex.yarn.sink.HadoopTimelineMetricsSink:58 - Initializing Timeline metrics sink.
2021-08-26 14:54:13 INFO com.ucar.spacex.yarn.sink.HadoopTimelineMetricsSink:83 - Identified hostname = namenode01.bi, serviceName = resourcemanager
2021-08-26 14:54:13 INFO com.ucar.spacex.yarn.common.timeline.availability.MetricSinkWriteShardHostnameHashingStrategy:43 - Calculated collector shard spacex05-prod-rg1-bj2 based on hostname: namenode01.bi
2021-08-26 14:54:13 INFO com.ucar.spacex.yarn.sink.HadoopTimelineMetricsSink:109 - Collector Uri: http://spacex05-prod-rg1-bj2:8080/spacex/sinkConsumer/metrics
2021-08-26 14:54:13 INFO com.ucar.spacex.yarn.sink.HadoopTimelineMetricsSink:110 - Container Metrics Uri: http://spacex05-prod-rg1-bj2:8080/spacex/sinkConsumer/containermetrics
2021-08-26 14:54:13 INFO org.apache.hadoop.metrics2.impl.MetricsSinkAdapter:195 - Sink timeline started
2021-08-26 14:54:13 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:376 - Scheduled snapshot period at 10 second(s).
2021-08-26 14:54:13 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:191 - ResourceManager metrics system started
2021-08-26 14:54:13 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:207 - Registering class org.apache.hadoop.yarn.server.resourcemanager.RMAppManagerEventType for class org.apache.hadoop.yarn.server.resourcemanager.RMAppManager
2021-08-26 14:54:13 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:207 - Registering class org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncherEventType for class org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher
2021-08-26 14:54:13 WARN org.apache.hadoop.metrics2.util.MBeans:64 - Failed to register MBean "Hadoop:service=ResourceManager,name=RMNMInfo": Instance already exists.
2021-08-26 14:54:13 INFO org.apache.hadoop.yarn.server.resourcemanager.RMNMInfo:63 - Registered RMNMInfo MBean
2021-08-26 14:54:13 INFO org.apache.hadoop.util.HostsFileReader:129 - Refreshing hosts (include/exclude) list
问题原因分析
为什么需要存储任务信息?
任务提交到ResourceManager后,ResourceManager需要对提交的任务的信息进行保存,以防止active的RM挂掉后,standby的RM切换为active后能恢复前者挂掉前的集群状态,RM挂掉时已经运行结束的任务不会受影响,只会影响运行中的任务。
什么情况下需要存任务信息呢?
主要有以下几个情况需要存储任务信息:
- 任务状态的添加、更新,对应 storeApplicationStateInternal()、updateApplicationStateInternal()方法
- 任务尝试状态的添加、更新,对应storeApplicationAttemptStateInternal()、updateApplicationAttemptStateInternal() 方法
任务信息存储在哪里呢?
以下为ZKRMStateStore类中标注的zk信息的概述
/**
*
* ROOT_DIR_PATH
* |--- VERSION_INFO
* |--- EPOCH_NODE
* |--- RM_ZK_FENCING_LOCK
* |--- RM_APP_ROOT
* | |----- (#ApplicationId1)
* | | |----- (#ApplicationAttemptIds)
* | |
* | |----- (#ApplicationId2)
* | | |----- (#ApplicationAttemptIds)
* | ....
* |
* |--- RM_DT_SECRET_MANAGER_ROOT
* |----- RM_DT_SEQUENTIAL_NUMBER_ZNODE_NAME
* |----- RM_DELEGATION_TOKENS_ROOT_ZNODE_NAME
* | |----- Token_1
* | |----- Token_2
* | ....
* |
* |----- RM_DT_MASTER_KEYS_ROOT_ZNODE_NAME
* | |----- Key_1
* | |----- Key_2
* ....
* |--- AMRMTOKEN_SECRET_MANAGER_ROOT
* |----- currentMasterKey
* |----- nextMasterKey
*
*/
涉及yarn-site.xml
中的yarn.resourcemanager.zk-state-store.parent-path
配置
相关配置代码为:
zkRootNodePath = getNodePath(znodeWorkingPath, ROOT_ZNODE_NAME);
rmAppRoot = getNodePath(zkRootNodePath, RM_APP_ROOT);
线上znodeWorkingPath
为/bi-rmstore-20200425
,ROOT_ZNODE_NAME
固定为ZKRMStateRoot
,RM_APP_ROOT
固定为RMAppRoot
具体存储在zk上的任务的信息为:
`/bi-rmstore-20200425/ZKRMStateRoot/RMAppRoot
-
|----- (#ApplicationId1)
- | | |----- (#ApplicationAttemptIds)
- | |
- | |----- (#ApplicationId2)
- | | |----- (#ApplicationAttemptIds)
- | …`
如何更新zk上的任务状态信息?
多久更新一次?
在启用 Yarn 高可用情况下,
重试间隔机制如下:受 yarn.resourcemanager.zk-timeout-ms(ZK会话超时时间,线上 1 分钟,即 60000ms)和 yarn.resourcemanager.zk-num-retries(操作失败后重试次数,线上环境 1000次)参数控制,计算公式为:
重试时间间隔(yarn.resourcemanager.zk-retry-interval-ms )=yarn.resourcemanager.zk-timeout-ms(ZK session超时时间)/yarn.resourcemanager.zk-num-retries(重试次数)
即在生产环境中,重试时间间隔 = 600000ms /1000次 = 60 ms/次,即线上环境在任务不成功的条件下,会重试 1000 次,每次 60 ms,线上已经将yarn.resourcemanager.zk-num-retries调整为100次了,即如果任务更新zk上的信息失败后,将每秒一次重试写zk节点。
更新是否有限制?
ZK StateStore 中(yarn.resourcemanager.max-completed-applications
)和保存在 Memory 的 App 最大数量(yarn.resourcemanager.state-store.max-completed-applications
)是一致的,默认是 10000(线上环境默认也是 10000),且保存在 ZK StateSotre 中的作业数量不能超过保存在 Memory 中的作业数量。
目前线上两个参数已经调整为25000
报错分析?
提交任务报错日志
2021-08-26 15:03:04 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager:258 - Application should be expired, max number of completed apps kept in memory met: maxCompletedAppsInMemory = 25000, removing app application_1606379848739_775651 from memory:
2021-08-26 15:03:04 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore:1179 - org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread thread interrupted! Exiting!
2021-08-26 15:03:04 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:857 - State store operation failed
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
RM重启后恢复任务状态时删除zk任务信息报错
at java.lang.Thread.run(Thread.java:745)
2021-08-26 15:03:04 INFO org.apache.hadoop.metrics2.impl.MetricsSinkAdapter:135 - timeline thread interrupted.
2021-08-26 15:03:04 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore:1264 - Maxed out ZK retries. Giving up!
2021-08-26 15:03:04 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:191 - Error removing app: application_1606379848739_775514
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:949)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:1064)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:1061)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1203)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1236)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:1067)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:812)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:189)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:175)
at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:844)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:904)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:899)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:182)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
at java.lang.Thread.run(Thread.java:745)
2021-08-26 15:03:04 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager:258 - Application should be expired, max number of completed apps kept in memory met: maxCompletedAppsInMemory = 25000, removing app application_1606379848739_775637 from memory:
2021-08-26 15:03:04 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:857 - State store operation failed
zkdoctor报watcher数过大的预警
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Z1UAU1vv-1630295299979)(C:\Users\Jackson\AppData\Roaming\Typora\typora-user-images\image-20210830105538398.png)]
目前配置的最大任务数是25000,两个RM分别对zk节点进行watch,即5000个watcher数,目前应该是有超过指定的zk节点还没被rm调用删除或者删除未成功的
分析原因
- 显然已经已完成的任务信息已经超过内存中能保存的最大已完成App数量了
- 目前内存中和zk上保存的信息已经太多且太大了,再往上加会增加zk的延迟从而增大任务的失败的可能
- 任务和任务尝试的zk信息在zk上都是持久节点,RM对每个节点都有添加watch
- 大量的watch和任务数据,如果没有删除的话会加大zk的内存消耗,间接影响涉及zk的所有集群活动
解决方案
- 需要及时删除zk中没用的任务信息,
- 再重启RM来减缓zk和rm的负载
如何删除zk中的大量的任务信息呢?
通过调用zk的api感觉有点太慢了,网上找了下,通过如下脚本删除zk上某个节点下部分的大量子节点:
-
想办法导出所有/bi-rmstore-20200425/ZKRMStateRoot/RMAppRoot下的子节点名称,
rmr /bi-rmstore-20200425/ZKRMStateRoot/RMAppRoot/application_1628183929631_170947
-
生成一个多行文件,文件内容为:rmr 节点名称,如:
-
利用Linux的管道执行删除命令:
cat waitForDelete_application_1628183929631_170947.txt | /usr/hdp/current/zookeeper-client/bin/zkCli.sh
将/usr/hdp/current/zookeeper-client/bin/zkCli.sh的路径替换为实际的zk路径即可。
将脚本发给zk运维人员,执行删除,总共22638行的任务记录,连同底下的任务尝试节点肯定不止,问了下删除大概耗时3分钟左右
重启standby的RM
由于删除后active的RM没报什么错误日志,就没执行重启操作,删除zk信息之后standby的RM由于zk的监听节点被删除了,触发相关节点报错,大致如下
重启日志
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /bi-rmstore-20200425/ZKRMStateRoot/RMAppRoot/application_1628183929631_170795
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1184)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$6.run(ZKRMStateStore.java:1103)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$6.run(ZKRMStateStore.java:1100)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1203)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1236)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.getDataWithRetries(ZKRMStateStore.java:1105)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.loadRMAppState(ZKRMStateStore.java:576)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.loadState(ZKRMStateStore.java:461)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:581)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
... 13 more
2021-08-27 14:30:26 INFO org.apache.hadoop.ha.ActiveStandbyElector:670 - Trying to re-establish ZK session
2021-08-27 14:30:26 INFO org.apache.zookeeper.ZooKeeper:684 - Session: 0x271b1d0e5900a93 closed
2021-08-27 14:30:27 INFO org.apache.zookeeper.ZooKeeper:438 - Initiating client connection, connectString=xxxxxxxxxxxxxxx sessionTimeout=60000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@60c96f97
2021-08-27 14:30:27 INFO org.apache.zookeeper.ClientCnxn:975 - Opening socket connection to server 10.212.5.114/10.212.5.114:5181. Will not attempt to authenticate using SASL (unknown error)
2021-08-27 14:30:27 INFO org.apache.zookeeper.ClientCnxn:852 - Socket connection established to xxxxxxxxxxxxxxx , initiating session
2021-08-27 14:30:27 INFO org.apache.zookeeper.ClientCnxn:1235 - Session establishment complete on server2021-08-27 14:30:27 INFO org.apache.zookeeper.ClientCnxn:852 - Socket connection established to xxxxxxxxxxxxxxx , initiating session
sessionid = 0x17b8499ebbd000b, negotiated timeout = 40000
2021-08-27 14:30:27 INFO org.apache.zookeeper.ClientCnxn:512 - EventThread shut down
2021-08-27 14:30:27 INFO org.apache.hadoop.ha.ActiveStandbyElector:547 - Session connected.
2021-08-27 14:39:46 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:407 - Release request cache is cleaned up
2021-08-27 14:39:55 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:407 - Release request cache is cleaned up
2021-08-27 14:40:09 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:407 - Release request cache is cleaned up
2021-08-27 14:40:15 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:407 - Release request cache is cleaned up
2021-08-27 14:40:18 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:407 - Release request cache is cleaned up
2021-08-27 14:40:26 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:407 - Release request cache is cleaned up
2021-08-27 14:45:05 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:60 - RECEIVED SIGNAL 15: SIGTERM
2021-08-27 14:45:05 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:659 - ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
2021-08-27 14:45:05 INFO org.mortbay.log:67 - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@namenode01.bi.10101111.com:8080
2021-08-27 14:45:05 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:659 - ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
2021-08-27 14:45:05 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:659 - ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
2021-08-27 14:45:05 INFO org.apache.hadoop.ipc.Server:2438 - Stopping server on 8033
2021-08-27 14:45:05 INFO org.apache.hadoop.ha.ActiveStandbyElector:354 - Yielding from election
2021-08-27 14:45:05 INFO org.apache.hadoop.ipc.Server:707 - Stopping IPC Server listener on 8033
2021-08-27 14:45:05 INFO org.apache.hadoop.ipc.Server:833 - Stopping IPC Server Responder
2021-08-27 14:45:05 INFO org.apache.zookeeper.ZooKeeper:684 - Session: 0x17b8499ebbd000b closed
2021-08-27 14:45:05 INFO org.apache.zookeeper.ClientCnxn:512 - EventThread shut down
2021-08-27 14:45:05 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:1065 - Already in standby state
2021-08-27 14:45:05 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:659 - SHUTDOWN_MSG:
重启stanby的rm后观察rm基本没有再报往zk存储任务信息的报错了
删除zk节点前rm的报错
删除节点后active的rm的报错
删除zk信息后没再发现类似这样的错误
ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:857 - State store operation failed
偶然发现rm的一个奇怪的报错,估计是任务并发大导致的,看似是hadoop的bug,暂时忽略了,至此问题解决。
0045826451_0135_01_000486
2021-08-27 15:38:29 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode:216 - Released container container_e137_1630045826451_0135_01_000486 of capacity <memory:35840, vCores:2> on host datanode95.bi:45454, which currently has 1 containers, <memory:35840, vCores:2> used and <memory:138240, vCores:30> available, release resources=true
2021-08-27 15:38:29 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:813 - Application attempt appattempt_1630045826451_0135_000001 released container container_e137_1630045826451_0135_01_000486 on node: host: datanode95.bi:45454 #containers=1 available=138240 used=35840 with event: RELEASED
2021-08-27 15:38:30 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:784 - Null container completed...
2021-08-27 15:38:30 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:784 - Null container completed...
2021-08-27 15:38:30 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:784 - Null container completed...
2021-08-27 15:38:30 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:784 - Null container completed...
2021-08-27 15:38:30 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:784 - Null container completed...
2021-08-27 15:38:30 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:784 - Null container completed...
2021-08-27 15:38:30 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:784 - Null container completed...
2021-08-27 15:38:31 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:784 - Null container completed...
2021-08-27 15:38:31 WARN org.apache.hadoop.yarn.webapp.GenericExceptionHandler:98 - INTERNAL_SERVER_ERROR
java.util.ConcurrentModificationException
at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
at java.util.ArrayList$Itr.next(ArrayList.java:851)
at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerQueueInfo.<init>(FairSchedulerQueueInfo.java:94)
at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.FairSchedulerInfo.<init>(FairSchedulerInfo.java:47)
at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getSchedulerInfo(RMWebServices.java:224)
at sun.reflect.GeneratedMethodAccessor237.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886)
at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:142)
at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:574)
at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:269)
at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:544)
at org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:84)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1225)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
问题总结:
- zk只适合存少量的小数据量的数据,更多适用于快速的分布式一致性服务等,不适合存储太多的任务数据
- 当问题达到瓶颈时,除了分析导致瓶颈的原因,调整避免瓶颈的各种服务端和客户端的参数外,更需要一次性缓解瓶颈问题,不能仅仅是避免问题再发生,正如磁盘满了,不仅要避免读写太多太大,更需要释放空间来缓解