问题描述:
2个yarn集群,都做了ha,都是使用了同一个zookeeper集群,集群名称不一样,运行几天后,发现每天定时yarn切换了master节点,日志里发现报错如下:
2020-01-12 18:41:26,831 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error While Removing RMDTMasterKey.
org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth
at org.apache.zookeeper.KeeperException.create(KeeperException.java:116)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1015)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:919)
at org.apache.curator.framework.imps.CuratorTransactionImpl.doOperation(CuratorTransactionImpl.java:159)
at org.apache.curator.framework.imps.CuratorTransactionImpl.access$200(CuratorTransactionImpl.java:44)
at org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:129)
at org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:125)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
经过分析,问题原因是yarn ha的默认参数里存zk的地址是/rmstore,如果2个集群共用一个zk,master key默认过期时间是86400秒,就会出现每天更换master key之后,2个集群改写同一个key,导致失效,引起yarn自动恢复重启。
解决办法:
每个集群配置单独的zk store目录:
<property>
<name>yarn.resourcemanager.zk-state-store.parent-path</name>
<value>/yarncluster1</value>
</property>