NameNode优化归纳【RPC&FBR&监控】

1、background

We have seen many incidents of overloaded HDFS namenode due to 1) misconfigurations or 2) “bad” MR jobs or Hive queries that create large amount of RPC requests in a short period of time. There are quite a few features that have been introduced in HDP 2.3/2.4 to protect HDFS namenode. This article summarize the deployment steps of these features with an incomplete list of known issues and possible solutions for them.

2、optimizing

  • Enable Async Audit Logging开启异步日志(本文已有配置说明)
  • Dedicated Service RPC Port拆分serviceRPC端口(本文已有配置说明)
  • Dedicated Lifeline RPC Port for HA拆分lifeLineRPC端口(本文已有配置说明)
  • Enable FairCallQueue on Client RPC Port开启RPC公平调度队列(本文已有配置说明)
  • Enable RPC Client Backoff on Client RPC port开启backoff退避
  • Enable RPC Caller Context to track the “bad” jobs``
  • Enable Response time based backoff with DecayedRpcScheduler``
  • Check JMX for namenode client RPC call queue length and average queue time``
  • Check JMX for namenode DecayRpcScheduler when FCQ is enabled
    NNtop (HDFS-6982)``
  • Tuning configuration when deleting a large Dir slowly (本文已有配置说明)
  • Injection of patch to improve FBR when NN started

3、Enable Async Audit Logging

Enable async audit logging by setting
dfs.namenode.audit.log.async to true in hdfs-site.xml. This can minimize the impact of audit log I/Os on namenode performance.

<property>  
  <name>dfs.namenode.audit.log.async</name>  
  <value>true</value>
</property>

4、Dedicated Service RPC Port

Configuring a separate service RPC port can improve the responsiveness of the NameNode by allowing DataNode and client requests to be processed via separate RPC queues. Datanode and all other services should be connected to the new service RPC address and clients connect to the well known addresses specified by dfs.namenode.rpc-address.

Adding a service RPC port to an HA cluster with automatic failover via ZKFCs (with/wo Kerberos) requires some additional steps as follows:

1、Add the following settings to hdfs-site.xml.

<property>
  <name>dfs.namenode.servicerpc-address.mycluster.nn1</name>
  <value>nn1.example.com:8040</value>
</property>
<property>
  <name>dfs.namenode.servicerpc-address.mycluster.nn2</name>
  <value>nn2.example.com:8040</value>
</property>

2. If the cluster is not Kerberos enabled, skip this step.

If the cluster is kerberos enabled, create two new hdfs_jass.conf files for nn1 and nn2 and copy them to /etc/hadoop/conf/hdfs_jaas.conf, respectively

nn1:

Client { 
com.sun.security.auth.module.Krb5LoginModule required 
useKeyTab=true 
storeKey=true 
useTicketCache=false 
keyTab="/etc/security/keytabs/nn.service.keytab" principal="nn/c6401.ambari.apache.org@EXAMPLE.COM";
};

nn2:

Client { 
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
useTicketCache=false
keyTab="/etc/security/keytabs/nn.service.keytab" principal="nn/c6402.ambari.apache.org@EXAMPLE.COM";
};

Add the following to hadoop-env.sh

export HADOOP_NAMENODE_OPTS="
-Dzookeeper.sasl.client=true 
-Dzookeeper.sasl.client.username=zookeeper 
-Djava.security.auth.login.config=/etc/hadoop/conf/hdfs_jaas.conf 
-Dzookeeper.sasl.clientconfig=Client ${HADOOP_NAMENODE_OPTS}"

3. Restart NameNodes

4. Restart DataNodes

Restart DataNodes to connect to the new NameNode service RPC port instead of the NameNode client RPC port .

5. Stop the ZKFC

Stop the ZKFC processes on both NameNodes

6. -formatZK

Run the following command to reset the ZKFC state in ZooKeeper

hdfs zkfc -formatZK

Known issues:

  • 1 Without step 6 you will see the following exception after ZKFC restart.
java.lang.RuntimeException:Mismatched address 
stored in ZK forNameNode
  • 2 Without step 2 in a Kerberos enabled HA cluster, you will see the following exception when running step 6.
16/03/23 03:30:53 INFO ha.ActiveStandbyElector: Recursively deleting /hadoop-ha/hdp64ha from ZK...16/03/23 03:30:53 ERROR ha.ZKFailoverController: Unable to clear zk parent znodejava.io.IOException: Couldn't clear parent znode /hadoop-ha/hdp64haat org.apache.hadoop.ha.ActiveStandbyElector.clearParentZNode(ActiveStandbyElector.java:380)at org.apache.hadoop.ha.ZKFailoverController.formatZK(ZKFailoverController.java:267)at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:212)at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61)at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172)at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:360)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:442)at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:168)at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:183)

Caused by: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /hadoop-ha/hdp64haat org.apache.zookeeper.KeeperException.create(KeeperException.java:125)at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873)at org.apache.zookeeper.ZKUtil.deleteRecursive(ZKUtil.java:54)at org.apache.hadoop.ha.ActiveStandbyElector$1.run(ActiveStandbyElector.java:375)at org.apache.hadoop.ha.ActiveStandbyElector$1.run(ActiveStandbyElector.java:372)at org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1041)at org.apache.hadoop.ha.ActiveStandbyElector.clearParentZNode(ActiveStandbyElector.java:372)
... 11 more

5、Dedicated Lifeline RPC Port for HA

HDFS-9311 allows using a separate RPC address to isolate health checks and liveness from client RPC port which could be exhausted due to “bad” jobs. Here is an example to configure this feature in a HA cluster.

<property>  
<name>dfs.namenode.lifeline.rpc-address.mycluster.nn1</name>
<value>nn1.example.com:8050</value> 
</property>

<property>
  <name>dfs.namenode.lifeline.rpc-address.mycluster.nn2</name>
  <value>nn1.example.com:8050</value>
</property>

也就是说
【RPC拆分patch参数配置】

dfs.namenode.servicerpc-address.gaofeng.nn1=gaofeng-nn-01:8022
dfs.namenode.servicerpc-address.gaofeng.nn2=gaofeng-nn-02:8022
dfs.namenode.lifeline.rpc-address.gaofeng.nn1=gaofeng-nn-01:8023
dfs.namenode.lifeline.rpc-address.gaofeng.nn2=gaofeng-nn-02:8023
dfs.namenode.service.handler.count=50
dfs.namenode.lifeline.handler.count=50

按照上面的配置完成后重启受影响的组件,之后进行-formatZK即可

但是拆分队列在hadoop3有bug原因:
sendLifeline NPE异常
NameNode在处理DataNode发送的生命线消息时出现NPE,这将导致NN计算的maxLoad异常。
由于在choose DataNode中DataNode被标识为busy并且无法分配可用的节点,程序循环的执行会导致高CPU并降低集群的处理性能。
解决办法:打入HDFS-15556

Duplicated issue为HDFS-14042

6、Enable FairCallQueue on Client RPC Port

《聊聊RPC的拥塞控制》
《RPC Congestion Control with FairCallQueue》
《FairCallQueue.html官方文档》
《FairCallQueue滴滴技术文摘》
《Quality of Service in Hadoop性能测试图》
《华为FairCallQueue配置说明》
《唯品会 HDFS 性能挑战和优化实践》

7、Enable RPC Client Backoff on Client RPC port

开启backoff退避
TODO…

8、Enable RPC Caller Context to track the “bad” jobs

TODO…

9、Enable Response time based backoff with DecayedRpcScheduler

TODO…

10、HDFS jmx&health check

jmx

The HDFS monitoring commands I often use in production are summarized below

【NN audit cmd count】

查看hdfs-audit审计日志中cmd一份中的个数
cat /var/log/hadoop/ocdp/hdfs-audit.log/awk '{print $2}'|awk -F ':' '{print $1":"$2}'|sort|uniq -c

后台查看健康检查

hdfs dfsadmin -report |head

三副本情况下查看块数

hdfs dfsadmin -report |grep 'Num of Blocks'|awk -F ':' '{print $2}'|awk '{sum +=$1};END {print sum/3}'这是一个大约值

curl --silent http://192.168.1.1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem | grep -i "blocktotal"这是50070显示的值

查看PendingDeletionBlocks

  • (Ambari方式)
curl -u admin:admin -X GET http://192.168.1.1:8080/api/v1/clusters/testqjcluster/hosts/host-192-168-1-1/host_components/NAMENODE?fields=metrics/rpc
PendingDeletionBlocks
  • (Hadoop方式)
curl --silent http://192.168.1.1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem | grep -i "PendingDeletionBlocks"

查看RPC指标

  • (Ambari方式)
curl -u admin:admin -X GET http://192.168.1.1:8080/api/v1/clusters/cluster1/hosts/host-192-168-1-1/host_components/NAMENODE?fields=metrics/rpc
  • (Hadoop方式)
    • 依据上文的端口配置,监控如下

    • (client=客户端跟NN的交互)
      curl --silent http://192.168.1.1:50070/jmx?qry=Hadoop:service=NameNode,name=RpcActivityForPort8020

    • (service=NN跟JN的交互)
      curl --silent http://192.168.1.1:50070/jmx?qry=Hadoop:service=NameNode,name=RpcActivityForPort8040

    • (lifeLine=DN跟NN的交互)
      curl --silent http://192.168.1.1:50070/jmx?qry=Hadoop:service=NameNode,name=RpcActivityForPort8050

11、delete优化

Reference 唯品会-林意群

HDFS-13831patch

打入HDFS-13831patch,将dfs.namenode.block.deletion.increment(default 1000)降低为100

FoldedTreeSet碎片阈值

dfs.namenode.storageinfo.defragment.ratio=0.75->0.9
ipc.8020.callqueue.impl=org.apache.hadoop.ipc.FairCallQueue
按照commitor的建议,调大FoldedTreeSet(Hadoop3存储blockInfo的数据结构)的碎片阈值参数

ipc.server.read.threadpool.size

Reader线程数,默认1->100

dfs.namenode.service.handler.count

Handler线程数,默认10->361

ipc.server.handler.queue.size

每个 Handler 处理的最大 Call 队列长度,默认100->1000。

12、Injection of patch to improve FBR when NN started

线上1.4k节点规模集群的HDFS原本NN启动后2亿多块上报不到一小时就完成,在Hadoop2.7升级到3.1后,发现上报的时间需要4个小时左右,严重影响线上环境
在打入如下patch后,可以解决这个问题
HDFS-14366
HDFS-14859
HDFS-14632
HDFS-14171

一键三连(〃‘▽’〃)

更多关于大数据(Hadoop、HBASE、Hive、Flink、Doris、Pulsar、Kafka、ClickHouse)学习干货资料
识别下方二维码,回复“资料全集”,即可获得下载地址。
在这里插入图片描述
在这里插入图片描述

  • 91
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Geoffrey Turing

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值