【背景】
因为下周要对大数据开放式平台的服务器进行机房搬迁,开放式平台有90台物理机,其中24台服务器是后来扩容新增的,ip段为19.126.66.*,与另外一个集群共用了同一个网段。根据机房的物理部署规划,搬迁是要对同一个网段批量进行的,因此在搬迁前需要对这24台服务器的ip进行修改。
修改ip的变更本周四实施,因此今天在测试环境进行方案验证,对一台计算节点进行ip修改。源ip:146.32.19.25,目标ip:146.32.18.100
【零、CM上停止该节点角色】
【一、修改ip配置文件】
将老的ip配置文件移动到/tmp目录下
d0305001:/etc/sysconfig/network # cat ifcfg-vlan119
BOOTPROTO='static'
BROADCAST=''
ETHERDEVICE='bond0'
ETHTOOL_OPTIONS=''
IPADDR='146.32.19.25/24'
MTU=''
NAME=''
NETWORK=''
REMOTE_IPADDR=''
STARTMODE='auto'
USERCONTROL='no'
VLAN_ID='119'
d0305001:/etc/sysconfig/network # mv ifcfg-vlan119 /tmp/
新建ip配置文件ifcfg-vlan118
d0305001:/etc/sysconfig/network # cat ifcfg-vlan118
BOOTPROTO='static'
BROADCAST=''
ETHERDEVICE='bond0'
ETHTOOL_OPTIONS=''
IPADDR='146.32.18.100/24'
MTU=''
NAME=''
NETWORK=''
REMOTE_IPADDR=''
STARTMODE='auto'
USERCONTROL='no'
VLAN_ID='118'
【二、修改路由配置文件】
d0305001:/etc/sysconfig/network # vi routes
default 146.32.19.254 - -
修改为
default 146.32.18.254 - -
【三、重启网络服务】
service network restart
d0305001:/etc/sysconfig/network # ip a|grep global
inet 146.32.18.100/24 brd 146.32.18.255 scope global vlan118
inet 146.33.18.100/24 brd 146.33.18.255 scope global vlan218
【四、修改ntp配置】
修改/etc/ntp.conf文件中的146.32.19.254网关地址为新ip对应的网关146.32.18.254,并重启ntp服务
d0305001:/etc/sysconfig/network # service ntp restart
Shutting down network time protocol daemon (NTPD) done
Starting network time protocol daemon (NTPD) done
【五、修改整个集群、客户端的/etc/hosts文件】
cp /etc/hosts /etc/hosts.0107
sed -i 's/19.25/18.100/g' /etc/hosts
【六、重启该节点agent服务】
service cloudera-scm-agent restart
cm上启动该节点角色
【七、验证】
该节点角色服务启动后,报错失去namenode连接
查看日志:Datanode denied communication with namenode because the host is not in the include-list: DatanodeRegistration(146.32.18.100……
d0305001:/var/log/hadoop-hdfs # tail -100 hadoop-cmf-hdfs-DATANODE-d0305001.log.out
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1714)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2135)
2019-01-07 10:50:21,141 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-1060838331-146.249.31.13-1489136106065 (Datanode Uuid 5538360a-f138-42f2-b219-2b4993c6de2a) service to d0305004/146.32.19.28:8022 beginning handshake with NN
2019-01-07 10:50:21,143 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool BP-1060838331-146.249.31.13-1489136106065 (Datanode Uuid 5538360a-f138-42f2-b219-2b4993c6de2a) service to d0305004/146.32.19.28:8022 Datanode denied communication with namenode because the host is not in the include-list: DatanodeRegistration(146.32.18.100, datanodeUuid=5538360a-f138-42f2-b219-2b4993c6de2a, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=cluster14;nsid=314642609;c=0)
at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.registerDatanode(DatanodeManager.java:915)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.registerDatanode(FSNamesystem.java:5143)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.registerDatanode(NameNodeRpcServer.java:1162)
at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.registerDatanode(DatanodeProtocolServerSideTranslatorPB.java:100)
at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:29184)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2141)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1714)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2135)
2019-01-07 10:50:21,151 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-1060838331-146.249.31.13-1489136106065 (Datanode Uuid 5538360a-f138-42f2-b219-2b4993c6de2a) service to d0305005/146.32.19.29:8022 beginning handshake with NN
2019-01-07 10:50:21,152 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool BP-1060838331-146.249.31.13-1489136106065 (Datanode Uuid 5538360a-f138-42f2-b219-2b4993c6de2a) service to d0305005/146.32.19.29:8022 Datanode denied communication with namenode because the host is not in the include-list: DatanodeRegistration(146.32.18.100, datanodeUuid=5538360a-f138-42f2-b219-2b4993c6de2a, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=cluster14;nsid=314642609;c=0)
at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.registerDatanode(DatanodeManager.java:915)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.registerDatanode(FSNamesystem.java:5143)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.registerDatanode(NameNodeRpcServer.java:1162)
at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.registerDatanode(DatanodeProtocolServerSideTranslatorPB.java:100)
at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:29184)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2141)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1714)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2135)
问题原因:造成此问题的原因大部分是由于主机IP方面有问题。报错上面提示的是与active namenode拒绝连接。
登陆namenode节点,
d0305004:~ # find / -name *allow.txt
find: `/proc/29004': No such file or directory
/var/run/cloudera-scm-agent/process/6938-yarn-RESOURCEMANAGER-refresh/nodes_allow.txt
/var/run/cloudera-scm-agent/process/6930-namenodes-failover/dfs_hosts_allow.txt
/var/run/cloudera-scm-agent/process/6929-hdfs-NAMENODE-safemode-wait/dfs_hosts_allow.txt
/var/run/cloudera-scm-agent/process/6927-hdfs-NAMENODE-nnRpcWait/dfs_hosts_allow.txt
/var/run/cloudera-scm-agent/process/6926-hdfs-NAMENODE/dfs_hosts_allow.txt
/var/run/cloudera-scm-agent/process/6924-hdfs-NAMENODE-jnSyncWait/dfs_hosts_allow.txt
/var/run/cloudera-scm-agent/process/6920-hdfs-NAMENODE-jnSyncWait/dfs_hosts_allow.txt
/var/run/cloudera-scm-agent/process/6916-hdfs-NAMENODE-jnSyncWait/dfs_hosts_allow.txt
/var/run/cloudera-scm-agent/process/6416-yarn-RESOURCEMANAGER/nodes_allow.txt
/var/run/cloudera-scm-agent/process/6368-hdfs-NAMENODE/dfs_hosts_allow.txt
d0305004:~ # cat /var/run/cloudera-scm-agent/process/6368-hdfs-NAMENODE/dfs_hosts_allow.txt
146.33.19.13
146.32.19.14
146.32.19.15
146.32.19.16
146.32.19.17
146.32.19.18
146.32.19.20
146.32.19.22
146.32.19.23
146.32.19.24
146.32.19.25
146.32.19.26
146.32.19.27
146.32.19.28
146.32.19.30
由于报错上面提示的是与active namenode拒绝连接,所以手动在namenode节点上刷新主机列表:
hadoop dfsadmin -fs hdfs://146.32.19.28:8020 -refreshNodes //其中146.32.19.28是active namenode的IP
此时在active namenode也是能看到新添加的机器的:
d0305004:~ # cd /var/run/cloudera-scm-agent/process/
d0305004:/var/run/cloudera-scm-agent/process # ls -ltr
total 0
drwxr-x--x 3 zookeeper zookeeper 280 Nov 1 15:58 6346-zookeeper-server
drwxr-x--x 3 hdfs hdfs 360 Nov 1 15:59 6353-hdfs-DATANODE
drwxr-x--x 3 hbase hbase 360 Nov 1 16:00 6372-hbase-MASTER
drwxr-x--x 3 yarn hadoop 440 Nov 1 16:00 6407-yarn-NODEMANAGER
drwxr-x--x 3 yarn hadoop 500 Nov 1 16:00 6416-yarn-RESOURCEMANAGER
drwxr-x--x 5 solr solr 280 Nov 1 16:00 6400-solr-SOLR_SERVER
drwxr-x--x 4 hive hive 340 Nov 1 16:00 6417-hive-HIVESERVER2
drwxr-x--x 4 hive hive 300 Nov 1 16:00 6418-hive-HIVEMETASTORE
drwxr-xr-x 4 root root 100 Nov 12 11:01 ccdeploy_hadoop-conf_etchadoopconf.cloudera.yarn_6266238222486433408
drwxr-xr-x 4 root root 100 Nov 12 11:02 ccdeploy_hive-conf_etchiveconf.cloudera.hive_-1465732137655581486
drwxr-x--x 3 root root 140 Nov 14 12:11 6533-host-inspector
drwxr-x--x 4 root root 140 Nov 14 12:11 6511-collect-host-statistics
drwxr-x--x 3 root root 140 Nov 21 12:12 6581-host-inspector
drwxr-x--x 4 root root 140 Nov 21 12:12 6559-collect-host-statistics
drwxr-x--x 3 root root 140 Nov 28 12:13 6629-host-inspector
drwxr-x--x 4 root root 140 Nov 28 12:13 6607-collect-host-statistics
drwxr-x--x 3 root root 140 Dec 5 12:14 6677-host-inspector
drwxr-x--x 4 root root 140 Dec 5 12:14 6655-collect-host-statistics
drwxr-x--x 3 root root 140 Dec 12 12:15 6726-host-inspector
drwxr-x--x 4 root root 140 Dec 12 12:15 6704-collect-host-statistics
drwxr-x--x 3 root root 140 Dec 19 12:16 6774-host-inspector
drwxr-x--x 4 root root 140 Dec 19 12:16 6752-collect-host-statistics
drwxr-x--x 3 root root 140 Dec 26 12:17 6822-host-inspector
drwxr-x--x 4 root root 140 Dec 26 12:17 6800-collect-host-statistics
drwxr-x--x 3 root root 140 Jan 2 12:18 6870-host-inspector
drwxr-x--x 4 root root 140 Jan 2 12:18 6848-collect-host-statistics
drwxr-x--x 3 hdfs hdfs 340 Jan 7 10:54 6355-hdfs-JOURNALNODE
drwxr-x--x 3 hdfs hdfs 320 Jan 7 10:54 6917-hdfs-JOURNALNODE
drwxr-x--x 3 hdfs hdfs 500 Jan 7 10:54 6916-hdfs-NAMENODE-jnSyncWait
drwxr-x--x 3 hdfs hdfs 500 Jan 7 10:55 6920-hdfs-NAMENODE-jnSyncWait
drwxr-x--x 3 hdfs hdfs 500 Jan 7 10:55 6924-hdfs-NAMENODE-jnSyncWait
drwxr-x--x 3 hdfs hdfs 500 Jan 7 10:55 6368-hdfs-NAMENODE
drwxr-x--x 3 hdfs hdfs 480 Jan 7 10:55 6926-hdfs-NAMENODE
drwxr-x--x 3 hdfs hdfs 480 Jan 7 10:56 6927-hdfs-NAMENODE-nnRpcWait
drwxr-x--x 3 hdfs hdfs 380 Jan 7 10:56 6362-hdfs-FAILOVERCONTROLLER
drwxr-x--x 3 hdfs hdfs 360 Jan 7 10:56 6928-hdfs-FAILOVERCONTROLLER
drwxr-x--x 3 hdfs hdfs 480 Jan 7 10:56 6929-hdfs-NAMENODE-safemode-wait
drwxr-x--x 3 hdfs hdfs 480 Jan 7 10:57 6930-namenodes-failover
drwxr-xr-x 4 root root 100 Jan 7 10:57 ccdeploy_hadoop-conf_etchadoopconf.cloudera.hdfs_1239954674294922633
drwxr-xr-x 4 root root 120 Jan 7 10:57 ccdeploy_hadoop-conf_etchadoopconf.cloudera.hdfs_2490906708984413108
drwxr-x--x 3 yarn hadoop 500 Jan 7 11:00 6938-yarn-RESOURCEMANAGER-refresh
d0305004:/var/run/cloudera-scm-agent/process # cd 6926-hdfs-NAMENODE
d0305004:/var/run/cloudera-scm-agent/process/6926-hdfs-NAMENODE # ls
cloudera-monitor.properties dfs_hosts_exclude.txt http-auth-signature-secret ssl-server.xml
cloudera-stack-monitor.properties event-filter-rules.json log4j.properties supervisor.conf
cloudera_manager_agent_fencer.py hadoop-metrics2.properties logs topology.map
cloudera_manager_agent_fencer_secret_key.txt hadoop-policy.xml navigator.client.properties topology.py
core-site.xml hdfs-site.xml redaction-rules.json
dfs_hosts_allow.txt hdfs.keytab ssl-client.xml
d0305004:/var/run/cloudera-scm-agent/process/6926-hdfs-NAMENODE # cat dfs_hosts_allow.txt
146.33.19.13
146.32.19.14
146.32.19.15
146.32.19.16
146.32.19.17
146.32.19.18
146.32.19.20
146.32.19.22
146.32.19.23
146.32.19.24
146.32.18.100
146.32.19.26
146.32.19.27
146.32.19.28
146.32.19.30
d0305004:/var/run/cloudera-scm-agent/process/6926-hdfs-NAMENODE #
最后一步,在cm页面上执行刷新集群操作就可以了。
当然,还有一个简单一点的办法,那就是滚动重启hdfs服务喽,只要没业务就行~