前言
遇到错误、遇到问题不可怕,关键是能否快速去定位找到最终出问题的地方,并顺利解决它。
举例说明。
启动三台服务器:hadoop001、hadoop002、hadoop003
[root@hadoop001 ~]# jps
1479 Jps
服务器刚启动的时候,只有没什么进程。
现在想启动zookeeper集群。
[hadoop@hadoop001 bin]$ ./zkServer.sh start
JMX enabled by default
Using config: /home/hadoop/app/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... already running as process 1503.
[hadoop@hadoop001 bin]$ jps
1514 Jps
[hadoop@hadoop001 bin]$ ./zkServer.sh status
JMX enabled by default
Using config: /home/hadoop/app/zookeeper/bin/../conf/zoo.cfg
Error contacting service. It is probably not running.
从上面启动过程看到,zookeeper并未启动,那么怎么定位问题?怎么解决?
①.看日志
(可以用tail -200f看日志最后200行或者更多,复制到本地记事本上,仔细分析一下,生产上或者干脆把日志下载到本地区查看分析)
进入目录/home/hadoop/app/zookeeper/bin(每个人的问题所在目录不一定一样)
[hadoop@hadoop001 bin]$ tail -200f zookeeper.out
看最后200行,可以看到:
2019-04-16 19:59:52,408 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:FastLeaderElection@849] - Notification time out: 60000
2019-04-16 20:00:52,409 [myid:1] - WARN [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumCnxManager@382] - Cannot open channel to 2 at election address hadoop002/172.19.12.133:3888
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:402)
at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:840)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:762)
2019-04-16 20:00:52,410 [myid:1] - WARN [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumCnxManager@382] - Cannot open channel to 3 at election address hadoop003/172.19.12.135:3888
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:402)
at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:840)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:762)
2019-04-16 20:00:52,410 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:FastLeaderElection@849] - Notification time out: 60000
从上面日志分析可以明显看出很多这样的:
Cannot open channel to 2 at election address hadoop002/172.19.12.133:3888
java.net.ConnectException: Connection refused
可以看出是hadoop001、hadoop002、hadoop003三台机器之间的连接出问题了。
去ping一下:
[hadoop@hadoop001 bin]$ ping hadoop002
PING hadoop002 (172.19.12.133) 56(84) bytes of data.
64 bytes from hadoop002 (172.19.12.133): icmp_seq=1 ttl=64 time=0.220 ms
[hadoop@hadoop001 bin]$ ping hadoop003
PING hadoop003 (172.19.12.135) 56(84) bytes of data.
64 bytes from hadoop003 (172.19.12.135): icmp_seq=1 ttl=64 time=0.282 ms
发现都能ping通,再仔细看上面 Cannot open channel to 2 at election address hadoop002/172.19.12.133:3888,可以看出是3888端口通信有问题。
(网上查了很多原因,防火墙、myid、restart、zoo.cfg中的server什么原因的)
实际上原因很简单,就是因为另外两台机器没有启动zk,去另外两台机器上执行./zkServer.sh start就可以了。
再看一下:
[hadoop@hadoop001 bin]$ ./zkServer.sh status
JMX enabled by default
Using config: /home/hadoop/app/zookeeper/bin/../conf/zoo.cfg
Mode: follower
②shell脚本 -x debug模式排查问题原因
编辑启动命令的shell脚本,在第一行后面加入 -x
[hadoop@hadoop001 bin]$ vi zkServer.sh
#!/usr/bin/bash -x
........
然后启动脚本进行排查分析:
[hadoop@hadoop001 bin]$ ./zkServer.sh start
+ '[' x = x ']'
+ JMXLOCALONLY=false
+ '[' x = x ']'
+ echo 'JMX enabled by default'
JMX enabled by default
.......
每个+号表示执行的命令,可以一步一步的去看。(这里看不出问题,不过有时候会有用的。)
比如说可以看出
+ _ZOO_DAEMON_OUT=./zookeeper.out
日志文件的路径,是执行命令所在的目录,如果想修改日志文件的目录,可以在这里修改。
后面待补充~