关于Linux中搭建分布式时可能遇到的问题
这个问题来自于今天安装zookeeper时踩的一个大坑,害的我花了一天时间。在搭建zookeeper的分布式时,往往要进行这样的配置:
server.1=hadoop01:2888:3888
server.2=hadoop02:2888:3888
server.3=hadoop03:2888:3888
一开始我是按照这样的配置来做的,后来死活不成功,zookeeper.out中的信息如下:
2017-04-21 06:05:34,385 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@400] - Cannot open channel to 3 at election address hadoop05/192.168.31.155:3888
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:381)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:426)
at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:843)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:822)
先不要关心日志的时间(其实你们也不会关注的,这个是没有配时间,所以显示早上六点,哪个傻叉会6点起来)。
对于这个错误,网上大多数说的是zookeeper启动顺序导致开始选举不稳定引起的,不用担心,过一会儿就会好的,可是对于我来说并不管用。然后就是什么主机映射之类的,就算配了主机映射,三台虚拟机都可以相互ping通,似乎也没什么卵用。继续查资料,发现又有这样的一种配置:
server.1=192.168.31.151:2888:3888
server.2=192.168.31.152:2888:3888
server.3=192.168.31.153:2888:3888
然后我又照着这种配置又配了一遍,发现这样居然可以,leader和follower都选出来了。激动之余,尼玛问题究竟出在哪儿,这样两种配置有啥不一样,这又让我寝食难安,百度了一圈毛都没发现。然后就google去了,反正就是在一个犄角旮旯找到了一个问答,发现别人也是有这种问题,链接在此:
https://unix.stackexchange.com/questions/240506/zookeeper-dns-name-problems-with-leader-elections-when-migrating-from-windows-to
问的题目是:
Zookeeper DNS name problems with leader elections when migrating from Windows to Debian
回答的人就说了:
The "smoking gun" was this line in my zookeeper log:
2015-11-26 20:48:31,439 [myid:1] - INFO [Thread-2:QuorumCnxManager$Listener@504] - My election bind port: spring-xd-1/127.0.0.1:3888
So, why was Zookeeper binding the election port on the loopback interface? Well...
My /etc/hosts on one of the VMs looked like this:
127.0.0.1 spring-xd-1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 ## vagrant-hostmanager-start 172.28.128.3 spring-xd-1 172.28.128.4 spring-xd-2 172.28.128.7 spring-xd-3 ## vagrant-hostmanager-end
I removed the hostname from the 127.0.0.1 line in /etc/hosts and bounced the zookeeper service on all 3 nodes, and BAM! everything came up roses. So, now the host file on each machine looks like this:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 ## vagrant-hostmanager-start 172.28.128.3 spring-xd-1 172.28.128.4 spring-xd-2 172.28.128.7 spring-xd-3 ## vagrant-hostmanager-end
最后一段有点启发意义:
EDIT: According to
http://ccl.cse.nd.edu/operations/condor/hostname.shtml, this seems to
be a fairly common problem with clustered apps on Linux, and
recommends editing the hosts file as I've described above. However,
the Zookeeper documentation on cluster setup doesn't mention it.
想去访问这个说明这个问题的网址,可惜访问不了,mmp!
P.S. 搭zookeeper这个硬是要搞出人命