如果一个主或备份服务器在网络中被孤立,可能会发生故障转移,并且会照成有2个存活的服务器同时在集群中提供消息服务(主和备份服务器同时成为存活服务器),我们称这个为脑裂。如下有两种不同的配置来缓解此问题。
1.仲裁选举
主和备份服务器都使用仲裁选举来决定在断开复制连接时的操作。服务器将请求集群中的每个主服务器进行投票,以确定它是否认为它复制到或者复制来源的服务器仍然存活。在这种情况下所需最小活动/备份对数量为3对,如果小于3对,那么只能选择后面介绍的网络监测工具了。
1.1 备份选举
默认情况下,如果副本丢失其与主代理的复制连接,它通过仲裁选举结果进行判定。 这当然要求群集中至少有3对实时/备份节点。 对于3节点集群,如果它返回2票表示其活动服务器不再可用,备份服务器将启动,对于4个节点,这将是3票,依此类推。 当备份失去与主服务器的连接时,它将继续进行仲裁选举,直到它收到允许启动为主的投票或者检测到主服务器仍处于活动状态。 对于后者,它将作为备份重新启动。 一共有多少票和每次投票时间间隔配置如下:
<ha-policy>
<replication>
<slave>
<vote-retries>12</vote-retries>
<vote-retry-wait>5000</vote-retry-wait>
</slave>
</replication>
</ha-policy>
当预先知道了集群大小的情况下,可以静态设置冲裁票数的大小,如下:
<ha-policy>
<replication>
<slave>
<quorum-size>2</quorum-size>
</slave>
</replication>
</ha-policy>
1.2 主选举
默认情况下,如果主服务器丢失了其复制连接,则它将继续等待备份重新连接并再次开始复制。如果出现脑裂情况,意味着备份服务器已经被激活,主服务器也会保持其有效性。可以通过为主服务器配置仲裁选举,这样的话当主服务器没有收到多数投票,它将关闭服务,这是通过设置vote-on-replication-failure为true来完成的。
<ha-policy>
<replication>
<master>
<vote-on-replication-failure>true</vote-on-replication-failure>
<quorum-size>2</quorum-size>
</master>
</replication>
</ha-policy>
和备份策略一样,也可以配置静态仲裁票数。
2.网络检测
可以在broker.xml中配置多个网络拓展中的地址,整个服务器的生命周期中服务器会对这些地址进行ping操作。
当在执行create命令时传递--ping ip,会创建一个可用于网络检查的默认xml。
./artemis create /myDir/myServer --ping 10.0.0.1
这部分XML配置会被添加到broker.xml中
<!--
You can verify the network health of a particular NIC by specifying the <network-check-NIC> element.
通过指定<network-check-NIC>元素来验证特定NIC(网卡 or 网络适配器)的网络监控状况。
<network-check-NIC>theNicName</network-check-NIC>
-->
<!--
Use this to use an HTTP server to validate the network
<network-check-URL-list>http://www.apache.org</network-check-URL-list> -->
<network-check-period>10000</network-check-period>
<network-check-timeout>1000</network-check-timeout>
<!-- this is a comma separated list, no spaces, just DNS or IPs
it should accept IPV6
这是一个以逗号分隔的列表,没有空格,仅DNS或IP,它应当支持IPV6
Warning: Make sure you understand your network topology as this is meant to check if your network is up.Using IPs that could eventually disappear or be partially visible may defeat the purpose.You can use a list of multiple IPs, any successful ping will make the server OK to continue running
注意:确保了解网络拓扑,因为这是为了验证局域网络的可连通性。 使用最终可能不存在或只对部分可见的IP可能会破坏目的。 您可以使用多个IP列表。 任何成功的ping都将使服务器可以继续运行-->
<network-check-list>10.0.0.1</network-check-list>
<!-- use this to customize the ping used for ipv4 addresses -->
<network-check-ping-command>ping -c 1 -t %d %s</network-check-ping-command>
<!-- use this to customize the ping used for ipv addresses -->
<network-check-ping6-command>ping6 -c 1 %2$s</network-check-ping6-command>
一旦与配置中的ip失联,将会在服务器上看到如下异常输出。
09:49:24,562 WARN [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Ping Address /10.0.0.1 wasn't reacheable
09:49:36,577 INFO [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Network is unhealthy, stopping service ActiveMQServerImpl::serverUUID=04fd5dd8-b18c-11e6-9efe-6a0001921ad0
09:49:36,625 INFO [org.apache.activemq.artemis.core.server] AMQ221002: Apache ActiveMQ Artemis Message Broker version 1.6.0 [04fd5dd8-b18c-11e6-9efe-6a0001921ad0] stopped, uptime 14.787 seconds
09:50:00,653 WARN [org.apache.activemq.artemis.core.server.NetworkHealthCheck] ping: sendto: No route to host
09:50:10,656 WARN [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Host is down: java.net.ConnectException: Host is down
at java.net.Inet6AddressImpl.isReachable0(Native Method) [rt.jar:1.8.0_73]
at java.net.Inet6AddressImpl.isReachable(Inet6AddressImpl.java:77) [rt.jar:1.8.0_73]
at java.net.InetAddress.isReachable(InetAddress.java:502) [rt.jar:1.8.0_73]
at org.apache.activemq.artemis.core.server.NetworkHealthCheck.check(NetworkHealthCheck.java:295) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
at org.apache.activemq.artemis.core.server.NetworkHealthCheck.check(NetworkHealthCheck.java:276) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
at org.apache.activemq.artemis.core.server.NetworkHealthCheck.run(NetworkHealthCheck.java:244) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$2.run(ActiveMQScheduledComponent.java:189) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$3.run(ActiveMQScheduledComponent.java:199) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_73]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_73]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_73]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [rt.jar:1.8.0_73]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_73]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_73]
at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_73]
一旦检查到连接恢复,将会恢复服务。
09:53:23,461 INFO [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Network is healthy, starting service ActiveMQServerImpl::
09:53:23,462 INFO [org.apache.activemq.artemis.core.server] AMQ221000: live Message Broker is starting with configuration Broker Configuration (clustered=false,journalDirectory=./data/journal,bindingsDirectory=./data/bindings,largeMessagesDirectory=./data/large-messages,pagingDirectory=./data/paging)
09:53:23,462 INFO [org.apache.activemq.artemis.core.server] AMQ221013: Using NIO Journal
09:53:23,462 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-server]. Adding protocol support for: CORE
09:53:23,463 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-amqp-protocol]. Adding protocol support for: AMQP
09:53:23,463 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-hornetq-protocol]. Adding protocol support for: HORNETQ
09:53:23,463 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-mqtt-protocol]. Adding protocol support for: MQTT
09:53:23,464 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-openwire-protocol]. Adding protocol support for: OPENWIRE
09:53:23,464 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-stomp-protocol]. Adding protocol support for: STOMP
09:53:23,541 INFO [org.apache.activemq.artemis.core.server] AMQ221003: Deploying queue jms.queue.DLQ
09:53:23,541 INFO [org.apache.activemq.artemis.core.server] AMQ221003: Deploying queue jms.queue.ExpiryQueue
09:53:23,549 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:61616 for protocols [CORE,MQTT,AMQP,STOMP,HORNETQ,OPENWIRE]
09:53:23,550 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:5445 for protocols [HORNETQ,STOMP]
09:53:23,554 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:5672 for protocols [AMQP]
09:53:23,555 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:1883 for protocols [MQTT]
09:53:23,556 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:61613 for protocols [STOMP]
09:53:23,556 INFO [org.apache.activemq.artemis.core.server] AMQ221007: Server is now live
注:此系列文章为Apache Artemis V2.6.2官方使用文档的简要翻译文档(非完全按照官方文档排版进行翻译,有删减),个人能力有限如有错误请谅解。源文档地址:http://activemq.apache.org/artemis/docs/latest/index.html