故障转移演练
- 客户端高可用观察
- 服务端日志分析:数据节点和 Sentinel节点
故障转移演练(客户端)
创建Maven工程
添加pom依赖
<dependencies>
<!-- 日志 -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.6</version>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
<version>1.1.1</version>
</dependency>
<!-- jedis -->
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
<version>2.9.0</version>
<type>jar</type>
<scope>compile</scope>
</dependency>
</dependencies>
创建RedisSentinelFailoverTest.java
/**
* 故障转移客户端的表现
*/
public class RedisSentinelFailoverTest {
private static Logger logger = LoggerFactory.getLogger(RedisSentinelFailoverTest.class);
public static void main(String[] args) {
String masterName = "mymaster";
Set<String> sentinels = new HashSet<String>();
sentinels.add("127.0.0.1:26379");
sentinels.add("127.0.0.1:26380");
sentinels.add("127.0.0.1:26381");
JedisSentinelPool jedisSentinelPool = new JedisSentinelPool(masterName, sentinels);
int counter = 0;
while (true) {
counter ++;
Jedis jedis = null;
try {
jedis = jedisSentinelPool.getResource();
int index = new Random.nextInt(100000);
String key = "k-" + index;
String value = "v-" + index;
jedis.set(key, value);
// 控制日志输出
if (counter % 100 == 0) {
logger.info("{} value is {}", key, jedis.get(key));
}
TimeUnit.MILLISECONDS.sleep(10);
} catch (Exception e) {
logger.error(e.getMessage(), e);
} finally {
if (jedis != null) {
jedis.close();
}
}
}
}
}
运行结果
...
23:26:49.278 [main] INFO RedsiSentinelFailoverTest - k-51558 value is v-51558
23:26:49.278 [main] INFO RedsiSentinelFailoverTest - k-1747 value is v-1747
23:26:49.278 [main] INFO RedsiSentinelFailoverTest - k-3276 value is v-3276
23:26:49.278 [main] INFO RedsiSentinelFailoverTest - k-71138 value is v-71138
...
此时将7000节点强制宕机
127.0.0.1:7000> info server
# Server
...
process_id:96599
...
127.0.0.1:7000> exit
[root@localhost config]# kill -9 96599
[root@localhost config]# ps -ef | grep redis-server | grep 700
501 93371 1 0 3:20下午 ?? 0:00.01 redis-server *:7001
501 93378 1 0 3:20下午 ?? 0:00.01 redis-server *:7002
客户端收到大量的报错,30秒后故障转移后将会恢复正常
...
Caused by: redis.clients.jedis.exceptions.JedisConnectionException: java.net.ConnectException: ...
...
Caused by: java.net.ConnecException: Connection refused (Connection refused)
...
23:26:49.278 [main] INFO RedsiSentinelFailoverTest - k-5155 value is v-5155
23:26:49.278 [main] INFO RedsiSentinelFailoverTest - k-2747 value is v-2747
23:26:49.278 [main] INFO RedsiSentinelFailoverTest - k-1138 value is v-1138
...
故障演练(日志分析)
查看master节点日志,最后一条日志还在做主从复制
[root@localhost data]# tail -f 7000.log
96602:C 21 Oct 21:34:41.063 * DB saved on disk
96599:M 21 Oct 21:34:41.195 * Background saving terminated with success
96599:M 21 Oct 21:34:41.160 * Synchronization with slave 127.0.0.1:7002 succeeded
96599:M 21 Oct 21:34:42.023 * Slave 127.0.0.1:7001 asks for synchronization
96599:M 21 Oct 21:34:42.023 * Partial resynchronization not accepted: Runid mismatch (Client asked for runid '42821ea0a85b0f84b9847ad3171ec0ae67381c9', my runid id 'd97bc7e44538d54ff9fd6cc6a48185879e25fdd3')
96599:M 21 Oct 21:34:42.023 * Starting BGSAVE for SYNC with target: disk
96599:M 21 Oct 21:34:42.024 * Background saving started by pid 96605
96505:C 21 Oct 21:34:42.025 * DB saved on disk
96599:M 21 Oct 21:34:42.068 * Background saving terminated with success
96599:M 21 Oct 21:34:42.069 * Synchronization with slave 127.0.0.1:7002 succeeded
查看7001节点的日志,发现与原来的Master节点断开了连接,并且一直尝试重新连接,尝试连接超时后接到Sentinel的调度命令晋升为新的Master节点(配置重写、原剩余从节点转变至新的主节点、完成复制)。
[root@localhost data]# tail -200 7001.log
...
96568:S 21 Oct 23:34:32.303 * Connecting to MASTER 127.0.0.1:7000
96568:S 21 Oct 23:34:32.304 * MASTER <-> SLAVE sync started
96568:S 21 Oct 23:34:35.870 # Error condition on socket for SYNC: Connection refused
...
96568:M 21 Oct 23:35:02.122 * Discarding previously cached master state
96568:M 21 Oct 23:35:02.123 * MASTER MODE enabled (user request form 'id=2 addr
=127.0.0.1:49237 fd=8 name=sentinel-a02e8c95-cmd age=7269 idle=0 flags=x db=0
sub=0 psub=0 multi=3 qbuf=14 qbuf-free=32754 obl=36 oll=0 omem=0 events=rw cmd=exec')
96568:M 21 Oct 23:35:02.124 * CONFIG REWRITE executed with success
96568:M 21 Oct 23:35:02.456 * Slave 127.0.0.1:7002 asks for synchronization
96568:M 21 Oct 23:35:02.456 * Full resync requested by slave 127.0.0.1:7002
96568:M 21 Oct 23:35:02.457 * Starting BGSAVE for SYNC with targer: disk
96568:M 21 Oct 23:35:02.457 * Background saving started by pid 1908
96568:M 21 Oct 23:35:02.460 * DB saved on disk
96568:M 21 Oct 23:35:02.497 * Background saving terminated with success
96568:M 21 Oct 23:35:02.498 * Synchronization with slave 127.0.0.1:7002 succeeded
查看7002节点的日志,发现与原来的Master节点断开了连接,并且一直尝试重新连接,尝试连接超时后接到Sentinel的调度命令复制新的Master节点(配置重写、转变至新的主节点、完成复制)。
[root@localhost data]# tail -200 7001.log
...
96585:S 21 Oct 23:34:32.303 * Connecting to MASTER 127.0.0.1:7000
96585:S 21 Oct 23:34:32.304 * MASTER <-> SLAVE sync started
96585:S 21 Oct 23:34:35.870 # Error condition on socket for SYNC: Connection refused
...
96585:S 21 Oct 23:35:02.122 * Discarding previously cached master state
96585:S 21 Oct 23:35:02.123 * SLAVE OF 127.0.0.1:7001 (user request form 'id=2 addr
=127.0.0.1:50767 fd=8 name=sentinel-a02e8c95-cmd age=7246 idle=0 flags=x db=0
sub=0 psub=0 multi=3 qbuf=14 qbuf-free=32754 obl=36 oll=0 omem=0 events=rw cmd=exec')
96585:S 21 Oct 23:35:02.124 * CONFIG REWRITE executed with success
96585:S 21 Oct 23:35:02.456 * Connectiong to MASTER 127.0.0.1:7001
96585:S 21 Oct 23:35:02.456 * MASTER <-> SLAVE sync started
96585:S 21 Oct 23:35:02.457 * Non blocking connect for SYNC fired the event
96585:S 21 Oct 23:35:02.457 * Master replied to PING, replication can continue . . .
96585:S 21 Oct 23:35:02.460 * Partial resynchronization not possible (no cached master)
96585:S 21 Oct 23:35:02.497 * Full resync from master: 359c5d4f2e7a68880746e677f1dc0498eea859a3:1
96585:S 21 Oct 23:35:02.498 * MASTER <-> SLAVE sync: receiving 64690 bytes from master
96585:S 21 Oct 23:35:02.499 * MASTER <-> SLAVE sync: Flushing old data
96585:S 21 Oct 23:35:02.498 * MASTER <-> SLAVE sync: Loading DB in memory
96585:S 21 Oct 23:35:02.498 * MASTER <-> SLAVE sync: Finished with success
查看新Master的7001节点客户端信息,7001的角色发生变化,增加了slave的信息
[root@localhost data]# redis-cli -p 7001
127.0.0.1:7001> info replication
# Replication
role:master
connected_slaves:1
slave0:ip=127.0.0.1,port=7002,state=online,offset=259536.lag=1
...
查看Sentinel 的26379节点客户端信息
[root@localhost data]# redis-cli -p 26379
127.0.0.1:26379> info
...
# Sentinel
sentinel_master:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
master0:name=mymaster,status=ok,address=127.0.0.1:7001.slaves=2,sentinels=3
查看Sentinel 的26379节点日志信息,发现(+sdown)sentinel认为master下线的消息(客观下线(+odown):大部分sentinel节点认为master下线),(+try-failover)尝试做failover,(+selected-slave)选出一个从节点,(+failover-state-send-slaveof-noone)晋升为主节点,(+failover-state-wait-promotion)等待完成晋升等。
...
93821:X 21 Oct 21:35:01.861 * +sdown master mymaster 127.0.0.1 7000
93821:X 21 Oct 21:35:01.917 * +odown master mymaster 127.0.0.1 7000 #quorum 3/2
93821:X 21 Oct 21:35:01.917 * +new-epoch 1
93821:X 21 Oct 21:35:01.918 * +try-failover master mymaster 127.0.0.1 7000
93821:X 21 Oct 21:35:01.919 * +vote-for-leader a02e8c95073b12b7a7e9199de7c754adebd45a42 1
93821:X 21 Oct 21:35:01.922 * 127.0.0.1:26380 vote for a02e8c95073b12b7a7e9199de7c754adebd45a42 1
93821:X 21 Oct 21:35:01.922 * 127.0.0.1:26381 vote for a02e8c95073b12b7a7e9199de7c754adebd45a42 1
93821:X 21 Oct 21:35:01.991 * +elected-leader master mymaster 127.0.0.1 7000
93821:X 21 Oct 21:35:01.992 * +failover-state-select-slave master mymaster 127.0.0.1 7000
93821:X 21 Oct 21:35:02.064 * +selected-slave slave 127.0.0.1:7001 127.0.0.1 7001 @ 127.0.0.1 7000
93821:X 21 Oct 21:35:02.064 * +failover-state-send-slaveof-noone slave 127.0.0.1:7001 127.0.0.1 7001 @ mymaster 127.0.0.1 7000
93821:X 21 Oct 21:35:02.122 * +failover-state-wait-promotion slave 127.0.0.1:7001 127.0.0.1 7001 @ mymaster 127.0.0.1 7000
93821:X 21 Oct 21:35:02.125 * +promoted-slave slave 127.0.0.1:7001 127.0.0.1 7001 @ mymaster 127.0.0.1 7000
93821:X 21 Oct 21:35:02.125 * +failover-state-reconf-slaves master mymaster 127.0.0.1 7000
93821:X 21 Oct 21:35:02.125 * +slave-reconf-sent slave 127.0.0.1:7002 127.0.0.1 7002 @ mymaster 127.0.0.1 7000
...
查看Sentinel 的26380节点日志信息(+switch-master)主节点转变日志消息
...
93996:X 21 Oct 21:35:02.272 * +switch-master mymaster 127.0.0.1 7000 127.0.0.1 7001
...
Sentinel节点发布订阅频道
+reset-master < instance details > | 主节点被重置 |
+slave < instance details > | 一个新的从节点被发现并关联 |
+failover-state-reconf-slaves < instance details > | 故障转移进入 reconf-slaves 状态 |
+slave-reconf-sent < instance details > | 领导者 Sentinel 节点命令其他从节点复制新的主节点 |
+slave-reconf-inprog < instance details > | 从节点正在重新分配主节点的 slave,但是同步过程尚未完成 |
+slave-reconf-done < instance details > | 其余从节点完成了和新主节点的同步 |
+sentinel < instance details > | 一个新的 sentinel 节点被发现并关联 |
+sdown < instance details > | 添加对某个节点被主观下线 |
-sdown < instance details > | 撤销对某个节点被主观下线 |
+odown < instance details > | 添加对某个节点被客观下线 |
-odown < instance details > | 撤销对某个节点被客观下线 |
+new-epoch < instance details > | 当点纪元被更新 |
+try-failover < instance details > | 故障转移开始 |
+elected-leader < instance details > | 选出了故障转移的 Sentinel 节点 |
+failover-state-select-slave < instance details > | 故障转移进入 select-slave 状态(寻找合适的从节点) |
no-good-slave < instance details > | 没有找到合适的从节点 |
selected-slave < instance details > | 找到了合适的从节点 |
failover-state-send-slaveof-noone < instance details > | 故障转移进入 failover-state-send-slaveof-noone 状态(对找到的从节点执行 slaveof no one) |
failover-end-for-slaveof < instance details > | 故障转移由于超时而终止 |
failover-end < instance details > | 故障转移顺利完成 |
switch-master < masterName > < oldip >< oldport >< newip >< newport > | 更新主节点信息,这个是许多客户端重点关注的 |