目的:
1. 验证redis-sentinel环境和切换功能验证;
2. 为后续基于jedis sentinel patch或 sentinel/twemproxy/Twemproxy-sentinel-agent实现故障转移做好准备;
参考:
http://www.redisdoc.com/en/latest/topic/sentinel.html
http://blog.163.com/a12333a_li/blog/static/87594285201304103257837/
验证内容:
1. 环境搭建
2. 切换:验证master shutdown
3. 切换:验证slave
4. 多sentinel环境
5. 总结sentinel环境的消息/编写状态判断的脚步,为后续维护或状态监控做准备;
环境:
192.168.0.11: redis master:6379/slave:6380/sentinel1:26379
192.168.0.12:redis slave:6379
OS:rhel 6.4
过程记录:
环境安装:
1. 两台机器的软件环境都已经安装完毕
2. 配置上述redis 运行节点;
2.1 11 机器配置master/slave,set数据后,验证主从工作正常;
2.2
3. 安装sentinel,之前build已经装好,只是install时,没有安装到路径内,执行文件在: $redis-source-dir/src/下,文件:redis-sentinel,见上述路径加入path。另外src同级目录有sentinel.conf配置文件模板,可以参考。
采用http://blog.163.com/a12333a_li/blog/static/87594285201304103257837/
#修改IP地址,IP可以是集群中的任意一个IP地址。
sentinel monitor mymaster192.168.1.11 6379 1
#默认1s检测一次,这里配置超时5000毫秒为宕机。
sentinel failover-timeout mymaster 900000
sentinel can-failover mymaster yes
sentinel parallel-syncs mymaster 1
启动sentinel:[root@soa1 sentinel-env]# redis-server /root/devzone/redis/sentinel-env/sentinel.conf --sentinel &
[1] 29795
[root@soa1 sentinel-env]# [29795] 26 Nov 16:57:53.056 * Max number of open files set to 10032
[31193] 26 Nov 17:12:28.234 * +slave slave 192.168.0.12:6379 192.168.0.12 6379 @ mymaster 127.0.0.1 6379
[31193] 26 Nov 17:12:28.234 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
登录sentinel控制台,redis-cli -p 26379, 查询状态:
# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
master0:name=mymaster,status=ok,address=127.0.0.1:6379,slaves=2,sentinels=1
说明节点工作正常
监控slave的控制台:monitor,可以看到大量信息,包括sentinel的消息:
"PUBLISH" "__sentinel__:hello" "127.0.0.1:26379:5af67fe4818cd8c1e1dc679a
主节点也存在这样的查询;
-----------停止master后,查询sentinel:
# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
master0:name=mymaster,status=ok,address=192.168.0.12:6379,slaves=2,sentinels=1
已经切换为12机器,slave的选择见sentinel的文档说明;
sentinel控制台显示了切换过程:
[31193] 26 Nov 17:18:24.228 # +sdown master mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:24.228 # +odown master mymaster 127.0.0.1 6379 #quorum 1/1
[31193] 26 Nov 17:18:24.328 # +failover-triggered master mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:24.328 # +failover-state-wait-start master mymaster 127.0.0.1 6379 #starting in 9168 milliseconds
[31193] 26 Nov 17:18:33.565 # +failover-state-select-slave master mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:33.666 # +selected-slave slave 192.168.0.12:6379 192.168.0.12 6379 @ mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:33.666 * +failover-state-send-slaveof-noone slave 192.168.0.12:6379 192.168.0.12 6379 @ mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:33.766 * +failover-state-wait-promotion slave 192.168.0.12:6379 192.168.0.12 6379 @ mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:34.269 # +promoted-slave slave 192.168.0.12:6379 192.168.0.12 6379 @ mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:34.269 # +failover-state-reconf-slaves master mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:34.369 * +slave-reconf-sent slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:35.273 * +slave-reconf-inprog slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:35.273 * +slave-reconf-done slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:35.373 # +failover-end master mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:35.373 #
[31193] 26 Nov 17:18:35.475 * +slave slave 192.168.0.11:6380 192.168.0.11 6380 @ mymaster 192.168.0.12 6379
[31193] 26 Nov 17:19:05.455 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 192.168.0.12 6379
查询11 sentinel的信息:
redis 127.0.0.1:26379> sentinel slaves mymaster
1)
2)
显示现在还有两个slave,其中一个是自身127.0.0.1,runid为空,其master信息不正常,非常怪异,不知道是不是bug。11那个slave状态正常;
进入12的redis控制台,查看info,看到replication是正常的,只有一个slave,11的显示也正常;说明sentinel显示不正常,估计是bug:
# Replication
role:master
connected_slaves:1
slave0:192.168.0.11,6380,online
------------------shutdown 11从机:
12主机info信息:说明master发现slave shutdown
# Replication
role:master
connected_slaves:0
11主机 sentinel控制台显示:[31193] 26 Nov 17:42:34.520 # +sdown slave 192.168.0.11:6380 192.168.0.11 6380 @ mymaster 192.168.0.12 6379
但是sentinel的redis-cli的info还是显示两个slave,太怪了;
# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
master0:name=mymaster,status=ok,address=192.168.0.12:6379,slaves=2,sentinels=1
进一步查询11 sentinel的sentinel slaves mymaster:
edis 127.0.0.1:26379> sentinel slaves mymaster
1)
2)
发现11主机从机状态变为s_down,slave,disconnected 说明sentinel已经记录两个slave的状态变化了,似乎又不是bug。
-------------launch 11 主机master :
12主机马上发现11从机:
# Replication
role:master
connected_slaves:1
slave0:192.168.0.11,6379,online
11 6379启动后,自己已经变成了slave:
# Replication
role:slave
master_host:192.168.0.12
master_port:6379
sentinel console显示11slave恢复:
[31193] 26 Nov 17:50:06.531 * +demote-old-slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 192.168.0.12 6379
[31193] 26 Nov 17:50:06.731 # -sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 192.168.0.12 6379
[31193] 26 Nov 17:50:11.448 * +slave slave 192.168.0.11:6379 192.168.0.11 6379 @ mymaster 192.168.0.12 6379
[31193] 26 Nov 17:50:16.562 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 192.168.0.12 6379
sentinel的管理info信息竟然有多个一个slave记录!
redis 127.0.0.1:26379> redis 127.0.0.1:26379> sentinel slaves mymaster
1)
2)
3)
查询sentinel的info:果然显示为三个slave:
# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
master0:name=mymaster,status=ok,address=192.168.0.12:6379,slaves=3,sentinels=1
说明sentinel已经可以管理主从状态,但是slave管理存在bug?是不是启动时绑定的IP 127.0.0.1有问题?
----------------------恢复11从机6380:
查询11 6380 info,发现该从机未变成12的从机,而是变成了主机:
# Replication
role:master
connected_slaves:0
看sentinel的console,发现似乎sentinel做了12 6379到11 6380的主备切换,所以导致11:6380成为了master:
[31193] 26 Nov 18:04:36.244 * +reboot slave 192.168.0.11:6380 192.168.0.11 6380 @ mymaster 192.168.0.12 6379
[31193] 26 Nov 18:04:36.244 # -slave-restart-as-master slave 192.168.0.11:6380 192.168.0.11 6380 @ mymaster 192.168.0.12 6379 #removing it from the attached slaves
但结果是12:6379依然未动,还是master:而且还有一个11:6379的slave,说明没有完成切换:
# Replication
role:master
connected_slaves:1
slave0:192.168.0.11,6379,online
sentinel的info显示12:6379还是master,但是slave变成了2个:
# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
master0:name=mymaster,status=ok,address=192.168.0.12:6379,slaves=2,sentinels=1
查询两个slave:
redis 127.0.0.1:26379> sentinel slaves mymaster
1)
2)
其他问题明天再试吧。