1. 需要完成1主2从,并记录快照 Redis已完成1主2从
2. 配置三个哨兵(160,163,164)
以主节点哨兵为例,复制哨兵配置文件到安装目录下
cp /usr/local/src/redis-6.0.16/sentinel.conf /usr/local/src/redis160/bin/
vi /usr/local/src/redis160/bin/sentinel.conf
i
:set nu
17 === 保护模式开启(和redis配置保持一致),如果有注释放开
protected-mode yes
21 === 端口(可以修改成自己喜欢的)
port 26379
26 === 开启后台运行
daemonize yes
36 === 日志文件位置(可选操作),建议放到redis160文件夹下
logfile "/usr/local/src/redis160/sentinel_26379.log"
84 === 哨兵监控的主机,端口号,需要2个哨兵完成投票
sentinel monitor mymaster 192.168.109.160 6379 2
86 === 哨兵权限,如果有注释放开
sentinel auth-pass mymaster root
125 === 默认检测宕机时间(毫秒)是30s,我修改成10s
sentinel down-after-milliseconds mymaster 10000
127 === 保护模式密码
requirepass root
Esc
:wq
创建空日志文件(也可以用vi加:wq)
touch /usr/local/src/redis160/sentinel_26379.log
开启26379端口
firewall-cmd --query-port=26379/tcp
firewall-cmd --zone=public --add-port=26379/tcp --permanent
firewall-cmd --reload
编辑哨兵启动脚本
vi sentinel-start.sh
i
脚本内容
cd /usr/local/src/redis160/bin
./redis-sentinel sentinel.conf
Esc
:wq
sudo chmod -R 777 sentinel-start.sh
依次完成163和164的哨兵配置
3. 启动哨兵并测试高可用
首先确定3台redis服务已经启动,再依次启动3个哨兵服务
./sentinel-start.sh
3.1 进入主库Redis客户端
/usr/local/src/redis160/bin/redis-cli
输入密码,并进行BUG调试
auth root
debug segfault
exit
查看当前涉及的redis的服务,只剩哨兵
ps -ef | grep redis
3.2 等待一段时间,进入从库163Redis客户端
/usr/local/src/redis163/bin/redis-cli
输入密码,并查询
auth root
info replication
3.3 进入从库164Redis客户端
/usr/local/src/redis164/bin/redis-cli
输入密码,并查询
auth root
info replication
3.4 再次启动160Redis服务,进入客户端
/usr/local/src/redis160/bin/redis-cli
输入密码,并查询
auth root
info replication
4. 日志分析
4.1 当3个哨兵全部启动后
160 日志如下
22:46:15.667 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
Redis启动中
22:46:15.667 # Redis version=6.0.16, bits=64, commit=00000000, modified=0, pid=93243, just started
Redis版本,位数
22:46:15.667 # Configuration loaded
配置加载
22:46:15.668 * Increased maximum number of open files to 10032 (it was originally set to 1024).
增加打开文件的最大容量到10032(以前是1024)
22:46:15.668 * Running mode=sentinel, port=26379.
正在启动哨兵,端口号26379
22:46:15.668 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
警告 : TCP连接的backlog(socket的监听队列)不能被强制执行到511,因为内核参数设定的是最小值128
(❁´◡`❁)此警告可以通过 vi /etc/sysctl.conf 添加 net.core.somaxconn=1024 :wq sysctl -p 解决
22:46:15.669 # Sentinel ID is 89d9d89c66f30370b1d045ae21cfac9ee18e8f2e
哨兵的ID
22:46:15.669 # +monitor master mymaster 192.168.109.160 6379 quorum 2
监控主机 法定投票数 2
22:46:15.669 * +slave slave 192.168.109.163:6379 192.168.109.163 6379 @ mymaster 192.168.109.160 6379
+从库 从库163@主库160
22:46:15.670 * +slave slave 192.168.109.164:6379 192.168.109.164 6379 @ mymaster 192.168.109.160 6379
+从库 从库164@主库160
22:53:33.296 * +sentinel sentinel c5fdae5bac20a8e836e52442b5d43b03418522b0 192.168.109.163 26379 @ mymaster 192.168.109.160 6379
+哨兵 163的哨兵 c5fdae5bac20a8e836e52442b5d43b03418522b0 监控主机160
22:58:37.751 * +sentinel sentinel 09f5f5062918b2c609124f359825290a883d1abd 192.168.109.164 26379 @ mymaster 192.168.109.160 6379
+哨兵 164的哨兵 09f5f5062918b2c609124f359825290a883d1abd 监控主机160
163 日志如下
22:53:31.296 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
22:53:31.296 # Redis version=6.0.16, bits=64, commit=00000000, modified=0, pid=104788, just started
22:53:31.296 # Configuration loaded
22:53:31.297 * Increased maximum number of open files to 10032 (it was originally set to 1024).
22:53:31.297 * Running mode=sentinel, port=26379.
22:53:31.297 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
22:53:31.299 # Sentinel ID is c5fdae5bac20a8e836e52442b5d43b03418522b0
22:53:31.299 # +monitor master mymaster 192.168.109.160 6379 quorum 2
22:53:31.300 * +slave slave 192.168.109.163:6379 192.168.109.163 6379 @ mymaster 192.168.109.160 6379
22:53:31.301 * +slave slave 192.168.109.164:6379 192.168.109.164 6379 @ mymaster 192.168.109.160 6379
22:53:31.953 * +sentinel sentinel 89d9d89c66f30370b1d045ae21cfac9ee18e8f2e 192.168.109.160 26379 @ mymaster 192.168.109.160 6379
+哨兵 160的哨兵 89d9d89c66f30370b1d045ae21cfac9ee18e8f2e 监控主机160
22:58:37.754 * +sentinel sentinel 09f5f5062918b2c609124f359825290a883d1abd 192.168.109.164 26379 @ mymaster 192.168.109.160 6379
+哨兵 164的哨兵 09f5f5062918b2c609124f359825290a883d1abd 监控主机160
164 日志如下
22:58:35.682 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
22:58:35.682 # Redis version=6.0.16, bits=64, commit=00000000, modified=0, pid=113039, just started
22:58:35.682 # Configuration loaded
22:58:35.683 * Increased maximum number of open files to 10032 (it was originally set to 1024).
22:58:35.683 * Running mode=sentinel, port=26379.
22:58:35.683 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is se t to the lower value of 128.
22:58:35.702 # Sentinel ID is 09f5f5062918b2c609124f359825290a883d1abd
22:58:35.702 # +monitor master mymaster 192.168.109.160 6379 quorum 2
22:58:35.706 * +slave slave 192.168.109.163:6379 192.168.109.163 6379 @ mymaster 192.168.109.160 6379
22:58:35.711 * +slave slave 192.168.109.164:6379 192.168.109.164 6379 @ mymaster 192.168.109.160 6379
22:58:36.850 * +sentinel sentinel c5fdae5bac20a8e836e52442b5d43b03418522b0 192.168.109.163 26379 @ mymaster 192.168.109.160 6379
+哨兵 163的哨兵 c5fdae5bac20a8e836e52442b5d43b03418522b0 监控主机160
22:58:37.571 * +sentinel sentinel 89d9d89c66f30370b1d045ae21cfac9ee18e8f2e 192.168.109.160 26379 @ mymaster 192.168.109.160 6379
+哨兵 160的哨兵 89d9d89c66f30370b1d045ae21cfac9ee18e8f2e 监控主机160
4.2 当主节点160宕机后
160 日志如下
23:02:42.470 # +sdown master mymaster 192.168.109.160 6379
主观下线 160服务
164 = 在 23:02:42.603 对160服务实现客观下线
23:02:42.606 # +new-epoch 1
递增新的版本号
164 = 在 23:02:42.603 尝试进行故障迁移,并选择自己为leader
23:02:42.607 # +vote-for-leader 09f5f5062918b2c609124f359825290a883d1abd 1
选择164哨兵作为故障迁移leader
164 = 164服务当选主节点,164哨兵向所有从节点发送跟随操作
23:02:42.973 # +config-update-from sentinel 09f5f5062918b2c609124f359825290a883d1abd 192.168.109.164 26379 @ mymaster 192.168.109.160 6379
收到164哨兵更新配置的消息
23:02:42.973 # +switch-master mymaster 192.168.109.160 6379 192.168.109.164 6379
23:02:42.973 * +slave slave 192.168.109.163:6379 192.168.109.163 6379 @ mymaster 192.168.109.164 6379
23:02:42.973 * +slave slave 192.168.109.160:6379 192.168.109.160 6379 @ mymaster 192.168.109.164 6379
切换主机到164,增加2个从节点163和160
23:02:53.017 # +sdown slave 192.168.109.160:6379 192.168.109.160 6379 @ mymaster 192.168.109.164 6379
检测到160服务下线
163日志如下,参考160
23:02:42.571 # +sdown master mymaster 192.168.109.160 6379
23:02:42.609 # +new-epoch 1
23:02:42.610 # +vote-for-leader 09f5f5062918b2c609124f359825290a883d1abd 1
23:02:42.662 # +odown master mymaster 192.168.109.160 6379 #quorum 3/2
再次主观下线160服务
23:02:42.662 # Next failover delay: I will not start a failover before Wed Nov 3 23:08:43 2021
下一个故障推迟 : 在 23:08:43 之前,我不会开始故障迁移
23:02:42.978 # +config-update-from sentinel 09f5f5062918b2c609124f359825290a883d1abd 192.168.109.164 26379 @ mymaste r 192.168.109.160 6379
23:02:42.978 # +switch-master mymaster 192.168.109.160 6379 192.168.109.164 6379
23:02:42.978 * +slave slave 192.168.109.163:6379 192.168.109.163 6379 @ mymaster 192.168.109.164 6379
23:02:42.978 * +slave slave 192.168.109.160:6379 192.168.109.160 6379 @ mymaster 192.168.109.164 6379
23:02:52.992 # +sdown slave 192.168.109.160:6379 192.168.109.160 6379 @ mymaster 192.168.109.164 6379
检测到160服务下线
没有看redis源码,对打印的Next failover delay: I will not start a failover before xxx 做下猜想
160 = 23:02:42.470 # +sdown master mymaster 192.168.109.160 6379
164 = 23:02:42.527 # +sdown master mymaster 192.168.109.160 6379
163 = 23:02:42.571 # +sdown master mymaster 192.168.109.160 6379
客观下线操作任何哨兵都可以,投票和客观下线是并行操作,但客观下线只能有一次,redis默认会把投票数 = 2(quorum 的值)的那一个哨兵
作为故障迁移的leader, 所以当前哨兵又进行了一次主观下线,但是发现自己的投票已经是3了,强制自己不做leader,把自己可以进行故障迁移
的时间推迟6分钟,避免越位(从日期看,164主观下线后,达到2票,164是leader)
Think : 高并发下秒杀的商品的数量判断条件 number <= X,而不是 number == X
164 日志如下
23:02:42.527 # +sdown master mymaster 192.168.109.160 6379
23:02:42.603 # +odown master mymaster 192.168.109.160 6379 #quorum 2/2
哨兵客观下线160服务,投票满足
23:02:42.603 # +new-epoch 1
递增新的版本号
23:02:42.603 # +try-failover master mymaster 192.168.109.160 6379
164哨兵尝试对160服务进行故障迁移,开始投票
23:02:42.604 # +vote-for-leader 09f5f5062918b2c609124f359825290a883d1abd 1
164哨兵选举它自己为故障迁移的leader
23:02:42.606 # c5fdae5bac20a8e836e52442b5d43b03418522b0 voted for 09f5f5062918b2c609124f359825290a883d1abd 1
163哨兵投票给了164哨兵
23:02:42.607 # 89d9d89c66f30370b1d045ae21cfac9ee18e8f2e voted for 09f5f5062918b2c609124f359825290a883d1abd 1
160哨兵投票给了164哨兵
23:02:42.671 # +elected-leader master mymaster 192.168.109.160 6379
开始选择新的主节点
23:02:42.671 # +failover-state-select-slave master mymaster 192.168.109.160 6379
查询160旧节点下的从节点
23:02:42.724 # +selected-slave slave 192.168.109.164:6379 192.168.109.164 6379 @ mymaster 192.168.109.160 6379
选择164服务作为新的主节点
23:02:42.724 * +failover-state-send-slaveof-noone slave 192.168.109.164:6379 192.168.109.164 6379 @ mymaster 192.168.109.160 6379
哨兵向164服务发送 slaveof no one 指令
23:02:42.802 * +failover-state-wait-promotion slave 192.168.109.164:6379 192.168.109.164 6379 @ mymaster 192.168.109.160 6379
等待其他哨兵确认新的主节点
23:02:42.917 # +promoted-slave slave 192.168.109.164:6379 192.168.109.164 6379 @ mymaster 192.168.109.160 6379
其他哨兵确认了164服务为新的主节点
23:02:42.917 # +failover-state-reconf-slaves master mymaster 192.168.109.160 6379
开始对所有从节点做配置更新
23:02:42.972 * +slave-reconf-sent slave 192.168.109.163:6379 192.168.109.163 6379 @ mymaster 192.168.109.160 6379
向163服务发送跟随操作
23:02:43.719 # -odown master mymaster 192.168.109.160 6379
客观下线160服务,因为自己是leader,所有没有163的 Next failover delay
23:02:43.934 * +slave-reconf-inprog slave 192.168.109.163:6379 192.168.109.163 6379 @ mymaster 192.168.109.160 6379
163服务正在更新配置
23:02:43.934 * +slave-reconf-done slave 192.168.109.163:6379 192.168.109.163 6379 @ mymaster 192.168.109.160 6379
163服务完成配置
23:02:43.999 # +failover-end master mymaster 192.168.109.160 6379
本次164哨兵对160服务的故障迁移完毕
23:02:43.999 # +switch-master mymaster 192.168.109.160 6379 192.168.109.164 6379
23:02:43.999 * +slave slave 192.168.109.163:6379 192.168.109.163 6379 @ mymaster 192.168.109.164 6379
23:02:43.999 * +slave slave 192.168.109.160:6379 192.168.109.160 6379 @ mymaster 192.168.109.164 6379
切换主机到164,增加2个从节点,163和160
23:02:54.036 # +sdown slave 192.168.109.160:6379 192.168.109.160 6379 @ mymaster 192.168.109.164 6379
检测到160服务下线