rabbitmq集群故障处理
故障现象
rabbitmq启动失败
手动kill掉rabbit的后台进程,杀完后进程又会自动起来
故障处理
(1)手动更改rabbitmq故障节点的erl_crash.dump为erl_crash.dump_bak,mnesia为mnesia_bak。
(2)启动rabbitmq-server服务
systemctl start rabbitmq-server
启动之后修改的两个文件会自动生成,此时服务虽然启动了,但是集群内可能存在混乱的数据,所以我们手动在其他节点将该故障节点移除集群。
(3)将故障节点移除集群
[root@controller5422 ~]# rabbitmqctl cluster_status
Cluster status of node rabbit@controller5422
[{nodes,[{disc,[rabbit@controller5422,rabbit@controller5423,
rabbit@controller5424]}]},
{running_nodes,[rabbit@controller5424,rabbit@controller5422]},
{cluster_name,<<“rabbitmq_cluster_neocu”>>},
{partitions,[]},
{alarms,[{rabbit@controller5424,[]},{rabbit@controller5422,[]}]}]
[root@controller5422 ~]#
[root@controller5422 ~]# rabbitmqctl -n rabbit@controller5422 forget_cluster_node rabbit@controller5423
Removing node rabbit@controller5423 from cluster
[root@controller5422 ~]#
[root@controller5422 ~]# rabbitmqctl cluster_status
Cluster status of node rabbit@controller5422
[{nodes,[{disc,[rabbit@controller5422,rabbit@controller5424]}]},
{running_nodes,[rabbit@controller5424,rabbit@controller5422]},
{cluster_name,<<“rabbitmq_cluster_neocu”>>},
{partitions,[]},
{alarms,[{rabbit@controller5424,[]},{rabbit@controller5422,[]}]}]
[root@controller5422 ~]#
(4)手动将故障节点加入集群
在故障节点操作
[root@controller5423 rabbitmq]# rabbitmqctl stop_app
Stopping rabbit application on node rabbit@controller5423
[root@controller5423 rabbitmq]#
[root@controller5423 rabbitmq]#
[root@controller5423 rabbitmq]# rabbitmqctl reset
Resetting node rabbit@controller5423
[root@controller5423 rabbitmq]#
[root@controller5423 rabbitmq]# rabbitmqctl join_cluster “rabbit@controller5422”
Clustering node rabbit@controller5423 with rabbit@controller5422
[root@controller5423 rabbitmq]#
[root@controller5423 rabbitmq]# rabbitmqctl start_app
Starting node rabbit@controller5423
[root@controller5423 rabbitmq]#
其他运维命令
cat /var/lib/rabbitmq/.erlang.cookie 查看集群cookie
rabbitmqctl stop_app
rabbitmqctl start_app