corosync集群安装配置及crmsh使用详解

最新推荐文章于 2023-08-17 11:39:42 发布

weixin_34292402

最新推荐文章于 2023-08-17 11:39:42 发布

阅读量453

点赞数

文章标签：运维 shell 操作系统

原文链接：http://blog.51cto.com/lidefu/1399158

版权

corosync+pacemaker的简单实现

前提：
1）本配置共有两个测试节点，分别lidefu1和lidefu2，相的IP地址分别为172.16.21.3和172.16.21.4；
2）集群服务为apache的httpd服务；
3）提供web服务的地址为172.16.21.21，即vip；
4）系统为CentOS 6.5 64bits

1、准备工作

为了配置一台Linux主机成为HA的节点，通常需要做出如下的准备工作：

1）所有节点的主机名称和对应的IP地址解析服务可以正常工作，且每个节点的主机名称需要跟"uname -n“命令的结果保持一致；因此，需要保证两个节点上的/etc/hosts文件均为下面的内容：
172.16.21.3 lidefu3

172.16.21.4 lidefu4

为了使得重新启动系统后仍能保持如上的主机名称，还分别需要在各节点执行类似如下的命令：

Node1:
# sed -i 's@\(HOSTNAME=\).*@\1lidefu3@g' /etc/sysconfig/network
# hostname lidefu3

Node2：
# sed -i 's@\(HOSTNAME=\).*@\1lidefu4 @g' /etc/sysconfig/network
# hostname lidefu4

2）设定两个节点可以基于密钥进行ssh通信，这可以通过类似如下的命令实现：
Node1:
# ssh-keygen -t rsa –P ‘’
# ssh-copy-id -i ~/.ssh/id_rsa.pub root@lidefu4

Node2:
# ssh-keygen -t rsa –P ‘’
# ssh-copy-id -i ~/.ssh/id_rsa.pub root@lidefu3

安装corosync ,两个节点都要安装

  1: yum install –y corosync

查看安装生成文件

  1: rpm -ql corosync

配置corosync,两个节点的配置需要一样

  1: cd /etc/corosync/

  2: cp corosync.conf.example corosync.conf

  3: vim corosync.conf

  4: compatibility: whitetank        #是否兼容0.8以前的版本

  5: totem {                         #corosync的协议

  6:         version: 2              #协议的版本号

  7:         secauth: on             #是否开启安全认证功能,这个如果关闭,别人知道我们的多播地址,就很容易加入到集群中,开启会消耗更多的cpu,但是还是建议开启

  8:         threads: 0              #实现认证时的并行线程数,0表示使用默认配置

  9:         interface {             #子模块

 10:                 ringnumber: 0   #环号,当主机上有多块网卡,为了避免自己主机上的其他网卡收到自己主机上发出的心跳信息,所使用的环号

 11:                 bindnetaddr: 172.16.21.0  #集群工作的网络地址,

 12:                 mcastaddr: 226.16.21.1     #多播地址

 13:                 mcastport: 5405           #多播端口

 14:                 ttl: 1                    #至广播一次

 15:         }

 16: }

17:

 18: logging {                      #日志相关

 19:         fileline: off

 20:         to_stderr: no          #日志信息是否发往标准错误输出

 21:         to_logfile: yes        #是否送给日志文件

 22:         to_syslog: no          #是否发往syslog

 23:         logfile: /var/log/cluster/corosync.log    #日志文件

 24:         debug: off             #是否开启debug

 25:         timestamp: on          #是否记录日志的时间戳,消耗I/O

 26:         logger_subsys {        #日志的子系统

 27:                 subsys: AMF    #下面的AMF

 28:                 debug: off     #debug

 29:         }

 30: }

31:

 32: amf {                      #与amf的编程接口相关

 33:         mode: disabled

 34: }

  1: yum install pacemaker -y  #注:好像不能与heartbeat共存

  2: rpm -q pacemaker

  3: rpm -ql pacemaker

配置corosync.conf,让pacemaker可以与corosync共同启动,添加早corosync.conf的末行

  1: servcie {                 #定义的服务

  2:         ver: 0            #版本号

  3:         name: pacemaker   #服务名,表示当corosync启动的时候,pacemaker同时启动

  4: }

  5: aisexec {                 #表示启动上面的service时以谁的身份,默认也是root,所以其实可以省略

  6:         user: root        #以root的身份运行

  7:         group: root       #以root组的身份运行

  8: }

  9: 实际上pacemaker可以自己当成一个服务来启动,只是这里的corosync版本比较老,所以需要把他和corosync一起启动

生成密钥文件,为了通信的安全

  1: corosync-keygen     #生成认证文件,请使劲敲键盘以获取随机数

安装crmsh和pssh,两个节点都要安装,crmsh作为pacemaker的接口,而安装crmsh需要依赖pssh

  1: yum install crmsh-1.2.6-4.el6.x86_64.rpm pssh-2.3.1-2.el6.x86_64.rpm

  1: scp -p authkey corosync lidefu1:/etc/corosync

  1: service corosync start

查看日志,看是否启动成功

  1: less /var/log/cluster/corosync.log

  1: grep -e "Corosync Cluster Engine" -e "configuration file" /var/log/cluster/corosync.log

  2: Apr 20 17:39:36 corosync [MAIN  ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service.

  3: Apr 20 17:39:36 corosync [MAIN  ] Successfully read main configuration file '/etc/corosync/corosync.conf'

查看初始化成员节点通知是否正常发出：

  1: grep  TOTEM  /var/log/cluster/corosync.log

  2: Apr 20 17:39:36 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).

  3: Apr 20 17:39:36 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).

  4: Apr 20 17:39:36 corosync [TOTEM ] The network interface [172.16.21.3] is now up.

  5: Apr 20 17:39:37 corosync [TOTEM ] Type of received message is wrong...  ignoring 25.

  6: Apr 20 17:39:37 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.

  7: ...

检查启动过程中是否有错误产生。下面的错误信息表示packmaker不久之后将不再作为corosync的插件运行，因此，建议使用cman作为集群基础架构服务；此处可安全忽略。

  1: grep ERROR: /var/log/cluster/corosync.log | grep -v unpack_resources

  2: Apr 20 17:39:37 corosync [pcmk  ] ERROR: process_ais_conf: You have configured a cluster using the Pacemaker plugin for Corosync. The plugin is not supported in this environment and will be removed very soon.

  3: Apr 20 17:39:37 corosync [pcmk  ] ERROR: process_ais_conf:  Please see Chapter 8 of 'Clusters from Scratch' (http://www.clusterlabs.org/doc) for details on using Pacemaker with CMAN

查看pacemaker是否正常启动：

  1: grep pcmk_startup /var/log/cluster/corosync.log

  2: Apr 20 17:39:37 corosync [pcmk  ] info: pcmk_startup: CRM: Initialized

  3: Apr 20 17:39:37 corosync [pcmk  ] Logging: Initialized pcmk_startup

  4: Apr 20 17:39:37 corosync [pcmk  ] info: pcmk_startup: Maximum core file size is: 18446744073709551615

  5: Apr 20 17:39:37 corosync [pcmk  ] info: pcmk_startup: Service: 9

  6: Apr 20 17:39:37 corosync [pcmk  ] info: pcmk_startup: Local hostname: lidefu3

如果安装了crmsh，可使用如下命令查看集群节点的启动状态

  1: crm status   #查看状态

  2: Last updated: Sun Apr 20 18:14:01 2014

  3: Last change: Sun Apr 20 18:00:25 2014 via crmd on lidefu3

  4: Stack: classic openais (with plugin)

  5: Current DC: NONE

  6: 1 Nodes configured, 2 expected votes

  7: 0 Resources configured

8:

9:

 10: Node lidefu3: UNCLEAN (offline)

  1: crm   #直接执行crm命令

  2: crm(live)# help      #live表示直接修改在执行的配置信息,即修改即生效

3:

  4: This is crm shell, a Pacemaker command line interface.

5:

  6: Available commands:

7:

  8: 	cib              manage shadow CIBs

  9: 	resource         resources management

 10: 	configure        CRM cluster configuration

 11: 	node             nodes management

 12: 	options          user preferences

 13: 	history          CRM cluster history

 14: 	site             Geo-cluster support

 15: 	ra               resource agents information center

 16: 	status           show cluster status

 17: 	help,?           show help (help topics for list of topics)

 18: 	end,cd,up        go back one level

 19: 	quit,bye,exit    exit the program

  1: property stonith-enabled=false #关闭stonith

  2: verify                         #查看命令是否错误

  3: commit                         #提交

  4: show                           #查看集群所有的配置信息

注:当一个集群没有法定票数,节点故障资源也不会转移

  1: help primitive

  2: primitive <rsc> {[<class>:[<provider>:]]<type>|@<template>}

  3:           [params attr_list]

  4:           [meta attr_list]

  5:           [utilization attr_list]

  6:           [operations id_spec]

  7:             [op op_type [<attribute>=<value>...] ...]

8:

  9:         attr_list :: [$id=<id>] <attr>=<val> [<attr>=<val>...] | $id-ref=<id>

 10:         id_spec :: $id=<id> | $id-ref=<id>

 11:         op_type :: start | stop | monitor

查看各资源代理的可代理类型classes和list

  1: classes   #列出资源类型

  2: lsb

  3: o / heartbeat pacemaker

  4: service

  5: stonith

  6: crm(live)ra# list lsb  #查看该资源代理可代理的资源

  7: NetworkManager     abrt-ccpp          abrt-oops          abrtd              acpid              atd

  8: auditd             autofs             blk-availability   bluetooth          corosync           corosync-notifyd

  9: ...

查看资源可使用的参数meta

  1: classes

  2: list o

  3: meta o:IPaddr2

4:

  5: Parameters (* denotes required, [] the default):

6:

  7: ip* (string): IPv4 or IPv6 address

  8: nic (string): Network interface

  9: broadcast (string): Broadcast address

 10: iflabel (string): Interface label

 11: lvs_support (boolean, [false]): Enable support for LVS DR

定义资源primitive

  1: primitive webip o:IPaddr2 params ip=172.16.21.21

  2: primitive webserver lsb:httpd

  3: verify

  4: commit

  5: show

定义组,对于corosync是后定义组的,不像heartbeat只能先定义组

  1: group webservice webip webserver

  2: verify

  3: commit

  4: show

访问网页

  1: curl http://172.16.21.21       #也可以通过windows的浏览器

  2: this is fours page33333333333333

让节点3停止页

  1: node

  2: standby lidefu3

  3: cd ..

  4: status

再次访问网页

  1: curl http://172.16.21.21

  2: this is four page444444444444

  1: crm node online lidefu3

  2: crm status       #结果是没转回来,因为前面的步骤并没有定义资源的倾向性

删除资源的一种方式

  1: crm

  2: configure

  3: edit  #可以直接编辑删掉,但是不建议这么做

停止资源stop,而后才做删除操作

  1: crm

  2: resource

  3: stop webservice

  4: status webservice

删除资源delete

  1: crm

  2: configure

  3: delete webservice #注意:删除的仅仅是webservice这个组,并没有删除组内的资源

  1: Usage:

  2: ...............

  3:         colocation <id> <score>: <rsc>[:<role>] <rsc>[:<role>] ...

  4:           [node-attribute=<node_attr>]

  5: ...............

  6: Example:

  7: ...............

  8:         colocation dummy_and_apache -inf: apache dummy

  9:         colocation c1 inf: A ( B C )

 10: ...............

  1: configure

  2: colocation webserver_and_webip inf: webserver webip #前面的是随后面的,webserver是随着webip的

  3: verify

  4: commit

查看所有资源的详细信息

  1: configure

  2: show xml

定义顺序约束,假定启动是线性的

  1: Usage:

  2: ...............

  3:         order <id> {kind|<score>}: <rsc>[:<action>] <rsc>[:<action>] ...

  4:           [symmetrical=<bool>]

5:

  6:         kind :: Mandatory(强制) | Optional(可选) | Serialize(强制)

  7: ...............

  8: Example:

  9: ...............

 10:         order c_apache_1 Mandatory: apache:start ip_1

 11:         order o1 Serialize: A ( B C )

 12:         order order_2 Mandatory: [ A B ] C

 13: ...............

1: order webip_before_webserver mandatory: webip webserver

  1:  location <id> <rsc> {node_pref|rules}

2:

  3:         node_pref :: <score>: <node>

4:

  5:         rules ::

  6:           rule [id_spec] [$role=<role>] <score>: <expression>

  7:           [rule [id_spec] [$role=<role>] <score>: <expression> ...]

8:

  9:         id_spec :: $id=<id> | $id-ref=<id>

 10:         score :: <number> | <attribute> | [-]inf

 11:         expression :: <simple_exp> [bool_op <simple_exp> ...]

 12:         bool_op :: or | and

 13:         simple_exp :: <attribute> [type:]<binary_op> <value>

 14:                       | <unary_op> <attribute>

 15:                       | date <date_expr>

 16:         type :: string | version | number

 17:         binary_op :: lt | gt | lte | gte | eq | ne

 18:         unary_op :: defined | not_defined

19:

 20:         date_expr :: lt <end>

 21:                      | gt <start>

 22:                      | in_range start=<start> end=<end>

 23:                      | in_range start=<start> <duration>

 24:                      | date_spec <date_spec>

 25:         duration|date_spec ::

 26:                      hours=<value>

 27:                      | monthdays=<value>

 28:                      | weekdays=<value>

 29:                      | yearsdays=<value>

 30:                      | months=<value>

 31:                      | weeks=<value>

 32:                      | years=<value>

 33:                      | weekyears=<value>

 34:                      | moon=<value>

 35: ...............

 36: Examples:

 37: ...............

 38:         location conn_1 internal_www 100: node1

39:

 40:         location conn_1 internal_www \

 41:           rule 50: #uname eq node1 \

 42:           rule pingd: defined pingd

43:

 44:         location conn_2 dummy_float \

 45:           rule -inf: not_defined pingd or pingd number:lte 0

  1: location webip_on_lidefu4 webip 200: lidefu4

  1: configure

  2: property no-quorum-policy=ignore

  3: verify

  4: commit

定义资源的默认属性

  1: crm(live)configure# rsc_defaults

  2: allow-migrate=        interval-origin=      multiple-active=      restart-type=

  3: description=          is-managed=           priority=             target-role=定义资源就启动?

  4: failure-timeout=      migration-threshold=  resource-stickiness=资源粘性,只对当前节点生效,还是不太熟

以上的定义只是节点出故障才会转移资源,并没有定义资源出故障怎么办.所以下面定义资源故障

  1: killall httpd   #不正常终止运行的服务

  2: crm

  3: status          #查看节点状态,发现依然显示正常运行

  4: stop webserver  #停止服务

  5: stop webip      #停止服务

  6: resource        #进去咯

  7: cleanup webserver#对不正常关闭的资源做清理

  8: cleanup webip   #对不正常关闭的资源做清理

  1: Usage:

  2: ...............

  3:         monitor <rsc>[:<role>] <interval>[:<timeout>]

  4: ...............

  5: Example:

  6: ...............

  7:         monitor apence 60m:60s

  8: ...............

1: monitor webserver 20s:15s

  2: verify

  3: commit

  1: crm(live)configure# monitor webserver 20s:15s

定义好了,可以尝试在当前运行httpd节点的服务停止,然后等待15秒以上,查看他是否启动成功,笔者这里是启动成功的.

转载于:https://blog.51cto.com/lidefu/1399158

weixin_34292402

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
corosync集群安装配置及crmsh使用详解

corosync+pacemaker的简单实现前提： 1）本配置共有两个测试节点，分别lidefu1和lidefu2，相的IP地址分别为172.16.21.3和172.16.21.4； 2）集群服务为apache的httpd服务； 3）提供web服务的地址为172.16.21.21，即vip； 4）系统为CentOS 6.5 64bits 1、...
复制链接

扫一扫