corosync+pacemaker的简单实现

前提:
1)本配置共有两个测试节点,分别lidefu1和lidefu2,相的IP地址分别为172.16.21.3和172.16.21.4;
2)集群服务为apache的httpd服务;
3)提供web服务的地址为172.16.21.21,即vip;
4)系统为CentOS 6.5 64bits

1、准备工作

为了配置一台Linux主机成为HA的节点,通常需要做出如下的准备工作:

1)所有节点的主机名称和对应的IP地址解析服务可以正常工作,且每个节点的主机名称需要跟"uname -n“命令的结果保持一致;因此,需要保证两个节点上的/etc/hosts文件均为下面的内容:
172.16.21.3 lidefu3

172.16.21.4 lidefu4

为了使得重新启动系统后仍能保持如上的主机名称,还分别需要在各节点执行类似如下的命令:

Node1:
# sed -i 's@\(HOSTNAME=\).*@\1lidefu3@g'  /etc/sysconfig/network
# hostname lidefu3

Node2:
# sed -i 's@\(HOSTNAME=\).*@\1lidefu4 @g' /etc/sysconfig/network
# hostname lidefu4

2)设定两个节点可以基于密钥进行ssh通信,这可以通过类似如下的命令实现:
Node1:
# ssh-keygen -t rsa –P ‘’
# ssh-copy-id -i ~/.ssh/id_rsa.pub root@lidefu4

Node2:
# ssh-keygen -t rsa –P ‘’
# ssh-copy-id -i ~/.ssh/id_rsa.pub root@lidefu3

安装corosync ,两个节点都要安装

 
  
  1: yum install –y corosync

查看安装生成文件

 
  
  1: rpm -ql corosync

配置corosync,两个节点的配置需要一样

 
  
  1: cd /etc/corosync/
  2: cp corosync.conf.example corosync.conf
  3: vim corosync.conf
  4: compatibility: whitetank        #是否兼容0.8以前的版本
  5: totem {                         #corosync的协议
  6:         version: 2              #协议的版本号
  7:         secauth: on             #是否开启安全认证功能,这个如果关闭,别人知道我们的多播地址,就很容易加入到集群中,开启会消耗更多的cpu,但是还是建议开启
  8:         threads: 0              #实现认证时的并行线程数,0表示使用默认配置
  9:         interface {             #子模块
 10:                 ringnumber: 0   #环号,当主机上有多块网卡,为了避免自己主机上的其他网卡收到自己主机上发出的心跳信息,所使用的环号
 11:                 bindnetaddr: 172.16.21.0  #集群工作的网络地址,
 12:                 mcastaddr: 226.16.21.1     #多播地址 
 13:                 mcastport: 5405           #多播端口
 14:                 ttl: 1                    #至广播一次
 15:         }
 16: }
 17: 
 18: logging {                      #日志相关
 19:         fileline: off 
 20:         to_stderr: no          #日志信息是否发往标准错误输出
 21:         to_logfile: yes        #是否送给日志文件
 22:         to_syslog: no          #是否发往syslog
 23:         logfile: /var/log/cluster/corosync.log    #日志文件
 24:         debug: off             #是否开启debug
 25:         timestamp: on          #是否记录日志的时间戳,消耗I/O
 26:         logger_subsys {        #日志的子系统
 27:                 subsys: AMF    #下面的AMF
 28:                 debug: off     #debug
 29:         }
 30: }
 31: 
 32: amf {                      #与amf的编程接口相关
 33:         mode: disabled
 34: }
 
  
  1: yum install pacemaker -y  #注:好像不能与heartbeat共存
  2: rpm -q pacemaker
  3: rpm -ql pacemaker

配置corosync.conf,让pacemaker可以与corosync共同启动,添加早corosync.conf的末行

 
  
  1: servcie {                 #定义的服务
  2:         ver: 0            #版本号
  3:         name: pacemaker   #服务名,表示当corosync启动的时候,pacemaker同时启动
  4: }
  5: aisexec {                 #表示启动上面的service时以谁的身份,默认也是root,所以其实可以省略
  6:         user: root        #以root的身份运行
  7:         group: root       #以root组的身份运行
  8: }
  9: 实际上pacemaker可以自己当成一个服务来启动,只是这里的corosync版本比较老,所以需要把他和corosync一起启动

生成密钥文件,为了通信的安全

 
  
  1: corosync-keygen     #生成认证文件,请使劲敲键盘以获取随机数

安装crmsh和pssh,两个节点都要安装,crmsh作为pacemaker的接口,而安装crmsh需要依赖pssh

 
  
  1: yum install crmsh-1.2.6-4.el6.x86_64.rpm pssh-2.3.1-2.el6.x86_64.rpm
 
  
  1: scp -p authkey corosync lidefu1:/etc/corosync
 
  
  1: service corosync start

查看日志,看是否启动成功

 
  
  1: less /var/log/cluster/corosync.log
 
  
 
  
  1: grep -e "Corosync Cluster Engine" -e "configuration file" /var/log/cluster/corosync.log
  2: Apr 20 17:39:36 corosync [MAIN  ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service.
  3: Apr 20 17:39:36 corosync [MAIN  ] Successfully read main configuration file '/etc/corosync/corosync.conf'


查看初始化成员节点通知是否正常发出:

 
  
  1: grep  TOTEM  /var/log/cluster/corosync.log
  2: Apr 20 17:39:36 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).
  3: Apr 20 17:39:36 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
  4: Apr 20 17:39:36 corosync [TOTEM ] The network interface [172.16.21.3] is now up.
  5: Apr 20 17:39:37 corosync [TOTEM ] Type of received message is wrong...  ignoring 25.
  6: Apr 20 17:39:37 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
  7: ...
检查启动过程中是否有错误产生。下面的错误信息表示packmaker不久之后将不再作为corosync的插件运行,因此,建议使用cman作为集群基础架构服务;此处可安全忽略。
 
  
  1: grep ERROR: /var/log/cluster/corosync.log | grep -v unpack_resources
  2: Apr 20 17:39:37 corosync [pcmk  ] ERROR: process_ais_conf: You have configured a cluster using the Pacemaker plugin for Corosync. The plugin is not supported in this environment and will be removed very soon.
  3: Apr 20 17:39:37 corosync [pcmk  ] ERROR: process_ais_conf:  Please see Chapter 8 of 'Clusters from Scratch' (http://www.clusterlabs.org/doc) for details on using Pacemaker with CMAN


查看pacemaker是否正常启动:

 
  
  1: grep pcmk_startup /var/log/cluster/corosync.log 
  2: Apr 20 17:39:37 corosync [pcmk  ] info: pcmk_startup: CRM: Initialized
  3: Apr 20 17:39:37 corosync [pcmk  ] Logging: Initialized pcmk_startup
  4: Apr 20 17:39:37 corosync [pcmk  ] info: pcmk_startup: Maximum core file size is: 18446744073709551615
  5: Apr 20 17:39:37 corosync [pcmk  ] info: pcmk_startup: Service: 9
  6: Apr 20 17:39:37 corosync [pcmk  ] info: pcmk_startup: Local hostname: lidefu3


如果安装了crmsh,可使用如下命令查看集群节点的启动状态

 
  
  1: crm status   #查看状态
  2: Last updated: Sun Apr 20 18:14:01 2014
  3: Last change: Sun Apr 20 18:00:25 2014 via crmd on lidefu3
  4: Stack: classic openais (with plugin)
  5: Current DC: NONE
  6: 1 Nodes configured, 2 expected votes
  7: 0 Resources configured
  8: 
  9: 
 10: Node lidefu3: UNCLEAN (offline)
 
  
  1: crm   #直接执行crm命令
  2: crm(live)# help      #live表示直接修改在执行的配置信息,即修改即生效
  3: 
  4: This is crm shell, a Pacemaker command line interface.
  5: 
  6: Available commands:
  7: 
  8: 	cib              manage shadow CIBs
  9: 	resource         resources management
 10: 	configure        CRM cluster configuration
 11: 	node             nodes management
 12: 	options          user preferences
 13: 	history          CRM cluster history
 14: 	site             Geo-cluster support
 15: 	ra               resource agents information center
 16: 	status           show cluster status
 17: 	help,?           show help (help topics for list of topics)
 18: 	end,cd,up        go back one level
 19: 	quit,bye,exit    exit the program
 
  
 
  
  1: property stonith-enabled=false #关闭stonith
  2: verify                         #查看命令是否错误 
  3: commit                         #提交
  4: show                           #查看集群所有的配置信息
注:当一个集群没有法定票数,节点故障资源也不会转移
 
  
  1: help primitive
  2: primitive <rsc> {[<class>:[<provider>:]]<type>|@<template>}
  3:           [params attr_list]
  4:           [meta attr_list]
  5:           [utilization attr_list]
  6:           [operations id_spec]
  7:             [op op_type [<attribute>=<value>...] ...]
  8: 
  9:         attr_list :: [$id=<id>] <attr>=<val> [<attr>=<val>...] | $id-ref=<id>
 10:         id_spec :: $id=<id> | $id-ref=<id>
 11:         op_type :: start | stop | monitor
查看各资源代理的可代理类型classes和list
 
  
  1: classes   #列出资源类型
  2: lsb
  3: o / heartbeat pacemaker
  4: service
  5: stonith
  6: crm(live)ra# list lsb  #查看该资源代理可代理的资源
  7: NetworkManager     abrt-ccpp          abrt-oops          abrtd              acpid              atd
  8: auditd             autofs             blk-availability   bluetooth          corosync           corosync-notifyd
  9: ...
查看资源可使用的参数meta
 
  
  1: classes
  2: list o
  3: meta o:IPaddr2
  4: 
  5: Parameters (* denotes required, [] the default):
  6: 
  7: ip* (string): IPv4 or IPv6 address
  8: nic (string): Network interface
  9: broadcast (string): Broadcast address
 10: iflabel (string): Interface label
 11: lvs_support (boolean, [false]): Enable support for LVS DR
定义资源primitive
 
  
  1: primitive webip o:IPaddr2 params ip=172.16.21.21
  2: primitive webserver lsb:httpd
  3: verify
  4: commit
  5: show

定义组,对于corosync是后定义组的,不像heartbeat只能先定义组

 
  
  1: group webservice webip webserver
  2: verify
  3: commit
  4: show

访问网页

 
  
  1: curl http://172.16.21.21       #也可以通过windows的浏览器
  2: this is fours page33333333333333

让节点3停止页

 
  
  1: node
  2: standby lidefu3
  3: cd ..
  4: status
再次访问网页
 
  
  1: curl http://172.16.21.21
  2: this is four page444444444444
 
  
  1: crm node online lidefu3
  2: crm status       #结果是没转回来,因为前面的步骤并没有定义资源的倾向性

删除资源的一种方式

 
  
  1: crm
  2: configure 
  3: edit  #可以直接编辑删掉,但是不建议这么做

停止资源stop,而后才做删除操作

 
  
  1: crm
  2: resource 
  3: stop webservice
  4: status webservice

删除资源delete

 
  
  1: crm
  2: configure 
  3: delete webservice #注意:删除的仅仅是webservice这个组,并没有删除组内的资源
 
  
  1: Usage:
  2: ...............
  3:         colocation <id> <score>: <rsc>[:<role>] <rsc>[:<role>] ...
  4:           [node-attribute=<node_attr>]
  5: ...............
  6: Example:
  7: ...............
  8:         colocation dummy_and_apache -inf: apache dummy
  9:         colocation c1 inf: A ( B C )
 10: ...............
 
  
  1: configure
  2: colocation webserver_and_webip inf: webserver webip #前面的是随后面的,webserver是随着webip的
  3: verify
  4: commit

查看所有资源的详细信息

 
  
  1: configure
  2: show xml

定义顺序约束,假定启动是线性的

 
  
  1: Usage:
  2: ...............
  3:         order <id> {kind|<score>}: <rsc>[:<action>] <rsc>[:<action>] ...
  4:           [symmetrical=<bool>]
  5: 
  6:         kind :: Mandatory(强制) | Optional(可选) | Serialize(强制)
  7: ...............
  8: Example:
  9: ...............
 10:         order c_apache_1 Mandatory: apache:start ip_1
 11:         order o1 Serialize: A ( B C )
 12:         order order_2 Mandatory: [ A B ] C
 13: ...............
 
  
 
  

 

1: order webip_before_webserver mandatory: webip webserver

 
 
  
  1:  location <id> <rsc> {node_pref|rules}
  2: 
  3:         node_pref :: <score>: <node>
  4: 
  5:         rules ::
  6:           rule [id_spec] [$role=<role>] <score>: <expression>
  7:           [rule [id_spec] [$role=<role>] <score>: <expression> ...]
  8: 
  9:         id_spec :: $id=<id> | $id-ref=<id>
 10:         score :: <number> | <attribute> | [-]inf
 11:         expression :: <simple_exp> [bool_op <simple_exp> ...]
 12:         bool_op :: or | and
 13:         simple_exp :: <attribute> [type:]<binary_op> <value>
 14:                       | <unary_op> <attribute>
 15:                       | date <date_expr>
 16:         type :: string | version | number
 17:         binary_op :: lt | gt | lte | gte | eq | ne
 18:         unary_op :: defined | not_defined
 19: 
 20:         date_expr :: lt <end>
 21:                      | gt <start>
 22:                      | in_range start=<start> end=<end>
 23:                      | in_range start=<start> <duration>
 24:                      | date_spec <date_spec>
 25:         duration|date_spec ::
 26:                      hours=<value>
 27:                      | monthdays=<value>
 28:                      | weekdays=<value>
 29:                      | yearsdays=<value>
 30:                      | months=<value>
 31:                      | weeks=<value>
 32:                      | years=<value>
 33:                      | weekyears=<value>
 34:                      | moon=<value>
 35: ...............
 36: Examples:
 37: ...............
 38:         location conn_1 internal_www 100: node1
 39: 
 40:         location conn_1 internal_www \
 41:           rule 50: #uname eq node1 \
 42:           rule pingd: defined pingd
 43: 
 44:         location conn_2 dummy_float \
 45:           rule -inf: not_defined pingd or pingd number:lte 0
 
 
  
  1: location webip_on_lidefu4 webip 200: lidefu4
 
 
  
 
  
  1: configure
  2: property no-quorum-policy=ignore
  3: verify
  4: commit
定义资源的默认属性
 
  
  1: crm(live)configure# rsc_defaults 
  2: allow-migrate=        interval-origin=      multiple-active=      restart-type=         
  3: description=          is-managed=           priority=             target-role=定义资源就启动?          
  4: failure-timeout=      migration-threshold=  resource-stickiness=资源粘性,只对当前节点生效,还是不太熟

以上的定义只是节点出故障才会转移资源,并没有定义资源出故障怎么办.所以下面定义资源故障

 
  
  1: killall httpd   #不正常终止运行的服务
  2: crm
  3: status          #查看节点状态,发现依然显示正常运行
  4: stop webserver  #停止服务
  5: stop webip      #停止服务
  6: resource        #进去咯
  7: cleanup webserver#对不正常关闭的资源做清理
  8: cleanup webip   #对不正常关闭的资源做清理
 
  
  1: Usage:
  2: ...............
  3:         monitor <rsc>[:<role>] <interval>[:<timeout>]
  4: ...............
  5: Example:
  6: ...............
  7:         monitor apence 60m:60s
  8: ...............
 
  
 
  

 

1: monitor webserver 20s:15s

  2: verify
  3: commit
 
  
  1: crm(live)configure# monitor webserver 20s:15s
定义好了,可以尝试在当前运行httpd节点的服务停止,然后等待15秒以上,查看他是否启动成功,笔者这里是启动成功的.