--本文大纲
-
简介
-
什么是高可用集群
-
Heartbeat工作原理
-
基于heartbeat v1实现双机热备
一、简介
二、什么是高可用集群
高可用集群,英文原文为High Availability Cluster,简称HA Cluster,是指以减少服务中断(如因服务器宕机等引起的服 务中断)时间为目的的服务器集群技术。简单的说,集群(cluster)就是一组计算机,它们作为一个整体向用户提供一组网络资源。这些单个的计算机系统 就是集群的节点(node)。
高可用集群的出现是为了使集群的整体服务尽可能可用,从而减少由计算机硬件和软件易错性所带来的损 失。它通过保护用户的业务程序对外不间断提供的服务,把因软件/硬件/人为造成的故障对业务的影响降低到最小程度。如果某个节点失效,它的备援节点将在几 秒钟的时间内接管它的职责。因此,对于用户而言,集群永远不会停机。高可用集群软件的主要作用就是实现故障检查和业务切换的自动化。
只有两个节点的高可用集群又称为双机热备,即使用两台服务器互相备份。当一台服务器出现故障时,可由另一台服务器承担服务任务,从而在不需要人工干预的 情况下,自动保证系统能持续对外提供服务。双机热备只是高可用集群的一种,高可用集群系统更可以支持两个以上的节点,提供比双机热备更多、更高级的功能, 更能满足用户不断出现的需求变化。
三、Heartbeat工作原理
heartbeat (Linux-HA)的工作原理:heartbeat最核心的包括两个部分,心跳监测部分和资源接管部分,心跳监测可以通过网络链路和串口进行,而且支持冗 余链路,它们之间相互发送报文来告诉对方自己当前的状态,如果在指定的时间内未收到对方发送的报文,那么就认为对方失效,这时需启动资源接管模块来接管运 行在对方主机上的资源或者服务
四、基于heartbeat v1实现双机热备
注:
配置高可用的前提:
-
所有节点的主机名一定要与uname -n的结果一致
-
所有节点的时间必须一致
-
各节点间能基于ssh密钥认证通信
nfs服务器 www.directory.com 192.168.1.118(同时也是集群节点的ping状态检测)
2、node1配置部分
vim /ect/hosts
1
|
#uname -n
|
1
|
#vim /etc/sysconfig/network
|
每一个节点都要这类似这样的设置
1
2
|
#ssh-keygen -t rsa -P ''
#ssh-copy-id -i .ssh/id_rsa.pub root@essun.node2.com
|
将公钥送达到每一个节点上,(每一个节点都要如此做)
之后就可以使用yum安装EPEL中的包了
1
|
#rpm -ivh heartbeat-2.1.4-12.el6.x86_64.rpm heartbeat-stonith-2.1.4-12.el6.x86_64.rpm heartbeat-pils-2.1.4-12.el6.x86_64.rpm
|
-
信息层
-
资源管理器
-
资源代理
其实这里并没有配置文件,它仅提供一个配置模板,存放于
1
2
|
[root@essun ha.d]
# rpm -ql heartbeat |grep ha.cf
/usr/share/doc/heartbeat-2
.1.4
/ha
.cf
|
在这个目录中我们将用到三个文件
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
[root@essun heartbeat-2.1.4]
#pwd
/usr/share/doc/heartbeat-2
.1.4
[root@essun heartbeat-2.1.4]
# ls
apphbd.cf HardwareGuide.txt
authkeys haresources
AUTHORS hb_report.html
ChangeLog hb_report.txt
COPYING heartbeat_api.html
COPYING.LGPL heartbeat_api.txt
DirectoryMap.txt logd.cf
faqntips.html README
faqntips.txt Requirements.html
GettingStarted.html Requirements.txt
GettingStarted.txt
rsync
.html
ha.cf
rsync
.txt
HardwareGuide.html startstop
|
1
|
#cp /usr/share/doc/heartbeat-2.1.4/{authkeys,haresources,ha.cf } /etc/ha.d/ -a
|
配置信号算法认证所使用的密钥
1
2
3
4
5
|
#vim /etc/ha.d/authkeys
auth 2
#认证算法
#1 crc
2 sha1 HI!
#认证所使用的算法
#3 md5 Hello!
|
或使用openssl生成随机数
1
2
|
[root@essun ha.d]
# openssl rand -hex 8
7d4d4401f3d151dc
|
改变此文件的权限:
1
2
3
|
[root@essun ha.d]
# chmod 600 authkeys
[root@essun ha.d]
# ll authkeys
-rw------- 1 root root 643 Apr 18 00:01 authkeys
|
配置属性信息
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
[root@essun ha.d]
# grep -v "#" /etc/ha.d/ha.cf |grep -v "^$"
logfile
/var/log/ha-log
keepalive 1000ms
deadtime 8
warntime 3
udpport 694
mcast eth0 225.0.32.1 694 1 0
auto_failback on
node essun.node1.com
node essun.node2.com
ping
192.168.1.118
compression bz2
compression_threshold 2
[root@essun ha.d]
#
|
在node1和node2中安装http服务并对其进行访问测试
node1
1
2
3
4
5
|
[root@essun download]
# echo "192.168.1.109" > /var/www/html/index.html
[root@essun download]
# service httpd start
Starting httpd: [ OK ]
[root@essun download]
# curl http://192.168.1.123
192.168.1.123
|
node2
1
2
3
4
5
|
[root@essun ha.d]
# echo "192.168.1.123"> /var/www/html/index.html
[root@essun ha.d]
# service httpd start
Starting httpd: [ OK ]
[root@essun ha.d]
# curl http://192.168.1.109
192.168.1.109
|
1
2
3
|
[root@essun download]
# service httpd stop
Stopping httpd: [ OK ]
[root@essun download]
# chkconfig httpd off
|
node2
1
2
3
|
[root@essun ha.d]
# service httpd stop
Stopping httpd: [ OK ]
[root@essun ha.d]
# chkconfig httpd off
|
1
2
3
4
5
6
7
8
9
10
|
[root@essun ha.d]
# cd resource.d/
[root@essun resource.d]
# ls
apache ICP LinuxSCSI Raid1
AudibleAlarm ids LVM SendArp
db2 IPaddr LVSSyncDaemonSwap ServeRAID
Delay IPaddr2 MailTo WAS
Filesystem IPsrcaddr OCF WinPopup
hto-mapfuncs IPv6addr portblock Xinetd
[root@essun resource.d]
# pwd
/etc/ha
.d
/resource
.d
|
而在/etc/ha.d/resource.d中定义的就是传统类型的,如果在此目录没有对应的资源代理则将查找LSB类型的RA(其目录为/etc/rc.d/init.d/*)
1
2
|
#vim /etc/ha.d/haresources
essun.node1.com 192.168.1.110
/24/eth0
httpd
|
将三个配置文件同步到所有节点上(我这里只有node1和node2)
1
2
3
4
|
[root@essun ha.d]
# scp -p ha.cf authkeys haresources node1:/etc/ha.d/
ha.cf 100% 10KB 10.4KB
/s
00:00
authkeys 100% 643 0.6KB
/s
00:00
haresources 100% 5905 5.8KB
/s
00:00
|
到node1上查看一下
1
2
3
|
[root@essun ha.d]
# ls
authkeys harc rc.d resource.d
ha.cf haresources README.config shellfuncs
|
在两个节点上启动集群管理器
1
2
3
4
5
6
7
8
9
|
[root@essun ha.d]
# service heartbeat start
Starting High-Availability services:
2014
/04/18_01
:25:32 INFO: Resource is stopped
Done.
[root@essun ha.d]
# ssh essun.node2.com 'service heartbeat start'
Starting High-Availability services:
2014
/04/18_01
:26:45 INFO: Resource is stopped
Done.
[root@essun ha.d]
#
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
#tail -40 /var/log/ha-log
heartbeat[3647]: 2014
/04/18_01
:25:32 info: Configuration validated. Starting heartbeat 2.1.4
heartbeat[3648]: 2014
/04/18_01
:25:32 info: heartbeat: version 2.1.4
heartbeat[3648]: 2014
/04/18_01
:25:32 info: Heartbeat generation: 1397754809
heartbeat[3648]: 2014
/04/18_01
:25:32 info: glib: UDP multicast heartbeat started
for
group 225.0.32.1 port 694 interface eth0 (ttl=1 loop=0)
heartbeat[3648]: 2014
/04/18_01
:25:32 info: glib:
ping
heartbeat started.
heartbeat[3648]: 2014
/04/18_01
:25:32 info: G_main_add_TriggerHandler: Added signal manual handler
heartbeat[3648]: 2014
/04/18_01
:25:32 info: G_main_add_TriggerHandler: Added signal manual handler
heartbeat[3648]: 2014
/04/18_01
:25:32 info: G_main_add_SignalHandler: Added signal handler
for
signal 17
heartbeat[3648]: 2014
/04/18_01
:25:32 info: Local status now
set
to:
'up'
heartbeat[3648]: 2014
/04/18_01
:25:33 info: Link 192.168.1.118:192.168.1.118 up.
heartbeat[3648]: 2014
/04/18_01
:25:33 info: Status update
for
node 192.168.1.118: status
ping
heartbeat[3648]: 2014
/04/18_01
:25:39 info: Link essun.node2.com:eth0 up.
heartbeat[3648]: 2014
/04/18_01
:25:39 info: Status update
for
node essun.node2.com: status up
harc[3660]: 2014
/04/18_01
:25:39 info: Running
/etc/ha
.d
/rc
.d
/status
status
heartbeat[3648]: 2014
/04/18_01
:25:39 info: Comm_now_up(): updating status to active
heartbeat[3648]: 2014
/04/18_01
:25:39 info: Local status now
set
to:
'active'
heartbeat[3648]: 2014
/04/18_01
:25:40 info: Status update
for
node essun.node2.com: status active
harc[3678]: 2014
/04/18_01
:25:40 info: Running
/etc/ha
.d
/rc
.d
/status
status
heartbeat[3648]: 2014
/04/18_01
:25:50 info: remote resource transition completed.
heartbeat[3648]: 2014
/04/18_01
:25:50 info: remote resource transition completed.
heartbeat[3648]: 2014
/04/18_01
:25:50 info: Initial resource acquisition complete (T_RESOURCES(us))
IPaddr[3729]: 2014
/04/18_01
:25:50 INFO: Resource is stopped
heartbeat[3693]: 2014
/04/18_01
:25:50 info: Local Resource acquisition completed.
harc[3779]: 2014
/04/18_01
:25:50 info: Running
/etc/ha
.d
/rc
.d
/ip-request-resp
ip-request-resp
ip-request-resp[3779]: 2014
/04/18_01
:25:50 received ip-request-resp 192.168.1.100
/24/eth0
OK
yes
ResourceManager[3798]: 2014
/04/18_01
:25:50 info: Acquiring resource group: essun.node1.com 192.168.1.100
/24/eth0
httpd
IPaddr[3824]: 2014
/04/18_01
:25:50 INFO: Resource is stopped
ResourceManager[3798]: 2014
/04/18_01
:25:50 info: Running
/etc/ha
.d
/resource
.d
/IPaddr
192.168.1.100
/24/eth0
start
IPaddr[3921]: 2014
/04/18_01
:25:50 INFO: Using calculated netmask
for
192.168.1.100: 255.255.255.0
IPaddr[3921]: 2014
/04/18_01
:25:50 INFO:
eval
ifconfig
eth0:0 192.168.1.100 netmask 255.255.255.0 broadcast 192.168.1.255
IPaddr[3892]: 2014
/04/18_01
:25:50 INFO: Success
ResourceManager[3798]: 2014
/04/18_01
:25:51 info: Running
/etc/init
.d
/httpd
start
|
查看一下node1是否添加网卡正常
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
[root@essun heartbeat2]
# ifconfig
eth0 Link encap:Ethernet HWaddr 00:0C:29:1E:F8:F9
inet addr:192.168.1.109 Bcast:255.255.255.255 Mask:255.255.255.0
inet6 addr: fe80::20c:29ff:fe1e:f8f9
/64
Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:52073 errors:0 dropped:0 overruns:0 frame:0
TX packets:25502 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:35207636 (33.5 MiB) TX bytes:8188830 (7.8 MiB)
eth0:0 Link encap:Ethernet HWaddr 00:0C:29:1E:F8:F9
inet addr:192.168.1.100 Bcast:192.168.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1
/128
Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:708 errors:0 dropped:0 overruns:0 frame:0
TX packets:708 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:62304 (60.8 KiB) TX bytes:62304 (60.8 KiB)
|
访问一下集群IP地址看一下是不是node1的ip地址
在node2上将heartbeat服务停止,(在一个正常的节点上停止另一个节点,模拟故障)再次访问一下会出现什么情况?
1
2
3
4
|
[root@essun ha.d]
# ssh node1 'service heartbeat stop'
Stopping High-Availability services:
Done.
[root@essun ha.d]
#
|
刷新一下页面,看一下效果
在节点node1重新启动heartbeat服务,模拟主机重新上线
1
2
3
4
5
|
[root@essun heartbeat2]
# service heartbeat start
Starting High-Availability services:
2014
/04/18_01
:41:17 INFO: Resource is stopped
Done.
[root@essun heartbeat2]
#
|
重新刷新一下页面,看一下效果
第四步、在directory创建共享,给两了集群节点提供文件共享
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
[root@www /]
# yum install -y nfs*
[root@www /]
# mkdir -pv /www/share
mkdir
: created directory `
/www
`
mkdir
: created directory `
/www/share
`
[root@www /]
# echo -e "/www/share\t192.168.0.0(rw)" > /etc/exports
[root@www /]
# cat /etc/exports
/www/share
192.168.0.0(rw)
[root@www /]
# setfacl -m u:apache:rwx /www/share/
[root@www /]
# echo "<h1>此站点来自于NFS</h1>"> /www/share/index.html
[root@www /]
# cat /www/share/index.html
<h1>此站点来自于NFS<
/h1
>
[root@www /]
# service nfs start
Starting NFS services: [ OK ]
Starting NFS quotas: [ OK ]
Starting NFS mountd: [ OK ]
Starting NFS daemon: [ OK ]
Starting RPC idmapd: [ OK ]
[root@www /]
#
|
停止前端所有的集群节点
1
2
3
4
5
6
7
8
9
|
[root@essun heartbeat2]
# uname -n
essun.node1.com
[root@essun heartbeat2]
# service heartbeat stop
Stopping High-Availability services:
Done.
[root@essun heartbeat2]
# ssh node2 'service heartbeat stop'
Stopping High-Availability services:
Done.
[root@essun heartbeat2]
#
|
在node1上修改/etc/ha.d/haresources
1
2
|
#vim /etc/ha.d/haresources
essun.node1.com 192.168.1.100
/24/eth0
Filesystem::192.168.1.118:
/www/share
::
/var/www/html
::nfs httpd
|
1
2
3
4
|
[root@essun heartbeat2]
# service heartbeat start
Starting High-Availability services:
2014
/04/18_02
:44:50 INFO: Resource is stopped
Done.
|
刷新页面试试
停止node1节点,将node2启动后,它的页面应该还是其IP地址所在的页面
1
2
3
4
5
6
7
8
|
[root@essun heartbeat2]
# ssh node2 'service heartbeat start'
Starting High-Availability services:
2014
/04/18_02
:50:42 INFO: Resource is stopped
Done.
[root@essun heartbeat2]
# service heartbeat stop
Stopping High-Availability services:
Done.
[root@essun heartbeat2]
#
|
结果如下,如果我将node1上的haresources同步到node2上,那么结果将与node1相同。
同步一下试试。
1
2
3
4
5
6
7
8
9
10
|
[root@essun heartbeat2]
# ssh node2 'service heartbeat stop'
Stopping High-Availability services:
Done.
[root@essun heartbeat2]
# scp /etc/ha.d/haresources node2:/etc/ha.d/
haresources 100% 6006 5.9KB
/s
00:00
[root@essun heartbeat2]
# ssh node2 'service heartbeat start'
Starting High-Availability services:
2014
/04/18_02
:54:00 INFO: Resource is stopped
Done.
[root@essun heartbeat2]
#
|
刷新一下试试
以上操作实际上是对同组资源顺序引用说明,既然属于同组,就可以实现同进同退,达到无障碍切换。
====================基于heartbeat v1双机集群演示完毕====================================