第一步:因为Centos默认的yum源里没有heartbeat的资源,所以首先需要将fedora的源添加到centos系统中
rpm -ivf epel-release-6-8.noarch.rpm
第二步:
执行 yum install heartbeat即可将heartbeat安装到服务器上。
通过查看heatbeat的配置文件,在/etc/ha.d/目录下有个README.config文件,
You need three configuration files to make heartbeat happy,
and they all go in this directory.
They are:
ha.cf Main configuration file
haresources Resource configuration file
authkeys Authentication information
These first two may be readable by everyone, but the authkeys file
must not be.
The good news is that sample versions of these files may be found in
the documentation directory (providing you installed the documentation).
If you installed heartbeat using rpm packages then
this command will show you where they are on your system:
<span style="color:#ff0000;"> rpm -q heartbeat -d</span>
If you installed heartbeat using Debian packages then
the documentation should be located in /usr/share/doc/heartbeat
执行上面的命令可以看到相关的目录
[root@TEST-43 log]# rpm -q heartbeat -d
/usr/share/doc/heartbeat-3.0.4/AUTHORS
/usr/share/doc/heartbeat-3.0.4/COPYING
/usr/share/doc/heartbeat-3.0.4/COPYING.LGPL
/usr/share/doc/heartbeat-3.0.4/ChangeLog
/usr/share/doc/heartbeat-3.0.4/README
/usr/share/doc/heartbeat-3.0.4/apphbd.cf
/usr/share/doc/heartbeat-3.0.4/authkeys
/usr/share/doc/heartbeat-3.0.4/ha.cf
/usr/share/doc/heartbeat-3.0.4/haresources
/usr/share/man/man1/cl_status.1.gz
/usr/share/man/man1/hb_addnode.1.gz
/usr/share/man/man1/hb_delnode.1.gz
/usr/share/man/man1/hb_standby.1.gz
/usr/share/man/man1/hb_takeover.1.gz
/usr/share/man/man5/authkeys.5.gz
/usr/share/man/man5/ha.cf.5.gz
/usr/share/man/man8/apphbd.8.gz
/usr/share/man/man8/heartbeat.8.gz
然后拷贝authkeys、ha.cf和haresources到ha.d目录下。
ha.cf配置:
修改相关的内容,在这里遇到了一些问题,首先是ha.cf,这里是配置当前服务的信息的,主要是下面几部分,
首先是心跳周期以及检测到失败的周期
# A note on specifying "how long" times below...
#
# The default time unit is seconds
# 10 means ten seconds
#
# You can also specify them in milliseconds
# 1500ms means 1.5 seconds
#
#
# keepalive: how long between heartbeats?<span style="color:#ff0000;">心跳发送的周期</span>
#
keepalive 2
#
# deadtime: how long-to-declare-host-dead?<span style="color:#ff0000;">多久收不到心跳认为对方宕机</span>
#
# If you set this too low you will get the problematic
# split-brain (or cluster partition) problem.
# See the FAQ for how to use warntime to tune deadtime.
#
deadtime 30
#
# warntime: how long before issuing "late heartbeat" warning?<span style="color:#ff0000;">多久收不到心跳会告警</span>
# See the FAQ for how to use warntime to tune deadtime.
#
warntime 10
#
#
# Very first dead time (initdead)<span style="color:#ff0000;">系统启动后隔多久检测心跳</span>
#
# On some machines/OSes, etc. the network takes a while to come up
# and start working right after you've been rebooted. As a result
# we have a separate dead time for when things first come up.
# It should be at least twice the normal dead time.
#
initdead 120
网络配置:
#
# What UDP port to use for bcast/ucast communication?针对广播或者单播
#
#udpport 3800
#
# Baud rate for serial ports...
#
#baud 19200
#
# serial serialportname ...如果不是采用网络方式,而是串口的话就需要设置下面的参数
#serial /dev/ttyS0 # Linux
#serial /dev/cuaa0 # FreeBSD
#serial /dev/cuad0 # FreeBSD 6.x
#serial /dev/cua/a # Solaris
#
# Set up a multicast heartbeat medium设置组播参数
# mcast [dev] [mcast group] [port] [ttl] [loop]
#
# [dev] device to send/rcv heartbeats on接受和发送心跳用的网卡【eth0 eht1 。。。】
# [mcast group] multicast group to join (class D multicast address设置一个组播地址,D类网址
# 224.0.0.0 - 239.255.255.255)
# [port] udp port to sendto/rcvfrom (set this value to the组播的端口
# same value as "udpport" above)
# [ttl] the ttl value for outbound heartbeats. this effects
# how far the multicast packet will propagate. (0-255)
# Must be greater than zero.
# [loop] toggles loopback for outbound multicast heartbeats.是否支持组播数据包的循环,默认是不支持
# if enabled, an outbound packet will be looped back and
# received by the interface it was sent on. (0 or 1)
# Set this value to zero.
#
#
#mcast eth0 225.0.0.1 694 1 0
#
# Set up a unicast / udp heartbeat medium设置单播,也就是点对点通信,适合只有两台主机的系统,需要配合上面的端口来使用
# ucast [dev] [peer-ip-addr]
#
# [dev] device to send/rcv heartbeats on
# [peer-ip-addr] IP address of peer to send packets to
#
ucast eth0 172.17.1.45
设置是否启动会自动恢复
#
# About boolean values...
#
# Any of the following case-insensitive values will work for true:
# true, on, yes, y, 1
# Any of the following case-insensitive values will work for false:
# false, off, no, n, 0
#
#
#
# auto_failback: determines whether a resource will
# automatically fail back to its "primary" node, or remain
# on whatever node is serving it until that node fails, or
# an administrator intervenes.当主服务恢复后接管备份服务的状态,备份服务切换到待命状态
#
# The possible values for auto_failback are:
# on - enable automatic failbacks
# off - disable automatic failbacks
# legacy - enable automatic failbacks in systems
# where all nodes do not yet support
# the auto_failback option.
#
# auto_failback "on" and "off" are backwards compatible with the old
# "nice_failback on" setting.
#
# See the FAQ for information on how to convert
# from "legacy" to "on" without a flash cut.
# (i.e., using a "rolling upgrade" process)
#
# The default value for auto_failback is "legacy", which
# will issue a warning at startup. So, make sure you put
# an auto_failback directive in your ha.cf file.
# (note: auto_failback can be any boolean or "legacy")
#
auto_failback on
集群主机列表:
#
# Tell what machines are in the cluster
# node nodename ... -- must match uname -n 节点名称必须与主机中uname -n查到的一致
node TEST-43
node TEST-45
authkeys配置:
authkeys是配置集群中服务认证方式的文件
#
auth 1 指定的认证方式,数字是后面所列的方法的一种,最简单的是crc,最复杂的是sha1,下面的列表必须按顺序来,不能只有2,没有1,而且这里指定的方法必须是下面列表有的
1 crc
#2 sha1 HI!
#3 md5 Hello
haresources配置:
这里是配置集群资源的,主要是集群的虚拟IP以及由heartbeat所管理的服务和资源
#
# This is a list of resources that move from machine to machine as
# nodes go down and come up in the cluster. Do not include
# "administrative" or fixed IP addresses in this file.
#
# <VERY IMPORTANT NOTE>
# The haresources files MUST BE IDENTICAL on all nodes of the cluster.
#
# The node names listed in front of the resource group information
# is the name of the preferred node to run the service. It is
# not necessarily the name of the current machine. If you are running
# auto_failback ON (or legacy), then these services will be started
# up on the preferred nodes - any time they're up.
#
# If you are running with auto_failback OFF, then the node information
# will be used in the case of a simultaneous start-up, or when using
# the hb_standby {foreign,local} command.
#
# BUT FOR ALL OF THESE CASES, the haresources files MUST BE IDENTICAL.
# If your files are different then almost certainly something
# won't work right.
# </VERY IMPORTANT NOTE>
#
#
# We refer to this file when we're coming up, and when a machine is being
# taken over after going down.
#
# You need to make this right for your installation, then install it in
# /etc/ha.d
#
# Each logical line in the file constitutes a "resource group".
# A resource group is a list of resources which move together from
# one node to another - in the order listed. It is assumed that there
# is no relationship between different resource groups. These
# resource in a resource group are started left-to-right, and stopped
# right-to-left. Long lists of resources can be continued from line
# to line by ending the lines with backslashes ("\").
#
# These resources in this file are either IP addresses, or the name
# of scripts to run to "start" or "stop" the given resource. 资源可以是IP地址或者启动或者停止这些资源的脚本名字
#
# The format is like this:格式如下:节点名 空格 资源1 空格 资源2 空格 。。。。heartbeat会把资源传递给处理脚本来执行资源
#
#node-name resource1 resource2 ... resourceN
#
#
# If the resource name contains an :: in the middle of it, the
# part after the :: is passed to the resource script as an argument.
# Multiple arguments are separated by the :: delimeter
#
# In the case of IP addresses, the resource script name IPaddr is
# implied.<span style="color:#ff0000;">IP地址也是一种资源,意思就是创建虚拟IP,资源的格式一般是【资源名::资源参数】IP地址的资源名是IPaddr,可以省略,例如IPaddr::172.17.1.49==172.17.1.49</span>
#
# For example, the IP address 135.9.8.7 could also be represented
# as IPaddr::135.9.8.7
#
# THIS IS IMPORTANT!! vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#
# The given IP address is directed to an interface which has a route
# to the given address. This means you have to have a net route
# set up outside of the High-Availability structure. We don't set it
# up here -- we key off of it.
#
# The broadcast address for the IP alias that is created to support
# an IP address defaults to the highest address on the subnet.
#
# The netmask for the IP alias that is created defaults to the same
# netmask as the route that it selected in in the step above.
#
# The base interface for the IPalias that is created defaults to the
# same netmask as the route that it selected in in the step above.
#
# If you want to specify that this IP address is to be brought up
# on a subnet with a netmask of 255.255.255.0, you would specify
# this as IPaddr::135.9.8.7/24 .
#
# If you wished to tell it that the broadcast address for this subnet
# was 135.9.8.210, then you would specify that this way:
# IPaddr::135.9.8.7/24/135.9.8.210
#
# If you wished to tell it that the interface to add the address to
# is eth0, then you would need to specify it this way:
# IPaddr::135.9.8.7/24/eth0
#
# And this way to specify both the broadcast address and the
# interface:
# IPaddr::135.9.8.7/24/eth0/135.9.8.210
#
# The IP addresses you list in this file are called "service" addresses,
# since they're they're the publicly advertised addresses that clients
# use to get at highly available services.
#
# For a hot/standby (non load-sharing) 2-node system with only
# a single service address,
# you will probably only put one system name and one IP address in here.
# The name you give the address to is the name of the default "hot"
# system.
#
# Where the nodename is the name of the node which "normally" owns the
# resource. If this machine is up, it will always have the resource
# it is shown as owning.
#
# The string you put in for nodename must match the uname -n name
# of your machine. Depending on how you have it administered, it could
# be a short name or a FQDN.
#
#-------------------------------------------------------------------
#
# Simple case: One service address, default subnet and netmask
# No servers that go up and down with the IP address
#
#just.linux-ha.org 135.9.216.110 <span style="color:#ff0000;">这样表示,本机设置一个虚拟IP 135.9.216.110,默认网关,默认网卡,默认子网</span>
#
#-------------------------------------------------------------------
#
# Assuming the adminstrative addresses are on the same subnet...
# A little more complex case: One service address, default subnet
# and netmask, and you want to start and stop http when you get
# the IP address...
#
#just.linux-ha.org 135.9.216.110 http <span style="color:#ff0000;">这样表示,本机设置一个虚拟IP 135.9.216.110,默认网关,默认网卡,默认子网,并且绑定一个apache服务</span>
#-------------------------------------------------------------------
#
# A little more complex case: Three service addresses, default subnet
# and netmask, and you want to start and stop http when you get
# the IP address...
#
#just.linux-ha.org 135.9.216.110 135.9.215.111 135.9.216.112 httpd<span style="color:#ff0000;">这样表示,本机设置一个虚拟IP 列表,默认网关,默认网卡,默认子网,并且绑定一个apache服务</span>
#-------------------------------------------------------------------
#
# One service address, with the subnet, interface and bcast addr
# explicitly defined.
#
#just.linux-ha.org 135.9.216.3/28/eth0/135.9.216.12 httpd<span style="color:#ff0000;">这样表示,本机设置一个虚拟IP 135.9.216.3,默认网关,28位子网掩码,eth0网卡,广播地址是135.9.216.12 【主机号全1】,并且绑定一个apache服务</span>
#
#-------------------------------------------------------------------
#
# An example where a shared filesystem is to be used.
# Note that multiple aguments are passed to this script using
# the delimiter '::' to separate each argument.
#
#node1 10.0.0.170 Filesystem::/dev/sda1::/data1::ext2 <span style="color:#ff0000;">这行表示使用一个共享磁盘,双冒号为多个参数间隔符,这句话的意思是,分配node1一个虚拟IP 10.0.0.170,默认网关、默认网卡、默认子网,然后执行"mount -t /ext2 data1 /dev/sda1”</span>
#
# Regarding the node-names in this file:
#
# They must match the names of the nodes listed in ha.cf, which in turn
# must match the `uname -n` of some node in the cluster. So they aren't
# virtual in any sense of the word.
#
上面的配置,要求主节点和备用节点必须完全一致,否则无法接管资源
问题:
安装配置完毕后,发现启动失败,错误日志为:
Nov 20 09:40:44 TEST-43 heartbeat: [6961]: WARN: heartbeat: udp port 3800 reserved for service "pwgpsi".
Nov 20 09:40:44 TEST-43 heartbeat: [6961]: info: Pacemaker support: false
Nov 20 09:40:44 TEST-43 heartbeat: [6961]: WARN: Logging daemon is disabled --enabling logging daemon is recommended
Nov 20 09:40:44 TEST-43 heartbeat: [6961]: info: **************************
Nov 20 09:40:44 TEST-43 heartbeat: [6961]: info: Configuration validated. Starting heartbeat 3.0.4
Nov 20 09:40:44 TEST-43 heartbeat: [6962]: info: heartbeat: version 3.0.4
Nov 20 09:40:45 TEST-43 heartbeat: [6962]: info: Heartbeat generation: 1447925863
Nov 20 09:40:45 TEST-43 heartbeat: [6962]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth0
Nov 20 09:40:45 TEST-43 heartbeat: [6962]: info: glib: ucast: bound send socket to device: eth0
<span style="color:#ff0000;">Nov 20 09:40:45 TEST-43 heartbeat: [6962]: ERROR: glib: ucast: error setting option SO_REUSEPORT(w): Protocol not available
Nov 20 09:40:45 TEST-43 heartbeat: [6962]: ERROR: make_io_childpair: cannot open ucast eth0</span>
Nov 20 09:40:46 TEST-43 heartbeat: [6966]: CRIT: Emergency Shutdown: Master Control process died.
Nov 20 09:40:46 TEST-43 heartbeat: [6966]: CRIT: Killing pid 6962 with SIGTERM
Nov 20 09:40:46 TEST-43 heartbeat: [6966]: CRIT: Emergency Shutdown(MCP dead): Killing ourselves.
在RedHat的bug列表里有相关的描述,但是具体原因没有找到,下面是bug描述和讨论,有一个编译的补丁可以解决这个问题
=======================================================================================
irst Last Prev Next This bug is not in your last search results.
Bug 1028127 - Heartbeat not working on centos6 after last update [NEEDINFO]
|
后记:
目前heartbeat已经不再作为HA的技术来维护,而是采用Corosync 和Pacemaker技术来实现。
PS:
redhat下有一个工具叫做yumdownloader的工具,可以将rpm包下载到本地,通过这个工具和yum源,可以构造完整的安装包,
附件是适合