原创 2004年09月02日 15:51:00

1.What is a Cluster?--什么是Cluster?

VERITAS Cluster Server(VCS) connects, or clusters, multiple, independent systems
into a management framework for increased availability. Each system, or node, runs its
own operating system and cooperates at the software level to form a cluster. VCS links
commodity hardware with intelligent software to provide application failover and
control. When a node or a monitored application fails, other nodes can take predefined
action to take over and bring up services elsewhere in the cluster.

2.Detecting Failure--失败检测
VCS can detect application failure and node failure among cluster members.

(1).Detecting Application Failure
At the highest level, VCS is typically deployed to keep business-critical applications
online and available to users. VCS provides a mechanism to detect failure of an
application and any underlying resources or services supporting the application. VCS
issues specific commands, tests, or scripts that monitor the overall health of an
application. VCS also determines the health of underlying system resources supporting
the application, such as file systems and network interfaces.

(2).Detecting Node Failure
One of the most difficult tasks in clustering is correctly discriminating between loss of a
system and loss of communication between systems. There are several technologies used
for this purpose, including heartbeat networks between servers, quorum disks, and SCSI
reservation. VCS uses a redundant network heartbeat along with SCSI III-based
membership coordination and data protection for detecting failure on a node and on

3.Switchover and Failover

Failover and switchover are the processes of bringing up application services on a
different node in a cluster. In both cases, an application and its network identity are
brought up on a selected node. Client systems access a virtual IP address that moves with
the service. Client systems are unaware of which server they are using.

A virtual IP address is an address brought up in addition to the base address of systems in
the cluster. For example, in a 2-node cluster consisting of db-server1 and db-server2, a
virtual address may be called db-server. Clients will then access db-server and be
unaware of which physical server is actually hosting the db-server. Virtual IP addresses
use a technology known as IP Aliasing.

(1)The Switchover Process
A switchover is an orderly shutdown of an application and its supporting resources on
one server and a controlled startup on another server. Typically this means unassigning
the virtual IP, stopping the application, and deporting shared storage. On the other server,
the process is reversed. Storage is imported, file systems are mounted, the application is
started, and the virtual IP address is brought up.

(2)The Failover Process
A failover is similar to a switchover, except the ordered shutdown of applications on the
original node may not be possible. In this case services are simply started on another
node. The process of starting the application on the node is identical in a failover or
switchover. This means the application must be capable of restarting following a crash of
its original host.

4.Cluster Control, Communications, and Membership
(1)High-Availability Daemon (HAD)
The high-availability daemon, or HAD, is the main VCS daemon running on each system.
It is responsible for building the running cluster configuration from the configuration
files, distributing the information when new nodes join the cluster, responding to operator
input, and taking corrective action when something fails. It is typically known as the VCS
engine. The engine uses agents to monitor and manage resources.

(2)Low Latency Transport (LLT)
VCS uses private network communications between cluster nodes for cluster
maintenance. The Low Latency Transport functions as a high-performance, low-latency
replacement for the IP stack, and is used for all cluster communications. VERITAS
requires two completely independent networks between all cluster nodes, which provide
the required redundancy in the communication path and enable VCS to discriminate
between a network failure and a system failure. LLT has two major functions.

(3)Group Membership Services/Atomic Broadcast (GAB)
The Group Membership Services/Atomic Broadcast protocol (GAB) is responsible for
cluster membership and cluster communications.
◆ Cluster Membership
GAB maintains cluster membership by receiving input on the status of the heartbeat
from each node via LLT. When a system no longer receives heartbeats from a peer, it
marks the peer as DOWN and excludes the peer from the cluster. In most
configurations, the I/O fencing module is used to prevent network partitions.
◆ Cluster Communications
GAB’s second function is reliable cluster communications. GAB provides guaranteed
delivery of point-to-point and broadcast messages to all nodes.

5.I/O Fencing Module
The I/O fencing module implements a quorum-type functionality to ensure only one
cluster survives a split of the private network. I/O fencing also provides the ability to
perform SCSI-III persistent reservations on failover. The shared VERITAS Volume
Manager disk groups offer complete protection against data corruption by nodes assumed
to be excluded from cluster membership.


  • 2017年03月06日 18:37

CloudStack High Availability源码分析

关于CloudStack HA的设计原理和思路,在官方文档中已经给出了比较清晰的解释
  • lmhlq
  • lmhlq
  • 2017-03-20 17:07:59
  • 255

Kubernetes Master High Availability 高级实践

才云科技云开源高级工程师唐继元受邀DBAplus社群,在线分享《Kubernetes Master High Availability 高级实践》,介绍如何构建Kubernetes Master Hi...
  • liukuan73
  • liukuan73
  • 2017-11-26 09:36:05
  • 1148

golang 开源项目全集

一直更新中,地址: Indexes and search engines These si...
  • yangyangye
  • yangyangye
  • 2017-06-22 09:19:00
  • 3387

golang 开源项目全集

一直更新中,地址: Indexes and search engines These si...
  • yangyangye
  • yangyangye
  • 2017-06-22 09:19:00
  • 3387

大数据开发利器:Hadoop(11) Hadoop2 HA(High Availability)

本节主要介绍了HDFS HA(High Availability)的原理、主备切换过程以及基于JournalNode的共享存储系统。 1. 前言在当初介绍Hadoop2.0时,我们简单提到了Hadoo...
  • Dr_Lecter
  • Dr_Lecter
  • 2016-11-14 12:27:22
  • 391


高可用性H.A.(High Availability)指的是通过尽量缩短因日常维护操作(计划)和突发的系统崩溃(非计划)所导致的停机时间,以提高系统和应用的可用性。它与被认为是不间断操作的容错技术有所...
  • qq_33936481
  • qq_33936481
  • 2017-06-15 10:09:34
  • 484


This EtherPad was created to gather the requirements and other information for OpenStack Instance Hi...
  • tantexian
  • tantexian
  • 2015-04-13 17:23:52
  • 1126

构建OpenStack的高可用性(HA,High Availability)

1、CAP理论 1) CAP 理论给出了3个基本要素: 一致性 ( Consistency) :任何一个读操作总是能读取到之前完成的写操作结果; 可用性 ( Availabi...
  • hilyoo
  • hilyoo
  • 2012-06-30 19:27:32
  • 49763

ResourceManager High Availability

Apache 官方原文地址:
  • u010022051
  • u010022051
  • 2016-03-25 14:55:21
  • 551