1.1. 群集状态查看(clustat)
The clustat command displays the status of the cluster. It shows membership information, quorum view, and the state of all configured user services. The clustat command displays cluster status only from the viewpoint of the cluster system on which it is running.
常用参数-i,指定刷新间隔,可动态观察群集起停状态转变。如:clustat -i 2,每隔2秒钟刷新显示clustat输出。
 
1.2. cman管理工具(man_tool)
 cman_tool is a program that manages the cluster management subsystem    CMAN. cman_tool can be used to join the node to a cluster, leave the cluster, kill another cluster node or change the value of expected  votes of a cluster.    Be careful that you understand the consequences of the commands issued via cman_tool as they can affect all nodes in your cluster. Most of the time the cman_tool will only be invoked from your startup and shutdown scripts.
下图可看到db1上次被fenced的时间,以及使用的fence设备。
 [root@db1 oradata]# cman_tool nodes -f
Node  Sts   Inc   Joined               Name
   1   M     96   2010-09-02 15:04:11  db1.fjnet114.com
    Last fenced:   2010-09-02 14:04:11  by ilo1
   2   M    100   2010-09-02 15:04:11  db2.fjnet114.com
-------------------------------------------------------------------------------
[root@db1 home]# cman_tool status
Version: 6.1.0
Config Version: 8
Cluster Name: new_cluster
Cluster Id: 23732
Cluster Member: Yes
Cluster Generation: 104
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Quorum: 1 
Active subsystems: 8
Flags: 2node Dirty
Ports Bound: 0 177 
Node name: db1.fjnet114.com
Node ID: 1
Multicast addresses: 239.192.92.17 //在redhat 4中未发现多播地址;
Node addresses: 192.168.114.102
1.3. fence/dlm状态查看(group_tool)
The  group_tool program displays the status of fence, dlm and gfs    groups. The information is read from the groupd daemon which controls the fenced, dlm_controld and gfs_controld daemons. group_tool will also dump debug logs from various daemons.
此命令在redhat 4版本上没有。
[root@db1 oradata]# group_tool ls
type             level name       id       state      
fence            0     default    00010001 none       
[1 2]
dlm              1     rgmanager  00020001 none       
[1 2]
1.4. rgmanager资源测试(rg_test)
Cman对群集资源监控设置查看rg_test rules, /usr/share/cluster保留有部分应用默认监控脚本;
1、Display there source rules that rg_test understands. rg_test rules Test a configuration (and /usr/share/cluster) for errors or redundant resource agents.
rg_test test /etc/cluster/cluster.conf
2、Display the start and stop ordering of a service.Display start order:
rg_test noop /etc/cluster/cluster.conf start service servicename
这个命令在测试资源的依赖关系时很有用,使用rg_test --help看不到noop参数。在我环境下输出如下:
[root@db1 oradata]# rg_test noop /etc/cluster/cluster.conf start service wbdb_service
Running in test mode.
Starting wbdb_service...
[start] service:wbdb_service
[start] fs:oradata
[start] fs:orabackup
[start] ip:192.168.114.108
[start] script:oracle
Start of wbdb_service complete
Display stop order:
rg_test noop /etc/cluster/cluster.conf stop service servicename
3、Explicitly start or stop a service.
Important Only do this on one node, and always disable the service in rgmanager
first. Start a service:
rg_test test /etc/cluster/cluster.conf start service servicename
Stop a service:
rg_test test /etc/cluster/cluster.conf stop service servicename
4、Calculate and display the resource tree delta between two cluster.conf files.查看2份cluster配置文件的资源目录结构和启停顺序。
rg_test delta cluster.conf file 1 cluster.conf file 2
For example:
rg_test delta /etc/cluster/cluster.conf.bak /etc/cluster/cluster.conf
1.5. 动态查看日志(tail –f)
该命令用以观察群集日志时特别有用,可看到群集何时进行磁盘mount,IP地址切换,服务启动等信息。
常用命令:
Tail –f /var/log/message
1.6. 测试fence设备配置(fence_node/fence_drac/…)
使用fence_node 命令进行fence配置测试,该命令将读取cluster.conf中关于fence设备的配置。
常用命令
/sbin/fence_node db1.fjnet114.com
/sbin/fence_node db2.fjnet114.com
针对每个不同的fence设备,redhat提供了相应的工具fence_drac、fence_ilo等,可在命令下直接加载fence设备参数进行测试。参数-o指定执行的动作,可为reboot\off\on\status等,详见man fence_drac。
如:
[root@db2 ~]# fence_drac -a 192.168.114.106 -l admin -p wlhmbst@2008 -o status
status: on
1.7. 手动群集切换clusvcadmin
The clusvcadm command allows you to enable, disable, relocate, and restart high-availability services in a cluster. For more information about this tool, refer to the clusvcadm(8) man page.
做rhcs的切换测试方式有很多,比如拔网线、模拟宕机操作。但是日常维护作业过程中需要做群集的切换,我们希望以对系统破坏最小的操作进行。你们就可以使用clusvcadmin命令。
[root@db2 /]# clusvcadm -r wbdb_service -m db2.fjnet114.com
Trying to relocate service:wbdb_service to db2.fjnet114.com...Success
service:wbdb_service is now running on db2.fjnet114.com

 
2.      IP端口使用情况

Port Number
Protocol
Component
5404, 5405
UDP
cman (Cluster Manager)                  
11111
TCP
ricci (part of Conga remote agent)    
14567
TCP
gnbd (Global Network Block Device)      
16851
TCP
modclusterd (part of Conga remote agen
21064
TCP
dlm (Distributed Lock Manager)        
50006, 50008,50009 
TCP
ccsd (Cluster Configuration System daemon)
50007
UDP
ccsd (Cluster Configuration System daemon)
3.      常见故障分析
If a node in your cluster is repeatedly getting fenced, it means that one of the nodes in your cluster is not seeing enough "heartbeat" network messages from the node that is getting fenced. Most of the time, this is a result of flaky or faulty hardware, such as bad cables or bad ports on the network hub or switch. Test your communications paths thoroughly without the cluster software running to make sure your hardware is working correctly.
如果群集中的一个节点被反复执行fenced而重启,这意味着群集中的另一节点没有发现被fenced节点足够多的心跳信息。大多数情况下,这是硬件故障导致的,如网络交换机中的故障线缆、端口等。在没有群集软件运行的情况下,测试通信链路以确认你的硬件环境工作正常。
• If a node in your cluster is repeatedly getting fenced right at startup, if may be due to system activities that occur when a node joins a cluster. If your network is busy, your cluster may decide it is not getting enough heartbeat packets. To address this, you may have to increase the post_join_delay setting in your cluster.
如果群集中的一个节点在开机时被反复fenced而重启,这可能是由这样一种系统活动导致的,当节点正在加入群集,一旦网络繁忙,群集可能觉得没有足够的心跳信息而被fenced。为解决这个情况,你需要将cluster.conf中的post_join_delay参数调大些,如由3改为60。