HP MC/ServiceGuard双机配置修改

1、1、 错误描述

LVM: Performed a switch for Lun ID = 0 (pv = 0x0000000043802000), from raw device 0x1f050100 (with priority: 0, and current flags: 0x40) to raw device 0x1f070100 (with priority: 1, and current flags: 0x0).

LVM: VG 64 0x020000: PVLink 31 0x050100 Failed! The PV is still accessible.

LVM: Performed a switch for Lun ID = 0 (pv = 0x000000004380e000), from raw device 0x1f050200 (with priority: 0, and current flags: 0x40) to raw device 0x1f070200 (with priority: 1, and current flags: 0x0).

LVM: VG 64 0x030000: PVLink 31 0x050200 Failed! The PV is still accessible.

LVM: VG 64 0x040000: Lost quorum.

This may block configuration changes and I/Os. In order to reestablish quorum at least 1 of the following PVs (represented by current link) must become available:

<31 0x070300>

LVM: VG 64 0x040000: PVLink 31 0x070300 Failed! The PV is not accessible.

LVM: VG 64 0x020000: Lost quorum.

This may block configuration changes and I/Os. In order to reestablish quorum at least 1 of the following PVs (represented by current link) must become available:

<31 0x070100>

LVM: VG 64 0x020000: PVLink 31 0x070100 Failed! The PV is not accessible.

SCSI: Async write error -- dev: b 31 0x070100, errno: 126, resid: 8192,

blkno: 2566056, sectno: 5132112, offset: 2627641344, bcount: 8192.

SCSI: Async write error -- dev: b 31 0x070100, errno: 126, resid: 2048,

blkno: 2098216, sectno: 4196432, offset: 2148573184, bcount: 2048.

SCSI: Write error -- dev: b 31 0x070100, errno: 126, resid: 16384,

blkno: 1639328, sectno: 3278656, offset: 1678671872, bcount: 16384.

SCSI: Async write error -- dev: b 31 0x070100, errno: 126, resid: 4096,

blkno: 1598692, sectno: 3197384, offset: 1637060608, bcount: 4096.

SCSI: Write error -- dev: b 31 0x070100, errno: 126, resid: 2048,

blkno: 1175672, sectno: 2351344, offset: 1203888128, bcount: 2048.

SCSI: Write error -- dev: b 31 0x070100, errno: 126, resid: 4096,

blkno: 854688, sectno: 1709376, offset: 875200512, bcount: 4096.

SCSI: Write error -- dev: b 31 0x070100, errno: 126, resid: 3072,

blkno: 830056, sectno: 1660112, offset: 849977344, bcount: 3072.

SCSI: Async write error -- dev: b 31 0x070100, errno: 126, resid: 8192,

blkno: 499248, sectno: 998496, offset: 511229952, bcount: 8192.

SCSI: Async write error -- dev: b 31 0x070100, errno: 126, resid: 8192,

blkno: 499256, sectno: 998512, offset: 511238144, bcount: 8192.

SCSI: Async write error -- dev: b 31 0x070100, errno: 126, resid: 4096,

blkno: 293776, sectno: 587552, offset: 300826624, bcount: 4096.

SCSI: Async write error -- dev: b 31 0x070100, errno: 126, resid: 4096,

blkno: 293792, sectno: 587584, offset: 300843008, bcount: 4096.

LVM: VG 64 0x030000: Lost quorum.

This may block configuration changes and I/Os. In order to reestablish quorum at least 1 of the following PVs (represented by current link) must become available:

<31 0x070200>

LVM: VG 64 0x030000: PVLink 31 0x070200 Failed! The PV is not accessible.

LVM: Performed a switch for Lun ID = 0 (pv = 0x0000000048238000), from raw device 0x1f070300 (with priority: 1, and current flags: 0x80) to raw device 0x1f050300 (with priority: 0, and current flags: 0x80).

LVM: VG 64 0x040000: Reestablished quorum.

LVM: VG 64 0x040000: PVLink 31 0x050300 Recovered.

LVM: VG 64 0x040000: PVLink 31 0x070300 Recovered.

LVM: VG 64 0x030000: Reestablished quorum.

LVM: VG 64 0x030000: PVLink 31 0x050200 Recovered.

LVM: VG 64 0x030000: PVLink 31 0x070200 Recovered.

LVM: VG 64 0x020000: Reestablished quorum.

LVM: VG 64 0x020000: PVLink 31 0x050100 Recovered.

LVM: VG 64 0x020000: PVLink 31 0x070100 Recovered.

scp4[/var/opt/resmon/log]#tail -20000 /var/adm/syslog/syslog.log | grep -i war

Jun 2 08:17:03 scp4 cmcld: Warning: cmcld process was unable to run for the last 2 seconds,

Jun 2 17:49:03 scp4 cmcld: Warning: cmcld process was unable to run for the last 2 seconds,

Jun 2 21:41:02 scp4 cmcld: Warning: cmcld process was unable to run for the last 2 seconds,

2、2、 原因分析

双机中的两个节点有一个心跳时间,即双机配置文件cmcluster.ascNODE_TIMEOUT,目前这个值为2000000 microseconds,也就是2秒。如果两个节点cmcld进程在2秒中不能正常进行通讯,就会在syslog.log中有如上的错误信息。

3、3、 存在风险

NODE_TIMEOUT 值太小的风险是:两个节点出现重组的可能性加大,即一台机器可能重启。双机中的两个节点在有限的时间内(2 timeout NODE_TIMEOUT + HEARTBEAT_INTERVAL), 如果发现两个节点cmcld进程不能正常进行通讯,双机中的两个节点会进行重组,导致一台机器panic, 另一台机器接管所有的资源。目前系统并没有真正进行两个节点的重组,由于系统在有限的时间内两个节点cmcld进程仍能正常通讯。增加NODE_TIMEOUT值后,可以有效来避免双机节点出现重组的可能性。因此,建议修改该值,建议值为8秒(8000000 microseconds

4、实施步骤

方案一:修改配置文件后并重启双机

1、 修改配置文件cmcluster.asc,将NODE_TIMEOUT的值该为8000000

#cd /etc/cmcluster

#vi cmcluster.asc

2、 检查配置文件

#cmcheckconf -C cmcluster.asc

3、 停止双机

#cmhaltcl –f –v

4、 应用配置文件

#cmapplyconf -C cmcluster.asc

5、 启动双机

#cmruncl –v

如果条件允许可以更新如下双机软件patch

Download PHSS_32656 (15975K bytes)

安装步骤如下:

1. 安装前先备份系统

2. root用户登陆

3. patch放到 /tmp

4. cd /tmp

sh PHSS_34391

5. Run swinstall to install the patch:

swinstall -x autoreboot=true -x patch_match_target=true -s /tmp/PHSS_34391.depot

如果有depen patch,那么选择tar格式下载,将tarftp到服务器上执行:

# cd /tmp/patch

# tar xvf pathname.tar

# ./create_depot_hpux.11.31

# swinstall -s /tmp/patch/depot

现网更新patch有风险,建议先实施修改双机配置。

[@more@]

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/9479798/viewspace-1050069/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/9479798/viewspace-1050069/

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值