1、1、 错误描述
LVM: Performed a switch for Lun ID = 0 (pv = 0x0000000043802000), from raw device 0x1f050100 (with priority: 0, and current flags: 0x40) to raw device 0x1f070100 (with priority: 1, and current flags: 0x0).
LVM: VG 64 0x020000: PVLink 31 0x050100 Failed! The PV is still accessible.
LVM: Performed a switch for Lun ID = 0 (pv = 0x000000004380e000), from raw device 0x1f050200 (with priority: 0, and current flags: 0x40) to raw device 0x1f070200 (with priority: 1, and current flags: 0x0).
LVM: VG 64 0x030000: PVLink 31 0x050200 Failed! The PV is still accessible.
LVM: VG 64 0x040000: Lost quorum.
This may block configuration changes and I/Os. In order to reestablish quorum at least 1 of the following PVs (represented by current link) must become available:
<31 0x070300>
LVM: VG 64 0x040000: PVLink 31 0x070300 Failed! The PV is not accessible.
LVM: VG 64 0x020000: Lost quorum.
This may block configuration changes and I/Os. In order to reestablish quorum at least 1 of the following PVs (represented by current link) must become available:
<31 0x070100>
LVM: VG 64 0x020000: PVLink 31 0x070100 Failed! The PV is not accessible.
SCSI: Async write error -- dev: b 31 0x070100, errno: 126, resid: 8192,
blkno: 2566056, sectno: 5132112, offset: 2627641344, bcount: 8192.
SCSI: Async write error -- dev: b 31 0x070100, errno: 126, resid: 2048,
blkno: 2098216, sectno: 4196432, offset: 2148573184, bcount: 2048.
SCSI: Write error -- dev: b 31 0x070100, errno: 126, resid: 16384,
blkno: 1639328, sectno: 3278656, offset: 1678671872, bcount: 16384.
SCSI: Async write error -- dev: b 31 0x070100, errno: 126, resid: 4096,
blkno: 1598692, sectno: 3197384, offset: 1637060608, bcount: 4096.
SCSI: Write error -- dev: b 31 0x070100, errno: 126, resid: 2048,
blkno: 1175672, sectno: 2351344, offset: 1203888128, bcount: 2048.
SCSI: Write error -- dev: b 31 0x070100, errno: 126, resid: 4096,
blkno: 854688, sectno: 1709376, offset: 875200512, bcount: 4096.
SCSI: Write error -- dev: b 31 0x070100, errno: 126, resid: 3072,
blkno: 830056, sectno: 1660112, offset: 849977344, bcount: 3072.
SCSI: Async write error -- dev: b 31 0x070100, errno: 126, resid: 8192,
blkno: 499248, sectno: 998496, offset: 511229952, bcount: 8192.
SCSI: Async write error -- dev: b 31 0x070100, errno: 126, resid: 8192,
blkno: 499256, sectno: 998512, offset: 511238144, bcount: 8192.
SCSI: Async write error -- dev: b 31 0x070100, errno: 126, resid: 4096,
blkno: 293776, sectno: 587552, offset: 300826624, bcount: 4096.
SCSI: Async write error -- dev: b 31 0x070100, errno: 126, resid: 4096,
blkno: 293792, sectno: 587584, offset: 300843008, bcount: 4096.
LVM: VG 64 0x030000: Lost quorum.
This may block configuration changes and I/Os. In order to reestablish quorum at least 1 of the following PVs (represented by current link) must become available:
<31 0x070200>
LVM: VG 64 0x030000: PVLink 31 0x070200 Failed! The PV is not accessible.
LVM: Performed a switch for Lun ID = 0 (pv = 0x0000000048238000), from raw device 0x1f070300 (with priority: 1, and current flags: 0x80) to raw device 0x1f050300 (with priority: 0, and current flags: 0x80).
LVM: VG 64 0x040000: Reestablished quorum.
LVM: VG 64 0x040000: PVLink 31 0x050300 Recovered.
LVM: VG 64 0x040000: PVLink 31 0x070300 Recovered.
LVM: VG 64 0x030000: Reestablished quorum.
LVM: VG 64 0x030000: PVLink 31 0x050200 Recovered.
LVM: VG 64 0x030000: PVLink 31 0x070200 Recovered.
LVM: VG 64 0x020000: Reestablished quorum.
LVM: VG 64 0x020000: PVLink 31 0x050100 Recovered.
LVM: VG 64 0x020000: PVLink 31 0x070100 Recovered.
scp4[/var/opt/resmon/log]#tail -20000 /var/adm/syslog/syslog.log | grep -i war
Jun 2 08:17:03 scp4 cmcld: Warning: cmcld process was unable to run for the last 2 seconds,
Jun 2 17:49:03 scp4 cmcld: Warning: cmcld process was unable to run for the last 2 seconds,
Jun 2 21:41:02 scp4 cmcld: Warning: cmcld process was unable to run for the last 2 seconds,
2、2、 原因分析
双机中的两个节点有一个心跳时间,即双机配置文件cmcluster.asc中NODE_TIMEOUT,目前这个值为2000000 microseconds,也就是2秒。如果两个节点cmcld进程在2秒中不能正常进行通讯,就会在syslog.log中有如上的错误信息。
3、3、 存在风险
NODE_TIMEOUT 值太小的风险是:两个节点出现重组的可能性加大,即一台机器可能重启。双机中的两个节点在有限的时间内(2 次 timeout NODE_TIMEOUT + HEARTBEAT_INTERVAL), 如果发现两个节点cmcld进程不能正常进行通讯,双机中的两个节点会进行重组,导致一台机器panic, 另一台机器接管所有的资源。目前系统并没有真正进行两个节点的重组,由于系统在有限的时间内两个节点cmcld进程仍能正常通讯。增加NODE_TIMEOUT值后,可以有效来避免双机节点出现重组的可能性。因此,建议修改该值,建议值为8秒(8000000 microseconds)
4、实施步骤
方案一:修改配置文件后并重启双机
1、 修改配置文件cmcluster.asc,将NODE_TIMEOUT的值该为8000000
#cd /etc/cmcluster
#vi cmcluster.asc
2、 检查配置文件
#cmcheckconf -C cmcluster.asc
3、 停止双机
#cmhaltcl –f –v
4、 应用配置文件
#cmapplyconf -C cmcluster.asc
5、 启动双机
#cmruncl –v
如果条件允许可以更新如下双机软件patch
Download PHSS_32656 (15975K bytes)
安装步骤如下:
1. 安装前先备份系统
2. 以 root用户登陆
3. 把patch放到 /tmp
4. cd /tmp
sh PHSS_34391
5. Run swinstall to install the patch:
swinstall -x autoreboot=true -x patch_match_target=true -s /tmp/PHSS_34391.depot
如果有depen patch,那么选择tar格式下载,将tar包ftp到服务器上执行:
# cd /tmp/patch
# tar xvf pathname.tar
# ./create_depot_hpux.11.31
# swinstall -s /tmp/patch/depot
现网更新patch有风险,建议先实施修改双机配置。
[@more@]来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/9479798/viewspace-1050069/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/9479798/viewspace-1050069/