environment:
OS: HP-UX B.11.31 U ia64
symptoms:
元旦做机房UPS放电测试时,因为厂商的疏忽导致机房跳电,我的一台MES数据的 standby db 跳电重启(HP RX2660小机)
之后观察/var/adm/syslog/syslog.log 每天上午的十点半都会报错(power supply faild)
syslog
Jan 3 10:09:19 sfcstb1 telnetd[27751]: Time out occurred in the initial option negotiation
Jan 3 10:33:21 sfcstb1 EMS [5879]: ------ EMS Event Notification ------ Value: "CRITICAL (5)" for Resource: "/system/events/ia64_corehw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 385286155 -r /system/events/ia64_corehw/core_hw -n 385286145 -a
Jan 4 10:33:23 sfcstb1 EMS [5879]: ------ EMS Event Notification ------ Value: "CRITICAL (5)" for Resource: "/system/events/ia64_corehw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 385286155 -r /system/events/ia64_corehw/core_hw -n 385286146 -a
Jan 5 10:33:27 sfcstb1 EMS [5879]: ------ EMS Event Notification ------ Value: "CRITICAL (5)" for Resource: "/system/events/ia64_corehw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 385286155 -r /system/events/ia64_corehw/core_hw -n 385286147 -a
Jan 5 14:10:28 sfcstb1 su: + tb johnz-oracle
Jan 5 16:05:50 sfcstb1 su: - ta xiaofan-oracle
Jan 5 16:05:59 sfcstb1 su: + ta xiaofan-oracle
Jan 5 16:07:09 sfcstb1 syslog: rm_log_init: fopen of file /etc/opt/resmon/log/client.log failed: Permission denied
Jan 5 16:27:35 sfcstb1 syslog: rm_log_init: fopen of file /etc/opt/resmon/log/client.log failed: Permission denied
Jan 6 08:26:41 sfcstb1 su: + ta xiaofan-oracle
Jan 6 10:33:30 sfcstb1 EMS [5879]: ------ EMS Event Notification ------ Value: "CRITICAL (5)" for Resource: "/system/events/ia64_corehw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 385286155 -r /system/events/ia64_corehw/core_hw -n 385286148 -a
Jan 6 15:30:54 sfcstb1 su: - ta xiaofan-root
Jan 6 15:31:02 sfcstb1 su: + ta xiaofan-root
Jan 6 15:47:01 sfcstb1 su: + tc xiaofan-root
Jan 6 15:47:44 sfcstb1 su: + tc xiaofan-root
Jan 7 10:33:33 sfcstb1 EMS [5879]: ------ EMS Event Notification ------ Value: "CRITICAL (5)" for Resource: "/system/events/ia64_corehw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 385286155 -r /system/events/ia64_corehw/core_hw -n 385286149 -a
Jan 8 10:33:36 sfcstb1 EMS [5879]: ------ EMS Event Notification ------ Value: "CRITICAL (5)" for Resource: "/system/events/ia64_corehw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 385286155 -r /system/events/ia64_corehw/core_hw -n 385286150 -a
Jan 9 08:32:25 sfcstb1 su: + ta xiaofan-oracle
Jan 9 08:33:41 sfcstb1 syslog: rm_log_init: fopen of file /etc/opt/resmon/log/client.log failed: Permission denied
Jan 9 10:33:39 sfcstb1 EMS [5879]: ------ EMS Event Notification ------ Value: "CRITICAL (5)" for Resource: "/system/events/ia64_corehw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 385286155 -r /system/events/ia64_corehw/core_hw -n 385286151 -a
Jan 10 08:53:22 sfcstb1 su: + ta xiaofan-oracle
Jan 10 08:53:58 sfcstb1 syslog: rm_log_init: fopen of file /etc/opt/resmon/log/client.log failed: Permission denied
Jan 10 08:58:46 sfcstb1 syslog: rm_log_init: fopen of file /etc/opt/resmon/log/client.log failed: Permission denied
Jan 10 08:59:00 sfcstb1 above message repeats 2 times
Jan 10 10:33:42 sfcstb1 EMS [5879]: ------ EMS Event Notification ------ Value: "CRITICAL (5)" for Resource: "/system/events/ia64_corehw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 385286155 -r /system/events/ia64_corehw/core_hw -n 385286152 -a
sfcstb1:/tmp# /opt/resmon/bin/resdata -R 385286155 -r /system/events/ia64_corehw/core_hw -n 385286152 -a
CURRENT MONITOR DATA:
Event Time..........: Tue Jan 10 10:33:41 2012
Severity............: CRITICAL
Monitor.............: ia64_corehw
Event #.............: 103001
System..............: sfcstb1
Summary:
Power Supply : Failure is detected.
Description of Error:
The system has detected that one of the power supplies has failed.
Probable Cause / Recommended Action:
The power supply has failed. Contact your HP support representative to
check the power supply.
For information on the sensor that generated this event, refer to
FRU ID in Event Details section.
Additional Event Data:
System IP Address...: 172.16.51.151
Event Id............: 103001820120110103336
Monitor Version.....: C.04.00.05
Event Class.........: System
Client Configuration File............:
/var/stm/config/tools/monitor/default_ia64_corehw.clcfg
Client Configuration File Version....: A.01.00
Qualification criteria met.
Number of events: 1
Associated OS error log entry id(s)
None
Additional System Data:
System Model Number.............: ia64 hp server rx2660
EMS Version.....................: A.04.20
STM Version.....................: NA
System Serial Number............: SGH4843041
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/ia64_corehw.htm#E103001
v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v
Event Details :
Event Date ...................: Mon Jan 2 17:02:49 2012
Sensor Number .................: 0x41
Sensor Type ...................: Power Supply
Sensor Class ..................: Sensor specific
Sensor Reading/Offset .........: 0x1 (Sensor Reading)
Event Type ...................: Assertion
Entity ID .....................: 0xa
Generic Message ...............:
Power Supply Failure detected
Entity FRU Id Info ............: (Sensor ID 0())
Error Details:
Additional information on this event can be obtained from evweb
logviewer (Refer SFM User Guide) with the following log id: 271804
-------------------------------------------------------------------------------------------------------------------------------------------
实际上去机房实地查看,或通过com口连接到MP管理端口查看 power supply都是正常的 并无问题
-------------------------------------------------------------------------------------------------
Power supplies State
-----------------------------------------------------------
Power Supply 1 Normal
Power Supply 2 Normal
Fans State Fans State
-------------------------------------------------------------------------------
Fan 1 (Mem) Normal Fan 7 (CPU) Normal
Fan 2 (Mem) Normal Fan 8 (CPU) Normal
Fan 3 (Mem) Normal Fan 9 (I/O) Normal
Fan 4 (Mem) Normal Fan 10 (I/O) Normal
Fan 5 (CPU) Normal Fan 11 (I/O) Normal
Fan 6 (CPU) Normal Fan 12 (I/O) Normal
所以很困惑,打800 联系HP技术支持 HP技术支持一开始给出的方案是让我查看电源线路是否有问题
UPS供电是否有异常? 和SA交流UPS供电没有问题,于是更换了电源的插座和电源线,但是周期性误报依然存在
不过时间变为我更换电源时的时间 14:33 。
------------------------------------------------------------更换电源插座时的报错--------------------------------------------------------------------------------
Jan 10 14:30:37 sfcstb1 EMS [5879]: ------ EMS Event Notification ------ Value: "CRITICAL (5)" for Resource: "/system/events/ia64_corehw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 385286155 -r /system/events/ia64_corehw/core_hw -n 385286155 -a
Jan 10 14:33:43 sfcstb1 EMS [5879]: ------ EMS Event Notification ------ Value: "CRITICAL (5)" for Resource: "/system/events/ia64_corehw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 385286155 -r /system/events/ia64_corehw/core_hw -n 385286156 -a
-------------------------------------------------------------------------------------------------------------------------------------------
Jan 11 08:23:52 sfcstb1 su: + ta xiaofan-oracle
Jan 11 14:33:46 sfcstb1 EMS [5879]: ------ EMS Event Notification ------ Value: "CRITICAL (5)" for Resource: "/system/events/ia64_corehw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 385286155 -r /system/events/ia64_corehw/core_hw -n 385286157 -a
Jan 12 08:45:56 sfcstb1 su: + ta xiaofan-oracle
Jan 12 08:47:00 sfcstb1 syslog: rm_log_init: fopen of file /etc/opt/resmon/log/client.log failed: Permission denied
Jan 12 08:54:31 sfcstb1 syslog: rm_log_init: fopen of file /etc/opt/resmon/log/client.log failed: Permission denied
Jan 12 08:55:07 sfcstb1 syslog: rm_log_init: fopen of file /etc/opt/resmon/log/client.log failed: Permission denied
Jan 12 08:59:00 sfcstb1 above message repeats 3 times
Jan 12 13:56:58 sfcstb1 su: + ta xiaofan-oracle
Jan 12 14:33:49 sfcstb1 EMS [5879]: ------ EMS Event Notification ------ Value: "CRITICAL (5)" for Resource: "/system/events/ia64_corehw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 385286155 -r /system/events/ia64_corehw/core_hw -n 385286158 -a
--------------------------------------------------------------------------------------------------------------------------------------------------
MP 查看 power supply normal
Power supplies State
-----------------------------------------------------------
Power Supply 1 Normal
Power Supply 2 Normal
Fans State Fans State
-------------------------------------------------------------------------------
Fan 1 (Mem) Normal Fan 7 (CPU) Normal
Fan 2 (Mem) Normal Fan 8 (CPU) Normal
Fan 3 (Mem) Normal Fan 9 (I/O) Normal
Fan 4 (Mem) Normal Fan 10 (I/O) Normal
Fan 5 (CPU) Normal Fan 11 (I/O) Normal
Fan 6 (CPU) Normal Fan 12 (I/O) Normal
继续联系HP技术支持,这次技术支持给出的解释是HPUX SFM(system fault management)cache 记录下power supply fail 但是没有被刷新
之后每天都会在相同时候在syslog.log报出。给出的solution为手动刷新SFM cache or 升级SFM版本
solution:
1.手动刷新cache
Disable SFM provider:
#cimprovider -d -m SFMProviderModule
Remove the /var/opt/sfm/data/reminderEvent.dat,/var/opt/sfm/data/MemoryErrorCache.dat file.
Enable the SFM provider module:
#cimprovider -e -m SFMProviderModule
2.升级SFM版本