关键过程、根本原因分析
BMC SEL日志
可以看到初始slot 21故障时,BMC是有告警的,但告警随后解除
RAID卡日志分析:
在7/27,slot21硬盘发生故障,command timeout后发生复位,随后被移除,硬盘被置为FAILED状态,也会对应的产生BMC告警。
在7/28 16:30左右,RAID卡发生异常复位
复位后下面找盘的日志中,没有slot 21的insert记录,表示初始化discovery过程已经无法找到slot 21硬盘
1716: 16-07-28,16:34:27 Info:Inserted: PD 01(e0x00/s38)
1717: 16-07-28,16:34:27 Info:Inserted: PD 01(e0x00/s38) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=5000c500964774ad,0000000000000000
1718: 16-07-28,16:34:27 Info:Inserted: PD 02(e0x00/s39)
1719: 16-07-28,16:34:27 Info:Inserted: PD 02(e0x00/s39) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=5000c500964c028d,0000000000000000
1720: 16-07-28,16:34:27 Info:Inserted: PD 03(e0x00/s8)
1721: 16-07-28,16:34:27 Info:Inserted: PD 03(e0x00/s8) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027388,0000000000000000
1722: 16-07-28,16:34:27 Info:Inserted: PD 04(e0x00/s17)
1723: 16-07-28,16:34:27 Info:Inserted: PD 04(e0x00/s17) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027391,0000000000000000
1724: 16-07-28,16:34:27 Info:Inserted: PD 05(e0x00/s10)
1725: 16-07-28,16:34:27 Info:Inserted: PD 05(e0x00/s10) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b00002738a,0000000000000000
1726: 16-07-28,16:34:27 Info:Inserted: PD 06(e0x00/s14)
1727: 16-07-28,16:34:27 Info:Inserted: PD 06(e0x00/s14) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b00002738e,0000000000000000
1728: 16-07-28,16:34:27 Info:Inserted: PD 07(e0x00/s13)
1729: 16-07-28,16:34:27 Info:Inserted: PD 07(e0x00/s13) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b00002738d,0000000000000000
1730: 16-07-28,16:34:27 Info:Inserted: PD 08(e0x00/s4)
1731: 16-07-28,16:34:27 Info:Inserted: PD 08(e0x00/s4) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027384,0000000000000000
1732: 16-07-28,16:34:27 Info:Inserted: PD 0a(e0x00/s9)
1733: 16-07-28,16:34:27 Info:Inserted: PD 0a(e0x00/s9) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027389,0000000000000000
1734: 16-07-28,16:34:27 Info:Inserted: PD 0b(e0x00/s0)
1735: 16-07-28,16:34:27 Info:Inserted: PD 0b(e0x00/s0) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027380,0000000000000000
1736: 16-07-28,16:34:27 Info:Inserted: PD 0c(e0x00/s6)
1737: 16-07-28,16:34:27 Info:Inserted: PD 0c(e0x00/s6) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027386,0000000000000000
1738: 16-07-28,16:34:27 Info:Inserted: PD 0d(e0x00/s2)
1739: 16-07-28,16:34:27 Info:Inserted: PD 0d(e0x00/s2) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027382,0000000000000000
1740: 16-07-28,16:34:27 Info:Inserted: PD 0e(e0x00/s19)
1741: 16-07-28,16:34:27 Info:Inserted: PD 0e(e0x00/s19) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027393,0000000000000000
1742: 16-07-28,16:34:27 Info:Inserted: PD 0f(e0x00/s11)
1743: 16-07-28,16:34:27 Info:Inserted: PD 0f(e0x00/s11) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b00002738b,0000000000000000
1744: 16-07-28,16:34:27 Info:Inserted: PD 10(e0x00/s7)
1745: 16-07-28,16:34:27 Info:Inserted: PD 10(e0x00/s7) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027387,0000000000000000
1746: 16-07-28,16:34:27 Info:Inserted: PD 11(e0x00/s12)
1747: 16-07-28,16:34:27 Info:Inserted: PD 11(e0x00/s12) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b00002738c,0000000000000000
1748: 16-07-28,16:34:27 Info:Inserted: PD 12(e0x00/s1)
1749: 16-07-28,16:34:27 Info:Inserted: PD 12(e0x00/s1) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027381,0000000000000000
1750: 16-07-28,16:34:27 Info:Inserted: PD 13(e0x00/s3)
1751: 16-07-28,16:34:27 Info:Inserted: PD 13(e0x00/s3) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027383,0000000000000000
1752: 16-07-28,16:34:27 Info:Inserted: PD 14(e0x00/s5)
1753: 16-07-28,16:34:27 Info:Inserted: PD 14(e0x00/s5) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027385,0000000000000000
1754: 16-07-28,16:34:27 Info:Inserted: PD 15(e0x00/s16)
1755: 16-07-28,16:34:27 Info:Inserted: PD 15(e0x00/s16) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027390,0000000000000000
1756: 16-07-28,16:34:27 Info:Inserted: PD 16(e0x00/s15)
1757: 16-07-28,16:34:27 Info:Inserted: PD 16(e0x00/s15) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b00002738f,0000000000000000
1758: 16-07-28,16:34:27 Info:Inserted: PD 17(e0x00/s18)
1759: 16-07-28,16:34:27 Info:Inserted: PD 17(e0x00/s18) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027392,0000000000000000
1760: 16-07-28,16:34:27 Info:Inserted: PD 18(e0x00/s20)
1761: 16-07-28,16:34:27 Info:Inserted: PD 18(e0x00/s20) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027394,0000000000000000
1762: 16-07-28,16:34:27 Info:Inserted: PD 19(e0x00/s23)
1763: 16-07-28,16:34:27 Info:Inserted: PD 19(e0x00/s23) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027397,0000000000000000
1764: 16-07-28,16:34:27 Info:Inserted: PD 1a(e0x00/s22)
1765: 16-07-28,16:34:27 Info:Inserted: PD 1a(e0x00/s22) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027396,000000000000000
由于slot 21是配置了单盘RAID 0,当初始化时无法找到这个盘,RAID卡也就无法获知这里是否曾经有RAID组。这种情况RAID卡无法区分这个盘是被人为拔下来的(正常维护操作)还是硬盘故障不识别,也就不会再做告警。