DS4700控制器重启故障原因分析
版本历史:
1.0 | 初稿 | 2017/5/10 |
|
|
|
说明:本文档内容来自于IBM官方手册,可以作为建议使用。
第 1 章 环境说明
1.1 控制器微码
当前控制器微码和硬盘微码均是官方推荐的最新版本。
1.2 接口信息
根据现场的同事反馈,A控和B控的host channel port分别和主机直连。
1.3 主机信息
host_alias | host_type | host_group | Controller | Logical_Drive_Name |
windown1_a | Windows 2000/Server 2003/Server 2008 Non-Clustered | windows_group | B | 4,5 |
windown2_a | Windows 2000/Server 2003/Server 2008 Non-Clustered | windows_group | B | 4,5 |
app1_hba | Solaris (with or without MPXIO) | app_group | A | 3 |
app2_hba | Solaris (with or without MPXIO) | app_group | A | 1,2,6 |
dbhba1 | Solaris (with or without MPXIO) | db_group | A | 1,2,6 |
db2_hba | Solaris (with or without MPXIO) | db_group | A | 1,2,6 |
第 2 章 故障现象
5/10 A控发生重启,导致对应的app和db业务中断,对应的业务名称如下:
10.1.121.129 M01-HQ-SV013-DB1 数据库主
10.1.121.130 M01-HQ-SV013-DB2 数据库备(DG数据库)
10.1.121.132 M01-HQ-SV014-APP1 应用服务器(主机)
10.1.121.133 M01-HQ-SV014-APP2 应用服务器(备机)
第 3 章 故障分析原因
由于是单链路,主机端也没有多路径,导致控制器重启后链路中断。
根据存储日志分析A控的重启日志:
Date/Time: 15-2-3 8:38:19
Sequence number: 4698
Event type: 400F
Event category: Internal
Priority: Informational
Description: Controller reset by its alternate
Event specific codes: 0/0/0
Component type: Controller
Component location: Enclosure 85, Slot 1
Date/Time: 17-5-10 7:47:44
Sequence number: 6300
Event type: 400F
Event category: Internal
Priority: Informational
Description: Controller reset by its alternate
Event specific codes: 0/0/0
Component type: Controller
Component location: Enclosure 85, Slot 1
从2015-2-3至2017-5-10日,时间间隔827天
再看A控制器的最近2次Start-of-day routine begun的日期:
Date/Time: 15-2-10 17:25:12
Date/Time: 17-5-10 7:47:04
从2015-2-10到2017-5-10,时间间隔820天。
由此可以判断下来是存储820/825 日期问题导致的重启。
存储每820/825天检测一次控制器的运行天数
A控上次运行这个日期检测程序是2015/2/10日,到2017/5/10日刚好820天
而A控上次重启的日期是2015/2/3日,到2017/5/10日刚好827天,所以A控重启了。
另外通过历史日志检查,发现B控在6年中有重启过4次,而B控上主机端有2个FC口没有接主机,如果有SFP模块的话建议插上堵头或者拨掉SFP模块。
第 4 章 后续建议
1. 修改存储的链路设计,实现高可用冗余连接
2. 考虑到DS4K系列的820/825的设计,到期前进行预防性重启。
第 5 章 附录
关于DS4K 820/825的说明
H193288: DS3000/DS4000/DS5000 controllerwill reboot every 820 or 825 days
5.1 Technote(troubleshooting)
5.2Problem(Abstract)
RETAINtip: H193288
5.3Symptom
TheIBM System Storage DS3000, DS4000, and DS5000 families of storage subsystemcontrollers will reboot every 820 days for controller A or 825 days for controllerB, if the controller firmware is not upgraded or already rebooted within thattime period.
Affectedconfigurations
Thesystem may be any of the following IBM servers:
· DS4100 (FAStT100) Dual-Controller Storage Server, type 1724, anymodel
· DS4100 (FAStT100) Single-Controller Storage Server, type 1724,any model
· DS4200 Storage Server, type 1814, any model
· DS4300 (FAStT600) Dual Controller and Turbo Storage Server, type1722, any model
· DS4300 (FAStT600) Single Controller Storage Server, type 1722,any model
· DS4400 (FAStT700) Storage Server, type 1742, any model
· DS4500 (FAStT900) Storage Server, type 1742, any model
· DS4700 Storage Server, type 1814, any model
· DS4700 Storage Server, type 1814 (DC power supplies), any model
· DS4800 Storage Server, type 1815, any model
· DS5020 Disk Controller (1814-20A), any model
· DS5100 Storage Controller, type 1818, any model
· DS5300 Storage Controller, type 1818, any model
· FAStT 200 Storage Server, type 3542, any model
· FAStT500 RAID Controller, type 3552, any model
· FAStT500, type 3552, any model
· IBM System Storage DS3200, type 1726, any model
· IBM System Storage DS3300, type 1726, any model
· IBM System Storage DS3400, type 1726, any model
· IBM System Storage DS3512, type 1746, any model
· IBM System Storage DS3524, type 1746, any model
· IBM System Storage DS3950 Express, type 1814, any model
The system is configured with one or more of the following IBM Options:
· BladeCenter Boot Disk System (1726-22B), any model
This tip is not software specific.
Solution
Forthe DS3500, DCS3700, and DCS3860, this issue is fixed in the 8.2x release. Forall other products, this is a permanent restriction and there will be nosolution.
Workaround
Whenevera controller is rebooted, the firmware will reset the timer mechanism, givingthe controllers another 828.5 days on the timer. The next reboots will occur at820 days for controller A or 825 days for controller B.
Theway to avoid these unexpected reboots is with a controller firmware upgrade,since the process of upgrading controller firmware will reboot the controllers,thereby, resetting the timer mechanism. This also allows for the reboots to bescheduled at a convenient time for the customer's environment.
Upgradingfirmware to the levels below is also recommended to reduce the possibility ofthe controller reboots happening at same time.
DS3000 - 07.35.41.00 or higher
DS4000 - 07.15.07.00 or higher
DS5000 - 07.30.21.00 or higher
IBM's best recommended practice is to maintain the environment with regularfirmware upgrades, at least once per year, to leverage the enhancementsimplemented in firmware and provide the best possible quality, performance, andavailability of the system.
Ifthese recommended best practices are followed, then the reboot behavior willnot be observed.
Regularlyscheduled maintenance of controller firmware will reset the timer, since thisprocess reboots the controller. A reboot, for any other reason, will also causethe timer to be reset.
Additionalinformation
Thecurrent design of the DS3000, DS4000, and DS5000 controller operating systemcontains a separate timer for each controller. Each timer rolls over after828.5 days. In order to keep the timer from rolling over, the controller isdesigned to reboot after 825.5 days to reset the timer. These timers are independentof each other, however, there is a possibility that the controllers couldreboot at the same time. Firmware levels 07.35.41.00, 07.15.07.00, and07.30.21.00 were changed to stagger the controller reboots - controller A willreboot at 820 days and controller B will reboot at 825 days. This eliminatesthe simultaneous controller reboot condition, and allows the two redundantcontrollers to protect each other using the normal failover/failbackoperations.
Aproperly maintained DS3000, DS4000, and DS5000 system includes periodicfirmware upgrades. These firmware upgrades should never allow the controllersto get to the point where the timer rolls over.
IBMhighly recommends to periodically upgrade controller firmware. Firmwareupgrades should be part of a yearly Change Management plan.
Segment | Product | Component | Platform | Version | Edition |
Disk Storage Systems | DS3950 | ||||
Disk Storage Systems | DS4200 | ||||
Disk Storage Systems | DS4700 | ||||
Disk Storage Systems | DS4800 | ||||
Disk Storage Systems | DS5020 | ||||
Disk Storage Systems | DS5100 | ||||
Disk Storage Systems | DS3200 | ||||
Disk Storage Systems | DS3300 | ||||
Disk Storage Systems | DS3400 | ||||
Disk Storage Systems | BladeCenter Boot Disk System | ||||
Disk Storage Systems | DS3500 (DS3512- DS3524) | ||||
Disk Storage Systems | DS4100 | ||||
Disk Storage Systems | DS4300 | ||||
Disk Storage Systems | DS4400 | ||||
Disk Storage Systems | DS4500 | ||||
Disk Storage Systems | FAStT500 Storage Server | ||||
Disk Storage Systems | DCS3700 | ||||
Disk Storage Systems | System Storage DCS3860 | ||||
Cross reference information |