SA3008IR+镁光SSD rebuild 异常问题

最新推荐文章于 2024-09-10 09:41:53 发布

kensp1

最新推荐文章于 2024-09-10 09:41:53 发布

阅读量2.8k

点赞数 1

文章标签：服务器运维

本文链接：https://blog.csdn.net/kensp1/article/details/113712410

版权

一．问题描述

硬件配置：

SA5212M5

HBA卡：3008IR

硬盘: 2x镁光5100 FW 037；

RAID level: RAID 1。

问题现象：

15台服务器（SA5212M5 搭配3008IR（FW：14.00.02），后置硬盘两块镁光SSD（FW：037），硬盘组建RAID 1 服务器上架时间为2018年3月），报修其中一个硬盘故障，现场运维上门更换硬盘但Rebuild失败，如下图一所示，RAID组状态为“PermDegrd”；手动设置新硬盘进行同步，发现硬盘同步一段时间后自动停止同步。SAS HDD机器用同样SAS3008IR卡 rebuild成功

故障机器：

图一：现场查看控制器信息发现Rebuid过程失败，RAID组状态为“PermDegrd：

二．原因分析

镁光SSD硬盘日志分析：两块硬盘坏块大于64

5100 SSD FW:D0MU037 SN：17391AF46AB2

5100 SSD FW:D0MU037 SN：17391AF4A185

故障机器收集lsiget日志：properties_hba1.txt

Volume 0 is DevHandle 025d, Bus 1 Target 0, Type RAID1 (Mirroring)

Volume Name:

Volume WWID: 03c144f19cc16b53

Volume State: degraded, enabled, bad block table full, background init complete

Volume Settings: write caching disabled, auto configure hot swap enabled, data scrub allowed

Volume draws from Hot Spare Pools: 0

HBA卡PermDegrd定义：

根据博通反馈，确认“PermDegrd”的全写为“Permanently Degraded”，即永久降级。PermDegrd的定义为：当硬盘的Bad Block Table中记录了超过64个连续坏块后，firmware将标示这块硬盘的RAID状态为“PermDegrd”，如下图二,图三所示。LSIget收集故障机器日志显示 Volume State: degraded, enabled, bad block table full, background init complete

图二：

图三：

所以永久降级后硬盘RAID不能再恢复到最佳状态，也不能与新硬盘重构RAID。

Q1. If "PermDegrd" status only applied on RAID1 in IR card?

[A] It apply for RAID1,RAID10,RAID1E in IR.

Q2. Why VD status changed to "PermDegrd" only after rebuild? Bad block entries should have been exceed 64 before rebuild?

[A] IR try to rebuild the volume. During rebuild, IR need copy the whole data from primary to secondary, the block will be marked as bad if failure, if bad block is full, then it is PermDegrd. Also this Bad block entries is for VD.

Q3. Customer never found "PermDegrd" in HDD, If "PermDegrd" is applied only in SSD?

[A] No, it apply for both ssd and hdd. For VD Bad block entries design in IR, this is same for HDD and SSD.

Q4. As we know, MR FW performs CC operation every 7 days by default. Does 3008IR also perform CC operation automatically?

[A] MegaRAID and IR is total different design. They are apple and orange. 3008IR will not perform CC operation automatically. But you really can manual start CC by utility, like SAS3IRCU.

4.分析结论：

服务器其中一个硬盘故障被更换，另一个硬盘的连续坏块也超过博通SAS 3008IR 卡rebuild 限制，导致RAID进入永久降级状态，硬盘这种状态时，只能进行读写操作，但是已不能再重构RAID;故新硬盘无法与之同步。

5100 fw037 版本是早期版本，这个版本会有NAND 漏电的issue存在，这个是5100出现大量bad block 的根源。这个NAND issue是5100的老问题，它需要升级到5100 最新fw074来解决。在2020年年初，客户已经有部分5100 升级到了fw074，到目前为止已经运行了接近一年，表现很稳定。

三．现场排查及解决方案：

排查方案：检查Raid状态是否处于Degraded状态，检查SSD硬盘smart日志看是否超过博通定义阈值。

1.检查Raid状态

命令： ./SAS3ircu 0 display

IR Volume information

------------------------------------------------------------------------

IR volume 1

Volume ID : 605

Status of volume : Degraded (DGD)

Volume wwid : 0e45839bc2597025

RAID level : RAID1

Size (in MB) : 456809

Boot : Primary

Physical hard disks :

PHY[0] Enclosure#/Slot# : 1:0

PHY[1] Enclosure#/Slot# : 1:1

2.检查硬盘的坏块数量：

命令：smartctl –a /dev/sd*

维修建议：

如果通过上面排查发现Raid状态处于Degraded状态以及SSD硬盘坏块阈值超过64 ，不建议继续使用，将硬盘数据备份后并替换换RAID组内所有硬盘，新建RAID组并导入备份数据。
如果机器其中一个硬盘故障但母盘坏块没超过64，建议更换新盘，系统可以重新Rebuild。
SSD硬盘使用时间过长，会有大量潜在的Bad block 存在，为避免大量新增Bad block风险，建议镁光5100系列更新最新FW到074。

业界类似案例参考：

https://support.huawei.com/enterprise/zh/doc/EDOC1100080944/e3823cff