目标
某osd 经常启动后自动关闭进程
因为磁盘坏道过多导致
记录一下整个 OSD 安全下线流程
步骤
1 检测磁盘坏道
2 确认属于哪个 sd? 设备
3 关闭 osd 进程
4 卸载设备
5 删除故障磁盘设备
6 删除 osd 对应 authkey
7 下线 osd
8 删除 osd
操作步骤
1 检测磁盘坏道
# megacli -PDlist -a0 | grep -E "Slot|Error|Firmware state: Failed"
Slot Number: 10 <- 记住这个 Slot 10 number
Media Error Count: 155 <- 坏道过多
Other Error Count: 1
Firmware state: Online, Spun Up
2 确认属于哪个 sd? 设备
# megacli -cfgdsply -aALL | grep -v Information | grep -E "Virtual|Slot|RAID Level"
Virtual Drive: 9 (Target Id: 9)
Slot Number: 10 <- 找到坏道多的 Slot number
Virtual Device 就是对应的磁盘设备例如 sda, sdb sdc …
Virtul Device 从 0 开始计算, 那么第九就是 /dev/sdj 设备
通过 mount | grep sdj 方法可以知道磁盘对应那个 OSD 设备
3 关闭 osd 进程
systemctl stop ceph-osd@18
4 卸载设备
umount /dev/sdj1
5 删除故障磁盘设备
注意 获取上面命令返回的 Virtual Drive: 9 (Target Id: 9)
# megacli -CfgLdDel -L9 -a0 <- L9 = Virtual Device 9
Adapter 0: Deleted Virtual Drive-9(target id-9)
Exit Code: 0x00
6 删除 osd 对应 authkey
# ceph auth del osd.18
updated
7 下线 osd
# ceph osd crush remove osd.18
removed item id 18 name 'osd.18' from crush map
8 删除 osd
# ceph osd rm osd.
removed osd.18
额外信息
mark 磁盘从 Unconfigured(bad) 为 Unconfigured(good) 状态
参考显示
Slot Number: 4
Media Error Count: 0
Other Error Count: 6
Predictive Failure Count: 23
Last Predictive Failure Event Seq Number: 64378
Firmware state: Unconfigured(bad) <- 当前为 bad 状态
修改为 good 方法
megacli -pdmakegood -physdrv[0:4] -a0
修改后状态为
Slot Number: 4
Media Error Count: 0
Other Error Count: 6
Predictive Failure Count: 23
Last Predictive Failure Event Seq Number: 64378
Firmware state: Unconfigured(good), Spun Up
把磁盘状态从 spun down 修改为 spun up
megacli -cfgldadd r0[0:0] wb ra direct cachedbadbbu -a0
创建了磁盘后,状态自动会转换称为 Spun Up
Slot Number: 0
Firmware state: Online, Spun Up