zabbix监控报警一台ceph节点journal盘写入寿命已经达到96%以上,根据intel官方说法写入寿命达到设置值将会无法正常写入。PercentageUsed : 97
[root@ceph-11 ~]# isdct show -sensor
PowerOnHours : 0x021B5
EraseFailCount : 0
EndToEndErrorDetectionCount : 0
ReliabilityDegraded : False
AvailableSpare : 100
AvailableSpareBelowThreshold : False
DeviceStatus : Healthy
SpecifiedPCBMaxOperatingTemp : 85
SpecifiedPCBMinOperatingTemp : 0
UnsafeShutdowns : 0x08
CrcErrorCount : 0
AverageNandEraseCycles : 2917
MediaErrors : 0x00
PowerCycles : 0x0C
ProgramFailCount : 0
MaxNandEraseCycles : 2922
HighestLifetimeTemperature : 57
PercentageUsed : 97
ThermalThrottleStatus : 0
ErrorInfoLogEntries : 0x00
MinNandEraseCycles : 2913
LowestLifetimeTemperature : 23
ReadOnlyMode : False
ThermalThrottleCount : 0
TemperatureThresholdExceeded : False
Temperature - Celsius : 50
有12个osd用这块盘做的日志
[root@ceph-11 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 5.5T 0 disk
└─sda1 8:1 0 5.5T 0 part /var/lib/ceph/osd/ceph-87
sdb 8:16 0 5.5T 0 disk
└─sdb1 8:17 0 5.5T 0 part /var/lib/ceph/osd/ceph-88
sdc 8:32 0 5.5T 0 disk
└─sdc1 8:33 0 5.5T 0 part /var/lib/ceph/osd/ceph-89
sdd 8:48 0 5.5T 0 disk
└─sdd1 8:49 0 5.5T 0 part /var/lib/ceph/osd/ceph-90
sde 8:64 0 5.5T 0 disk
└─sde1 8:65 0 5.5T 0 part /var/lib/ceph/osd/ceph-91
sdf 8:80 0 5.5T 0 disk
└─sdf1 8:81 0 5.5T 0 part /var/lib/ceph/osd/ceph-92
sdg 8:96 0 5.5T 0 disk
└─sdg1 8:97 0 5.5T 0 part /var/lib/ceph/osd/ceph-93
sdh 8:112 0 5.5T 0 disk
└─sdh1 8:113 0 5.5T 0 part /var/lib/ceph/osd/ceph-94
sdi 8:128 0 5.5T 0 disk
└─sdi1 8:129 0 5.5T 0 part /var/lib/ceph/osd/ceph-95
sdj 8:144 0 5.5T 0 disk
└─sdj1 8:145 0 5.5T 0 part /var/lib/ceph/osd/ceph-96
sdk 8:160 0 5.5T 0 disk
└─sdk1 8:161 0 5.5T 0 part /var/lib/ceph/osd/ceph-97
sdl 8:176 0 5.5T 0 disk
└─sdl1 8:177 0 5.5T 0 part /var/lib/ceph/osd/ceph-98
sdm 8:192 0 419.2G 0 disk
└─sdm1 8:193 0 419.2G 0 part /
nvme0n1 259:0 0 372.6G 0 disk
├─nvme0n1p1 259:1 0 30G 0 part
├─nvme0n1p2 259:2 0 30G 0 part
├─nvme0n1p3 259:3 0 30G 0 part
├─nvme0n1p4 259:4 0 30G 0 part
├─nvme0n1p5 259:5 0 30G 0 part
├─nvme0n1p6 259:6 0 30G 0 part
├─nvme0n1p7 259:7 0 30G 0 part
├─nvme0n1p8 259:8 0 30G 0 part
├─nvme0n1p9 259:9 0 30G 0 part
├─nvme0n1p10 259:10 0 30G 0 part
├─nvme0n1p11 259:11 0 30G 0 part
└─nvme0n1p12 259:12 0 30G 0 part
[root@ceph-11 ~]#
1,降低osd优先级
在大部分故障场景, 我们需要关机操作, 为了让用户无感知, 我们需要提前降低待操作的节点的优先级。首先看下ceph版本号,ceph版本为10.x. 我们启用了primary-affinity支持, 用户的io请求会先转给primary pg处理. 然后写入其他replica(副本).。先找出host ceph-11对应的osd,然后把这些osd的primary-affinity设为0, 意思就是上面的pg除非其他副本挂了, 否则不应该成为主pg.
-12 65.47299 host ceph-11
87 5.45599 osd.87 up 1.00000 0.89999
88 5.45599 osd.88 up 0.79999 0.29999
89 5.45599 osd.89 up 1.00000 0.89999
90 5.45599 osd.90 up 1.00000 0.89999
91 5.45599 osd.91 up 1.00000 0.89999
92 5.45599 osd.92 up 1.00000 0.79999
93 5.45599 osd.93 up 1.00000 0.89999
94 5.45599 osd.94 up 1.00000 0.89999
95