Ceph集群故障处理 - PG不一致修复

# ceph -v
ceph version 14.2.22

目录

故障现象

通过ceph -s命令发现集群处于HEALTH_ERR状态,显示存在scrub错误和数据不一致问题:

# ceph -s
  cluster:
    id:     xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
    health: HEALTH_ERR
            1 scrub errors
            Possible data damage: 1 pg inconsistent
 

从输出可以看出:

  • 集群有1个scrub错误
  • 有可能的数据损坏:1个PG处于不一致状态
  • 另有1个PG正在进行深度清洗(deep scrubbing)

故障分析

使用ceph health detail命令获取更详细的错误信息:

# ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
    pg 2.8d8 is active+clean+inconsistent, acting [159,79,609,286,11,431,355,398,490,210]

从上述输出可以看出:

  • 有1个PG处于不一致状态:2.8d8
  • 该PG上的acting OSD列表包括10个OSD:[159,79,609,286,11,431,355,398,490,210]
  • 虽然PG状态为active+clean+inconsistent,表示它可以提供读写服务,但存在数据一致性问题

注意inconsistent标志意味着在scrub或deep-scrub过程中,Ceph发现了数据不一致的问题。这可能是由于磁盘故障、I/O错误或其他硬件问题导致的。

故障定位

找到主OSD(159)的位置信息:

# ceph osd find 159
{
    "osd": 159,
    "addrs": {
        "addrvec": [
            {
                "type": "v2",
                "addr": "10.x.x.x:6950",
                "nonce": xxxxx
            },
            {
                "type": "v1",
                "addr": "10.x.x.x:6951",
                "nonce": xxxxx
            }
        ]
    },
    "osd_fsid": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "host": "osd-host03",
    "crush_location": {
        "host": "osd-host03",
        "root": "default"
    }
}

OSD.159位于主机osd-host03上,这是PG 2.8d8的主OSD。在Ceph的PG架构中,主OSD负责协调该PG上的所有操作,包括读取、写入和scrub操作。

修复过程

确认问题后,执行PG修复命令:

# ceph pg repair 2.8d8
instructing pg 2.8d8s0 on osd.159 to repair

ceph pg repair命令会指示主OSD尝试修复PG中的数据不一致问题。修复过程中,Ceph会从健康的副本中恢复数据。

查看OSD日志以确认修复进展:

# tail -n 50 /var/log/ceph/ceph-osd.159.log

2025-05-10 21:33:43.258 xxxxxxxxx  0 log_channel(cluster) log [DBG] : 2.8d8 repair starts
2025-05-10 21:35:03.066 xxxxxxxxx -1 log_channel(cluster) log [ERR] : 2.8d8 shard 398(7) soid 2:xxxxxxxx:::xxxxxxxxxx.xxxxxxxx:head : candidate had a read error
2025-05-10 21:35:43.607 xxxxxxxxx -1 log_channel(cluster) log [ERR] : 2.8d8s0 repair 0 missing, 1 inconsistent objects
2025-05-10 21:35:43.608 xxxxxxxxx -1 log_channel(cluster) log [ERR] : 2.8d8 repair 1 errors, 1 fixed

从日志可以清楚地看到:

  • 修复从21:33:43开始
  • 在21:35:03,发现了问题所在:OSD.398(acting列表中的第7个OSD,即shard 398(7))上的对象出现读取错误
  • 在21:35:43,修复完成,报告修复了1个不一致对象

磁盘状态检查

通过日志发现分片398(7)上存在读取错误,需要进一步检查OSD.398使用的磁盘:

# smartctl -a /dev/sdah
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WUH721816ALE6L4
Serial Number:    XXXXXXXXX
LU WWN Device Id: 5 000cca XXXXXXXXX
Add. Product Id:  XXXXXX
Firmware Version: PCGAW270
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-4 (unknown minor revision code: 0x009c)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat May 10 22:19:33 2025 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  101) seconds.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   001    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       96
  3 Spin_Up_Time            0x0007   083   083   001    Pre-fail  Always       -       336 (Average 339)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       36
  5 Reallocated_Sector_Ct   0x0033   100   100   001    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   001    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   140   140   020    Pre-fail  Offline      -       15
  9 Power_On_Hours          0x0012   096   096   000    Old_age   Always       -       34069
 10 Spin_Retry_Count        0x0013   100   100   001    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       36
 22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       100
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       65536
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1455
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       1455
194 Temperature_Celsius     0x0002   063   063   000    Old_age   Always       -       32 (Min/Max 17/49)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       16
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   100   100   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0012   100   100   000    Old_age   Always       -       68388904427
242 Total_LBAs_Read         0x0012   100   100   000    Old_age   Always       -       4664543043

SMART Error Log Version: 1
ATA Error Count: 2
        
Error 2 occurred at disk power-on lifetime: 34068 hours (1419 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 43 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 80 3f 91 40 00   2d+01:31:42.070  READ FPDMA QUEUED
  61 00 08 80 34 70 40 00   2d+01:31:39.514  WRITE FPDMA QUEUED
  60 00 00 80 36 97 40 00   2d+01:31:39.463  READ FPDMA QUEUED
  60 00 00 00 8a f9 40 00   2d+01:31:39.455  READ FPDMA QUEUED
  60 00 00 80 c7 41 40 00   2d+01:31:39.450  READ FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 34068 hours (1419 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

磁盘检查结果分析:

关键指标解释
整体健康评估PASSED磁盘自我评估仍为通过状态
Current_Pending_Sector16有16个扇区存在读取问题
UNC错误2个不可纠正的读取错误
Power_On_Hours34069运行时间约3.9年
Command_Timeout65536出现过命令超时情况

虽然磁盘整体评估为PASSED,但出现的Current_Pending_SectorUNC错误是严重的预警信号,表明该磁盘已经开始出现物理故障,应当计划更换。

OSD存储结构检查

使用ceph-volume lvm list命令检查OSD.398的存储配置:

===== osd.398 ======

  [block]       /dev/ceph-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/osd-block-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

      block device              /dev/ceph-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/osd-block-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
      block uuid                XXXXXX-XXXX-XXXX-XXXX-XXXX-XXXX-XXXXXX
      cephx lockbox secret      
      cluster fsid              xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
      cluster name              ceph
      crush device class        None
      db device                 /dev/vg_nvme1n1/lv_sdah
      db uuid                   XXXXXX-XXXX-XXXX-XXXX-XXXX-XXXX-XXXXXX
      encrypted                 0
      osd fsid                  xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
      osd id                    398
      osdspec affinity          
      type                      block
      vdo                       0
      devices                   /dev/sdah

  [db]          /dev/vg_nvme1n1/lv_sdah

      block device              /dev/ceph-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/osd-block-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
      block uuid                XXXXXX-XXXX-XXXX-XXXX-XXXX-XXXX-XXXXXX
      cephx lockbox secret      
      cluster fsid              xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
      cluster name              ceph
      crush device class        None
      db device                 /dev/vg_nvme1n1/lv_sdah
      db uuid                   XXXXXX-XXXX-XXXX-XXXX-XXXX-XXXX-XXXXXX
      encrypted                 0
      osd fsid                  xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
      osd id                    398
      osdspec affinity          
      type                      db
      vdo                       0
      devices                   /dev/nvme1n1

存储配置分析:

  1. OSD.398使用/dev/sdah作为存储设备
  2. 使用NVMe设备(/dev/nvme1n1)上的卷组作为DB设备
  3. 采用了BlueStore存储引擎
  4. 该OSD的数据位于LVM逻辑卷上

修复分析

通过日志分析修复过程:

  1. 2025-05-10 21:33:43 - 修复操作启动(PG 2.8d8)
  2. 2025-05-10 21:35:03 - 发现分片398(7)上的对象出现读取错误
  3. 2025-05-10 21:35:43 - 修复结果统计:0个缺失对象,1个不一致对象
  4. 2025-05-10 21:35:43 - 修复完成:发现1个错误,修复了1个错误

修复是成功的,Ceph能够自动从正常的副本恢复数据。这体现了Ceph的自愈能力,当检测到数据不一致问题时,可以通过repair操作从健康副本恢复数据。

故障总结

问题原因

  • 磁盘故障:PG 2.8d8中存在一个不一致的对象,具体在OSD.398(acting列表中的第7个OSD)上发生了读取错误
  • 磁盘健康状况:SMART检查显示/dev/sdah磁盘存在16个待处理扇区和UNC读取错误
  • 磁盘年龄:磁盘已运行约3.9年,开始出现读取错误
  • 故障模式:这是典型的磁盘扇区读取错误导致的数据不一致问题

修复方法

  • 使用Ceph自愈功能:通过ceph pg repair 2.8d8命令成功修复了不一致的对象
  • 数据恢复:Ceph自动从其他健康副本恢复了正确的数据
  • 修复过程:由主OSD.159协调,确认并修复了OSD.398上的问题对象

后续建议

  1. 磁盘监控

    • 密切监控OSD.398所在磁盘(/dev/sdah)的健康状态
    • 每周运行一次smartctl检查
    • 配置监控系统报警Current_Pending_Sector值变化
  2. 磁盘更换计划

    • 考虑尽快更换该磁盘,因为存在扇区错误可能会导致更多数据问题
    • 准备好替换磁盘,并安排合适的维护窗口进行更换
  3. 验证步骤

    • 运行完整的SMART自检: smartctl -t long /dev/sdah
    • 再次执行deep-scrub验证PG健康: ceph pg deep-scrub 2.8d8
    • 验证集群状态: ceph -s

经验教训

  • 主动监测的重要性:定期执行deep scrub可以及时发现数据不一致问题
  • 修复流程:对于不一致的PG,使用ceph pg repair命令可以高效修复
  • 后续验证:修复后应检查集群状态确认问题已解决
  • 硬件监控:磁盘的Current_Pending_Sector和SMART错误日志是预测硬盘故障的重要指标
  • 冗余设计的价值:正是因为Ceph的多副本设计,才能在单个OSD出现问题时保证数据的完整性

最佳实践

  1. 清洗策略优化

    • 设置适当的scrub和deep scrub策略
    • 建议工作时间轻负载执行scrub,周末执行deep-scrub
    • 对于大型集群,错开不同OSD的scrub时间
  2. 数据保护

    • 对于重要数据,考虑增加副本数量或使用纠删码
    • 定期验证备份策略和灾难恢复流程
  3. 硬件管理

    • 保持硬件设备健康,避免磁盘读写错误
    • 建立硬盘SMART监控系统,及时发现潜在问题
    • 对运行3年以上的硬盘进行更严格的监控
    • 实施预防性替换策略,而不是等到磁盘完全故障
  4. 监控系统增强

    • 配置自动化监控工具监控SMART属性
    • 设置关键指标的阈值报警
    • 建立磁盘健康评分系统,综合评估磁盘状况

参考资料

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值