Ceph集群故障处理 - PG不一致修复

mixboot

于 2025-05-10 23:44:21 发布

阅读量901

点赞数 19

分类专栏： Ceph 文章标签： ceph

本文链接：https://blog.csdn.net/u010953692/article/details/147859651

版权

Ceph 专栏收录该内容

75 篇文章

订阅专栏

Ceph集群故障处理 - PG不一致修复

# ceph -v
ceph version 14.2.22

故障现象

通过ceph -s命令发现集群处于HEALTH_ERR状态，显示存在scrub错误和数据不一致问题：

# ceph -s
  cluster:
    id:     xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
    health: HEALTH_ERR
            1 scrub errors
            Possible data damage: 1 pg inconsistent

从输出可以看出：

集群有1个scrub错误
有可能的数据损坏：1个PG处于不一致状态
另有1个PG正在进行深度清洗(deep scrubbing)

故障分析

使用ceph health detail命令获取更详细的错误信息：

# ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
    pg 2.8d8 is active+clean+inconsistent, acting [159,79,609,286,11,431,355,398,490,210]

从上述输出可以看出：

有1个PG处于不一致状态：2.8d8
该PG上的acting OSD列表包括10个OSD：[159,79,609,286,11,431,355,398,490,210]
虽然PG状态为active+clean+inconsistent，表示它可以提供读写服务，但存在数据一致性问题

注意：inconsistent标志意味着在scrub或deep-scrub过程中，Ceph发现了数据不一致的问题。这可能是由于磁盘故障、I/O错误或其他硬件问题导致的。

故障定位

找到主OSD（159）的位置信息：

# ceph osd find 159
{
    "osd": 159,
    "addrs": {
        "addrvec": [
            {
                "type": "v2",
                "addr": "10.x.x.x:6950",
                "nonce": xxxxx
            },
            {
                "type": "v1",
                "addr": "10.x.x.x:6951",
                "nonce": xxxxx
            }
        ]
    },
    "osd_fsid": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "host": "osd-host03",
    "crush_location": {
        "host": "osd-host03",
        "root": "default"
    }
}

OSD.159位于主机osd-host03上，这是PG 2.8d8的主OSD。在Ceph的PG架构中，主OSD负责协调该PG上的所有操作，包括读取、写入和scrub操作。

修复过程

确认问题后，执行PG修复命令：

# ceph pg repair 2.8d8
instructing pg 2.8d8s0 on osd.159 to repair

ceph pg repair命令会指示主OSD尝试修复PG中的数据不一致问题。修复过程中，Ceph会从健康的副本中恢复数据。

查看OSD日志以确认修复进展：

# tail -n 50 /var/log/ceph/ceph-osd.159.log

2025-05-10 21:33:43.258 xxxxxxxxx  0 log_channel(cluster) log [DBG] : 2.8d8 repair starts
2025-05-10 21:35:03.066 xxxxxxxxx -1 log_channel(cluster) log [ERR] : 2.8d8 shard 398(7) soid 2:xxxxxxxx:::xxxxxxxxxx.xxxxxxxx:head : candidate had a read error
2025-05-10 21:35:43.607 xxxxxxxxx -1 log_channel(cluster) log [ERR] : 2.8d8s0 repair 0 missing, 1 inconsistent objects
2025-05-10 21:35:43.608 xxxxxxxxx -1 log_channel(cluster) log [ERR] : 2.8d8 repair 1 errors, 1 fixed

从日志可以清楚地看到：

修复从21:33:43开始
在21:35:03，发现了问题所在：OSD.398（acting列表中的第7个OSD，即shard 398(7)）上的对象出现读取错误
在21:35:43，修复完成，报告修复了1个不一致对象

磁盘状态检查

通过日志发现分片398(7)上存在读取错误，需要进一步检查OSD.398使用的磁盘：

# smartctl -a /dev/sdah
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.119.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WUH721816ALE6L4
Serial Number:    XXXXXXXXX
LU WWN Device Id: 5 000cca XXXXXXXXX
Add. Product Id:  XXXXXX
Firmware Version: PCGAW270
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-4 (unknown minor revision code: 0x009c)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat May 10 22:19:33 2025 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  101) seconds.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   001    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       96
  3 Spin_Up_Time            0x0007   083   083   001    Pre-fail  Always       -       336 (Average 339)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       36
  5 Reallocated_Sector_Ct   0x0033   100   100   001    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   001    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   140   140   020    Pre-fail  Offline      -       15
  9 Power_On_Hours          0x0012   096   096   000    Old_age   Always       -       34069
 10 Spin_Retry_Count        0x0013   100   100   001    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       36
 22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       100
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       65536
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1455
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       1455
194 Temperature_Celsius     0x0002   063   063   000    Old_age   Always       -       32 (Min/Max 17/49)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       16
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   100   100   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0012   100   100   000    Old_age   Always       -       68388904427
242 Total_LBAs_Read         0x0012   100   100   000    Old_age   Always       -       4664543043

SMART Error Log Version: 1
ATA Error Count: 2
        
Error 2 occurred at disk power-on lifetime: 34068 hours (1419 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 43 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 80 3f 91 40 00   2d+01:31:42.070  READ FPDMA QUEUED
  61 00 08 80 34 70 40 00   2d+01:31:39.514  WRITE FPDMA QUEUED
  60 00 00 80 36 97 40 00   2d+01:31:39.463  READ FPDMA QUEUED
  60 00 00 00 8a f9 40 00   2d+01:31:39.455  READ FPDMA QUEUED
  60 00 00 80 c7 41 40 00   2d+01:31:39.450  READ FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 34068 hours (1419 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

磁盘检查结果分析：

关键指标	值	解释
整体健康评估	PASSED	磁盘自我评估仍为通过状态
Current_Pending_Sector	16	有16个扇区存在读取问题
UNC错误	2个	不可纠正的读取错误
Power_On_Hours	34069	运行时间约3.9年
Command_Timeout	65536	出现过命令超时情况

虽然磁盘整体评估为PASSED，但出现的Current_Pending_Sector和UNC错误是严重的预警信号，表明该磁盘已经开始出现物理故障，应当计划更换。

OSD存储结构检查

使用ceph-volume lvm list命令检查OSD.398的存储配置：

===== osd.398 ======

  [block]       /dev/ceph-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/osd-block-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

      block device              /dev/ceph-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/osd-block-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
      block uuid                XXXXXX-XXXX-XXXX-XXXX-XXXX-XXXX-XXXXXX
      cephx lockbox secret      
      cluster fsid              xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
      cluster name              ceph
      crush device class        None
      db device                 /dev/vg_nvme1n1/lv_sdah
      db uuid                   XXXXXX-XXXX-XXXX-XXXX-XXXX-XXXX-XXXXXX
      encrypted                 0
      osd fsid                  xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
      osd id                    398
      osdspec affinity          
      type                      block
      vdo                       0
      devices                   /dev/sdah

  [db]          /dev/vg_nvme1n1/lv_sdah

      block device              /dev/ceph-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/osd-block-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
      block uuid                XXXXXX-XXXX-XXXX-XXXX-XXXX-XXXX-XXXXXX
      cephx lockbox secret      
      cluster fsid              xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
      cluster name              ceph
      crush device class        None
      db device                 /dev/vg_nvme1n1/lv_sdah
      db uuid                   XXXXXX-XXXX-XXXX-XXXX-XXXX-XXXX-XXXXXX
      encrypted                 0
      osd fsid                  xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
      osd id                    398
      osdspec affinity          
      type                      db
      vdo                       0
      devices                   /dev/nvme1n1

存储配置分析：

OSD.398使用/dev/sdah作为存储设备
使用NVMe设备(/dev/nvme1n1)上的卷组作为DB设备
采用了BlueStore存储引擎
该OSD的数据位于LVM逻辑卷上

修复分析

通过日志分析修复过程：

2025-05-10 21:33:43 - 修复操作启动（PG 2.8d8）
2025-05-10 21:35:03 - 发现分片398(7)上的对象出现读取错误
2025-05-10 21:35:43 - 修复结果统计：0个缺失对象，1个不一致对象
2025-05-10 21:35:43 - 修复完成：发现1个错误，修复了1个错误

修复是成功的，Ceph能够自动从正常的副本恢复数据。这体现了Ceph的自愈能力，当检测到数据不一致问题时，可以通过repair操作从健康副本恢复数据。

故障总结

问题原因

磁盘故障：PG 2.8d8中存在一个不一致的对象，具体在OSD.398（acting列表中的第7个OSD）上发生了读取错误
磁盘健康状况：SMART检查显示/dev/sdah磁盘存在16个待处理扇区和UNC读取错误
磁盘年龄：磁盘已运行约3.9年，开始出现读取错误
故障模式：这是典型的磁盘扇区读取错误导致的数据不一致问题

修复方法

使用Ceph自愈功能：通过ceph pg repair 2.8d8命令成功修复了不一致的对象
数据恢复：Ceph自动从其他健康副本恢复了正确的数据
修复过程：由主OSD.159协调，确认并修复了OSD.398上的问题对象

后续建议

磁盘监控：
- 密切监控OSD.398所在磁盘(/dev/sdah)的健康状态
- 每周运行一次smartctl检查
- 配置监控系统报警Current_Pending_Sector值变化
磁盘更换计划：
- 考虑尽快更换该磁盘，因为存在扇区错误可能会导致更多数据问题
- 准备好替换磁盘，并安排合适的维护窗口进行更换
验证步骤：
- 运行完整的SMART自检: smartctl -t long /dev/sdah
- 再次执行deep-scrub验证PG健康: ceph pg deep-scrub 2.8d8
- 验证集群状态: ceph -s

经验教训

主动监测的重要性：定期执行deep scrub可以及时发现数据不一致问题
修复流程：对于不一致的PG，使用ceph pg repair命令可以高效修复
后续验证：修复后应检查集群状态确认问题已解决
硬件监控：磁盘的Current_Pending_Sector和SMART错误日志是预测硬盘故障的重要指标
冗余设计的价值：正是因为Ceph的多副本设计，才能在单个OSD出现问题时保证数据的完整性

最佳实践

清洗策略优化：
- 设置适当的scrub和deep scrub策略
- 建议工作时间轻负载执行scrub，周末执行deep-scrub
- 对于大型集群，错开不同OSD的scrub时间
数据保护：
- 对于重要数据，考虑增加副本数量或使用纠删码
- 定期验证备份策略和灾难恢复流程
硬件管理：
- 保持硬件设备健康，避免磁盘读写错误
- 建立硬盘SMART监控系统，及时发现潜在问题
- 对运行3年以上的硬盘进行更严格的监控
- 实施预防性替换策略，而不是等到磁盘完全故障
监控系统增强：
- 配置自动化监控工具监控SMART属性
- 设置关键指标的阈值报警
- 建立磁盘健康评分系统，综合评估磁盘状况