RGW Bucket Shard优化

最新推荐文章于 2022-07-01 14:33:49 发布

Lucien168

最新推荐文章于 2022-07-01 14:33:49 发布

阅读量1.3k

点赞数

本文链接：https://blog.csdn.net/weixin_44389885/article/details/86621718

版权

1.bucket index背景简介

bucket index是整个RGW里面一个非常关键的数据结构，用于存储bucket的索引数据，默认情况下单个bucket的index全部存储在一个shard文件（shard数量为0，主要以OMAP-keys方式存储在leveldb中），随着单个bucket内的Object数量增加，整个shard文件的体积也在不断增长，当shard文件体积过大就会引发各种问题。

2. 问题及故障

2.1 故障现象描述

Flapping OSD's when RGW buckets have millions of objects
● Possible causes
○ The first issue here is when RGW buckets have millions of objects their
bucket index shard RADOS objects become very large with high
number OMAP keys stored in leveldb. Then operations like deep-scrub,
bucket index listing etc takes a lot of time to complete and this triggers
OSD's to flap. If sharding is not used this issue become worse because
then only one RADOS index objects will be holding all the OMAP keys.

RGW的index数据以omap形式存储在OSD所在节点的leveldb中，当单个bucket存储的Object数量高达百万数量级的时候，
deep-scrub和bucket list一类的操作将极大的消耗磁盘资源，导致对应OSD出现异常，
如果不对bucket的index进行shard切片操作(shard切片实现了将单个bucket index的LevelDB实例水平切分到多个OSD上)，数据量大了以后很容易出事。

○ The second issue is when you have good amount of DELETEs it causes
loads of stale data in OMAP and this triggers leveldb compaction all the
time which is single threaded and non optimal with this kind of workload
and causes osd_op_threads to suicide because it is always compacting
hence OSD’s starts flapping.

RGW在处理大量DELETE请求的时候，会导致底层LevelDB频繁进行数据库compaction(数据压缩，对磁盘性能损耗很大)操作，而且刚好整个compaction在LevelDB中又是单线程处理，很容易到达osdopthreads超时上限而导致OSD自杀。

常见的问题有:

对index pool进行scrub或deep-scrub的时候，如果shard对应的Object过大，会极大消耗底层存储设备性能，造成io请求超时。
底层deep-scrub的时候耗时过长，会出现request blocked，导致大量http请求超时而出现50x错误，从而影响到整个RGW服务的可用性。
当坏盘或者osd故障需要恢复数据的时候，恢复一个大体积的shard文件将耗尽存储节点性能，甚至可能因为OSD响应超时而导致整个集群出现雪崩。

2.2 根因跟踪

当bucket index所在的OSD omap过大的时候，一旦出现异常导致OSD进程崩溃，这个时候就需要进行现场"救火"，用最快的速度恢复OSD服务。
先确定对应OSD的OMAP大小，这个过大会导致OSD启动的时候消耗大量时间和资源去加载levelDB数据，导致OSD无法启动（超时自杀）。
特别是这一类OSD启动需要占用非常大的内存消耗，一定要注意预留好内存。（物理内存40G左右，不行用swap顶上）

image.png

3. 临时解决方案

3.1 关闭集群scrub, deep-scrub提升集群稳定性

$ ceph osd set noscrub
$ ceph osd set nodeep-scrub

3.2 调高timeout参数，减少OSD自杀的概率

osd_op_thread_timeout = 90 #default is 15
osd_op_thread_suicide_timeout = 2000 #default is 150
If filestore op threads are hitting timeout
filestore_op_thread_timeout = 180 #default is 60
filestore_op_thread_suicide_timeout = 2000 #default is 180
Same can be done for recovery thread also.
osd_recovery_thread_timeout = 120 #default is 30
osd_recovery_thread_suicide_timeout = 2000

3.2 手工压缩OMAP

在可以停OSD的情况下，可以对OSD进行compact操作，推荐在ceph 0.94.6以上版本，低于这个版本有bug。 https://github.com/ceph/ceph/pull/7645/files

○ The third temporary step could be taken if OSD's have very large OMAP
directories you can verify it with command: du -sh /var/lib/ceph/osd/ceph-$id/current/omap, then do manual leveldb compaction for OSD's.
■ ceph tell osd.$id compact or
■ ceph daemon osd.$id compact or
■ Add leveldb_compact_on_mount = true in [osd.$id] or [osd] section
and restart the OSD.
■ This makes sure that it compacts the leveldb and then bring the
OSD back up/in which really helps.

#开启noout操作
$ ceph osd set noout
  
#停OSD服务
$ systemctl stop ceph-osd@<osd-id>
  
#在ceph.conf中对应的[osd.id]加上下面配置
leveldb_compact_on_mount = true
  
#启动osd服务
$ systemctl start ceph-osd@<osd-id>
  
  
#使用ceph -s命令观察结果，最好同时使用tailf命令去观察对应的OSD日志.等所有pg处于active+clean之后再继续下面的操作
$ ceph -s
#确认compact完成以后的omap大小:
$ du -sh /var/lib/ceph/osd/ceph-$id/current/omap
  
#删除osd中临时添加的leveldb_compact_on_mount配置
  
#取消noout操作(视情况而定，建议线上还是保留noout):
$ ceph osd unset noout