ceph基础操作

版本号

[root@controller1 ~]# ceph -v 
ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)

状态
在admin节点执行
ceph -s

可以看到集群的状态, 如下示例

    cluster 936a5233-9441-49df-95c1-01de82a192f4
     health HEALTH_OK
     monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0}
            election epoch 382, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6
      fsmap e85: 1/1/1 up {0=ceph-2=up:active}
     osdmap e62553: 111 osds: 109 up, 109 in
            flags sortbitwise,require_jewel_osds
      pgmap v72844263: 5064 pgs, 24 pools, 93130 GB data, 13301 kobjects
            273 TB used, 133 TB / 407 TB avail
                5058 active+clean
                   6 active+clean+scrubbing+deep
  client io 57046 kB/s rd, 35442 kB/s wr, 1703 op/s rd, 1486 op/s wr  

如果我们需要持续观察, 有两种办法
一种是:
ceph -w

这是官方的做法, 效果和ceph -s一样, 不过下面的client io那行会持续更新
有时我们期望看下上面其他信息的变动情况, 所以我写了个脚本

watch -n 1 "ceph -s|
awk -v ll=$COLUMNS '/^ *mds[0-9]/{
  \$0=substr(\$0, 1, ll);
 }
 /^ +[0-9]+ pg/{next}
 /monmap/{ next }
 /^ +recovery [0-9]+/{next}
 { print}';
ceph osd pool stats | awk '/^pool/{
  p=\$2
 }
 /^ +(recovery|client)/{
  if(p){print \"\n\"p; p=\"\"};
  print
}'"

参考输出

Every 1.0s: ceph -s| awk -v ll=105 '/^ *mds[0-9]/{$0=substr($0, 1, ll);} /^ ...  Mon Jan 21 18:09:44 2019

    cluster 936a5233-9441-49df-95c1-01de82a192f4
     health HEALTH_OK
            election epoch 382, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6
      fsmap e85: 1/1/1 up {0=ceph-2=up:active}
     osdmap e62561: 111 osds: 109 up, 109 in
            flags sortbitwise,require_jewel_osds
      pgmap v73183831: 5064 pgs, 24 pools, 93179 GB data, 13310 kobjects
            273 TB used, 133 TB / 407 TB avail
                5058 active+clean
                   6 active+clean+scrubbing+deep
  client io 263 MB/s rd, 58568 kB/s wr, 755 op/s rd, 1165 op/s wr

cinder-sas
  client io 248 MB/s rd, 33529 kB/s wr, 363 op/s rd, 597 op/s wr

vms
  client io 1895 B/s rd, 2343 kB/s wr, 121 op/s rd, 172 op/s wr

cinder-ssd
  client io 15620 kB/s rd, 22695 kB/s wr, 270 op/s rd, 395 op/s wr

用量

# ceph df
GLOBAL:
    SIZE     AVAIL     RAW USED     %RAW USED
    407T      146T         260T         64.04
POOLS:
    NAME                            ID     USED       %USED     MAX AVAIL     OBJECTS
    cinder-sas                      13     76271G     89.25         9186G     10019308
    images                          14       649G      6.60         9186G       339334
    vms                             15      7026G     43.34         9186G      1807073
    cinder-ssd                      16      4857G     74.73         1642G       645823
    rbd                             17          0         0        16909G            1

osd
可以快速看到osd的拓扑关系, 可以用于查看osd的状态等信息

# ceph osd tree
ID     WEIGHT    TYPE NAME        UP/DOWN REWEIGHT PRIMARY-AFFINITY
-10008         0 root sas6t3
-10007         0 root sas6t2
-10006 130.94598 root sas6t1
   -12  65.47299     host ceph-11
    87   5.45599         osd.87        up  1.00000          0.89999
    88   5.45599         osd.88        up  0.79999          0.29999
    89   5.45599         osd.89        up  1.00000          0.89999
    90   5.45599         osd.90        up  1.00000          0.89999
    91   5.45599         osd.91        up  1.00000          0.89999
    92   5.45599         osd.92        up  1.00000          0.79999
    93   5.45599         osd.93        up  1.00000          0.89999
    94   5.45599         osd.94        up  1.00000          0.89999
    95   5.45599         osd.95        up  1.00000          0.89999
    96   5.45599         osd.96        up  1.00000          0.89999
    97   5.45599         osd.97        up  1.00000          0.89999
    98   5.45599         osd.98        up  0.89999          0.89999
   -13  65.47299     host ceph-12
    99   5.45599         osd.99        up  1.00000          0.79999
   100   5.45599         osd.100       up  1.00000          0.79999
   101   5.45599         osd.101       up  1.00000          0.79999
   102   5.45599         osd.102       up  1.00000          0.79999
   103   5.45599         osd.103       up  1.00000          0.79999
   104   5.45599         osd.104       up  0.79999          0.79999
   105   5.45599         osd.105       up  1.00000          0.79999
   106   5.45599         osd.106       up  1.00000          0.79999
   107   5.45599         osd.107       up  1.00000          0.79999
   108   5.45599         osd.108       up  1.00000          0.79999
   109   5.45599         osd.109       up  1.00000          0.79999
   110   5.45599         osd.110       up  1.00000          0.79999

我写了个脚本, 可以高亮
ceph osd df | awk -v c1=84 -v c2=90 '{z=NF-2; if($z<=100&&$z>c1){c=34;if($z>c2)c=31;$z="\033["c";1m"$z"\033[0m"}; print}'

reweight
人工权重
当osd负载不均衡时, 就需要人工干预权重. 默认值都是1, 我们一般都是降低权重
osd reweight <int[0-]> <float[0.0-1.0]> reweight osd to 0.0 < <weight> < 1.0

primary affinity
这个控制osd里的pg成为primary的比例. 0表示除非其他的pg挂了, 否则不会成为pg. 1表示, 除非其他的都是1, 那么这个一定会成为primary. 至于其他的值, 则是根据osd拓扑结构计算决定具体的pg数量. 毕竟不同的pool可能位于不同的osd上

osd primary-affinity <osdname (id|osd.   adjust osd primary-affinity from 0.0 <=
 id)> <float[0.0-1.0]>                     <weight> <= 1.0

pool
命令皆以 ceph osd pool 开头
看看有哪些pool
ceph osd pool ls

在结尾加detail可以看pool详情

# ceph osd pool ls detail
pool 13 'cinder-sas' replicated size 3 min_size 2 crush_ruleset 8 object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 63138 flags hashpspool stripe_width 0
        removed_snaps [1~5,7~2,a~2,e~10,23~4,2c~24,51~2,54~2,57~2,5a~a]
pool 14 'images' replicated size 3 min_size 2 crush_ruleset 8 object_hash rjenkins pg_num 512 pgp_num 512 last_change 63012 flags hashpspool stripe_width 0

调整pool的属性

# ceph osd pool set pool名称 属性 值
osd pool set <poolname> size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hashpspool|nodelete|nopgchange|nosizechange|write_fadvise_dontneed|noscrub|nodeep-scrub|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|use_gmt_hitset|debug_fake_ec_pool|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_dirty_high_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|auid|min_read_recency_for_promote|min_write_recency_for_promote|fast_read|hit_set_grade_decay_rate|hit_set_search_last_n|scrub_min_interval|scrub_max_interval|deep_scrub_interval|recovery_priority|recovery_op_priority|scrub_priority <val> {--yes-i-really-mean-it} :  set pool parameter <var> to <val>

pg
命令皆以 ceph pg 开头

查看状态

# ceph pg stat
v79188443: 5064 pgs: 1 active+clean+scrubbing, 2 active+clean+scrubbing+deep, 5061 active+clean; 88809 GB data, 260 TB used, 146 TB / 407 TB avail; 384 MB/s rd, 134 MB/s wr, 2380 op/s

ceph pg ls, 后面可以跟状态, 也可以跟其他参数

# ceph pg ls | grep scrub
pg_stat objects mip     degr    misp    unf     bytes   log     disklog state   state_stamp     v       reported     up      up_primary      acting  acting_primary  last_scrub      scrub_stamp     last_deep_scrub deep_scrub_stamp
13.1e   4832    0       0       0       0       39550330880     3034    3034    active+clean+scrubbing+deep 2019-04-08 15:24:46.496295       63232'167226529 63232:72970092  [95,80,44]      95      [95,80,44]      95  63130'167208564  2019-04-07 05:16:01.452400      63130'167117875 2019-04-05 18:53:54.796948
13.13b  4955    0       0       0       0       40587477010     3065    3065    active+clean+scrubbing+deep 2019-04-08 15:19:43.641336       63232'93849435  63232:89107385  [87,39,78]      87      [87,39,78]      87  63130'93838372   2019-04-07 08:07:43.825933      62998'93796094  2019-04-01 22:23:14.399257
13.1ac  4842    0       0       0       0       39605106850     3081    3081    active+clean+scrubbing+deep 2019-04-08 15:26:40.119698       63232'29801889  63232:23652708  [110,31,76]     110     [110,31,76]     110 63130'29797321   2019-04-07 10:50:26.243588      62988'29759937  2019-04-01 08:19:34.927978
13.31f  4915    0       0       0       0       40128633874     3013    3013    active+clean+scrubbing  2019-04-08 15:27:19.489919   63232'45174880  63232:38010846  [99,25,42]      99      [99,25,42]      99      63130'45170307       2019-04-07 06:29:44.946734      63130'45160962  2019-04-05 21:30:38.849569
13.538  4841    0       0       0       0       39564094976     3003    3003    active+clean+scrubbing  2019-04-08 15:27:15.731348   63232'69555013  63232:58836987  [109,85,24]     109     [109,85,24]     109     63130'69542700       2019-04-07 08:09:00.311084      63130'69542700  2019-04-07 08:09:00.311084
13.71f  4851    0       0       0       0       39552301568     3014    3014    active+clean+scrubbing  2019-04-08 15:27:16.896665   63232'57281834  63232:49191849  [100,75,66]     100     [100,75,66]     100     63130'57247440       2019-04-07 05:43:44.886559      63008'57112775  2019-04-03 05:15:51.434950
13.774  4867    0       0       0       0       39723743842     3092    3092    active+clean+scrubbing  2019-04-08 15:27:19.501188   63232'32139217  63232:28360980  [101,63,21]     101     [101,63,21]     101     63130'32110484       2019-04-07 06:24:22.174377      63130'32110484  2019-04-07 06:24:22.174377
13.7fe  4833    0       0       0       0       39485484032     3015    3015    active+clean+scrubbing+deep 2019-04-08 15:27:15.699899       63232'38297730  63232:32962414  [108,82,56]     108     [108,82,56]     108 63130'38286258   2019-04-07 07:59:53.586416      63008'38267073  2019-04-03 14:44:02.779800

当然也可以使用ls-by开头的命令

pg ls {<int>} {active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered [active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered...]}
pg ls-by-primary <osdname (id|osd.id)> {<int>} {active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered [active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered...]}
pg ls-by-osd <osdname (id|osd.id)> {<int>} {active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered [active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered...]}
pg ls-by-pool <poolstr> {active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered [active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered...]}

修复

# ceph pg repair 13.e1
instructing pg 13.e1 on osd.110 to repair

日常故障处理
pg inconsistent
出现inconsistent状态, 即表示符合此问题. 后面的 1scrub error表示这是scrub相关的问题

# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors; noout flag(s) set
pg 13.e1 is active+clean+inconsistent, acting [110,55,21]
1 scrub errors
noout flag(s) set

使用如下操作:

# ceph pg repair 13.e1
instructing pg 13.e1 on osd.110 to repair

检查
此时可以看到13.e1进入了deep scrub

# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 pgs repair; 1 scrub errors; noout flag(s) set
pg 13.e1 is active+clean+scrubbing+deep+inconsistent+repair, acting [110,55,21]
1 scrub errors
noout flag(s) set

等待一段时间后, 可以看到报错消失, pg13.e1也回归了active+clean状态

# ceph health detail
HEALTH_WARN noout flag(s) set
noout flag(s) set

问题原因
ceph会定期对pg做校验. 出现inconsistent并不代表一定是出现了数据不一致, 这个只是因为数据和校验码不一致而已,当确定执行repair后, ceph会进行一次deep scrub, 从而判断数据是否存在不一致的情况, 如果deep scrub通过, 那么说明没有数据问题, 只需要修正校验即可。

request blocked for XXs
定位请求被阻塞的osd
ceph health detail | grep blocked

然后降低上述osd的primary affinity, 可以分流一部分pg出去. 压力会变小. 之前的值可以通过ceph osd tree查看
ceph osd primary-affinity OSD_ID 比之前低的数值

主要还是由于集群不均衡, 导致部分osd压力过大. 无法及时处理请求.
1,如果频繁出现, 建议调查原因:
2,如果是因为客户端IO需求增大, 那么尝试优化客户端, 降低不必要的读写.
3,如果是因为部分osd一直无法处理请求, 建议临时降低此osd的primary affinity. 并保持关注, 因为这可能是磁盘故障的前兆.
4,如果某个journal ssd上的osd均出现此问题, 建议排查journal ssd是否存在写入瓶颈, 或者是否故障.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值