Ceph常见问题

1. CephFS问题诊断

1.1 无法创建

创建新CephFS报错Error EINVAL: pool ‘rbd-ssd’ already contains some objects. Use an empty pool instead,解决办法:

ceph fs new cephfs rbd-ssd rbd-hdd --force

1.2 mds.0 is damaged

断电后出现此问题。MDS进程报错: Error recovering journal 0x200: (5) Input/output error。诊断过程:

# 健康状况
ceph health detail
# HEALTH_ERR mds rank 0 is damaged; mds cluster is degraded
# mds.0 is damaged
 
# 文件系统详细信息,可以看到唯一的MDS Boron启动不了
ceph fs status
# cephfs - 0 clients
# ======
# +------+--------+-----+----------+-----+------+
# | Rank | State  | MDS | Activity | dns | inos |
# +------+--------+-----+----------+-----+------+
# |  0   | failed |     |          |     |      |
# +------+--------+-----+----------+-----+------+
# +---------+----------+-------+-------+
# |   Pool  |   type   |  used | avail |
# +---------+----------+-------+-------+
# | rbd-ssd | metadata |  138k |  106G |
# | rbd-hdd |   data   | 4903M | 2192G |
# +---------+----------+-------+-------+
 
# +-------------+
# | Standby MDS |
# +-------------+
# |    Boron    |
# +-------------+
 
# 显示错误原因
ceph tell mds.0 damage
# terminate called after throwing an instance of 'std::out_of_range'
#   what():  map::at
# Aborted
 
# 尝试修复,无效
ceph mds repaired 0
 
# 尝试导出CephFS日志,无效
cephfs-journal-tool journal export backup.bin
# 2019-10-17 16:21:34.179043 7f0670f41fc0 -1 Header 200.00000000 is unreadable
# 2019-10-17 16:21:34.179062 7f0670f41fc0 -1 journal_export: Journal not readable, attempt object-by-object dump with `rados`Error ((5) Input/output error)
 
# 尝试重日志修复,无效
# 尝试将journal中所有可回收的 inodes/dentries 写到后端存储(如果版本比后端更高)
cephfs-journal-tool event recover_dentries summary
# Events by type:
# Errors: 0
# 2019-10-17 16:22:00.836521 7f2312a86fc0 -1 Header 200.00000000 is unreadable
 
# 尝试截断日志,无效
cephfs-journal-tool journal reset 
# got error -5from Journaler, failing
# 2019-10-17 16:22:14.263610 7fe6717b1700  0 client.6494353.journaler.resetter(ro) error getting journal off disk
# Error ((5) Input/output error)
 
 
# 删除重建,数据丢失
ceph fs rm cephfs  --yes-i-really-mean-it
 
 
 
## 又一次遇到此问题
 
# 深度清理,发现200.00000000存在数据不一致
ceph osd deep-scrub all
40.14 shard 14: soid 40:292cf221:::200.00000000:head data_digest
  0x6ebfd975 != data_digest 0x9e943993 from auth oi 40:292cf221:::200.00000000:head
  (22366'34 mds.0.902:1 dirty|data_digest|omap_digest s 90 uv 34 dd 9e943993 od ffffffff alloc_hint [0 0 0])                                                                              
40.14 deep-scrub 0 missing, 1 inconsistent objects
40.14 deep-scrub 1 errors
 
# 查看RADOS不一致对象详细信息
rados list-inconsistent-obj  40.14  --format=json-pretty
{
    "epoch": 23060,
    "inconsistents": [
        {
            "object": {
                "name": "200.00000000",
            },
            "errors": [],
            "union_shard_errors": [
                # 错误原因,校验信息不一致
                "data_digest_mismatch_info"
            ],
            "selected_object_info": {
                "oid": {
                    "oid": "200.00000000",
                },
            },
            "shards": [
                {
                    "osd": 7,
                    "primary": true,
                    "errors": [],
                    "size": 90,
                    "omap_digest": "0xffffffff"
                },
                {
                    "osd": 14,
                    "primary": false,
 
# errors:分片之间存在不一致,而且无法确定哪个分片坏掉了,原因:
#    data_digest_mismatch 此副本的摘要信息和主副本不一样
#    size_mismatch 此副本的数据长度和主副本不一致
#    read_error 可能存在磁盘错误
                    "errors": [
                        # 这里的原因是两个副本的摘要不一致
                        "data_digest_mismatch_info"
                    ],
                    "size": 90,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0x6ebfd975"
                }
            ]
        }
    ]
}
# 转为处理inconsistent问题,停止OSD.14,Flush 日志,启动OSD.14,执行PG修复
# 无效…… 执行PG修复后Ceph会自动以权威副本覆盖不一致的副本,但是并非总能生效,
# 例如,这里的情况,主副本的数据摘要信息丢失
 
# 删除故障对象
rados -p rbd-ssd  rm 200.00000000

2. OSD问题诊断

2.1 启动后立即崩溃

通常可以认为属于Ceph的Bug。这些Bug可能因为数据状态引发,有些时候将崩溃OSD的权重清零,可以恢复:

# 尝试解决osd.17启动后立即宕机
ceph osd reweight 17 0

3. PG问题诊断

3.1 所有PG卡在unkown

如果创建一个存储池后,其所有PG都卡在此状态,可能原因是CRUSH map不正常。你可以配置osd_crush_update_on_start为true让集群自动调整CRUSH map。

3.2 卡在peering

ceph -s显示如下状态,长期不恢复:

  cluster:       
    health: HEALTH_WARN                                          
            Reduced data availability: 2 pgs inactive, 2 pgs peering
            19 slow requests are blocked > 32 sec
  data:
    pgs:     0.391% pgs not active
             510 active+clean
             2   peering

此案例中,使用此PG的Pod呈Known状态。

检查卡在inactive状态的PG:

ceph pg dump_stuck inactive
 
PG_STAT STATE   UP     UP_PRIMARY ACTING ACTING_PRIMARY  
17.68   peering [3,12]          3 [3,12]              3
16.32   peering [4,12]          4 [4,12]              4 

输出其中一个PG的诊断信息,片断如下:

// ceph pg 17.68 query
{                                               
    "info": {                                            
        "stats": {
            "state": "peering",
            "stat_sum": {
                "num_objects_dirty": 5
            },
            "up": [
                3,
                12
            ],
            "acting": [
                3,
                12
            ],
            // 因为哪个OSD而阻塞
            "blocked_by": [
                12
            ],
            "up_primary": 3,
            "acting_primary": 3
        }
    },
    "recovery_state": [
        // 如果顺利,第一个元素应该是 "name": "Started/Primary/Active"
        {
            "name": "Started/Primary/Peering/GetInfo",
            "enter_time": "2018-06-11 18:32:39.594296",
            // 但是,卡在向OSD 12 请求信息这一步上
            "requested_info_from": [
                {
                    "osd": "12"
                }
            ]
        },
        {
            "name": "Started/Primary/Peering",
        },
        {
            "name": "Started",
        }
    ]
}

没有获得osd-12阻塞Peering的明确原因。

查看日志,osd-12位于10.0.0.104,osd-3位于10.0.0.100,后者为Primary OSD。

osd-3日志,在18:26开始出现,和所有其它OSD之间心跳检测失败。此时10.0.0.100负载很高,卡死。

osd-12日志,在18:26左右大量出现:

osd.12 466 heartbeat_check: no reply from 10.0.0.100:6803 osd.4 since back 2018-06-11 18:26:44.973982 ...

直到18:44分仍然无法进行心跳检测,重启osd-12后一切恢复正常。

3.3 incomplete

检查无法完成的PG:

ceph pg dump_stuck
 
# PG_STAT STATE      UP     UP_PRIMARY ACTING ACTING_PRIMARY
# 17.79   incomplete [9,17]          9 [9,17]              9
# 32.1c   incomplete [16,9]         16 [16,9]             16
# 17.30   incomplete [16,9]         16 [16,9]             16
# 31.35   incomplete [9,17]          9 [9,17]              9

查询PG 17.30的诊断信息:

// ceph pg  17.30 query
{
  "state": "incomplete",
  "info": {
    "pgid": "17.30",
    "stats": {
      // 被osd.11阻塞而无法完成,此osd已经不存在
      "blocked_by": [
        11
      ],
      "up_primary": 16,
      "acting_primary": 16
    }
  },
  // 恢复的历史记录
  "recovery_state": [
    {
      "name": "Started/Primary/Peering/Incomplete",
      "enter_time": "2018-06-17 04:48:45.185352",
      // 最终状态,此PG没有完整的副本
      "comment": "not enough complete instances of this PG"
    },
    {
      "name": "Started/Primary/Peering",
      "enter_time": "2018-06-17 04:48:45.131904",
      "probing_osds": [
        "9",
        "16",
        "17"
      ],
      // 期望检查已经不存在的OSD
      "down_osds_we_would_probe": [
        11
      ],
      "peering_blocked_by_detail": [
        {
          "detail": "peering_blocked_by_history_les_bound"
        }
      ]
    }
  ]
}

以看到17.30期望到osd.11寻找权威数据,而osd.11已经永久丢失了。这种情况下,可以尝试强制标记PG为complete。

首先,停止PG的主OSD: service ceph-osd@16 stop

然后,运行下面的工具:

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-16  --pgid 17.30 --op mark-complete
# Marking complete 
# Marking complete succeeded

最后,重启PG的主OSD: service ceph-osd@16 start

3.4 单副本导致的stale

不做副本的情况下,单个OSD宕机即导致数据不可用:

ceph health detail 
# 注意Acting Set仅仅有一个成员
# pg 2.21 is stuck stale for 688.372740, current state stale+active+clean, last acting [7]
# 但是其它PG的Acting Set则不是
# pg 3.4f is active+recovering+degraded, acting [9,1]

如果OSD的确出现硬件故障,则数据丢失。此外,你也无法对这种PG进行查询操作。

3.5 inconsistent

定位出问题PG的主OSD,停止它,刷出日志,然后修复PG:

 ceph health detail
# HEALTH_ERR 2 scrub errors; Possible data damage: 2 pgs inconsistent
# OSD_SCRUB_ERRORS 2 scrub errors
# PG_DAMAGED Possible data damage: 2 pgs inconsistent
#     pg 15.33 is active+clean+inconsistent, acting [8,9]
#     pg 15.61 is active+clean+inconsistent, acting [8,16]
 
# 查找OSD所在机器
ceph osd find 8
 
# 登陆到osd.8所在机器
systemctl stop ceph-osd@8.service
ceph-osd -i 8 --flush-journal
systemctl start ceph-osd@8.service
ceph pg repair 15.61

4. 对象问题诊断

4.1 unfound

持有对象权威副本的OSD宕机或被剔除,会导致该问题出现。例如两个配对的OSD(共同处理某个PG):

  1. osd.1宕机
  2. osd.2独自处理了一些写操作
  3. osd1开机
  4. osd.1+osd2配对,由于osd.2独自的写操作,缺失的对象排队等候在osd.1上恢复
  5. 恢复完成之前,osd.2宕机,或者被移除

在上面这个事件序列中,osd.1知道权威副本存在,但是却找不到,这种情况下针对目标对象的请求会被阻塞,直到权威副本的持有者osd上线。

执行下面的命令,定位存在问题的PG:

ceph health detail | grep unfound
# OBJECT_UNFOUND 1/90055 objects unfound (0.001%)
#     pg 33.3e has 1 unfound objects
#    pg 33.3e is active+recovery_wait+degraded, acting [17,6], 1 unfound

进一步,定位存在问题的对象:

 // ceph pg 33.3e list_missing
{
    "offset": {
        "oid": "",
        "key": "",
        "snapid": 0,
        "hash": 0,
        "max": 0,
        "pool": -9223372036854775808,
        "namespace": ""
    },
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
        {
            "oid": {
                // 丢失的对象
                "oid": "obj_delete_at_hint.0000000066",
                "key": "",
                "snapid": -2,
                "hash": 2846662078,
                "max": 0,
                "pool": 33,
                "namespace": ""
            },
            "need": "1723'1412",
            "have": "0'0",
            "flags": "none",
            "locations": []
        }
    ],
    "more": false
}

如果丢失的对象太多,more会显示为true。

执行下面的命令,可以查看PG的诊断信息:

// ceph pg 33.3e query
{
  "state": "active+recovery_wait+degraded",
  "recovery_state": [
    {
      "name": "Started/Primary/Active",
      "enter_time": "2018-06-16 15:03:32.873855",
      // 丢失的对象所在的OSD
      "might_have_unfound": [
        {
          "osd": "6",
          "status": "already probed"
        },
        {
          "osd": "11",
          "status": "osd is down"
        }
      ],
    } 
  ]
}

上面输出中的osd.11,先前已经出现硬件故障,被移除了。这意味着unfound的对象已经不可恢复。你可以标记:

# 回滚到前一个版本,如果是新创建对象则忘记其存在。不支持EC池
ceph pg 33.3e mark_unfound_lost revert
# 让Ceph忘记unfound对象的存在
ceph pg 33.3e mark_unfound_lost delete 

5. ceph-deploy

5.1 TypeError: ‘Logger’ object is not callable

/usr/lib/python2.7/dist-packages/ceph_deploy/osd.py第376行,替换为:

LOG.info(line.decode('utf-8')) 

5.2 Could not locate executable ‘ceph-volume’ make sure it is installed and available

应该安装ceph-deploy的1.5.39版本,2.0.0版本仅仅支持luminous:

apt remove ceph-deploy
apt install ceph-deploy=1.5.39 -y

5.3 部署MON后ceph-s卡死

在我的环境下,是因为MON节点识别的public addr为LVS的虚拟网卡的IP地址导致。修改配置,显式指定MON的IP地址即可:

[mon.master01-10-5-38-24]
public addr = 10.5.38.24 
cluster addr = 10.5.38.24
 
[mon.master02-10-5-38-39]
public addr = 10.5.38.39
cluster addr = 10.5.38.39
 
[mon.master03-10-5-39-41]
public addr = 10.5.39.41
cluster addr = 10.5.39.41

6. ceph-helm

在我的环境下部署,出现一系列和权限有关的问题,如果你遇到相同问题且不关心安全性,可以修改配置:

# kubectl -n ceph edit configmap ceph-etc
apiVersion: v1
data:
  ceph.conf: |
    [global]
    fsid = 08adecc5-72b1-4c57-b5b7-a543cd8295e7
    mon_host = ceph-mon.ceph.svc.k8s.gmem.cc
    # 添加以下三行
    auth client required = none
    auth cluster required = none
    auth service required = none
    [osd]
    # 在大型集群里用单独的“集群”网可显著地提升性能
    cluster_network = 10.0.0.0/16
    ms_bind_port_max = 7100
    public_network = 10.0.0.0/16
kind: ConfigMap

如果需要保证集群安全,请参考下面几个案例。

6.1 ceph-mgr报Operation not permitted

  • 问题现象
    此Pod一直无法启动,查看容器日志,发现:
    timeout 10 ceph --cluster ceph auth get-or-create mgr.xenial-100 mon ‘allow profile mgr’ osd ‘allow *’ mds ‘allow *’ -o /var/lib/ceph/mgr/ceph-xenial-100/keyring
    0 librados: client.admin authentication error (1) Operation not permitted

  • 问题分析
    连接到可以访问的ceph-mon,执行命令:

    kubectl -n ceph exec -it ceph-mon-nhx52 -c ceph-mon -- ceph
    

    发现报同样的错误。这说明client.admin的Keyring有问题。登陆到ceph-mon,获取Keyring列表:

    # kubectl -n ceph exec -it ceph-mon-nhx52 -c ceph-mon bash
    # ceph --cluster=ceph  --name mon. --keyring=/var/lib/ceph/mon/ceph-xenial-100/keyring auth list   
     
    installed auth entries:
     
    client.admin
            key: AQAXPdtaAAAAABAA6wd1kCog/XtV9bSaiDHNhw==
            auid: 0
            caps: [mds] allow
            caps: [mgr] allow *
            caps: [mon] allow *
            caps: [osd] allow *
     
    client.bootstrap-mds
            key: AQAgPdtaAAAAABAAFPgqn4/zM5mh8NhccPWKcw==
            caps: [mon] allow profile bootstrap-mds
    client.bootstrap-osd
            key: AQAUPdtaAAAAABAASbfGQ/B/PY4Imoa4Gxsa2Q==
            caps: [mon] allow profile bootstrap-osd
    client.bootstrap-rgw
            key: AQAJPdtaAAAAABAAswtFjgQWahHsuy08Egygrw==
            caps: [mon] allow profile bootstrap-rgw
    

    而当前使用的client.admin的Keyring内容为:

[client.admin]
  key = AQAda9taAAAAABAAgWIsgbEiEsFRJQq28hFgTQ==
  auid = 0
  caps mds = "allow"
  caps mon = "allow *"
  caps osd = "allow *"
  caps mgr = "allow *"

内容不一致。使用auth list获得的client.admin的Keyring,可以发现是有效的:

ceph --cluster=ceph --name mon. --keyring=/var/lib/ceph/mon/ceph-xenial-100/keyring auth get client.admin > client.admin.keyyring
ceph --name client.admin --keyring client.admin.keyyring # OKskydns_skydns_dns_cachemiss_count_total{instance="172.27.100.134:10055"}

检查一下各Pod的/etc/ceph/ceph.client.admin.keyring,可以发现都是从Secret ceph-client-admin-keyring挂载的。那么这个Secret是如何生成的呢?执行命令:

kubectl -n ceph get job --output=yaml --export | grep ceph-client-admin-keyring -B 50

可以发现Job ceph-storage-keys-generator负责生成该Secret。 查看其Pod日志可以生成Keyring、创建Secret的记录。进一步查看Pod的资源定义,可以看到负责创建的脚本/opt/ceph/ceph-storage-key.sh挂载自ConfigMap ceph-bin中的ceph-storage-key.sh。

解决此问题最简单的办法就是修改Secret,将其修改为集群中实际有效的Keyring:

# 导出Secret定义
kubectl -n ceph get  secret ceph-client-admin-keyring --output=yaml --export > ceph-client-admin-keyring
# 获得有效Keyring的Base64编码
cat client.admin.keyyring | base64
# 将Secret中的编码替换为上述Base64,然后重新创建Secret
kubectl -n ceph apply -f ceph-client-admin-keyring

此外Secret pvc-ceph-client-key中存放的也是admin用户的Key,其内容也需要替换到有效的:

kubectl -n ceph edit secret  pvc-ceph-client-key

6.2 pvc无法提供

原因和上一个问题类似,还是权限问题。

查看无法绑定的PVC日志:

# kubectl -n ceph describe pvc
 Normal   Provisioning        53s   ceph.com/rbd ceph-rbd-provisioner-5544dcbcf5-n846s 708edb2c-4619-11e8-abf2-e672650d97a2  External provisioner is provisioning volume for claim
"ceph/ceph-pvc"
  Warning  ProvisioningFailed  53s   ceph.com/rbd ceph-rbd-provisioner-5544dcbcf5-n846s 708edb2c-4619-11e8-abf2-e672650d97a2  Failed to provision volume with StorageClass "general"
: failed to create rbd image: exit status 1, command output: 2018-04-22 13:44:35.269967 7fb3e3e3ad80 -1 did not load config file, using default settings.
2018-04-22 13:44:35.297828 7fb3e3e3ad80 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2)
 No such file or directoryConnection to localhost closed by remote host.
Connection to localhost closed.e3e3ad80  0 librados: client.admin authentication error (1) Operation not permitted

rbd-provisioner需要读取StorageClass定义,获取需要的凭证信息:

# kubectl -n ceph get storageclass --output=yaml
apiVersion: v1                                                                                                                                                                  
items:                                                                                                                                                                          
- apiVersion: storage.k8s.io/v1                                                                                                                                                 
  kind: StorageClass                                                                                                                                                            
  metadata:                                                                                                                             
    name: general
  parameters:
    adminId: admin
    adminSecretName: pvc-ceph-conf-combined-storageclass
    adminSecretNamespace: ceph
    imageFeatures: layering
    imageFormat: "2"
    monitors: ceph-mon.ceph.svc.k8s.gmem.cc:6789
    pool: rbd
    userId: admin
    userSecretName: pvc-ceph-client-key
  provisioner: ceph.com/rbd
  reclaimPolicy: Delete

可以看到牵涉到两个Secret:pvc-ceph-conf-combined-storageclass、pvc-ceph-client-key,你需要把正确的Keyring内容写入其中。

6.3 pvc无法Attach

  • 现象:
    PVC可以Provision,RBD可以通过Ceph命令挂载,但是Pod无法启动,Describe之显示:

    auth: unable to find a keyring on /etc/ceph/keyring: (2) No such file or directory
    monclient(hunting): authenticate NOTE: no keyring found; disabled cephx authentication
    librados: client.admin authentication error (95) Operation not supported
    
  • 解决办法:
    把ceph.client.admin.keyring拷贝一份为 /etc/ceph/keyring即可。

6.4 ceph-osd报Operation not permitted

原因和上一个问题一样。查看无法启动的容器日志:

kubectl -n ceph logs ceph-osd-dev-vdb-bjnbm -c osd-prepare-pod
# ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring health                                                   
# 0 librados: client.bootstrap-osd authentication error (1) Operation not permitted                                               
# [errno 1] error connecting to the cluster

进一步查看,可以发现/var/lib/ceph/bootstrap-osd/ceph.keyring挂载自ceph-bootstrap-osd-keyring下的ceph.keyring:

# kubectl -n ceph get secret ceph-bootstrap-osd-keyring --output=yaml --export
apiVersion: v1
data:
  ceph.keyring: W2NsaWVudC5ib290c3RyYXAtb3NkXQogIGtleSA9IEFRQVlhOXRhQUFBQUFCQUFSQ2l1bVY1NFpOU2JGVWwwSDZnYlJ3PT0KICBjYXBzIG1vbiA9ICJhbGxvdyBwcm9maWxlIGJvb3RzdHJhcC1vc2QiCgo=
kind: Secret
metadata:
  creationTimestamp: null
  name: ceph-bootstrap-osd-keyring
  selfLink: /api/v1/namespaces/ceph/secrets/ceph-bootstrap-osd-keyring
type: Opaque
 
# BASE64解码后:
[client.bootstrap-osd]
  key = AQAYa9taAAAAABAARCiumV54ZNSbFUl0H6gbRw==
  caps mon = "allow profile bootstrap-osd"

获得实际有效的Keyring:

kubectl -n ceph exec -it ceph-mon-nhx52 -c ceph-mon -- ceph --cluster=ceph --name mon. --keyring=/var/lib/ceph/mon/ceph-xenial-100/keyring auth get client.bootstrap-osd
# 注意上述命令的输出的第一行exported keyring for client.bootstrap-osd不属于Keyring
[client.bootstrap-osd]
        key = AQAUPdtaAAAAABAASbfGQ/B/PY4Imoa4Gxsa2Q==
        caps mon = "allow profile bootstrap-osd"

修改Secret: kubectl **-**n ceph edit secret ceph-bootstrap-osd-keyring 替换为上述Keyring。

6.5 ceph-osd报No cluster conf with fsid

  • 报错信息:

    # kubectl -n ceph logs  ceph-osd-dev-vdc-cpkxh -c osd-activate-pod
    ceph_disk.main.Error: Error: No cluster conf found in /etc/ceph with fsid 08adecc5-72b1-4c57-b5b7-a543cd8295e7
    # 每个OSD都包同样的错误
    

    对应的配置文件内容:

     kubectl -n ceph get configmap ceph-etc --output=yaml
    apiVersion: v1
    data:
      ceph.conf: |
        [global]
        fsid = a4426e8a-c46d-4407-95f1-911a23a0dd6e
        mon_host = ceph-mon.ceph.svc.k8s.gmem.cc
        [osd]
        cluster_network = 10.0.0.0/16
        ms_bind_port_max = 7100
        public_network = 10.0.0.0/16
    kind: ConfigMap
    metadata:
      name: ceph-etc
      namespace: ceph
    

    可以看到,fsid不一致。修改一下ConfigMap中的fsid即可解决此问题。

6.6 容器无法Attach PV

  • 报错信息:
    describe pod报错:timeout expired waiting for volumes to attach/mount for pod
    kubelet报错:executable file not found in $PATH, rbd output

  • 原因分析:
    动态提供的持久卷,包含两个阶段:

    1. 卷提供,原本由控制平面负责,controller-manager中需要包含rbd命令,才能在Ceph集群中创建供K8S使用的镜像。目前这个职责由external_storage项目的rbd-provisioner完成
    2. 卷依附/分离,由使用卷的Pod所在的Node的kubelet负责完成。这些Node需要安装rbd命令,并提供有效的配置文件
  • 解决方案:

    # 安装软件
    apt install -y ceph-common
    # 从ceph-mon拷贝以下文件:
    # /etc/ceph/ceph.client.admin.keyring
    # /etc/ceph/ceph.conf
    

    应用上述方案后,如果继续报错:rbd: map failed exit status 110, rbd output: rbd: sysfs write failed In some cases useful info is found in syslog。则查看一下系统日志:

    dmesg | tail
     
    # [ 3004.833252] libceph: mon0 10.0.0.100:6789 feature set mismatch, my 106b84a842a42 
    #     < server's 40106b84a842a42, missing 400000000000000
    # [ 3004.840980] libceph: mon0 10.0.0.100:6789 missing required protocol features
    

    对照本文前面的特性表,可以发现内核版本必须4.5+才可以(CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING)。
    最简单的办法就是升级一下内核:

    # Desktop
    apt install --install-recommends linux-generic-hwe-16.04 xserver-xorg-hwe-16.04 -y
    # Server
    apt install --install-recommends linux-generic-hwe-16.04 -y
     
    sudo apt-get remove linux-headers-4.4.* -y && \
    sudo apt-get remove linux-image-4.4.* -y && \
    sudo apt-get autoremove -y && \
    sudo update-grub
    

    或者,将tunables profile调整到hammer版本的Ceph:

    ceph osd crush tunables hammer
    # adjusted tunables profile to hammer
    

6.7 OSD启动失败报文件名太长

报错信息:ERROR: osd init failed: (36) File name too long

报错原因:使用的文件系统为EXT4,存储的xattrs大小有限制,有条件的话最好使用XFS

解决办法:修改配置文件,如下:

 osd_max_object_name_len = 256
osd_max_object_namespace_len = 64

6.8 无法打开/proc/0/cmdline

报错信息:Fail to open ‘/proc/0/cmdline’ error No such file or directory

报错原因:在CentOS 7上,将ceph-mon和ceph-osd(基于目录)部署在同一节点(基于Helm)报此错误,分离后问题消失。此外部署mon的那些节点还设置了虚IP,其子网和Ceph的Cluster/Public网络相同,这导致了某些OSD监听的地址不正确。

再次遇到此问题,原因是一个虚拟网卡lo:ngress使用和eth0相同的网段,导致OSD使用了错误的网络。

解决办法是写死OSD监听地址:

[osd.2]                                                                                                                                                                          
public addr = 10.0.4.1                                                                                                                                                           
cluster addr = 10.0.4.1  

6.9 无法挂载RBD

报错信息:Input/output error,结合dmesg | tail可以看到更细节的报错

报错原因,可能情况:

  1. CentOS7下报错,提示客户端不满足特性CEPH_FEATURE_CRUSH_V4(1000000000000)。解决办法,将Bucket算法改为straw。注意,之后加入的OSD仍然默认使用straw2,使用的镜像的标签为tag-build-master-luminous-ubuntu-16.04。

6.10 write error: File name too long

external storage中的CephFS可以正常Provisioning,但是尝试读写数据时报此错误。原因是文件路径过长,和底层文件系统有关,为了兼容部分Ext文件系统的机器,我们限制了osd_max_object_name_len。

解决办法,不使用UUID,而使用namespace + pvcname来命名目录。修改cephfs-provisioner.go,118行

// create random share name
share := fmt.Sprintf("%s-%s", options.PVC.Namespace,options.PVC.Name)
// create random user id
user := fmt.Sprintf("%s-%s", options.PVC.Namespace,options.PVC.Name)

重新编译即可。

7. k8s相关

7.1 rbd image * is still being used

describe pod发现:

rbd image rbd-unsafe/kubernetes-dynamic-pvc-c0ac2cff-84ef-11e8-9a2a-566b651a72d6 is still being used

说明有其它客户端正在占用此镜像。如果尝试删除镜像,你会发现无法成功:

rbd rm rbd-unsafe/kubernetes-dynamic-pvc-c0ac2cff-84ef-11e8-9a2a-566b651a72d6 
 
librbd::image::RemoveRequest: 0x560e39df9af0 check_image_watchers: image has watchers - not removing
Removing image: 0% complete...failed.
rbd: error: image still has watchers
This means the image is still open or the client using it crashed. Try again after closing/unmapping it or waiting 30s for the crashed client to timeout. 

要知道watcher是谁,可以执行:

rbd status rbd-unsafe/kubernetes-dynamic-pvc-c0ac2cff-84ef-11e8-9a2a-566b651a72d6 
Watchers:
        watcher=10.5.39.12:0/1652752791 client.94563 cookie=18446462598732840961

可以发现10.5.39.12正在占用镜像。

另一种获取watcher的方法是,使用rbd的header对象。执行下面的命令获取rbd的诊断信息:

rbd info rbd-unsafe/kubernetes-dynamic-pvc-c0ac2cff-84ef-11e8-9a2a-566b651a72d6 
 
rbd image 'kubernetes-dynamic-pvc-c0ac2cff-84ef-11e8-9a2a-566b651a72d6':
        size 8192 MB in 2048 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.134474b0dc51
        format: 2
        features: layering
        flags: 
        create_timestamp: Wed Jul 11 17:49:51 2018

字段block_name_prefix的值rbd_data.134474b0dc51,将data换为header即为header对象。然后使用命令:

rados listwatchers -p rbd-unsafe rbd_header.134474b0dc51
 
watcher=10.5.39.12:0/1652752791 client.94563 cookie=18446462598732840961

既然知道10.5.39.12占用镜像,断开连接即可。 在此机器上执行下面的命令,显示当前映射的rbd镜像列表:

rbd showmapped
 
id pool       image                                                       snap device  
0  rbd-unsafe kubernetes-dynamic-pvc-c0ac2cff-84ef-11e8-9a2a-566b651a72d6 -    /dev/rbd0 
1  rbd-unsafe kubernetes-dynamic-pvc-0729f9a6-84f0-11e8-9b75-5a3f858854b1 -    /dev/rbd1 

此机器上的rbd0虽然映射,但是没有挂载。解除映射:

rbd unmap /dev/rbd0

再次检查rbd镜像状态,发现没有watcher了:

rbd status rbd-unsafe/kubernetes-dynamic-pvc-c0ac2cff-84ef-11e8-9a2a-566b651a72d6 
 
Watchers: none

7.2 rbd: map failed signal: aborted (core dumped)

kubectl describe报错Unable to mount volumes for pod… timeout expired waiting for volumes to attach or mount for pod…

检查发现目标rbd没有Watcher,Pod所在机器的Kubectl报错rbd: map failed signal: aborted (core dumped)。此前曾经在该机器上执行过rbd unmap操作。

手工 rbd map后问题消失。

8. 断电后无法启动OSD

journal do_read_entry: bad header magic

报错信息:journal do_read_entry(156389376): bad header magic…FAILED assert(interval.last > last)

这是12.2版本已知的BUG,断电后可能出现OSD无法启动,可能导致数据丢失。

9. 其他

9.1 Couldn’t init storage provider (RADOS)

RGW实例无法启动,通过journalctl看到上述信息。

要查看更多信息,需要查看RGW日志:

2020-10-22 16:51:55.771035 7fb1b0f20e80  0 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable), process (unknown), pid 2546439
2020-10-22 16:51:55.792872 7fb1b0f20e80  0 librados: client.rgw.ceph02 authentication error (22) Invalid argument
2020-10-22 16:51:55.793450 7fb1b0f20e80 -1 Couldn't init storage provider (RADOS)

可以发现是和身份验证有关的问题。

通过 systemctl status ceph-radosgw@rgw.**$**RGW_HOST得到命令行,手工运行:

radosgw -f --cluster ceph  --name client.rgw.ceph02 --setuser ceph --setgroup ceph -d --debug_ms 1

发现报错和上面一样。尝试增加–keyring参数,问题解决:

radosgw -f --cluster ceph  --name client.rgw.ceph02        \
  --setuser ceph --setgroup ceph -d --debug_ms 1           \
  --keyring=/var/lib/ceph/radosgw/ceph-rgw.ceph02/keyring

看来是Systemd服务没有找到keyring导致。

9.2 禁用IPv6的机器上无法开启Prometheus模块

报错信息:Unhandled exception from module ‘prometheus’ while running on mgr.master01-10-5-38-24: error(‘No socket could
be created’,)

解决办法: ceph config-key set mgr/prometheus/server_addr 0.0.0.0

9.3 反复警告mon… clock skew

原因是时钟不同步警告阈值太低,在global段增加配置并重启MON:

mon clock drift allowed = 2
mon clock drift warn backoff = 30

或者执行下面的命令即时生效:

ceph tell mon.* injectargs '--mon_clock_drift_allowed=2'
ceph tell mon.* injectargs '--mon_clock_drift_warn_backoff=30'

或者检查ntp相关配置,保证时钟同步精度。

9.4 深度清理导致高IO

深度清理很消耗IO,如果长时间无法完成,可以禁用:

ceph osd set noscrub
ceph osd set nodeep-scrub

问题解决后,可以再启用:

ceph osd unset noscrub
ceph osd unset nodeep-scrub 

使用CFQ作为IO调度器时,可以调整OSD IO线程的优先级:

# 设置调度器
echo cfq > /sys/block/sda/queue/scheduler
 
# 检查当前某个OSD的磁盘线程优先级类型
ceph daemon osd.4 config get osd_disk_thread_ioprio_class
 
# 修改IO优先级
ceph tell osd.* injectargs '--osd_disk_thread_ioprio_priority 7'
# IOPRIO_CLASS_RT最高 IOPRIO_CLASS_IDLE最低
ceph tell osd.* injectargs '--osd_disk_thread_ioprio_class idle'

如果上述措施没有问题时,可以考虑配置以下参数:

osd_deep_scrub_stride = 131072                                                                                                                                                       
# 每次Scrub的块数量范围
osd_scrub_chunk_min = 1                                                                                                                                                              
osd_scrub_chunk_max = 5                                                                                                                                                              
osd scrub during recovery = false                                                                                                                                                    
osd deep scrub interval = 2592000                                                                                                                                                    
osd scrub max interval = 2592000                                                                                                                                                     
# 单个OSD并发进行的Scrub个数
osd max scrubs = 1   
# Scrub起止时间                                                                                                                                                            
osd max begin hour = 2                                                                                                                                                               
osd max end hour = 6                                                                                                                                                                 
# 系统负载超过多少则禁止Scrub
osd scrub load threshold = 4                                                                                                                                                         
# 每次Scrub后强制休眠0.1秒
osd scrub sleep = 0.1                                                                                                                                                                  
# 线程优先级
osd disk thread ioprio priority = 7
osd disk thread ioprio class = idle

9.5 强制unmap

如果Watcher被黑名单,则尝试Unmap镜像时会报错:rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy

可以使用下面的命令强制unmap: rbd unmap -o force ...

9.6 增加pg_num和pgp_num后无法A+C

部分PG状态卡死,可能原因是OSD允许的PG数量受限,修改全局配置项mon_max_pg_per_osd并重启MON即可。

此外注意:调整PG数量后,一定要进入A+C状态后,再进行下一次调整。

9.7 无法删除RBD镜像

下面第二个镜像对应的K8S PV已经删除:

rbd ls
# kubernetes-dynamic-pvc-35350b13-46b8-11e8-bde0-a2c14c93573f
# kubernetes-dynamic-pvc-78740b26-46eb-11e8-8349-e6e3339859d4

但是对应的RBD没有删除,手工删除:

rbd remove kubernetes-dynamic-pvc-78740b26-46eb-11e8-8349-e6e3339859d4

报错:

2018-04-23 13:37:25.559444 7f919affd700 -1 librbd::image::RemoveRequest: 0x5598e77831d0 check_image_watchers: image has watchers - not removing
Removing image: 0% complete…failed.
rbd: error: image still has watchers
This means the image is still open or the client using it crashed. Try again after closing/unmapping it or waiting 30s for the crashed client to timeout.

查看RBD状态:

# rbd info kubernetes-dynamic-pvc-78740b26-46eb-11e8-8349-e6e3339859d4
rbd image 'kubernetes-dynamic-pvc-78740b26-46eb-11e8-8349-e6e3339859d4':
        size 8192 MB in 2048 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.1003e238e1f29
        format: 2
        features: layering
        flags: 
        create_timestamp: Mon Apr 23 11:42:59 2018
 
#rbd status kubernetes-dynamic-pvc-78740b26-46eb-11e8-8349-e6e3339859d4
Watchers:
        watcher=10.0.0.101:0/4275384344 client.65597 cookie=18446462598732840963

到10.0.0.101这台机器上查看:

# df | grep e6e3339859d4
/dev/rbd2        8125880  251560   7438508   4% /var/lib/kubelet/plugins/kubernetes.io/rbd/rbd/rbd-image-kubernetes-dynamic-pvc-78740b26-46eb-11e8-8349-e6e3339859d4

重启Kubelet后可以删除RBD。

9.8 Error EEXIST: entity osd.9 exists but key does not match

# 删除密钥
ceph auth del osd.9
# 重新收集目标主机的密钥
ceph-deploy --username ceph-ops gatherkeys Carbon

9.9 创建新Pool后无法Active+Clean


    pgs:     12.413% pgs unknown                                                                                                                                                   
             20.920% pgs not active                                                                                                                                                
             768 active+clean                                                                                                                                                      
             241 creating+activating                                                                                                                                               
             143 unknown 

可能是由于PG总数太大导致,降低PG数量后很快Active+Clean

9.10 Orphaned pod无法清理

报错信息:Orphaned pod “a9621c0e-41ee-11e8-9407-deadbeef00a0” found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them

临时解决办法:

rm -rf /var/lib/kubelet/pods/a9621c0e-41ee-11e8-9407-deadbeef00a0/volumes/rook.io~rook/

9.11 osd启动报错:ERROR: osd init failed: (1) Operation not permitted

可能原因是OSD使用的keyring和MON不一致。对于ID为14的OSD,将宿主机/var/lib/ceph/osd/ceph-14/keyring的内容替换为 ceph auth get osd.14的输出前两行即可。

9.12 Mount failed with ‘(11) Resource temporarily unavailable’

在没有停止OSD的情况下执行ceph-objectstore-tool命令,会出现此错误。

9.13 neither public_addr nor public_network keys are defined for monitors

通过ceph-deploy添加MON节点时出现此错误,将public_network配置添加到配置文件的global段即可。

9.14 journalctl删除pv后卡在Terminating

可能原因:

  1. 对应的PVC没有删除,还在引用此PV。先删除PV即可

chown: cannot access ‘/var/log/ceph’: No such file or directory

OSD无法启动,报上面的错误,可以配置:

ceph:
  storage:
    osd_log: /var/log 

HEALTH_WARN application not enabled on

 
                                 #池 # 功能
ceph osd pool application enable rbd block-devices

10. 诊断

调试日志

注意:详尽的日志每小时可能超过 1GB ,如果你的系统盘满了,这个节点就会停止工作。

10.1 临时启用调试日志

 # 通过中心化配置下发
ceph tell osd.0 config set debug_osd 0/5
 
# 到目标主机上,针对OSD进程设置
ceph daemon osd.0 config set debug_osd 0/5

10.2 配置日志级别

可以为各子系统定制日志级别

 # debug {subsystem} = {log-level}/{memory-level}
 
[global]
        debug ms = 1/5
[mon]
        debug mon = 20
        debug paxos = 1/5
        debug auth = 2
[osd]
        debug osd = 1/5
        debug filestore = 1/5
        debug journal = 1
        debug monc = 5/20
[mds]
        debug mds = 1
        debug mds balancer = 1
        debug mds log = 1
        debug mds migrator = 1

子系统列表:

子系统日志级别内存日志级别
default05
lockdep01
context01
crush11
mds15
mds balancer15
mds locker15
mds log15
mds log expire15
mds migrator15
buffer01
timer01
filer01
striper01
objecter01
rados05
rbd05
rbd mirror05
rbd replay05
journaler05
objectcacher05
client05
osd15
optracker05
objclass05
filestore13
journal13
ms05
mon15
monc010
paxos15
tp05
auth15
crypto15
finisher11
reserver11
heartbeatmap15
perfcounter15
rgw15
rgw sync15
civetweb110
javaclient15
asok15
throttle11
refs00
compressor15
bluestore15
bluefs15
bdev13
kstore15
rocksdb45
leveldb45
memdb45
fuse15
mgr15
mgrc15
dpdk15
eventtrace15

10.3 加快日志滚动

如果磁盘空间有限,可以配置/etc/logrotate.d/ceph,加快日志滚动:

rotate 7
weekly
size 500M
compress
sharedscripts

然后设置定时任务,定期检查并清理:

30 * * * * /usr/sbin/logrotate /etc/logrotate.d/ceph >/dev/null 2>&1

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值