Ceph常见问题

竹杖芒鞋轻胜马，谁怕？一蓑烟雨任平生。

已于 2023-08-29 10:19:36 修改

阅读量3.3k

点赞数 1

分类专栏： Ceph 存储文章标签： ceph

于 2023-04-11 20:36:25 首次发布

本文链接：https://blog.csdn.net/qq_21127151/article/details/130092904

版权

Ceph 同时被 2 个专栏收录

6 篇文章

订阅专栏

存储

6 篇文章

订阅专栏

1. CephFS问题诊断

1.1 无法创建

创建新CephFS报错Error EINVAL: pool ‘rbd-ssd’ already contains some objects. Use an empty pool instead，解决办法：

ceph fs new cephfs rbd-ssd rbd-hdd --force

1.2 mds.0 is damaged

断电后出现此问题。MDS进程报错： Error recovering journal 0x200: (5) Input/output error。诊断过程：

# 健康状况
ceph health detail
# HEALTH_ERR mds rank 0 is damaged; mds cluster is degraded
# mds.0 is damaged
 
# 文件系统详细信息，可以看到唯一的MDS Boron启动不了
ceph fs status
# cephfs - 0 clients
# ======
# +------+--------+-----+----------+-----+------+
# | Rank | State  | MDS | Activity | dns | inos |
# +------+--------+-----+----------+-----+------+
# |  0   | failed |     |          |     |      |
# +------+--------+-----+----------+-----+------+
# +---------+----------+-------+-------+
# |   Pool  |   type   |  used | avail |
# +---------+----------+-------+-------+
# | rbd-ssd | metadata |  138k |  106G |
# | rbd-hdd |   data   | 4903M | 2192G |
# +---------+----------+-------+-------+
 
# +-------------+
# | Standby MDS |
# +-------------+
# |    Boron    |
# +-------------+
 
# 显示错误原因
ceph tell mds.0 damage
# terminate called after throwing an instance of 'std::out_of_range'
#   what():  map::at
# Aborted
 
# 尝试修复，无效
ceph mds repaired 0
 
# 尝试导出CephFS日志，无效
cephfs-journal-tool journal export backup.bin
# 2019-10-17 16:21:34.179043 7f0670f41fc0 -1 Header 200.00000000 is unreadable
# 2019-10-17 16:21:34.179062 7f0670f41fc0 -1 journal_export: Journal not readable, attempt object-by-object dump with `rados`Error ((5) Input/output error)
 
# 尝试重日志修复，无效
# 尝试将journal中所有可回收的 inodes/dentries 写到后端存储（如果版本比后端更高）
cephfs-journal-tool event recover_dentries summary
# Events by type:
# Errors: 0
# 2019-10-17 16:22:00.836521 7f2312a86fc0 -1 Header 200.00000000 is unreadable
 
# 尝试截断日志，无效
cephfs-journal-tool journal reset 
# got error -5from Journaler, failing
# 2019-10-17 16:22:14.263610 7fe6717b1700  0 client.6494353.journaler.resetter(ro) error getting journal off disk
# Error ((5) Input/output error)
 
 
# 删除重建，数据丢失
ceph fs rm cephfs  --yes-i-really-mean-it
 
 
 
## 又一次遇到此问题
 
# 深度清理，发现200.00000000存在数据不一致
ceph osd deep-scrub all
40.14 shard 14: soid 40:292cf221:::200.00000000:head data_digest
  0x6ebfd975 != data_digest 0x9e943993 from auth oi 40:292cf221:::200.00000000:head
  (22366'34 mds.0.902:1 dirty|data_digest|omap_digest s 90 uv 34 dd 9e943993 od ffffffff alloc_hint [0 0 0])                                                                              
40.14 deep-scrub 0 missing, 1 inconsistent objects
40.14 deep-scrub 1 errors
 
# 查看RADOS不一致对象详细信息
rados list-inconsistent-obj  40.14  --format=json-pretty
{
    "epoch": 23060,
    "inconsistents": [
        {
            "object": {
                "name": "200.00000000",
            },
            "errors": [],
            "union_shard_errors": [
                # 错误原因，校验信息不一致
                "data_digest_mismatch_info"
            ],
            "selected_object_info": {
                "oid": {
                    "oid": "200.00000000",
                },
            },
            "shards": [
                {
                    "osd": 7,
                    "primary": true,
                    "errors": [],
                    "size": 90,
                    "omap_digest": "0xffffffff"
                },
                {
                    "osd": 14,
                    "primary": false,
 
# errors：分片之间存在不一致，而且无法确定哪个分片坏掉了，原因：
#    data_digest_mismatch 此副本的摘要信息和主副本不一样
#    size_mismatch 此副本的数据长度和主副本不一致
#    read_error 可能存在磁盘错误
                    "errors": [
                        # 这里的原因是两个副本的摘要不一致
                        "data_digest_mismatch_info"
                    ],
                    "size": 90,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0x6ebfd975"
                }
            ]
        }
    ]
}
# 转为处理inconsistent问题，停止OSD.14，Flush 日志，启动OSD.14，执行PG修复
# 无效…… 执行PG修复后Ceph会自动以权威副本覆盖不一致的副本，但是并非总能生效，
# 例如，这里的情况，主副本的数据摘要信息丢失
 
# 删除故障对象
rados -p rbd-ssd  rm 200.00000000

2. OSD问题诊断

2.1 启动后立即崩溃

通常可以认为属于Ceph的Bug。这些Bug可能因为数据状态引发，有些时候将崩溃OSD的权重清零，可以恢复：

# 尝试解决osd.17启动后立即宕机
ceph osd reweight 17 0

3. PG问题诊断

3.1 所有PG卡在unkown

如果创建一个存储池后，其所有PG都卡在此状态，可能原因是CRUSH map不正常。你可以配置osd_crush_update_on_start为true让集群自动调整CRUSH map。

3.2 卡在peering

ceph -s显示如下状态，长期不恢复：

  cluster:       
    health: HEALTH_WARN                                          
            Reduced data availability: 2 pgs inactive, 2 pgs peering
            19 slow requests are blocked > 32 sec
  data:
    pgs:     0.391% pgs not active
             510 active+clean
             2   peering

此案例中，使用此PG的Pod呈Known状态。

检查卡在inactive状态的PG：

ceph pg dump_stuck inactive
 
PG_STAT STATE   UP     UP_PRIMARY ACTING ACTING_PRIMARY  
17.68   peering [3,12]          3 [3,12]              3
16.32   peering [4,12]          4 [4,12]              4

输出其中一个PG的诊断信息，片断如下：

// ceph pg 17.68 query
{                                               
    "info": {                                            
        "stats": {
            "state": "peering",
            "stat_sum": {
                "num_objects_dirty": 5
            },
            "up": [
                3,
                12
            ],
            "acting": [
                3,
                12
            ],
            // 因为哪个OSD而阻塞
            "blocked_by": [
                12
            ],
            "up_primary": 3,
            "acting_primary": 3
        }
    },
    "recovery_state": [
        // 如果顺利，第一个元素应该是 "name": "Started/Primary/Active"
        {
            "name": "Started/Primary/Peering/GetInfo",
            "enter_time": "2018-06-11 18:32:39.594296",
            // 但是，卡在向OSD 12 请求信息这一步上
            "requested_info_from": [
                {
                    "osd": "12"
                }
            ]
        },
        {
            "name": "Started/Primary/Peering",
        },
        {
            "name": "Started",
        }
    ]
}

没有获得osd-12阻塞Peering的明确原因。

查看日志，osd-12位于10.0.0.104，osd-3位于10.0.0.100，后者为Primary OSD。

osd-3日志，在18:26开始出现，和所有其它OSD之间心跳检测失败。此时10.0.0.100负载很高，卡死。

osd-12日志，在18:26左右大量出现：

osd.12 466 heartbeat_check: no reply from 10.0.0.100:6803 osd.4 since back 2018-06-11 18:26:44.973982 ...

直到18:44分仍然无法进行心跳检测，重启osd-12后一切恢复正常。

3.3 incomplete

检查无法完成的PG：

ceph pg dump_stuck
 
# PG_STAT STATE      UP     UP_PRIMARY ACTING ACTING_PRIMARY
# 17.79   incomplete [9,17]          9 [9,17]              9
# 32.1c   incomplete [16,9]         16 [16,9]             16
# 17.30   incomplete [16,9]         16 [16,9]             16
# 31.35   incomplete [9,17]          9 [9,17]              9

查询PG 17.30的诊断信息：

// ceph pg  17.30 query
{
  "state": "incomplete",
  "info": {
    "pgid": "17.30",
    "stats": {
      // 被osd.11阻塞而无法完成，此osd已经不存在
      "blocked_by": [
        11
      ],
      "up_primary": 16,
      "acting_primary": 16
    }
  },
  // 恢复的历史记录
  "recovery_state": [
    {
      "name": "Started/Primary/Peering/Incomplete",
      "enter_time": "2018-06-17 04:48:45.185352",
      // 最终状态，此PG没有完整的副本
      "comment": "not enough complete instances of this PG"
    },
    {
      "name": "Started/Primary/Peering",
      "enter_time": "2018-06-17 04:48:45.131904",
      "probing_osds": [
        "9",
        "16",
        "17"
      ],
      // 期望检查已经不存在的OSD
      "down_osds_we_would_probe": [
        11
      ],
      "peering_blocked_by_detail": [
        {
          "detail": "peering_blocked_by_history_les_bound"
        }
      ]
    }
  ]
}

以看到17.30期望到osd.11寻找权威数据，而osd.11已经永久丢失了。这种情况下，可以尝试强制标记PG为complete。

首先，停止PG的主OSD： service ceph-osd@16 stop

然后，运行下面的工具：

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-16  --pgid 17.30 --op mark-complete
# Marking complete 
# Marking complete succeeded

最后，重启PG的主OSD： service ceph-osd@16 start

3.4 单副本导致的stale

不做副本的情况下，单个OSD宕机即导致数据不可用：

ceph health detail 
# 注意Acting Set仅仅有一个成员
# pg 2.21 is stuck stale for 688.372740, current state stale+active+clean, last acting [7]
# 但是其它PG的Acting Set则不是
# pg 3.4f is active+recovering+degraded, acting [9,1]

如果OSD的确出现硬件故障，则数据丢失。此外，你也无法对这种PG进行查询操作。

3.5 inconsistent

定位出问题PG的主OSD，停止它，刷出日志，然后修复PG：

 ceph health detail
# HEALTH_ERR 2 scrub errors; Possible data damage: 2 pgs inconsistent
# OSD_SCRUB_ERRORS 2 scrub errors
# PG_DAMAGED Possible data damage: 2 pgs inconsistent
#     pg 15.33 is active+clean+inconsistent, acting [8,9]
#     pg 15.61 is active+clean+inconsistent, acting [8,16]
 
# 查找OSD所在机器
ceph osd find 8
 
# 登陆到osd.8所在机器
systemctl stop ceph-osd@8.service
ceph-osd -i 8 --flush-journal
systemctl start ceph-osd@8.service
ceph pg repair 15.61

4. 对象问题诊断

4.1 unfound

持有对象权威副本的OSD宕机或被剔除，会导致该问题出现。例如两个配对的OSD（共同处理某个PG）：

osd.1宕机
osd.2独自处理了一些写操作
osd1开机
osd.1+osd2配对，由于osd.2独自的写操作，缺失的对象排队等候在osd.1上恢复
恢复完成之前，osd.2宕机，或者被移除

在上面这个事件序列中，osd.1知道权威副本存在，但是却找不到，这种情况下针对目标对象的请求会被阻塞，直到权威副本的持有者osd上线。

执行下面的命令，定位存在问题的PG：

ceph health detail | grep unfound
# OBJECT_UNFOUND 1/90055 objects unfound (0.001%)
#     pg 33.3e has 1 unfound objects
#    pg 33.3e is active+recovery_wait+degraded, acting [17,6], 1 unfound

进一步，定位存在问题的对象：

 // ceph pg 33.3e list_missing
{
    "offset": {
        "oid": "",
        "key": "",
        "snapid": 0,
        "hash": 0,
        "max": 0,
        "pool": -9223372036854775808,
        "namespace": ""
    },
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
        {
            "oid": {
                // 丢失的对象
                "oid": "obj_delete_at_hint.0000000066",
                "key": "",
                "snapid": -2,
                "hash": 2846662078,
                "max": 0,
                "pool": 33,
                "namespace": ""
            },
            "need": "1723'1412",
            "have": "0'0",
            "flags": "none",
            "locations": []
        }
    ],
    "more": false
}

如果丢失的对象太多，more会显示为true。

执行下面的命令，可以查看PG的诊断信息：

// ceph pg 33.3e query
{
  "state": "active+recovery_wait+degraded",
  "recovery_state": [
    {
      "name": "Started/Primary/Active",
      "enter_time": "2018-06-16 15:03:32.873855",
      // 丢失的对象所在的OSD
      "might_have_unfound": [
        {
          "osd": "6",
          "status": "already probed"
        },
        {
          "osd": "11",
          "status": "osd is down"
        }
      ],
    } 
  ]
}

上面输出中的osd.11，先前已经出现硬件故障，被移除了。这意味着unfound的对象已经不可恢复。你可以标记：

# 回滚到前一个版本，如果是新创建对象则忘记其存在。不支持EC池
ceph pg 33.3e mark_unfound_lost revert
# 让Ceph忘记unfound对象的存在
ceph pg 33.3e mark_unfound_lost delete

5. ceph-deploy

5.1 TypeError: ‘Logger’ object is not callable

/usr/lib/python2.7/dist-packages/ceph_deploy/osd.py第376行，替换为：

LOG.info(line.decode('utf-8'))

5.2 Could not locate executable ‘ceph-volume’ make sure it is installed and available

应该安装ceph-deploy的1.5.39版本，2.0.0版本仅仅支持luminous：

apt remove ceph-deploy
apt install ceph-deploy=1.5.39 -y

5.3 部署MON后ceph-s卡死

在我的环境下，是因为MON节点识别的public addr为LVS的虚拟网卡的IP地址导致。修改配置，显式指定MON的IP地址即可：

[mon.master01-10-5-38-24]
public addr = 10.5.38.24 
cluster addr = 10.5.38.24
 
[mon.master02-10-5-38-39]
public addr = 10.5.38.39
cluster addr = 10.5.38.39
 
[mon.master03-10-5-39-41]
public addr = 10.5.39.41
cluster addr = 10.5.39.41

6. ceph-helm

在我的环境下部署，出现一系列和权限有关的问题，如果你遇到相同问题且不关心安全性，可以修改配置：

# kubectl -n ceph edit configmap ceph-etc
apiVersion: v1
data:
  ceph.conf: |
    [global]
    fsid = 08adecc5-72b1-4c57-b5b7-a543cd8295e7
    mon_host = ceph-mon.ceph.svc.k8s.gmem.cc
    # 添加以下三行
    auth client required = none
    auth cluster required = none
    auth service required = none
    [osd]
    # 在大型集群里用单独的“集群”网可显著地提升性能
    cluster_network = 10.0.0.0/16
    ms_bind_port_max = 7100
    public_network = 10.0.0.0/16
kind: ConfigMap

如果需要保证集群安全，请参考下面几个案例。

6.1 ceph-mgr报Operation not permitted

问题现象
此Pod一直无法启动，查看容器日志，发现：
timeout 10 ceph --cluster ceph auth get-or-create mgr.xenial-100 mon ‘allow profile mgr’ osd ‘allow *’ mds ‘allow *’ -o /var/lib/ceph/mgr/ceph-xenial-100/keyring
0 librados: client.admin authentication error (1) Operation not permitted

问题分析
连接到可以访问的ceph-mon，执行命令：

kubectl -n ceph exec -it ceph-mon-nhx52 -c ceph-mon -- ceph

发现报同样的错误。这说明client.admin的Keyring有问题。登陆到ceph-mon，获取Keyring列表：

# kubectl -n ceph exec -it ceph-mon-nhx52 -c ceph-mon bash
# ceph --cluster=ceph  --name mon. --keyring=/var/lib/ceph/mon/ceph-xenial-100/keyring auth list   
 
installed auth entries:
 
client.admin
        key: AQAXPdtaAAAAABAA6wd1kCog/XtV9bSaiDHNhw==
        auid: 0
        caps: [mds] allow
        caps: [mgr] allow *
        caps: [mon] allow *
        caps: [osd] allow *
 
client.bootstrap-mds
        key: AQAgPdtaAAAAABAAFPgqn4/zM5mh8NhccPWKcw==
        caps: [mon] allow profile bootstrap-mds
client.bootstrap-osd
        key: AQAUPdtaAAAAABAASbfGQ/B/PY4Imoa4Gxsa2Q==
        caps: [mon] allow profile bootstrap-osd
client.bootstrap-rgw
        key: AQAJPdtaAAAAABAAswtFjgQWahHsuy08Egygrw==
        caps: [mon] allow profile bootstrap-rgw

而当前使用的client.admin的Keyring内容为：

[client.admin]
  key = AQAda9taAAAAABAAgWIsgbEiEsFRJQq28hFgTQ==
  auid = 0
  caps mds = "allow"
  caps mon = "allow *"
  caps osd = "allow *"
  caps mgr = "allow *"

内容不一致。使用auth list获得的client.admin的Keyring，可以发现是有效的：

ceph --cluster=ceph --name mon. --keyring=/var/lib/ceph/mon/ceph-xenial-100/keyring auth get client.admin > client.admin.keyyring
ceph --name client.admin --keyring client.admin.keyyring # OKskydns_skydns_dns_cachemiss_count_total{instance="172.27.100.134:10055"}

检查一下各Pod的/etc/ceph/ceph.client.admin.keyring，可以发现都是从Secret ceph-client-admin-keyring挂载的。那么这个Secret是如何生成的呢？执行命令：

kubectl -n ceph get job --output=yaml --export | grep ceph-client-admin-keyring -B 50

可以发现Job ceph-storage-keys-generator负责生成该Secret。查看其Pod日志可以生成Keyring、创建Secret的记录。进一步查看Pod的资源定义，可以看到负责创建的脚本/opt/ceph/ceph-storage-key.sh挂载自ConfigMap ceph-bin中的ceph-storage-key.sh。

解决此问题最简单的办法就是修改Secret，将其修改为集群中实际有效的Keyring：

# 导出Secret定义
kubectl -n ceph get  secret ceph-client-admin-keyring --output=yaml --export > ceph-client-admin-keyring
# 获得有效Keyring的Base64编码
cat client.admin.keyyring | base64
# 将Secret中的编码替换为上述Base64，然后重新创建Secret
kubectl -n ceph apply -f ceph-client-admin-keyring

此外Secret pvc-ceph-client-key中存放的也是admin用户的Key，其内容也需要替换到有效的：

kubectl -n ceph edit secret  pvc-ceph-client-key

6.2 pvc无法提供

原因和上一个问题类似，还是权限问题。

查看无法绑定的PVC日志：

# kubectl -n ceph describe pvc
 Normal   Provisioning        53s   ceph.com/rbd ceph-rbd-provisioner-5544dcbcf5-n846s 708edb2c-4619-11e8-abf2-e672650d97a2  External provisioner is provisioning volume for claim
"ceph/ceph-pvc"
  Warning  ProvisioningFailed  53s   ceph.com/rbd ceph-rbd-provisioner-5544dcbcf5-n846s 708edb2c-4619-11e8-abf2-e672650d97a2  Failed to provision volume with StorageClass "general"
: failed to create rbd image: exit status 1, command output: 2018-04-22 13:44:35.269967 7fb3e3e3ad80 -1 did not load config file, using default settings.
2018-04-22 13:44:35.297828 7fb3e3e3ad80 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2)
 No such file or directoryConnection to localhost closed by remote host.
Connection to localhost closed.e3e3ad80  0 librados: client.admin authentication error (1) Operation not permitted

rbd-provisioner需要读取StorageClass定义，获取需要的凭证信息：

# kubectl -n ceph get storageclass --output=yaml
apiVersion: v1                                                                                                                                                                  
items:                                                                                                                                                                          
- apiVersion: storage.k8s.io/v1                                                                                                                                                 
  kind: StorageClass                                                                                                                                                            
  metadata:                                                                                                                             
    name: general
  parameters:
    adminId: admin
    adminSecretName: pvc-ceph-conf-combined-storageclass
    adminSecretNamespace: ceph
    imageFeatures: layering
    imageFormat: "2"
    monitors: ceph-mon.ceph.svc.k8s.gmem.cc:6789
    pool: rbd
    userId: admin
    userSecretName: pvc-ceph-client-key
  provisioner: ceph.com/rbd
  reclaimPolicy: Delete

可以看到牵涉到两个Secret：pvc-ceph-conf-combined-storageclass、pvc-ceph-client-key，你需要把正确的Keyring内容写入其中。

6.3 pvc无法Attach

现象：
PVC可以Provision，RBD可以通过Ceph命令挂载，但是Pod无法启动，Describe之显示：

auth: unable to find a keyring on /etc/ceph/keyring: (2) No such file or directory
monclient(hunting): authenticate NOTE: no keyring found; disabled cephx authentication
librados: client.admin authentication error (95) Operation not supported

解决办法：
把ceph.client.admin.keyring拷贝一份为 /etc/ceph/keyring即可。

6.4 ceph-osd报Operation not permitted

原因和上一个问题一样。查看无法启动的容器日志：

kubectl -n ceph logs ceph-osd-dev-vdb-bjnbm -c osd-prepare-pod
# ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring health                                                   
# 0 librados: client.bootstrap-osd authentication error (1) Operation not permitted                                               
# [errno 1] error connecting to the cluster

进一步查看，可以发现/var/lib/ceph/bootstrap-osd/ceph.keyring挂载自ceph-bootstrap-osd-keyring下的ceph.keyring：

# kubectl -n ceph get secret ceph-bootstrap-osd-keyring --output=yaml --export
apiVersion: v1
data:
  ceph.keyring: W2NsaWVudC5ib290c3RyYXAtb3NkXQogIGtleSA9IEFRQVlhOXRhQUFBQUFCQUFSQ2l1bVY1NFpOU2JGVWwwSDZnYlJ3PT0KICBjYXBzIG1vbiA9ICJhbGxvdyBwcm9maWxlIGJvb3RzdHJhcC1vc2QiCgo=
kind: Secret
metadata:
  creationTimestamp: null
  name: ceph-bootstrap-osd-keyring
  selfLink: /api/v1/namespaces/ceph/secrets/ceph-bootstrap-osd-keyring
type: Opaque
 
# BASE64解码后：
[client.bootstrap-osd]
  key = AQAYa9taAAAAABAARCiumV54ZNSbFUl0H6gbRw==
  caps mon = "allow profile bootstrap-osd"

获得实际有效的Keyring：

kubectl -n ceph exec -it ceph-mon-nhx52 -c ceph-mon -- ceph --cluster=ceph --name mon. --keyring=/var/lib/ceph/mon/ceph-xenial-100/keyring auth get client.bootstrap-osd
# 注意上述命令的输出的第一行exported keyring for client.bootstrap-osd不属于Keyring
[client.bootstrap-osd]
        key = AQAUPdtaAAAAABAASbfGQ/B/PY4Imoa4Gxsa2Q==
        caps mon = "allow profile bootstrap-osd"

修改Secret： kubectl **-**n ceph edit secret ceph-bootstrap-osd-keyring 替换为上述Keyring。

6.5 ceph-osd报No cluster conf with fsid

报错信息：

# kubectl -n ceph logs  ceph-osd-dev-vdc-cpkxh -c osd-activate-pod
ceph_disk.main.Error: Error: No cluster conf found in /etc/ceph with fsid 08adecc5-72b1-4c57-b5b7-a543cd8295e7
# 每个OSD都包同样的错误

对应的配置文件内容：

 kubectl -n ceph get configmap ceph-etc --output=yaml
apiVersion: v1
data:
  ceph.conf: |
    [global]
    fsid = a4426e8a-c46d-4407-95f1-911a23a0dd6e
    mon_host = ceph-mon.ceph.svc.k8s.gmem.cc
    [osd]
    cluster_network = 10.0.0.0/16
    ms_bind_port_max = 7100
    public_network = 10.0.0.0/16
kind: ConfigMap
metadata:
  name: ceph-etc
  namespace: ceph

可以看到，fsid不一致。修改一下ConfigMap中的fsid即可解决此问题。

6.6 容器无法Attach PV

报错信息：
describe pod报错：timeout expired waiting for volumes to attach/mount for pod
kubelet报错：executable file not found in $PATH, rbd output
原因分析：
动态提供的持久卷，包含两个阶段：
1. 卷提供，原本由控制平面负责，controller-manager中需要包含rbd命令，才能在Ceph集群中创建供K8S使用的镜像。目前这个职责由external_storage项目的rbd-provisioner完成
2. 卷依附/分离，由使用卷的Pod所在的Node的kubelet负责完成。这些Node需要安装rbd命令，并提供有效的配置文件

解决方案：

# 安装软件
apt install -y ceph-common
# 从ceph-mon拷贝以下文件：
# /etc/ceph/ceph.client.admin.keyring
# /etc/ceph/ceph.conf

应用上述方案后，如果继续报错：rbd: map failed exit status 110, rbd output: rbd: sysfs write failed In some cases useful info is found in syslog。则查看一下系统日志：

dmesg | tail
 
# [ 3004.833252] libceph: mon0 10.0.0.100:6789 feature set mismatch, my 106b84a842a42 
#     < server's 40106b84a842a42, missing 400000000000000
# [ 3004.840980] libceph: mon0 10.0.0.100:6789 missing required protocol features

对照本文前面的特性表，可以发现内核版本必须4.5+才可以（CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING）。
最简单的办法就是升级一下内核：

# Desktop
apt install --install-recommends linux-generic-hwe-16.04 xserver-xorg-hwe-16.04 -y
# Server
apt install --install-recommends linux-generic-hwe-16.04 -y
 
sudo apt-get remove linux-headers-4.4.* -y && \
sudo apt-get remove linux-image-4.4.* -y && \
sudo apt-get autoremove -y && \
sudo update-grub

或者，将tunables profile调整到hammer版本的Ceph：

ceph osd crush tunables hammer
# adjusted tunables profile to hammer

6.7 OSD启动失败报文件名太长

报错信息：ERROR: osd init failed: (36) File name too long

报错原因：使用的文件系统为EXT4，存储的xattrs大小有限制，有条件的话最好使用XFS

解决办法：修改配置文件，如下：

 osd_max_object_name_len = 256
osd_max_object_namespace_len = 64

6.8 无法打开/proc/0/cmdline

报错信息：Fail to open ‘/proc/0/cmdline’ error No such file or directory

报错原因：在CentOS 7上，将ceph-mon和ceph-osd（基于目录）部署在同一节点（基于Helm）报此错误，分离后问题消失。此外部署mon的那些节点还设置了虚IP，其子网和Ceph的Cluster/Public网络相同，这导致了某些OSD监听的地址不正确。

再次遇到此问题，原因是一个虚拟网卡lo:ngress使用和eth0相同的网段，导致OSD使用了错误的网络。

解决办法是写死OSD监听地址：

[osd.2]                                                                                                                                                                          
public addr = 10.0.4.1                                                                                                                                                           
cluster addr = 10.0.4.1

6.9 无法挂载RBD

报错信息：Input/output error，结合dmesg | tail可以看到更细节的报错

报错原因，可能情况：

CentOS7下报错，提示客户端不满足特性CEPH_FEATURE_CRUSH_V4（1000000000000）。解决办法，将Bucket算法改为straw。注意，之后加入的OSD仍然默认使用straw2，使用的镜像的标签为tag-build-master-luminous-ubuntu-16.04。

6.10 write error: File name too long

external storage中的CephFS可以正常Provisioning，但是尝试读写数据时报此错误。原因是文件路径过长，和底层文件系统有关，为了兼容部分Ext文件系统的机器，我们限制了osd_max_object_name_len。

解决办法，不使用UUID，而使用namespace + pvcname来命名目录。修改cephfs-provisioner.go，118行

// create random share name
share := fmt.Sprintf("%s-%s", options.PVC.Namespace,options.PVC.Name)
// create random user id
user := fmt.Sprintf("%s-%s", options.PVC.Namespace,options.PVC.Name)

重新编译即可。

7. k8s相关

7.1 **rbd image * is still being used**

describe pod发现：

rbd image rbd-unsafe/kubernetes-dynamic-pvc-c0ac2cff-84ef-11e8-9a2a-566b651a72d6 is still being used

说明有其它客户端正在占用此镜像。如果尝试删除镜像，你会发现无法成功：

rbd rm rbd-unsafe/kubernetes-dynamic-pvc-c0ac2cff-84ef-11e8-9a2a-566b651a72d6 
 
librbd::image::RemoveRequest: 0x560e39df9af0 check_image_watchers: image has watchers - not removing
Removing image: 0% complete...failed.
rbd: error: image still has watchers
This means the image is still open or the client using it crashed. Try again after closing/unmapping it or waiting 30s for the crashed client to timeout.

要知道watcher是谁，可以执行：

rbd status rbd-unsafe/kubernetes-dynamic-pvc-c0ac2cff-84ef-11e8-9a2a-566b651a72d6 
Watchers:
        watcher=10.5.39.12:0/1652752791 client.94563 cookie=18446462598732840961

可以发现10.5.39.12正在占用镜像。

另一种获取watcher的方法是，使用rbd的header对象。执行下面的命令获取rbd的诊断信息：

rbd info rbd-unsafe/kubernetes-dynamic-pvc-c0ac2cff-84ef-11e8-9a2a-566b651a72d6 
 
rbd image 'kubernetes-dynamic-pvc-c0ac2cff-84ef-11e8-9a2a-566b651a72d6':
        size 8192 MB in 2048 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.134474b0dc51
        format: 2
        features: layering
        flags: 
        create_timestamp: Wed Jul 11 17:49:51 2018

字段block_name_prefix的值rbd_data.134474b0dc51，将data换为header即为header对象。然后使用命令：

rados listwatchers -p rbd-unsafe rbd_header.134474b0dc51
 
watcher=10.5.39.12:0/1652752791 client.94563 cookie=18446462598732840961

既然知道10.5.39.12占用镜像，断开连接即可。在此机器上执行下面的命令，显示当前映射的rbd镜像列表：

rbd showmapped
 
id pool       image                                                       snap device  
0  rbd-unsafe kubernetes-dynamic-pvc-c0ac2cff-84ef-11e8-9a2a-566b651a72d6 -    /dev/rbd0 
1  rbd-unsafe kubernetes-dynamic-pvc-0729f9a6-84f0-11e8-9b75-5a3f858854b1 -    /dev/rbd1

此机器上的rbd0虽然映射，但是没有挂载。解除映射：

rbd unmap /dev/rbd0

再次检查rbd镜像状态，发现没有watcher了：

rbd status rbd-unsafe/kubernetes-dynamic-pvc-c0ac2cff-84ef-11e8-9a2a-566b651a72d6 
 
Watchers: none

7.2 rbd: map failed signal: aborted (core dumped)

kubectl describe报错Unable to mount volumes for pod… timeout expired waiting for volumes to attach or mount for pod…

检查发现目标rbd没有Watcher，Pod所在机器的Kubectl报错rbd: map failed signal: aborted (core dumped)。此前曾经在该机器上执行过rbd unmap操作。

手工 rbd map后问题消失。

8. 断电后无法启动OSD

journal do_read_entry: bad header magic

报错信息：journal do_read_entry(156389376): bad header magic…FAILED assert(interval.last > last)

这是12.2版本已知的BUG，断电后可能出现OSD无法启动，可能导致数据丢失。

9. 其他

9.1 Couldn’t init storage provider (RADOS)

RGW实例无法启动，通过journalctl看到上述信息。

要查看更多信息，需要查看RGW日志：

2020-10-22 16:51:55.771035 7fb1b0f20e80  0 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable), process (unknown), pid 2546439
2020-10-22 16:51:55.792872 7fb1b0f20e80  0 librados: client.rgw.ceph02 authentication error (22) Invalid argument
2020-10-22 16:51:55.793450 7fb1b0f20e80 -1 Couldn't init storage provider (RADOS)

可以发现是和身份验证有关的问题。

通过 systemctl status ceph-radosgw@rgw.**$**RGW_HOST得到命令行，手工运行：

radosgw -f --cluster ceph  --name client.rgw.ceph02 --setuser ceph --setgroup ceph -d --debug_ms 1

发现报错和上面一样。尝试增加–keyring参数，问题解决：

radosgw -f --cluster ceph  --name client.rgw.ceph02        \
  --setuser ceph --setgroup ceph -d --debug_ms 1           \
  --keyring=/var/lib/ceph/radosgw/ceph-rgw.ceph02/keyring

看来是Systemd服务没有找到keyring导致。

9.2 禁用IPv6的机器上无法开启Prometheus模块

报错信息：Unhandled exception from module ‘prometheus’ while running on mgr.master01-10-5-38-24: error(‘No socket could
be created’,)

解决办法： ceph config-key set mgr/prometheus/server_addr 0.0.0.0

9.3 反复警告mon… clock skew

原因是时钟不同步警告阈值太低，在global段增加配置并重启MON：

mon clock drift allowed = 2
mon clock drift warn backoff = 30

或者执行下面的命令即时生效：

ceph tell mon.* injectargs '--mon_clock_drift_allowed=2'
ceph tell mon.* injectargs '--mon_clock_drift_warn_backoff=30'

或者检查ntp相关配置，保证时钟同步精度。

9.4 深度清理导致高IO

深度清理很消耗IO，如果长时间无法完成，可以禁用：

ceph osd set noscrub
ceph osd set nodeep-scrub

问题解决后，可以再启用：

ceph osd unset noscrub
ceph osd unset nodeep-scrub

使用CFQ作为IO调度器时，可以调整OSD IO线程的优先级：

# 设置调度器
echo cfq > /sys/block/sda/queue/scheduler
 
# 检查当前某个OSD的磁盘线程优先级类型
ceph daemon osd.4 config get osd_disk_thread_ioprio_class
 
# 修改IO优先级
ceph tell osd.* injectargs '--osd_disk_thread_ioprio_priority 7'
# IOPRIO_CLASS_RT最高 IOPRIO_CLASS_IDLE最低
ceph tell osd.* injectargs '--osd_disk_thread_ioprio_class idle'

如果上述措施没有问题时，可以考虑配置以下参数：

osd_deep_scrub_stride = 131072                                                                                                                                                       
# 每次Scrub的块数量范围
osd_scrub_chunk_min = 1                                                                                                                                                              
osd_scrub_chunk_max = 5                                                                                                                                                              
osd scrub during recovery = false                                                                                                                                                    
osd deep scrub interval = 2592000                                                                                                                                                    
osd scrub max interval = 2592000                                                                                                                                                     
# 单个OSD并发进行的Scrub个数
osd max scrubs = 1   
# Scrub起止时间                                                                                                                                                            
osd max begin hour = 2                                                                                                                                                               
osd max end hour = 6                                                                                                                                                                 
# 系统负载超过多少则禁止Scrub
osd scrub load threshold = 4                                                                                                                                                         
# 每次Scrub后强制休眠0.1秒
osd scrub sleep = 0.1                                                                                                                                                                  
# 线程优先级
osd disk thread ioprio priority = 7
osd disk thread ioprio class = idle

9.5 强制unmap

如果Watcher被黑名单，则尝试Unmap镜像时会报错：rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy

可以使用下面的命令强制unmap： rbd unmap -o force ...

9.6 增加pg_num和pgp_num后无法A+C

部分PG状态卡死，可能原因是OSD允许的PG数量受限，修改全局配置项mon_max_pg_per_osd并重启MON即可。

此外注意：调整PG数量后，一定要进入A+C状态后，再进行下一次调整。

9.7 无法删除RBD镜像

下面第二个镜像对应的K8S PV已经删除：

rbd ls
# kubernetes-dynamic-pvc-35350b13-46b8-11e8-bde0-a2c14c93573f
# kubernetes-dynamic-pvc-78740b26-46eb-11e8-8349-e6e3339859d4

但是对应的RBD没有删除，手工删除：

rbd remove kubernetes-dynamic-pvc-78740b26-46eb-11e8-8349-e6e3339859d4

报错：

2018-04-23 13:37:25.559444 7f919affd700 -1 librbd::image::RemoveRequest: 0x5598e77831d0 check_image_watchers: image has watchers - not removing
Removing image: 0% complete…failed.
rbd: error: image still has watchers
This means the image is still open or the client using it crashed. Try again after closing/unmapping it or waiting 30s for the crashed client to timeout.

查看RBD状态：

# rbd info kubernetes-dynamic-pvc-78740b26-46eb-11e8-8349-e6e3339859d4
rbd image 'kubernetes-dynamic-pvc-78740b26-46eb-11e8-8349-e6e3339859d4':
        size 8192 MB in 2048 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.1003e238e1f29
        format: 2
        features: layering
        flags: 
        create_timestamp: Mon Apr 23 11:42:59 2018
 
#rbd status kubernetes-dynamic-pvc-78740b26-46eb-11e8-8349-e6e3339859d4
Watchers:
        watcher=10.0.0.101:0/4275384344 client.65597 cookie=18446462598732840963

到10.0.0.101这台机器上查看：

# df | grep e6e3339859d4
/dev/rbd2        8125880  251560   7438508   4% /var/lib/kubelet/plugins/kubernetes.io/rbd/rbd/rbd-image-kubernetes-dynamic-pvc-78740b26-46eb-11e8-8349-e6e3339859d4

重启Kubelet后可以删除RBD。

9.8 Error EEXIST: entity osd.9 exists but key does not match

# 删除密钥
ceph auth del osd.9
# 重新收集目标主机的密钥
ceph-deploy --username ceph-ops gatherkeys Carbon

9.9 创建新Pool后无法Active+Clean


    pgs:     12.413% pgs unknown                                                                                                                                                   
             20.920% pgs not active                                                                                                                                                
             768 active+clean                                                                                                                                                      
             241 creating+activating                                                                                                                                               
             143 unknown

可能是由于PG总数太大导致，降低PG数量后很快Active+Clean

9.10 Orphaned pod无法清理

报错信息：Orphaned pod “a9621c0e-41ee-11e8-9407-deadbeef00a0” found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them

临时解决办法：

rm -rf /var/lib/kubelet/pods/a9621c0e-41ee-11e8-9407-deadbeef00a0/volumes/rook.io~rook/

9.11 osd启动报错：ERROR: osd init failed: (1) Operation not permitted

可能原因是OSD使用的keyring和MON不一致。对于ID为14的OSD，将宿主机/var/lib/ceph/osd/ceph-14/keyring的内容替换为 ceph auth get osd.14的输出前两行即可。

9.12 Mount failed with ‘(11) Resource temporarily unavailable’

在没有停止OSD的情况下执行ceph-objectstore-tool命令，会出现此错误。

9.13 neither `public_addr` nor `public_network` keys are defined for monitors

通过ceph-deploy添加MON节点时出现此错误，将public_network配置添加到配置文件的global段即可。

9.14 journalctl删除pv后卡在Terminating

可能原因：

对应的PVC没有删除，还在引用此PV。先删除PV即可

chown: cannot access ‘/var/log/ceph’: No such file or directory

OSD无法启动，报上面的错误，可以配置：

ceph:
  storage:
    osd_log: /var/log

HEALTH_WARN application not enabled on

 
                                 #池 # 功能
ceph osd pool application enable rbd block-devices

10. 诊断

调试日志

注意：详尽的日志每小时可能超过 1GB ，如果你的系统盘满了，这个节点就会停止工作。

10.1 临时启用调试日志

 # 通过中心化配置下发
ceph tell osd.0 config set debug_osd 0/5
 
# 到目标主机上，针对OSD进程设置
ceph daemon osd.0 config set debug_osd 0/5

10.2 配置日志级别

可以为各子系统定制日志级别

 # debug {subsystem} = {log-level}/{memory-level}
 
[global]
        debug ms = 1/5
[mon]
        debug mon = 20
        debug paxos = 1/5
        debug auth = 2
[osd]
        debug osd = 1/5
        debug filestore = 1/5
        debug journal = 1
        debug monc = 5/20
[mds]
        debug mds = 1
        debug mds balancer = 1
        debug mds log = 1
        debug mds migrator = 1

子系统列表：

子系统	日志级别	内存日志级别
default	0	5
lockdep	0	1
context	0	1
crush	1	1
mds	1	5
mds balancer	1	5
mds locker	1	5
mds log	1	5
mds log expire	1	5
mds migrator	1	5
buffer	0	1
timer	0	1
filer	0	1
striper	0	1
objecter	0	1
rados	0	5
rbd	0	5
rbd mirror	0	5
rbd replay	0	5
journaler	0	5
objectcacher	0	5
client	0	5
osd	1	5
optracker	0	5
objclass	0	5
filestore	1	3
journal	1	3
ms	0	5
mon	1	5
monc	0	10
paxos	1	5
tp	0	5
auth	1	5
crypto	1	5
finisher	1	1
reserver	1	1
heartbeatmap	1	5
perfcounter	1	5
rgw	1	5
rgw sync	1	5
civetweb	1	10
javaclient	1	5
asok	1	5
throttle	1	1
refs	0	0
compressor	1	5
bluestore	1	5
bluefs	1	5
bdev	1	3
kstore	1	5
rocksdb	4	5
leveldb	4	5
memdb	4	5
fuse	1	5
mgr	1	5
mgrc	1	5
dpdk	1	5
eventtrace	1	5

10.3 加快日志滚动

如果磁盘空间有限，可以配置/etc/logrotate.d/ceph，加快日志滚动：

rotate 7
weekly
size 500M
compress
sharedscripts

然后设置定时任务，定期检查并清理：

30 * * * * /usr/sbin/logrotate /etc/logrotate.d/ceph >/dev/null 2>&1