这篇分析pool CRD,object store CRD,object store user CRD,file system CRD, nfs granesha CRD等watcher
包括mon,osd健康检查,ceph状态检查
Replicated: 数据副副本数
apiVersion: ceph.rook.io/v1 kind: CephBlockPool metadata: name: replicapool namespace: rook-ceph spec: failureDomain: host replicated: size: 3 deviceClass: hdd
创建完ceph cluster,接着执行pool.NewPoolController,主要是为storageclass watch,执行创建 osd pool操作
// Start pool CRD watcher
poolController := pool.NewPoolController(c.context, cluster.Namespace)
poolController.StartWatch(cluster.stopCh)
1. Pool CRD
只分析pool CRD watcher,watch CephBlockPool资源,回调函数onAdd,onUpdate,onDelete
// Watch watches for instances of Pool custom resources and acts on them
func (c *PoolController) StartWatch(stopCh chan struct{}) error {
resourceHandlerFuncs := cache.ResourceEventHandlerFuncs{
AddFunc: c.onAdd,
UpdateFunc: c.onUpdate,
DeleteFunc: c.onDelete,
}
logger.Infof("start watching pool resources in namespace %s", c.namespace)
watcher := opkit.NewWatcher(PoolResource, c.namespace, resourceHandlerFuncs, c.context.RookClientset.CephV1().RESTClient())
go watcher.Watch(&cephv1.CephBlockPool{}, stopCh)
// watch for events on all legacy types too
c.watchLegacyPools(c.namespace, stopCh, resourceHandlerFuncs)
return nil
}
# ceph osd crush rule dump
[
{
"rule_id": 0,
"rule_name": "replicated_ruleset",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 1,
"rule_name": "replicapool",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]
2. pool CRD onAdd
createPool,最终执行的命令:
createReplicationCrushRul函数:ceph osd crush rule create-simple replicapool default host --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/685363112
func (c *PoolController) onAdd(obj interface{}) {
pool, migrationNeeded, err := getPoolObject(obj)
if err != nil {
logger.Errorf("failed to get pool object: %+v", err)
return
}
if migrationNeeded {
if err = c.migratePoolObject(pool, obj); err != nil {
logger.Errorf("failed to migrate pool %s in namespace %s: %+v", pool.Name, pool.Namespace, err)
}
return
}
err = createPool(c.context, pool)
if err != nil {
logger.Errorf("failed to create pool %s. %+v", pool.ObjectMeta.Name, err)
}
}
3. CreateECPoolForApp
ceph osd pool create replicapool 0 replicated replicapool --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
func CreateECPoolForApp(context *clusterd.Context, clusterName string, newPool CephStoragePoolDetails, appName string, enableECOverwrite bool, erasureCodedConfig model.ErasureCodedPoolConfig) error {
args := []string{"osd", "pool", "create", newPool.Name, strconv.Itoa(newPool.Number), "erasure", newPool.ErasureCodeProfile}
buf, err := ExecuteCephCommand(context, clusterName, args)
if err != nil {
return fmt.Errorf("failed to create EC pool %s. %+v", newPool.Name, err)
}
3.1 SetPoolProperty
ceph osd pool set replicapool size 2 --connect-timeout=15 --cluster=rook-ceph ,设置副本数属性
func SetPoolProperty(context *clusterd.Context, clusterName, name, propName string, propVal string) error {
args := []string{"osd", "pool", "set", name, propName, propVal}
_, err := ExecuteCephCommand(context, clusterName, args)
if err != nil {
return fmt.Errorf("failed to set pool property %s on pool %s, %+v", propName, name, err)
}
return nil
}
3.2 givePoolAppTag
ceph osd pool application enable replicapool rbd --yes-i-really-mean-it --connect-timeout=15 --cluster=rook-ceph
func givePoolAppTag(context *clusterd.Context, clusterName string, poolName string, appName string) error {
args := []string{"osd", "pool", "application", "enable", poolName, appName, confirmFlag}
_, err := ExecuteCephCommand(context, clusterName, args)
if err != nil {
return fmt.Errorf("failed to enable application %s on pool %s. %+v", appName, poolName, err)
}
return nil
}
4. osd健康检查,执行osd dump命令
[root@master ~]# kubectl get pods -n rook-ceph
NAME READY STATUS RESTARTS AGE
rook-ceph-agent-kgf6v 1/1 Running 0 4d
rook-ceph-agent-lh9fl 1/1 Running 0 4d
rook-ceph-mgr-a-6dc98f4799-cw6bf 1/1 Running 0 4d
rook-ceph-mon-a-656468d54f-lcsn8 1/1 Running 0 4d
rook-ceph-operator-8bc78b546-r2slk 1/1 Running 0 4d
rook-ceph-osd-0-587d967794-2kdt5 1/1 Running 0 4d
rook-ceph-osd-1-77c556d78f-gw5md 1/1 Running 8 4d
rook-ceph-osd-prepare-master-node-zrxs4 0/2 Completed 0 4d
rook-ceph-osd-prepare-node1-rs4fb 0/2 Completed 0 4d
rook-discover-sgmtr 1/1 Running 0 4d
rook-discover-tpzfp 1/1 Running 0 4d
例如执行kubectl apply storageclass.yaml:
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
name: replicapool
namespace: rook-ceph
spec:
replicated:
size: 3
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: rook-ceph-block
provisioner: ceph.rook.io/block
parameters:
blockPool: replicapool
# Specify the namespace of the rook cluster from which to create volumes.
# If not specified, it will use `rook` as the default namespace of the cluster.
# This is also the namespace where the cluster will be
clusterNamespace: rook-ceph
# Specify the filesystem type of the volume. If not specified, it will use `ext4`.
fstype: xfs
# (Optional) Specify an existing Ceph user that will be used for mounting storage with this StorageClass.
#mountUser: user1
# (Optional) Specify an existing Kubernetes secret name containing just one key holding the Ceph user secret.
# The secret must exist in each namespace(s) where the storage will be consumed.
#mountSecret: ceph-user1-secret
执行的主要是命令:
ceph osd crush rule create-simple replicapool default host --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/490128287
ceph osd pool create replicapool 0 replicated replicapool --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/762947698
ceph osd pool set replicapool size 3 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/384935721
ceph osd pool application enable replicapool rbd --yes-i-really-mean-it --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/873283188
总结:
监听CephBlockPool资源,进行操作
创建crush rule,创建pool,设置属性
什么是crushmap
crushmap就相当于是ceph集群的一张数据分布地图,crush算法通过该地图可以知道数据应该如何分布;找到数据存放位置从而直接与对应的osd进行数据访问和写入;故障域的设置和数据冗余选择策略等。crushmap的灵活设置显示出了ceph的软件定义存储方案。
这里可以引入raid相关的概念来对比下:
raid0:又称为Stripe,中文名为条带,它是所有RAID级别中存储性能最高的,它的原理就是把连续的数据分散存储到多个磁盘中,充分利用了每个磁盘的吞吐,突破了单个磁盘的吞吐限制。
raid1:又称为Mirror,中文名为镜像,它的主要宗旨是保证用户的数据可用性和可修复性,它的原理是把用户写入硬盘的数据百分之百的自动复制到另外一个磁盘上。
raid10:高可靠性与高效磁盘结构,可以理解为raid0和raid1的互补结合,类似于ceph中的多副本策略。
raid5:是一种存储性能、存储成本和数据安全都兼顾的一种方案,它的原理是不对存储的数据进行备份,而是把数据和存储校验信息存储到组成raid5的各个磁盘上,当raid5的一个磁盘损坏时,可以根据剩下的数据和奇偶校验信息来恢复损坏的数据,类似于ceph中的纠删码策略。
crush的基本原理
常用的分布式数据分布算法有一致性Hash算法和Crush算法,Crush也称为可扩展的伪随机数据分布算法
一致性Hash算法原理:
对一个圆形(圆形可以是0-2的31次方-1)用n个虚拟节点将其划分成n个区域,每个虚拟节点管控一个范围,如下图所示:T0负责[A, B],T1负责[B, C],T2负责[C,D],T3负责[D,A]
由于分区是固定的,所以我们很容易知道哪些数据要迁移,哪些数据不需要迁移。
在每个节点的存储容量相等且虚拟节点跟物理节点个数一致时,就会是每个节点对应一段相同大小的区段,从而可以达到最佳的数据分布。Crush算法原理:
Crush算法跟一致性Hash算法原理很类似,其中的pg就是充当上面所说的虚拟节点的作用,用以分割区域,每个pg管理的数据区间相同,因而数据能够均匀的分布到pg上。crush分布数据过程:
数据x经过hash函数得到一个值,将该值与pg数量取模得到数值,该数值即是pg的编号,然后通过crush算法知道pg是对应哪些osd的,最后是将数据存放到pg对应的osd上。其中经过了两次的映射,一次是数据到pg的映射,一次是pg到osd的映射,由于pg是抽象的存储节点,一般情况下,pg的数量是保持的不变的,不随着节点的增加或减少而改变,因此数据到pg的映射是稳定的。