前言
看代码的时候无意间看到了这个参数,查了官方的文档,说明是这样的:
chooseleaf_stable: Whether a recursive chooseleaf attempt will use a better value for an inner loop that greatly reduces the number of mapping changes when an OSD is marked out. The legacy value is 0, while the new value of 1 uses the new approach.
翻译过来就是当osd被标记为out了以后,是否用尝试一种可以减少map变化的方法.
意思就是当osd被标记为out了以后,可以尽量少的修改pg的映射,从而达到最少的迁移.
实验
修改chooseleaf_stable的值,out一个osd,对比pgmap迁移变化.
环境
3副本,3节点,每个节点3个osd
实验脚本
获取单个pool中的pg对应的osd up set,然后out osd之后再获取 osd up set,比较同一个pg内,osd变化的数目,osd变化的数目越多,则迁移就越多.由于环境内没有数据,所以peering过程等待了10s,足够用。
import os
import sys
import commands
import time
def getPgUpSet():
#pool id = 8
cmd = "ceph pg dump |grep '^8\.' | awk '{print $1}'"
(status,output) = commands.getstatusoutput(cmd)
output = output.split('\n')
#skip "dumped all in format plain"
pglist = output[1:]
cmd = "ceph pg dump |grep '^8\.' | awk '{print $15}'"
(status,output) = commands.getstatusoutput(cmd)
output = output.split('\n')
osdsets = output[1:]
pgosdset = {}
for i in range(0,len(pglist)):
pgosdset[pglist[i]] = osdsets[i]
return pgosdset
#compare before after pgosdupset
def compBASet(bpgosdset,apgosdset):
change0 = 0
change1 = 0
change2 = 0
change3 = 0
for key in bpgosdset.keys():
change = 0
for osd in bpgosdset[key]:
if osd not in apgosdset[key]:
change += 1
if change == 0:
change0 += 1
elif change == 1:
change1 += 1
elif change == 2:
change2 += 1
print "change 2 osd in pg ",str(key)
print "before out up set ",bpgosdset[key]
print "after out up set ",apgosdset[key]
elif change == 3:
change3 += 1
print "change 3 osd in pg ",str(key)
print "before out up set ",bpgosdset[key]
print "after out up set ",apgosdset[key]
print "change0 = ",change0
print "change1 = ",change1
print "change2 = ",change2
print "change3 = ",change3
def test():
bpgosdset = getPgUpSet()
os.system("systemctl stop ceph-osd@0")
os.system("ceph osd out 0")
#wait peering
time.sleep(10)
apgosdset = getPgUpSet()
compBASet(bpgosdset,apgosdset)
if __name__ == "__main__":
test()
实验步骤
- 把chooseleaf_stable 设置为0,然后执行脚本.
[root@host196 yg]# ceph osd crush dump | grep stable
"chooseleaf_stable": 0,
[root@host196 home]# python pgmapcompare.py
marked out osd.0.
change 2 osd in pg 8.14
before out up set [0,8,3]
after out up set [8,4,2]
change 2 osd in pg 8.10
before out up set [0,4,7]
after out up set [4,6,1]
change 2 osd in pg 8.1f
before out up set [0,7,4]
after out up set [6,1,4]
change 2 osd in pg 8.64
before out up set [4,0,8]
after out up set [4,7,2]
change 3 osd in pg 8.66
before out up set [0,4,8]
after out up set [5,6,2]
change 2 osd in pg 8.4b
before out up set [0,5,8]
after out up set [4,1,8]
change 2 osd in pg 8.71
before out up set [0,4,6]
after out up set [3,2,6]
change 2 osd in pg 8.7e
before out up set [3,0,6]
after out up set [3,7,1]
change 2 osd in pg 8.59
before out up set [0,3,7]
after out up set [4,1,7]
change 2 osd in pg 8.78
before out up set [0,5,8]
after out up set [3,2,8]
change 2 osd in pg 8.9
before out up set [0,8,4]
after out up set [8,3,2]
change 2 osd in pg 8.23
before out up set [0,4,7]
after out up set [5,1,7]
change 2 osd in pg 8.31
before out up set [5,0,7]
after out up set [5,6,2]
change 3 osd in pg 8.39
before out up set [0,6,5]
after out up set [7,3,1]
change 2 osd in pg 8.3d
before out up set [0,8,5]
after out up set [6,5,1]
change 2 osd in pg 8.5a
before out up set [0,4,6]
after out up set [5,6,1]
change0 = 86
change1 = 26
change2 = 14
change3 = 2
可以看到当osd0out了以后,pg进行了迁移,pg没有迁移的pg个数为86个,迁移了1个osd的pg为26个,迁移了2个osd的pg有14个,osd全部变化的pg个数为2个.
pool内有128个pg.
- 重新把osd加入集群,然后修改chooseleaf_stable 为1,重新执行脚本.
[root@host196 yg]# ceph osd crush dump | grep stable
"chooseleaf_stable": 1,
[root@host196 home]# python pgmapcompare.py
marked out osd.0.
change0 = 88
change1 = 40
change2 = 0
change3 = 0
结论
当chooseleaf_stable为0时,迁移相当于26×1+14×2+2×3=60个pg分片进行了迁移,修改为1以后,相当于40×1=40个pg分片进行了迁移.