Ceph Crush解析

最新推荐文章于 2023-04-03 19:20:32 发布

Ni259251

最新推荐文章于 2023-04-03 19:20:32 发布

阅读量2.1k

点赞数

分类专栏： ceph

ceph 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

文章出处：

https://www.2cto.com/net/201606/515330.html

文章比较深入的写了CRUSH算法的原理和过程.通过调试深入的介绍了CRUSH计算的过程.

写在前面

读本文前,你需要对ceph的基本操作,pool和CRUSH map非常熟悉.并且较深入的读过源码.

分析的方法

首先,我们写了个c程序调用librados向pool中写入一个对象.然后使用 GDB(CGDB is recommended)来追踪CRUSH的运行过程.同时我们将会关注CRUSH相关的几个重要变量.以便于我们完全知道CRUSH源码是如何运行.

1 如何追踪

1.1 编译CEPH

首先通过源码安装CEPH,然后使用./configure.如下添加编译参数/configure CFLAGS='-g3 –O0' CXXFLAGS='-g3 –O0'
-g3 意味着会产生大量的调试信息.-O0 非常重要, 它意味着关闭编译器的优化，如果没有，使用GDB追踪程序时，大多数变量被优化,无法显示。配置后，make and sudo make install。附： -O0只适合用于实验情况，在生产环境中编译器优化是必须进行的。

1.2 得到函数stack

众所周知,CRSUSH的核心函数是 crush_do_rule(位置 crush/mapper.c line 779).

 
          <code  
          class 
          = 
          "language-c++ hljs java" 
          > 
         
          /** 
         
          * crush_do_rule - calculate a mapping with the given input and rule 
         
          * @map: the crush_map 
         
          * @ruleno: the rule id 
         
          * @x: hash input 
         
          * @result: pointer to result vector 
         
          * @result_max: maximum result size 
         
          * @weight: weight vector (for map leaves) 
         
          * @weight_max: size of weight vector 
         
          * @scratch: scratch vector for private use; must be >= 3 * result_max 
         
          */ 
         
          int 
          crush_do_rule( 
          const 
          struct crush_map *map, 
         
          int 
          ruleno,  
          int 
          x,  
          int 
          *result,  
          int 
          result_max, 
         
          const 
          __u32 *weight,  
          int 
          weight_max, 
         
          int 
          *scratch) 
         
          </code>

通过这个函数将crush计算过程分为两部分:
1. input -> PGID
2. PGID -> OSD set.
第一部分,使用GDB来得到函数过程

通过 -g 参数来编译例子程序 gdb来参看 rados_write 然后,在添加断点 b crush_do_rule前,进入GDB的接口. 在函数 crush_do_rule 处停留得到函数stack,然后使用GDB log将调试信息输出到文件中 .
下面让我们来深入的研究这个过程.

函数stack和下面的相似

 
          <code  
          class 
          = 
          "hljs vala" 
          ># 
          12 
          main 
         
 
          # 
          11 
          rados_write 
         
 
          # 
          10 
          librados::IoCtxImpl::write 
         
 
          # 
          9 
          librados::IoCtxImpl::operate 
         
 
          # 
          8 
          Objecter::op_submit 
         
 
          # 
          7 
          Objecter::_op_submit_with_budget 
         
 
          # 
          6 
          Objecter::_op_submit 
         
 
          # 
          5 
          Objecter::_calc_target 
         
 
          # 
          4 
          OSDMap::pg_to_up_acting_osds 
         
 
          # 
          3 
          OSDMap::_pg_to_up_acting_osds 
         
 
          # 
          2 
          OSDMap::_pg_to_osds 
         
 
          # 
          1 
          CrushWrapper::do_rule 
         
 
          # 
          0 
          crush_do_rule</code> 
         

追踪过程

CRUSH 计算过程总结如下:

1	`<code` `class` `=` `"hljs applescript"` `> INPUT(object name & pool name) —> PGID —> OSD set.</code>`

本文中主要关注计算过程

2.1 input —> PGID

你可以按顺序阅读源码.最后,通过列出转换关键过程
从input 到 pgid

1 2	`<code` `class` `=` `"language-c++ hljs objectivec"` `>Ceph_hash.h (include):` `extern unsigned ceph_str_hash(` `int` `type,` `const` `char` `*s, unsigned len);</code>`

1	`<code` `class` `=` `"hljs avrasm"` `>OSDMap.h (osd):` `int` `ret = object_locator_to_pg(oid, loc, pg);</code>`

首先从 rados_ioctx_create, 和lookup_pool 通过pool名字得到 poolid. 把poolid封装进librados::IoCtxImpl 类型变量ctx;

然后在rados_write 对象名被封装进oid; 然后在librados::IoCtxImpl::operate, oid 和oloc(comprising poolid) 被包装成Objecter::Op * 类型变量objecter_op;

通过各种类型的封装我们到到 _calc_target 这层. 我们得到不断的oid 和poolid. 然后读取目标 pool 的信息.

这里写图片描述

(in my cluster, pool “neo” id is 29, name of object to write is “neo-obj”)<喎�"/kf/ware/vc/" target="_blank" class="keylink">vcD4NCjxwPtTaPGNvZGU+b2JqZWN0X2xvY2F0b3JfdG9fcGc8L2NvZGU+LCC12tK7tM68xsvjtNM8Y29kZT5jZXBoX3N0cl9oYXNoPC9jb2RlPiC5/s+jttTP88P719azyc6q0ru49nVpbnQzMl90IMDg0M2x5MG/LNKyvs3Kx8v5zr21xCA8Y29kZT5wczwvY29kZT4gPHN0cm9uZz4ocGxhY2VtZW50IHNlZWQpPC9zdHJvbmc+PC9wPg0KPHByZSBjbGFzcz0="brush:java;">unsigned int ceph_str_hash(int type, const char *s, unsigned int len) { switch (type) { case CEPH_STR_HASH_LINUX: return ceph_str_hash_linux(s, len); case CEPH_STR_HASH_RJENKINS: return ceph_str_hash_rjenkins(s, len); default: return -1; } }

这里写图片描述

然后得到 PGID. 以前我认为pgid 是单一变量,然而不是. PGID 是个包含 poolid 和 ps 的结构体变量.

这里写图片描述

 
          <code  
          class 
          = 
          "language-c++ hljs r" 
          > 
          //pgid  不仅仅是一个数字,还有好多信息 
         
          // placement group id 
         
          struct pg_t  
         
          { 
         
          uint64_t m_pool; 
         
          uint32_t m_seed; 
         
          int32_t m_preferred; 
         
          pg_t() : m_pool( 
          0 
          ), m_seed( 
          0 
          ), m_preferred(- 
          1 
          ) {} 
         
          pg_t(ps_t seed, uint64_t pool,  
          int 
          pref=- 
          1 
          ) : 
         
          ... 
         
          还有很多信息略</code>

crush_do_rule 的输入参数x是什么?让我们继续风雨兼程. 然后在 _pg_to_osds 有一行 ps_t pps = pool.raw_pg_to_pps(pg); //placement ps. pps 就是 x.
PPS 如何计算? 在函数crush_hash32_2(CRUSH_HASH_RJENKINS1,ceph_stable_mod(pg.ps(), pgp_num, pgp_num_mask),pg.pool());

 
          <code  
          class 
          = 
          "hljs go" 
          >__u32 crush_hash32_2( 
          int 
          type, __u32 a, __u32 b) 
         
          { 
         
          switch 
          (type) { 
         
          case 
          CRUSH_HASH_RJENKINS1: 
         
          return 
          crush_hash32_rjenkins1_2(a, b); 
         
          default 
          : 
         
          return 
          0 
          ; 
         
          } 
         
          }</code>

这里写图片描述

ps mod pgp_num_mask 的结果(例如 a) 和poolid(例如 b) 进行哈希. 这就是pps,也就是x

这里写图片描述

我们得到第二个过程输入参数的 x.第一阶段过程图如下图

第一阶段过程图

这里写图片描述
P.S. you can find something in PG’s name and object name.

2.2 PGID —> OSD set

首先需要明白几个概念

weight VS reweight
这里写图片描述
这里,“ceph osd crush reweight” 设置了OSD的权重

weight
这个重量为任意值（通常是磁盘的TB大小,1TB设置为1），并且控制系统尝试分配到OSD的数据量。

reweight
reweight将覆盖了weight量。这个值在0到1的范围，并强制CRUSH重新分配数据。它不改变buckets 的权重,并且是CRUSH不正常的情况下的纠正措施。（例如，如果你的OSD中的一个是在90％以上，其余为50％，可以减少权重，进行补偿。）

primary-affinity

主亲和力默认为1（即，一个OSD可以作为主OSD）。primary-affinity变化范围从0到1.其中0意味着在OSD可能用作主OSD,设置为1则可以被用作主OSD.当其<1时，CRUSH将不太可能选择该OSD作为主守护进程。

pg 和 pgp的区别
- PG = Placement Group
- PGP = Placement Group for Placement purpose
- pg_num = number of placement groups mapped to an OSD

当增加每个pool的pg_num数量时,每个PG分裂成半,但他们都保持到它们的父OSD的映射。

直到这个时候，ceph不会启动平衡策略。现在，当你增加同一池中的pgp_num值，PGs启动从父OSD迁移到其他OSD,ceph开始启动平衡策略。这是PGP如何起着重要的作用的过程。

pgp-num是在CRUSH算法中使用的参数,不是 pg-num.例如pg-num = 1024 , pgp-num = 1.所有的1024个PGs都映射到同一个OSD.当您增加PG-NUM是分裂的PG，如果增加PGP-NUM将移动PGs，即改变OSD的map。PG和PGP是很重要的概念。

1 2	`<code` `class` `=` `"hljs cpp"` `>` `void` `do_rule(` `int` `rule,` `int` `x, vector<` `int` `>& out,` `int` `maxout,` `const` `vector<__u32>& weight)` `const` `{}</` `int` `></code>`

在了解了这些概念后开始第二部分
PGID -> OSD set. 现在我们在 do_rule: void do_rule(int rule, int x, vector& out, int maxout, const vector<__u32>& weight)

do_rule源代码

 
          <code  
          class 
          = 
          "hljs cpp" 
          >   
          void 
          do_rule( 
          int 
          rule,  
          int 
          x, vector< 
          int 
          >& out,  
          int 
          maxout, 
         
 
                      
          const 
          vector<__u32>& weight)  
          const 
          { 
         
 
               
          Mutex::Locker l(mapper_lock); 
         
 
               
          int 
          rawout[maxout]; 
         
 
               
          int 
          scratch[maxout *  
          3 
          ]; 
         
 
               
          int 
          numrep = crush_do_rule(crush, rule, x, rawout, maxout, &weight[ 
          0 
          ], weight.size(), scratch); 
         
 
               
          if 
          (numrep <  
          0 
          ) 
         
 
                 
          numrep =  
          0 
          ; 
         
 
               
          out.resize(numrep); 
         
 
               
          for 
          ( 
          int 
          i= 
          0 
          ; i<numrep; code= 
          "" 
          ></numrep;></ 
          int 
          ></code> 
         

让我们看下输入参数 x 就是我们已经得到的 pps rule 就是内存中的crushrule’s number(不是 ruleid, 在我的crushrule set中, this rule’s id是 3), weight 是已经讲过的 reweight 变化范围从1 到 65536. 我们定义了 rawout[maxout] 来存储 OSD set, scratch[maxout * 3] 为计算使用. 然后我们进入了crush_do_rule.

`PGID -> OSDset OUTLINE`

下面要仔细研究3个函数, firstn 意味着副本存储, CRUSH 需要去选择n 个osds存储副本. indep是纠删码存储过程.我们只关注副本存储方法

crush_do_rule: 反复do crushrules crush_choose_firstn: 递归选择特定类型的桶或设备 crush_bucket_choose: 直接选择bucket的子节点

crush_do_rule

首先这是我的crushrule 和集群的层次结构参数

 
          <code  
          class 
          = 
          "hljs cpp" 
          ><code  
          class 
          = 
          "hljs perl" 
          > 
          @map 
          : the crush_map 
         
 
          @ruleno 
          : the rule id 
         
 
          @x 
          : hash input 
         
 
          @result 
          : pointer to result vector 
         
 
          @result_max 
          : maximum result size 
         
 
          @weight 
          : weight vector ( 
          for 
          map leaves) 
         
 
          @weight_max 
          : size of weight vector 
         
 
          @scratch 
          : scratch vector  
          for 
          private 
          use; must be >=  
          3 
          * result_max 
         
 
          int 
          crush_do_rule( 
          const 
          struct crush_map *map, 
         
 
                     
          int 
          ruleno,  
          int 
          x,  
          int 
          *result,  
          int 
          result_max, 
         
 
                     
          const 
          __u32 *weight,  
          int 
          weight_max, 
         
 
                     
          int 
          *scratch)</code></code> 
         

值得说的变量emit 通常用在规则的结束,同时可以被用在在形相同规则下选择不同的树。

scratch[3 * result_max]

1

2

3

 
          <code  
          class 
          = 
          "hljs cpp" 
          ><code  
          class 
          = 
          "hljs perl" 
          > 
          int 
          *a = scratch; 
         
 
          int 
          *b = scratch + result_max; 
         
 
          int 
          *c = scratch + result_max* 
          2 
          ;</code></code> 
         

a, b, c 分别指向 scratch向量的0, 1/3, 2/3的位置. w = a; o = b; - w被用作一个先入先出队列来在CRUSH map中进行横向优先搜索(BFS traversal). - o存储crush_choose_firstn选择的结果. - c存储最终的OSD选择结果. crush_choose_firstn计算后如果结果不是OSD类型, o 交给w.以便于 w成为下次crush_choose_firstn的输入参数. 如上所述, crush_do_rule 反复进行 crushrules 迭代. 你可以在内存中发现规则:

过程步骤

step 1 put root rgw1 in w(enqueue);

step 2 would run crush_choose_firstn to choose 1 rack-type bucket from root rgw1

下面分析crush_choose_firstn过程

`crush_choose_firstn 函数`

这个函数递归的选择特定bucket或者设备,并且可以处理冲突,失败的情况. 如果当前是choose过程,通过调用crush_bucket_choose来直接选择. 如果当前是chooseleaf选择叶子节点的过程,该函数将递归直到得到叶子节点.

`crush_bucket_choose 函数`

crush_bucket_choose是CRUSH最重要的函数.应为默认的bucket类型是straw,常见的情况下我们会使用straw类型bucket,然后就会进入bucket_straw_choose

case进行跳转

1 2	`<code` `class` `=` `"hljs cpp"` `><code` `class` `=` `"hljs cs"` `>` `case` `CRUSH_BUCKET_STRAW:` `return` `bucket_straw_choose((struct crush_bucket_straw *)in,</code></code>`

完整代码

 
          <code  
          class 
          = 
          "hljs cpp" 
          ><code  
          class 
          = 
          "language-c++ hljs cs" 
          > 
          static 
          int 
          crush_bucket_choose(struct crush_bucket *in,  
          int 
          x,  
          int 
          r) 
         
          { 
         
          dprintk( 
          " crush_bucket_choose %d x=%d r=%d\n" 
          , in->id, x, r); 
         
          BUG_ON(in->size ==  
          0 
          ); 
         
          switch 
          (in->alg) { 
         
          case 
          CRUSH_BUCKET_UNIFORM: 
         
          return 
          bucket_uniform_choose((struct crush_bucket_uniform *)in, 
         
          x, r); 
         
          case 
          CRUSH_BUCKET_LIST: 
         
          return 
          bucket_list_choose((struct crush_bucket_list *)in, 
         
          x, r); 
         
          case 
          CRUSH_BUCKET_TREE: 
         
          return 
          bucket_tree_choose((struct crush_bucket_tree *)in, 
         
          x, r); 
         
          case 
          CRUSH_BUCKET_STRAW: 
         
          return 
          bucket_straw_choose((struct crush_bucket_straw *)in, 
         
          x, r); 
         
          case 
          CRUSH_BUCKET_STRAW2: 
         
          return 
          bucket_straw2_choose((struct crush_bucket_straw2 *)in, 
         
          x, r); 
         
          default 
          : 
         
          dprintk( 
          "unknown bucket %d alg %d\n" 
          , in->id, in->alg); 
         
          return 
          in->items[ 
          0 
          ]; 
         
          } 
         
          }</code></code>

 
          <code  
          class 
          = 
          "hljs cpp" 
          ><code  
          class 
          = 
          "language-c++ hljs cs" 
          > 
          /* straw */ 
         
          static 
          int 
          bucket_straw_choose(struct crush_bucket_straw *bucket, 
         
          int 
          x,  
          int 
          r) 
         
          { 
         
          __u32 i; 
         
          int 
          high =  
          0 
          ; 
         
          __u64 high_draw =  
          0 
          ; 
         
          __u64 draw; 
         
          for 
          (i =  
          0 
          ; i < bucket->h.size; i++) { 
         
          draw = crush_hash32_3(bucket->h.hash, x, bucket->h.items[i], r); 
         
          draw &=  
          0xffff 
          ; 
         
          draw *= bucket->straws[i]; 
         
          if 
          (i ==  
          0 
          || draw > high_draw) { 
         
          high = i; 
         
          high_draw = draw; 
         
          } 
         
          } 
         
          return 
          bucket->h.items[high]; 
         
          } 
         
          </code></code>

bucket结构体

1

2

3

4

5

6

 
          <code  
          class 
          = 
          "hljs cpp" 
          ><code  
          class 
          = 
          "language-c++ hljs cs" 
          >struct crush_bucket_straw { 
         
          struct crush_bucket h; 
         
          __u32 *item_weights;    
          /* 16-bit fixed point */ 
         
          __u32 *straws;          
          /* 16-bit fixed point */ 
         
          }; 
         
          </code></code>

可以看到 bucket root rgw1’s id是 -1, type = 10 意味着根节点 alg = 4 意味着 straw 类型. 这里 weight 是 OSD权重 we set scales up by 65536(i.e. 37 * 65536 = 2424832). 然后看下循环 T: 对每个输入的son bucket , 例如升上图rack1, crush_hash32_3 hashes x,bucket id(rack1’s id), r(current selection’s order number), 这3个变量是a uint32_t 类型变量, 结果 & 0xffff, 然后乘straw(rack1’s straw value, straw calculation seen below), 最后得到这个值, 在一次循环中, for one son bucket(rack1 here). 我们在循环中计算每个 son bucket然后选择最大的 . 然后一个son bucket 被选择 . Nice job! 下面是个计算的例子

`调用层次,图表描述`

`结论`

我们已经研究了CRUSH计算中的重要部分,其余的部分就是迭代和递归,直到选择了所有的OSD.

关于 straw 值

详细的代码在 src/crush/builder.c crush_calc_straw. 总之,straw 值总是和OSD权重正相关.straw2正在开发.

Ni259251

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Ceph Crush解析

文章出处：https://www.2cto.com/net/201606/515330.html文章比较深入的写了CRUSH算法的原理和过程.通过调试深入的介绍了CRUSH计算的过程.写在前面读本文前,你需要对ceph的基本操作,pool和CRUSH map非常熟悉.并且较深入的读过源码.分析的方法首先,我们写了个c程序调用librados向pool中写入一个对象.然后使用 GDB(CGDB is...
复制链接

扫一扫