ceph QoS设计 -- dmclock

个人专注于分布式后台开发的博客地址(如有更改会先在此博客更新): https://www.gtblog.cn/

1 dmClock算法

dmclock算法基于mclock算法,mclock算法见另外一篇blog(QoS算法mClock)。

dmclock在server是上运行的是mclock的修改版本,在分布式模型中标签分配的变化是唯一一点需要说明的。在标签分配期间,每个服务器需要确定两件事:1.VM收到系统中所有server的服务请求总数,2.及这些请求中作为预留层面完成的服务总数。这个信息会有VM宿主机在向服务器Sj发送消息的时候携带过来,主要有两个想整数信息:ρi 和δi 。δi代表的是在上一次来自Vi到Sj的请求和当前请求之间在所有服务器上已经服务完成的从VM Vi虚拟机发出的IO请求总数(上一次从Vi到Sj的请求是Ri-1,当前从Vi到Sj的请求为Ri,那么Ri-1到Ri之间从Vi发出去的请求且被处理完毕的请求总数为δi,因为可能中间会发送很多请求给集群中的其他服务器节点),ρi 代表在上一次Vi到Sj的请求Ri-1和当前这一次的请求Ri之间处于在约束限制阶段的被服务的IO请求总数(上一次从Vi到Sj的请求是Ri-1,当前从Vi到Sj的请求为Ri,那么Ri-1到Ri之间从Vi发出去的请求且在基于约束阶段被被服务的请求总数)。这些信息可以在VM宿主机轻松维护。宿主机将ρi 和δi的信息附加在Vi到server的请求中。(注意,对于单一server,ρi 和δi始终为1,这时候就是mclock算法),这时候在dmclock算法第一阶段打标签的时候标签请求赋值公式如下:

因此,一个请求会在未来收到一个标签,这个标签会反映出当前VM Vi节点在获取来自其他服务器的服务信息。δi值越大,那么其请求获得的服务优先级越低。需要注意的是这个打标签不需要在存储节点服务器之间同步。dmclock算法的剩余部分与mclock一致。最坏的情况是 ρ 和 δ在其他服务器上可能最多会有一个请求的误差。但dmclock算法并不需要服务器之间复杂的同步。

注意: rho和delta两个值都是表示请求完成的总数,那些发送出去但是没有返回的请求不计算在内。

举个例子:

2 dmClock在Ceph中的实现

源码结构

对于mclock中提到的storage-specific issue:burst io、IO Size、IO Type在Ceph中并没有具体实现。

dmclock实现分为两个模块:client、server。

dmclock实现中比较重要的两个参数 ρ 和 δ。

2.1 ρ 和 δ定义

  namespace crimson {
  namespace dmclock {


    // QoS控制的两个阶段,基于预留限制控制,基于比例控制,这个值会随response返回给client
    // client需要通过这个值判断当前的response属于哪个调度阶段,帮助进行rho和delta值计算
    enum class PhaseType : uint8_t { reservation, priority };       
    // request携带的参数,携带参数包括ρ 和 δ
    struct ReqParams {

          uint32_t delta; //δ: 上一次请求到目前收到的所有回复总数        
          uint32_t rho; //ρ: 上一次请求到目前请求之间收到的reservation回复

          ReqParams(uint32_t _delta, uint32_t _rho) : delta(_delta), rho(_rho)
          {
                assert(rho <= delta); // ρ <= δ
          }

          ReqParams() : ReqParams(0, 0)
          {
                // empty
          }

          ReqParams(const ReqParams& other) : delta(other.delta), rho(other.rho)
          {
                // empty
          }

    }; // class ReqParams
  }
}

2.2 dmclock_Client

dmClock Client端会主要负责记录ρ 和 δ状态,并在Client节点向服务节点发送请求时将ρ 和 δ信息插入到请求中。实现上dmClock端通过OrigTracker、BorrowingTracker、ServiceTracker三个类实现。

2.2.1 OrigTracker

origtracker最近有改动changlist: https://github.com/ceph/dmclock/commit/effea495839c0371b9288d0ffa79b2e0db459a8a

作者对cost改动的描述:

Allow the dmclock library to handle the “cost” of an:

cost = operation/request. This is in contrast to just measuring ops. If the
cost is set at the value 1, then it’s equivalent to the old code.

The cost is calculated on the server side and affects the tag values
for the request. Furthermore, the cost is then sent back to the
client, so the ServiceTracker can use the cost to properly calculate
delta and rho values.

We now allow the delta and rho values sent with a request to be zero,
since the request cost is included on the server-side tag calculation,
and the request cost must be positive guaranteeing the advancement of
tags.

The OrigTracker has now been updated so as not to add one when
computing delta and rho.

从作者的描述中可以看出cost指的是一个request在server端所需要的操作次数,其实这个cost就是考虑了client发送的请求对于server来说增加了多少负载,如果cost越多那么rho和delta越大,那么该client的request优先级就比那些需要operation操作少的要小。

***Tracker中的cost参数是从server端一侧传过来的,ceph端接收到一个请求的时候插到Opque时,会创建一个opqueitem,

OrigTracker和后续的BorrowTracker是进行记录ρ 和 δ状态记录的对象,提供给ServiceTracker。
每个client会有一个server_map,因为client会向底层不同的server发送请求,那么client就需要保存一个与底层server之间的tacker映射关系。

    class OrigTracker {
      Counter   delta_prev_req;  // 当前client对应的全局的δ,这个值保存上一次发送给server Sj请求时当前client在其他server完成的IO总数
      Counter   rho_prev_req; // 当前client对应的全局的ρ,这个值保存上一次发送给server Sj请求时当前client在其他server完成的IO总数(reservation阶段)
      uint32_t  my_delta; // 当前client的δ
      uint32_t  my_rho; // 当前client的ρ

    public:

      OrigTracker(Counter global_delta, Counter global_rho) : delta_prev_req(global_delta), rho_prev_req(global_rho), my_delta(0), my_rho(0)
      {
        /* empty */ 
      }

      static inline OrigTracker create(Counter the_delta, Counter the_rho) {
            return OrigTracker(the_delta, the_rho);
      }

        // 因为ρ 和 δ状态是记录当前请求与上一次请求之间处理系统处理过的请求信息,所以保存的delta和rho都要及时在进行下一次请求之前更新
      inline ReqParams prepare_req(Counter& the_delta, Counter& the_rho) {

            // 当前tracker上一次请求到这一次请求的δ数目,the_delta是service_tracker记录的全局总数,the_rho是系统记录的在reservation期间全局处理的请求数目。
            Counter delta_out = the_delta - delta_prev_req - my_delta; 

            // 计算当前tracker上一次到这一次之间reservation阶段完成的所有请求数目。
            Counter rho_out = the_rho - rho_prev_req - my_rho; 

            // 更新当前节点的全局delta       
            delta_prev_req = the_delta;

            // 更新当前节点的全局rho
            rho_prev_req = the_rho;

            // 重新置位delta和rho
            my_delta = 0;
            my_rho = 0;

            return ReqParams(uint32_t(delta_out), uint32_t(rho_out));
      }

        // 更新 ρ 和 δ状态 (当前最新的版本,加入了cost因子)
      inline void resp_update(PhaseType phase,Counter& the_delta,Counter& the_rho,Cost cost) {
            the_delta += cost;
            my_delta += cost;
            if (phase == PhaseType::reservation) { // 如果当前处于reservation阶段,这个值有server端返回
                the_rho += cost; // 加入cost因子
                my_rho += cost;
            }
      }

        /********************** 上一个版本的resp_update ********************************/
      //inline void resp_update(PhaseType phase, Counter& the_delta, Counter& the_rho) {
       //       ++the_delta;
        //      ++my_delta;
       //       if (phase == PhaseType::reservation) { 
        //          ++the_rho;
        //          ++my_rho;
        //      }
      //}
      /******************************************************************************/


      inline Counter get_last_delta() const {
            return delta_prev_req;
      }
    }; // struct OrigTracker

举个例子:

2.2.2 BorrowingTracker

当所有发出去的request还没有收到请求的时候,这时候计算ρ 和 δ状态可以使用BorrowingTracker来借用未来的的response。这里的borrowtracker相对简单,并没有考虑到server端的压力状况,这样如果请求一直未返回,某一个server上会导致IO积累,恶性循环。prepare_req计算方式比OrigTracker原理简单。就不再赘述,看代码就OK。

    class BorrowingTracker {
          Counter delta_prev_req;
          Counter rho_prev_req;
          Counter delta_borrow;
          Counter rho_borrow;

    public:

      BorrowingTracker(Counter global_delta, Counter global_rho) : delta_prev_req(global_delta), rho_prev_req(global_rho), delta_borrow(0), rho_borrow(0)
      {
             /* empty */
      }

      static inline BorrowingTracker create(Counter the_delta, Counter the_rho) {
            return BorrowingTracker(the_delta, the_rho);
      }


      inline Counter calc_with_borrow(const Counter& global, const Counter& previous, Counter& borrow) {

            // 全局请求数减去上一次统计的请求数
            Counter result = global - previous;

            // 上次请求到这次请求之间没有任何完成的请求,那么就返回1
            if (0 == result) {
              // if no replies have come in, borrow one from the future
              ++borrow;
              return 1;
            }
            // 将之前借的归还,然后返回归还后的结果
            else if (result > borrow) {
              // if we can give back all of what we borrowed, do so
              result -= borrow;
              borrow = 0;
              return result;
            } 
            // 借的太多,那就归还一部分,然后返回1.(borrow > result)
            else {
              // can only return part of what was borrowed in order to
              // return positive
              borrow = borrow - result + 1;
              return 1;
            }
      }

        // 构造reqparams
      inline ReqParams prepare_req(Counter& the_delta, Counter& the_rho) {
            Counter delta_out = calc_with_borrow(the_delta, delta_prev_req, delta_borrow);
            Counter rho_out = calc_with_borrow(the_rho, rho_prev_req, rho_borrow);
            delta_prev_req = the_delta;
            rho_prev_req = the_rho;
            return ReqParams(uint32_t(delta_out), uint32_t(rho_out));
      }

      inline void resp_update(PhaseType phase,Counter& the_delta,Counter& the_rho,Counter cost)           {
            the_delta += cost;
            if (phase == PhaseType::reservation) {
                the_rho += cost;
            }
     }
      /********************** 上一个版本的resp_update ********************************/
      //inline void resp_update(PhaseType phase, Counter& the_delta, Counter& the_rho) {
       //       ++the_delta;
        //  if (phase == PhaseType::reservation) {
        //      ++the_rho;
        //  }
      //}
      /*****************************************************************************/

      inline Counter get_last_delta() const {
            return delta_prev_req;
      }
    }; // struct BorrowingTracker

2.2.3 ServiceTracker

ServiceTracker用于记录系统中全局的delta和rho信息,并管理各个server与其对应的tracker。

         // S is server identifier type
        // T is the server info class that adheres to ServerTrackerIfc interface
        template<typename S, typename T = OrigTracker> // 最新代码默认用的OrigTracker
        class ServiceTracker {

              using TimePoint = decltype(std::chrono::steady_clock::now());
              using Duration = std::chrono::milliseconds;
              using MarkPoint = std::pair<TimePoint,Counter>;

              Counter                 delta_counter; // # reqs completed
              Counter                 rho_counter;   // # reqs completed via reservation
              std::map<S,T>           server_map;    // 记录系统中server节点与tracker之间的映射
              mutable std::mutex      data_mtx;      // protects Counters and map

              using DataGuard = std::lock_guard<decltype(data_mtx)>;

              // clean config

              std::deque<MarkPoint>     clean_mark_points;
              Duration                  clean_age;     // age at which server tracker cleaned

              // NB: All threads declared at end, so they're destructed first!

              std::unique_ptr<RunEvery> cleaning_job;


            public:

              // we have to start the counters at 1, as 0 is used in the
              // cleaning process
              template<typename Rep, typename Per>
              ServiceTracker(std::chrono::duration<Rep,Per> _clean_every,std::chrono::duration<Rep,Per> _clean_age) :delta_counter(1),rho_counter(1),clean_age(std::chrono::duration_cast<Duration>(_clean_age))
              {
                    cleaning_job =std::unique_ptr<RunEvery>(new RunEvery(_clean_every,std::bind(&ServiceTracker::do_clean, this)));
              }

              ...



                // 将tack记录追加在映射表中
              void track_resp(const S& server_id, const PhaseType& phase, const Cost cost) {
                    DataGuard g(data_mtx);

                    auto it = server_map.find(server_id);
                    if (server_map.end() == it) {
                            auto i = server_map.emplace(server_id,
                            T::create(delta_counter, rho_counter));  // emplace = constructor + emplace_back
                            it = i.first;
                    }
                    it->second.resp_update(phase, delta_counter, rho_counter, cost);
               }

              ReqParams get_req_params(const S& server) {
                    DataGuard g(data_mtx);
                    auto it = server_map.find(server);
                    if (server_map.end() == it) {
                      server_map.emplace(server,
                                 T::create(delta_counter, rho_counter));
                      return ReqParams(1, 1);
                    } else {
                      return it->second.prepare_req(delta_counter, rho_counter);
                    }
              }

            private:

             // 定期清理那些长时间没有活动的server节点
              void do_clean() {
                    TimePoint now = std::chrono::steady_clock::now();
                    DataGuard g(data_mtx);
                    clean_mark_points.emplace_back(MarkPoint(now, delta_counter));

                    Counter earliest = 0;
                    auto point = clean_mark_points.front();
                    while (point.first <= now - clean_age) {
                      earliest = point.second;
                      clean_mark_points.pop_front();
                      point = clean_mark_points.front();
                    }

                    if (earliest > 0) {
                      for (auto i = server_map.begin();
                           i != server_map.end();
                           /* empty */) {
                        auto i2 = i++;
                        if (i2->second.get_last_delta() <= earliest) {
                          server_map.erase(i2);
                        }
                      }
                    }
              } // do_clean
            }; // class ServiceTracker
      }
    }

2.3 dmclock_Server

server方面在进行QoS控制时主要有两个阶段,第一阶段为constraint-based约束,第二阶段为proportion-based约束。
server包含:ClientInfo、RequestTag、PriorityQueueBase、ClientReq、ClientRec、PullPriorityQueue、PushPriorityQueue。

ClientInfo:主要负责记录当前Client的reservation、proportion、limit信息。

struct ClientInfo {
      double reservation;  // minimum 设定的 reservation值
      double weight;       // proportional
      double limit;        // maximum

      // multiplicative inverses of above, which we use in calculations
      // and don't want to recalculate repeatedly
      double reservation_inv;
      double weight_inv;
      double limit_inv;
      ...
  }

RequestTag:主要负责计算tag,在tag assignment阶段为Client创建RequestInfo并分配tag。

struct RequestTag {
  double   reservation;  // 当前Ri请求的reservation
  double   proportion;
  double   limit;
  Cost     cost;
  bool     ready; // true when within limit
  Time     arrival;
    ...   
}


RequestTag(const RequestTag& prev_tag,
         const ClientInfo& client,
         const uint32_t delta,
         const uint32_t rho,
         const Time time,
         const double cost = 0.0,
         const double anticipation_timeout = 0.0) :
         ready(false),
         arrival(time)
  {
        Time max_time = time;
        if (time - anticipation_timeout < prev_tag.arrival)
            max_time -= anticipation_timeout;  // ***需要进行时间调整,这个调整大小需要配置***

        // tag assignment
        reservation = tag_calc(max_time, prev_tag.reservation,client.reservation_inv,rho,true, cost);
        proportion = tag_calc(max_time,prev_tag.proportion,client.weight_inv,delta,true, cost);
        limit = tag_calc(max_time,prev_tag.limit,client.limit_inv,delta,false, cost);

        assert(reservation < max_tag || proportion < max_tag);
  }

static double tag_calc(const Time time,
         double prev,
         double increment,
         uint32_t dist_req_val,
         bool extreme_is_high,
         double cost) 
 {
        if (0.0 == increment) {
          return extreme_is_high ? max_tag : min_tag;
        } else {
          if (0 != dist_req_val) {
            increment *= (dist_req_val +  cost);
          }
          return std::max(time, prev + increment);
        }
  }

在requesttag计算的时候,需要先进行时间调整:

  if (time - anticipation_timeout < prev_tag.arrival)

      max_time -= anticipation_timeout;

这个anticipation_timeout添加的原因是:

具体可以参看作者的解释:https://github.com/ceph/ceph/pull/18827

以及作者的测试例子: https://github.com/ceph/dmclock/pull/34

针对这个anticipation_timeout为什么要加入,PR作者描述如下

Dmclock scheduler supports IOPS reservation.

However, an aggressive worker can take light woker’s reserved shares.
Assume that worker A is a light worker whose IOPS reservation is 100 IOPS and worker B is an aggressive worker whose IOPS reservation is 1 IOPS.
Also assume that Woker A generates just 10 IOs in one second but every its IO does not come in exactly every 10 ms(arrived with a very small variation)

In this case, Worker A couldn’t get serviced 10 IOPS, because worker B takes worker A’s share.
If Worker A’s IO is a little late(even 1ms) Worker B’s IOs will be processed rather than Woker A’s and Worker A’s IO will be delayed.
This is because dmclock reset the time tag of worker A’s IO.

In a normal case, dmclock set time tag for IO based on previous IO’s tag.

However, if an IO arrived more than (1/reserved IOPS) ms later since the previous IO that belongs to the same worker arrived,the time tag of newly arrived IO is reset by the current time.

**Setting anticipation timeout can prevent this situation.
Reset will be deferred by anticipation timeout and time tag will be set based on previous IO’s tag.*

作者主要表达意思主要是如果某一个client发到server的请求在中途因为各种原因导致到达server的时间大于1/reservation的时间间隔,那么这样会导致其到达server后计算的tag值被current time重置,这样只要其到达时间大于1/reservation就会被current time重置,那么其会一直小于current time,导致一直在reservation阶段被调度,加上anticipation timeout控制后可以延迟其被current time置位的时间。

推荐一篇论文:https://www.usenix.org/system/files/conference/atc13/atc13-shen.pdf 这篇论文里提到了这个问题。


ClientReq:记录当前的client的client_id、对应的request_tag.ClientReq和ClientRec是PriorityQueueBase的内部类

class ClientReq {
    friend PriorityQueueBase;

    RequestTag tag;
    C          client_id;
    RequestRef request;
    ...
}

ClientRec: 记录当前clientid、上一次的request_tag、prop_delta、client是否处于idle、clientinfo信息、当前的rho、delta数值、一个当前client的请求队列requests,当前clientid的所有请求都保存在这个requests queue里,ClientRec提供了操作这个queue的基本方法,如add、pop、find、removed等操作.

class ClientRec {
        friend PriorityQueueBase<C,R,U1,B>;

        C                     client;
        RequestTag            prev_tag;
        std::deque<ClientReq> requests;  // client request queue

        // amount added from the proportion tag as a result of
        // an idle client becoming unidle
        double                prop_delta = 0.0; // 这个值用于调整idle状态的client request的权重

        c::IndIntruHeapData   reserv_heap_data {};
        c::IndIntruHeapData   lim_heap_data {};
        c::IndIntruHeapData   ready_heap_data {};
        #if USE_PROP_HEAP
        c::IndIntruHeapData   prop_heap_data {};
        #endif

  public:

        const ClientInfo*     info;
        bool                  idle;
        Counter               last_tick;
        uint32_t              cur_rho;
        uint32_t              cur_delta;

    ...
}

PriorityQueueBase: 优先级队列PriorityQueueBase是PullPriorityQueue、PushPriorityQueue的基类,PriorityQueueBase定义了resv_heap,limit_heap,ready_heap三个最小堆(这个堆是可配置的,可以为2叉堆也可以是4叉堆)、定义了clientid和client queue的映射关系client_map.

2.3.1 PriorityQueueBase

每个ceph osd都会有一个优先级队列,PriorityQueueBase提供的操作中最重要的应该是do_add_request和do_next_request两个函数,这两个函数中进行了tag adjustment和schedule。

  • reservation/proportion/limit最小堆

            //注意heap元素是clientrec而不是request,clientCompare会根据RequestTag::reservation选择对比请求的哪个tag
           c::IndIntruHeap<ClientRecRef,
                  ClientRec,
                  &ClientRec::reserv_heap_data,
                  ClientCompare<&RequestTag::reservation, // ClientCompare是堆中元素的比较算子,
                        ReadyOption::ignore,
                        false>, // 这个bool值代表是不是使用clientrec中的prop_delta作为调整依据
                  B> resv_heap;
          c::IndIntruHeap<ClientRecRef,
                  ClientRec,
                  &ClientRec::lim_heap_data,
                  ClientCompare<&RequestTag::limit,
                        ReadyOption::lowers,
                        false>,
                  B> limit_heap;
          c::IndIntruHeap<ClientRecRef,
                  ClientRec,
                  &ClientRec::ready_heap_data,
                  ClientCompare<&RequestTag::proportion,
                        ReadyOption::raises,
                        true>,    // 对于proportion 操作需要开启基于prop_delta的权重比较 
                  B> ready_heap; // 使用ready heap来进行proportion 
    
    
        // 比较算子 ClientCompare就是一个函数对象,
        struct ClientCompare {
            bool operator()(const ClientRec& n1, const ClientRec& n2) const {
              if (n1.has_request()) {
                if (n2.has_request()) {
                  const auto& t1 = n1.next_request().tag;
                  const auto& t2 = n2.next_request().tag;
                  if (ReadyOption::ignore == ready_opt || t1.ready == t2.ready) { // ready值主要判断是否在limit
                  值之内
                        // if we don't care about ready or the ready values are the same
                        if (use_prop_delta) { // 对于proportion 需要对比prop_delta
                          return (t1.*tag_field + n1.prop_delta) <
                            (t2.*tag_field + n2.prop_delta);
                        } else {
                          return t1.*tag_field < t2.*tag_field;
                        }
                  } else if (ReadyOption::raises == ready_opt) {
                        // use_ready == true && the ready fields are different
                        return t1.ready;
                  } else {
                        return t2.ready;
                  }
              } else {
                  // n1 has request but n2 does not
                  return true;
              }
            } else if (n2.has_request()) {
                // n2 has request but n1 does not
                return false;
            } else {
                // both have none; keep stable w false
                return false;
            }
        }
    };
    
  • do_add_request

      // data_mtx must be held by caller
      void do_add_request(RequestRef&& request,
              const C& client_id,
              const ReqParams& req_params,
              const Time time,
              const double cost = 0.0) {
            ++tick;
    
    
        ClientRec* temp_client;
    
    
        // 查看当前请求的client是否已经在client_map中,如果在就直接将request存放到3个最小堆中
        auto client_it = client_map.find(client_id);
        if (client_map.end() != client_it) {
          temp_client = &(*client_it->second); // address of obj of shared_ptr
        } else {
          const ClientInfo* info = client_info_f(client_id);
          ClientRecRef client_rec =
            std::make_shared<ClientRec>(client_id, info, tick);
          resv_heap.push(client_rec); // 注意这里的push只是将clientrec插入到heap实现的queue里,但这里的push并没有进行堆调整,heap将push和堆调整两个操作分开了。
    #if USE_PROP_HEAP
          prop_heap.push(client_rec);
    #endif
          limit_heap.push(client_rec);
          ready_heap.push(client_rec);
          client_map[client_id] = client_rec;
          temp_client = &(*client_rec); // address of obj of shared_ptr
        }
    
        // for convenience, we'll create a reference to the shared pointer
        ClientRec& client = *temp_client;
    
        // 标签调整环节,如果client当前处在idle状态,需要调整当前idle的p tag
        if (client.idle) {
          // We need to do an adjustment so that idle clients compete
          // fairly on proportional tags since those tags may have
          // drifted from real-time. Either use the lowest existing
          // proportion tag -- O(1) -- or the client with the lowest
          // previous proportion tag -- O(n) where n = # clients.
          //
          // So we don't have to maintain a propotional queue that
          // keeps the minimum on proportional tag alone (we're
          // instead using a ready queue), we'll have to check each
          // client.
          //
          // The alternative would be to maintain a proportional queue
          // (define USE_PROP_TAG) and do an O(1) operation here.
    
          // Was unable to confirm whether equality testing on
          // std::numeric_limits<double>::max() is guaranteed, so
          // we'll use a compile-time calculated trigger that is one
          // third the max, which should be much larger than any
          // expected organic value.
          constexpr double lowest_prop_tag_trigger =
            std::numeric_limits<double>::max() / 3.0;
    
          double lowest_prop_tag = std::numeric_limits<double>::max();
          for (auto const &c : client_map) {
            // don't use ourselves (or anything else that might be
            // listed as idle) since we're now in the map
            if (!c.second->idle) {
              double p;
              // use either lowest proportion tag or previous proportion tag
              if (c.second->has_request()) {
                p = c.second->next_request().tag.proportion +
                    c.second->prop_delta;
              } else {
                p = c.second->get_req_tag().proportion + c.second->prop_delta;
              }
    
              if (p < lowest_prop_tag) {
                lowest_prop_tag = p;
              }
            }
          }
    
          // if this conditional does not fire, it
          if (lowest_prop_tag < lowest_prop_tag_trigger) {
            client.prop_delta = lowest_prop_tag - time; // adjust tag
          }
          client.idle = false;
        } // if this client was idle
    
    #ifndef DO_NOT_DELAY_TAG_CALC
        RequestTag tag(0, 0, 0, time);
    
        if (!client.has_request()) {
          const ClientInfo* client_info = get_cli_info(client);
          assert(client_info);
          tag = RequestTag(client.get_req_tag(),
                   *client_info,
                   req_params,
                   time,
                   cost,
                   anticipation_timeout);
    
          // copy tag to previous tag for client
          client.update_req_tag(tag, tick);
        }
    #else
        const ClientInfo* client_info = get_cli_info(client);
        assert(client_info);
        RequestTag tag(client.get_req_tag(), // 为request计算tag
                   *client_info,
                   req_params,
                   time,
                   cost,
                   anticipation_timeout);
    
        // copy tag to previous tag for client
        client.update_req_tag(tag, tick);
    #endif
    
        client.add_request(tag, client.client, std::move(request)); //将tag插入clientrec的deque中
        if (1 == client.requests.size()) {
          // NB: can the following 4 calls to adjust be changed
          // promote? Can adding a request ever demote a client in the
          // heaps?
          resv_heap.adjust(client);
          limit_heap.adjust(client);
          ready_heap.adjust(client);
    #if USE_PROP_HEAP
          prop_heap.adjust(client);
    #endif
        }
    
        client.cur_rho = req_params.rho;
        client.cur_delta = req_params.delta;
    
        resv_heap.adjust(client);  // 堆调整,堆调整(clientcomparor)的对象是通过比较deque的元素request的tag信息。
        limit_heap.adjust(client);
        ready_heap.adjust(client);
    #if USE_PROP_HEAP
        prop_heap.adjust(client);
    #endif
    } // add_request
    
  • do_next_request

      // data_mtx should be held when called
      NextReq do_next_request(Time now) {
            // if reservation queue is empty, all are empty (i.e., no
            // active clients)
            if(resv_heap.empty()) {
              return NextReq::none();
            }
    
        // try constraint (reservation) based scheduling
    
        auto& reserv = resv_heap.top();
        if (reserv.has_request() &&
            reserv.next_request().tag.reservation <= now) { //值得注意的一点,根据tag的时间点来进行决策是否要进行reservation调度
          return NextReq(HeapId::reservation);
        }
    
        // no existing reservations before now, so try weight-based scheduling
    
        // all items that are within limit are eligible based on priority
        auto limits = &limit_heap.top();
        while (limits->has_request() &&
               !limits->next_request().tag.ready &&
               limits->next_request().tag.limit <= now) {
          limits->next_request().tag.ready = true;
          ready_heap.promote(*limits);
          limit_heap.demote(*limits);
    
          limits = &limit_heap.top();
        }
    
        auto& readys = ready_heap.top();
        if (readys.has_request() &&
            readys.next_request().tag.ready &&
            readys.next_request().tag.proportion < max_tag) {
          return NextReq(HeapId::ready);
        }
    
        // if nothing is schedulable by reservation or
        // proportion/weight, and if we allow limit break, try to
        // schedule something with the lowest proportion tag or
        // alternatively lowest reservation tag.
        if (allow_limit_break) {
          if (readys.has_request() &&
              readys.next_request().tag.proportion < max_tag) {
            return NextReq(HeapId::ready);
          } else if (reserv.has_request() &&
                 reserv.next_request().tag.reservation < max_tag) {
            return NextReq(HeapId::reservation);
          }
        }
    
        // nothing scheduled; make sure we re-run when next
        // reservation item or next limited item comes up
    
        Time next_call = TimeMax;
        if (resv_heap.top().has_request()) {
          next_call =
            min_not_0_time(next_call,
                   resv_heap.top().next_request().tag.reservation);
        }
        if (limit_heap.top().has_request()) {
          const auto& next = limit_heap.top().next_request();
          assert(!next.tag.ready || max_tag == next.tag.proportion);
          next_call = min_not_0_time(next_call, next.tag.limit);
        }
        if (next_call < TimeMax) {
          return NextReq(next_call);
        } else {
          return NextReq::none();
        }
     } // do_next_request
    

2.3.2 PullPriorityQueue

PullPriorityQueue向上提供了针对request的操作,最重要的操作有两个,add_request和pull_request

  • add_request

      // 内部调用PriorityQueueBase的接口
      void add_request(typename super::RequestRef&& request,
               const C&                     client_id,
               const ReqParams&             req_params,
               const Time                   time,
               double                       addl_cost = 0.0) 
      {
            typename super::DataGuard g(this->data_mtx);
        #ifdef PROFILE
            add_request_timer.start();
        #endif
            super::do_add_request(std::move(request),
                          client_id,
                          req_params,
                          time,
                          addl_cost);
            // no call to schedule_request for pull version
        #ifdef PROFILE
            add_request_timer.stop();
        #endif
        }
    
  • pull_request

      // pull_request调用PriorityQueueBase的do_next_request进行调度
      inline PullReq pull_request() {
            return pull_request(get_time());
      }
    
    
      PullReq pull_request(Time now) {
            PullReq result;
            typename super::DataGuard g(this->data_mtx);
        #ifdef PROFILE
            pull_request_timer.start();
        #endif
    
            typename super::NextReq next = super::do_next_request(now);
            result.type = next.type;
            switch(next.type) {
            case super::NextReqType::none:
              return result;
            case super::NextReqType::future:
              result.data = next.when_ready;
              return result;
            case super::NextReqType::returning:
              // to avoid nesting, break out and let code below handle this case
              break;
            default:
              assert(false);
            }
    
            // we'll only get here if we're returning an entry
    
            auto process_f =
              [&] (PullReq& pull_result, PhaseType phase) ->
              std::function<void(const C&,
                         typename super::RequestRef&)> {
              return [&pull_result, phase](const C& client,
                               typename super::RequestRef& request) {
                pull_result.data =
                typename PullReq::Retn{client, std::move(request), phase};
              };
            };
    
            // 根据request所处的阶段进行相应的处理
            switch(next.heap_id) {
                case super::HeapId::reservation:
                  super::pop_process_request(this->resv_heap,
                                 process_f(result, PhaseType::reservation));
                  ++this->reserv_sched_count;
                  break;
                case super::HeapId::ready:
                  super::pop_process_request(this->ready_heap,
                                 process_f(result, PhaseType::priority));
                  { // need to use retn temporarily
                    auto& retn = boost::get<typename PullReq::Retn>(result.data);
                    super::reduce_reservation_tags(retn.client);
                  }
                  ++this->prop_sched_count;
                  break;
                default:
                  assert(false);
            }
    
        #ifdef PROFILE
            pull_request_timer.stop();
        #endif
            return result;
      } // pull_request
    

2.4 client 与 server交互

3 Ceph中dmclock如何玩

3.1 Ceph QoS Uints【1】

  • QoS Units
  1. an rbd image

  2. a group of objects (pool)

  3. a directory on a filesystem

  4. a client or subset of clients

  5. an application (and how would you define an application?)

  6. a dataset

  7. universal set

当前阶段ceph实现的是1和2两个单元

  • Requirements for implementing distributed mclock in Ceph

A. Each request must have its own unique identifier for each client unit

B. Ceph Cluster must store permanently QoS control corresponding to a unique identifier

C. OSD must be able to find the QoS control from the unique identifier of the arriving request

  • Space where Ceph cluster can store specific information
  1. Cluster Map (Monitor, OSD, PG, CRUSH, MDS)

  2. Header Type Object

3.2 当前ceph的dmclock实现框架

3.3 Ceph中dmclock源码实现

ceph通过CRUSH算法实现placement(CephFS还未成熟暂不考虑),ceph是没有master中心节点的,对于有中心节点的QoS设计可以在中心节点实现,而对于ceph,QoS就下移到OSD上。

ceph的两个pullrequest大家可以参考一下,这两个PR分别实现了mclock和dmClock:

  1. dmclock: Delivery of the dmclock delta, rho and phase parameter + Enabling the client service tracker

  2. osd/PG: Add two new mClock implementations of the PG sharded operator queue

3.3.1 当前Ceph中dmClock client实现

ceph发送IO请求流程流程如下:

dmClock client的主要职责就是将delta、rho、phase(所处控制阶段)直接插入到PGQueueable, OpRequest, MOSDOp, MOSDOpReply 消息类中。

osdc中Objecter定义了一个ServiceTracker实例(servicetracker是dmClock client对象)。
objecter::send_op:

// send_op将request封装成一个MOSDOp,在返回之前会现将请求的delta、rho等信息插入request
void Objecter::_send_op(Op *op)
{
  ...

  assert(op->tid > 0);
  MOSDOp *m = _prepare_osd_op(op); // prepare会插入delta、rho

  ...
}

MOSDOp *Objecter::_prepare_osd_op(Op *op)
{
...
  if (op->priority)
    m->set_priority(op->priority);
  else
    m->set_priority(cct->_conf->osd_client_op_priority);

  if (op->reqid != osd_reqid_t()) {
    m->set_reqid(op->reqid);
  }

  if (mclock_service_tracker) { // 如果使用service_tracker
    dmc::ReqParams rp = qos_trk->get_req_params(op->target.osd);
    m->set_qos_params(rp); // 设置MOSDOp的qos_params
  }
...
  return m;
}

在Objecter收到reply时会调用handle_osd_op_reply进行servicetracker更新(调用track_resp更新delta、rho):

bool Objecter::ms_dispatch(Message *m)
{
  ldout(cct, 10) << __func__ << " " << cct << " " << *m << dendl;
  switch (m->get_type()) {
    // these we exlusively handle
  case CEPH_MSG_OSD_OPREPLY:  // 如果是osd的回复信息,那么就进入handle_osd_op_reply处理
    handle_osd_op_reply(static_cast<MOSDOpReply*>(m));
    return true;

  case CEPH_MSG_OSD_BACKOFF:
    handle_osd_backoff(static_cast<MOSDBackoff*>(m));
    return true;

  case CEPH_MSG_WATCH_NOTIFY:
    handle_watch_notify(static_cast<MWatchNotify*>(m));
    m->put();
    return true;

...
}

/* This function DOES put the passed message before returning */
void Objecter::handle_osd_op_reply(MOSDOpReply *m)
{

  ...
  if (mclock_service_tracker) { // 如果启用service_tracker
    qos_trk->track_resp(op->target.osd, m->get_qos_resp()); // 更新qos_parames
  }
  ldout(cct, 15) << "handle_osd_op_reply completed tid " << tid << dendl;
  _finish_op(op, 0); 

  ...
 }

3.3.2 Ceph中dmClock server实现

dmClock server驻留在OSD,在 OSD 侧,所有消息都是通过 Op Queue 进行分发处理,目前整个Op Queue 为了性能问题会做分区,每个分区被多个线程共享,相当于多个队列。这使得 QoS 在 OSD 侧控制变得困难,因此目前必须要将 OSD Shard 设为 1,相当于 OSD 有唯一的 Op Queue 进行处理【2】。

在server端sharedthreadpool会处理opqueue,opqueue是osd的mClockClientQueue类实例,定义中封装了dmclock的priorityqueue,opqueue通过enqueue、dequeue来调用priorityqueue的do_add_request、do_next_request,在这两个函数中进行IOPS调度。

void ShardedThreadPool::shardedthreadpool_worker(uint32_t thread_index)
{
   while (!stop_threads) {
    if (pause_threads) {
     ...

    cct->get_heartbeat_map()->reset_timeout(
      hb,
      wq->timeout_interval, wq->suicide_interval);
      wq->_process(thread_index, hb);  // 调用OSD::ShardedOpWQ::_process

  }

  ldout(cct,10) << "sharded worker finish" << dendl;

  cct->get_heartbeat_map()->remove_worker(hb);

}


void OSD::ShardedOpWQ::_process(uint32_t thread_index, heartbeat_handle_d *hb)
{
  ...
  OpQueueItem item = sdata->pqueue->dequeue(); // 从opqueue中取出一个IO请求进行处理
  if (osd->is_stopping()) {
    sdata->sdata_op_ordering_lock.Unlock();
    return;    // OSD shutdown, discard.
  }
  ...
}

// mClockClientQueue中封装的enqueue_distributed、dequeue_distributed

    // Enqueue op in the back of the regular queue
  inline void mClockClientQueue::enqueue(Client cl,
                     unsigned priority,
                     unsigned cost,
                     Request&& item) {
    auto qos_params = item.get_qos_params();
    queue.enqueue_distributed(get_inner_client(cl, item), priority, cost,
                  std::move(item), qos_params);
  }


// Return an op to be dispatched
  inline Request mClockClientQueue::dequeue() {
    std::pair<Request, dmc::PhaseType> retn = queue.dequeue_distributed();

    if (boost::optional<OpRequestRef> _op = retn.first.maybe_get_op()) {
      (*_op)->qos_resp = retn.second;
    }
    return std::move(retn.first);
  }




Retn dequeue_distributed() {
  assert(!empty());
  dmc::PhaseType resp_params = dmc::PhaseType();

  ...
  auto pr = queue.pull_request();  // 调用do_next_request
  assert(pr.is_retn());
  auto& retn = pr.get_retn();
  resp_params = retn.phase;
  return std::make_pair(std::move(*(retn.request)), resp_params);
}

void enqueue_distributed(K cl, unsigned priority, unsigned cost, T&& item,
             const dmc::ReqParams& req_params) {
  // priority is ignored
  queue.add_request(std::move(item), cl, req_params, cost); // 调用do_add_request
}

4 ceph dmclock simulation

servicetacker = OrigTracker

4.1 anticipation_timeout simulation

开关anticipation timeout

config:

result:

4.2 idle on/off

config:

result:

4.3 weight

config:

result:

4.4 Cost

config:

result:

5 分布式存储QoS设计

分布式存储QoS设计是保证用户体验的重要环节,当前云计算厂商都试图进行QoS保证,AWS、alibaba、AZure在云存储服务上都提供了对应的QoS保障(IOPS+bandwidth),但只有AWS对其QoS保障承诺了在99.9%的可用率。

分布式存储QoS保障对象包括:IOPS、bandwidth等,常用的QoS算法有mclock、及其分布式版本dmClock、令牌桶、漏斗算法。

mclock算法并没有指定应用对象,其可以同时应用到IOPS、bandwidth性能保障,由于IOPS应用对象是单一的IO请求、而bandwidth则是对应着不同大小的数据块,所以在使用到bandwidth性能保障的时候需要进行适当的调整。



reference:

  1. Implementing distributed mclock in Ceph
  2. Ceph Qos 目前社区进展
  3. ceph 读写流程
  4. Ceph设计原理与实现
阅读更多
版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/tgxallen/article/details/80351183
文章标签: ceph QoS dmClock
所属专栏: 分布式存储
想对作者说点什么? 我来说一句

没有更多推荐了,返回首页

不良信息举报

ceph QoS设计 -- dmclock

最多只允许输入30个字

加入CSDN,享受更精准的内容推荐,与500万程序员共同成长!
关闭
关闭