Morpheus: Towards Automated SLOs for Enterprise Clusters

{Morpheus: Towards Automated SLOs for Enterprise Clusters}

{2016}, {}, {OSDI}

Summary

写完笔记之后最后填,概述文章的内容,以后查阅笔记的时候先看这一段。注:写文章summary切记需要通过自己的思考,用自己的语言描述。忌讳直接Ctrl + c原文。

Research Objective(s)

Modern resource management framework for large-scale analytics leave unsolved the problematic tention between high cluster utilization and job’s performanc predictability – respectively coveted by operators and users. We address this in Morpheus, a new system that: 1) codifies implict user expectations as explict Service Level Objectives(SLOs), infered from historical data, 2) enforces SLOs using novel scheduling techniques that isolate jobs from sharing-induced performance variability, 3) mitigates inherent performance variance by means of dynamical reprovisioning of jobs. We validate these ideas against production traces from a 50k node cluster, and show that Morpheus can lower the number of deadline violations by 5x to 13x, while retaining cluster-utilization, and lowering cluster footprint by 14% to 28%. We demonstrate the scalability and practicablity of our implementation by deploying Morpheus on a 2700-node cluster and running it against production-derived workloads.

Background / Problem Statement

Commercial enterprises ranging from Fortune-500 companies to venture-captital funded startups are increasingly relying on muti-tenated clusters for running their business analytics jobs. These jobs comprise of multiple tasks that are run on different cluster nodes, where the unit of per-task resource allocation is a container(i.e, a bundle of resource such as CPU, RAM and disk I/O) on an individual machine. From an analysis of large-scale prodution workloads, we observe significant variance in job runtimes, which sometimes results in missed deadlines and negative business impact. This is perceived by users as an unpredictable excution experience, and it accounts for 25% of user escaltions in Microsoft big-data clusters.

Unpredictability comes from serveral sources, we roughtly group as follows:

  • Sharing-indcued – performance variability caused by inconsistent allocation of resources across job runs
  • Inherent – performance variability due to changes in the job input(size, skew, availiability), soure code tweaks, adn failures. This is endemic even in dedicated and lightly used clusters.

Unpredictability is most noticeable to users who submit periodic jobs(scheduled runs of the same job on newly arriving data). Their recurrent nature prompts users to form an expectation on jobs’ runtime performance as well as react to any deviation from it, pariticularly, if the job is a busniess-critical.

Unfortunately, widely deployed resource managers provide limited mechanisms for users to cope with unpredictability of such jobs.

Overview of Morpheus

Mopherus is a system that continuously observes and learns as perioic jobs run over time. The findings are used to economically reserve resources for the job ahead of job execution, and dynamically adapt to changing conditions at runtime.

3.2 Key components

We next give a brief overview of the different components in Morpheus, highlighting the different timescales in which they operates. We start our description with the automatic inference module which consists of two sub-components.

Extractor of target SLOs. This sub-component operates on Provenance Graph(PG). This is a graph capturing all inter-job data dependencies, as we as all in-gress-egress operations. The SLO extractor leverages the precise timing information stored in PG to statistically derive a deadline for the periodic job-- as time at which downstream consumers read a job’s output.
Job resource model. This sub-compoent takes as input detailed telemetry information on job runs. The information includes time-series of resource utilization(“skyline”) – the amount of resources (represented by container) used by the job at certain time granularity, tyipically one minute. Based on time-series of multiple runs**, the sub-componets constructs a tight envelop R* around the typical behavior** of the job.

Method(s)

The second part of our inference module produces a resource allocation R* that has high fidelity ot the actual requirements of some periodic job j. In a nutshell, Morpheus collects resource usage patterns of periodic jobs over N instances that have run in the past, and solve an LP that “best fits” all patterns.

Formally, a skyline for the i-th instance can be defined by the sequence s i , k {s_{i,k}} si,k, the average number of containers used for each time-step k. Using a collection of sequences as input, the optimization problem outputs the vector s=(s1,…sk)-- the number of containers reserved at each time-step.

Our optimization objective is a cost function which is a linear combination of two terms: One term which penalizes for “over-allocation” A 0 ( s ) A_0(s) A0(s) and another term which penalizes for “under-allocation” A u ( s ) A_u(s) Au(s), both illustrated in Fig.5. Formally, we wish to minimize a A 0 ( s ) + ( 1 − a ) A u ( s ) aA_0(s)+(1-a)A_u(s) aA0(s)+(1a)Au(s). Next, we describe these terms.

Over-allocation penalty. The over-allocation penalty is defined as the average over-allocation of containers. Formally, the expression ( s k − s i , k ) = m a x s i , k , 0 (s_k-s_{i,k}) = max{s_i,k, 0} (sksi,k)=maxsi,k,0 is the instantaneous over-allocation for instance i at time-step k. Accordingly, the over-allocation penalty is given by .

The idea behind choosing these particular forms of penalties is to model, as closely as possible, the usage of allocated resources by a job that requests them. Particularly, the over-allocation penalty modes the amount of unused resource because the job instance doesn’t need them. Wasted resouces allocated in a time-step cannot be recovered back at a later time-step. However, a shortage of resource at a time-step can be statisfied at a later point in time assuming the job is elastic.

Packing multiple periodic jobs

Periodic reservations

Regardless of packing algorithm we shall use, we face a practical challenge of how to reserve resource for multiple, possibly infinite instances of a periodic job. It is inefficient to calculate and store a seperate reservation for each instance of a periodic job. To address this challege, we force the constraint that all instances that assoiciated with the same perioidic job would have the same reservation across runs. E.g., a daily job which requires 10 containers for one hour to run between 10am and 4pm maybe forced to execute between noon to 1pm every day. While this decision choice might reduce the flexibility of a reservation algorithm, it can provide stronger predictability to users and and reducce allocation complexity.

Evaluation

作者如何评估自己的方法?实验的setup是什么样的?感兴趣实验数据和结果有哪些?有没有问题或者可以借鉴的地方?

Conclusion

作者给出了哪些结论?哪些是strong conclusions, 哪些又是weak的conclusions(即作者并没有通过实验提供evidence,只在discussion中提到;或实验的数据并没有给出充分的evidence)?

Notes

(optional) 不在以上列表中,但需要特别记录的笔记。

References

(optional) 列出相关性高的文献,以便之后可以继续track下去。

  1. tc: Traffic Control in the Linux kernel.
  2. Jaeger:Open source, end-to-end distributed tarcing.
  3. OPenZipkin: A distributed tracing system.
  4. Trickle:A lightweight userspace bandwidth shaper.
  5. wrk2:An HTTPbenchmarking tool based mostly on wrk.
  6. Sysbench. github
  7. Sysdig. githug
  8. stress-ng. ubuntu
  9. Opentracing
  10. pmbw:Parallel memory bandwidth benchmark.
  11. Intel cache allocation technology. github
  12. Intel memory bandwidth allocatinon. github
  13. HTB hierarchical token bucket.
  14. Neo4j: Native Grath Database. github
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值