Morpheus: Towards Automated SLOs for Enterprise Clusters

{Title}(文章标题)
{2016}, {Sangeeth Abdu Jyothi}, {OSDI}

Summary
写完笔记之后最后填,概述文章的内容,以后查阅笔记的时候先看这一段。注:写文章summary切记需要通过自己的思考,用自己的语言描述。忌讳直接Ctrl + c原文。

Research Objective(s)
Modern resource management frameworks for large-scale analytics leave unresolved the problematic tension between high cluster utilization and job’s performance predictability-- respectively coveted by operators and users.
covet
美 [ˈkʌvɪtid]
英 [ˈkʌvɪtid]
v.垂涎;渴望;妄想(别人东西)
网络梦寐以求的;令人垂涎的;令人羡慕的

We address this in Morpheus, a nwe system that: 1) codifies implict user expectations as explict Service Level Objectives(SLOs), inferred from historical data, 2) enforces SLOs using novel scheduling techniques that isolate jobs from sharing-induced performance variability, and 3) mitigates inherent performance variance (e.g. due to failures) by means of dynamic reprovisioning of jobs.

Background / Problem Statement
Unpredictability comes from several sources, which roughly can be grouped as

  • Sharing-induced - performance variability caused by inconsistent allocations of resources across job runs
  • Inherent - due to changes on the job input(size, skew, availability), source code tweaks, failures – this si endemic even in dedicated and lightly used clusters.

Method(s)

  • (a) Data-dependencies in the Provenance Graph(PG).
    PG gathers logs (application logs, filesystem logs…)
    . (b) Resource utilization of each run in a Telemetry-History infrastructure database(TH).

  • (a) Form the PG it derives a dealine d – the SLO.
    SLO — derive a dealine for the periodic job-- as time which downstream consumers read a job’s output.
    (b) From the TH, it derives a model of the job resource demand over time, R*.
    time-seris of resource utilization used by the job every one minute.

    we refer to R* as the job resource model.

  • Morpheus enforces SLOs via recurring reservations:
    (a) Adds a recurring reservation for JobX into the cluster agenda-- this set aside resources over time based on the job resource model R*.

  • Formally, skyline for the i-th instance can defined by the sequece s i , k {s_{i,k}} si,k
    , the average number of containers used for each time-step k k k. Using a collection of sequece as input, the optimization problem outputs the vector s = ( s 1 , . . . . . s K ) s=(s_1,.....s_K) s=(s1,.....sK) – the number of containers reserved at each time-step.
    Our optimization ojective is a cost function which is a linear combination of two term: One term which penalizes for “over-allocations” and another term which penalizes fpr “under-allcation”
    minimize a ∗ A 0 ( s ) + ( 1 − a ) A u ( s ) a*A_0(s) +(1-a)A_u(s) aA0(s)+(1a)Au(s)

  • Over-allocation penalty is defined as the average over-allocation of containers.

  • Using Linear Programming to solve this problem.

    (b) New instances of JobX run within the recurring reservation(dedicated resources).

  1. The Dynamic Reprovisioning componet monitors the job progress online, and increases/decreases the reservation, to mitigate inherent execution variability.
    Reprovisioning is triggered when a job resource demand(used containers plus pending ask) exceeds the resources allocated in the predicted skyline.

  2. Morpheus constantly feeds back into STep 2 the PG and TH information of the new runs for continuous learning and refinement of the SLO and the job resource model.

Evaluation
作者如何评估自己的方法?实验的setup是什么样的?感兴趣实验数据和结果有哪些?有没有问题或者可以借鉴的地方?

Conclusion
作者给出了哪些结论?哪些是strong conclusions, 哪些又是weak的conclusions(即作者并没有通过实验提供evidence,只在discussion中提到;或实验的数据并没有给出充分的evidence)?

Notes
(optional) 不在以上列表中,但需要特别记录的笔记。

References
(optional) 列出相关性高的文献,以便之后可以继续track下去。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值