Delay scheduling学习总结

目标

在多路复用的集群中(多租户并行提交作业),在对公平调度冲击最小的前提下,尽量提高job的数据本地性。

意义

MapReduce使调度复杂化的主要方面是需要将任务放在输入数据附近。本地性增加了吞吐量,因为大型集群中的网络带宽远远低于集群磁盘的总带宽。在包含数据的节点上运行(节点本地性)效率最高,但是当这是不可能的时候,在相同的机架上运行(机架本地性)比在非机架上运行更快。

hadoop的默认调度策略

hadoop1默认采用FIFO调度。FIFO采用队列方式将一个一个job任务按照时间先后顺序进行服务。比如排在最前面的job需要若干maptask和若干reducetask,当发现有空闲的服务器节点就分配给这个job,直到job执行完毕。当scheduler接收到指示map或reduce插槽空闲的心跳时,它按优先级顺序和作业并提交时间扫描job,以找到具有所需类型任务的job。对于map,Hadoop使用了与谷歌MapReduce类似的本地优化:在选择作业之后,schduler贪婪地挑选与节点数据最接近的map任务来执行(如果可能的话,在同一节点上,否则在同一机架上,或者最后在远程机架上)。

Fair Scheduling的设计思考

fair scheduling需要回答的两个问题:

  1. How should resources be reassigned to new jobs?
    kill vs wait,fair scheduling chose wait
  2. How should data locality be achieved?
    fair scheduling存在两种情况会制约数据本地性:(1)head-of-line scheduling (2)sticky slots,采用Delay Scheduling就可以解决fair scheduling种这两种情况对数据本地性的制约。

纯粹的公平调度算法(Naive Fair Sharing Algorithm)

A simple way to share a cluster fairly between jobs is to always assign free slots to the job that has the fewest running tasks. As long as slots become free quickly enough, the resulting allocation will satisfy max-min fairness。
在这里插入图片描述

纯粹的公平调度算法存在的本地性问题(Locality Problems with Naive Fair Sharing)

纯粹的公平调度算法产生两个本地性问题:head-of-line scheduling and sticky slots

Head-of-line Scheduling

The first locality problem occurs in small jobs (jobs that
have small input files and hence have a small number of data
blocks to read). The problem is that whenever a job reaches
the head of the sorted list in Algorithm 1 (i.e. has the fewest
running tasks), one of its tasks is launched on the next slot
that becomes free, no matter which node this slot is on. If
the head-of-line job is small, it is unlikely to have data on
the node that is given to it. For example, a job with data on
10% of nodes will only achieve 10% locality.

Sticky Slots

A second locality problem, sticky slots, happens even with
large jobs if fair sharing is used. The problem is that there is
a tendency for a job to be assigned the same slot repeatedly.
For example, suppose that there are 10 jobs in a 100-node
cluster with one slot per node, and that each job has 10
running tasks. Suppose job j finishes a task on node n. Node
n now requests a new task. At this point, j has 9 running
tasks while all the other jobs have 10. Therefore, Algorithm
1 assigns the slot on node n to job j again. Consequently, in
steady state, jobs never leave their original slots. This leads
to poor data locality because input files are striped across the
cluster, so each job needs to run some tasks on each machine.

Delay Scheduling

The problems we presented happen because following a
strict queuing order forces a job with no local data to be
scheduled. We address them through a simple technique
called delay scheduling. When a node requests a task, if the
head-of-line job cannot launch a local task, we skip it and
look at subsequent jobs. However, if a job has been skipped
long enough, we start allowing it to launch non-local tasks,
to avoid starvation. The key insight behind delay scheduling
is that although the first slot we consider giving to a job is
unlikely to have data for it, tasks finish so quickly that some
slot with data for it will free up in the next few seconds.
在这里插入图片描述

Rack Locality

Analysis of Delay Scheduling

Long Tasks and Hotspots

To lower the chance that a node
fills with long tasks, we can spread long tasks throughout
the cluster by changing the locality test in Algorithm 2 to
prevent jobs with long tasks from launching tasks on nodes
that are running a higher-than-average number of long tasks.
Although we do not know which jobs have long tasks in
advance, we can treat new jobs as long-task jobs, and mark
them as short-task jobs if their tasks finish quickly

Hotspot Replication

Because distributed file systems like
HDFS place blocks on random nodes, hotspots are only
likely to occur if multiple jobs need to read the same data
file, and that file is small enough that copies of its blocks
are only present on a small fraction of nodes. In this case,
no scheduling algorithm can achieve high locality without
excessive queueing delays. Instead, it would be better to
dynamically increase the replication level of small hot files.

Hadoop Fair Scheduler Design

HFS解决的问题:
1)一些用户会比另一些用户运行更多的作业,我们想要的公平性是基于用户维度的而不是job维度。
2)用户想要能控制自己提交的作业。例如,有一个用户提交了一批作业,想要这些作业按照FIFO的顺序去执行从而按顺序产出结果。
3)产品性质的作业需要按预期时间完成,即使集群被其他用户的大量长时间运行的任务占满。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值