Slide 1
Scheduling for Grid Computing
Slide 2
Reference Fangpeng Dong and Selim G.Akl Scheduling Algorithms for Grid Computing : State of the Art and Open Problems Yanmin ZHU : A Survey on Grid Scheduling Systems Peter Gradwell : Overview of Grid Scheduling Systems Alain Andrieux et al : Open Issues in Grid Scheduling Jia yu and Rajkumar Buyya : A Taxonomy of Workflow Systems for Grid Computing
Slide 3
Grid Internet Internet E-mail 1998, The Grid: Blueprint for a New Computing Infrastructure. Ian Foster :
Slide 4
The Definition of Grid A type of parallel and distributed system that enables the sharing, selection, and aggregation of geographically distributed autonomous and heterogeneous resources dynamically at runtime depending on their availability, capability, performance, cost and users quality-of-service requirements
Slide 5
Characteristics of Grid Computing Exploiting underutilized resources Distributed supercomputing capability Virtual organization for collaboration Resource balancing Reliability
Slide 6
Class of Grid Computing Function: Computing Grid Data Grid Service Grid Size: IntraGrid ExtraGrid InterGrid
Slide 7
Traditional Parallel Scheduling Systems System: SMP : Cluster CC-NUMA: SGI Scheduling Systems: OpenPBS, LSF, SGE Loadlevel, Condor,etc
Slide 8
Slide 9
Cluster Scheduling
Slide 10
The Assumption Underlying Tradition Systems All resources reside within a single administrative domain. To provide a single system image, the scheduler controls all of the resources. The resource pool is invariant. Contention caused by incoming application can be managed by the scheduler according to some policies, so that its impact on the performance that the site can provide to each application can be well predicted. Computation and their data reside in the same site or data staging is a highly predictable process, usually from a predetermined source to a predetermined destination, which can be viewed as constant overhead.
Slide 11
Characteristics of Cluster Scheduling Homogeneity of resource and application Dedicated resource Centralized scheduling architecture High-speed interconnection network Monotonic performance goal
Slide 12
Slide 13
The Terms of Grid Scheduling A task is an atomic unit to be scheduled by the scheduler and assigned to a resource. The properties of a task are parameters like CPU/memory requirement, deadline, priority, etc. A job (or metatask, or application) is a set of atomic tasks that will be carried out on a set of resources. Job can have a recursive structure, meaning that jobs are composed of sub-jobs and /or tasks, and sub-jobs can themselves be decomposed further into atomic tasks. A resource is something that is required to carry out an operation, for example: a processor for data processing, a data storage device, or a network link for data transporting. A site (or node) is an autonomous entity composed of one or multiple resources. A task scheduling is the mapping of tasks to a selected group of resources which may be distributed in administrative domains.
Slide 14
Three Stages of Scheduling Process Resource discovering and filtering Resource selecting and scheduling according to certain objectives Job submission
Slide 15
Stages of SuperScheduling Resource Discovery Authorization Filtering Application requirement definition Minimal requirement filtering System Selection Gathering information (query) Select the system (s) to run on Run Job (optional) Make an advance reservation Submit job to resources Preparation Tasks Monitor progress (maybe go back to System Selection) Find out J is done Completion tasks
Slide 16
Grid Scheduling framework Application Model Extracts the characteristics of applications to be scheduled. Resource Model Describes the characteristics of the underlying resources in Grid systems. Performance Model Responsible for behavior of a specific job on a specific computation resource. Scheduling Policy Responsible for deciding how applications should be executed and how resources should be utilized.
Slide 17
Applications Classification Batch vs. Interactive Real-time vs. Non real-time Priority
Slide 18
Resources Classification Time-shared vs. Non time-shared Dedicated vs. Non-dedicated Preemptive vs. Non-preemptive
Slide 19
Performance Estimation Simulation Analytical Modeling Historical Data On-line Learning Hybrid
Slide 20
Scheduling Policy Application-centric Execution Time : the time duration spent executing the job Wait Time : the time duration spent waiting in the ready queue Speedup : the ratio of time spent executing the job on the original platform to time spent executing the job on the Grid. Turnaround Time : also called response time. It is defined as the sum of waiting time and executing time. Job Slowdown : it is defined as the ratio of the response time of a job to its actual run time. System-centric Throughput : the number of jobs completed in one unit of time, such as per hour or per day. Utilization : the percent of time a resource is busy. Flow Time : the flow time of a set of jobs is the sum of completion time of all jobs. Average Application performance.
Slide 21
Scheduling Strategy Performance-driven Market-driven Trust-driven Security policy Accumulated reputation Self-defense capability Attack history Site vulnerability
Slide 22
A logical Grid scheduling architecture Broken lines : resource or application information flows Real lines : task or task scheduling command flows
Slide 23
Grid Scheduler Grid Scheduler (GS) receives application from Grid users, select feasible resources for these application according to acquired information from the Grid Information Service module, and finally generates application-to-resource mappings based on certain objective functions and predicted resource performance. GS usually cannot control Grid resources directly, but work like broker or agents Metascheduler, SuperScheduler Is not an indispensable component in the Grid infrastructure. Not included in the Globus Tookit In reality multiple such schedulers might be deployed, and organized to form different structures (centralized, hierarchical and decentralized) according to different concerns, such as performance or scalability.
Slide 24
Grid Information Service (GIS) To provide such information to Grid schedulers GIS is responsible for collecting and predicting the resource state information, such as CPU capacities, memory size, network bandwidth, software availabilities and load of a site in a particular period. GIS can answer queries for resource information or push information subscribers Globus : Monitoring and Discovery System (MDS) Application profiling (AP) is used to extract properties of applications Analogical Benchmarking (AB) provides a measure of how well a resource can perform a given type of job.
Slide 25
Launching and Monitoring (LM) Binder Implements a finally-determined schedule by submitting applications to selected resources, staging input data and executables if necessary, and monitoring the execution of the applications Globus :Grid Resource Allocation and Management, GRAM
Slide 26
Local Resource Manager (LRM) Is mainly responsible for two jobs: local scheduling inside a resource domain, where not only jobs from exterior Grid users, but also jobs from the domains local users are executed, and reporting resource information to GIS. Open PBS, Condor LSF SGE etc NWS : Network Weather Service, Hawkeye, Ganglia
Slide 27
Evaluation Criteria for Grid Scheduling Systems Application Performance Promotion System Performance Promotion Scheduling Efficiency Reliability Scalability Applicability to Application and Grid Environment
Slide 28
Scheduler Organization Centralized Decentralized Hierarchical
Slide 29
Centralized Scheduling
Slide 30
Decentralized Scheduling
Slide 31
Hierarchical Scheduling
Slide 32
Existing Grid Scheduling Systems Information Collection Systems MDS (Meta Directory Service) NWS (Network Weather Service) Condor Condor-G AppLeS Nimrod-G GRaDS Etc
Slide 33
Characteristics of scheduling for Grid Computing Heterogeneity and Autonomy Does not have full control of the resources Hard to estimate the exact cost of executing a task on different sites. Is required to be adaptive to different local policies Performance Dynamism Grid resources are not dedicated to a Grid application Performance fluctuation, compared with traditional system Some methods: QoS negotiation, resource reservation, rescheduling Resource Selection and Computation-Data Separation In tradition systems, executable codes of application and input/output data are usually in the same site, or the input sources and output destinations determined before the application is submitted, The cost of data staging can be neglected. Application Diversity
Slide 34
Grid Scheduling Algorithms The Complexity of