discuss scope
- workload partition
- energy efficiency
- computing aprooches
1. at runtime
2. algotirhm
3. progtamming
4. compiler
5. application level - discrete and fused CPU-GPU systems
- benchmark
Motivation
CPU: out-of-order, multi-instruction issue cores,run at high frequency and use large-sized caches to minimize the latency of a single thread. suited for latency-critical applications
GPU: in-order cores that share their control unit,GPU cores use lower frequency, and smaller-sized caches. suited for throughput-critical applications
a heterogeneous system: can provide high performance for a much wider variety of applications
and usage scenarios than using either a CPU or a GPU alone
In systems with GPUs, CPUs have been conventionally used as host for GPU to manage I/O and scheduling; however,as continuing innovations improve CPU performance even further ,using their computation capabilities also has become more attractive
Challenges
The vastly different architecture, programming model, and performance (for a given
program) of CPUs and GPUs present unique challenges in heterogeneous computing.
- PU specific
- Application/Problem specific
- Objective Specific
Workload Partition
dynamic or static scheduling
the mapping of subtasks to PUs is fixed or not
basis of workload partition
why a particular scheduling of tasks to PUs is done. eg. by characteristic/capability of the PU itself and/or the subtasks themselves.
scheduling based on relative performance of PUs
using a performance model, it evaluates the respective contributions of each PU and then makes an estimation of the total execution time of the FFT problem for arbitrary work distribution problem sizes. decomposes the computation and uses profiling to estimates the optimal workload division between PUs. profiling and estimating
accelerating query processing by the length of the query. specific peocess
divide workload based on several factors such as the contention of devices, historical performance data, number of cores, processor speed, problem size, and device status
Their technique intercepts functioncalls to kernels and schedules them on a PU based on their argument size, historical profile, and location of data. Their technique accounts for both computation time and data transfer time
accelerating QR factorization: sequence of subtasks -> CPU or GPU functions -> statics or dynamic schedule
divided based on their relative performance. The estimates of performance of PUs is updated during each iteration of execution of algorithm.
hard real-time stream scheduling in heterogeneous systems: partitions the incoming streams
into two subsets for CPU and GPU. The algorithm works to find an assignment that satisfies both the deadline constraint of each stream alone and the aggregate throughput requirements of all the streamsstatic partitioning technique for OpenCL programs on HCSs: conducts static analysis on OpenCL programs to extract code features. -> determines the best work-division ratio -> divides the workload into suitable sized chunks for each PU using machine learning approach
usually optimized for particular processes.