<A Survey of CPU-GPU Heterogeneous Computing Techniques >reading note

最新推荐文章于 2021-12-12 21:46:33 发布

jinxingyingtu

最新推荐文章于 2021-12-12 21:46:33 发布

阅读量379

点赞数

分类专栏：综述 GPU 文章标签：读书笔记

综述同时被 2 个专栏收录

1 篇文章 0 订阅

订阅专栏

GPU

1 篇文章 0 订阅

订阅专栏

discuss scope

workload partition
energy efficiency
computing aprooches
1. at runtime
2. algotirhm
3. progtamming
4. compiler
5. application level
discrete and fused CPU-GPU systems
benchmark

Motivation

CPU: out-of-order, multi-instruction issue cores，run at high frequency and use large-sized caches to minimize the latency of a single thread. suited for latency-critical applications
GPU: in-order cores that share their control unit,GPU cores use lower frequency, and smaller-sized caches. suited for throughput-critical applications
a heterogeneous system: can provide high performance for a much wider variety of applications
and usage scenarios than using either a CPU or a GPU alone

In systems with GPUs, CPUs have been conventionally used as host for GPU to manage I/O and scheduling; however,as continuing innovations improve CPU performance even further ,using their computation capabilities also has become more attractive

Challenges

The vastly different architecture, programming model, and performance (for a given
program) of CPUs and GPUs present unique challenges in heterogeneous computing.

PU specific
Application/Problem specific
Objective Specific

Workload Partition

dynamic or static scheduling

the mapping of subtasks to PUs is fixed or not

basis of workload partition

why a particular scheduling of tasks to PUs is done. eg. by characteristic/capability of the PU itself and/or the subtasks themselves.

scheduling based on relative performance of PUs

using a performance model, it evaluates the respective contributions of each PU and then makes an estimation of the total execution time of the FFT problem for arbitrary work distribution problem sizes. decomposes the computation and uses profiling to estimates the optimal workload division between PUs. profiling and estimating
accelerating query processing by the length of the query. specific peocess
divide workload based on several factors such as the contention of devices, historical performance data, number of cores, processor speed, problem size, and device status
Their technique intercepts functioncalls to kernels and schedules them on a PU based on their argument size, historical profile, and location of data. Their technique accounts for both computation time and data transfer time
accelerating QR factorization: sequence of subtasks -> CPU or GPU functions -> statics or dynamic schedule
divided based on their relative performance. The estimates of performance of PUs is updated during each iteration of execution of algorithm.
hard real-time stream scheduling in heterogeneous systems: partitions the incoming streams
into two subsets for CPU and GPU. The algorithm works to find an assignment that satisfies both the deadline constraint of each stream alone and the aggregate throughput requirements of all the streams
static partitioning technique for OpenCL programs on HCSs： conducts static analysis on OpenCL programs to extract code features. -> determines the best work-division ratio -> divides the workload into suitable sized chunks for each PU using machine learning approach