Undersubscribed Threading on Clustered Cache Architectures

最新推荐文章于 2021-05-12 18:56:14 发布

orsprite

最新推荐文章于 2021-05-12 18:56:14 发布

阅读量682

点赞数

本文来自 Heirman, W., Carlson, T. E., Van Craeynest, K., Hur, I., Jaleel, A., & Eeckhout, L. Undersubscribed Threading on Clustered Cache Architectures. HPCA 14

Abstract

Recent many-core processors such as Intel’s Xeon Phi and GPGPUs specialize in running highly scalable parallel applications at high performance while simultaneously embracing energy efficiency as a first-order design constraint. The traditional belief is that full utilization of all available cores also translates into the highest possible performance. In this paper, we study the effects of cache capacity con-flicts and competition for shared off-chip bandwidth; and show that undersubscription, or not utilizing all cores, often yields significant increases in both performance and energy efficiency. Based on a detailed shared working set analysis we make the case for clustered cache architectures as an efficient design point for exploiting both data sharing and undersubscription, while providing low-latency and ease of implementation in many-core processors.

We propose ClusteR-aware Undersubscribed Schedul-ing of Threads (CRUST) which dynamically matches an application’s working set size and off-chip bandwidth demands with the available on-chip cache capacity and off-chip bandwidth. CRUST improves application performance and energy efficiency by 15% on average, and up to 50%, for the NPB and SPEC OMP benchmarks. In addition, we make recommendations for the design of future many-core architectures, and show that taking the undersubscription usage model into account moves the optimum performance under the cores-versus-cache area trade-off towards design points with more cores and less cache.

Introduction

Increasing core counts on many-core chips require pro-cessor architects to design higher-performing memory sub-systems to keep these cores fed with data. An important design trade-off is the allocation of die area and power budget across cores and caches [9, 19]. Adding cores increases the theoretical maximum performance of the chip, but sufficient amounts of cache capacity must be available to exploit locality; if not, real-world performance will suffer. Yet, the notion of enough cache capacity depends on the application’s working set characteristics, which varies widely across applications, input sets and even different kernels or phases within an application. Designing a processor architecture that maximizes performance for most benchmarks (by maximizing core count) but does not cause unacceptable degradation when the working set does not fit in cache, is a difficult balancing act.