Undersubscribed Threading on Clustered Cache Architectures

本文来自 Heirman, W., Carlson, T. E., Van Craeynest, K., Hur, I., Jaleel, A., & Eeckhout, L. Undersubscribed Threading on Clustered Cache Architectures. HPCA 14



Abstract

Recent many-core processors such as Intel’s Xeon Phi and GPGPUs specialize in running highly scalable parallel applications at high performance while simultaneously embracing energy efficiency as a first-order design constraint. The traditional belief is that full utilization of all available cores also translates into the highest possible performance. In this paper, we study the effects of cache capacity con-flicts and competition for shared off-chip bandwidth; and show that undersubscription, or not utilizing all cores, often yields significant increases in both performance and energy efficiency. Based on a detailed shared working set analysis we make the case for clustered cache architectures as an efficient design point for exploiting both data sharing and undersubscription, while providing low-latency and ease of implementation in many-core processors.

We propose ClusteR-aware Undersubscribed Schedul-ing of Threads (CRUST) which dynamically matches an application’s working set size and off-chip bandwidth demands with the available on-chip cache capacity and off-chip bandwidth. CRUST improves application performance and energy efficiency by 15% on average, and up to 50%, for the NPB and SPEC OMP benchmarks. In addition, we make recommendations for the design of future many-core architectures, and show that taking the undersubscription usage model into account moves the optimum performance under the cores-versus-cache area trade-off towards design points with more cores and less cache.


Introduction

Increasing core counts on many-core chips require pro-cessor architects to design higher-performing memory sub-systems to keep these cores fed with data. An important design trade-off is the allocation of die area and power budget across cores and caches [9, 19]. Adding cores increases the theoretical maximum performance of the chip, but sufficient amounts of cache capacity must be available to exploit locality; if not, real-world performance will suffer. Yet, the notion of enough cache capacity depends on the application’s working set characteristics, which varies widely across applications, input sets and even different kernels or phases within an application. Designing a processor architecture that maximizes performance for most benchmarks (by maximizing core count) but does not cause unacceptable degradation when the working set does not fit in cache, is a difficult balancing act.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值