【DATE 2023】The Next Era for Chiplet Innovation

最新推荐文章于 2024-10-02 08:00:00 发布

正在輸入......

最新推荐文章于 2024-10-02 08:00:00 发布

阅读量655

点赞数 10

分类专栏： Chiplet技术 Chiplet 论文文章标签：网络系统架构硬件架构制造

本文链接：https://blog.csdn.net/Messiah___/article/details/137257349

版权

Chiplet技术同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

Chiplet 论文

2 篇文章 0 订阅

订阅专栏

The Next Era for Chiplet Innovation

The Next Era for Chiplet Innovation

Author: Gabriel H. Loh, Raja Swaminathan
Organization: Advanced Micro Devices, Inc.(AMD)
email: gabriel.loh@amd.com

Abstract

Moore’s Law is slowing down and associated costs are simultaneously increasing. These pressures give rise to new approaches that utilizing advanced packaging and integration such as chiplets, interposers, and 3D stacking.
they describe the key technology drivers and constraints that motivate chiplet-based architectures.
Exploring several product case studies to highlight how different chiplet strategies developed to address different design objectives.
Detail multiple generations of chiplet-based CPU architectures.
Anticipate the transition to a new generation of chiplet.architectures that utilize increasing combinations of 2D, 2.5D, 3D integaration and packaging as SoC solutions.
ATS, a variety of challenges, this paper explores many topics.

Keywords—chiplets, integration, stacking

I. INTRODUCTION

针对特定问题采用特定技术
1. For some products, use multi-chip module(MCM) and chiplet designs to deal with rising silicon costs and to integrate more logic per package or utilize die-stacking to address bandwidth or to increase integration density.
采用一组技术应对各种挑战
1. Technology trends such as the slowing of Moore’s Law, the increasing costs of silicon, demands on memory bandwidth, solution density and optimizing for total cost of ownership(TCO), and other factors may force future silicon designs to not just adopt these technologies, but aggressively deploy multiple of these technologies at once within the same design.
2. This leads to the potential for exciting new architectures.
Paper starts with an overview of several of advanced packaging and stacking technologies.
1. explain what they are, their pros and cons, what applications are most suitable, and provide some examples of commercial use cases for each
2. Expected future trends and their implications on the design of semiconductor systems regarding packaging and stacking.
3. the challenging research problems that the future systems will face, which should provide excellent topics.

II. CURRENT APPROACHES

This part limits the focus here on some of the technologies that have been utilized in high-volume commercial offerings for server and consumer products.

2.1 2D MCM

Die yields decrease in a non-linear fashion as a function of the die size.
textbook yield equation:
Figure 1a shows hypothetical yield curves using this equation:
Very large die sizes can become exceedingly expensive as the raw yield rates drop.
The commercial usage of multi-chip module(MCM) technology has been around for decades. MCM adopts one large die or system on a chip(SoC) as functionality, and partitions the design into multiple smaller chips.
Due to the non-linear relationship between die size and yield, reintegrating multiple smaller chips can be more cost effective than constructing a single monolithic SoC.
Figure 1b–a first-generation AMD EPYC CPU MCM, ?A 32-core CPU is partitions into 8 smaller 8-core die. This organization results in a ~40% reduction in cost compared to a hypothetical monolithic implementation.
One tradeoff for MCM is the communication between logical components in different die must cross a die-to-die communication link across the package substrate. Compared to on-chip metal resources, the bandwidth, latency, and power to send data between the chips of an MCM is worse. SoC’ logical blocks can be partitioned in hardware and/or managed by software can keep the reduced bandwidth of the inter-die links under control.

2.2 2D Chiplets

Newer technologies are becoming more expensive even for chips with the same die area. Figure 2a shows the cost of a 250mm2 die normalized to the 45nm node across a range of technologies. While classic MCMs are cost-effective, AMD’s chiplet approach takes things further and implements different die in different process technologies to better match the requirements and/or constraints for each chiplet.
Figure 2b shows a second-generation AMD EPYC CPU with mixed chiplets. The 8 smaller chiplets each implement eight cores in a 7nm technology node. The larger chip in the middle is the “IO Die” that houses memory controllers, IO interfaces, and other system components. Many of these blocks, especially IO interfaces, do not scale much or at all with improvements in technology nodes.
其中一些块的大小由外部IO连接所需的面积所决定。IOD(IO芯片)在这个芯片中设计为较老的性价比高的 12nm node.
Similar to MCM, the communication between chiplets may be limited by the substrate-level routing, so architectural design to partition an SoC into chiplets is an important part of the design process.

2.3 2.5D Silicon Interposer

die-to-die communication links across the package substrate in MCM and chiplet designs is limited to around a few 10s of GB/s(表示几十GB/s). The primary constraint is the width/density of the metal routing, that can be supported in organic substrate implementatins.
A silicon interposer is a chip with purpose of providing interconnections between multiple other chips. Figure 3a show a cross-sectional view(横截面图) of a interposer with two chips stacked on top:
This is often referred to as 2.5D stacking, because of 2D organization relative to each other. If Chip 1 is a memory device and 2 is a compute die, the interposer can provide thousands of parallel routes in a relatively small area, thereby supporting the 100s of GB/s bandwidth required by high-performance memory interfaces.
The interposer uses through-silicon vias(TSV) to provide IO, Power, and ground connections from the individual chips to the outside of the package.
Figure 3b illustrates an AMD instinct MI100 accelerator, that combines a GPU-based compute die with four in-packaged DRAM modules, all stacking on and interconnected with a interposer, supporting a peak theoretical bandwidth of 1.2TB/s

2.4 2.5 Silicon Bridges

One set of challenges stem from the fact that the size of the interposer must be large enough to accommodate all of the chips that are be 2.5D stacked on top of it.
Silicon bridge technologies have been developed as an alternative packaging solution to provide silicon-levels of wire density while using much smaller pieces of silicon.
Figure 4a shows a cross-sectional view of AMD’s elevated fanout bridge(EFB) technology.
5. EFB is similar to the silicon interposer, providing an electrical interface to the chips above. But it is much smaller, only needing to be large enough to cover die-to-die connection interfaces. More cost-effective. 6. Figure 4b shows the AMD Instinct MI200 accelerator, which consists of two GPU compute die and eight in-package memory modules. Each memory module is connected to a GPU compute die via EFB, illustrated in Figure 4c

2.5 3D Stacking: Microbumps 微凸块

3D stacking can further increase integration density and die-to-die bandwidth by derectly placing one or more active chips on top of each other.
Microbumps are small solder connections. Figure 5a shows a cross-sectional micrograph of two chips vertically connected with microbumps:
The micorbump stacking process can be repeated to constrct stacks with multiple die. Figure 5b shows a 3D memory stack with 8 layers of DRAM chips all interconnected with TSVs and microbumps. This greatly increases the amount of memory that can be integrated into a given processor package area.
But also some challenges, including higher thermal resistance, the additional height. Interconnect density is limited by the size of the microbumps.

2.6 3D stacking: Hybrid Bonding 混合键合

Recent 3D stacking technology uses a two-phase hybrid bonding process. Rather than use microbumps that connect the metal pads on two chips together, the chips are fused directly together. The first phase consists of 两个芯片各自表面的氧化物形成共价键。第二阶段通过铜接合使每个芯片的金属焊盘直接融合在一起。
By eliminating the microbumps completely, hybrid bonding can support higher interconnect densities(间距缩短), Figure 6a shows a view of a cache die hybrid bonded on top of a CPU die:
Figure 6b shows a graphical rendering of AMD’s V-Cache technology that stacks cache die on top of a CPU chiplet. Provides the ability to triple the capacity of the CPU’s L3 cache at full bandwidth.
In this implementation, additional passive filler silicon (shown as the floating gray pieces in the figure) is stacked on top of the CPU compute logic to help conduct heat from the processor pipeline to the package’s cooling solution (not shown).
The direct die-to-die interface without microbumps or underfill provides a thermally superior pathway compared to microbump-based 3D stacking.

III. FUTURE SILICON CITYSCAPES

As system requirements continue to increase, technology scaling slows down, and package sizes stop growing, the desire to integrate more functionality into a single package will inevitably drive the simultaneous utilization of multiple 2D, 2.5D, 3D approaches within the same design.

3.1 Combination Approaches

We are already seeing the beginnings of the move toward SoCs utilizing multiple integration technologies. Figure 7 a shows a blown up view of the AMD Instinct MI200 accelerator that combines both 2.5D silicon bridge with 3D microbump-based DRAM stacks all within the same design:
Earlier GPUs combined 2.5D passive interposers with 3D DRAM as well.
Figure 7 b shows an example system that simultaneously combined multiple 2D, 2.5D, and 3D technologies in a single solution. This designs as “Silicon Cityscapes”.

3.2 Design Space - Why Silicon Cityscapes?

the demand for more computing power does not appear to be slowing. However, faces headwinds to keep up with computing demand.
One of headwinds is Moore’ Law slowed down. We need to wait longer for new nodes, and the new nodes are increasingly more expensive.
Partitioning an SoC into smaller chiplets and using 2D and 2.5D packaging technologies to reintegrate them together has provided a path to continue scaling.
But placing chiplets in 2D/2.5D is running up against the available real estate within the package. Many Figures show the silicon components consuming the vast majority of the available package area.
It is difficult and expensive to build larger packages. So the natural option is to go up.
while “scale out”, some computations have communication requirements that desire designs that keep as much of the computational resources co-located within the same package.(package internal link can provide higher bandwidth than external package-to-package links).

IV. SILICON CITYSCAPE RESEARCH TOPICS

Many challenging research problems.

4.1 Chiplet Decomposition and Technology Selection/Chiplet分解和技术选择

Given an SoC’s design requirements, the first challenge is in determining how to partion and how to reintegrated together.
Need to explore different architecture organizations along with cost analyses. A smaller number of larger chiplets reduces the overheads of die-to-die communication, whereas a larger number of smaller chiplet can reduce the silicon cost per chiplet.
Some communication paths can tolerate lower bandwidth and higher latencies can utilize 2D/2.5D packaging. Other interfaces require highest bandwidths and lowest latencies can use 3D stacked with hybrid bonding.
chiplet reuse is an important factor to keep overall design costs under control. Previous work showed the effectiveness of chiplet reuse.
尝试确定单个架构的最佳小芯片划分已经非常具有挑战性，但是当人们想要同时优化多个架构的划分，同时最大限度地减少必须流片的独特小芯片的数量时，设计空间就会增加。
Research problems: Developing techniques, tools, and methodologies to guide architects and designers to partition and find integration technologies.

4.2 Interconnect Infrastructure 互连基础设施

Reseach topic: exploring how to effectively interconnect everything in a scalable, performant, and energy-efficient manner that also provides easy inter-operability and modularity among the chiplets. This area includes a range of “Network on chip”(NoC) topics, and generalizing 3D and to negotiate a geterogeneous set of interfaces.
Software can potentially allocate and place data in more convenient locations within the package, schedule processing task to be co-located near data sources, pre-schedule any neccessary data movement to reduece congestion phenomena, and partition workloads across different computational resource(chiplets).
Hardware-software co-design can relax the requirements of some portions of interconnect architecure, perhaps allowing some chiplets to be integrated with simper or more cost-effective technologies.
Another research topic is on designing protocols to support easy interoperability and modularity of many diverse chiplets. Such as UCIe. Providing standard interfaces and approaches for security, power management, memory management, virtualization, boot-up and chiplet discovery, debug, profiling and telemetry, error reporting…

4.3 Power Delivery and Thermals 电源输送和热量

软硬件协同设计，安排chiplets如何工作，避免在一个区域集中过多，会导致配电负担过重。
热量散热具有挑战性，3D堆叠导致热量被困在较小的区域，对软件协同设计的研究可能是一种有效的方法，可以通过整个包内不同计算资源的智能调度和工作安排来减少热挑战的严重程度。
在小芯片的 3D 堆栈中，电力传输和散热提供了动态张力，这使得确定堆栈各层之间的最佳堆栈组织和/或工作布局具有挑战性。计算最密集、功耗最高的工作倾向于放置在堆栈的底部小芯片上，以最大限度地提高电力传输的质量和可靠性。然而，就散热而言，堆叠底部的放置可能是最差的地方，因此对电力传输或热量的优化往往对另一个而言不是最佳的。

4.4 Reliability 可靠性

虽然这种方法的计算前景在性能和能源效率方面可能很有吸引力，但它可能会在可靠性、可用性和可服务性 (RAS) 方面带来新的挑战。如果包内发生故障，可能会导致包内一个或多个组件的功能丧失。如果故障是暂时的，则可能表示软件包内系统的可用性可能会降低。然而，如果故障是永久性的并且无法修复、禁用或解决，则可能会导致必须丢弃整个封装，其中可能包括大量其他功能硅。
先进的封装和堆叠同样会导致潜在的可维护性降低，因为传统上可能位于封装外部（因此更容易维护或更换）的组件现在可能紧密集成在封装内部，而无法再访问它们。

4.5 Software 软件

A highly chipletized and heterogeneously-integrated system presents additional opportunities for software research.
An important open problem is in determining an effective hardware-software interface to expose, describe, program, and manage all of the components and their relationships.
1. The low-level software layers(operating system, hypervisor)needs to enumerate all of the different types of chiplets and compute resources as well memory components.
2. To manage performance, job scheduling, and data placement, the software also need some way to learn about and understand the characteristics of the interconnect interfaces betweenn the components and the overall interconnect topology.
3. Silicon Cityscape 可能需要向软件层提供运行时监控功能，例如报告性能和利用率统计数据、内存行为、功耗、热状况、可疑的安全活动等。
在用户和程序员层面，需要进行额外的研究来开发工具、编译器、编程模型、框架等，以有效地将更高级别的问题和算法映射到特定 Silicon Cityscape 包提供的底层硬件功能。

V. CONCLUSIONS

The opportunities are immense for transformational research across disciplines and vertically throughout the hardwarr-software-application stack.