异构多核片上系统实例

最新推荐文章于 2022-07-31 17:41:15 发布

yuyin86

最新推荐文章于 2022-07-31 17:41:15 发布

阅读量2k

点赞数

分类专栏：嵌入式

嵌入式专栏收录该内容

65 篇文章 0 订阅

订阅专栏

当考虑多核架构时，嵌入式设计人员正面临着一种选择，同构还是异构？这也就是说，我们都在寻求一种方案，使得在今天的芯片密度条件下可以实现并行处理。并且在这样做的同时，我们需要评估如何才能更好地利用这一技术能力。

同构MPSoCs （多处理片上系统）一般是由很多（10，100，也可能是1000）复制，通用处理器或DSP，以及在一个单芯片上提供通用的多处理能力的方案组成。这样的架构是为了更好地提高计算负载的性能，要么缩短时间（典型的通用处理器使用），要么改善其他性能参数（如具有广泛目标市场的系统芯片的情况）。

然而，在嵌入式的世界里，在设计MPSoC之前，我们需要了解许多算法和计算条件。例如，在针对手机的MPSoC设计上，如无线通信，音频处理，视频解码，图像编码，全球定位系统的三角测量，数据加密和解密等功能以及其他算法都可能集中在一个芯片上。对于每个不同的功能算法都有显着的可变性，但每个功能本身都有其独特的计算敏感性。这一独特性质再加上在一块芯片上执行多个独立的（但合作关系）处理器这就导致了异构MPSoCs的产生。

通用处理器可以在一个非常广泛的基准算法上进行优化，目的是通过整体设置提供良好的性能。但是，这种基础广泛的方法为每一个独立的算法提供的性能低于最佳（在更大的成本和功耗下），这种现象能够得到证明。

或者，想象从一个同构MPSoC开始，但微调每个处理器来满足处理器上运行的算法要求（视频解码，无线基带，等等）。微调可以通过三个基本步骤实施：

首先，除去那些不直接提供目标算法性能提升的所有特性和功能的处理器架构。这样做是为了消除障碍，降低成本和功耗。
然后，通过引入特定的操作指南，内存架构，数据路径等等措施，使得每一个处理器能够专门满足目标算法的特殊要求。相比专门执行指令的一个一个的依次处理，匹配该算法计算要求的架构可以降低基本指示的整个序列，从而大大改善每个处理器目标算法的“每个时钟周期性能”（并因此进一步降低功耗）。这通常只会导致门数相对温和的增长。
最后，通过在每时钟周期上引入发送多个指令（通常称为VLIW）和多个数据元素操作（通常被称为SIMD）的能力，可在每个算法开发并行（指令级和数据级）操作。这种技术往往可以进一步改善每个时钟周期在对算法性能起关键作用的地区的架构性能，以至于整个回路架构，甚至可以在一个时钟周期执行的整个循环的性能。添加并行一般通过门数增加来实现，但提高计算效率要求降低时钟频率以满足实时需求。当然，这也直接有利于降低功耗。

通过上述步骤后，由于每个独立处理器的优化，您的同构MPSoC已经变成一个高效率的，特定应用的异构MPSoC。所有这些带来的实际表现是：在大多数嵌入式系统中，有着大量可以很容易地映射到独立处理器上的计算密集算法，而这些独立处理器可以支持特定类型算法的极高计算速度。独立处理器核心的架构一般参照被称为ASIPs（特定应用的处理器）的技术进行改进。

为了能够看到异构多核心架构的发展趋势，我们可以看看针对视频，音频，图像处理和无线市场应用的最新SoCs的结构框图。在这些结构框图的许多模块中，一般你会发现为实现具体功能（例如图像或视频的编码和处理）而采用的通用商业处理器（运行应用和网络堆栈），一个数字信号处理器（运行非关键性能的信号处理算法），以及一个或多个ASIPs （也就是经常被称为“加速器”或“辅助处理器”）。哈哈，这就是异构MPSoC ！

当然，在一个单芯片上设计和编程几个不同的处理器看起来可能是一项相当艰巨的任务。毕竟，设计一个处理器是相当困难的，对不对？而设计多个独立的处理器更是相当相当的困难。但是只有在当不考虑以下两个重要因素的时候，这种情形似乎看起来才是合乎逻辑的。

首先，通用处理器的设计是十分困难的！这是因为必须要为表现良好的基准进行广泛设置，同时在市场上已经出货产品区别性所显示出来的困难性。明显优于市场上现有产品性能的真正创新的架构在一段时间内已经很难再去的突破了，这表明通用处理器架构和技术已经基本上成熟。事实上，这也是为什么大多数通用处理器架构的公司正在向多核心解决方案转型的原因之一。

值得庆幸的是，前面所列的这三个步骤不需要发明任何真正新的处理器架构概念。相反，这一过程是相当简单的，因为你只需要使用这些年来已被证明的架构概念来优化处理器架构，使之只是成为一种算法类型。这大大简化了设计过程本身，通常也使得结果比较简单而且有效，这就是ASIP架构。
其次，目前市场上存在两种类型的商业产品来使得ASIP设计过程得到简化和自动化。

第一种选择是由供应商提供的可配置处理器IP 。利用这类产品，MPSoC设计人员可以通过添加专用的操作指示、功能单元和其他设备来自定义（或配置）模板处理器。因为添加了专用的操作指示，该软件工具链（ SDK ）也能够更好地适应新的功能要求。

另一种势头正劲的方法是由几个EDA供应商提供的基于工具的解决方案。这些工具集通过提供了专用的“体系架构描述”语言来描述和建模ASIPs来使得ASIPs设计过程实现自动化。有了这样一个工具， SoC设计师可以为软件开发中的特定架构和为使用ASIP和SDK自动生成RTL的特点进行快速建模。通过使用一个具有广泛能力的架构描述语言，ASIPs能够被设计地更具高效率，而不再受到预定义模板的制约。

通过这些商业上可得的选择，使得专用处理器设计成为所有SoC设计的一部分变得可能，即使只是一个最小的团队在设计。事实上，随着差异化的要求，迅速推向市场的需求，和良好的性能，成本和功耗要求，很多人认为这不仅仅只是一种可能而是一种必要。

英文原文：

The Case for Heterogeneous Multi-Core SoCs

When considering multi-core architectures, embedded designers are faced with a choice. homogenous or heterogeneous? That is, we are all looking for a way to exploit the parallelism made possible by today’s chip densities. And, in doing so, we need to evaluate how best to take advantage of this technological capability.

Homogenous MPSoCs (multi-processing SoCs) are generally composed of a large number (10s, 100s, perhaps 1,000s) of replicated, general-purpose processors or DSPs and promise a way to provide general-purpose multi-processing capability in a single chip. These architectures are targeted at boosting performance for computing loads that either vary over time (the typical use of a general-purpose processor) or are otherwise not well-contained (as is the case for SoCs with a very wide target market).

In the embedded world, however, much is known about algorithmic and computational requirements prior to designing the MPSoC. For example, in an MPSoC targeted for mobile handsets, functions such as wireless communication, audio processing, video decoding, image coding, GPS triangulation, data encryption and decryption, and other algorithms likely all reside on a single chip. There is significant variability in the algorithms for each different function, but each function itself has its own unique computational sensitivities. This uniqueness combined with the opportunity to implement multiple independent (yet cooperating) processors all on one chip is what gives rise to heterogeneous MPSoCs.

To clarify, general purpose processors are optimized across a very wide set of benchmark algorithms, with the objective of providing good performance across the entire set. This broad-based approach, however, provides slower than optimal performance (at greater cost and power dissipation) for every individual algorithm.

Alternatively, imagine starting with a homogenous MPSoC, but then fine tuning each individual processor to the explicit needs of the algorithm (video decode, wireless baseband, and so on) running on that processor. Fine tuning can be accomplished through three fundamental steps:

   1. First, trim the processor architecture of any features and functions that do not directly provide a performance boost to the target algorithm. Doing so will eliminate gates, reducing cost and power dissipation.
   2. Second, specialize each processor to the specific needs of its target algorithm by introducing specialized instructions, memory architecture, datapaths, and so on. Matching the architecture to the algorithm’s computational needs can reduce whole sequences of basic instructions to a single execution of a specialized instruction, thereby significantly improving the “performance per clock cycle” of each processor for its target algorithm (and therefore further reducing power dissipation). This typically comes with a relatively modest increase in gate count.
   3. Third, exploit parallelism (instruction-level and data-level) available in each algorithm by introducing the ability to dispatch multiple instructions (commonly known as VLIW) and operate on multiple data elements (commonly known as SIMD) every clock cycle. Such techniques can often further improve the performance per clock cycle of the architecture in the performance-critical regions of the algorithm to the extent that entire loop bodies or even entire loops can be executed in a single clock cycle. Adding parallelism typically increases gate count, but the improved computational efficiency allows for the lower clock frequency needed to meet real-time constraints. This, of course, has the direct benefit of reducing power dissipation.

As a result of optimizing each individual processor through the preceding steps, your homogeneous MPSoC has been transformed into a highly efficient, application-specific heterogeneous MPSoC—all driven by the fact that, in most embedded systems, there are a number of computationally intense algorithms that can easily be mapped to an individual processor that is highly tuned to running that one particular type of algorithm. Individual processor cores whose architecture has been tuned in this way are generally referred to as ASIPs—application-specific processors.

To see the trend toward heterogeneous multi-core architectures yourself, look at the block diagrams of the latest SoCs targeted to the video, audio, image processing, and wireless markets. In many of these diagrams, you will commonly find a general-purpose merchant processor (for running the application and networking stack), a DSP (for running non-performance-critical signal-processing algorithms), and one or more ASIPs (also often called “accelerators” or “coprocessors”) for specific functions (for example, image or video coding or manipulation). Viola! Heterogeneous MPSoC!

Of course, designing and programming several different processors in a single chip may seem a daunting task. After all, designing one processor is hard, right? Wouldn’t designing multiple independent processors be really, really, really hard? This may seem logical, except there are two important factors to be considered.

First, general-purpose processor design is difficult! This is due both to the broad set of benchmarks that must be performed well as well as the difficulty in showing differentiation in the marketplace against offerings that are already shipping. Truly innovative architectures that perform significantly better than existing offerings in the marketplace have been hard to come by for some time now—an indication that general-purpose processor architectures and technology have largely matured. In fact, this is one of the reasons why most of the general-purpose processor architecture companies are moving to multi-core solutions themselves.

Thankfully, the three steps listed previously don’t require inventing any truly new processor architecture concepts. Rather, the process is quite a bit simpler, as you only need to optimize a processor architecture to just one type of algorithm—using architectural concepts that have been proven through the years. This significantly simplifies the design process itself and also generally results in an uncomplicated—yet efficient—ASIP architecture.

Second, there are two types of commercial offerings available in the market today to simplify and automate the process of ASIP design.

One alternative is provided by vendors of configurable processor IP. With their offerings, MPSoC designers can customize (or configure) a template processor by adding specialized instructions and functional units, among other things. As instructions are added, the software tool-chain (SDK) also adjusts to account for the new functionality.

Another alternative that is gaining momentum is the tool-based approach offered by a few EDA vendors. These tool-sets automate the process of designing ASIPs by providing a specialized "architecture description" language to describe and model ASIPs. With such a toolset, an SoC architect can quickly model the characteristics of a specialized architecture and automatically create both the RTL of the ASIP as well as the SDK for use in software development. By using a broadly capable architecture description language, ASIPs can be designed for greater efficiency—free from the constraints of a predefined template.

With commercial options like these available, it becomes possible to consider specialized processor design as a part of any SoC—even in the smallest design teams. In fact, with the quest for differentiation, fast time-to-market needs, and aggressive performance, cost, and power requirements, many would argue that it is much more than a possibility, but rather, a necessity.

作者介绍：