[A-06] ARMv8/ARMv9-Cache的一致性机制(Cache系列完结篇)

最新推荐文章于 2024-08-24 22:34:16 发布

奔跑的架构师

最新推荐文章于 2024-08-24 22:34:16 发布

阅读量982

点赞数 7

分类专栏： ARMv8/ARMv9 文章标签： arm arm开发架构

本文链接：https://blog.csdn.net/timyu007/article/details/140737133

版权

ARMv8/ARMv9 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

ver 0.2
在这里插入图片描述

更多精彩内容，请关注公众号

前言

书接上文，前序的文章我们介绍了Cache一致性的一些基础知识，为我们继续探讨ARM的一致性机制做了必要的铺垫。看过前文的我们应该清楚，ARM会提供两种方式维护Cache的一致性，这两种方式分别是：通过可以直接操作Cache的软件管理方式和直接通过在CPU内部的微架构内部和总线内部的一致性硬件管理单元的硬件方式。当然，为了提高处理器的性能和解放底层软件码农的双手，大部分的场景都是硬件在处理和维护Cache的一致性，自然我们要对这种方式做重点介绍。另外，本文也将围绕Cache一致性相关的编程课题做一下简单的汇总。

正文

1 Cache的一致性机制

我们这里会先讨论多核架构之间的Cache一致性，然后再讨论CPU与总线其他Master之间的一致性问题。

1.1 多核架构的一致性

在展开具体的讨论之前，我们思想实验一下如下具体的场景：
(1) 在一个Linux系统中，一个进程有多个线程，各个线程之间共享的变量“i”。当这个进程的线程1在一个PE-Core上修改或者访问变量i的时候，另外一个变量也要访问或者修改这个变量i，ARM系统是要通过哪些控制模块？按照什么策略或者协议进行处理的？
(2) 同样是Linux系统的内核中，一个全局的变量，中断上下文和进程上下文都有可能访问和修改这个变量，此时ARM系统是通过哪些模块？按照什么策略或者协议进行处理的。
上面的的思想实验的场景下需要运行的代码和要操作的数据，都局限于CPU的各个PE-Core上，所有首先我们讨论一下多核架构内部的一致性课题。
如同一个人一样，生下来就具备了社会属性，要受到这个社会大的框架的保护和约束。那么我要讨论ARM的一致性问题，还是离不开ARM的系统架构，所以我们讨论的所有问题还是绕不开ARM的架构，所以这里还是要花费一点笔墨把背景讲清楚。在讲述Cache的多级架构文章中，我们已经了解到ARM处理器主流的架构包含Big.Little和DSU.Big.Little两种形式，下面我们将分别讨论。

1.1.1 Big.Little架构

ARM稍早一些的处理器架构多采用Big.Little的形式，如图1-1所示。

图1-1 典型的ARM-Big.Little系统架构

我们直接引用手册对Big.Little做一下描述：

We can distinguish between systems that contain:
• A single processor containing a single core.
• A multi-core processor, such as the Cortex-A53, with several cores capable of independent instruction execution, and can be externally viewed as a single unit or cluster, either by the system designer or by an operating system that can abstract the underlying resources from the application layer.
• Multiple clusters , in which each cluster contains multiple cores.
• ARM uses HMP to mean a system composed of clusters of application processors that are 100% identical in their instruction set architecture but very different in their microarchitecture. All the processors are fully cache coherent and a part of the same coherency domain.

这里我们稍微总结一下：
(1) Big.Little架构可以讲一定数量的PE-Core组织成一个处理单元作为一个Master直接接入总线。一般情况下，将高性能PE-Cores做为Big Cluster，而低功耗PE-Cores作为Little Cluster。
(2) Big和Little Cluster都是直接接入到一致性总线如CCI-400，而且他们的内存视角是一样的，也就是说所有的PE-Core无论是高性能的Core还是低功耗的Core都能够看到相同的内存(External Memory)。这个内存视角一致非常的的重要，也就是说被初始化成Share属性的内存可以被所有PE-Core缓存，这就客观的对总线架构和CPU的微架构在设计的时候提出了为了维护一致性而需要增加辅助电路结构的要求。

1.1.2 Big.Little的Cache一致性

这里我们简要的回顾一下Big.Little的多级Cache架构，如图1-2所示。

图1-2 典型的Big.Little多级Cache架构

Cluster内部的Cache一致性
这里我们举一个具象化一个Cluster看一下"Cluster 0"内部的Cache是如何维护一致性的，如图1-3所示。
A53 System Block

图1-3 Cortex-A53 系统框图

这里我们看到A53内部部署了4个PE-Core，它们都有自己的L1-Data Cache，共享L2 Cache，ARM这里在处理器内部通过一个特殊的功能单元SCU来维护Cluster内部各个Core的Cache一致性，这里我们引用一下手册中的描述：

L2 memory system
The Cortex-A53 L2 memory system contains the L2 cache pipeline and all logic required to maintain memory coherence between the cores of the cluster. It has the following features:
• An SCU that connects the cores to the external memory system through the master memory interface. The SCU maintains data cache coherency between the cores and arbitrates L2 requests from the cores.
• When the Cortex-A53 processor is implemented with a single core, it still includes the Snoop Control Unit (SCU).

Snoop Control Unit
The Cortex-A53 processor supports between one and four individual cores with L1 Data cache coherency maintained by the SCU.
• The SCU is clocked synchronously and at the same frequency as the cores.
• The SCU maintains coherency between the individual data caches in the processor using ACE modified equivalents of MOESI state, as described in Data Cache Unit on page XX.
• The SCU contains buffers that can handle direct cache-to-cache transfers between cores without having to read or write any data to the external memory system. Cache line migration enables dirty cache lines to be moved between cores, and there is no requirement to write back transferred cache line data to the external memory system.
• Each core has tag and dirty RAMs that contain the state of the cache line. Rather than access these for each snoop request the SCU contains a set of duplicate tags that permit each coherent data request to be checked against the contents of the other caches in the cluster. The duplicatetags filter coherent requests from the system so that the cores and system can function efficiently even with a high volume of snoops from the system.
• When an external snoop hits in the duplicate tags a request is made to the appropriate core.

由于SCU对于Cache一致性太过重要，我们直接引用全文进行描述。简单对SCU归纳一下：
(1) A53系统内部有一个SCU的控制单元可以按照MOSEI协议同步各个Core内部的Cache状态进而维护Cluster内部Cache的一致性。
(2) 一致性维护的基本单位是Cache Line，类型是L1-Data Cache。
(3) SCU能够实现探测和维护的核心原理是基于它的核心电路设计，如图1-4所示，它通过备份Cache Line 的Tag的方式，通过监测相关的状态位，实现Cluster内部的Cache之间的状态同步。
SCU Core Circuit

图1-4 Cluster内部实现一致性的核心逻辑

Cluster之间的Cache一致性
让我们再把视野拉回图1-2，看一下Cluster0内部的Core-0和Cluster-1中Core-2要如何维护一致性。首先通过Cluster内部的SCU肯定是不行了，那么自然Cluster之间的一致性任务要落在一致性总线的身上了，这里我们具象化一个具体的实例看一下，是不是这样的，如图1-5所示。
CCI-500

图1-5 基于CCI-500的实例

这里我们先看一下，手册中对于该实例的描述：

In this example, slave interfaces S5 and S6 support the ACE protocol for connecting masters such as the Cortex-A53 or Cortex-A72 processors. The CCI-500 manages full coherency and data sharing between L1 and L2 caches of all connected processor clusters.

通过手册的描述，我们可以发现总线架构上也有类似SCU一样的控制单元Snoop filter实现Cluster之间的Cache状态同步，进而启动维护Cache一致性的作用。下面我们看一下手册中对于Snoop filter的描述：

Snoop filter
The CCI-500 contains an inclusive snoop filter that records the addresses of data that is stored in the ACE master caches.
The snoop filter can respond to snoop transactions in the case of a miss, and snoop appropriate masters only in the case of a hit. Snoop filter entries are maintained by observing transactions from ACE masters to determine when entries have to be allocated and deallocated.
The snoop filter can respond to multiple coherency requests without it being necessary to broadcast to all ACE interfaces. For example, if the address is not in any cache, the snoop filter responds with a miss and directs the request to memory. If the address is in a processor cache, the request is considered a hit and is directed to the ACE port containing that address in its cache.
Arm recommends that you configure the snoop filter directory to be 0.75-1 times the total size of exclusive caches of processors that are attached to the CCI-500. The snoop filter is 8-way set associative and, to minimize conflicts, stores twice as many tags as the configured size. An example of a conflict is when the CCI-500 is unable to insert a new entry in an available position in the snoop filter. If a conflict occurs, an existing entry is evicted, and the snoop filter issues a CleanInvalid snoop to processors that might be holding the evicted lines. This type of eviction is known as a back-invalidation, and is expected to occur rarely if you configure the snoop filter size as Arm recommends.
The snoop filter is updated by monitoring transactions from the attached masters, that allocate and deallocate data into their caches. In the ACE protocol, the deallocation of clean data is indicated using the Evict transaction.
ACE uses a MOESI state machine for cross-cluster coherency.

简单归纳一下(关于总线架构也是一个很庞大且复杂的方向，这里我们不展开讨论，作为码农暂时搞懂如何借助总线架构是实现Cache一致性的原理即可)：
(1) Cluster的Memory系统通过ACE接口连接到一致性总线上。
(2) 总线上的Snoop filter单元通过监控和管理ACE接口状态，进而实现Cluster之间的Cache状态的交互。
(3) 而ACE接口的状态机使用的也是MOESI规范，协调各个Cache line的状态同步。
此时的一致性场景下一般情况下会遵循MOESI规范，具体要和处理器的设计有关。

1.1.3 DSU.Big.Little架构

这里我们看一个经典的基于DSU的Big.Little架构的总线架构，如图1-6所示。
DSU.Big.Little

图1-6 典型兼容DSU的总线架构

这里我们对比Big.Little架构看上去好像差不多，其实是内有乾坤，这里我们把DSU-CLUSTER-0放大，可以得到一个更加具象化的Cluster，如图1-7所示。
DSU-Instance

图1-7 典型的DSU架构示例1

通过图1-7，我们可以清晰的看到一个基于DSU架构的Cluster内部的结构，这里我们稍微花费一点笔墨，介绍一下：
(1) 总线层面，随着总线架构的升级，引入了新的CHI接口规范，能够更灵活的接入各种兼容CHI的Master，包括核心计算单元DSU Master。
(2) DSU内部已经突破传统Cluster架构的限制，可以兼容越来越多的PE-Core(DSU-120可以支持单Cluster中配置14个Core)，这个对比传统的Big.Little架构是显而易见的。

可配置式的DSU Cluster
DSU可灵活的配置PE Core，这里我们直接引用手册中的描述：

A cluster can be conﬁgured with up to three diﬀerent types of cores in the same cluster. Each core type targeting diﬀerent power eﬃciency and performance levels. This arrangement allows for an intermediate core that has an intermediate performance and eﬃciency level. The cluster also supports complexes.
A cluster can be conﬁgured in many arrangements. Examples of cluster arrangements are:
• One or more cores of the same type.
• Various arrangements of two types of cores. For example, one or more cores targeting either a
high-performance level or a higher power eﬃciency level.
• Various arrangements of three of cores. For example, one or more high-performance cores,
power-eﬃcient cores, and intermediate cores.
• One or more complexes and no individual cores.
• One or more complexes and individual cores.

这里需要澄清一下，DSU的架构也是在不断的升级中的，支持PE Core的数量和配置的形式，也是不断的在升级中，这里我们举另一种配置形式，帮住大家理解基于DSU的Cluster内部的结构，如图1-8所示(我们主要是为了交代清楚基于DSU的Cache架构背景，关于DSU这里不展开讨论)。
DSU-Cluster

图1-8 典型的DSU架构示例2

1.1.4 DSU.Big.Little的Cache一致性

和Big.Little一样，首先，我们也来回顾一下DSU的的多级Cache架构，如图1-9所示。
DSU

图1-9 典型的基于DSU的多级Cache架构

DSU.Cluster内部的Cache一致性
同样我们这里也具象化一个DSU的cluster，如图1-10所示。
在这里插入图片描述

图1-10 DSU.Cluster实例

经过了Big.Little的洗礼，对于DSU应该是轻车熟路的。参考图1-9、图1-10，Level-1和Level-2级别的cache是各个PE-Core私有的，L3级别的Cache是个各个PE Core私有的，这里ARM同样引入了SCU实现DSU.Cluster内部各个PE-Core的一致性。

DSU-120
A DSU-120 DynamIQ ™ cluster consists of between one and 14 cores, with up to three diﬀerent types of cores in the same cluster. Cores can be conﬁgured for various performance points during macrocell implementation and run at diﬀerent frequencies and voltages.
The DSU-120 DynamIQ ™ cluster also supports complexes where typically two cores are linked together and share logic. Examples of shared logic include a ﬂoating-point unit and an L2 cache.
All cores in the DSU-120 DynamIQ ™ cluster, including those in complexes, are coherently connected to an L3 memory system that includes an L3 cache and a Snoop Control Unit (SCU). The SCU maintains coherency between caches in the cores and the L3 cache, and includes a snoop ﬁlter to optimize coherency maintenance operations. The shared L3 cache simpliﬁes process migration between the cores.

Coherency and snoop control
The DSU-120 has the following coherency and snoop control features:
• Snoop Control Unit (SCU) maintains coherency and consistency in the memory system internal to the cluster, and (optionally) external to the cluster.
• SCU includes a set of snoop ﬁlters, automatically sized, one for each cache slice.

Snoop Control Unit
The Snoop Control Unit (SCU) maintains coherency between all the data caches in the cluster.
The SCU contains buﬀers that can handle direct cache-to-cache transfers between cores without having to read or write data to the L3 cache. Cache line migration enables dirty lines to be moved between cores.
The SCU contains a set of snoop ﬁlters that track the addresses for locations cached in the core caches. Including the snoop ﬁlters means that the SCU does not need to request a look up in the core caches when it receives a coherent memory request. These snoop ﬁlters are accessed by the coherent requests from the other cores or from the system. If there is a simultaneous hit in the L3 tags and the SCU snoop ﬁlters, then the L3 cache normally provides the data in preference to a core. The size of the snoop ﬁlter is automatically determined from the conﬁgured number of cores and the cache sizes in those cores.

Cortex ® ‑A710
To maintain data coherency between multiple cores, the Cortex ® ‑A710 core uses the Modiﬁed Exclusive Shared Invalid (MESI) protocol.

这里简单归纳一下：
(1) DSU内部的架构中同样包含了一个SCU模块并且增加了snoop filter探测模块实现DSU.Cluster内部的Cache一致性。
(2) 维护Cache的一致性的基本单位也是Cache Line，类型是L1-Data Cache，遵循的协议是MESI。

DSU.Cluster之间的Cache一致性
DSU.Cluster之间是如何实现同步，这里我们看一下手册中的描述：

Main memory master
The main memory master provides an interface between the DynamIQ ™ Shared Unit-120 and the external interconnect. For a connection to an external coherent interconnect, the memory interface must be conﬁgured to use the AMBA 5 CHI (Issue E) protocol. For connection to a non-coherent external interconnect, the memory interface can either be conﬁgured to use the CHI protocol or AXI5 (Issue H) protocol. In either conﬁguration, the interfaces are 256-bit wide, with support up to four bus master ports.

通过手册中的描述，结合图1-7、图1-8可以看出DSU.Cluster通过CHI总线接入到一致性总线，进而实现共享内存视角。那么这里需要看一下手册中对CHI接口的描述，如图1-11所示：
CHI-MESI

图1-11 CHI MESI规范

通过Big.Little的Cluster之间的Cache一致性介绍，理解DSU.Cluster之间的一致性同步，应该不难。这里面的关键点是大家遵循的规范，略有差异，这里同样做一下简单的归纳：
(1) DSU.Cluster的Memory系统通过CHI接口连接到一致性总线上。
(2) CHI总线遵循MESI规范，支持Cluster间的状态同步(可以说CHI 在总线架构层面吸收了Snoop filters的功能)，维护各个Cluster 内部的Cacheline的一致性(目前ARM手册中并没有明确的描述，DSU的Level3 Cache支持MESI规范，但是从理论上推导应该是支持的，感兴趣的的同学可以继续Dig一下)。
此时的一致性场景下一般情况下会遵循MESI规范，具体要和处理器的设计有关。

1.2 一致性规范 MESI

我们在上面的叙述中，反复提到了MOSEI/MESI这两种规范，以MESI为例：

MESI（Modified, Exclusive, Shared, Invalid）是一种基于Invalidate的高速缓存一致性协议，也称为伊利诺伊州协议，因其在伊利诺伊大学厄巴纳-香槟分校的发展而得名。MESI协议是支持回写高速缓存的最常用协议之一，主要用于管理多处理器系统中缓存数据的一致性问题。
定义：MESI协议是一种用于维护多处理器系统中缓存数据一致性的协议。它通过定义缓存行的四种状态（Modified、Exclusive、Shared、Invalid），以及这些状态之间的转换规则，来确保各个处理器对共享数据的访问是一致的。
背景：在现代计算机系统中，CPU的运算速度远快于内存的访问速度，为了缓解这一矛盾，引入了高速缓存（Cache）作为CPU和内存之间的缓冲。然而，多处理器系统中的多个CPU可能同时访问同一个内存地址，导致缓存数据不一致的问题。MESI协议就是为了解决这一问题而设计的。

简单总结一下，MESI一种基于侦听然后反馈而设计的规范，能够让多个节点同步接受其中一个节点的状态变化，进而推动自己节点的状态按照一定的规则进行状态切换。四种状态的具体含义如下，状态迁移如图1-12所示：

MESI有四种状态: M(Modiﬁed), E(Exclusive), S(Shared)、I(Invalid)
• M(Modiﬁed): 已修改，当前cpu cache数据有效，cache当中的数据已被修改且与主存储器当中数据不⼀致，数据只存在当前cpu cache当中。在这种状态下cpu命中cache line数据时只会修改⾃⼰cacheline当中存放的数据，不会产⽣任何的总线事务，不必通知其他CPU。
• E(Exclusive): 独占，当前cpu cache数据有效，cache当中的数据与主存储器当中数据⼀致，数据只存在当前cpu cache当中。在独占模式时如果cache line当中的数据被修改，则状态由独占状态(E)转变为已修改状态(M)。此时同样的也不⽤产⽣任何事务，不必通知其他CPU。
• S(Shared): 共享，当前cpu cache数据有效，cache当中的数据与主存储器当中数据⼀致，数据存在多个cpu cache当中。此时⼊股修改了当前cpu cacheline当中的数据，则当前cpu cacheline会将状态由共享状态转(S)变成已修改状态(M),同时产⽣总线事务告知其他cpu cacheline当中的数据⽆效，此时其他CPU的状态就从共享状态(S)转变成⽆效状态(I)。
• I(Invalid): ⽆效，当前cpu cache数据⽆效。

MESI

图1-12 MESI状态迁移图

MESI的资料非常多，感兴趣大家可以翻翻互联网，也可以翻翻本文的Reference。这里只简单的归纳总结一下：
(1) 要实现这种分布式的状态管理需要额外的硬件电路配合，包括链接各个节点(CPU cluster、GPU、ADsp…)的总线(Snoop filter、CHI、ACE…)和各个节点内部(SCU、Snoop filter、Cache的格式…)也要有额外的状态位记录状态。如图7-13所示。
A710-MESI

图1-13 Cortex-A710 TAG格式

(2) 状态的变化是联动的，当一个节点的状态因为各种原因(图1-12 下面具体罗列)发生切换的时候，其他节点也要侦听然后做出状态改变。
(3) MOSEI的状态O：这是MOESI协议新增的状态，表示该缓存行中的数据是当前处理器系统中最新的数据拷贝，但允许其他处理器缓存该数据的一个共享版本（即S状态）。这种状态允许数据的并行访问，同时保持一致性。
(4) 一致性的规范协议不止MESI这一种，是否支持MESI，还是其他的协议，还是要看具体芯片硬件落地。

1.3 CPU-Master与其他Master之间的一致性

如图1-14、图1-15所示，经典基于ARM体系的SOC架构上，那么我们简单归类一下：
(1) 非共享内存，那么就是PE-Core独占，不涉及到Cache一致性问题。
(2) CPU内部共享，那么就是PE-Core之间的Cache一致性维护，这个上面已经讨论完成了，基本就是靠额外的硬件电路解决。
(3) 通过内存属性的设置(Outer shareable)，让Cluster和其他的Master共享内存，例如GPU、VPU、DPU等和PE-Core共享内存，不论这些PE-Core之外的协处理器是否支持内存缓存，但是PE-Core是需要支持Cache机制，那么也涉及到Cache一致性问题。
(4) 通过DMA控制器等，实现PE-Core视野下的内存和外设共享内存，虽然此时这些外设不支持缓存一致性总线的管理，但是由于PE-Core支持Cache机制，那么也涉及到Cache一致性问题。
AMBA-ARCH

图1-14 典型的基于ARM的SOC架构

OuterShareable

图1-14 典型的基于ARM的内存共享属性配置

这一部分情况就比较复杂了，下面我们结合具体的场景讨论一下：
(1) 把共享的内存设置No-Cache的形式，啥烦恼都没有了，所有的节点对内存的访问都要经过主存，没有一致性的问题，但是性能是有问题。
(2) PE-Master和协处理Master(例如GPU)支持全缓存特征和规范，那么这就是硬件维护一致性，性能最好，但是处理器设计负责度搞，成本高。
(3) 协处理器(例如GPU、DMA等)只支持单侧监测PE-Core的Cache状态变化，如手册中的描述：

The AMBA 4 ACE-Lite interface is a subset of the full interface, designed for one-way IO coherent system masters such as DMA engines, network interfaces, and GPUs.

那么这种情况的Cache一致性的变化，就需要响应控制器的驱动程序参与，以DMA为例：
a) 当使用 DMA 从 device 拷贝数据到 main memory 时，设置 cpu cache invalid ，完成拷贝后 cpu 通过 cache 访问数据时就会发生 cache miss 此时必须从 main memory 读取最新的数据，此时不存在一致性问题；
b) 当使用 DMA 从 main memory 拷贝数据到 device 时，对 cpu 修改的 cache 做 cache clean/flush 操作，将 cache 当中修改的最新数据写入到 main memory 当中，保证 main memory 当中存放的数据是最新数据，然后再通过 DMA 将数据从 main memory拷贝到 device 当中，也不存在 cache 一致性问题。

2. Cache机制与编程

2.1 Cache与上层软件的关系

• 在smp架构当中cache对于软件来说是透明的，在软件设计时不⽤过多的考虑cache⼀致性的问题，但是cache对于软件来说并不是免费的，消耗了时间影响了性能。
• 除了上面介绍多Master之间的Cache一致性中提到需要设备驱动参与职位还有其他的课题，这里简要做的一个介绍。

2.2 Cache机制与编程需要注意的点

2.2.1 Cache对齐

以Linux为例内核中常用的数据结构通常是和L1 cache对齐的。例如，mm_struct、fs_cache等数据结构使用“SLAB_HWCACHE_ALIGN”标志位来创建slab缓存描述符，见proc_caches_ init()函数。为了最大限度的提高Cache机制的性能，总结一些规则如下：
• cache和内存交换的最小单位是cache line，若结构体没有和cache line对齐，那么一个结构体有可能占用多个cache line。
• 必要的时候做一下结构体填充，避免出现现在有结构体C1和结构体C2，缓存到L1 Cache时没有按照cache line对齐，因此它们有可能同时占用了一条cache line，即C1的后半部和C2的前半部在一条cache line中。
• 避免伪共享，就是一个结构体内，两个变量(如a和b)确实经过对齐等方式在一个Cache Line内了，但是PE-Core0在一个线程的上下文不断的操作变量a，PE-Core1在另一个线程的上下文不断的操作变量b，这样就会极大的浪费一致性的总线资源，造成Cache震颤。

2.2.3 内存屏障

Cache一致性的机制下，一个PE-Core修改的资源，会迅速同步相关值到其他的PE-Core，这样的话，就需要注意以下两种场景：
(1) 现代的处理器的体系下，在编译期间为了使程序达到最佳性能，会有指令重排的情况，但是如果被重排的指令需要操作共享的内存资源，就会隐晦的出现一些bug，这种情况下就需要借助内存屏障机制对需要保护的资源做一个保护。(内存屏障机制这里不展开讨论，需要后续)
(2) 第二种情况就是编码的时候，要采用加锁、进行原子操作、声明变量的敏感性等方式对相关的共享资源进行保护，这也是码农最熟悉的地方了。(关于加锁的原理，后续我们会依托Linux专门写文章讨论的，非常的有意思。)

结语

本文结合总线架构、内存属性、CPU架构、CPU的微架构对Cache的一致性在各种场景下做了讨论，由于现在的ARM体系对Cache的一致性从硬件角度已经支持的非常完善了，限于篇幅我们只讲了基本的原理，限于篇幅没有展开。而Cache的一致性还不能100%对码农透明，一些场景下也离不开软件的参与，这部分我们也做了概略性的介绍，感兴趣的同学可以结合本系列的文章做进一步的研究。

ARM的Cache相关的所有内容，到这里基本上会告一个段落。后续可能会有零散的文章对本系列的文章做勘误或者是深入的分析，水平和精力有限，休息时间写一点，其实很多地方实话实话是没有深究下去的，希望大家多多指教，正所谓三人行，必有我的老师。下一个系列应该会写写ARM内存相关的课题，请大家保持关注，万分感谢。

在这里插入图片描述

更多精彩内容，请关注公众号

Reference

[01] <DDI0487K_a_a-profile_architecture_reference_manual.pdf>
[02] <DEN0024A_v8_architecture_PG.pdf>
[03] <80-LX-MEM-yk0008_CPU-Cache-RAM-Disk关系.pdf>
[04] <80-ARM-ARCH-HK0001_一文搞懂CPU工作原理.pdf>
[05] <80-ARM-MM-Cache-wx0003_Arm64-Cache.pdf>
[06] <80-ARM-MM-HK0002_一文搞懂cpu-cache工作原理.pdf>
[07] <80-MM-yd0001_Caches-From-a-Mostly-OS-Software-Perspective.pdf>
[08] <80-MM-yd0002_Improving-Kernel-Performance-by-Unmapping-the-Page-Cache.pdf>
[09] <arm_cortex_a710_core_trm_101800_0201_07_en.pdf>
[10] <DDI0608B_a_armv9a_supplement_RETIRED.pdf>
[11] <arm_cortex_a520_core_trm_102517_0003_06_en.pdf>
[12] <arm_cortex_a720_core_trm_102530_0002_05_en.pdf>
[13] <79-LX-LK-z0002_奔跑吧Linux内核-V-2-卷1_基础架构.pdf>
[14] <80-ARM-MM-Cache-wx0001_Cache多核之间的一致性MESI.pdf>
[15] <80-ARM-MM-Cache-wx0002_深度学习armv8_armv9_cache的原理.pdf>
[16] <80-ARM-MM-Cache-ym0001_带着几个疑问-从Cache的应用场景学起.pdf>
[17] <80-ARM-MM-Cache-ym0002_Cache是如何工作的-概念以及工作过程.pdf>
[18] <80-ARM-MM-Cache-ym0003_多核多Cluster多系统之间的缓存一致性.pdf>
[19] <DDI0500J_cortex_a53_trm.pdf>
[20] <DDI0488H_cortex_a57_mpcore_trm.pdf>
[21] <cortex_a72_mpcore_trm_100095_0003_06_en.pdf>
[22] <corelink_cci550_cache_coherent_interconnect_technical_reference_manual_100282_0100_01_en.pdf>
[23] <80-ARM-DyIQ-wx0001_ARM架构系列(2)-DynamIQ技术.pdf>
[24] <ARM_DynamIQ_The_future_of_multi-core_computing.pdf>
[25] <cortex_a72_mpcore_trm_100095_0003_06_en.pdf>
[26] <arm_cortex_a710_core_trm_101800_0201_07_en.pdf>
[27] <DEN0013D_cortex_a_series_PG.pdf>
[28] <DDI0329L_l220_cc_r1p7_trm.pdf>
[29] <arm_dsu_120_trm_102547_0201_07_en.pdf>
[30] <80-Cache-MESI-yd0001_Cache_coherency_controller_for_MESI_protocol_based.pdf>
[31] <80-Cache-MESI-yd0002_cache-coherence.pdf>
[32] <80-Cache-MESI-yd0003_Cache-coherence-in-shared-memory-architectures.pdf>
[33] <80-Cache-MESI-yd0004_Designing-Predictable-Cache-Coherence-Protocols-for-Multi-Core-Real-Time-Systems.pdf>

Glossary

SRAM - Static Random-Access Memory
DRAM - Dynamic Random Access Memory
SSD - Solid state disk
HDD - Hard Disk Drive
SOC - System on a chip
AMBA - Advanced Microcontroller Bus Architecture 高级处理器总线架构
TLB - translation lookaside buffer(地址变换高速缓存)
VIVT - Virtual Index Virtual Tag
PIPT - Physical Index Physical Tag
VIPT - Virtual Index Physical Tag
AHB - Advanced High-performance Bus 高级高性能总线
ASB - Advanced System Bus 高级系统总线
APB - Advanced Peripheral Bus 高级外围总线
AXI - Advanced eXtensible Interface 高级可拓展接口
DSU - DynamIQ Share Unit
ACE - AXI Coherency Extensions
CHI - Coherent Hub Interface 一致性集线器接口
CCI - Cache Coherent Interconnect
ADB - AMBA Domain Bridge
CMN - Coherent Mesh Network
MESI - Modified, Exclusive, Shared, Invalid

奔跑的架构师

关注

7
点赞
踩
21

收藏

觉得还不错? 一键收藏
0
评论
[A-06] ARMv8/ARMv9-Cache的一致性机制(Cache系列完结篇)

书接上文，前序的文章我们介绍了Cache一致性的一些基础知识，为我们继续探讨ARM的一致性机制做了必要的铺垫。看过前文的我们应该清楚，ARM会提供两种方式维护Cache的一致性，这两种方式分别是：通过可以直接操作Cache的软件管理方式和直接通过在CPU内部的微架构内部和总线内部的一致性硬件管理单元的硬件方式。当然，为了提高处理器的性能和解放底层软件码农的双手，大部分的场景都是硬件在处理和维护Cache的一致性，自然我们要对这种方式做重点介绍。
复制链接

扫一扫

专栏目录