ver 0.2
1. 一致性的相关基础
上面叙述的视角还是基于硬件(总线)的视角看过去SOC,但是实际在程序眼中这些Master是通过内存模型联系到一起的,啥AMBA、CHI、ACE、InterconnectBUS等,都是透明的,他们在编写启动代码、BSP代码、应用层代码的时候只认识内存地址已经地址空间。那么这里Memory Master就是一个很关键的节点了,可以说是其他的一些Master,例如NPU或者ADSP等可以在一些嵌入式设备上剪裁掉,例如电话手表,但是内存是物理如何都要保留的。
基于这样一个背景,结合前序的文章我们知道,为了提高CPU以及各个协处理的执行速度,引入了Cache机制,就是缓存一部分内存中的数据到Cache中。这个世界普遍的一个真理,就是这个世界上所有的事物都是矛盾的,比如Cache机制也一样,享受了它带来的优势就要承受它让人挠头的一面。前面我们在讲述Cache策略的时候,比如对Cache执行写的时候,就算是Cache命中的情况下还有Write Back这样的策略,也就是PE-Core把数据送到Cache就结束了,内存中与Cache相对应数据要靠Cache Controller在合适的时机更新。那么在内存中的数据这一段时间内,被其他的PE-Core访问或者其他的Master访问,怎么办?这就会引出我们今天要讨论的问题,Cache的一致性。
1.1 Cache一致性的概念
Coherency means ensuring that all processors or bus masters within a system have the same view of shared memory. It means that changes to data held in the cache of one core are visible to the other cores, making it impossible for cores to see stale or old copies of data. This can be handled by simply not caching, that is disabling caches for shared memory locations, but this typically has a high performance cost.
(1) 保证多线程之间共享的数据始终保持一致,不会产生逻辑错误,生产者线程更新的数据能及时同步到消费者线程。(PE-Core之间的数据一致)
(2) 保证跑在CPU上驱动程序维护的图层数据和DPU中数据一致,从而避免显示器上显示的数据不至于花屏。(总线上Master之间的数据一致)
1.2 维护一致性的场景
(1) Cluster内部各个PE-Core之间的Cache一致性,讨论Big.Little架构或者DSU架构下一个Cluster内的各个PE-Core之间是如何实现Cache一致性的。(图1-1 Master1中高性能Cluster中的4个PE-Core之间的视角)
(2) Cluster之间的Cache一致性,讨论Big.Little架构或者DSU架构下各个Cluster中的PE-Core之间的Cache一致性问题,例如Cluster-0中的A53-Core中的Cache和Cluster-1中A57-Core的Cache是如何保持一致性的。(图1-1 Master1中高性能Cluster中的4个PE-Core之间和低功耗Cluster的4个PE-Core之间的视角)
(3) CPU和其他Master之间的一致性,例如GPU和CPU中的数据一致性问题,或者一个USB设备中的数据和驱动中数据的一致性。(图1-1,Master1和Master2之间或者Master-1和Master7之间的视角)
1.3 维护⼀致性的方式
There are three mechanisms to maintain coherency:
Disable caching
This is the simplest mechanism but might cost significant core performance. To get the highest performance processors are pipelined to run fast, and to run from caches that offer a very low latency. Caching of data that is accessed multiple times increases performance significantly and reduces DRAM accesses and power. Marking data as “non-cached” could impact performance and power.
Software managed coherency
Software managed coherency is the traditional solution to the data sharing problem. Here the software, usually device drivers, must clean or flush dirty data from caches, and invalidate old data to enable sharing with other processors or masters in the system. This takes processor cycles, bus bandwidth, and power.
Where there are high rates of sharing between requesters the cost of software cache maintenance can be significant, and can limit performance.
Hardware managed coherency
Hardware maintains coherency between level 1 data caches within a cluster. A core automatically participates in the coherency scheme when it is powered up, has its D-cache and MMU enabled, and an address is marked as coherent.
However, this cache coherency logic does NOT maintain coherency between data and instruction caches.
In the ARMv8-A architecture and associated implementations, there are likely to be hardware managed coherent schemes. These ensure that any data marked as shareable in a hardware coherent system has the same value seen by all cores and bus masters in that shareability domain. This adds some hardware complexity to the interconnect and to clusters, but greatly simplifies the software and enables applications that would otherwise not be possible using only software coherency.
▪ 禁⽤缓存
▪ 软件管理的⼀致性
1.5 Cache的维护机制(软件管理)
It is sometimes necessary for software to clean or invalidate a cache.
• Invalidation of a cache or cache line means to clear it of data, by clearing the valid bit of one or more cache lines. The cache must always be invalidated after reset as its contents are undefined. This can also be viewed as a way of making changes in the memory domain outside the cache visible to the user of the cache.
• Cleaning a cache or cache line means writing the contents of cache lines that are marked as dirty, out to the next level of cache, or to main memory, and clearing the dirty bits in the cache line. This makes the contents of the cache line coherent with the next level of
the cache or memory system. This is only applicable for data caches in which a write-back policy is used. This is also a way of making changes in the cache visible to the user of the outer memory domain, but is only available for data cache.
• Zero. This zeroes a block of memory within the cache, without the need to first of all read its contents from the outer domain. This is only available for data cache.
1.5 一致性管理的数据(内存的属性)
Data memory accesses can take longer and consume more power with cache coherency hardware than they otherwise would do. This overhead can be minimized by maintaining coherency between a smaller number of masters while ensuring that they are physically close in the processor. For this reason, the architecture splits the system into domains, and makes it possible to limit the overhead to those locations where the coherency is required.
Software must define which address regions are to be used by which group of masters, that is which other masters are sharing this address, by creating appropriate translation table entries. For Normal cacheable regions, this means setting the shareable attribute to one of Non-shareable, Inner Shareable, or Outer Shareable. For non-cacheable regions, the shareable attribute is ignored.
Locations marked as Normal also have cacheability and shareability attributes. The cacheability attributes control whether a location can be cached. If a location can be cached, the shareability attributes control which other agents need to see a coherent copy of the memory. This allows for some complex configuration, which is beyond the scope of this guide. However, Arm expects operating systems to mark the majority of DRAM memory as Normal Write-back cacheable, Inner shareable.
[01] <DDI0487K_a_a-profile_architecture_reference_manual.pdf>
[02] <DEN0024A_v8_architecture_PG.pdf>
[03] <80-LX-MEM-yk0008_CPU-Cache-RAM-Disk关系.pdf>
[04] <80-ARM-ARCH-HK0001_一文搞懂CPU工作原理.pdf>
[05] <80-ARM-MM-Cache-wx0003_Arm64-Cache.pdf>
[06] <80-ARM-MM-HK0002_一文搞懂cpu-cache工作原理.pdf>
[07] <80-MM-yd0001_Caches-From-a-Mostly-OS-Software-Perspective.pdf>
[08] <80-MM-yd0002_Improving-Kernel-Performance-by-Unmapping-the-Page-Cache.pdf>
[09] <arm_cortex_a710_core_trm_101800_0201_07_en.pdf>
[10] <DDI0608B_a_armv9a_supplement_RETIRED.pdf>
[11] <arm_cortex_a520_core_trm_102517_0003_06_en.pdf>
[12] <arm_cortex_a720_core_trm_102530_0002_05_en.pdf>
[13] <79-LX-LK-z0002_奔跑吧Linux内核-V-2-卷1_基础架构.pdf>
[14] <80-ARM-MM-Cache-wx0001_Cache多核之间的一致性MESI.pdf>
[15] <80-ARM-MM-Cache-wx0002_深度学习armv8_armv9_cache的原理.pdf>
[16] <80-ARM-MM-Cache-ym0001_带着几个疑问-从Cache的应用场景学起.pdf>
[17] <80-ARM-MM-Cache-ym0002_Cache是如何工作的-概念以及工作过程.pdf>
[18] <80-ARM-MM-Cache-ym0003_多核多Cluster多系统之间的缓存一致性.pdf>
[19] <DDI0500J_cortex_a53_trm.pdf>
[20] <DDI0488H_cortex_a57_mpcore_trm.pdf>
[21] <cortex_a72_mpcore_trm_100095_0003_06_en.pdf>
[22] <corelink_cci550_cache_coherent_interconnect_technical_reference_manual_100282_0100_01_en.pdf>
[23] <80-ARM-DyIQ-wx0001_ARM架构系列(2)-DynamIQ技术.pdf>
[24] <ARM_DynamIQ_The_future_of_multi-core_computing.pdf>
[25] <cortex_a72_mpcore_trm_100095_0003_06_en.pdf>
[26] <arm_cortex_a710_core_trm_101800_0201_07_en.pdf>
[27] <DEN0013D_cortex_a_series_PG.pdf>
[28] <DDI0329L_l220_cc_r1p7_trm.pdf>
[29] <arm_dsu_120_trm_102547_0201_07_en.pdf>
[30] <80-Cache-MESI-yd0001_Cache_coherency_controller_for_MESI_protocol_based.pdf>
[31] <80-Cache-MESI-yd0002_cache-coherence.pdf>
[32] <80-Cache-MESI-yd0003_Cache-coherence-in-shared-memory-architectures.pdf>
[33] <80-Cache-MESI-yd0004_Designing-Predictable-Cache-Coherence-Protocols-for-Multi-Core-Real-Time-Systems.pdf>
SRAM - Static Random-Access Memory
DRAM - Dynamic Random Access Memory
SSD - Solid state disk
HDD - Hard Disk Drive
SOC - System on a chip
AMBA - Advanced Microcontroller Bus Architecture 高级处理器总线架构
TLB - translation lookaside buffer(地址变换高速缓存)
VIVT - Virtual Index Virtual Tag
PIPT - Physical Index Physical Tag
VIPT - Virtual Index Physical Tag
AHB - Advanced High-performance Bus 高级高性能总线
ASB - Advanced System Bus 高级系统总线
APB - Advanced Peripheral Bus 高级外围总线
AXI - Advanced eXtensible Interface 高级可拓展接口
DSU - DynamIQ Share Unit
ACE - AXI Coherency Extensions
CHI - Coherent Hub Interface 一致性集线器接口
CCI - Cache Coherent Interconnect
ADB - AMBA Domain Bridge
CMN - Coherent Mesh Network