ARM Cortex -A Series Programmer’s Guide for ARMv8-A Chapter 13 Memory Ordering 第13章内存排列_arm cortex-a series programmer’s guide for armv8- -CSDN博客

本文详细介绍了ARMv8架构中的内存类型，包括Normal和Device内存，强调了内存重排序和内存屏障的作用。Normal内存允许处理器进行优化，可能存在读写重排序，而Device内存主要用于有副作用的访问。Data Memory Barrier (DMB) 和 Data Synchronization Barrier (DSB) 指令用于保证数据访问的顺序，防止不必要的优化。文章还讨论了不同类型的内存屏障参数及其应用场景。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文档下载地址

Documentation – Arm Developerhttps://developer.arm.com/documentation/den0024/a

缩写我放前面:

TLB Translation Lookaside Buffer. 旁路转换缓冲，或称为页表缓冲 . TLB（translation lookaside buffer)介绍_limanjihe的博客-CSDN博客

内存管理（四）内存分配掩码（gfp_mask） - 内存域修饰符 & 内存分配标志

[Linux 系统编程] kmalloc 的一点用法-可怜的猪头-ChinaUnix博客

宋宝华：关于DMA ZONE和dma alloc coherent若干误解的完全澄清 - 尚码园

宋宝华：关于DMA ZONE和dma alloc coherent若干误解的彻底澄清_Linux阅码场的博客-CSDN博客

Linux 设备驱动 Edition 3-Linux设备驱动第三版（中文版）- -

第 8 章分配内存-Linux设备驱动第三版（中文版）- -

Linux Support for ARM LPAE 分析_duanlove的博客-CSDN博客_arm lpae

If your code interacts(相互影响) directly either with the hardware or with code executing on other cores, or if it directly loads or writes instructions to be executed, or modifies page tables, you need to be aware of (知道) memory ordering issues.

If you are an application developer, hardware interaction(互动,影响) is probably through a device driver, the interaction with other cores is through Pthreads or another multithreading API, and the
interaction with a paged memory system is through the operating system. In all of these cases,
the memory ordering issues are taken care of for you by the relevant code. However, if you are
writing the operating system kernel or device drivers, or implementing a hypervisor(管理程序), JIT compiler （(Just-In-Time）运行时编译执行的技术), or multithreading library, you must have a good understanding of the memory ordering rules of the ARM Architecture. You must ensure that where your code requires explicit(明确的) ordering of memory accesses, you are able to achieve this through the correct use of barriers.

The ARMv8 architecture employs a weakly-ordered model of memory. (ARMv8体系使用弱顺序的内存模型) In general terms(一般地说), this means that the order of memory accesses is not required to be the same as the program order for load and store operations. (这就意味着内存访问的顺序和程序读写操作的顺序是不需要一致的) The processor is able to re-order memory read operations with respect to each other. (处理器能够重新排序内存读取操作相对于其他操作) Writes may also be re-ordered (for example, write combining) (写操作也能够被重新排序). As a result, hardware optimizations, such as the use of cache and write buffer, (这就导致必须使用cache和write buffer的硬件优化) function in a way that improves the performance of the processor, which means that the required bandwidth between the processor and external memory can be reduced and the long latencies associated with such external memory accesses are hidden. (这个功能可以从一方面提高处理器的效能, 这就意味着所需的处理器和外部存储的带宽可以减少,与这种外部内存访问相关的长延迟被隐藏了 )

Reads and writes to Normal memory can be re-ordered by hardware, being subject only to data
dependencies and explicit memory barrier instructions.(硬件可以重排对普通内存的读写, 只收到数据依赖的影响和明确的内存围栏指令我的注解 : linux 是mb() ) Certain situations require stronger ordering rules. (特定的情况需要更强的排序规则) You can provide information to the core about this through the memory type attribute of the translation table entry that describes that memory.(你可以提供给core关于规则的信息, 通过内存类型的属性,在翻译表的入口)

(re-order受数据依赖性和显式barrier指令影响)

13.1 Memory types

The ARMv8 architecture defines two mutually-exclusive(互斥的) memory types. All regions of memory are configured as one or the other of these two types, which are Normal and Device. (所有的内存都被配置为三种类型,一般型 Normal 和设备性 Device) A third memory type, Strongly Ordered, is part of the ARMv7 architecture. (第三种是强排序的, 是 ARMv7体系结构的一部分)The differences between this type and Device memory are few and it is therefore now omitted(省略了) in ARMv8. (See Device memory on page 13-4.)

In addition to the memory type(除了内存类型之外), attributes also provide control over cacheability, shareability, access, and execution permissions. (属性也提供了控制可cache性, 可共享性, 访问和执行权限) Shareable and cache properties pertain only to Normal
memory.(Shareable 和 cache属性与Normal内存相关) Device regions are always deemed to be (认为是)non-cacheable and outer-shareable(外部共享). For cacheable locations,(对可缓存的地址) you can use attributes to indicate(标明,指出) cache allocation policy(策略) to the processor.

The memory type is not directly encoded in the translation table entry. (内存类型不是在翻译表格入口直接编码) Instead, each block entry specifies a 3-bit index into a table of memory types. (取而代之的是, 每个块入口指定一个3-bit索引的内存类型表格)This table is stored in the Memory Attribute Indirection Register MAIR_ELn.(这个表格存储在内存属性间接寄存器 MAIR_ELn 中) This table has eight entries and each of those entries has eight bits, as shown in Figure 13-1. (这个表格有8个入口, 每个入口有8个bit)

Although the translation table block entry itself does not directly contain(包含) the memory type
encoding, the TLB entry inside the processor usually stores this information for a specific entry.()
Therefore, changes to MAIR_ELn might not be observed until after both an ISB instruction
barrier and a TLB invalidate operation.(因此,MAIR_ELn 的改变可能不会被观察到, 直到在ISB 指令围栏和一个 TLB 无效操作之后)

Figure 13-1 Type encoding

13.1.1 Normal memory

You can use Normal memory for all code and for most data regions in memory. Examples of
Normal memory include areas of RAM, Flash, or ROM in physical memory. This kind of
memory provides the highest processor performance as it is weakly ordered and has fewer
restrictions(限制,约束) placed on the processor. The processor can re-order, repeat, and merge accesses to Normal memory.

Furthermore(此外), address locations that are marked as Normal can be accessed speculatively(投机的) by the processor, so that data or instructions can be read from memory without being explicitly referenced(严格引用) in the program, or in advance of(超过) the actual execution of an explicit reference. Such speculative accesses can occur as a result of branch prediction(分支预测), speculative cache linefills(随机cache占用率), out-of-order data loads, or other hardware optimizations.

For best performance, always mark application code and data as Normal and in circumstances(在...情况下) where an enforced(实施的,强制执行的) memory ordering is required, you can achieve it through the use of explicit barrier operations. Normal memory implements a weakly-ordered memory mode. There is no requirement for Normal accesses to complete in order with respect to either other Normal accesses or to Device accesses.
However, the processor must always handle hazards(危害) caused by address dependencies(依赖).

For example, consider the following simple code sequence:
STR X0, [X2] // STR 字数据存储指令 STR{条件} 源寄存器，<存储器地址> 将X0存入X2
LDR X1, [X2] // LDR 字数据加载指令 LDR{条件} 目的寄存器，<存储器地址> 从X2读出到X1

The processor always ensures that the value placed in X1 is the value that was written to the
address stored in X2 . (处理器会永远保证位于X1的值和写到存储在X2的值是相同的)

This of course applies to more complex dependencies.(当然这种情况也适用于更复杂的情况)
Consider the following code:
ADD X4, X3, #3
ADD X5, X3, #2

STR X0, [X3]
STRB W1, [X4]
LDRH W2, [X5]

In this case, the accesses take place to addresses that overlap each other. The processor must
ensure that the memory is updated as if the STR and STRB occurred in order, so that the LDRH returns the most up-to-date value. It would still be valid for the processor to merge the STR and STRB into a single access that contained the latest, correct data to be written.

13.1.2 Device memory

(没有使能MMU时，就按照device访问)

You can use Device memory for all memory regions where an access might have a side-effect.(你可以对所有的内存区域使用Device memory, 但是可能会有一个访问的影响.)
For example, a read to a FIFO location or timer is not repeatable, as it returns different values
for each read. (例如,对一个FIFO地址或者定时器的读是不可重复的, 因为每次读都会返回不同的值) A write to a control register might trigger an interrupt. (对控制寄存器的写可能触发一个中断) It is typically only used for peripherals in the system.(在系统外设中这是一个典型的硬硬) The Device memory type imposes more restrictions on the core.(设备内存类型对核心施加了更多的限制) Speculative data accesses cannot be performed to regions of memory marked as Device. (不能对标记为“设备”的内存区域执行投机数据访问) There is a single, uncommon exception to this. (这里有一个不寻常的例外) If NEON operations are used to read bytes from Device memory, the processor might read bytes not explicitly referenced if they are within an aligned 16-byte block that contains one or more bytes that are explicitly referenced. (如果NEON操作用于从Device内存中读取字节，处理器可能会读取未显式引用的字节，如果它们在一个对齐的16字节块中，包含一个或多个显式引用的字节)

Trying to execute code from a region marked as Device, is generally UNPREDICTABLE . (尝试从标记为设备的区域执行代码，通常是不可预测的) The implementation might either handle the instruction fetch as if it were to a memory location with the Normal non-cacheable attribute, or it might take a permission fault. (实现可能要么处理指令的取回好像内存有不可缓存的属性一样, 要么可能出现权限错误)

There are four different types of device memory, to which different rules apply.

Device-nGnRnE most restrictive (equivalent to(等于) Strongly Ordered memory in the ARMv7architecture).
Device-nGnRE
Device-nGRE
Device-GRE least restrictive

The letter suffixes refer to the following three properties(字母的后缀指的是下面的3个属性):

Gathering or non Gathering (G or nG) (聚合和非聚合)

This property determines whether multiple accesses can be merged into a single
bus transaction for this memory region.(这个性质决定了是否多个访问能否合并为一个单独的总线业务,对这个内存区域) If the address is marked as non Gathering
(nG),(如果地址被标记为nG) then the number and size of accesses on the memory bus performed to that location must exactly match the number and size of explicit accesses in the code. (那么在内存总线上对其进行访问的次数和大小Location必须与代码中显式访问的数量和大小完全匹配) If the address is marked as Gathering (G), then the processor can, for example, merge two byte writes into a single half-word write.(那么处理器能够, 例如, 将2byte的write合并为一个 half-word的write)

For a region marked as Gathering, multiple memory accesses to the same
memory location can also be merged.(一个区域标识为G, 对同一个内存地址的访问可以合并为一个) For example, if the program reads the same location twice, (例如,如果程序读同一个地址2次) the core only needs to perform the read once and can return the same result for both instructions.(core只需要执行read操作1次, 2个命令就可以返回相同的结果) For reads from regions marked as non Gathering,(对标识为nG的区域来读) the data value must come from the end device.(数据必须来自最终的设备) It cannot be snooped from a write-buffer or other location.(不能从write-buffer 或者别的地址进行窥探)

Re-ordering (R or nR)

This determines whether accesses to the same device can be re-ordered with
respect to each other. (这个属性决定了对同一设备的范围是否可以re-order) If the address is marked as non Re-ordering (nR), then accesses within the same block always appear on the bus in program order.(那么对同一个block的bus访问总会呈现出程序的顺序) The size of this block is IMPLEMENTATION DEFINED .(block的大小是程序指定的) Where the size of this block is large, it could span(持续,覆盖) several table entries. In this case, the ordering rule is observed with respect to any other accesses also marked as nR.

Early Write Acknowledgement(承认,回执) (E or nE)

This determines whether an intermediate write buffer between the processor and
the slave device being accessed is allowed to send an acknowledgement of a write
completion.(这决定了是否一个在处理器和slave设备中间的 write buffer, 当被访问的时候是否允许发送一个当写完成之后的应答) If the address is marked as non Early Write Acknowledgement (nE), then the write response must come from the peripheral. (如果地址被标识为nE, 那么write 应答必须来自外部设备) If the address is marked as Early Write Acknowledgement (E), then it is permissible(允许的) for a buffer in the interconnect(互相联系) logic to signal write acceptance(接受), in advance of(超过) the write actually being received by the end device. This is essentially a message to the external memory system. (这实际上是一个到外部内存系统的消息)

13.2 Barriers

The ARM architecture includes barrier instructions to force access ordering and access
completion at a specific point. (ARM架构包含屏障指令来强制顺序访问和在特定的点访问结束)In some architectures, similar instructions are known as a fence.(在其他的体系中, 相似的指令已知的有 fence)

If you are writing code where ordering is important, see Appendix J7 Barrier Litmus Tests in the
ARM Architecture Reference Manual - ARMv8, for ARMv8-A architecture profile and Appendix
G Barrier Litmus Tests in the ARM Architecture Reference Manual ARMv7-A/R Edition, which
includes many worked examples.(如果你要写一些顺序很重要的程序, .... 有很多可行的例子)

The ARM Architecture Reference Manual defines certain key words, in particular, the terms
observe and must be observed (期限必须被观察到). In typical systems, this defines how the bus interface of a master, for example, a core or GPU and the interconnect, must handle bus transactions. (core或者GPU相互连接, 就必须处理业务了) Only masters are able to observe transfers.(只有主机能够观察到传输) All bus transactions are initiated by a master. (所有总线事务都是由主总线发起的)The order that a master performs transactions in is not necessarily the same order that such transactions complete at the slave device, because transactions might be re-ordered by the interconnect unless some ordering is explicitly enforced.(主设备执行事务的顺序不一定与从设备完成事务的顺序相同，因为事务可能会被互连设备重新排序，除非某些顺序被明确强制执行)

A simple way to describe observability is to say that “I have observed your write when I can
read what you wrote and I have observed your read when I can no longer change the value you
read” where both I and you refer to cores or other masters in the system.

There are three types of barrier instruction provided by the architecture:

Instruction Synchronization Barrier (ISB)

This is used to guarantee(确保) that any subsequent(随后的) instructions are fetched, again, so
that privilege(特权) and access are checked with the current MMU configuration. It is
used to ensure any previously(以前的) executed context-changing operations, such as
writes to system control registers, have completed by the time the ISB completes.(它用于确保任何以前执行的上下文更改操作，如写入系统控制寄存器，在ISB完成时已经完成)
In hardware terms,(在硬件方面) this might mean that the instruction pipeline is flushed, for
example. (例如, 可能意味着指令流水线被flush了)Typical uses of this would be in memory management, cache control, and context switching code, or where code is being moved about in memory.(典型的应用是在内存管理, cache控制,上下文切换代码, 或者在内存中移动的代码)

把instruction pipeline刷掉重新读指令保证isb前面的都生效了

Data Memory Barrier (DMB)

This prevents re-ordering of data accesses instructions across the barrier
instruction.(这阻止了跨屏障数据访问指令的重新排序指令) All data accesses, that is, loads or stores, but not instruction fetches,performed by this processor before the DMB, are visible to all other masters within the specified shareability domain before any of the data accesses after the
DMB.(这个处理器在DMB之前执行的所有数据访问(即加载或存储，但不包括指令读取)，在DMB之后的任何数据访问之前，对指定共享域内的所有其他主机都是可见的)

dmb后面的非load strore指令，可能被执行

For example:

LDR x0, [x1] // Must be seen by the memory system before the STR below.
DMB ISHLD
ADD x2, #1 // May be executed before or after the memory system sees
LDR.
STR x3, [x4] // Must be seen by the memory system after the LDR above.

It also ensures that any explicit preceding data or unified cache maintenance
operations have completed before any subsequent data accesses are executed.

DC CSW, x5 // Data clean by Set/way
LDR x0, [x1] // Effect of data cache clean might not be seen by this
// instruction
DMB ISH
LDR x2, [x3] // Effect of data cache clean will be seen by this instruction

Data Synchronization Barrier (DSB)

This enforces the same ordering as the Data Memory Barrier, but has the
additional effect of blocking execution of any further instructions, not just loads
or stores, or both, until synchronization is complete. This can be used to prevent
execution of a SEV instruction, for instance, that would signal to other cores that
an event occurred. It waits until all cache, TLB and branch predictor maintenance
operations issued by this processor have completed for the specified shareability
domain.

dsb比dmb限制更强

For example:

DC ISW, x5 // operation must have completed before DSB can complete
STR x0, [x1] // Access must have completed before DSB can complete
DSB ISH
ADD x2, x2, #3 // Cannot be executed until DSB completes

后面指令没法执行

As you can see from the above examples, the DMB and DSB instructions take a parameter which specifies the types of access to which the barrier operates, before or after, and a shareability domain to which it applies.

The available options are listed in the table.

Table 13-1 Barrier parameters

<option> Ordered Accesses (before – after) Shareability Domain

OSHLD Load – Load, Load – Store Outer shareable
OSHST Store – Store
OSH Any – Any

NSHLD Load – Load, Load – Store Non-shareable
NSHST Store – Store
NSH Any – Any

ISHLD Load –Load, Load – Store Inner shareable
ISHST Store – Store
ISH Any – Any

LD Load –Load, Load – Store Full system
ST Store – Store
SY Any – Any

The ordered access field specifies which classes of accesses the barrier operates on. There are
three options.

暂时弄到这里

Load - Load/Store

This means that the barrier requires all loads to complete before the barrier but
does not require stores to complete. Both loads and stores that appear after the
barrier in program order must wait for the barrier to complete.

Store - Store

This means that the barrier only affects store accesses and that loads can still be
freely re-ordered around the barrier.

Any - Any

This means that both loads and stores must complete before the barrier. Both
loads and stores that appear after the barrier in program order must wait for the
barrier to complete.

Barriers are used to prevent unsafe optimizations from occurring and to enforce a specific
memory ordering. Use of unnecessary barrier instructions can therefore reduce software
performance. Consider carefully whether a barrier is necessary in a specific situation, and if so,
which is the correct barrier to use.

A more subtle effect of the ordering rules is that the instruction interface, data interface, and
MMU table walker of a core are considered as separate observers. This means that you might
need, for example, to use DSB instructions to ensure that an access one interface is guaranteed to be observable on a different interface.

If you execute a data cache clean and invalidate instruction, for example DCCVAU , X0 , you must insert a DSB instruction after this to be sure that subsequent page table walks, modifications to translation table entries, instruction fetches, or updates to instructions in memory, can all see the new values.

For example, consider an update of the translation tables:

STR X0, [X1] // update a translation table entry
DSB ISHST // ensure write has completed
TLBI VAE1IS, X2 // invalidate the TLB entry for the entry that changes
DSB ISH // ensure TLB invalidation is complete
ISB // synchronize context on this processor

A DSB is required to ensure that the maintenance operations complete and an ISB is required to
ensure that the effects of those operations are seen by the instructions that follow.

The processor might speculatively access an address marked as Normal at any time. So when
considering whether barriers are required, don’t just consider explicit accesses generated by
load or store instructions.