Memory barrier

Execution Ordering

Out-Of-Order Execution

为了提高性能,程序指令的执行顺序可能被编译器或者硬件改变,比如:

  • Optimize the compiler to minimize repeated execution intervals
    • Example: Routine that increases by 1 repeat 100 times -> increases by 100 at a time
  • When a single process (HW thread) processes the instructions in some parallel rather than sequentially

另外,指令还可能会被预取执行,因此,不同顺序的指令可能同时执行:

  • 比如有五种处理类型(Integer, Multiply, Floating-Point/NENO, Branch, Load/Store),被不同的指令引入;那么,由于他们分属于不同的单元处理,所以,是可以在同一时间被处理的。不过,他们的完成时间是各不相同的。integer操作的处理时间是最短的,所以最先完成。
  • 即便是相同的处理类型,如果处理器有多个issue handler的话,也是可以并行处理的

Memory Ordering

Out-Of-Order Memory

也是为了提高性能,cpu或者设备访问内存空间的顺序也可能被编译器或者硬件改变,比如:

  • Optimized to minimize iterations by the compiler
    • Example) Repeat 100 times to store 1 to address A  -> Store 1 to address A once
  • Architecture optimizes by using output buffers to reorder memory access
    • Example) Collect and process access to addresses belonging to the cache line if possible

Example 1) 

  • If address A and address B are on the same cache line, they should be grouped together if possible (figure below)
    • I tried to save in the order A, X, and B, but it was changed to the order A, B, and X
    • Since there are no operations between A, X, and B addresses, the architecture does not know the dependency, so the order can be changed.

Example 2) 

当使用设备的地址寄存器和数据寄存器来读取设备内容时,我本来想,

  1. 向设备地址寄存器写入地址(store操作)
  2. 从设备数据寄存器读取数据(load操作)

不过,cpu可能并不知道这个store和load之间是有依赖关系的,这个order是可能被改变的。

Compiler Barrier

Compiler Barrier 用于防止编译器省略、简化或改变代码执行顺序等优化措施。

  • C Functions
    • barrier()
      • barrier() restricts previous and subsequent memory accesses from being reordered by the compiler's optimizations.
  • Compiler Directives
    • __volatile__
      • Use __volatile__ to limit the compiler's optimization
      • When used before a variable
        • Limit optimization for the use of variables.
          • It prevents the compiler from placing variables in registers for speedup.
      • When used before function
        • Limit all optiomization in the function.
  • READ_ONCE() or WRITE_ONCE()
    • Use volatile internally. In the case of the DEC alpha architecture, it also includes an architecture barrier.
    • This is fine for single-threaded code, but it prevents optimizations that can be problematic in code with concurrency.

When you need a compiler barrier

Dependency Barriers

LOAD-LOAD

  • The compiler does not support control (if) dependencies between LOAD operations, so it can be predicted by optimization, so the order between the two LOADs cannot be guaranteed.

STORE

  • STORE operations are not performed predictably by the compiler, so they ensure order with other LOADs and STOREs. Still, it doesn't work as a conditional dependency barrier.

Architecture Memory Barrier

内存屏障分几个level:inner-shareable、outer-shareable、full-system等。不同level的作用范围不同,对性能的影响也不同:

Mandatory barrier

It prevents the architecture from omitting, abbreviating, or changing the order of access when accessing shared memory. It provides four basic types of APIs:

  • mb()
    • General Memory Barrier Type
    • Based on the command, the order of the previous Load/Store operation and the subsequent Load/Store operation is guaranteed.
    • Implementation according to architecture
      • dsb, sync, mfence, …
      • On ARM, the dsb() command is used to perform some additional synchronization in addition to the write buffer flush.
      • In the case of ARM64, the dsb(sy) command is used to perform some additional synchronization in addition to the write buffer flush.
  • rmb()
    • Read Memory Barrier Type
    • Based on this command, the order of the previous load operation and the subsequent load operation (Ordering) is guaranteed.
  • wmb()
    • write memory barrier type
    • Based on this command, the order of the previous store operation and the subsequent store operation is guaranteed.
  • read_barrier_depends()
    • Data Dependency Barrier Types
    • If the value read by the Load operation is passed as a result and used for the next Load operation, the order between the atomic Load operations is guaranteed.
    • It is a little faster than rmb().
    • ARM, ARM64, and most architectures do not require any code because they check for simple data dependencies inside them, but only the DEC alpha architecture implements this code.

In addition, there are two implicit one-way barrier types: (More on that later.)

  • ACQUIRE operations
    • The one-way barrier guarantees the memory manipulation order after the ACQUIRE operation.
    • It is used in LOCK operations and smp_load_acquire() and smp_cond_acquire().
    • smp_load_acquire()
      • Based on this command, the order of subsequent Read/Write operations is guaranteed.
  • RELEASE operations
    • A one-way barrier guarantees the memory manipulation order before the RELEASE operation.
    • UNLOCK operations and smp_store_release().
    • smp_store_release()
      • Based on this command, the order of the previous Read/Write operations is guaranteed.

The following figure shows how the ACQUIRE and RELEASE operations are implemented in a spinlock.

  • Since the A value is not reordered after RELEASE, it can be seen that the A value is safely stored in B.

 

Device Support Barrier

For devices, we added commands to Kernel v3.19-rc1 that can be shared to the Outer Share region, which is lighter than RMB() and WMB() and heavier than SMP-enabled barriers.

  • dma_rmb()
    • It is the same as rmb(), but it is a shared memory that not only the CPU but also devices in the outer area access.
  • dma_wmb()
    • WMB(), but it is a shared memory that not only the CPU but also the device accesses.

SMP Support Barrier

In the inner share area of the SMP system, the following functions are used to ensure memory consistency between cores and devices that use cache consistency together.

  • smp_mb()
    • Based on this command, the order of the previous Read/Write operation and the subsequent Read/Write operation is guaranteed.
  • smp_rmb()
    • Based on this command, the order of the previous Read operation and the subsequent Read operation is guaranteed.
  • smp_wmb()
    • Based on this command, the order of the previous write operation and the subsequent write operation is guaranteed.

ldar/stlr指令

The following figure shows the difference between a two-way barrier and a one-way barrier in the ARMv8 architecture.

  • When implementing lock() and unlock(), you can use a one-way barrier pair to improve performance rather than using two two-way barriers.
  • ARMv7 does not support these commands, so it is implemented using the existing smp_mb() command.
  • In ARMv8, two of these one-way barrior commands are composed into a pair.
    • smp_load_acquire to LDAR, a one-way barrier()
    • smp_store_release to the unidirectional barrier STLR ()

 

Barriers of ARMv7 & ARMv8

  • Three insts: DMB, DSB, ISB

DMB(Data Memory Barrier)

  • The following two arguments can be used in combination to separate memory access before and after the DMB command as a data memory barrier.
    • shareablity domain
      • ish: inner-shareable
      • osh: outer-shareable
      • sy: system-shareable
    • access types
      • ld: load
      • st: store
  • Wait for all delayed load/store to complete (flush).
    • Example) For the sake of efficiency, you can change the order of R,W,R,W,W,W → R,W,R,W,W to prevent this.

DSB(Data Synchronization Barrier, called DWB)

  • It is much slower than DMB because it stops and waits for additional commands to run until the following items (including what DMB commands do) are completed. The use of arguments is the same as that of DMB.
    • Instruction Cache and Data Cache Operations
    • Branch predictor cache flush
    • Handling of Deferred Load/Store Commands <- What DMB Commands Do
    • TLB cache manipulation completed

ISB(Instruction Synchronization Barrier)

  • The moment an ISB command is activated, the next command will follow in due to the pipeline, and it will discard all of them. (Pipeline Flush) Due to the Out of order execution function, there is a possibility that the command fetched after it will be executed first, which may cause problems. In a routine where the priority of the two commands must be clearly distinguished, an ISB is executed between the two commands to clearly ensure the order in which the two commands are executed.
  • ARMv7 supports ISBs, but other architectures may not support ISBs. In this case, to achieve an effect similar to emptying the pipeline, you can use commands such as nop or mov a0 or a0 to ensure that there is no problem at the moment when the execution of the command changes. It also induces dependency in the order of memory references, so that certain routines have no choice but to be executed in-order.
  • Usage Case
    • Real-time code changes
      • If the cached command is re-executed after the code part is changed, the problem will occur, so the ISB should be used in this case as well. (JIT behaves as it changes commands)
    • MMU on/off
      • ISB commands are also used at the time of MMU switching.
      • Out of Order Execution: If the CPU is using a parallel pipeline that supports Out of Order Execution (supported from the ARMv6 architecture, but most products are available from ARMv7), this is to prevent the next command from being executed first and the address reference may be a problem before or after the MMU state changes.

Code Analysis

Armv8

include/asm-generic/barrier.h

#define smp_mb()    __smp_mb()

#define smp_rmb()   __smp_rmb()

#define smp_wmb()   __smp_wmb()

arch/arm64/include/asm/barrier.h

#define __smp_mb()  dmb(ish)

#define __smp_rmb() dmb(ishld)

#define __smp_wmb() dmb(ishst)

#define mb()        dsb(sy)

#define rmb()       dsb(ld)

#define wmb()       dsb(st)

#define dma_mb()    dmb(osh)

#define dma_rmb()   dmb(oshld)

#define dma_wmb()   dmb(oshst)

#define sev()       asm volatile("sev" : : : "memory")

#define wfe()       asm volatile("wfe" : : : "memory")

#define wfi()       asm volatile("wfi" : : : "memory")

#define isb()       asm volatile("isb" : : : "memory")

#define dmb(opt)    asm volatile("dmb " #opt : : : "memory")

#define dsb(opt)    asm volatile("dsb " #opt : : : "memory")

The instructions for each ARM architecture were compared as shown in the following figure.

smp_load_acquire()

include/asm-generic/barrier.h

#ifndef smp_load_acquire

#define smp_load_acquire(p) __smp_load_acquire(p)

#endif

arch/arm64/include/asm/barrier.h

#define __smp_load_acquire(p)                       \

({                                  \

    union { __unqual_scalar_typeof(*p) __val; char __c[1]; } __u;   \

    typeof(p) __p = (p);                        \

    compiletime_assert_atomic_type(*p);             \

    kasan_check_read(__p, sizeof(*p));              \

    switch (sizeof(*p)) {                       \

    case 1:                             \

        asm volatile ("ldarb %w0, %1"               \

            : "=r" (*(__u8 *)__u.__c)           \

            : "Q" (*__p) : "memory");           \

        break;                          \

    case 2:                             \

        asm volatile ("ldarh %w0, %1"               \

            : "=r" (*(__u16 *)__u.__c)          \

            : "Q" (*__p) : "memory");           \

        break;                          \

    case 4:                             \

        asm volatile ("ldar %w0, %1"                \

            : "=r" (*(__u32 *)__u.__c)          \

            : "Q" (*__p) : "memory");           \

        break;                          \

    case 8:                             \

        asm volatile ("ldar %0, %1"             \

            : "=r" (*(__u64 *)__u.__c)          \

            : "Q" (*__p) : "memory");           \

        break;                          \

    }                               \

    (typeof(*p))__u.__val;                      \

})

smp_store_release()

include/asm-generic/barrier.h

#ifndef smp_store_release

#define smp_store_release(p, v) __smp_store_release(p, v)

#endif

arch/arm64/include/asm/barrier.h

#define __smp_store_release(p, v)                   \

do {                                    \

    typeof(p) __p = (p);                        \

    union { __unqual_scalar_typeof(*p) __val; char __c[1]; } __u =  \

        { .__val = (__force __unqual_scalar_typeof(*p)) (v) };  \

    compiletime_assert_atomic_type(*p);             \

    kasan_check_write(__p, sizeof(*p));             \

    switch (sizeof(*p)) {                       \

    case 1:                             \

        asm volatile ("stlrb %w1, %0"               \

                : "=Q" (*__p)               \

                : "r" (*(__u8 *)__u.__c)        \

                : "memory");                \

        break;                          \

    case 2:                             \

        asm volatile ("stlrh %w1, %0"               \

                : "=Q" (*__p)               \

                : "r" (*(__u16 *)__u.__c)       \

                : "memory");                \

        break;                          \

    case 4:                             \

        asm volatile ("stlr %w1, %0"                \

                : "=Q" (*__p)               \

                : "r" (*(__u32 *)__u.__c)       \

                : "memory");                \

        break;                          \

    case 8:                             \

        asm volatile ("stlr %1, %0"             \

                : "=Q" (*__p)               \

                : "r" (*(__u64 *)__u.__c)       \

                : "memory");                \

        break;                          \

    }                               \

} while (0)

【HW3000的收发程序】是一个与硬件设备HW3000相关的软件实现,主要用于处理数据的发送和接收。在这个项目中,我们关注的是一个基于STM8S105微控制器的示例代码,它使用了IAR集成开发环境。这个压缩包包含了名为"A版本"的代码示例,这表明可能还有其他版本存在,例如"B版本"或"C版本",每个版本可能对应不同的功能改进或优化。 STM8S105是意法半导体(STMicroelectronics)推出的一款8位微控制器,属于STM8系列。该芯片具有高性能、低功耗的特点,常用于各种嵌入式系统,如工业控制、消费电子和汽车电子等领域。IAR是一个流行的嵌入式系统开发工具,提供了一整套集成开发环境(IDE),包括编译器、调试器和其他辅助工具,使得开发者能够方便地编写、编译和调试STM8S105上的代码。 在HW3000的收发程序中,我们预计会看到以下关键知识点: 1. **硬件接口**:为了与HW3000通信,代码可能定义了特定的GPIO引脚来控制数据传输和握手信号。这些引脚可能是通过STM8S105的端口和引脚配置来实现的。 2. **通信协议**:HW3000可能使用某种串行通信协议,如SPI、I2C或UART。代码会包含相应的初始化函数和数据传输函数,以遵循该协议。 3. **中断服务例程**:为了实时响应数据收发事件,程序可能使用中断服务例程(ISR)。当硬件检测到新的数据或发送完成时,中断将被触发,并执行相应处理。 4. **数据缓冲区管理**:在发送和接收数据时,可能需要使用缓冲区来存储待发送的数据或接收的数据包。代码会包含关于如何填充、读取和清空缓冲区的逻辑。 5. **错误检测和处理**:良好的通信程序会包含错误检测机制,如CRC校验或奇偶校验,以及错误处理代码,以确保数据的完整性和正确性。 6. **同步机制**:为了确保发送和接收的顺序,可能使用了互斥锁、信号量或其他同步原语来避免数据冲突。 7. **代码注释**:描述中提到代码注释详细且通俗易懂,这意味着开发者可以很容易地理解每一部分的功能,这对于理解和维护代码至关重要。 由于这是一个"A版本"的代码,我们可以假设它可能是项目的基础版本,可能随着项目的进展,后续的版本会包含更高级的功能、优化的性能或者修复的已知问题。对这个压缩包的深入研究将有助于我们了解HW3000硬件的工作原理,以及如何有效地与之进行软件交互。
【源码免费下载链接】:https://renmaiwang.cn/s/fmfod 在 Cisco 交换机上管理 VLAN 是网络管理员日常工作中不可或缺的一部分。VLAN(虚拟局域网)的创建和删除有助于组织网络流量、提高安全性并优化网络性能。然而,在某些情况下,需要删除不再使用的 VLAN,以避免资源浪费和潜在的配置冲突。在 Cisco 交换机中,简单地执行`no vlan`命令并不能彻底删除一个 VLAN,因为这只会从 VLAN 数据库中移除 VLAN 的定义,而不会处理相关的接口配置。以下是一步一步的详细步骤,确保完全删除一个 VLAN:1. **连接到交换机**:可以通过控制台口或使用 Telnet 远程连接至交换机。一旦建立连接,将进入用户模式(User EXEC Mode),提示符显示为`<Switch>`。2. **切换到特权模式**:在用户模式下输入`en`命令,切换到特权模式(Privileged EXEC Mode),提示符变为`Switch#`。3. **进入全局配置模式**:在特权模式下,输入`conf t`命令,进入全局配置模式,此时提示符变为`Switch(config)#`。4. **移除接口上的 VLAN 配置**:如果 VLAN 已分配给接口,则必须先从接口上移除 VLAN 配置。例如,在删除 VLAN 20 时,可以输入`int f0/1`进入接口配置模式,然后执行命令`no switchport access vlan 20`。如果 VLAN 20 被多个接口使用,则需要对每个接口重复此操作。5. **移除 VLAN 接口**:接下来,删除 VLAN 接口,输入`no interface vlan 20`。这将解除 VLAN 20 对其所有相关接口的配置。6. **彻底删除 VLAN**:最后,执行命令`no vlan 20`,从 VLA
【源码免费下载链接】:https://renmaiwang.cn/s/5drve 二分图完美匹配属于图论中的一个重要概念,在该领域中具有特殊意义。特别是在研究匹配理论时,这一概念往往成为核心内容之一。将节点集合划分为两个互不相交的部分,其中每一部分内部的节点之间不会直接相连。完美匹配指的是在这样一个结构下,通过特定方式使得每个节点都能与另一组中的唯一一个节点建立连接,并且没有任何剩余的未配对节点存在。匈牙利算法作为一种高效的计算方法,在解决二分图最大匹配问题时发挥着关键作用。该算法的基本思路在于逐步构建增广路径来扩大已有的匹配规模,最终达到最大匹配目标。具体而言,一条增广路径是一条连接两个未被配对节点的路径,在其中交替出现已配对与未配对的状态。每次找到这样的一条路径后,通过调整相关边的方向和位置,可以实现匹配数量的增加。在提供的代码实现中,匈牙利算法主要分为以下几个部分:初始化阶段、增广路径搜索函数、主处理函数以及最终输出结果的部分。初始化阶段主要是定义必要的变量参数;增广路径搜索函数用于寻找能够提升当前匹配程度的路径;主处理函数则通过调用上述辅助函数,系统地完成整个匹配过程,并输出最终的匹配数量信息。该算法在多个实际应用场景中展现出强大的实用性,例如解决任务分配、婚姻配对以及网络路由等问题时都发挥着不可替代的作用。
<think>嗯,用户这次问的是内存屏障的作用,看来是在深入理解计算机底层机制。之前连续讨论了AUTOSAR和内存泄漏,现在转向系统级编程的并发控制,说明用户可能是嵌入式开发者或系统程序员,正在排查多线程环境下的同步问题。 内存屏障这个概念确实容易让人困惑,需要从硬件和编译器两个层面解释清楚。用户可能实际遇到过因内存乱序导致的诡异bug,比如数据明明更新了却读到旧值。我得用汽车装配线的例子类比CPU流水线,这样更直观。 硬件优化带来的指令重排是问题的核心。现代CPU为了效率会乱序执行,比如Store Buffer的存在可能让写操作延迟,导致其他核看到操作的顺序和代码顺序不一致。内存屏障就是强行给这些“调皮”的优化踩刹车,告诉CPU“到此为止之前的操作必须完成”。 编译器优化也是隐患。比如用volatile只能阻止编译器重排,但管不了CPU层面的乱序。这点在无锁编程中特别致命,用户可能正为此头疼。需要强调屏障是软硬双管齐下的解决方案。 用户应该还想知道具体使用场景,所以示例代码里要展示经典的双检锁模式。这个模式如果漏掉屏障,单例可能被初始化两次——这种bug极难复现但危害巨大。另外内核态的barrier()宏和用户态的__sync_synchronize()区别也得提一下。 最后的提问设计要层层递进:从原理(为什么需要)到实践(怎么用),再到特定领域(Linux内核)和替代方案。第五个问题特意问编译器屏障,因为很多人会混淆它与硬件屏障的作用范围。</think>内存屏障(Memory Barrier)的主要作用是**强制保证内存操作的顺序性和可见性**,解决多核/多线程环境下因**指令重排序**和**缓存一致性**引发的数据一致性问题。其核心作用如下: --- ### 核心作用 1. **禁止指令重排序** - 编译器和CPU可能为优化性能而乱序执行指令(如写操作延迟、读操作提前)。 - 内存屏障会阻止屏障前后的指令跨越屏障执行,确保程序逻辑顺序。 2. **保证内存可见性** - 强制将缓存数据刷新到主存(写屏障),或从主存加载最新数据(读屏障)。 - 确保一个线程的修改对其他线程立即可见。 3. **同步多核数据一致性** - 在弱一致性内存模型中(如ARM、PowerPC),屏障触发缓存失效/更新,维持多核缓存一致。 --- ### 类型与场景 | 屏障类型 | 作用 | 典型场景 | |-------------------|----------------------------------------------------------------------|--------------------------| | **写屏障(Store Barrier)** | 确保屏障前的所有写操作完成,数据刷入主存后才执行后续操作 | 修改共享数据后释放锁 | | **读屏障(Load Barrier)** | 确保后续读操作前,先加载主存最新数据(清空本地缓存) | 获取锁后读取共享数据 | | **全屏障(Full Barrier)** | 同时具备读写屏障功能(如 `mfence`),确保屏障前后指令无重排且全局可见 | 无锁数据结构、内核同步原语 | --- ### 代码示例(无锁编程中的屏障) ```c // 共享数据 int data = 0; bool ready = false; // 线程A:写入数据 void thread_A() { data = 42; // 写操作 // 写屏障:确保data写入完成后再更新ready __sync_synchronize(); // GCC内置全屏障 ready = true; // 标志位更新 } // 线程B:读取数据 void thread_B() { while (!ready); // 等待ready为true // 读屏障:确保读取data前获取最新值 __sync_synchronize(); printf("%d", data); // 必须输出42而非0或随机值 } ``` > **关键点**:若省略屏障,线程B可能因重排序先看到 `ready=true` 后看到 `data=0`(脏读)。 --- ### 典型应用场景 - **多线程同步**:锁实现(如自旋锁)、条件变量。 - **无锁数据结构**:CAS(Compare-And-Swap)操作前后插入屏障。 - **设备驱动**:确保硬件寄存器按顺序写入。 - **内核同步原语**:如Linux的 `smp_mb()`、`rmb()`、`wmb()`。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值