java内存栅栏_memory-barriers/fences JAVA 内存屏障[译]

最新推荐文章于 2024-05-26 08:25:33 发布

weixin_39567169

最新推荐文章于 2024-05-26 08:25:33 发布

阅读量154

点赞数

文章标签： java内存栅栏

本文链接：https://blog.csdn.net/weixin_39567169/article/details/114620089

版权

In this article I’ll discuss the most fundamental technique in concurrent programming known as memory barriers, or fences, that make the memory state within a processor visible to other processors.

本文讨论的是最基础的并发编程技术，将称为内存屏障(或者叫栅栏)，内存屏幕使一个线程的内存状态在另一个线程可见。

CPUs have employed many techniques to try and accommodate the fact that CPU execution unit performance has greatly outpaced main memory performance. In my “Write Combining” article I touched on just one of these techniques. The most common technique employed by CPUs to hide memory latency is to pipeline instructions and then spend significant effort, and resource, on trying to re-order these pipelines to minimise stalls related to cache misses.

CPU 已经拥有很多种技术来尝试并适应CPU的性能远远高于内存性能的事实，我之前的文章《Write Combining》一文，我只是接触过这其中的某一种技术，CPU拥有的最普通躲避内存延迟是管道技术，并花了大量的努力和资源去重排这些指令以减少缓存未命相关的一摊问题。

When a program is executed it does not matter if its instructions are re-ordered provided the same end result is achieved. For example, within a loop it does not matter when the loop counter is updated if no operation within the loop uses it. The compiler and CPU are free to re-order the instructions to best utilise the CPU provided it is updated by the time the next iteration is about to commence. Also over the execution of a loop this variable may be stored in a register and never pushed out to cache or main memory, thus it is never visible to another CPU.

当一个程序被执行的时侯，如果(provided)能得到相同的最终结果，那么它的指令是否被重排并不重要。举个例子，在一个循环中如果没有对计数器操作，计数器什么时侯更新并不重要。如果下一次迭代在即将开始的时才更新它(计数器),会最优化的使用CPU，因此编译器和CPU会自由的重排这些指令，以提高性能。然尔，在循环执行的过程中，这个变量可能被存储在寄存器并从未被压到缓存或主存中，那么这时的这个变将不会被另一个CPU可见。

CPU cores contain multiple execution units. For example, a modern Intel CPU contains 6 execution units which can do a combination of arithmetic, conditional logic, and memory manipulation. Each execution unit can do some combination of these tasks. These execution units operate in parallel allowing instructions to be executed in parallel. This introduces another level of non-determinism to program order if it was observed from another CPU.

CPU的内核包括多个执行单元，举个例子，当代的INTEL CPU包含6个执行单元(核),它能执行算术，条件逻辑和内存操作的组合。每个执行单元都可以做一样的这些任务的组合，这些执行单元会并行的执行，并允许指令并行的执行。这样如果从另一个CPU执行单元的角度去看，又带来了另一层关于程序执行顺序的不确定性问题。

Finally, when a cache-miss occurs, a modern CPU can make an assumption on the results of a memory load and continue executing based on this assumption until the load returns the actual data.

最终，如果发生了缓存未命中，那么当代CPU会对内存加载的结果会做一个假设，然后继续基于这个假设执行，直到实际的值返回。

Provided “program order” is preserved the CPU, and compiler, are free to do whatever they see fit to improve performance.

提供的"程序顺序"是保护CPU和编译器能够自由地做他们认为能适当提升性能的优化。

1edccc1a606e?utm_campaign=maleskine&utm_content=note&utm_medium=seo_notes&utm_source=recommendation

Figure 1

Figure 1.

Loads and stores to the caches and main memory are buffered and re-ordered using the load, store, and write-combining buffers. These buffers are associative queues that allow fast lookup. This lookup is necessary when a later load needs to read the value of a previous store that has not yet reached the cache. Figure 1 above depicts a simplified view of a modern multi-core CPU. It shows how the execution units can use the local registers and buffers to manage memory while it is being transferred back and forth from the cache sub-system.

加载和存储到缓存或主存是被缓冲的并被重排的，它们用load 指令，store指令和write-combining(写组合)缓冲，这些相关联的缓冲是允许快速查找的队列。当一个延迟加载需要读上一个store的值的时，而这时值还没有到达，这个快速查找队列就很有必要。Figure 1 描绘了当代多核CPU的简单视图。它展示了当它在缓存子系统前后来回传输数据时，CPU执行单元如何使用本地寄存器和缓冲去管理内存。

In a multi-threaded environment techniques need to be employed for making program results visible in a timely manner. I will not cover cache coherence in this article. Just assume that once memory has been pushed to the cache then a protocol of messages will occur to ensure all caches are coherent for any shared data. The techniques for making memory visible from a processor core are known as memory barriers or fences.

在多线程环境下，需要采用一些技术让程序的运行结果让其他CPU及时可见。在这篇文章中，我们不讨论缓存一致性的问题。只是假定一次内存压入缓存的操作之后,协议消息将发会发生，对于任何共享数据，它确保所有缓存是一致的。这种源于CPU核内存可见的技术我们叫做 memory barries 或fences。

Memory barriers provide two properties. Firstly, they preserve externally visible program order by ensuring all instructions either side of the barrier appear in the correct program order if observed from another CPU and, secondly, they make the memory visible by ensuring the data is propagated to the cache sub-system.

内存屏障提供两种属性：首先，从另一个CPU核的视角，它们通过确保屏障两端的所有指令都表现出正确的程序顺序，以保证外部可见的程序顺序。第二，它们使内存可见通过确保数据被传递到缓存子系统。

Memory barriers are a complex subject. They are implemented very differently across CPU architectures. At one end of the spectrum there is a relatively strong memory model on Intel CPUs that is more simple than say the weak and complex memory model on a DEC Alpha with its partitioned caches in addition to cache layers. Since x86 CPUs are the most common for multi-threaded programming I’ll try and simplify to this level.

内存屏障是一个很复杂的主题。在不同的CPU体系结构当中它们的实现有很大的不同。另外在INTEL 体系结构的CPU中有相对来说更强的内存模型，这种内存模型相对于DEC Alpha 体系结构的所具有分区缓存和缓存层的弱且复杂的内存模型更简单。由于X86的CPU对于多线程编程最为普通，所以我将简化到这个层次。

Store Barrier

A store barrier, “sfence” instruction on x86, forces all store instructions prior to the barrier to happen before the barrier and have the store buffers flushed to cache for the CPU on which it is issued. This will make the program state visible to other CPUs so they can act on it if necessary. A good example of this in action is the following simplified code from the BatchEventProcessor in the Disruptor. When the sequence is updated other consumers and producers know how far this consumer has progressed and thus can take appropriate action. All previous updates to memory that happened before the barrier are now visible.

存储屏障

存储屏障，在x86体系结构中叫"sfence"，它强制所有在屏障(barrier)之前的指令发生(happen before) 在barrier(屏障) 之前执行。并让存储缓冲区刷新到它提交CPU缓存中。这样会使程序的状态对其他CPU可见，以便在需要的时侯对其进行操作。这个场景比较好的例子是下边这段简单的代码，这段代码出自Disruptor (一个并发开源框架)的BatchEventProcessor. 当这个sequence 变量更修改之后，其他的消费者和生产者都知道这个消费者的进展情况并采取行动。所有屏障之前的内存更新都将可见。

private volatile long sequence = RingBuffer.INITIAL_CURSOR_VALUE;

// from inside the run() method

T event = null;

long nextSequence = sequence.get() + 1L;

while (running)

{

try

{

final long availableSequence = barrier.waitFor(nextSequence);

while (nextSequence <= availableSequence)

{

event = ringBuffer.get(nextSequence);

boolean endOfBatch = nextSequence == availableSequence;

eventHandler.onEvent(event, nextSequence, endOfBatch);

nextSequence++;

}

sequence.set(nextSequence - 1L);

// store barrier inserted here !!!

}

catch (final Exception ex)

{

exceptionHandler.handle(ex, nextSequence, event);

sequence.set(nextSequence);

// store barrier inserted here !!!

nextSequence++;

}

Load Barrier 加载屏障

A load barrier, “lfence” instruction on x86, forces all load instructions after the barrier to happen after the barrier and then wait on the load buffer to drain for that CPU. This makes program state exposed from other CPUs visible to this CPU before making further progress. A good example of this is when the BatchEventProcessor sequence referenced above is read by producers, or consumers, in the corresponding barriers of the Disruptor.

load barrier 在x86体系结构中叫"ifence",强制所有的在屏障之后的指令都在屏障之后执行，然后等待这个CPU上的缓冲清空。当前CPU进一步执行之前能够读取其他CPU所暴露的程序状态。这个场景的一个好的例子是：上边Disruptor框架中BatchEventProcessor类里引用的sequence相应的屏障中被生产者或消费读取。

Full Barrier 全屏障

A full barrier, “mfence” instruction on x86, is a composite of both load and store barriers happening on a CPU.

全屏障，在X86的体系结构中叫"mfance",它是load和store两个指令的组合。

Java Memory Model

In the Java Memory Model a volatile field has a store barrier inserted after a write to it and a load barrier inserted before a read of it. Qualified final fields of a class have a store barrier inserted after their initialisation to ensure these fields are visible once the constructor completes when a reference to the object is available.

中JAVA内存模型中volatile 关键字在写之后插入屏障并在读之前插入屏障，final关键字在它初始化之后插入屏障，以确保这些字段是在一次构建方法执行完成后，当这个对象的引用可用时可见。

Atomic Instructions and Software Locks 原子指令和软件锁

Atomic instructions, such as the “lock …” instructions on x86, are effectively a full barrier as they lock the memory sub-system to perform an operation and have guaranteed total order, even across CPUs. Software locks usually employ memory barriers, or atomic instructions, to achieve visibility and preserve program order.

原子指令，在X86平台为"lock"指令前缀，它们是有效的full barries，它们会锁内存子系统去执行一个操作并拥有一个顺序保证，即使是跨多个CPU。软件锁通常使用内存屏障，或者原子指令，以获得可见性并保证程序的执行顺序。

Performance Impact of Memory Barriers 内存屏障的性能影响

Memory barriers prevent a CPU from performing a lot of techniques to hide memory latency therefore they have a significant performance cost which must be considered. To achieve maximum performance it is best to model the problem so the processor can do units of work, then have all the necessary memory barriers occur on the boundaries of these work units. Taking this approach allows the processor to optimise the units of work without restriction. There is an advantage to grouping necessary memory barriers in that buffers flushed after the first one will be less costly because no work will be under way to refill them.

内存屏障预防CPU执行一些减少内存访问延迟的技术。因为它们一般会存在很大的性能损耗，它必须保证一致性。为取得最优的性能，最好对问题进行建模，以致于处理器能执行工作单元，然后在工作单元的边界上插入必要的内存屏障。采用这种方法允许处理器不受限制的优化工作单元，把必要的存储关卡分组是有益的，那样，在第一个之后的 buffer 刷新的开销会小点，因为没有工作需要进行重新填充它。

weixin_39567169

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
java内存栅栏_memory-barriers/fences JAVA 内存屏障[译]

In this article I’ll discuss the most fundamental technique in concurrent programming known as memory barriers, or fences, that make the memory state within a processor visible to other processors.本文讨...
复制链接

扫一扫