asm pause 指令

最新推荐文章于 2024-05-04 00:17:37 发布

foreverfresh

最新推荐文章于 2024-05-04 00:17:37 发布

阅读量2.5k

点赞数 2

分类专栏： C/C++ knowledge 文章标签： concurrency barrier multi-threading

C/C++ knowledge 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

http://blog.csdn.net/hintonic/article/details/7674024

http://siyobik.info.gf/main/reference/instruction/PAUSE

PAUSE

Spin Loop Hint

Opcodes

Hex	Mnemonic	Encoding	Long Mode	Legacy Mode	Description
F3 90	PAUSE	A	Valid	Valid	Gives hint to processor that improves performance of spin-wait loops.

Instruction Operand Encoding

Op/En	Operand 0	Operand 1	Operand 2	Operand 3
A	NA	NA	NA	NA

Description

Improves the performance of spin-wait loops. When executing a "spin-wait loop," a Pentium 4 or Intel Xeon processor suffers a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. For this reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops.

An additional function of the PAUSE instruction is to reduce the power consumed by a Pentium 4 processor while executing a spin loop. The Pentium 4 processor can execute a spin-wait loop extremely quickly, causing the processor to consume a lot of power while it waits for the resource it is spinning on to become available. Inserting a pause instruction in a spin-wait loop greatly reduces the processor's power consumption.

This instruction was introduced in the Pentium 4 processors, but is backward compatible with all IA-32 processors. In earlier IA-32 processors, the PAUSE instruction operates like a NOP instruction. The Pentium 4 and Intel Xeon processors implement the PAUSE instruction as a pre-defined delay. The delay is finite and can be zero for some processors. This instruction does not change the architectural state of the processor (that is, it performs essentially a delaying no-op operation).

This instruction's operation is the same in non-64-bit modes and 64-bit mode.

Pseudo Code

Execute_Next_Instruction(DELAY);

Exceptions

Protected Mode Exceptions

Exception	Description
#UD	If the LOCK prefix is used.

Real-Address Mode Exceptions

Exception	Description
#UD	If the LOCK prefix is used.

Virtual-8086 Mode Exceptions

Exception	Description
#UD	If the LOCK prefix is used.

Compatibility Mode Exceptions

Exception	Description
#UD	If the LOCK prefix is used.

64-Bit Mode Exceptions

Exception	Description
#UD	If the LOCK prefix is used.

Numeric Exceptions

None.

PAUSE notifies CPU that this is spinlock wait loop so memory and cache accesses may be optimized. Also PAUSE may actually stop CPU for some time while NOP runs as fast as possible.

Here is detailed explanation: http://siyobik.info/index.php?module=x86&id=232

EDIT: URL above is broken, try http://siyobik.info.gf/main/reference/instruction/PAUSE

A processor suffers a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. For this reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops. An additional function of the PAUSE instruction is to reduce the power consumed by Intel processors.

x86的cpu_relax解析

with one comment

内核执行的任务在很多情况下是不加锁的，只是poll某个公有变量去保证同步。再深一步，即使是使用锁，本质上也是一个poll某个公有变量的过程。这个poll的过程需要CPU一直循环等待。

要是让我这个菜鸟来写的话，循环体内大概是什么都不会做的了，XD。而x86的内核中一般是调用cpu_relax()。这个函数又是何方神圣呢？

实际上，这个函数很简单。

1

2

3

4

5

 
         #define cpu_relax() rep_nop() 
        
         staticalways_inline voidrep_nop(void) 
        
         { 
        
                 asmvolatile( "rep;nop": : : "memory"); 
        
         }

自旋锁里面也有rep;nop这个语句。而我很蛋痛地在想，为什么是rep;nop而不是nop;nop而不是nop;nop;nop…;nop，反正都是什么都不做嘛，为什么偏偏要选这个？众所周知，在内核这个层次，基本上每一行代码都是最优的，做出这样的选择必定事出有因。

rep;nop的机器码是f3 90，其实就是pause指令的机器码，相当于pause的一个“别名”，这是巧合吗？pause指令又是干什么的呢？

从Intel的手册里面翻出来一段话：

Improves the performance of spin-wait loops. When executing a “spin-wait loop,” a Pentium 4 or Intel Xeon processor suffers a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. For this reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops.

An additional fucntion of the PAUSE instruction is to reduce the power consumed by a Pentium 4 processor while executing a spin loop.
…

简单点说，用pause可以提示CPU接下来的指令序列是用来自旋等待，就不用做memory reorder了，cache什么的也不用废掉了——要知道，cache是很宝贵的资源啊。这指令还有附送的功能——减少能耗。其实内核代码最根本的要求就是：快，快，更加快，这条指令还有附送功能，所以何乐而不为呢？

那么为什么不直接写pause而要写rep;nop呢？理论上是等价的，但是实际上为什么不这样做，不好意思，不清楚。但是可以确定是的pause是Pentium 4才引入的，也许大家比较怀旧所以还用rep;nop也说不定。

So，以后写应用程序而又蛋痛写了循环等待的话，不妨也用用pause吧。不过我想会在应用程序中写循环等待这么傻的代码的程序员，应该也想不到用pause去节能减排兼提速了吧，伤脑筋。

http://blog.liuw.name/1024

http://boost.2283326.n4.nabble.com/x86-CPU-spinning-should-use-pause-td3851387.html

I tried it out and you're right:

void pause() {
__asm__ __volatile__("pause;");
}

void rep_nop() {
__asm__ __volatile__( "rep; nop" : : : "memory" ); // copied from BOOST_SMT_PAUSE yield_k.hpp
}

080483c4 <_Z5pausev>:
80483c4: 55 push %ebp
80483c5: 89 e5 mov %esp,%ebp
80483c7: f3 90 pause
80483c9: 5d pop %ebp
80483ca: c3 ret

080483cb <_Z7rep_nopv>:
80483cb: 55 push %ebp
80483cc: 89 e5 mov %esp,%ebp
80483ce: f3 90 pause
80483d0: 5d pop %ebp
80483d1: c3 ret

Still the other points in the Intel paper are valid. If you fail a CAS you should probably loop and read it. Otherwise if multiple threads are waiting they'll rip the cache line back and forth.

Also the impl calls nanosleep(), which kind of scares me because I would expecting the spin mutex to spin. If this happens often, we should be using OS mutexes.

The time on the OS sleeps is >> 1us, on my Linux ~3ms. 1us is probably way greater than 33 spins you do before calling sleep.

Ok, I realize spinlock is in::detail. As long as asio doesn't use it, I'm fine. I've stopped using it. I'm doing low latency work and stuff like that scares me a bit. As long as I don't do any atomic_exchange() of shared_ptr, I should be fine.

从JVM并发看CPU内存指令重排序(Memory Reordering)

这两天，我拜读了 Dennis Byrne 写的一片博文?Memory Barriers and JVM Concurrency (中译文?内存屏障与JVM并发)。

文中提到:

对主存的一次访问一般花费硬件的数百次时钟周期。处理器通过缓存（caching）能够从数量级上降低内存延迟的成本这些缓存为了性能重新排列待定内存操作的顺序。也就是说，程序的读写操作不一定会按照它要求处理器的顺序执行。

这段话是作者对内存屏障重要性的定义。通过cache降低内存延迟，这句话很好理解。但后面那句“为了性能重排序内存操作顺序”，让没学好微机原理的我倍感疑惑。

CPU为何要重排序内存访问指令？在哪种场景下会触发重排序？作者在文中并未提及。

为了解答疑问，我在网上查阅了一些资料，在这里跟大家分享一下。

重排序的背景

我们知道现代CPU的主频越来越高，与cache的交互次数也越来越多。当CPU的计算速度远远超过访问cache时，会产生cache wait，过多的cache ?wait就会造成性能瓶颈。
针对这种情况，多数架构（包括X86）采用了一种将cache分片的解决方案，即将一块cache划分成互不关联地多个 slots (逻辑存储单元，又名 Memory Bank 或 Cache Bank)，CPU可以自行选择在多个 idle bank 中进行存取。这种 SMP 的设计，显著提高了CPU的并行处理能力，也回避了cache访问瓶颈。

Memory Bank的划分
一般 Memory bank 是按cache address来划分的。比如偶数adress 0×12345000?分到 bank 0, 奇数address 0×12345100?分到 bank1。

重排序的种类
编译期重排。编译源代码时，编译器依据对上下文的分析，对指令进行重排序，以之更适合于CPU的并行执行。

运行期重排，CPU在执行过程中，动态分析依赖部件的效能，对指令做重排序优化。

实例讲解指令重排序原理

为了方便理解，我们先来看一张CPU内部结构图。

从图中可以看到，这是一台配备双CPU的计算机，cache 按地址被分成了两块 cache banks，分别是?cache bank0 和 cache bank1。

理想的内存访问指令顺序：
1，CPU0往?cache address 0×12345000 写入一个数字 1。因为address 0×12345000是偶数，所以值被写入 bank0.
2，CPU1读取 bank0 address 0×12345000 的值，即数字1。
3，CPU0往 cache 地址 0×12345100 ?写入一个数字 2。因为address 0×12345100是奇数，所以值被写入 bank1.
4，CPU1读取 bank1 address ?0×12345100 的值，即数字2。

重排序后的内存访问指令顺序：
1，CPU0 准备往 bank0 address 0×12345000 写入数字 1。
2，CPU0检查 bank0 的可用性。发现 bank0 处于 busy 状态。
3， CPU0 为了防止 cache等待，发挥最大效能，将内存访问指令重排序。即先执行后面的 bank1 address 0×12345100 数字2的写入请求。
4，CPU0检查 bank1 可用性，发现bank1处于 idle 状态。
5，CPU0 将数字2写入 bank 1 address 0×12345100。
6，CPU1来读取 ?0×12345000，未读到数字1，出错。
7， CPU0 继续检查 bank0 的可用性，发现这次?bank0 可用了，然后将数字1写入 0×12345000。
8， CPU1 读取 0×12345100，读到数字2，正确。

从上述触发步骤中，可以看到第 3 步发生了指令重排序，并导致第 6步读到错误的数据。

通过对指令重排，CPU可以获得更快地响应速度，但也给编写并发程序的程序员带来了诸多挑战。
内存屏障是用来防止CPU出现指令重排序的利器之一。
通过这个实例，不知道你对指令重排理解了没有？

不同架构下的指令重排优化

从图中，可以看到，X86仅在 Stores after loads 和 Incoherent instruction cache pipeline 中会触发重排。

Stores after loads的含义是在对同一个地址进行读写操作时，写入在读取后面，允许重排序。即满足弱一致性(Weak Consistency)，这是最可被接受的类型，不会造成太大的影响。

Incoherent instruction cache pipeline是跟JIT相关的类型，作用是在执行?self-modifying code 时预防JIT没有flush指令缓存。我不知道该类型跟指令排序有什么关系，既然不在本文涉及范围内，就不做深入探讨了。

参考资料

http://kenwublog.com/docs/memory.barrier.ppt
http://kenwublog.com/docs/memory.model.instruction.reordering.and.store.atomicity.pdf
http://kenwublog.com/docs/memory.ordering.in.modern.microprocessor.pdf
http://en.wikipedia.org/wiki/Memory_ordering
http://en.wikipedia.org/wiki/Memory_Bank

转载请注明原文链接：http://kenwublog.com/illustrate-memory-reordering-in-cpu

foreverfresh

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
asm pause 指令

http://blog.csdn.net/hintonic/article/details/7674024http://siyobik.info.gf/main/reference/instruction/PAUSEPAUSESpin Loop HintOpcodesHexMnemonicE
复制链接

扫一扫

专栏目录