http://blog.csdn.net/hintonic/article/details/7674024
http://siyobik.info.gf/main/reference/instruction/PAUSE
PAUSE
Spin Loop Hint
Opcodes
Hex | Mnemonic | Encoding | Long Mode | Legacy Mode | Description |
---|---|---|---|---|---|
F3 90 | PAUSE | A | Valid | Valid | Gives hint to processor that improves performance of spin-wait loops. |
Instruction Operand Encoding
Op/En | Operand 0 | Operand 1 | Operand 2 | Operand 3 |
---|---|---|---|---|
A | NA | NA | NA | NA |
Description
Improves the performance of spin-wait loops. When executing a "spin-wait loop," a Pentium 4 or Intel Xeon processor suffers a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. For this reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops.
An additional function of the PAUSE instruction is to reduce the power consumed by a Pentium 4 processor while executing a spin loop. The Pentium 4 processor can execute a spin-wait loop extremely quickly, causing the processor to consume a lot of power while it waits for the resource it is spinning on to become available. Inserting a pause instruction in a spin-wait loop greatly reduces the processor's power consumption.
This instruction was introduced in the Pentium 4 processors, but is backward compatible with all IA-32 processors. In earlier IA-32 processors, the PAUSE instruction operates like a NOP instruction. The Pentium 4 and Intel Xeon processors implement the PAUSE instruction as a pre-defined delay. The delay is finite and can be zero for some processors. This instruction does not change the architectural state of the processor (that is, it performs essentially a delaying no-op operation).
This instruction's operation is the same in non-64-bit modes and 64-bit mode.
Pseudo Code
Execute_Next_Instruction(DELAY);
Exceptions
Protected Mode Exceptions
Exception | Description |
---|---|
#UD | If the LOCK prefix is used. |
Real-Address Mode Exceptions
Exception | Description |
---|---|
#UD | If the LOCK prefix is used. |
Virtual-8086 Mode Exceptions
Exception | Description |
---|---|
#UD | If the LOCK prefix is used. |
Compatibility Mode Exceptions
Exception | Description |
---|---|
#UD | If the LOCK prefix is used. |
64-Bit Mode Exceptions
Exception | Description |
---|---|
#UD | If the LOCK prefix is used. |
Numeric Exceptions
None.
PAUSE notifies CPU that this is spinlock wait loop so memory and cache accesses may be optimized. Also PAUSE may actually stop CPU for some time while NOP runs as fast as possible.
Here is detailed explanation: http://siyobik.info/index.php?module=x86&id=232
EDIT: URL above is broken, try http://siyobik.info.gf/main/reference/instruction/PAUSE
A processor suffers a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. For this reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops. An additional function of the PAUSE instruction is to reduce the power consumed by Intel processors.
x86的cpu_relax解析
内核执行的任务在很多情况下是不加锁的,只是poll某个公有变量去保证同步。再深一步,即使是使用锁,本质上也是一个poll某个公有变量的过程。这个poll的过程需要CPU一直循环等待。
要是让我这个菜鸟来写的话,循环体内大概是什么都不会做的了,XD。而x86的内核中一般是调用cpu_relax()。这个函数又是何方神圣呢?
实际上,这个函数很简单。
1
2
3
4
5
|
#define cpu_relax() rep_nop()
static always_inline void rep_nop( void )
{
asm volatile ( "rep;nop" : : : "memory" );
}
|
自旋锁里面也有rep;nop这个语句。而我很蛋痛地在想,为什么是rep;nop而不是nop;nop而不是nop;nop;nop…;nop,反正都是什么都不做嘛,为什么偏偏要选这个?众所周知,在内核这个层次,基本上每一行代码都是最优的,做出这样的选择必定事出有因。
rep;nop的机器码是f3 90,其实就是pause指令的机器码,相当于pause的一个“别名”,这是巧合吗?pause指令又是干什么的呢?
从Intel的手册里面翻出来一段话:
Improves the performance of spin-wait loops. When executing a “spin-wait loop,” a Pentium 4 or Intel Xeon processor suffers a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. For this reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops.
An additional fucntion of the PAUSE instruction is to reduce the power consumed by a Pentium 4 processor while executing a spin loop.
…
简单点说,用pause可以提示CPU接下来的指令序列是用来自旋等待,就不用做memory reorder了,cache什么的也不用废掉了——要知道,cache是很宝贵的资源啊。这指令还有附送的功能——减少能耗。其实内核代码最根本的要求就是:快,快,更加快,这条指令还有附送功能,所以何乐而不为呢?
那么为什么不直接写pause而要写rep;nop呢?理论上是等价的,但是实际上为什么不这样做,不好意思,不清楚。但是可以确定是的pause是Pentium 4才引入的,也许大家比较怀旧所以还用rep;nop也说不定。
So,以后写应用程序而又蛋痛写了循环等待的话,不妨也用用pause吧。不过我想会在应用程序中写循环等待这么傻的代码的程序员,应该也想不到用pause去节能减排兼提速了吧,伤脑筋。
http://blog.liuw.name/1024
http://boost.2283326.n4.nabble.com/x86-CPU-spinning-should-use-pause-td3851387.html
I tried it out and you're right:
void pause() {
__asm__ __volatile__("pause;");
}
void rep_nop() {
__asm__ __volatile__( "rep; nop" : : : "memory" ); // copied from BOOST_SMT_PAUSE yield_k.hpp
}
080483c4 <_Z5pausev>:
80483c4: 55 push %ebp
80483c5: 89 e5 mov %esp,%ebp
80483c7: f3 90 pause
80483c9: 5d pop %ebp
80483ca: c3 ret
080483cb <_Z7rep_nopv>:
80483cb: 55 push %ebp
80483cc: 89 e5 mov %esp,%ebp
80483ce: f3 90 pause
80483d0: 5d pop %ebp
80483d1: c3 ret
Still the other points in the Intel paper are valid. If you fail a CAS you should probably loop and read it. Otherwise if multiple threads are waiting they'll rip the cache line back and forth.
Also the impl calls nanosleep(), which kind of scares me because I would expecting the spin mutex to spin. If this happens often, we should be using OS mutexes.
The time on the OS sleeps is >> 1us, on my Linux ~3ms. 1us is probably way greater than 33 spins you do before calling sleep.
Ok, I realize spinlock is in::detail. As long as asio doesn't use it, I'm fine. I've stopped using it. I'm doing low latency work and stuff like that scares me a bit. As long as I don't do any atomic_exchange() of shared_ptr, I should be fine.
从JVM并发看CPU内存指令重排序(Memory Reordering)