new barrier with good performance

最新推荐文章于 2022-06-09 11:19:36 发布

foreverfresh

最新推荐文章于 2022-06-09 11:19:36 发布

阅读量746

点赞数

文章标签： memory barrier multithreading pause

本文链接：https://blog.csdn.net/foreverfresh/article/details/78937622

版权

** This forces the processor to not prefetch, dispatch, or execute
* beyond this point until this instruction is retired. The memory barrier
* is supposed to be included in the builtin pause but code runs significantly
* slower with many threads and tasks if not included here. */
static inline void thread_pause(void)
{
/* The asm prevents compiler from moving loads and stores beyond this point. */
asm volatile ("":::"memory");
__builtin_ia32_pause();
asm volatile ("":::"memory");

}

--------------------------------------------------------------------------------------------

https://en.wikipedia.org/wiki/Memory_ordering

https://stackoverflow.com/questions/14950614/working-of-asm-volatile-memory

asm volatile("" ::: "memory");

creates a compiler level memory barrier forcing optimizer to not re-order memory accesses across the barrier.

For example, if you need to access some address in a specific order (probably because that memory area is actually backed by a different device rather than a memory) you need to be able tell this to the compiler otherwise it may just optimize your steps for the sake of efficiency.

Assume in this scenario you must increment a value in address, read something and increment another value in an adjacent address.

int c(int *d, int *e) {
        int r;
        d[0] += 1;
        r = e[0];
        d[1] += 1;
        return r;
}

Problem is compiler (gcc in this case) can rearrange your memory access to get better performance if you ask for it (-O). Probably leading to a sequence of instructions like below:

00000000 <c>:
   0:   4603        mov r3, r0
   2:   c805        ldmia   r0, {r0, r2}
   4:   3001        adds    r0, #1
   6:   3201        adds    r2, #1
   8:   6018        str r0, [r3, #0]
   a:   6808        ldr r0, [r1, #0]
   c:   605a        str r2, [r3, #4]
   e:   4770        bx  lr

Above values for d[0] and d[1] are loaded at the same time. Lets assume this is something you want to avoid then you need to tell compiler not to reorder memory accesses and that is to use asm volatile("" ::: "memory").

int c(int *d, int *e) {
        int r;
        d[0] += 1;
        r = e[0];
        asm volatile("" ::: "memory");
        d[1] += 1;
        return r;
}

So you'll get your instruction sequence as you want it to be:

00000000 <c>:
   0:   6802        ldr r2, [r0, #0]
   2:   4603        mov r3, r0
   4:   3201        adds    r2, #1
   6:   6002        str r2, [r0, #0]
   8:   6808        ldr r0, [r1, #0]
   a:   685a        ldr r2, [r3, #4]
   c:   3201        adds    r2, #1
   e:   605a        str r2, [r3, #4]
  10:   4770        bx  lr
  12:   bf00        nop

It should be noted that this is only compile time memory barrier to avoid compiler to reorder memory accesses, as it puts no extra hardware level instructions to flush memories or wait for load or stores to be completed. CPUs can still reorder memory accesses if they have the architectural capabilities and memory addresses are on normal type instead of strongly ordered or device(ref).

15 down vote

This sequence is a compiler memory access scheduling barrier, as noted in the article referenced by Udo. This one is GCC specific - other compilers have other ways of describing them, some of them with more explicit (and less esoteric) statements.

__asm__ is a gcc extension of permitting assembly language statements to be entered nested within your C code - used here for its property of being able to specify side effects that prevent the compiler from performing certain types of optimisations (which in this case might end up generating incorrect code).

__volatile__ is required to ensure that the asm statement itself is not reordered with any other volatile accesses any (a guarantee in the C language).

memory is an instruction to GCC that (sort of) says that the inline asm sequence has side effects on global memory, and hence not just effects on local variables need to be taken into account.

-------------------------------------------------------------------------------------------------------------------------------

void __builtin_ia32_pause (void)

Generates the pause machine instruction with a compiler memory barrier.

https://stackoverflow.com/questions/4725676/how-does-x86-pause-instruction-work-in-spinlock-and-can-it-be-used-in-other-sc

Intel does only recommend using the PAUSE instructions when the spin-loop is very short.

As I understood from your questions, the waits in your case are very long. In this case, spin-loops are not recommended.

You wrote that you have a "thread which keeps scanning some places (e.g. a queue) to retrieve new nodes".

In such a case, Intel recommends using synchronization API functions of your operating system. For example, you can create an event when a new node appears in a queue, and just wait for this event using the WaitForSingleObject(Handle, INFINITE). The queue will trigger this event whenever a new node will appear.

According to the Intel Optimization Manual, the PAUSE instruction is typically used with software threads executing on two logical processors located in the same processor core, waiting for a lock to be released. Such short wait loops tend to last between tens and a few hundreds of cycles (i.e. 20-500 CPU cycles), so performance-wise it is more beneficial to wait while occupying the CPU than yielding to the OS.

500 CPU cycles on a 4500 MHz Core i7 7700K processor is 0.0000001 seconds, i.e. 1/10000000th of a second: the CPU can make 10 million times per second this 500 CPU cycles loop.

As you see, this PAUSE instruction is for really short periods of time.

On the other hand, each call to an API function like Sleep() experiences the expensive cost of a context switch, which can be 10000+ cycles; it also suffers the cost of ring 3 to ring 0 transitions, which can be 1000+ cycles.

If there are more threads then the processor cores (multiplied to hyperthreading feature, if present) are available, and a thread will get switched to another one in the middle of a critical section, waiting for the critical section from another thread may really take looong, at least 10000+ cycles, so the PAUSE instruction will be futile.

Please see this articles for more information:

When the wait loop is expected to last for thousands of cycles or more, it is preferable to yield to the operating system by calling one of the OS synchronization API functions, such as WaitForSingleObject on Windows OS.

As a conclusion: in your scenario, the PAUSE instruction won't be the best choice, since your waiting time is long while the PAUSE is intended for very short loops. PAUSE is just 131 cycles SkyWell or later processors. For example, it is just or 31.19ns on Intel Core i7-7700K CPU @ 4.20GHz Kaby Lake.

On earlier processors, like Haswell, i has about 9 cycles. It is 2.81ns on Intel Core i5-4430 @ 3GHz. So, for the long loops, it's better to relinquish control to other threads using the OS synchronization API functions than to occupy CPU with the PAUSE loop.

-----------------------------------------------------------------------------------------------------------------------
https://software.intel.com/en-us/articles/benefitting-power-and-performance-sleep-loops

Resolving the Problems

The approach we recommend in such an algorithm is akin to a more gradual back off. First, we allow the thread to spin on the lock for a brief period of time, but instead of fully spinning, we use the pauseinstruction in the loop. Introduced with the Intel® Streaming SIMD Extensions 2 (Intel® SSE2) instruction set, the pause instruction gives a hint to the processor that the calling thread is in a "spin-wait" loop. In addition, the pause instruction is a no-op when used on x86 architectures that do not support Intel SSE2, meaning it will still execute without doing anything or raising a fault. While this means older x86 architectures that don’t support Intel SSE2 won’t see the benefits of the pause, it also means that you can keep one straightforward code path that works across the board.

Essentially, the pause instruction delays the next instruction's execution for a finite period of time. By delaying the execution of the next instruction, the processor is not under demand, and parts of the pipeline are no longer being used, which in turn reduces the power consumed by the processor.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

https://software.intel.com/en-us/articles/long-duration-spin-wait-loops-on-hyper-threading-technology-enabled-intel-processors

Sample Code 5: Replacing the long duration spin-wait loop with a thread-blocking API

Where np in the parameter list is the number of threads being waited for, and threadDoneEvents is a pointer to an array of event objects. The threads being waited upon signal the event objects when the work is completed. A value TRUE in the third argument ensures that the operating system blocks the master thread from accessing processor resources until all threads being waited for complete their work. dw_Milliseconds specifies the time-out interval, in milliseconds; if dwMilliseconds is INFINITE, the function never times out.

The operating system thread-blocking API ensures that the waiting thread relinquishes the processor during the entire waiting period. So this technique is sufficient to prevent wasting processor resources on systems with Hyper-Threading Technology. The thread-blocking API may introduce synchronization overhead on a conventional multiprocessor system, but for a long duration spin-wait loop the overhead is insignificant.

https://stackoverflow.com/questions/12894078/pause-instruction-in-x86

The processor uses this hint
to avoid the memory order violation in most situations, which greatly improves
processor performance. For this reason, it is recommended that a PAUSE instruction
be placed in all spin-wait loops. The documentation also mentions that 
"wait(some delay)" is the pseudo implementation of the instruction.

A processor suffers a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. For this reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops. An additional function of the PAUSE instruction is to reduce the power consumed by Intel processors.

[source: Intel manual]

------------------------------------------------------------------------------------------------------------------------------------

_mm_pause in a busy-wait loop is the way to go.

Unfortunately the delay it provides can change with each processor family:

http://siyobik.info/main/reference/instruction/PAUSE

Example usage for GCC on Linux:

#include <xmmintrin.h>

int main (void) {
    _mm_pause();
    return 0;
}

Compile with MMX enabled:

gcc -o moo moo.c  -march=native

Also you can always just use inline assembler:

__asm volatile ("pause" ::: "memory");

From some Intel engineers, you might find this useful to determine the cost of pausing:

NOP instruction can be between 0.4-0.5 clocks and PAUSE instruction can consume 38-40 clocks.

http://software.intel.com/en-us/forums/showthread.php?t=48371

foreverfresh

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
new barrier with good performance

A processor suffers a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to
复制链接

扫一扫