【高阶】PREFETCHh Prefetch Data Into Caches

最新推荐文章于 2022-09-18 13:36:22 发布

大隐隐于野

最新推荐文章于 2022-09-18 13:36:22 发布

阅读量250

点赞数

分类专栏： # 高性能编程

本文链接：https://blog.csdn.net/weixin_43778179/article/details/120054694

版权

Prefetch 缓存处理器性能 GCC内置函数预取提示

关键词由CSDN通过智能技术生成

高性能编程专栏收录该内容

42 篇文章 1 订阅

订阅专栏

PREFETCHh

Prefetch Data Into Caches

Opcode	Mnemonic	Description
`0F 18 /1`	`PREFETCHT0 m8`	Move data from m8 closer to the processor using T0 hint.
`0F 18 /2`	`PREFETCHT1 m8`	Move data from m8 closer to the processor using T1 hint.
`0F 18 /3`	`PREFETCHT2 m8`	Move data from m8 closer to the processor using T2 hint.
`0F 18 /0`	`PREFETCHNTA m8`	Move data from m8 closer to the processor using NTA hint.

Description
Fetches the line of data from memory that contains the byte specified with the source operand to a location in the cache hierarchy specified by a locality hint: T0 (temporal data)-prefetch data into all levels of the cache hierarchy. Pentium III processor-1st- or 2nd-level cache. Pentium 4 and Intel Xeon processors-2nd-level cache. T1 (temporal data with respect to first level cache)-prefetch data into level 2 cache and higher. Pentium III processor-2nd-level cache. Pentium 4 and Intel Xeon processors-2nd-level cache. T2 (temporal data with respect to second level cache)-prefetch data into level 2 cache and higher. Pentium III processor-2nd-level cache. Pentium 4 and Intel Xeon processors-2nd-level cache. NTA (non-temporal data with respect to all cache levels)-prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution. Pentium III processor-1st-level cache - Pentium 4 and Intel Xeon processors-2nd-level cache The source operand is a byte memory location. (The locality hints are encoded into the machine level instruction using bits 3 through 5 of the ModR/M byte. Use of any ModR/M value other than the specified ones will lead to unpredictable behavior.) If the line selected is already present in the cache hierarchy at a level closer to the processor, no data movement occurs. Prefetches from uncacheable or WC memory are ignored. The PREFETCHh instruction is merely a hint and does not affect program behavior. If executed, this instruction moves data closer to the processor in anticipation of future use. The implementation of prefetch locality hints is implementation-dependent, and can be overloaded or ignored by a processor implementation. The amount of data prefetched is also processor implementation-dependent. It will, however, be a minimum of 32 bytes. It should be noted that processors are free to speculatively fetch and cache data from system memory regions that are assigned a memory-type that permits speculative reads (that is, the WB, WC, and WT memory types). A PREFETCHh instruction is considered a hint to this speculative behavior. Because this speculative fetching can occur at any time and is not tied to instruction execution, a PREFETCHh instruction is not ordered with respect to the fence instructions (MFENCE, SFENCE, and LFENCE) or locked memory references. A PREFETCHh instruction is also unordered with respect to CLFLUSH instructions, other PREFETCHh instructions, or any other general instruction. It is ordered with respect to serializing instructions such as CPUID, WRMSR, OUT, and MOV CR.

Description

Fetches the line of data from memory that contains the byte specified with the source operand to a location in the cache hierarchy specified by a locality hint:

T0 (temporal data)-prefetch data into all levels of the cache hierarchy.
Pentium III processor-1st- or 2nd-level cache.
Pentium 4 and Intel Xeon processors-2nd-level cache.
T1 (temporal data with respect to first level cache)-prefetch data into level 2 cache and higher.
Pentium III processor-2nd-level cache.
Pentium 4 and Intel Xeon processors-2nd-level cache.
T2 (temporal data with respect to second level cache)-prefetch data into level 2 cache and higher.
Pentium III processor-2nd-level cache.
Pentium 4 and Intel Xeon processors-2nd-level cache.
NTA (non-temporal data with respect to all cache levels)-prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution.
Pentium III processor-1st-level cache - Pentium 4 and Intel Xeon processors-2nd-level cache The source operand is a byte memory location. (The locality hints are encoded into the machine level instruction using bits 3 through 5 of the ModR/M byte. Use of any ModR/M value other than the specified ones will lead to unpredictable behavior.) If the line selected is already present in the cache hierarchy at a level closer to the processor, no data movement occurs. Prefetches from uncacheable or WC memory are ignored.

The PREFETCHh instruction is merely a hint and does not affect program behavior. If executed, this instruction moves data closer to the processor in anticipation of future use.

The implementation of prefetch locality hints is implementation-dependent, and can be overloaded or ignored by a processor implementation. The amount of data prefetched is also processor implementation-dependent. It will, however, be a minimum of 32 bytes.

It should be noted that processors are free to speculatively fetch and cache data from system memory regions that are assigned a memory-type that permits speculative reads (that is, the WB, WC, and WT memory types). A PREFETCHh instruction is considered a hint to this speculative behavior. Because this speculative fetching can occur at any time and is not tied to instruction execution, a PREFETCHh instruction is not ordered with respect to the fence instructions (MFENCE, SFENCE, and LFENCE) or locked memory references. A PREFETCHh instruction is also unordered with respect to CLFLUSH instructions, other PREFETCHh instructions, or any other general instruction. It is ordered with respect to serializing instructions such as CPUID, WRMSR, OUT, and MOV CR.

Operation
Fetch(m8);

Don't write it using inline assembly which would make the compiler's job harder. GCC has a built-in extension (See gcc builtins docs for more details) for prefetch you should use instead:

__builtin_prefetch(const void*)

This will generate code using the prefetch instructions of your target, but with more scope for the compiler to be smart about it.

As a simple example of the difference between inline ASM and gcc's builtin consider the following two files, test1.c:

void foo(double *d, unsigned len) {
  for (unsigned i = 0; i < len; ++i) {
    __builtin_prefetch(&d[i]);
    d[i] = d[i] * d[i];
  }
}

And test2.c:

void foo(double *d, unsigned len) {
  for (unsigned i = 0; i < len; ++i) {
    asm("prefetcht0 (%0)" 
        : /**/
        : "g"(&d[i])
        : /**/
    );
    d[i] = d[i] * d[i];
  }
}

(Note that if you benchmark that I'm 99% sure that a third version with no prefetch would be faster than both of the above, because you've got predictable access patterns and so the only thing that it really achieves is adding more bytes of instructions and a few more cycles)

If we compile both with -O3 on x86_64 and diff the generated output we see:

        .file   "test1.c"                                       |          .file   "test2.c"
        .text                                                              .text
        .p2align 4,,15                                                     .p2align 4,,15
        .globl  foo                                                        .globl  foo
        .type   foo, @function                                             .type   foo, @function
foo:                                                               foo:
.LFB0:                                                             .LFB0:
        .cfi_startproc                                                     .cfi_startproc
        testl   %esi, %esi      # len                                      testl   %esi, %esi      # len
        je      .L1     #,                                                 je      .L1     #,
        leal    -1(%rsi), %eax  #, D.1749                       |          leal    -1(%rsi), %eax  #, D.1745
        leaq    8(%rdi,%rax,8), %rax    #, D.1749               |          leaq    8(%rdi,%rax,8), %rax    #, D.1745
        .p2align 4,,10                                                     .p2align 4,,10
        .p2align 3                                                         .p2align 3
.L4:                                                               .L4:
        movsd   (%rdi), %xmm0   # MEM[base: _8, offset: 0B], D. |  #APP
        prefetcht0      (%rdi)  # ivtmp.6                       |  # 3 "test2.c" 1
                                                                >          prefetcht0 (%rdi)       # ivtmp.6
                                                                >  # 0 "" 2
                                                                >  #NO_APP
                                                                >          movsd   (%rdi), %xmm0   # MEM[base: _8, offset: 0B], D.
        addq    $8, %rdi        #, ivtmp.6                                 addq    $8, %rdi        #, ivtmp.6
        mulsd   %xmm0, %xmm0    # D.1748, D.1748                |          mulsd   %xmm0, %xmm0    # D.1747, D.1747
        movsd   %xmm0, -8(%rdi) # D.1748, MEM[base: _8, offset: |          movsd   %xmm0, -8(%rdi) # D.1747, MEM[base: _8, offset:
        cmpq    %rax, %rdi      # D.1749, ivtmp.6               |          cmpq    %rax, %rdi      # D.1745, ivtmp.6
        jne     .L4     #,                                                 jne     .L4     #,
.L1:                                                               .L1:
        rep ret                                                            rep ret
        .cfi_endproc                                                       .cfi_endproc
.LFE0:                                                             .LFE0:
        .size   foo, .-foo                                                 .size   foo, .-foo
        .ident  "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4"               .ident  "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4"
        .section        .note.GNU-stack,"",@progbits                       .section        .note.GNU-stack,"",@progbits

Even in this simple case the compiler in question (GCC 4.8.4) has taken advantage of the fact that it's allowed to reorder things and chosen, presumably on the basis of an internal model of the target processors, to move the prefetch after the initial load has happened. If I had to guess it's slightly faster to do the load and prefetch in that order in some scenarios. Presumably the penalty for a miss and a hit is lower with this order. Or the ordering like this works better with branch predictions. It doesn't really matter why the compiler chose to do this though, the point is that it's exceedingly complex to fully understand the impact of even trivial changes to generated code on modern processors in real applications. By using builtin functions instead of inline assembly you benefit from the compiler's knowledge today and any improvements that show up in the future. Even if you spend two weeks studying and benchmarking this simple case the odds are fairly good that you'll not beat future compilers and you may even end up with a code base that can't benefit from future improvements.

Those problems are before we even begin to discuss portability of your code - with builtin functions they fall into one of two categories normally when on an architecture without support either graceful degradation or enabling emulation. Applications with lots of x86 inline assembly were harder to port to x86_64 when that came along.