【高阶】PREFETCHh Prefetch Data Into Caches

PREFETCHh

Prefetch Data Into Caches

OpcodeMnemonicDescription
0F 18 /1PREFETCHT0 m8Move data from m8 closer to the processor using T0 hint.
0F 18 /2PREFETCHT1 m8Move data from m8 closer to the processor using T1 hint.
0F 18 /3PREFETCHT2 m8Move data from m8 closer to the processor using T2 hint.
0F 18 /0PREFETCHNTA m8Move data from m8 closer to the processor using NTA hint.
Description

Fetches the line of data from memory that contains the byte specified with the source operand to a location in the cache hierarchy specified by a locality hint:

  • T0 (temporal data)-prefetch data into all levels of the cache hierarchy.
  • Pentium III processor-1st- or 2nd-level cache.
  • Pentium 4 and Intel Xeon processors-2nd-level cache.
  • T1 (temporal data with respect to first level cache)-prefetch data into level 2 cache and higher.
  • Pentium III processor-2nd-level cache.
  • Pentium 4 and Intel Xeon processors-2nd-level cache.
  • T2 (temporal data with respect to second level cache)-prefetch data into level 2 cache and higher.
  • Pentium III processor-2nd-level cache.
  • Pentium 4 and Intel Xeon processors-2nd-level cache.
  • NTA (non-temporal data with respect to all cache levels)-prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution.
  • Pentium III processor-1st-level cache - Pentium 4 and Intel Xeon processors-2nd-level cache The source operand is a byte memory location. (The locality hints are encoded into the machine level instruction using bits 3 through 5 of the ModR/M byte. Use of any ModR/M value other than the specified ones will lead to unpredictable behavior.) If the line selected is already present in the cache hierarchy at a level closer to the processor, no data movement occurs. Prefetches from uncacheable or WC memory are ignored.

The PREFETCHh instruction is merely a hint and does not affect program behavior. If executed, this instruction moves data closer to the processor in anticipation of future use.

The implementation of prefetch locality hints is implementation-dependent, and can be overloaded or ignored by a processor implementation. The amount of data prefetched is also processor implementation-dependent. It will, however, be a minimum of 32 bytes.

It should be noted that processors are free to speculatively fetch and cache data from system memory regions that are assigned a memory-type that permits speculative reads (that is, the WB, WC, and WT memory types). A PREFETCHh instruction is considered a hint to this speculative behavior. Because this speculative fetching can occur at any time and is not tied to instruction execution, a PREFETCHh instruction is not ordered with respect to the fence instructions (MFENCE, SFENCE, and LFENCE) or locked memory references. A PREFETCHh instruction is also unordered with respect to CLFLUSH instructions, other PREFETCHh instructions, or any other general instruction. It is ordered with respect to serializing instructions such as CPUID, WRMSR, OUT, and MOV CR.

Operation
Fetch(m8);

Don't write it using inline assembly which would make the compiler's job harder. GCC has a built-in extension (See gcc builtins docs for more details) for prefetch you should use instead:

__builtin_prefetch(const void*)

This will generate code using the prefetch instructions of your target, but with more scope for the compiler to be smart about it.

As a simple example of the difference between inline ASM and gcc's builtin consider the following two files, test1.c:

void foo(double *d, unsigned len) {
  for (unsigned i = 0; i < len; ++i) {
    __builtin_prefetch(&d[i]);
    d[i] = d[i] * d[i];
  }
}

And test2.c:

void foo(double *d, unsigned len) {
  for (unsigned i = 0; i < len; ++i) {
    asm("prefetcht0 (%0)" 
        : /**/
        : "g"(&d[i])
        : /**/
    );
    d[i] = d[i] * d[i];
  }
}

(Note that if you benchmark that I'm 99% sure that a third version with no prefetch would be faster than both of the above, because you've got predictable access patterns and so the only thing that it really achieves is adding more bytes of instructions and a few more cycles)

If we compile both with -O3 on x86_64 and diff the generated output we see:

        .file   "test1.c"                                       |          .file   "test2.c"
        .text                                                              .text
        .p2align 4,,15                                                     .p2align 4,,15
        .globl  foo                                                        .globl  foo
        .type   foo, @function                                             .type   foo, @function
foo:                                                               foo:
.LFB0:                                                             .LFB0:
        .cfi_startproc                                                     .cfi_startproc
        testl   %esi, %esi      # len                                      testl   %esi, %esi      # len
        je      .L1     #,                                                 je      .L1     #,
        leal    -1(%rsi), %eax  #, D.1749                       |          leal    -1(%rsi), %eax  #, D.1745
        leaq    8(%rdi,%rax,8), %rax    #, D.1749               |          leaq    8(%rdi,%rax,8), %rax    #, D.1745
        .p2align 4,,10                                                     .p2align 4,,10
        .p2align 3                                                         .p2align 3
.L4:                                                               .L4:
        movsd   (%rdi), %xmm0   # MEM[base: _8, offset: 0B], D. |  #APP
        prefetcht0      (%rdi)  # ivtmp.6                       |  # 3 "test2.c" 1
                                                                >          prefetcht0 (%rdi)       # ivtmp.6
                                                                >  # 0 "" 2
                                                                >  #NO_APP
                                                                >          movsd   (%rdi), %xmm0   # MEM[base: _8, offset: 0B], D.
        addq    $8, %rdi        #, ivtmp.6                                 addq    $8, %rdi        #, ivtmp.6
        mulsd   %xmm0, %xmm0    # D.1748, D.1748                |          mulsd   %xmm0, %xmm0    # D.1747, D.1747
        movsd   %xmm0, -8(%rdi) # D.1748, MEM[base: _8, offset: |          movsd   %xmm0, -8(%rdi) # D.1747, MEM[base: _8, offset:
        cmpq    %rax, %rdi      # D.1749, ivtmp.6               |          cmpq    %rax, %rdi      # D.1745, ivtmp.6
        jne     .L4     #,                                                 jne     .L4     #,
.L1:                                                               .L1:
        rep ret                                                            rep ret
        .cfi_endproc                                                       .cfi_endproc
.LFE0:                                                             .LFE0:
        .size   foo, .-foo                                                 .size   foo, .-foo
        .ident  "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4"               .ident  "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4"
        .section        .note.GNU-stack,"",@progbits                       .section        .note.GNU-stack,"",@progbits

Even in this simple case the compiler in question (GCC 4.8.4) has taken advantage of the fact that it's allowed to reorder things and chosen, presumably on the basis of an internal model of the target processors, to move the prefetch after the initial load has happened. If I had to guess it's slightly faster to do the load and prefetch in that order in some scenarios. Presumably the penalty for a miss and a hit is lower with this order. Or the ordering like this works better with branch predictions. It doesn't really matter why the compiler chose to do this though, the point is that it's exceedingly complex to fully understand the impact of even trivial changes to generated code on modern processors in real applications. By using builtin functions instead of inline assembly you benefit from the compiler's knowledge today and any improvements that show up in the future. Even if you spend two weeks studying and benchmarking this simple case the odds are fairly good that you'll not beat future compilers and you may even end up with a code base that can't benefit from future improvements.

Those problems are before we even begin to discuss portability of your code - with builtin functions they fall into one of two categories normally when on an architecture without support either graceful degradation or enabling emulation. Applications with lots of x86 inline assembly were harder to port to x86_64 when that came along.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值