PREFETCHh
Prefetch Data Into Caches
Opcode | Mnemonic | Description |
---|---|---|
0F 18 /1 | PREFETCHT0 m8 | Move data from m8 closer to the processor using T0 hint. |
0F 18 /2 | PREFETCHT1 m8 | Move data from m8 closer to the processor using T1 hint. |
0F 18 /3 | PREFETCHT2 m8 | Move data from m8 closer to the processor using T2 hint. |
0F 18 /0 | PREFETCHNTA m8 | Move data from m8 closer to the processor using NTA hint. |
Description |
---|
Fetches the line of data from memory that contains the byte specified with the source operand to a location in the cache hierarchy specified by a locality hint:
The PREFETCHh instruction is merely a hint and does not affect program behavior. If executed, this instruction moves data closer to the processor in anticipation of future use. The implementation of prefetch locality hints is implementation-dependent, and can be overloaded or ignored by a processor implementation. The amount of data prefetched is also processor implementation-dependent. It will, however, be a minimum of 32 bytes. It should be noted that processors are free to speculatively fetch and cache data from system memory regions that are assigned a memory-type that permits speculative reads (that is, the WB, WC, and WT memory types). A PREFETCHh instruction is considered a hint to this speculative behavior. Because this speculative fetching can occur at any time and is not tied to instruction execution, a PREFETCHh instruction is not ordered with respect to the fence instructions (MFENCE, SFENCE, and LFENCE) or locked memory references. A PREFETCHh instruction is also unordered with respect to CLFLUSH instructions, other PREFETCHh instructions, or any other general instruction. It is ordered with respect to serializing instructions such as CPUID, WRMSR, OUT, and MOV CR. |
Operation |
---|
Fetch(m8); |
Don't write it using inline assembly which would make the compiler's job harder. GCC has a built-in extension (See gcc builtins docs for more details) for prefetch you should use instead:
__builtin_prefetch(const void*)
This will generate code using the prefetch instructions of your target, but with more scope for the compiler to be smart about it.
As a simple example of the difference between inline ASM and gcc's builtin consider the following two files, test1.c:
void foo(double *d, unsigned len) {
for (unsigned i = 0; i < len; ++i) {
__builtin_prefetch(&d[i]);
d[i] = d[i] * d[i];
}
}
And test2.c:
void foo(double *d, unsigned len) {
for (unsigned i = 0; i < len; ++i) {
asm("prefetcht0 (%0)"
: /**/
: "g"(&d[i])
: /**/
);
d[i] = d[i] * d[i];
}
}
(Note that if you benchmark that I'm 99% sure that a third version with no prefetch would be faster than both of the above, because you've got predictable access patterns and so the only thing that it really achieves is adding more bytes of instructions and a few more cycles)
If we compile both with -O3 on x86_64 and diff the generated output we see:
.file "test1.c" | .file "test2.c"
.text .text
.p2align 4,,15 .p2align 4,,15
.globl foo .globl foo
.type foo, @function .type foo, @function
foo: foo:
.LFB0: .LFB0:
.cfi_startproc .cfi_startproc
testl %esi, %esi # len testl %esi, %esi # len
je .L1 #, je .L1 #,
leal -1(%rsi), %eax #, D.1749 | leal -1(%rsi), %eax #, D.1745
leaq 8(%rdi,%rax,8), %rax #, D.1749 | leaq 8(%rdi,%rax,8), %rax #, D.1745
.p2align 4,,10 .p2align 4,,10
.p2align 3 .p2align 3
.L4: .L4:
movsd (%rdi), %xmm0 # MEM[base: _8, offset: 0B], D. | #APP
prefetcht0 (%rdi) # ivtmp.6 | # 3 "test2.c" 1
> prefetcht0 (%rdi) # ivtmp.6
> # 0 "" 2
> #NO_APP
> movsd (%rdi), %xmm0 # MEM[base: _8, offset: 0B], D.
addq $8, %rdi #, ivtmp.6 addq $8, %rdi #, ivtmp.6
mulsd %xmm0, %xmm0 # D.1748, D.1748 | mulsd %xmm0, %xmm0 # D.1747, D.1747
movsd %xmm0, -8(%rdi) # D.1748, MEM[base: _8, offset: | movsd %xmm0, -8(%rdi) # D.1747, MEM[base: _8, offset:
cmpq %rax, %rdi # D.1749, ivtmp.6 | cmpq %rax, %rdi # D.1745, ivtmp.6
jne .L4 #, jne .L4 #,
.L1: .L1:
rep ret rep ret
.cfi_endproc .cfi_endproc
.LFE0: .LFE0:
.size foo, .-foo .size foo, .-foo
.ident "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4" .ident "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4"
.section .note.GNU-stack,"",@progbits .section .note.GNU-stack,"",@progbits
Even in this simple case the compiler in question (GCC 4.8.4) has taken advantage of the fact that it's allowed to reorder things and chosen, presumably on the basis of an internal model of the target processors, to move the prefetch after the initial load has happened. If I had to guess it's slightly faster to do the load and prefetch in that order in some scenarios. Presumably the penalty for a miss and a hit is lower with this order. Or the ordering like this works better with branch predictions. It doesn't really matter why the compiler chose to do this though, the point is that it's exceedingly complex to fully understand the impact of even trivial changes to generated code on modern processors in real applications. By using builtin functions instead of inline assembly you benefit from the compiler's knowledge today and any improvements that show up in the future. Even if you spend two weeks studying and benchmarking this simple case the odds are fairly good that you'll not beat future compilers and you may even end up with a code base that can't benefit from future improvements.
Those problems are before we even begin to discuss portability of your code - with builtin functions they fall into one of two categories normally when on an architecture without support either graceful degradation or enabling emulation. Applications with lots of x86 inline assembly were harder to port to x86_64 when that came along.