Going faster than memcpy
While profiling Shadesmar a couple of weeks ago, I noticed that for large binary unserialized messages (>512kB) most of the execution time is spent doing copying the message (using memcpy
) between process memory to shared memory and back.
I had a few hours to kill last weekend, and I tried to implement a faster way to do memory copies.
Autopsy of memcpy
Here’s the dumb of perf when running pub-sub for messages of sizes between 512kB and 2MB.
Children Self Shared Object Symbol
+ 99.86% 0.00% libc-2.27.so [.] __libc_start_main
+ 99.86% 0.00% [unknown] [k] 0x4426258d4c544155
+ 99.84% 0.02% raw_benchmark [.] main
+ 98.13% 97.12% libc-2.27.so [.] __memmove_avx_unaligned_erms
+ 51.99% 0.00% raw_benchmark [.] shm::PublisherBin<16u>::publish
+ 51.98% 0.01% raw_benchmark [.] shm::Topic<16u>::write
+ 47.64% 0.01% raw_benchmark [.] shm::Topic<16u>::read
__memmove_avx_unaligned_erms
is an implementation of memcpy
for unaligned memory blocks that uses AVX to copy over 32 bytes at a time. Digging into the glibc
source code, I found this:
#if IS_IN (libc)
# define VEC_SIZE 32
# define VEC(i) ymm##i
# define VMOVNT vmovntdq
# define VMOVU vmovdqu
# define VMOVA vmovdqa
# define SECTION(p) p##.avx
# define MEMMOVE_SYMBOL(p,s) p##_avx_##s
# include "memmove-vec-unaligned-erms.S"
#endif
Breaking down this function:
memmove
: glibc
implements memcpy
as a memmove
instead, here’s the relevant source code:
# define SYMBOL_NAME memcpy
# include "ifunc-memmove.h"
libc_ifunc_redirected (__redirect_memcpy, __new_memcpy,
IFUNC_SELECTOR ());
Here’s the difference between the two: With memcpy
, the destination cannot overlap the source at all. With memmove
it can. Initially, I wasn’t sure why it was implemented as memmove
. The reason for this will become clearer as the post proceeds.
erms
: Enhanced Rep Movs is a hardware optimization for a loop that does a simple copy. In simple pseudo-code, this is what the loop implementation looks like for copying a single byte at a time (REP MOVSB
).
void rep_movsb(void *dest, const void *src, size_t len) {
const uint8_t* s = (uint8_t*)src;
uint8_t* d = (uint8_t*)dest;
while (len--)
*d++ = *s++;
return dest;
}
Since the loop copies data pointer by pointer, it can handle the case of overlapping data.
vec
: For the above loop rather than copying around single bytes, it uses x86 vectorized instructions to copy multiple bytes in a single loop iteration (technically single instruction). vmov*
are assembly instructions for AVX which is the latest instruction set that the CPU on my laptop supports. With VEC_SIZE = 32
, it copies 32 bytes at a time.
unaligned
: This is a generic version of memmove
that can copy between any pointer locations irrespective of their alignment. Unaligned pointers increase complexity for the copy loop when using vectorized instructions. The unaligned preceeding and trailing memory locations must be copied separately before hitting the optimized loop.
memmove-vec-unaligned-erms.S
holds the actual implementation in assembly. A few things that the implementation does:
-
It uses
REP MOVS
only if the data is greater than 4kB. For smaller values it uses the SSE2 optimization. - For handling
unaligned
pointers, it uses the following blocks:- 16 to 31:
vmovdqu
- 15 to 8:
movq
- 7 to 4:
movl
- 3 to 2:
movzwl
andmovw
- 16 to 31:
VMOVNT
defined above is for doing non-temporal(NT) moves. NT instructions are used when there is an overlap between destination and source since destination may be in cache when source is loaded. Usesprefetcht0
to load data into cache (all levels: t0). In the current iteration, we prefetch the data for 2 iterations later. The data is copied (via cache) into registers. The data (via NT) is copied from registers into destination.
L(loop_large_forward):
; Copy 4 * VEC a time forward with non-temporal stores.
PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE * 2)
PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE * 3)
; PREFETCH 256b from rsi+256 to rsi+511
VMOVU (%rsi), %VEC(0)
VMOVU VEC_SIZE(%rsi), %VEC(1)
VMOVU (VEC_SIZE * 2)(%rsi), %VEC(2)
VMOVU (VEC_SIZE * 3)(%rsi), %VEC(3)
; mov 128b from rsi to rsi+127 -> 4 ymm registers (cache)
; 2 loops later, we hit the prefetched values
addq $PREFETCHED_LOAD_SIZE, %rsi ; advance to rsi+128 in next loop
subq $PREFETCHED_LOAD_SIZE, %rdx
VMOVNT %VEC(0), (%rdi)
VMOVNT %VEC(1), VEC_SIZE(%rdi)
VMOVNT %VEC(2), (VEC_SIZE * 2)(%rdi)
VMOVNT %VEC(3), (VEC_SIZE * 3)(%rdi)
; mov 128b from 4 ymm register -> rdi to rdi+127 (no cache)
addq $PREFETCHED_LOAD_SIZE, %rdi ; advance to rdi+128 in next loop
cmpq $PREFETCHED_LOAD_SIZE, %rdx
ja L(loop_large_forward)
Method 1: Basic REP MOVSB
Before getting into more exotic implementations, I wanted to first implement a super simple version of ERSB to see how well it would perform. I used inline assembly to write out the loop.
void _rep_movsb(void *d, const void *s, size_t n) {
asm volatile("rep movsb"
: "=D"(d), "=S"(s), "=c"(n)
: "0"(d), "1"(s), "2"(n)
: "memory");
}
This does the same as the pseudo-code attached above, but I wrote it in assembly to prevent any compiler optimization, and rely only on the hardware ERMS optimization.
Alternate 2: Aligned AVX
One of the complexities in glibc
’s implementation is getting it to work for unaligned pointers. Since I control the memory allocation, I figured I could recreate the implementation focused solely on aligned pointer and sizes. I’m using AVX intrinsics for 32-byte vectors (AVX):
void _avx_cpy(void *d, const void *s, size_t n) {
// d, s -> 32 byte aligned
// n -> multiple of 32
auto *dVec = reinterpret_cast<__m256i *>(d);
const auto *sVec = reinterpret_cast<const __m256i *>(s);
size_t nVec = n / sizeof(__m256i);
for (; nVec > 0; nVec--, sVec++, dVec++) {
const __m256i temp = _mm256_load_si256(sVec);
_mm256_store_si256(dVec, temp);
}
}
The logic is identical to the previous REP MOVSB
loop instead operating on 32 bytes at a time.
Method 3: Stream aligned AVX
_mm256_load_si256
and _mm256_store_si256
go through the cache, which incurs additional overhead. AVX instruction set has _stream_
load and store instructions that skip the cache. The performance of this copy is dependant on:
- Quantity of data to copy
- Cache size
Non-temporal moves may bog down the performance for smaller copies (that can fit into L2 cache) compared to regular moves.
void _avx_async_cpy(void *d, const void *s, size_t n) {
// d, s -> 32 byte aligned
// n -> multiple of 32
auto *dVec = reinterpret_cast<__m256i *>(d);
const auto *sVec = reinterpret_cast<const __m256i *>(s);
size_t nVec = n / sizeof(__m256i);
for (; nVec > 0; nVec--, sVec++, dVec++) {
const __m256i temp = _mm256_stream_load_si256(sVec);
_mm256_stream_si256(dVec, temp);
}
_mm_sfence();
}
Exact code as before but using non-temporal moves instead. There’s an extra _mm_sfence
which guarantees that all stores in the preceding loop are visible globally.
Method 4: Stream aligned AVX with prefetch
In the previous method, we skipped the cache entirely. We can squeeze a bit more performance by prefetching the source data into the cache for the next iteration in the current iteration. Since all prefetches work on cache-lines (64-bytes), each loop iteration copies 64-bytes from source to data.
void _avx_async_pf_cpy(void *d, const void *s, size_t n) {
// d, s -> 64 byte aligned
// n -> multiple of 64
auto *dVec = reinterpret_cast<__m256i *>(d);
const auto *sVec = reinterpret_cast<const __m256i *>(s);
size_t nVec = n / sizeof(__m256i);
for (; nVec > 2; nVec -= 2, sVec += 2, dVec += 2) {
// prefetch the next iteration's data
// by default _mm_prefetch moves the entire cache-lint (64b)
_mm_prefetch(sVec + 2, _MM_HINT_T0);
_mm256_stream_si256(dVec, _mm256_load_si256(sVec));
_mm256_stream_si256(dVec + 1, _mm256_load_si256(sVec + 1));
}
_mm256_stream_si256(dVec, _mm256_load_si256(sVec));
_mm256_stream_si256(dVec + 1, _mm256_load_si256(sVec + 1));
_mm_sfence();
}
The load from source pointer to register should not skip the cache since that data is explicitly prefetched into the cache, non-stream _mm256_load_si256
must be used instead.
This also unrolls the loop for 2 copies at a time instead of a single copy. This is to guarantee that each loop iteration’s prefetch coincides the copy. Prefetch the next 64-bytes and copy the current 64-bytes.
Alternate avenues
Unrolling
In the previous section, most of the changes were in the actual underlying load, store instructions used. Another avenue of exploration is to unroll the loop for a certain number of iterations. This reduces the number of branch statements by the factor of unrolling.
In the glibc
implementation the unrolling factor is 4 which is what I’ll use as well. A very simple way to implement this is to increase the alignment required by 4x and treat each loop as 4 instructions that copy 4x data.
A more complicated version would be trying to implement an unrolled loop without increasing alignment size. We’ll need to copy using a regular fully rolled loop till we hit a pointer location that is aligned to the size expected by our unrolled loop.
Unrolling the aligned AVX copy:
void _avx_cpy_unroll(void *d, const void *s, size_t n) {
// d, s -> 128 byte aligned
// n -> multiple of 128
auto *dVec = reinterpret_cast<__m256i *>(d);
const auto *sVec = reinterpret_cast<const __m256i *>(s);
size_t nVec = n / sizeof(__m256i);
for (; nVec > 0; nVec -= 4, sVec += 4, dVec += 4) {
_mm256_store_si256(dVec, _mm256_load_si256(sVec));
_mm256_store_si256(dVec + 1, _mm256_load_si256(sVec + 1));
_mm256_store_si256(dVec + 2, _mm256_load_si256(sVec + 2));
_mm256_store_si256(dVec + 3, _mm256_load_si256(sVec + 3));
}
}
Multithreading
The operation of copying data is super easy to parallelize across multiple threads. The total data to be transferred can be segmented into (almost) equal chunks, and then copied over using one of the above methods. This will make the copy super-fast especially if the CPU has a large core count.
Shadesmar API
To make it easy to integrate custom memory copying logic into the library, I introduced the concept of Copier
in this commit. For a new copying algorithm, an abstract class Copier
must be implemented.
Here’s the definition of Copier
:
class Copier {
public:
virtual void *alloc(size_t) = 0;
virtual void dealloc(void *) = 0;
virtual void shm_to_user(void *, void *, size_t) = 0;
virtual void user_to_shm(void *, void *, size_t) = 0;
};
The original reason for introducing this construct was to allow cross-device usage, where a custom copier would be implemented to tranfer between CPU and GPU. E.g.: using cudaMemcpy
for Nvidia GPUs.
For a single device use case the implementation of shm_to_user
and user_to_shm
are identical. The implementation of a copier that uses std::memcpy
:
class DefaultCopier : public Copier {
public:
void *alloc(size_t size) override { return malloc(size); }
void dealloc(void *ptr) override { free(ptr); }
void shm_to_user(void *dst, void *src, size_t size) override {
std::memcpy(dst, src, size);
}
void user_to_shm(void *dst, void *src, size_t size) override {
std::memcpy(dst, src, size);
}
};
I also created an adapter MTCopier
that adds multithreading support to other copiers:
template <class BaseCopierT>
class MTCopier : public Copier {
public:
explicit MTCopier(uint32_t threads = std::thread::hardware_concurrency())
: base_copier(base_copier), nthreads(threads) {}
void *alloc(size_t size) override { return base_copier.alloc(size); }
void dealloc(void *ptr) override { base_copier.dealloc(ptr); }
void _copy(void *d, void *s, size_t n, bool shm_to_user) {
std::vector<std::thread> threads;
threads.reserve(nthreads);
ldiv_t per_worker = div((int64_t)n, nthreads);
size_t next_start = 0;
for (uint32_t thread_idx = 0; thread_idx < nthreads; ++thread_idx) {
const size_t curr_start = next_start;
next_start += per_worker.quot;
if (thread_idx < per_worker.rem) {
++next_start;
}
uint8_t *d_thread = reinterpret_cast<uint8_t *>(d) + curr_start;
uint8_t *s_thread = reinterpret_cast<uint8_t *>(s) + curr_start;
if (shm_to_user) {
threads.emplace_back(&Copier::shm_to_user, &base_copier, d_thread,
s_thread, next_start - curr_start);
} else {
threads.emplace_back(&Copier::user_to_shm, &base_copier, d_thread,
s_thread, next_start - curr_start);
}
}
for (auto &thread : threads) {
thread.join();
}
threads.clear();
}
void shm_to_user(void *dst, void *src, size_t size) override {
_copy(dst, src, size, true);
}
void user_to_shm(void *dst, void *src, size_t size) override {
_copy(dst, src, size, false);
}
private:
BaseCopierT base_copier;
uint32_t nthreads;
};
Currently this only works for memcpy
and _rep_movsb
since the implementation expects the memory copy to work for unaligned memory.
Benchmark
I used Google’s Benchmark for timing the performance of copying data ranging from size of 32kB to 64MB. All the benchmarks were run on my PC with the following specifications:
- AMD Ryzen 7 3700X
- 2x8GB DDR4 RAM @ 3600Mhz
32B64B512B4.0kB32.0kB256.0kB2.0MB16.0MB64.0MB0GB/s20GB/s40GB/s60GB/s80GB/s100GB/s120GB/sDefaultCopierRepMovsbCopierAvxCopierAvxAsyncCopierAvxAsyncPFCopierAvxUnrollCopierAvxAsyncUnrollCopierAvxAsyncPFUnrollCopierData sizesSpeed
Conclusion
Stick to std::memcpy
. It delivers great performance while also adapting to the hardware architecture, and makes no assumptions about the memory alignment.
If performance truly matters, then you might want to consider using a more specific non-genetic implementation with alignment requirements. The streaming prefetching copy works the best for larger copies (>1MB), but the performance for small sizes is abyssal, but memcpy
matches its performance. For small to medium sizes Unrolled AVX absolutely dominates, but as for larger messages, it is slower than the streaming alternatives. The regular RepMovsb
is by far the worst overall performer as excepted.
Unrolling definitely improves performance in most cases by about 5-10%. The only case where the unrolled version is slower than rolled version is for AvxCopier
with data size of 32B, which the unrolled version is 25% slower. The rolled version will do a single AVX-256 load/store and a conditional check. The unrolled version will do 4 AVX-256 load/stores and a conditional check.
Code
Code for all the methods is included in the library conforming to the above mentioned API. To actively warn about the danger of using these custom copiers I have named this file dragons.h, with an apt message: Here be dragons.