内存连续就一定会获得高性能吗?是的!
连续内存访问优于矩阵按列内存访问,矩阵按列访问优于随机访问。
但其背后仅仅是prefetch的功劳吗?不一定!
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define MAX 0xfffff
unsigned int next_seq(unsigned int seed)
{
return seed * 1 + 11;
}
unsigned int next_rnd(unsigned int seed)
{
return seed * 1664525 + 1013904223;
}
int arr[MAX];
int main(int argc, char **argv)
{
int i, j, size;
int seed = 0;
size = atoi(argv[1]);
if(atoi(argv[2]) == 0) { // 顺序访问内存
for(i = 0; i < size; ++i) {
seed = next_seq(seed)%size;
j = arr[seed];
arr[seed] = 1;
}
} else { // 随机访问内存
int seed = time(NULL);;
for(i = 0; i < size; ++i) {
seed = next_rnd(seed)%size;
j = arr[seed];
arr[seed] = 1;
}
}
return 0;
}
孰优孰劣呢?
先看cache miss:
[shabi root@shabi /home/zyte]
# perf stat -e L1-dcache-load-misses ./a.out 1000000 1
Performance counter stats for './a.out 1000000 1':
1,094,294 L1-dcache-load-misses
0.020677177 seconds time elapsed
[shabi root@shabi /home/zyte]
# perf stat -e L1-dcache-load-misses ./a.out 1000000 0
Performance counter stats for './a.out 1000000 0':
753,484 L1-dcache-load-misses
0.018539000 seconds time elapsed
[shabi root@shabi /home/zyte]
非常容易理解,随机访问的cache命中率要低一些,但是prefetch的效果呢?
[shabi root@shabi /home/zyte]
# perf stat -e L1-dcache-prefetch-misses ./a.out 1000000 0
Performance counter stats for './a.out 1000000 0':
644,380 L1-dcache-prefetch-misses
0.018453423 seconds time elapsed
[shabi root@shabi /home/zyte]
# perf stat -e L1-dcache-prefetch-misses ./a.out 1000000 1
Performance counter stats for './a.out 1000000 1':
8,214 L1-dcache-prefetch-misses
0.020350861 seconds time elapsed
[shabi root@shabi /home/zyte]
随机访问的prefetch性能竟然是如此之高!然而肯定有哪里拖了后腿,毕竟总体性能才是结果,至于prefetch,load miss,store load,都只是个别指标。它们取决于:
- 对象结构体的大小。
- cacheline的大小。
- 随机访问跨越的大小。
- pagefault的处理情况。
- TLB的命中情况。
所以,不能只看单个指标,要全局观测。
总体上,如果你分别实验顺序访问,矩阵按列访问,随机访问,它们的性能是逐级降低的,最终完全随机的访问将使得CPU的prefetch策略完全失策,它将无法预测步长而失败。
请看下面的代码:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <unistd.h>
struct stub {
int a;
char m[0];
};
unsigned int next_seq(unsigned int seed)
{
return seed;
}
unsigned int next_rnd(unsigned int seed)
{
return seed * 1664525 + 1013904223;
}
int main(int argc, char **argv)
{
int rnd, i, j;
struct stub *s, *p;
int type = 0, size = 0;
size = atoi(argv[1]);;
type = atoi(argv[2]);
size *= size;
// 为了不引入page fault的影响,所以lock内存。
p = (struct stub *)mmap(0, sizeof(struct stub)*size,
PROT_READ|PROT_WRITE,
MAP_ANON|MAP_PRIVATE|MAP_LOCKED, 0, 0);
if (p == MAP_FAILED) {
perror("mmap");
return 1;
}
if (type == 0) {
int seed = 0;
for (i = 0; i < size; i++) {
seed = i;
s = &p[seed];
s->a = 123;
}
} else if (type == 1) {
int seed = 10;
for (i = 0; i < size; i++) {
seed = i/size + i%size;
s = &p[seed];
s->a = 123;
}
} else if (type == 2) {
int seed = 10;
for (i = 0; i < size; i++) {
seed = next_seq(seed)%size;
s = &p[seed];
s->a = 123;
}
}
return 0;
}
给出一次的比较结果:
[shabi root@shabi /home/zyte]
# perf stat -e L1-dcache-prefetch-misses,L1-dcache-load-misses,L1-dcache-store-misses,dTLB-load-misses,dTLB-loads ./a.out 10000 0
Performance counter stats for './a.out 10000 0':
581,025 L1-dcache-prefetch-misses (79.67%)
12,963,828 L1-dcache-load-misses (40.05%)
12,505,851 L1-dcache-store-misses (60.04%)
8,443 dTLB-load-misses # 0.00% of all dTLB cache hits (80.02%)
704,426,609 dTLB-loads (59.82%)
0.575746888 seconds time elapsed
[shabi root@shabi /home/zyte]
# perf stat -e L1-dcache-prefetch-misses,L1-dcache-load-misses,L1-dcache-store-misses,dTLB-load-misses,dTLB-loads ./a.out 10000 1
Performance counter stats for './a.out 10000 1':
350,254 L1-dcache-prefetch-misses (79.87%)
13,293,181 L1-dcache-load-misses (40.09%)
12,716,868 L1-dcache-store-misses (60.06%)
12,670 dTLB-load-misses # 0.00% of all dTLB cache hits (80.03%)
1,007,100,446 dTLB-loads (59.88%)
1.342234496 seconds time elapsed
[shabi root@shabi /home/zyte]
# perf stat -e L1-dcache-prefetch-misses,L1-dcache-load-misses,L1-dcache-store-misses,dTLB-load-misses,dTLB-loads ./a.out 10000 2
Performance counter stats for './a.out 10000 2':
565,455 L1-dcache-prefetch-misses (79.85%)
6,797,220 L1-dcache-load-misses (40.07%)
6,661,030 L1-dcache-store-misses (60.07%)
20,683 dTLB-load-misses # 0.00% of all dTLB cache hits (80.04%)
1,107,220,421 dTLB-loads (59.89%)
1.603148689 seconds time elapsed
我还缺的是一个大页实验。
浙江温州皮鞋湿,下雨进水不会胖。