Linux如何看程序缓存满了,如果我的程序缓慢是CPU缓存问题(在Linux上),我如何确定？...

最新推荐文章于 2023-08-06 23:07:40 发布

Auto汽车工程师

最新推荐文章于 2023-08-06 23:07:40 发布

阅读量304

点赞数

文章标签： Linux如何看程序缓存满了

我正在尝试在我的一个C程序中了解一些非常奇怪的行为.显然,在最后添加或删除看似无关紧要的行显着影响了程序其余部分的性能.

我的程序看起来有点像这样：

int large_buffer[10000];

void compute(FILE * input) {

for(int i=0; i<100; i++) {

do_lots_of_stuff();

printf(".");

fflush(stdout);

}

int main() {

FILE *input = fopen("input.txt", "r");

compute(input);

fclose(input); //

return 0;

}

理论上,主函数末尾的fclose(输入)行并不重要,因为操作系统应该在程序结束时自动关闭文件.但是我注意到,当我将fclose语句和5s的评论时,我的程序花了2.5秒才能运行.一个因素2差异！而这并不是由于程序开始或结束时的延迟：速度.打印出来的版本在fclose语句的版本中明显更快.

我怀疑这可能与一些内存对齐或缓存未命中的问题有关.如果我将fclose替换为另一个函数(如ftell),则还需要5秒运行,如果我将large_buffer的大小减小到< = 8000元素,那么总是运行2.5秒,无论是否存在fclose语句. 但我真的希望能够100％肯定这个奇怪的行为背后的罪魁祸首.是否可以运行我的程序在某种分析器或其他工具,将给我的信息？到目前为止,我尝试在valgrind –tool = cachegrind下运行两个版本,但它报告了我的程序的两个版本的缓存未命中(0％). 编辑1：在perf stat -d -d -d运行我的程序的两个版本后,我得到以下结果：

Performance counter stats for './no-fclose examples/bench.o':

5625.535086 task-clock (msec) # 1.000 CPUs utilized

38 context-switches # 0.007 K/sec

0 cpu-migrations # 0.000 K/sec

54 page-faults # 0.010 K/sec

17,851,853,580 cycles # 3.173 GHz (53.23%)

6,421,955,412 stalled-cycles-frontend # 35.97% frontend cycles idle (53.23%)

4,919,383,925 stalled-cycles-backend # 27.56% backend cycles idle (53.23%)

13,294,878,129 instructions # 0.74 insn per cycle

# 0.48 stalled cycles per insn (59.91%)

3,178,485,061 branches # 565.010 M/sec (59.91%)

440,171,927 branch-misses # 13.85% of all branches (59.92%)

4,778,577,556 L1-dcache-loads # 849.444 M/sec (60.19%)

125,313 L1-dcache-load-misses # 0.00% of all L1-dcache hits (60.22%)

12,110 LLC-loads # 0.002 M/sec (60.25%)

LLC-load-misses

L1-icache-loads

20,196,491 L1-icache-load-misses (60.22%)

4,793,012,927 dTLB-loads # 852.010 M/sec (60.18%)

683 dTLB-load-misses # 0.00% of all dTLB cache hits (60.13%)

3,443 iTLB-loads # 0.612 K/sec (53.38%)

90 iTLB-load-misses # 2.61% of all iTLB cache hits (53.31%)

L1-dcache-prefetches

51,382 L1-dcache-prefetch-misses # 0.009 M/sec (53.24%)

5.627225926 seconds time elapsed

Performance counter stats for './yes-fclose examples/bench.o':

2652.609254 task-clock (msec) # 1.000 CPUs utilized

15 context-switches # 0.006 K/sec

0 cpu-migrations # 0.000 K/sec

57 page-faults # 0.021 K/sec

8,277,447,108 cycles # 3.120 GHz (53.39%)

2,453,171,903 stalled-cycles-frontend # 29.64% frontend cycles idle (53.46%)

1,235,728,409 stalled-cycles-backend # 14.93% backend cycles idle (53.53%)

13,296,127,857 instructions # 1.61 insn per cycle

# 0.18 stalled cycles per insn (60.20%)

3,177,698,785 branches # 1197.952 M/sec (60.20%)

71,034,122 branch-misses # 2.24% of all branches (60.20%)

4,790,733,157 L1-dcache-loads # 1806.046 M/sec (60.20%)

74,908 L1-dcache-load-misses # 0.00% of all L1-dcache hits (60.20%)

15,289 LLC-loads # 0.006 M/sec (60.19%)

LLC-load-misses

L1-icache-loads

140,750 L1-icache-load-misses (60.08%)

4,792,716,217 dTLB-loads # 1806.793 M/sec (59.93%)

1,010 dTLB-load-misses # 0.00% of all dTLB cache hits (59.78%)

113 iTLB-loads # 0.043 K/sec (53.12%)

167 iTLB-load-misses # 147.79% of all iTLB cache hits (53.44%)

L1-dcache-prefetches

29,744 L1-dcache-prefetch-misses # 0.011 M/sec (53.36%)

2.653584624 seconds time elapsed

看起来在这两种情况下都没有数据缓存未命中,正如kcachegrind报道的那样,但较慢版本的程序具有较差的分支预测和更多的指令高速缓存未命中和iTLB负载.这些差异中的哪一个将最有可能对测试用例之间运行时的2x差异负责？

编辑2：有趣的事实,显然我仍然可以保持奇怪的行为,如果我用一个NOP指令替换“fclose”调用.

编辑3：我的处理器是Intel i5-2310(Sandy Bridge)

编辑4：结果,如果我通过编辑程序集文件来调整数组大小,它不会更快.当我更改C代码中的大小时,它的原因是更快,因为gcc决定重新排列二进制文件的顺序.

编辑5：更多的证据表明重要的是JMP指令的精确地址：如果我在代码开始添加一个NOP(而不是一个printf),它会变得更快.同样,如果我从我的代码开始删除一个无用的指令,它也会变得更快.当我在不同版本的gcc上编译我的代码时,尽管生成的汇编代码是相同的,但是它也变得更快.唯一的区别是开始时的调试信息,并且二进制文件的各个部分的顺序不同.