缓存局部性原理

最新推荐文章于 2022-12-18 10:15:00 发布

mseaspring

最新推荐文章于 2022-12-18 10:15:00 发布

阅读量816

点赞数 1

本文链接：https://blog.csdn.net/mseaspring/article/details/106346244

版权

这篇主要是利用局部性原理优化程序性能，另外主要是结合perf工具分析量化比较缓存命中率情况。

一存储体系

1.1 存储层组成

计算机的存储是由多级存储组成的存储体系，原因是越快的存储成本越高，越慢的存储成本越低。利用多级存储，可以让计算机在拥有一个成本和底层最便宜的存储相当，但是却以接近顶层存储存储的告诉速度向程序提供数据读写。存储体系中每一层都会作为下一层存储的缓存，如下图

1.2 存储访问速度

存储体系中各层存储中越接近CPU的位置速度越快，L1 缓存访问的速度为1-5个时钟周期，L2缓存访问速度为12个时钟周期，L3 缓存大约30个时钟周期；一般来说L1和L2是每个CPU核心都具有的，L3是多个CPU核心共享的。2GHZ的cpu上速度如下：

1.3 linux中缓存查看

不同的cpu 上的L1、L2 和L3的大小是不同的，在Linux上可以通过以下命令查询：

[root@localhost ~]# tree /sys/devices/system/cpu/cpu0/cache
/sys/devices/system/cpu/cpu0/cache
├── index0
│   ├── coherency_line_size
│   ├── id
│   ├── level
│   ├── number_of_sets
│   ├── physical_line_partition
│   ├── shared_cpu_list
│   ├── shared_cpu_map
│   ├── size
│   ├── type
│   └── ways_of_associativity

[root@localhost ~]# cat /sys/devices/system/cpu/cpu0/cache/index0/size
32K
[root@localhost ~]# cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
64
[root@localhost ~]# cat /sys/devices/system/cpu/cpu0/cache/index0/type
Data
[root@localhost ~]# cat /sys/devices/system/cpu/cpu0/cache/index0/ways_of_associativity
8
[root@localhost ~]# cat /sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list
0
[root@localhost ~]# cat /sys/devices/system/cpu/cpu0/cache/index0/physical_line_partition
1
[root@localhost ~]# cat /sys/devices/system/cpu/cpu0/cache/index0/number_of_sets
64

说明：size：缓存大小32K coherency_line_size: cache line size 对齐大小，64 字节。number_of_sets：缓存0中的组数；ways_of_associativity : 组中的行数；shared_cpu_list ：可以被哪些cpu共享；type：Data 数据缓存(Instruction指令缓存、Unified 通用指令)

更简单的查询办法：

[root@localhost ~]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 60
Model name:            Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz
Stepping:              3
CPU MHz:               3106.417
CPU max MHz:           3500.0000
CPU min MHz:           800.0000
BogoMIPS:              6186.09
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-3

二 cache和内存

2.1 缓存优点和局限性

上面我们知道，L1和L2缓存数据是以cache line为单位的，一般都是64字节为一个cache line。也就是每次缓存数据的时候是直接缓存64个字节，这样可以利用存储的局部性优势，但是也会有伪共享的问题，即如果缓存的数据不够64个字节，两个缓存数据同时存在一个缓存行中，一个数据更新后，整个缓存行失效。

还有一个问题就是当一个数据被多个cpu核心共享的时候，会有一致性问题。如果一个cpu修改了数据，需要同时让其他cpu共享的同一份数据失效，这样会造成cpu cache miss和一致性问题。行业内通过MESI协议来保证Cache的一致性。

2.2 内存和缓存映射

我们知道程序在计算机中运行的时候，访问的是虚拟地址，通过MMU将虚拟地址转成实际内存地址，如果 PTE（页表项）不在高速缓存中，就同样需要将页表项调入到缓存中，这样开销就会有几十到几百个时钟周期。如果PTE在L1缓存中，开销就会下降到1-2个时钟周期，所以现在MMU中都会包含一个小的PTE缓存叫TLB（后备缓冲器）。CPU请求数据的时候是这样的：1） CPU产生一个虚拟地址。2）MMU从TLB中取出PTE。3）MMU通过PTE将虚拟地址转成物理地址。4）高速缓存或内存将请求的数据返回给CPU。如果TLB没有命中，则需要从L1缓存中去取PTE，覆盖TLB一个PTE。

三利用高速缓存优化程序性能

cpu访问高速缓存比访问内存快的多，快100倍左右，所以代码中要善于利用高速缓存提升程序的性能，如何利用那，就是尽量让写的程序满足局部性原理，局部性原理包括两个部分，第一个是时间的局部性即最近访问的变量，在很多的时间内会再次访问；第二是空间局部性原理，即某个地址被程序访问，它周边的地址很可能会被再次访问。

像我们经常在循环中使用同一个局部变量，就满足时间的局部性原理，这些变量常被寄存器保存。
我们循环遍历数组的时候，满足空间局部性原理。一般来说，我们的循环代码越小，循环步长越小，循环的次数越多，局部性就越好。

四测试程序的缓存利用

以下面个小程序为例，看看满足局部性原理和不满足局部性原理的性能差异

#include <time.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

#define SIZE  2048

long timediff(clock_t t1,clock_t t2)
{
  // CLOCKS_PER_SEC 1秒钟有多少个时钟
   return ( (double) t2 -t1)/CLOCKS_PER_SEC*1000;
}

int main(int argc,char ** argv)
{
    //栈上分配，不能分配大了，会内存错误的
    char arrs[SIZE][SIZE];
  // cpu时钟作为计数单元更准确
    clock_t start = clock();
    for (int i = 0; i< SIZE; i++) {
      for (int j = 0; j< SIZE;j++) {
          // arrs[i][j] =0;
          arrs[j][i] = 0;
      }
    }
   clock_t end = clock();
   //耗时多少ms
   printf("\nCost time:%ld  \n",timediff(start,end));
}

如果使用循环体内注释掉的代码：arrs[i][j] =0; 耗时10ms；如果使用循环体内没有注释掉的代码: arrs[j][i] = 0;耗时50ms。性能相差5倍，如何得知是利用缓存局部性原理得到的性能提升那，我们可以通过perf工具来查看下两次执行情况：

[root@localhost testcode]# perf stat -e cache-references,cache-misses,instructions,cycles,L1-dcache-load-misses,L1-dcache-loads ./a.out

Cost time:10

 Performance counter stats for './a.out':

            33,108      cache-references
            10,248      cache-misses              #   30.953 % of all cache refs
        55,486,651      instructions              #    1.46  insn per cycle
        37,963,618      cycles
           112,745      L1-dcache-load-misses     #    0.62% of all L1-dcache hits
        18,164,912      L1-dcache-loads

       0.011468219 seconds time elapsed

       0.010431000 seconds user
       0.001043000 seconds sys

性能差的情况：

[root@localhost testcode]# perf stat -e cache-references,cache-misses,instructions,cycles,L1-dcache-load-misses,L1-dcache-loads ./a.out

Cost time:50

 Performance counter stats for './a.out':

         4,228,333      cache-references
            31,962      cache-misses              #    0.756 % of all cache refs
        55,664,205      instructions              #    0.31  insn per cycle
       177,039,427      cycles
         4,306,660      L1-dcache-load-misses     #   23.64% of all L1-dcache hits
        18,215,744      L1-dcache-loads

       0.051881676 seconds time elapsed

       0.050861000 seconds user
       0.000997000 seconds sys

说明：①cache-references 表示总的缓存读取次数 ②cache-misses 表示没有命中缓存次数 ②/ ① 即缓存的没有命中率，

③ L1-dcache-load-misses：L1缓存没有命中次数 ④ L1-dcache-loads ：总的L1缓存读取次数可以看出，第一执行的总的缓存命中率比第二次总的缓存命中率要高的多，而且1个时钟周期要执行的指令条数也更多。