STREAM内存带宽测试工具介绍及其内部实现

最新推荐文章于 2024-10-24 15:53:42 发布

路边闲人2

最新推荐文章于 2024-10-24 15:53:42 发布

阅读量1.1w

点赞数 6

分类专栏： linux 文章标签： STREAM benchmark risc-v gcc

本文链接：https://blog.csdn.net/v6543210/article/details/121608705

版权

linux 专栏收录该内容

100 篇文章

订阅专栏

本文详细介绍了STREAM Benchmark的使用，包括编译参数、测试原理及在不同平台上的应用。通过调整STREAM_ARRAY_SIZE以避免缓存影响，测试了不同内存大小对带宽的影响。结果显示，随着STREAM_ARRAY_SIZE增大，测试结果趋于稳定，揭示了内存带宽的真实性能。STREAM通过Copy、Scale、Add、Triad四种操作评估内存性能，并提供了计算内存带宽的公式。测试结果显示Add操作通常获得最高带宽，而Scale操作最低。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

lmbench中有stream，https://github.com/keith-packard/lmbench3

但是版本有点旧。

我们用的是版本更新一点的V5.10 2013/01/17 ，代码地址：

GitHub - jeffhammond/STREAM: STREAM benchmark

传送门：

intel平台可以使用官方的内存测试工具

Intel® Memory Latency Checker v3.9a

一、编译命令

gcc -O3 -fopenmp -DSTREAM_ARRAY_SIZE=2000000 stream.c -o stream.2M
gcc -O3 -fopenmp -DSTREAM_ARRAY_SIZE=20000000 stream.c -o stream.20M

#risc-v platform 
gcc -O3 -fopenmp -DSTREAM_ARRAY_SIZE=80000000 stream.c -o stream.80M -mexplicit-relocs

1. –fopenmp
适应多处理器环境。开启后，程序默认线程为CPU线程数，也可以运行时也可以动态指定运行的进程数，12为自定义的要使用的处理器数。
export OMP_NUM_THREADS=12
2. -DSTREAM_ARRAY_SIZE
计算方法参考stream.c中的说明，举例：u540的 L2缓存 2MB，其值为
double类型占8 Byte，每个ARRAY的大小是 STREAM_ARRAY_SIZE * 8Byte。

每个ARRAY的大小要超过4倍的缓存大小，即：

STREAM_ARRAY_SIZE * 8B > 4 * 2MB

可得STREAM_ARRAY_SIZE最小需要为1M，在这里作者并不区分1M是10的6次方，还是2的20次方，因为4倍大小已经远远大于缓存，能够保证访存到达内存而不是访问到缓存。这个值是最小值，可以适当大于此值，增大array size会增加测试时间，也会保证测试过程至少经历20个clock ticks。

下面是代码中关于STREAM_ARRAY_SIZE的说明：
* 1) STREAM requires different amounts of memory to run on different
* systems, depending on both the system cache size(s) and the
* granularity of the system timer.
* You should adjust the value of 'STREAM_ARRAY_SIZE' (below)
* to meet *both* of the following criteria:
* (a) Each array must be at least 4 times the size of the
* available cache memory. I don't worry about the difference
* between 10^6 and 2^20, so in practice the minimum array size
* is about 3.8 times the cache size.
* Example 1: One Xeon E3 with 8 MB L3 cache
* STREAM_ARRAY_SIZE should be >= 4 million, giving
* an array size of 30.5 MB and a total memory requirement
* of 91.5 MB.
* Example 2: Two Xeon E5's with 20 MB L3 cache each (using OpenMP)
* STREAM_ARRAY_SIZE should be >= 20 million, giving
* an array size of 153 MB and a total memory requirement
* of 458 MB.
* (b) The size should be large enough so that the 'timing calibration'
* output by the program is at least 20 clock-ticks.
* Example: most versions of Windows have a 10 millisecond timer
* granularity. 20 "ticks" at 10 ms/tic is 200 milliseconds.
* If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec.
* This means the each array must be at least 1 GB, or 128M elements.

3. –DNTIMES
NTIMES是执行次数，默认值是10，所有测试，结束后从结果中取最优，第一轮测试的结果不参与最终统计。

4. -mexplicit-relocs
当STREAM_ARRAY_SIZE太大时，gcc编译会出错，需要使用此选项更改代码链接时的分布。
参见：

RISC-V Options (Using the GNU Compiler Collection (GCC))

如果是x64平台，当STREAM_ARRAY_SIZE比较大时，（大于100MB可能）编译会出错，需要给gcc添加选项 -mcmodel=large

参考： x86 Options (Using the GNU Compiler Collection (GCC))

二、在u540上的测试结果

在u540上的测试结果如下表所示，可以看到当STREAM_ARRAY_SIZE非常小的时候，测试1和测试2得到的内存带宽结果会比较大，是因为用到了缓存。当STREAM_ARRAY_SIZE大于2M时，最终测得的结果稳定在1300MBps左右。

表4.1 u540上STREAM测试结果

测试序号	STREAM_ARRAY_SIZE (MB)	内存带宽测试结果(MBps)	备注
1	20K	4034.6	用到cache
2	100K	2307.7	用到cache
3	2M	1272.8	以下结果比较接近
4	20M	1314.1
5	80M	1258.8
6	130M	1295.7

三、一个测试结果输出

root@freedom-u540:~/work/STREAM-master# ./stream.20M
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 20000000 (elements), Offset = 0 (elements)
Memory per array = 152.6 MiB (= 0.1 GiB).
Total memory required = 457.8 MiB (= 0.4 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 4
Number of Threads counted = 4
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 170454 microseconds.
   (= 170454 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            1455.5     0.222349     0.219856     0.229256
Scale:           1205.2     0.266443     0.265518     0.267343
Add:             1309.9     0.368384     0.366434     0.371048
Triad:           1314.1     0.367992     0.365281     0.371307

四、STREAM内部实现

避免Cache影响

当CPU要读取一个数据时，首先从Cache缓存中查找，如果找到就立即读取并送给CPU处理；如果没有找到，就用相对慢的速度从内存中读取并送给CPU处理，同时把这个数据所在的数据块调入缓存中，可以使得以后对整块数据的读取都从缓存中进行，不必再调用内存。

STREAM通过设置STREAM_ARRAY_SIZE，定义远大于缓存容量的内存数组变量，按顺序从内存中进行读取或写入，由于数据量远大于缓存容量，可以避免缓存对测试结果的影响。

四种操作

STREAM对内存进行Copy、Scale、Add、Triad四种操作。

Copy操作最为简单，它先访问一个内存单元读出其中的值，再将值写入到另一个内存单元。Scale操作先从内存单元读出其中的值，作一个乘法运算，再将结果写入到另一个内存单元。Add操作先从内存单元读出两个值，做加法运算，再将结果写入到另一个内存单元。Triad的中文含义是将三个组合起来，在本测试中表示的意思是将Copy、Scale、Add三种操作组合起来进行测试。具体操作方式是：先从内存数组中读两个值a、b，对其进行乘加混合运算（a+因子* b），将运算结果写入到另一个内存单元。

假定每次操作的内存读写字节数：

测试结果一般的规律是Add > Triad > Copy > Scale。一次Add操作需要访问三次内存（两个读操作，一个写操作），Triad操作也需要三次访问内存， Copy和Scale操作只需要两次访问内存。单位操作内，访问内存次数越多，带宽越大。另一方面，单位操作内，浮点计算次数越多，操作完成时间越长，导致整个操作循环完成的时间越长，带宽越低。Add操作简单且访存次数多，故而带宽最大，Scale操作复杂且访存次数少，故而带宽最小。