Xilinx ZYNQ Ultrascale+ 性能测试之 Memory Stream

justdemo

于 2021-05-08 21:35:53 发布

阅读量370

点赞数

本文链接：https://blog.csdn.net/justdemo/article/details/116544534

版权

John McCalpin “Memory Bandwidth and Machine Balance in High
Performance Computers”, IEEE TCCA Newsletter, December 1995
http://www.cs.virginia.edu/stream/

Xilinx 4EV 上用自带LMBENCH测试结果如下：

# stream
STREAM copy latency: 3.84 nanoseconds
STREAM copy bandwidth: 4168.29 MB/sec
STREAM scale latency: 7.07 nanoseconds
STREAM scale bandwidth: 2261.80 MB/sec
STREAM add latency: 10.24 nanoseconds
STREAM add bandwidth: 2343.75 MB/sec
STREAM triad latency: 12.64 nanoseconds
STREAM triad bandwidth: 1899.34 MB/sec

意义如下：

STREAM: measure memory bandwidth with the operations:
– Copy: a(i) = b(i)
– Scale: a(i) = s * b(i)
– Add: a(i) = b(i) + c(i)
– Triad: a(i) = b(i) + s * c(i)

LMBENCH 的其他测试结果如下

# mhz
1199 MHz, 0.8340 nanosec clock
# tlb
tlb: 10 pages
# par_ops
integer bit parallelism: 2.65
integer add parallelism: 1.82
integer div parallelism: 1.00
integer mod parallelism: 2.27
int64 bit parallelism: 1.24
int64 add parallelism: 1.82
int64 div parallelism: 1.00
int64 mod parallelism: 1.93
float add parallelism: 7.86
float mul parallelism: 7.90
float div parallelism: 1.30
double add parallelism: 7.86
double mul parallelism: 7.90
double div parallelism: 1.16
#lat_unix
AF_UNIX sock stream latency: 16.0409 microseconds

自己下载 stream.c
编译 aarch64-linux-gnu-gcc -O -fopenmp -DNTIME=20 -DSTREAM_ARRAY_SIZE=120000000 stream.c -o stream
拷贝 libgomp.so.1 到系统中，系统总内存4GByte
运行结果如下

# OMP_NUM_THREADS=4 ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 120000000 (elements), Offset = 0 (elements)
Memory per array = 915.5 MiB (= 0.9 GiB).
Total memory required = 2746.6 MiB (= 2.7 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 4
Number of Threads counted = 4
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 279443 microseconds.
   (= 279443 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            8622.3     0.224413     0.222678     0.226238
Scale:           7985.3     0.243725     0.240443     0.249241
Add:             6930.9     0.419123     0.415530     0.424046
Triad:           6526.6     0.442527     0.441268     0.443724
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

如果编译的时候加上 -O2 优化，速度会快一点

# OMP_NUM_THREADS=4 ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 120000000 (elements), Offset = 0 (elements)
Memory per array = 915.5 MiB (= 0.9 GiB).
Total memory required = 2746.6 MiB (= 2.7 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 4
Number of Threads counted = 4
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 278242 microseconds.
   (= 278242 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            9413.4     0.205417     0.203965     0.207202
Scale:           7896.5     0.248300     0.243145     0.253207
Add:             7482.3     0.385689     0.384909     0.387379
Triad:           6004.8     0.481600     0.479614     0.483516
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

如果开4个线程，CPU占 100%，单线程占 25%，速度要下降一半

4K60的每秒数据为 3840 * 2160 * 60=497664000=475M
每个像素按YUV422，占 16bit = 2Byte，拷贝一次带宽需要 950MByte/s

justdemo

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Xilinx ZYNQ Ultrascale+ 性能测试之 Memory Stream

John McCalpin “Memory Bandwidth and Machine Balance in HighPerformance Computers”, IEEE TCCA Newsletter, December 1995http://www.cs.virginia.edu/stream/测试结果如下：# streamSTREAM copy latency: 3.84 nanosecondsSTREAM copy bandwidth: 4168.29 MB/secSTREAM s
复制链接

扫一扫