Performance Measurement on ARM

最新推荐文章于 2024-06-03 14:36:28 发布

samssm

最新推荐文章于 2024-06-03 14:36:28 发布

阅读量1.5k

点赞数

分类专栏： linux kernel

linux kernel 专栏收录该内容

40 篇文章 0 订阅

订阅专栏

Performance Measurement on ARM

After working mostly with different ARM processors in the 200...400 MHz range in lots of Embedded Linux projects over the last years, we have seen an interesting development in the market recently:

ARM cpus, having been known for their low power consumption, are becoming faster and faster (example: OMAP3, Beagleboard, MX51/MX53).

x86, having been known for its high computing performance, is becoming more and more SoC-like, power friendly and slower.

If you read the marketing stuff from the chip manufacturers, it sounds like if ARM is the next x86 (in terms of performance) and x86 is the next ARM (in terms of power consumption). But where do we stand today? How fast are modern ARM derivates?

The Pengutronix Kernel team wanted to know, and so we measured, in order to get some real numbers. Here are the results, and they turn up some interesting questions. Don't take the "observations" below too scientifically - I try to sum up the results in short claims.

As ARM is explicitly a low power architecture, it would have been interesting to measure some "performance vs. power consumption" data. However, as we have done our experiments on board level products, this couldn't be done. Some manufacturers tend to put more peripheral chips on their modules than others, so we would have only measured the effects of the board BoMs.

Test Hardware

In order to find out more about the real speed of today's hardware, we collected some typical industrial hardware in our lab, so this is the list of devices we have benchmarked:

Test Hardware	CPU	Freq.	Core	RAM	Kernel
phyCORE-PXA270	PXA270 (Marvell)	520 MHz	XScale (ARMv5)	SDRAM	2.6.34
phyCORE-i.MX27	MX27 (Freescale)	400 MHz	ARM926 (ARMv5)	DDR	2.6.34
phyCORE-i.MX35	MX35 (Freescale)	532 MHz	ARM1136 (ARMv6)	DDR2	2.6.34
O3530-PB-1452	OMAP3530 (Texas Instruments)	500 MHz	Cortex-A8 (ARMv7)	DDR	2.6.34
Beagleboard C3	OMAP3530 (Texas Instruments)	500 MHz	Cortex-A8 (ARMv7)	DDR	2.6.34
phyCORE-Atom	Z510 (Intel)	1100 MHz	Atom	DDR2	2.6.34

How fast are these boards? Yours truely assumed that the order in the table above does more or less reflect the systems in ascending performance order: PXA270 is a platform from the past, MX27 reflects the current generation of busmatrix optimized ARM9s, the ARM11 should be the next step there, Cortex-A8 appears to be the next killer platform and the Atom would probably be an order of magnitude above that.

So let's look at what we've measured.

Benchmarks

Explanatory note: In the following charts, the "error" bars (sometimes merely visible) deviantly indicate the range between minimum and maximum values of ten benchmark cycles, while the bar height shows the arithmetic mean.

Floating Point Multiplication (lat_ops)

http://lmbench.sourceforge.net/cgi-bin/man?keyword=lat_ops&section=8

This benchmark measures the time for a floating point multiplication. It shall be an indication of the computation power and is heavily influenced by the fact whether a SoC has a hardware floating point unit or not. Here are the results:

CPU	time [ns]
PXA270	50.90
MX27	72.39
MX35	15.08
OMAP-EVM	20.13
OMAP-Beagle	20.11
Atom Z510	4.57

The PXA270 and i.MX27 both have no hardware floating point unit, so the difference between the plots seems to directly reflect the different CPU clock speed.

An interesting observation is that the MX35 (ARM1136, 532 MHz) is faster than the OMAPs (Cortex-A8, 500 MHz). The frequency differs by 6%, whereas the speed is about 25% higher.

Observation 1: So even if scaled to the same frequency, the ARM11 is faster than the Cortex-A8!

Observation 2: The Atom needs 4.5 ns; it is about twice the clock frequency of the MX35, but needs only one third of the time (which needs 15 ns).

Memory Bandwidth (bw_mem)

http://lmbench.sourceforge.net/cgi-bin/man?keyword=bw_mem&section=8

We measure the memory transfer speed with the bw_mem benchmark.

CPU	Bandwidth [MB/s]
PXA270	54.40
MX27	101.30
MX35	128.02
OMAP-EVM	254.05
OMAP-Beagle	241.13
Atom Z510	601.39

Observation 3: There is a factor of 2 between the PXA270 and MX27/MX35.

Observation 4: OMAP is twice as fast as the i.MX ARM9/ARM11 ones.

Observation 5: The Atom still is 2.4 times faster than the OMAP, at 2.2 times the clock rate.

Context Switching Time (lat_ctx)

http://lmbench.sourceforge.net/cgi-bin/man?keyword=lat_ctx&section=8

An important indicator of the system speed is the time to switch the CPU context. This benchmark measures the context switching time and it can be configured which number of processes with which size shall be tested. The processes are started, read a token from a pipe, perform a certain amount of work and give the token to the next process.

CPU	Context Switch Time [µs]
PXA270	462.19
MX27	130.40
MX35	38.85
OMAP-EVM	71.51
OMAP-Beagle	32.29
Atom Z510	12.47

16 processes, 8 KiB each

Observation 6: This shows impressively how slow the PXA is. Factor 40 to the Atom, and still factor 3 to the ARM926.

Observation 7: The MX35/ARM1136 has almost the same speed as the Cortex-A8. I would have thought that the newer Cortex would indeed be much faster, somewhere between the ARM11 and the Atom. But the Cortex is still three times slower than the Atom, although at half the clock rate.

Syscall Performance (lat_syscall)

http://lmbench.sourceforge.net/cgi-bin/man?keyword=lat_syscall§ion=8

In order to estimate the performance of calling operating system functionality, we measured the syscall latency with lat_sys. The benchmark performs an open() and close() on a 1 MB random data file located in a ramdisk (tmpfs), accessing the file with a relative path (absolute paths seem to give other results). The time for both operations after each other is measured.

CPU	Syscall Time [µs]
PXA270	10.79
MX27	14.16
MX35	8.66
OMAP-EVM	13.67
OMAP-Beagle	10.46
Atom Z510	5.84

Observation 7: The PXA isn't too bad when it comes to syscalls.

Observation 8: The Cortex-A8 and the ARM11 are almost identically fast.

Observation 9: Even between OMAP/ARM11 and Atom, there is only a factor of 1.8.

Process Forking (lat_proc)

http://lmbench.sourceforge.net/cgi-bin/man?keyword=lat_proc&section=8

The lat_proc benchmark forks processes and measures the time to do so.

CPU	Fork Time [µs]
PXA270	5426.4
MX27	3153.6
MX35	1365.6
OMAP-EVM	3052.63
OMAP-Beagle	1687.63
Atom Z510	390.66

Observation 10: The ARM11 is even better than the Cortex-A8! I had expected that the newer Cortex would perform better there.

Observation 11: The Atom is 3 times as fast, at 2 times the clock frequency.

Specifications

Kernel and -configs

An approach has been made to uniformly use kernel 2.6.34 on all targets. After optimization, care has been taken to always set the following config options:

Tree-based hierarchical RCU
Preemptible Kernel (Low-Latency Desktop)
Choose SLAB allocator (SLAB)

THUMB mode has never been used. Turning off NEON on the OMAP did not produce significantly different results. Using v5TE versus v4T is repeatedly showing worse results (not only on the OMAP), which we still not quite understand. Anyway, the figures published here for the OMAP have been obtained using a Cortex-A8 toolchain.

LMbench command lines

lat_ops

root@target:~ lat_ops 2>&1

filtered by

grep "^float mul:" | cut -f3 -d" "

bw_mem

root@target:~ list="rd wr rdwr cp fwr frd bzero bcopy"; \

  for i in $list; \

  do echo -en "$i\t";  done; \

  echo; \

  for i in $list; \

  do res=$(bw_mem 33554432 $i 2>&1 | awk "{print \$2}"); \

  echo -en "$res\t"; done; \

  echo MB/Sec

filtered by

awk "/rd\twr\trdwr\tcp\tfwr\tfrd\tbzero\tbcopy/ { getline; print \$3 }"

lat_ctx

root@target:~ list="0 4 8 16 32 64" amount="2 4 8 16 24 32 64 96"; \

  for size in $list; do lat_ctx -s $size $amount 2>&1; \

  done

filtered by

grep -A4 "^\"size=8k" | grep "^16" | cut -f2 -d" "

lat_syscall

root@target:~ list="null read write stat fstat open"; \

  cd /tmp; \

  dd if=/dev/urandom of=test.pattern bs=1024 count=1024 2>/dev/null; \

  for i in $list; do echo -en "$i\t"; done; echo; \

  for i in $list; do \

  res=$(lat_syscall $i test.pattern 2>&1 | awk "{print \$3}"); \

  echo -en "$res\t"; done; echo microseconds

filtered by

awk "/null\tread\twrite\tstat\tfstat\topen/ { getline; print \$6 }"

lat_proc

root@target:~ list="procedure fork exec shell"; \

  cp /usr/bin/hello /tmp; \

  for i in $list; do echo -en "$i\t"; done ;echo; \

  for i in $list; do res=$(lat_proc $i 2>&1 | awk "{FS=\":\"} ; {print \$2}" \

  | awk "{print \$1}"); echo -en "$res\t"; done ; echo microseconds

filtered by

awk "/procedure\tfork\texec\tshell/ { getline; print \$2 }"

Thinking about Caches

The influence of Linux caches is not much of an issue, as ensuring cold caches by directly preceding every lmbench invocation by

sync; echo 3> /proc/sys/vm/drop_caches;

leads to only slightly (0% to 3%, depending on type of benchmark) worse figures, which has been verified on several targets.

GCC Flags

Here is an overview of the compiler variants used, with some of the relevant config switches (as shown by gcc -v) jotted down. The last column links to snippets taken from the output of objdump -d lat_ops.o, enabling comparison of the code used in float_mul.

CPU	Compiler	Floating Point Code
PXA270	arm-iwmmx-linux-gnueabi-gcc	lat_ops do_float_mul objdump: __aeabi_fmul
MX27	arm-v5te-linux-gnueabi-gcc	lat_ops do_float_mul objdump: __aeabi_fmul
MX35	arm-1136jfs-linux-gnueabi-gcc	lat_ops do_float_mul objdump: vmul.f32
OMAP-EVM	arm-cortexa8-linux-gnueabi-gcc	lat_ops do_float_mul objdump: vmul.f32
OMAP-Beagle	arm-cortexa8-linux-gnueabi-gcc	lat_ops do_float_mul objdump: vmul.f32
Atom Z510	i586-unknown-linux-gnu-gcc	lat_ops do_float_mul objdump: fmul

Resulting from the use of PTXdist as build system, the gcc flags in action are uniformly the same across all targets.

With LMbench's bw_mem as an example, the complete compiler command line in its original order is

-DPACKAGE_NAME=\"lmbench\"
-DPACKAGE_TARNAME=\"lmbench\"
-DPACKAGE_VERSION=\"trunk\"
-DPACKAGE_STRING=\"lmbench\ trunk\"
-DPACKAGE_BUGREPORT=\"bugs@pengutronix.de\"
-DPACKAGE_URL=\"\"
-DSTDC_HEADERS=1
-DHAVE_SYS_TYPES_H=1
-DHAVE_SYS_STAT_H=1
-DHAVE_STDLIB_H=1
-DHAVE_STRING_H=1
-DHAVE_MEMORY_H=1
-DHAVE_STRINGS_H=1
-DHAVE_INTTYPES_H=1
-DHAVE_STDINT_H=1
-DHAVE_UNISTD_H=1
-DHAVE_DLFCN_H=1
-DLT_OBJDIR=\".libs/\"
-DPACKAGE=\"lmbench\"
-DVERSION=\"trunk\"
-I.
-I../include
-I../include
-isystem /home/.../sysroot-target/include
-isystem /home/.../sysroot-target/usr/include
-W
-Wall
-O2
-DHAVE_uint
-DHAVE_uint64_t
-DHAVE_int64_t
-DHAVE_socklen_t
-DHAVE_DRAND48
-DHAVE_RAND
-DHAVE_RANDOM
-MT bw_mem.o
-MD
-MP
-MF .deps/bw_mem.Tpo
-c
-o bw_mem.o
bw_mem.c

Conclusion

These measurements are probably not completely scientifically correct. The intention was to give us a raw idea of how the systems perform.

We expected the Cortex-A8 to be an order of magnitude faster than the ARM11. This doesn't seem to be the case. Only the memory bandwidth is much faster, but most of the other benchmarks show almost the same values. It's currently totally unclear to us where the performance win we expected from an ARMv7 over an ARMv6 core went to.

There seems to be a pattern that, at double the clock frequency, the Atom is often three times faster than the ARM11/Cortex-A8.

Feedback

Do you have any remarks, ideas about the observed effects and other things you might want to tell us? We want to improve this article with the help of the community. So please send us your feedback to the mail address in the box below.

Thanks to ...	... for ...
Juergen Beisert	spelling fixes
Jochen Frieling	all the measurements
Andreas Gajda	spelling fixes
Martin Guy	spelling fixes
Marc Kleine-Budde	porting all kernels to 2.6.34
Uwe Kleine-Koenig	spelling fixes
Magnus Lilja	ideas and suggestions
Nicholas Pitre	comments about power vs. performance
Baruch Siach	fix arm cpu types
Colin Tuckley	ideas and suggestions