Performance Measurement on ARM
After working mostly with different ARM processors in the 200...400 MHz range in lots of Embedded Linux projects over the last years, we have seen an interesting development in the market recently:
- ARM cpus, having been known for their low power consumption, are becoming faster and faster (example: OMAP3, Beagleboard, MX51/MX53).
- x86, having been known for its high computing performance, is becoming more and more SoC-like, power friendly and slower.
If you read the marketing stuff from the chip manufacturers, it sounds like if ARM is the next x86 (in terms of performance) and x86 is the next ARM (in terms of power consumption). But where do we stand today? How fast are modern ARM derivates?
The Pengutronix Kernel team wanted to know, and so we measured, in order to get some real numbers. Here are the results, and they turn up some interesting questions. Don't take the "observations" below too scientifically - I try to sum up the results in short claims.
As ARM is explicitly a low power architecture, it would have been interesting to measure some "performance vs. power consumption" data. However, as we have done our experiments on board level products, this couldn't be done. Some manufacturers tend to put more peripheral chips on their modules than others, so we would have only measured the effects of the board BoMs.
Test Hardware
In order to find out more about the real speed of today's hardware, we collected some typical industrial hardware in our lab, so this is the list of devices we have benchmarked:
Test Hardware | CPU | Freq. | Core | RAM | Kernel |
---|---|---|---|---|---|
phyCORE-PXA270 | PXA270 (Marvell) | 520 MHz | XScale (ARMv5) | SDRAM | 2.6.34 |
phyCORE-i.MX27 | MX27 (Freescale) | 400 MHz | ARM926 (ARMv5) | DDR | 2.6.34 |
phyCORE-i.MX35 | MX35 (Freescale) | 532 MHz | ARM1136 (ARMv6) | DDR2 | 2.6.34 |
O3530-PB-1452 | OMAP3530 (Texas Instruments) | 500 MHz | Cortex-A8 (ARMv7) | DDR | 2.6.34 |
Beagleboard C3 | OMAP3530 (Texas Instruments) | 500 MHz | Cortex-A8 (ARMv7) | DDR | 2.6.34 |
phyCORE-Atom | Z510 (Intel) | 1100 MHz | Atom | DDR2 | 2.6.34 |
How fast are these boards? Yours truely assumed that the order in the table above does more or less reflect the systems in ascending performance order: PXA270 is a platform from the past, MX27 reflects the current generation of busmatrix optimized ARM9s, the ARM11 should be the next step there, Cortex-A8 appears to be the next killer platform and the Atom would probably be an order of magnitude above that.
So let's look at what we've measured.
Benchmarks
Explanatory note: In the following charts, the "error" bars (sometimes merely visible) deviantly indicate the range between minimum and maximum values of ten benchmark cycles, while the bar height shows the arithmetic mean.
Floating Point Multiplication (lat_ops)
http://lmbench.sourceforge.net/cgi-bin/man?keyword=lat_ops§ion=8This benchmark measures the time for a floating point multiplication. It shall be an indication of the computation power and is heavily influenced by the fact whether a SoC has a hardware floating point unit or not. Here are the results:
|
The PXA270 and i.MX27 both have no hardware floating point unit, so the difference between the plots seems to directly reflect the different CPU clock speed.
An interesting observation is that the MX35 (ARM1136, 532 MHz) is faster than the OMAPs (Cortex-A8, 500 MHz). The frequency differs by 6%, whereas the speed is about 25% higher.
Observation 1: So even if scaled to the same frequency, the ARM11 is faster than the Cortex-A8! Observation 2: The Atom needs 4.5 ns; it is about twice the clock frequency of the MX35, but needs only one third of the time (which needs 15 ns). |
Memory Bandwidth (bw_mem)
http://lmbench.sourceforge.net/cgi-bin/man?keyword=bw_mem§ion=8We measure the memory transfer speed with the bw_mem benchmark.
|
Observation 3: There is a factor of 2 between the PXA270 and MX27/MX35. Observation 4: OMAP is twice as fast as the i.MX ARM9/ARM11 ones. Observation 5: The Atom still is 2.4 times faster than the OMAP, at 2.2 times the clock rate. |
Context Switching Time (lat_ctx)
http://lmbench.sourceforge.net/cgi-bin/man?keyword=lat_ctx§ion=8An important indicator of the system speed is the time to switch the CPU context. This benchmark measures the context switching time and it can be configured which number of processes with which size shall be tested. The processes are started, read a token from a pipe, perform a certain amount of work and give the token to the next process.
|
Observation 6: This shows impressively how slow the PXA is. Factor 40 to the Atom, and still factor 3 to the ARM926. Observation 7: The MX35/ARM1136 has almost the same speed as the Cortex-A8. I would have thought that the newer Cortex would indeed be much faster, somewhere between the ARM11 and the Atom. But the Cortex is still three times slower than the Atom, although at half the clock rate. |
Syscall Performance (lat_syscall)
http://lmbench.sourceforge.net/cgi-bin/man?keyword=lat_syscall§ion=8In order to estimate the performance of calling operating system functionality, we measured the syscall latency with lat_sys. The benchmark performs an open() and close() on a 1 MB random data file located in a ramdisk (tmpfs), accessing the file with a relative path (absolute paths seem to give other results). The time for both operations after each other is measured.
|
Observation 7: The PXA isn't too bad when it comes to syscalls. Observation 8: The Cortex-A8 and the ARM11 are almost identically fast. Observation 9: Even between OMAP/ARM11 and Atom, there is only a factor of 1.8. |
Process Forking (lat_proc)
http://lmbench.sourceforge.net/cgi-bin/man?keyword=lat_proc§ion=8The lat_proc benchmark forks processes and measures the time to do so.
|
Observation 10: The ARM11 is even better than the Cortex-A8! I had expected that the newer Cortex would perform better there. Observation 11: The Atom is 3 times as fast, at 2 times the clock frequency. |
Specifications
Kernel and -configs
An approach has been made to uniformly use kernel 2.6.34 on all targets. After optimization, care has been taken to always set the following config options:
- Tree-based hierarchical RCU
- Preemptible Kernel (Low-Latency Desktop)
- Choose SLAB allocator (SLAB)
THUMB mode has never been used. Turning off NEON on the OMAP did not produce significantly different results. Using v5TE versus v4T is repeatedly showing worse results (not only on the OMAP), which we still not quite understand. Anyway, the figures published here for the OMAP have been obtained using a Cortex-A8 toolchain.
LMbench command lines
lat_ops
root@target:~ lat_ops 2>&1
filtered by
grep "^float mul:" | cut -f3 -d" "
bw_mem
root@target:~ list="rd wr rdwr cp fwr frd bzero bcopy"; \ for i in $list; \ do echo -en "$i\t"; done; \ echo; \ for i in $list; \ do res=$(bw_mem 33554432 $i 2>&1 | awk "{print \$2}"); \ echo -en "$res\t"; done; \ echo MB/Sec
filtered by
awk "/rd\twr\trdwr\tcp\tfwr\tfrd\tbzero\tbcopy/ { getline; print \$3 }"
lat_ctx
root@target:~ list="0 4 8 16 32 64" amount="2 4 8 16 24 32 64 96"; \ for size in $list; do lat_ctx -s $size $amount 2>&1; \ done
filtered by
grep -A4 "^\"size=8k" | grep "^16" | cut -f2 -d" "
lat_syscall
root@target:~ list="null read write stat fstat open"; \ cd /tmp; \ dd if=/dev/urandom of=test.pattern bs=1024 count=1024 2>/dev/null; \ for i in $list; do echo -en "$i\t"; done; echo; \ for i in $list; do \ res=$(lat_syscall $i test.pattern 2>&1 | awk "{print \$3}"); \ echo -en "$res\t"; done; echo microseconds
filtered by
awk "/null\tread\twrite\tstat\tfstat\topen/ { getline; print \$6 }"
lat_proc
root@target:~ list="procedure fork exec shell"; \ cp /usr/bin/hello /tmp; \ for i in $list; do echo -en "$i\t"; done ;echo; \ for i in $list; do res=$(lat_proc $i 2>&1 | awk "{FS=\":\"} ; {print \$2}" \ | awk "{print \$1}"); echo -en "$res\t"; done ; echo microseconds
filtered by
awk "/procedure\tfork\texec\tshell/ { getline; print \$2 }"
Thinking about Caches
The influence of Linux caches is not much of an issue, as ensuring cold caches by directly preceding every lmbench invocation by
sync; echo 3> /proc/sys/vm/drop_caches;
leads to only slightly (0% to 3%, depending on type of benchmark) worse figures, which has been verified on several targets.
GCC Flags
Here is an overview of the compiler variants used, with some of the relevant config switches (as shown by gcc -v) jotted down. The last column links to snippets taken from the output of objdump -d lat_ops.o, enabling comparison of the code used in float_mul.
CPU | Compiler | Floating Point Code |
---|---|---|
PXA270 | arm-iwmmx-linux-gnueabi-gcc | lat_ops do_float_mul objdump: __aeabi_fmul |
MX27 | arm-v5te-linux-gnueabi-gcc | lat_ops do_float_mul objdump: __aeabi_fmul |
MX35 | arm-1136jfs-linux-gnueabi-gcc | lat_ops do_float_mul objdump: vmul.f32 |
OMAP-EVM | arm-cortexa8-linux-gnueabi-gcc | lat_ops do_float_mul objdump: vmul.f32 |
OMAP-Beagle | arm-cortexa8-linux-gnueabi-gcc | lat_ops do_float_mul objdump: vmul.f32 |
Atom Z510 | i586-unknown-linux-gnu-gcc | lat_ops do_float_mul objdump: fmul |
Resulting from the use of PTXdist as build system, the gcc flags in action are uniformly the same across all targets.
With LMbench's bw_mem as an example, the complete compiler command line in its original order is
-DPACKAGE_NAME=\"lmbench\" -DPACKAGE_TARNAME=\"lmbench\" -DPACKAGE_VERSION=\"trunk\" -DPACKAGE_STRING=\"lmbench\ trunk\" -DPACKAGE_BUGREPORT=\"bugs@pengutronix.de\" -DPACKAGE_URL=\"\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DPACKAGE=\"lmbench\" -DVERSION=\"trunk\" -I. -I../include -I../include -isystem /home/.../sysroot-target/include -isystem /home/.../sysroot-target/usr/include -W -Wall -O2 -DHAVE_uint -DHAVE_uint64_t -DHAVE_int64_t -DHAVE_socklen_t -DHAVE_DRAND48 -DHAVE_RAND -DHAVE_RANDOM -MT bw_mem.o -MD -MP -MF .deps/bw_mem.Tpo -c -o bw_mem.o bw_mem.c
Conclusion
These measurements are probably not completely scientifically correct. The intention was to give us a raw idea of how the systems perform.
We expected the Cortex-A8 to be an order of magnitude faster than the ARM11. This doesn't seem to be the case. Only the memory bandwidth is much faster, but most of the other benchmarks show almost the same values. It's currently totally unclear to us where the performance win we expected from an ARMv7 over an ARMv6 core went to.
There seems to be a pattern that, at double the clock frequency, the Atom is often three times faster than the ARM11/Cortex-A8.
Feedback
Do you have any remarks, ideas about the observed effects and other things you might want to tell us? We want to improve this article with the help of the community. So please send us your feedback to the mail address in the box below.
Thanks to ... | ... for ... |
---|---|
Juergen Beisert | spelling fixes |
Jochen Frieling | all the measurements |
Andreas Gajda | spelling fixes |
Martin Guy | spelling fixes |
Marc Kleine-Budde | porting all kernels to 2.6.34 |
Uwe Kleine-Koenig | spelling fixes |
Magnus Lilja | ideas and suggestions |
Nicholas Pitre | comments about power vs. performance |
Baruch Siach | fix arm cpu types |
Colin Tuckley | ideas and suggestions |