This article was firstly published from http://oliveryang.net. The content reuse need include the original link.
1. Latency measurement in user space
While user application developers are working on performance sensitive code, one common requirement is do latency/time measurement in their code. This kind of code could be temporary code for debug, test or profiling purpose, or permanent code that could provide performance tracing data in software production mode.
Linux kernel provides gettimeofday() and clock_gettime() system calls for user application high resolution time measurement. The gettimeofday() is us level, and clock_gettime is ns level. However, the major concerns of these system calls usage are the additional performance cost caused by calling themselves.
In order to minimize the perf cost of gettimeofday() and clock_gettime() system calls, Linux kernel uses the vsyscalls(virtual system calls) and VDSOs (Virtual Dynamically linked Shared Objects) mechanisms to avoid the cost of switching from user to kernel. On x86, gettimeofday() and clock_gettime() could get better performance due to vsyscalls kernel patch, by avoiding context switch from user to kernel space. But some other arch still need follow the regular system call code path. This is really hardware dependent optimization.
2. Why using TSC?
Although vsyscalls implementation of gettimeofday() and clock_gettime() is faster than regular system calls, the perf cost of them is still too high to meet the latency measurement requirements for some perf sensitive application.
The TSC (time stamp counter) provided by x86 processors is a high-resolution counter that can be read with a single instruction (RDTSC). On Linux this instruction could be executed from user space directly, that means user applications could use one single instruction to get a fine-grained timestamp (nanosecond level) with a much faster way than vsyscalls.
Following code are typical implementation for rdtsc() api in user space application,
static uint64_t rdtsc(void)
{
uint64_t var;
uint32_t hi, lo;
__asm volatile
("rdtsc" : "=a" (lo), "=d" (hi));
var = ((uint64_t)hi << 32) | lo;
return (var);
}
The result of rdtsc is CPU cycle, that could be converted to nanoseconds by a simple calculation.
ns = CPU cycles * (ns_per_sec / CPU freq)
In Linux kernel, it uses more complex way to get a better results,
/*
* Accelerators for sched_clock()
* convert from cycles(64bits) => nanoseconds (64bits)
* basic equation:
* ns = cycles / (freq / ns_per_sec)
* ns = cycles * (ns_per_sec / freq)
* ns = cycles * (10^9 / (cpu_khz * 10^3))
* ns = cycles * (10^6 / cpu_khz)
*
* Then we use scaling math (suggested by george@mvista.com) to get:
* ns = cycles * (10^6 * SC / cpu_khz) / SC
* ns = cycles * cyc2ns_scale / SC
*
* And since SC is a constant power of two, we can convert the div
* into a shift.
*
* We can use khz divisor instead of mhz to keep a better precision, since
* cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
* (mathieu.desnoyers@polymtl.ca)
*
* -johnstul@us.ibm.com "math is hard, lets go shopping!"
*/
Finally, the code of latency measurement could be,
start = rdtsc();
/* put code you want to measure here */
end = rdtsc();
cycle = end - start;
latency = cycle_2_ns(cycle)
In fact, above rdtsc implementation are problematic, and not encouraged by Linux kernel. The major reason is, TSC mechanism is rather unreliable, and even Linux kernel had the hard time to handle it.
That is why Linux kernel does not provide the rdtsc api to user application. However, Linux kernel does not limit the rdtsc instruction to be executed at privilege level, although x86 support the setup. That means, there is nothing stopping Linux application read TSC directly by above implementation, but these applications have to prepare to handle some strange TSC behaviors due to some known pitfalls.
3. Known TSC pitfalls
3.1 TSC unstable hardware
3.1.1 CPU TSC capabilities
Intel CPUs have 3 sort of TSC behaviors,
- Variant TSC
The first generation of TSC, the TSC increments could be impacted by CPU frequency changes. This is started from a very old processors (P4).
- Constant TSC
The TSC increments at a constant rate, even CPU frequency get changed. But the TSC could be stopped when CPU run into deep C-state. Constant TSC is supported before Nehalem, and not as good as invariant TSC.
- Invariant TSC
The invariant TSC will run at a constant rate in all ACPI P-, C-, and T-states. This is the architectural behavior moving forward. Invariant TSC only appears on Nehalem-and-later Intel processors.
See I