x86 TSC使用的那些坑

最新推荐文章于 2024-06-14 14:51:15 发布

yayong

最新推荐文章于 2024-06-14 14:51:15 发布

阅读量9.1k

点赞数

分类专栏： Linux Hardware 文章标签： linux kernel x86 性能虚拟化

本文链接：https://blog.csdn.net/yayong/article/details/50639800

版权

本文探讨了x86处理器中时间戳计数器(TSC)的使用，尤其是在Linux环境下。TSC提供了高精度的时间测量，但存在硬件不稳定性、SMP系统同步问题、软件使用错误及虚拟化环境下的模拟问题。文章详细分析了TSC的潜在问题，包括CPU TSC能力、SMP系统中的TSC同步行为、非Intel平台、CPU热插拔、软件使用错误等，并介绍了不同虚拟化平台上的TSC处理方式。最后，建议开发者谨慎在用户空间使用TSC，以确保应用的可靠性和可移植性。

摘要由CSDN通过智能技术生成

This article was firstly published from http://oliveryang.net. The content reuse need include the original link.

1. Latency measurement in user space

While user application developers are working on performance sensitive code, one common requirement is do latency/time measurement in their code. This kind of code could be temporary code for debug, test or profiling purpose, or permanent code that could provide performance tracing data in software production mode.

Linux kernel provides gettimeofday() and clock_gettime() system calls for user application high resolution time measurement. The gettimeofday() is us level, and clock_gettime is ns level. However, the major concerns of these system calls usage are the additional performance cost caused by calling themselves.

In order to minimize the perf cost of gettimeofday() and clock_gettime() system calls, Linux kernel uses the vsyscalls(virtual system calls) and VDSOs (Virtual Dynamically linked Shared Objects) mechanisms to avoid the cost of switching from user to kernel. On x86, gettimeofday() and clock_gettime() could get better performance due to vsyscalls kernel patch, by avoiding context switch from user to kernel space. But some other arch still need follow the regular system call code path. This is really hardware dependent optimization.

2. Why using TSC?

Although vsyscalls implementation of gettimeofday() and clock_gettime() is faster than regular system calls, the perf cost of them is still too high to meet the latency measurement requirements for some perf sensitive application.

The TSC (time stamp counter) provided by x86 processors is a high-resolution counter that can be read with a single instruction (RDTSC). On Linux this instruction could be executed from user space directly, that means user applications could use one single instruction to get a fine-grained timestamp (nanosecond level) with a much faster way than vsyscalls.

Following code are typical implementation for rdtsc() api in user space application,

static uint64_t rdtsc(void)
{
    uint64_t var;
    uint32_t hi, lo;

    __asm volatile
        ("rdtsc" : "=a" (lo), "=d" (hi));

    var = ((uint64_t)hi << 32) | lo;
    return (var);
}

The result of rdtsc is CPU cycle, that could be converted to nanoseconds by a simple calculation.

ns = CPU cycles * (ns_per_sec / CPU freq)

In Linux kernel, it uses more complex way to get a better results,

/*
 * Accelerators for sched_clock()
 * convert from cycles(64bits) => nanoseconds (64bits)
 *  basic equation:
 *              ns = cycles / (freq / ns_per_sec)
 *              ns = cycles * (ns_per_sec / freq)
 *              ns = cycles * (10^9 / (cpu_khz * 10^3))
 *              ns = cycles * (10^6 / cpu_khz)
 *
 *      Then we use scaling math (suggested by george@mvista.com) to get:
 *              ns = cycles * (10^6 * SC / cpu_khz) / SC
 *              ns = cycles * cyc2ns_scale / SC
 *
 *      And since SC is a constant power of two, we can convert the div
 *  into a shift.
 *
 *  We can use khz divisor instead of mhz to keep a better precision, since
 *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
 *  (mathieu.desnoyers@polymtl.ca)
 *
 *                      -johnstul@us.ibm.com "math is hard, lets go shopping!"
 */

Finally, the code of latency measurement could be,

start = rdtsc();

/* put code you want to measure here */

end = rdtsc();

cycle = end - start;

latency = cycle_2_ns(cycle)

In fact, above rdtsc implementation are problematic, and not encouraged by Linux kernel. The major reason is, TSC mechanism is rather unreliable, and even Linux kernel had the hard time to handle it.

That is why Linux kernel does not provide the rdtsc api to user application. However, Linux kernel does not limit the rdtsc instruction to be executed at privilege level, although x86 support the setup. That means, there is nothing stopping Linux application read TSC directly by above implementation, but these applications have to prepare to handle some strange TSC behaviors due to some known pitfalls.

3. Known TSC pitfalls

3.1 TSC unstable hardware

3.1.1 CPU TSC capabilities

Intel CPUs have 3 sort of TSC behaviors,

Variant TSC

The first generation of TSC, the TSC increments could be impacted by CPU frequency changes. This is started from a very old processors (P4).

Constant TSC

The TSC increments at a constant rate, even CPU frequency get changed. But the TSC could be stopped when CPU run into deep C-state. Constant TSC is supported before Nehalem, and not as good as invariant TSC.

Invariant TSC

The invariant TSC will run at a constant rate in all ACPI P-, C-, and T-states. This is the architectural behavior moving forward. Invariant TSC only appears on Nehalem-and-later Intel processors.

See I