Linux时间子系统4：时钟源clocksource

Bluetangos

于 2024-06-14 14:51:15 发布

阅读量831

点赞数 16

分类专栏： Linux笔记文章标签： windows

本文链接：https://blog.csdn.net/Bluetangos/article/details/136888245

版权

Linux笔记专栏收录该内容

9 篇文章 1 订阅

订阅专栏

在前面的文章中，我们介绍了Linux获取时间的种类，最后提到了时钟源的概念，也就是我们获取的时间的计时器的源头。Linux使用clocksource来抽象计时的硬件，这就是本篇要讲的内容。原本想在本篇写出一点不一样的东西，但是发现Linux时间子系统之（十五）：clocksource已经讲得很好了，所以本篇转载了它的内容，适配了5.10，同时增加了一些内容。

1. clocksource数据结构

clock source是在指定输入频率的clock下工作的一个计时器的抽象。输入频率可以确定以什么样的精度来划分时间（假设输入counter的频率是1GHz，那么一个cycle就是1ns，也就是说时间精度是用1ns来划分的，最大的精度就是1ns）。

struct clocksource {
	u64			(*read)(struct clocksource *cs); －－－－－－－－－－－－－－－－－（1）
	u64			mask;
	u32			mult;
	u32			shift;
	u64			max_idle_ns;
	u32			maxadj;
	u32			uncertainty_margin;
#ifdef CONFIG_ARCH_CLOCKSOURCE_DATA
	struct arch_clocksource_data archdata;
#endif
	u64			max_cycles;

	const char		*name;    －－－－－－－－－－－－－－－－－－－－－－－－－－（2）
	struct list_head	list;
	int			rating;
	enum vdso_clock_mode	vdso_clock_mode;
	u16			vdso_fix;
	u16			vdso_shift;
	unsigned long		flags;

	int			(*enable)(struct clocksource *cs);
	void			(*disable)(struct clocksource *cs);
	void			(*suspend)(struct clocksource *cs);
	void			(*resume)(struct clocksource *cs);
	void			(*mark_unstable)(struct clocksource *cs);
	void			(*tick_stable)(struct clocksource *cs);

	/* private: */
#ifdef CONFIG_CLOCKSOURCE_WATCHDOG    －－－－－－－－－－－－－－－（3）
	/* Watchdog related data, used by the framework */
	struct list_head	wd_list;
	u64			cs_last;
	u64			wd_last;
#endif
	struct module		*owner;
};

struct clocksource的成员分成3组，我们分别介绍：

（1）这部分的代码是和计时有关的，kernel会频繁的访问这些数据结构，对于clock source抽象的counter而言，其counter value都是针对clock计数的，具体一个clock有多少个纳秒是和输入频率相关的。

通过read获取当前的counter value，这个计数值是基于cycle的（数据类型是cycle_t，U64）。不过，对于用户和其他driver而言，cycle数据是没有意义的，最好统一使用纳秒这样的单位，因此在struct clocksource中就有了mult和shift这两个成员了。我们先看看如何将A个cycles数转换成纳秒，具体公式如下：

转换后的纳秒数目 = (A / F)    x   NSEC_PER_SEC

出于性能的考虑，LInux没有直接使用除法，它使用如下的位移操作：

static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 shift)
{
    return ((u64) cycles * mult) >> shift;
}

也就是说，通过clock source的read函数获取了cycle数目，乘以mult这个因子然后右移shift个bit就可以得到纳秒数。这样的操作虽然性能比较好，但是损失了精度。

（2）这段的代码访问没有那么频繁，主要是和一些系统操作相关。例如list成员。系统将所有已经注册clocksource挂在一个链表中，list就是挂入链表的节点。suspend和resume是和电源管理相关，enable和disable是启停该clock source的callback函数。rating是描述clock source的精度的，毫无疑问，一个输入频率是20MHz的counter精度一定是大于10ms一个cycle的counter。关于rating，内核代码的注释已经足够了，如下：

1-99: Unfit for real use
        Only available for bootup and testing purposes.
    100-199: Base level usability.
        Functional for real use, but not desired.
    200-299: Good.
        A correct and usable clocksource.
    300-399: Desired.
        A reasonably fast and accurate clocksource.
    400-499: Perfect
        The ideal clocksource. A must-use where  available.

（3）第三部分的成员和clocksource watch dog相关。大家比较熟悉是系统的watch dog，主要监视系统，如果不及时喂狗，系统就会reset。clocksource watch dog当然是用来监视clock source的运行情况，如果watch dog发现它监视的clocksource的精度有问题，会修改其rating，告知系统。后面我们会有专门一个章节来描述这部分的内容。

2. 注册和注销clocksource

内核提供了一系列的注册接口，核心的注册函数是__clocksource_register_scale，代码如下：

int __clocksource_register_scale(struct clocksource *cs, u32 scale, u32 freq)
{
	unsigned long flags;

	clocksource_arch_init(cs);

	if (cs->vdso_clock_mode < 0 ||
	    cs->vdso_clock_mode >= VDSO_CLOCKMODE_MAX) {
		pr_warn("clocksource %s registered with invalid VDSO mode %d. Disabling VDSO support.\n",
			cs->name, cs->vdso_clock_mode);
		cs->vdso_clock_mode = VDSO_CLOCKMODE_NONE;
	}

	/* Initialize mult/shift and max_idle_ns */
	__clocksource_update_freq_scale(cs, scale, freq);

	/* Add clocksource to the clocksource list */
	mutex_lock(&clocksource_mutex);

	clocksource_watchdog_lock(&flags);
	clocksource_enqueue(cs);
	clocksource_enqueue_watchdog(cs);
	clocksource_watchdog_unlock(&flags);

	clocksource_select();
	clocksource_select_watchdog(false);
	__clocksource_suspend_select(cs);
	mutex_unlock(&clocksource_mutex);
	return 0;
}

__clocksource_update_freq_scale通过一系列计算，得到mult，shift和max_idle_ns，具体的计算过程就略过了。

clocksource_enqueue将clocksource加入时钟源链表，并且按照rating的值从高到低排序

clocksource_select用来选择合适的时钟源，因为有可能新加入的时钟源的精度比现有的时钟源的精度都高

watchdog相关的部分略过。

注销clocksource使用clocksource_unregister函数，主要逻辑在clocksource_unbind中，代码如下：

static int clocksource_unbind(struct clocksource *cs)
{
	unsigned long flags;

	if (clocksource_is_watchdog(cs)) {
		/* Select and try to install a replacement watchdog. */
		clocksource_select_watchdog(true);
		if (clocksource_is_watchdog(cs))
			return -EBUSY;
	}

	if (cs == curr_clocksource) {
		/* Select and try to install a replacement clock source */
		clocksource_select_fallback();
		if (curr_clocksource == cs)
			return -EBUSY;
	}

	if (clocksource_is_suspend(cs)) {
		/*
		 * Select and try to install a replacement suspend clocksource.
		 * If no replacement suspend clocksource, we will just let the
		 * clocksource go and have no suspend clocksource.
		 */
		clocksource_suspend_select(true);
	}

	clocksource_watchdog_lock(&flags);
	clocksource_dequeue_watchdog(cs);
	list_del_init(&cs->list);
	clocksource_watchdog_unlock(&flags);

	return 0;
}

分析略过

3. 新旧时钟源的切换

除了上一章节所说的新时钟源加入、注销时钟源时会做一次时钟源选择的决策，在clocksource watchdog中也有可能启动，这里略过先。底层的clocksource chip driver修改rating时也可能会重新选择，此外用户还可以指定clocksource。

切换clocksource的核心代码：

static void __clocksource_select(bool skipcur)
{
	bool oneshot = tick_oneshot_mode_active();
	struct clocksource *best, *cs;

	/* Find the best suitable clocksource */
	best = clocksource_find_best(oneshot, skipcur);
	if (!best)
		return;

	if (!strlen(override_name))
		goto found;

	/* Check for the override clocksource. */
	list_for_each_entry(cs, &clocksource_list, list) {
		if (skipcur && cs == curr_clocksource)
			continue;
		if (strcmp(cs->name, override_name) != 0)
			continue;
		/*
		 * Check to make sure we don't switch to a non-highres
		 * capable clocksource if the tick code is in oneshot
		 * mode (highres or nohz)
		 */
		if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) && oneshot) {
			/* Override clocksource cannot be used. */
			if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
				pr_warn("Override clocksource %s is unstable and not HRT compatible - cannot switch while in HRT/NOHZ mode\n",
					cs->name);
				override_name[0] = 0;
			} else {
				/*
				 * The override cannot be currently verified.
				 * Deferring to let the watchdog check.
				 */
				pr_info("Override clocksource %s is not currently HRT compatible - deferring\n",
					cs->name);
			}
		} else
			/* Override clocksource can be used. */
			best = cs;
		break;
	}

found:
	if (curr_clocksource != best && !timekeeping_notify(best)) {
		pr_info("Switched to clocksource %s\n", best->name);
		curr_clocksource = best;
	}
}

clocksource_find_best函数用于找到精度最高的时钟源，我们知道内核是把所有时钟源放到一个链表中，并且按着精度从高到低的顺序插入链表clocksource_list中，所以找到的第一个clocksource就认为是系统精度最高的时钟源

oneshot这个参数表示本CPU的tick device的工作模式，这个工作模式有两种，一种是周期性tick，也就是大家熟悉的传统的tick。另外一种叫做one shot模式。由于工作在one shot模式下的tick device对clock source有特别的需求，因此ocksource_find_best函数需要知道本CPU的tick device的工作模式

clocksource_find_best后的代码是处理用户空间指定current clock source的请求。用户空间程序会将其心仪的clock source的名字放入到override_name中，在clocksourceselect的时候需要scan clock source列表，找到用户指定的那个clock source，
并将其设定为best。注意：这里会覆盖上面clocksource_find_best函数中找到的那个best clock source。

找到之后调用timekeeping_notify函数通知timekeeping模块

int timekeeping_notify(struct clocksource *clock)
{
	struct timekeeper *tk = &tk_core.timekeeper;

	if (tk->tkr_mono.clock == clock)                    ——————————————（1）
		return 0;
	stop_machine(change_clocksource, clock, NULL);      ——————————————（2）
	tick_clock_notify();
	return tk->tkr_mono.clock == clock ? 0 : -1;        ——————————————（3）
}

timekeeping，我们后面再分析，这里简单说明一下。timekeeper是个全局变量，维护的是内核所有时间表示的结构体，这里意思是，如果判断出当前系统所用的时钟源已经是最新的需要切换的那个了，就表示切换已经结束，不需要在进行切换动作了。

（2）代码处，这里的stop_machine是所有CPU除了CPU0之外都关闭自己的本地中断，然后忙等轮转，等待CPU0去处理change_clocksource这个函数，此时类似于系统是个UP系统。change_clocksource（）是真正开始切换工作的函数，我们后续再讲。

（3）代码处，就是当新的时钟源切换成功后，需要通知所有CPU去更新自己的时钟精度模式，此时如果条件都满足的话，系统会切换到高精度时钟的模式去工作。

我们再看一下clocksource

static int change_clocksource(void *data)
{
	struct timekeeper *tk = &tk_core.timekeeper;
	struct clocksource *new, *old;
	unsigned long flags;

	new = (struct clocksource *) data;

	raw_spin_lock_irqsave(&timekeeper_lock, flags);
	write_seqcount_begin(&tk_core.seq);

	timekeeping_forward_now(tk);
	/*
	 * If the cs is in module, get a module reference. Succeeds
	 * for built-in code (owner == NULL) as well.
	 */
	if (try_module_get(new->owner)) {
		if (!new->enable || new->enable(new) == 0) {
			old = tk->tkr_mono.clock;
			tk_setup_internals(tk, new);
			if (old->disable)
				old->disable(old);
			module_put(old->owner);
		} else {
			module_put(new->owner);
		}
	}
	timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);

	write_seqcount_end(&tk_core.seq);
	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);

	return 0;
}

首先用旧的clocksource进行最后一次更新墙上时间和raw time。tk_setup_internals，主要读取新的时钟源，然后更新tk->cycle_last 和clock->cycle_last，然后根据新的时钟源更新一些时间变量，自此以后所有计算delta时间都跟这个新时钟源有关。

切换时钟源的时候，会在老时钟记录的系统时间的基础上进行叠加的。比如切换之前的时间是8:00，那么切换之后，新的时钟源会在8:00那一时刻记录下自己的last cyle值，到了下一次更新时间的点，就是计算now和last cycle的delta时间，比如delta=10分钟。那么切换后当前系统时间就是8:10分。

如果旧的时钟源不准，有丢失时间的现象的话，那么切换回新的时钟源是无法感知到的，但是墙上时间会被ntp校准，但是raw time就彻底变的不准的。

4. 附：mult shift计算

mult和shift是用于做cycle到nanosecond转换的。以TSC时钟源为例，频率是2400.481Mhz。

TSC clocksource calibration: 2400.481MHz.

意味着经过2400170000个cycle时钟周期，物理上时间流逝了1s，也是1000000000ns。因此，可以通过两次时间间隔间cycle数来感知时间的流逝。有计算公式如下：

time_elapse = cycle_interval / frequency

但是，内核中直接除法不太方便，基本使用乘法+移位的方式代替。所以，才有mult和shift两个因子。转换之后，计算公式变成：

time_elapse = cycle_interval * mult >> shift

clocksource_tsc {

...

rating = 300

mult = 894605559,

shift = 31,

...

}

再进一步来看一下mult和shift计算的精度如何。

Python 2.6.9 (unknown, Aug 17 2016, 09:50:05)
[GCC 4.3.4 [gcc-4_3-branch revision 152973]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> (2400481000*894605559)>>31
999999999

相差了1个纳秒，意味着走过10^9秒(31年)，会产生1秒的计算误差。