KVM APIC Timer 模拟详解

目录

1 手册

 2 KVM模拟

2.1 APIC TImer模式

2.2 定时器模式

2.3 中断注入

1 手册

我们首先来看下,Intel SDM 3中是怎么描述APIC Timer的,

参考 10.5.4 APIC Timer,

The local APIC unit contains a 32-bit programmable timer that is available to software to time events or operations. This timer is set up by programming four registers:

  • the divide configuration register,
  • the initial-count
  • current-count registers
  • LVT timer register, it etermines the vector number that is delivered to the processor with the timer interrupt that is generated when the timer count reaches zero.

然后是APIC Timer的模式,即oneshot模式和periodic模式,

  •  In one-shot mode, the timer is started by programming its initial-count register. The initial count value is then copied into the current-count register and count-down begins. After the timer reaches zero, an timer interrupt is generated and the timer remains at its 0 value until reprogrammed.
  • In periodic mode, the current-count register is automatically reloaded from the initial-count register when the count reaches 0 and a timer interrupt is generated, and the count-down is repeated. 

从以上,我们得到以下信息:

  • Initial count register,用于发起定时器;
  • LVT timer register,用于设置定时器模式,oneshot、periodic或者后面的tscdeadline;
  • current count register,用于计数,初始值来自initial count register,对于one shot模式,它只会获得一次初始值,对于periodic模式,它会反复获得初始值;

参考 10.5.4.1 TSC-Deadline Mode,

A write to the LVT Timer Register that changes the timer mode disarms the local APIC timer. The supported timer modes are given in Table 10-2. The three modes of the local APIC timer are mutually exclusive.

TSC-deadline mode allows software to use the local APIC timer to signal an interrupt at an absolute time. In TSC-deadline mode, writes to the initial-count register are ignored; and current-count register always reads 0. Instead, timer behavior is controlled using the IA32_TSC_DEADLINE MSR.

In TSC-deadline mode, writing 0 to the IA32_TSC_DEADLINE MSR disarms the local-APIC timer. Transitioning between TSC-deadline mode and other timer modes also disarms the timer

从以上我们得到以下信息:

  • 在oneshot/periodic和TSC-deadline模式之间转换会关闭之前的定时器;
  • TSC-deadline模式下,Initial count register无效的,读出来总是0;此时,定时器使用IA32_TSC_DEADLINE MSR控制;

 2 KVM模拟

2.1 APIC TImer模式

三种模式,periodic/oneshot/tscdeadline,相关代码如下:

kvm_lapic_reg_write()
---
	case APIC_LVTT:
		val &= (apic_lvt_mask[0] | apic->lapic_timer.timer_mode_mask);
		kvm_lapic_set_reg(apic, APIC_LVTT, val);
		apic_update_lvtt(apic);
		break;

	case APIC_TMICT:
		if (apic_lvtt_tscdeadline(apic))
			break;

		cancel_apic_timer(apic);
		kvm_lapic_set_reg(apic, APIC_TMICT, val);
		start_apic_timer(apic);
		break;
---

apic_update_lvtt()
---
	u32 timer_mode = kvm_lapic_get_reg(apic, APIC_LVTT) &
			apic->lapic_timer.timer_mode_mask;

	if (apic->lapic_timer.timer_mode != timer_mode) {
		if (apic_lvtt_tscdeadline(apic) != (timer_mode ==
				APIC_LVT_TIMER_TSCDEADLINE)) {
			cancel_apic_timer(apic);
			kvm_lapic_set_reg(apic, APIC_TMICT, 0);
			apic->lapic_timer.period = 0;
			apic->lapic_timer.tscdeadline = 0;
		}
		apic->lapic_timer.timer_mode = timer_mode;
		limit_periodic_timer_frequency(apic);
	}
---

参考,apic_update_lvtt(),可以看到,在TSC-deadline和其他模式转换时,会关闭之前的定时器,这点是与手册一致的。

APIC_TMICT就是Initial count register;可以看到在设置了它的值之后,会重启定时器。

TSC-deadline模式的设置是通过MSR,参考代码:


kvm_set_msr_common()/handle_fastpath_set_tscdeadline()
  -> kvm_set_lapic_tscdeadline_msr()
     ---
		hrtimer_cancel(&apic->lapic_timer.timer);
		apic->lapic_timer.tscdeadline = data;
		kvm_pv_update_tscdeadline(vcpu, data);
		start_apic_timer(apic);
	 ---

值被设置进了lapic_timer.tscdeadline里面;

oneshot/periodic模式的时间的计算可以参考函数:

__start_apic_timer()
  -> set_target_expiration() // oneshot or period mode
	 ---
	u64 tscl = rdtsc();
	s64 deadline;

	now = ktime_get();
	apic->lapic_timer.period = tmict_to_ns(apic, kvm_lapic_get_reg(apic, APIC_TMICT));

	deadline = apic->lapic_timer.period;
	...
	apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl) +
		nsec_to_cycles(apic->vcpu, deadline);
	apic->lapic_timer.target_expiration = ktime_add_ns(now, deadline);
	 ---
  -> restart_apic_timer()

其中lapic_timer.tscdeadline也被设置了;

In sw timer mode,
apic_timer_fn()
---
	if (lapic_is_periodic(apic)) {
		advance_periodic_target_expiration(apic);
		hrtimer_add_expires_ns(&ktimer->timer, ktimer->period);
		return HRTIMER_RESTART;
	}
---

In hw timer mode
kvm_lapic_expired_hv_timer()
---
	if (apic_lvtt_period(apic) && apic->lapic_timer.period) {
		advance_periodic_target_expiration(apic);
		restart_apic_timer(apic);
	}
---

2.2 定时器模式

定时器有两种方式实现,

  • hv timer,使用的是vmx的preempetion timer,
  • sw timer,使用的是hrtimer,

关于preemption timer的详细信息,可以参考之前的Blog中关于Clock Event的虚拟化的部分,

KVM CPU虚拟化_cpu vmx_jianchwa的博客-CSDN博客

 hv timer方式的设置和触发,参考以下代码:

start_hv_timer()
  -> static_call(kvm_x86_set_hv_timer)(vcpu, ktimer->tscdeadline, &expired)
     vmx_set_hv_timer()
	 ---
	 	vmx->hv_deadline_tsc = tscl + delta_tsc;
	 ---

vmx_vcpu_run()
  -> vmx_update_hv_timer(vcpu);
     ---
	if (vmx->hv_deadline_tsc != -1) {
		tscl = rdtsc();
		if (vmx->hv_deadline_tsc > tscl)
			delta_tsc = (u32)((vmx->hv_deadline_tsc - tscl) >> cpu_preemption_timer_multi);
		else
			delta_tsc = 0;

		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, delta_tsc);
	}
	 ---

vmx_vcpu_run()
  -> vmx_exit_handlers_fastpath()
	->  handle_fastpath_preemption_timer()
	  -> kvm_lapic_expired_hv_timer()
	    -> apic_timer_expired(apic, false);

定时器的值来自lapic_timer.tscdeadline,然后讲过一些列的转换之后,设置进VMX_PREEMPTION_TIMER_VALUE,在超时之后会触发vm-exit,最终会调用apic_timer_expired()来处理超时事件。

sw timer模式的设置和触发代码如下:

start_sw_period()
---
	hrtimer_start(&apic->lapic_timer.timer,
		apic->lapic_timer.target_expiration,
		HRTIMER_MODE_ABS_HARD);
---

apic_timer_fn()
  -> apic_timer_expired(apic, true);

sw tscdeadline also use this hrtimer, but in different code path,
start_sw_tscdeadline()
---
	u64 guest_tsc, tscdeadline = ktimer->tscdeadline;
	...
	now = ktime_get();
	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());

	ns = (tscdeadline - guest_tsc) * 1000000ULL;
	do_div(ns, this_tsc_khz);

	if (likely(tscdeadline > guest_tsc) &&
	    likely(ns > apic->lapic_timer.timer_advance_ns)) {
		expire = ktime_add_ns(now, ns);
		expire = ktime_sub_ns(expire, ktimer->timer_advance_ns);
		hrtimer_start(&ktimer->timer, expire, HRTIMER_MODE_ABS_HARD);
	} else
		apic_timer_expired(apic, false);
---

lapic_timer.timer这个hrtimer,不仅oneshot和periodic模式会使用,sw tscdeadline模式也会使用;两者区别就在于超时时间的计算方式。

hv timer和sw timer的选择,则取决于下面这个函数:

start_hv_timer()
---
	if (!kvm_can_use_hv_timer(vcpu))
		return false;

	if (!ktimer->tscdeadline)
		return false;

	if (static_call(kvm_x86_set_hv_timer)(vcpu, ktimer->tscdeadline, &expired))
		return false;
---

kvm_can_use_hv_timer()中,主要的变量有两个:

  • preemption timer的支持情况;
  • post timer interrupt的设置情况,这个我们下一小节再解释;

那么oneshot/periodic模式是不是会使用hv timer呢?

答案是会的,在set_target_expiration()中,apic->lapic_timer.tscdeadline也同样被计算了。

两种模式使用的超时时间的值分别保存在:

  • hv timer,lapic_timer.tscdeadline
  • sw timer,lapic_timer.target_expiration

对于,periodic模式,在超时之后,还需要重新将定时器re-arm,这一点sw timer和hw timer都有实现:

In sw timer mode,
apic_timer_fn()
---
	if (lapic_is_periodic(apic)) {
		advance_periodic_target_expiration(apic);
		hrtimer_add_expires_ns(&ktimer->timer, ktimer->period);
		return HRTIMER_RESTART;
	}
---

In hw timer mode
kvm_lapic_expired_hv_timer()
---
	if (apic_lvtt_period(apic) && apic->lapic_timer.period) {
		advance_periodic_target_expiration(apic);
		restart_apic_timer(apic);
	}
---

需要特别说明的是,在vcpu进入block状态之后,定时器会从hv mode转换到sw mode,参考代码:

static int vmx_pre_block(struct kvm_vcpu *vcpu)
{
	if (pi_pre_block(vcpu))
		return 1;

	if (kvm_lapic_hv_timer_in_use(vcpu))
		kvm_lapic_switch_to_sw_timer(vcpu);

	return 0;
}

static void vmx_post_block(struct kvm_vcpu *vcpu)
{
	if (kvm_x86_ops.set_hv_timer)
		kvm_lapic_switch_to_hv_timer(vcpu);

	pi_post_block(vcpu);
}

通常,定时器是可以将一个cpu从idle状态唤醒的;如果是hv mode,preemption timer对于一个没有运行的vcpu显然是没用的,所以需要使用sw mode,通过位于host上的hrtimer来唤醒它。

2.3 中断注入

在定时器超时之后,如何将对应的vector注入到guest中:

有两种方式,分别对应着hv timer和sw timer,两者的对apic_timer_expired()的调用上下文不同:

  • hv timer,vcpu上下文,对应代码为:
    not from timer fn, namely, the preemption_timer vm-exit case
    apic_timer_expired()
       ---
    	if (!from_timer_fn && vcpu->arch.apicv_active) {
    		kvm_apic_inject_pending_timer_irqs(apic);
              -> kvm_apic_local_deliver(apic, APIC_LVTT)
    	}
       ---
    
  •  sw timer,中断上下文,对应代码为:
    from the timer fn, namely, software emulated timer
       ---
    	atomic_inc(&apic->lapic_timer.pending);
    	kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
    	if (from_timer_fn)
    		kvm_vcpu_kick(vcpu);
       ---
    
    vcpu_run()
    ---
    	for (;;) {
    		...
    		if (kvm_vcpu_running(vcpu)) {
    			r = vcpu_enter_guest(vcpu);
    		} else {
    			r = vcpu_block(kvm, vcpu);
    		}
    
    		if (r <= 0)
    			break;
    
    		kvm_clear_request(KVM_REQ_UNBLOCK, vcpu);
    		if (kvm_cpu_has_pending_timer(vcpu))
    			kvm_inject_pending_timer_irqs(vcpu);
    		...
    	}
    ---

    对于这种情况,在设置了lapic_timer.pending之后,在vm-exit上下文中再次处理;如果timer触发在vcpu所在的pcpu,那么kvm_vcpu_kick()什么都不会做;否则,它会向vcpu所在的pcpu发送IPI。

在apic_timer_expired()中,还有以下注入中断的路径:

apic_timer_expired()
---
	if (kvm_use_posted_timer_interrupt(apic->vcpu)) {
		if (vcpu->arch.apic->lapic_timer.expired_tscdeadline &&
		    vcpu->arch.apic->lapic_timer.timer_advance_ns)
			__kvm_wait_lapic_expire(vcpu);
		kvm_apic_inject_pending_timer_irqs(apic);
		return;
	}
---

参考其最初的commit,

commit 0c5f81dad46c90792e6c3c4797131323c9e96dcd
Author: Wanpeng Li <wanpengli@tencent.com>
Date:   Sat Jul 6 09:26:51 2019 +0800

    KVM: LAPIC: Inject timer interrupt via posted interrupt
    
    Dedicated instances are currently disturbed by unnecessary jitter due
    to the emulated lapic timers firing on the same pCPUs where the
    vCPUs reside.  There is no hardware virtual timer on Intel for guest
    like ARM, so both programming timer in guest and the emulated timer fires
    incur vmexits.  This patch tries to avoid vmexit when the emulated timer
    fires, at least in dedicated instance scenario when nohz_full is enabled.
    
    In that case, the emulated timers can be offload to the nearest busy
    housekeeping cpus since APICv has been found for several years in server
    processors. The guest timer interrupt can then be injected via posted interrupts,
    which are delivered by the housekeeping cpu once the emulated timer fires.
    
    The host should tuned so that vCPUs are placed on isolated physical
    processors, and with several pCPUs surplus for busy housekeeping.
    If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
    ~3% redis performance benefit can be observed on Skylake server, and the
    number of external interrupt vmexits drops substantially.  Without patch
    
                VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time   Avg time
    EXTERNAL_INTERRUPT    42916    49.43%   39.30%   0.47us   106.09us   0.71us ( +-   1.09% )
    
    While with patch:
    
                VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time         Avg time
    EXTERNAL_INTERRUPT    6871     9.29%     2.96%   0.44us    57.88us   0.72us ( +-   4.02% )
    
    Cc: Paolo Bonzini <pbonzini@redhat.com>
    Cc: Radim Krčmář <rkrcmar@redhat.com>
    Cc: Marcelo Tosatti <mtosatti@redhat.com>
    Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

这个功能对应的是cpu isolation,具体可以参考suse的blog,

CPU Isolation – Housekeeping and tradeoffs – by SUSE Labs (part 4) | SUSE CommunitiesThis blog post is the fourth in a technical series by SUSE Labs team exploring Kernel CPU Isolation along with one of its core components: F...https://www.suse.com/c/cpu-isolation-housekeeping-and-tradeoffs-part-4/摘取其中的一段:

On normal configurations, every CPU get its housekeeping duty share. On the opposite, nohz_full configurations implicitly move away all the housekeeping work outside the nohz_full set. This means that if you have 8 CPUs and you isolate CPUs 1,2,3,4,5,6,7:

nohz_full=1-7

Then CPU 0 will handle the housekeeping workload alone. These duties involve:

  • Unbound timer callbacks execution.
  • Unbound workqueues execution.
  • Unbound kthreads execution
  • Timekeeping updates (jiffies and gettimeofday())
  • RCU grace periods tracking
  • RCU callbacks execution on behalf of isolated CPUs
  • 1Hz residual offloaded timer ticks on behalf of isolated CPUs
  • Depending on your extended setting:
    • Hardware IRQs that could be affine
    • User tasks others than the isolated workload

 使用posted interrupt delivery可以避免kvm_vcpu_kick()带来的一次额外的vm-exit。

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值