解决ovirt虚拟机使用FCP瘦分配安装win10系统卡死的问题

问题描述:
仅在FCP 瘦分配模式下会出现该问题,测试将win10安装到Getting files ready for installation(13%)时卡死,通过virsh看到,虚拟机状态进入pause

1、2日志中均出现如下报错:

2018-04-18 18:37:21,556+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) CPU stopped: onSuspend (vm:5085)
2018-04-18 18:37:21,556+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) abnormal vm stop device scsi0-0-0-0 error (vm:4218)
2018-04-18 18:37:21,556+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) CPU stopped: onIOError (vm:5085)
2018-04-18 18:37:21,556+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) abnormal vm stop device scsi0-0-0-0 error enospc (vm:4218)
2018-04-18 18:37:21,556+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) CPU stopped: onIOError (vm:5085)
2018-04-18 18:37:21,558+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) No VM drives were extended (vm:4225)
2018-04-18 18:37:21,559+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) abnormal vm stop device scsi0-0-0-0 error enospc (vm:4218)
2018-04-18 18:37:21,559+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) CPU stopped: onIOError (vm:5085)
2018-04-18 18:37:21,561+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) No VM drives were extended (vm:4225)
2018-04-18 18:37:21,561+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) abnormal vm stop device scsi0-0-0-0 error enospc (vm:4218)
2018-04-18 18:37:21,561+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) CPU stopped: onIOError (vm:5085)
2018-04-18 18:37:21,563+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) No VM drives were extended (vm:4225)
2018-04-18 18:37:21,563+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) abnormal vm stop device scsi0-0-0-0 error enospc (vm:4218)
2018-04-18 18:37:21,563+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) CPU stopped: onIOError (vm:5085)
2018-04-18 18:37:21,565+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) No VM drives were extended (vm:4225)
2018-04-18 18:37:21,566+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) abnormal vm stop device scsi0-0-0-0 error enospc (vm:4218)
2018-04-18 18:37:21,566+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) CPU stopped: onIOError (vm:5085)
2018-04-18 18:37:21,568+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) No VM drives were extended (vm:4225)
2018-04-18 18:37:21,568+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) abnormal vm stop device scsi0-0-0-0 error enospc (vm:4218)
2018-04-18 18:37:21,568+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) CPU stopped: onIOError (vm:5085)
2018-04-18 18:37:21,570+0800 INFO (libvirt/events) [virt.vm] (vmId=’13a5e5bd-f101-45df-bf0c-923da11bec67’) No VM drives were extended (vm:4225)

实际测试中出现虚拟机挂起问题,问题在于vdsm端通过获取qemu的ENOSPC报错信息,通过下图流程:

Created with Raphaël 2.1.2 extendDrivesIfNeed extendDriveVolume

实际是在_shouldExtendVolume中进行physical - alloc < drive.watermarkLimit判定时出错,而watermarkLimit参数由self.VOLWM_FREE_PCT * self.volExtensionChunk / 100计算得到,由vdsm.conf中的参数volume_utilization_chunk_mb(默认1024)决定,默认值为512MB,当vm硬盘对容量扩大大于限制值扩大会失败,尝试改大conf参数,还是会出问题,故在代码中取消对其磁盘扩展的限制,后面发现修改会引入瘦分配失效的问题:

Apr 21 14:34:07 Linx vdsmd: —-extend:[]
Apr 21 14:34:07 Linx vdsmd: ——out extend Drives
Apr 21 14:34:09 Linx vdsmd: ——in extend Drives
Apr 21 14:34:09 Linx vdsmd: blockInfo:[107374182400L, 0L, 10737418240L]
Apr 21 14:34:09 Linx vdsmd: —-ret:[(, ‘ab0474bc-7b63-413a-8817-6f4fd4bdd871’, 107374182400L, 0L, 10737418240L)]
Apr 21 14:34:09 Linx vdsmd: —-extend:[]
Apr 21 14:34:09 Linx vdsmd: ——out extend Drives
Apr 21 14:34:11 Linx vdsmd: ——in extend Drives
Apr 21 14:34:11 Linx vdsmd: blockInfo:[107374182400L, 0L, 10737418240L]
Apr 21 14:34:11 Linx vdsmd: —-ret:[(, ‘ab0474bc-7b63-413a-8817-6f4fd4bdd871’, 107374182400L, 0L, 10737418240L)]
Apr 21 14:34:11 Linx vdsmd: —-extend:[]
Apr 21 14:34:11 Linx vdsmd: ——out extend Drives
Apr 21 14:34:13 Linx vdsmd: ——in extend Drives
Apr 21 14:34:13 Linx vdsmd: blockInfo:[107374182400L, 0L, 10737418240L]
Apr 21 14:34:13 Linx vdsmd: —-ret:[(, ‘ab0474bc-7b63-413a-8817-6f4fd4bdd871’, 107374182400L, 0L, 10737418240L)]
Apr 21 14:34:13 Linx vdsmd: —-extend:[]
Apr 21 14:34:13 Linx vdsmd: ——out extend Drives
Apr 21 14:34:15 Linx vdsmd: ——in extend Drives
Apr 21 14:34:15 Linx vdsmd: blockInfo:[107374182400L, 0L, 10737418240L]
Apr 21 14:34:15 Linx vdsmd: —-ret:[(

瘦分配的主要用途,是减小虚拟机对磁盘占用率,瘦分配失效会导致每次创建快照后,该虚拟机对磁盘的
占用率会翻倍,通过打印发现ENOSPC的事件触发不停发生,且alloc size = 0, commit中的
修复直接返回True,会导致磁盘不停扩容,直到上限,也就导致了瘦分配的失效。

继续排查:

log如下:

Apr 21 15:42:06 Linx vdsmd: blockInfo:[107374182400L, 0L, 536870912L]
Apr 21 15:42:06 Linx vdsmd: —-ret:[(, u’83cf08db-ea47-40be-9b6f-c39b8444d369’, 107374182400L, 0L, 536870912L)]
Apr 21 15:42:06 Linx vdsmd: —-false3, physical:536870912; alloc:0; watermarkLimit:268435456;

vdsm端扩展内存的调用如下两条线:

流程1:

Created with Raphaël 2.1.2 DriveWatermarkMonitor _execute->self._vm.extendDrivesIfNeeded()

流程2:

Created with Raphaël 2.1.2 onIOError(callBack)->if reason == 'ENOSPC self.extendDrivesIfNeeded()

两条线代表了如下两种情况:
流程1)后台监控对磁盘水位进行实时监控,当达到阀值,扩展磁盘大小。
流程2)当实时监控的间隙出现磁盘达到阀值,qemu会挂起虚拟机并且抛出ENOSPC的异常,vdsm中捕获该异常,并且扩展磁盘。

如果仅仅针对qemu挂起后的情况进行处理,会导致磁盘持续写入时虚拟机断续挂起,表现在使用中的情况是不停出现卡顿,所以需要解决的是vdsm端_getExtendCandidates中调用libvirt self._dom.blockInfo无法获取到当前磁盘实际大小的问题。

self._dom.blockInfo在libvirt中和domblkinfo调用流程相同,实际测试如下:

virsh # domblkinfo –device sda –domain linx80-1
Capacity: 107374182400
Allocation: 0
Physical: 536870912

virsh # domstats –block –domain linx80-1
Domain: ‘linx80-1’
block.count=2
block.0.name=hdc
block.0.rd.reqs=4
block.0.rd.bytes=152
block.0.rd.times=82981
block.0.wr.reqs=0
block.0.wr.bytes=0
block.0.wr.times=0
block.0.fl.reqs=0
block.0.fl.times=0
block.0.allocation=0
block.0.physical=0
block.1.name=sda
block.1.path=/rhev/data-center/a3ae667f-bd61-4b9e-903b-9f57b2e89080/572679d5-b080-425e-9e4f-f5d01988a6be/images/1796cd02-5b15-4d72-864d-4bc33ca7cc1c/8ad5f4b3-b3fd-43c5-8c47-8da9a24904e9
block.1.rd.reqs=13340
block.1.rd.bytes=404829696
block.1.rd.times=73699128640
block.1.wr.reqs=1182
block.1.wr.bytes=128020480
block.1.wr.times=128182811428
block.1.fl.reqs=157
block.1.fl.times=401870367
block.1.allocation=149946368
block.1.capacity=107374182400
block.1.physical=536870912

可以看出domblkinfo无法获取到Allocation,但domstats却能获取到allocation,定位到该问题出现在libvirt端。

libvirt中domblkinfo调用流程如下:

Created with Raphaël 2.1.2 domblkinfo cmdDomblkinfo virDomainGetBlockInfo conn->drivce->domainGetBlockInfo qemuDomainGetBlockInfo qemuMonitorGetAllBlockStatsInfo

主要问题出在qemuDomainGetBlockInfo中的:

    if (entry->physical == 0 || info->allocation == 0 ||
        info->allocation == entry->physical) {
        info->allocation = entry->physical;
        if (info->allocation == 0) 
            info->allocation = entry->physical;

        if (qemuDomainStorageUpdatePhysical(driver, cfg, vm, disk->src) < 0)
            goto endjob;

        info->physical = disk->src->physical;
    } else {
        info->physical = entry->physical;
    }

libvirt当从monitor获得的磁盘现有大小为0时强行将已分配大小置0,之后从disk->src中重新获取磁盘大小,主要目的在于保证已分配大小不大于实际大小,但当从disk->src中获取到的磁盘大小不为0时,已分配大小却为0,实际改动见[1],只有当从disk->src中获取的磁盘大小为0时,才真正的将allocation大小置0。

测试:
进行了3次重装win10,安装过程磁盘分配均正常

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值