问题背景:
运行SSVM跟CPVM的物理机发生宕机,查看SSVM跟CPVM状态仍旧为 Running, 所在主机仍旧显示为宕机物理机,于是将该物理机启动成功,登录物理机通过virsh list --all 命令查看SSVM跟 CPVM是否确实运行成功,发现并没有,再查询所有物理机,发现依旧没有发现 SSVM跟 CPVM的虚机,然而CloudStack的UI界面显示SSVM跟CPVM一直为Running,也显示运行在该主机上面,当然Ping不通其IP地址,于是想将SSVM 跟 CPVM 删除,但是都不行,连停止操作都失败,但是竟然可以顺利创建实例,简直就是一个BIG BUG!
日志信息: /var/log/cloudstack/management/management-server.log
2013-12-17 21:33:26,525 DEBUG [cloud.async.AsyncJobManagerImpl] (Job-Executor-130:job-130) Executing org.apache.cloudstack.api.command.admin.systemvm.DestroySystemVmCmd for job-130 2013-12-17 21:33:26,527 DEBUG [cloud.api.ApiServlet] (catalina-exec-9:null) ===END=== 10.200.251.246 -- GET command=destroySystemVm&id=94576696-a734-459b-b697-9ade8d616e68&response=json&sessionkey=yY8M0StWM6ohsnSO3nhPZGj7xKk%3D&_=1387333995495 2013-12-17 21:33:26,612 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-130:job-130) VM state transitted from :Running to Stopping with event: StopRequestedvm's original host id: 1 new host id: 1 host id before state transition: 1 2013-12-17 21:33:26,618 WARN [cloud.vm.VirtualMachineManagerImpl] (Job-Executor-130:job-130) Unable to stop vm, agent unavailable: com.cloud.exception.AgentUnavailableException: Resource [Host:1] is unreachable: Host 1: Host with specified id is not in the right state: Disconnected 2013-12-17 21:33:26,618 WARN [cloud.vm.VirtualMachineManagerImpl] (Job-Executor-130:job-130) Unable to stop vm VM[SecondaryStorageVm|s-1-VM] 2013-12-17 21:33:26,628 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-130:job-130) VM state transitted from :Stopping to Running with event: OperationFailedvm's original host id: 1 new host id: 1 host id before state transition: 1 2013-12-17 21:33:26,628 DEBUG [cloud.vm.VirtualMachineManagerImpl] (Job-Executor-130:job-130) Unable to stop the VM so we can't expunge it. 2013-12-17 21:33:26,628 DEBUG [cloud.vm.VirtualMachineManagerImpl] (Job-Executor-130:job-130) Unable to destroy the vm because it is not in the correct state: VM[SecondaryStorageVm|s-1-VM] 2013-12-17 21:33:26,628 INFO [cloud.vm.VirtualMachineManagerImpl] (Job-Executor-130:job-130) Did not expunge VM[SecondaryStorageVm|s-1-VM] 2013-12-17 21:33:26,640 DEBUG [cloud.async.AsyncJobManagerImpl] (Job-Executor-130:job-130) Complete async job-130, jobStatus: 2, resultCode: 530, result: Error Code: 530 Error text: Fail to destroy system vm 2013-12-17 21:33:26,728 DEBUG [agent.transport.Request] (StatsCollector-1:null) Seq 15-1464552034: Received: { Ans: , MgmtId: 345051385634, via: 15, Ver: v1, Flags: 10, { GetHostStatsAnswer } } 2013-12-17 21:33:27,100 DEBUG [agent.manager.AgentManagerImpl] (AgentManager-Handler-13:null) Ping from 8 2013-12-17 21:33:27,235 DEBUG [agent.manager.AgentManagerImpl] (AgentManager-Handler-9:null) Ping from 14 2013-12-17 21:33:27,454 DEBUG [agent.transport.Request] (AgentManager-Handler-8:null) Seq 8-1342917711: Processing: { Ans: , MgmtId: 345051385634, via: 8, Ver: v1, Flags: 10, [{"Answer":{"result":false,"details":"timeout","wait":0}}] } 2013-12-17 21:33:27,455 DEBUG [agent.transport.Request] (AgentManager-Handler-12:null) Seq 8-1342917712: Processing: { Ans: , MgmtId: 345051385634, via: 8, Ver: v1, Flags: 10, [{"Answer":{"result":false,"details":"timeout","wait":0}}] } 2013-12-17 21:33:27,455 DEBUG [agent.transport.Request] (AgentTaskPool-3:null) Seq 8-1342917711: Received: { Ans: , MgmtId: 345051385634, via: 8, Ver: v1, Flags: 10, { Answer } } 2013-12-17 21:33:27,455 DEBUG [cloud.ha.AbstractInvestigatorImpl] (AgentTaskPool-3:null) host (10.196.53.73) cannot be pinged, returning null ('I don't know') 2013-12-17 21:33:27,455 DEBUG [cloud.ha.UserVmDomRInvestigator] (AgentTaskPool-3:null) sending ping from (9) to agent's host ip address (10.196.53.73) 2013-12-17 21:33:27,455 DEBUG [agent.transport.Request] (AgentTaskPool-16:null) Seq 8-1342917712: Received: { Ans: , MgmtId: 345051385634, via: 8, Ver: v1, Flags: 10, { Answer } } 2013-12-17 21:33:27,455 DEBUG [cloud.ha.AbstractInvestigatorImpl] (AgentTaskPool-16:null) host (10.196.53.74) cannot be pinged, returning null ('I don't know') 2013-12-17 21:33:27,455 DEBUG [cloud.ha.UserVmDomRInvestigator] (AgentTaskPool-16:null) sending ping from (9) to agent's host ip address (10.196.53.74) 2013-12-17 21:33:27,460 DEBUG [agent.transport.Request] (AgentTaskPool-3:null) Seq 9-241192500: Sending { Cmd , MgmtId: 345051385634, via: 9, Ver: v1, Flags: 100011, [{"PingTestCommand":{"_computingHostIp":"10.196.53.73","wait":20}}] } 2013-12-17 21:33:27,461 DEBUG [agent.transport.Request] (AgentTaskPool-16:null) Seq 9-241192501: Sending { Cmd , MgmtId: 345051385634, via: 9, Ver: v1, Flags: 100011, [{"PingTestCommand":{"_computingHostIp":"10.196.53.74","wait":20}}] } 2013-12-17 21:33:27,585 DEBUG [agent.transport.Request] (StatsCollector-1:null) Seq 16-1532317381: Received: { Ans: , MgmtId: 345051385634, via: 16, Ver: v1, Flags: 10, { GetHostStatsAnswer } } 2013-12-17 21:33:27,890 DEBUG [agent.manager.AgentManagerImpl] (AgentManager-Handler-1:null) Ping from 11
关键信息:
Unable to destroy the vm because it is not in the correct state: VM[SecondaryStorageVm|s-1-VM]
Unable to destroy the vm because it is not in the correct state: VM[SecondaryStorageVm|s-1-VM]
数据库信息
mysql> SELECT * FROM host WHERE name like '%s-1-VM%'\G //主机信息中的系统虚机信息
*************************** 1. row ***************************
id: 21
name: s-1-VM
uuid: 986db967-13a9-48ca-815b-c41d6951a3f3
status:
Disconnected
type: SecondaryStorageVM
private_ip_address: 10.196.53.74
private_netmask: 255.255.255.0
private_mac_address: 06:51:e0:00:00:07
storage_ip_address: 10.196.53.82
storage_netmask: 255.255.255.0
storage_mac_address: 06:51:e0:00:00:07
storage_ip_address_2: NULL
storage_mac_address_2: NULL
storage_netmask_2: NULL
cluster_id: NULL
public_ip_address: 10.196.53.76
public_netmask: 255.255.255.0
public_mac_address: 06:e0:2c:00:00:0e
proxy_port: NULL
data_center_id: 1
pod_id: 1
cpus: NULL
speed: NULL
url: NoIqn
fs_type: NULL
hypervisor_type: NULL
hypervisor_version: NULL
ram: 0
resource: NULL
version: 4.1.1
parent: NULL
total_size: NULL
capabilities: NULL
guid: s-1-VM-NfsSecondaryStorageResource
available: 1
setup: 0
dom0_memory: 0
last_ping: 1354828061
mgmt_server_id: 345051385634
disconnected: NULL
created: 2013-12-18 05:18:54
removed: NULL
update_count: 2
resource_state: Enabled
owner: NULL
lastUpdated: NULL
engine_state: Disabled
1 row in set (0.00 sec)
mysql> SELECT * FROM vm_instance WHERE name like '%s-1-VM%'\G //虚拟机实例中的系统虚机信息,cloudstack界面上面的实例以及系统虚机状态均从该表中的state字段读取。
*************************** 1. row ***************************
id: 22
name: s-1-VM
uuid: 8bd3ab0c-a431-4dd2-85a7-013921427f6a
instance_name: s-1-VM
state:
Running
vm_template_id: 3
guest_os_id: 15
private_mac_address: 06:51:e0:00:00:07
private_ip_address: 10.196.53.74
pod_id: 1
data_center_id: 1
host_id: 15
last_host_id: 15
proxy_id: 55
proxy_assign_time: 2013-12-18 05:20:52
vnc_password: VoRRPovUk7w7/+islEFf9Ai0tbTep0WOJJod0PLOJkU=
ha_enabled: 0
limit_cpu_use: 0
update_count: 3
update_time: 2013-12-18 05:18:59
created: 2013-12-18 05:17:04
removed: NULL
type: SecondaryStorageVm
vm_type: SecondaryStorageVm
account_id: 1
domain_id: 1
service_offering_id: 9
reservation_id: a2a55809-abfa-4b6e-92f8-105cf8bef2a8
hypervisor_type: KVM
disk_offering_id: NULL
cpu: NULL
ram: NULL
owner: NULL
speed: NULL
host_name: NULL
display_name: NULL
desired_state: NULL
1 row in set (0.01 sec)
问题的关键点
就是数据库中两个字段的红色标注部分 ,一个表中显示的是Disconnected ,一个表中显示的是Running, CloudStack 的UI界面上面显示两个系统虚机也是Running。
问题解决:
了解这两个虚拟机的朋友都知道,这是个很强大的虚拟机,删除之后能够重建,一般这两个虚拟机出现了故障,也是通过删除,重建解决的,既然UI界面上面无法删除,那就在数据库中修改相应字段,将其状态置为Destroyed即可。
UPDATE vm_instance SET state='Destroyed' WHERE name='s-1-VM';