Resource Scheduling in vSphere and Nova Compute
In a previous post on vSphere integration with OpenStack Nova Compute, I gave an overview of the OpenStack Nova Compute project and how VMware vSphere integrates with that particular project. In this post, I want to review Compute resource scheduling, a.k.a. nova-scheduler in OpenStack and Distributed Resource Scheduler (DRS) in vSphere. I want to show how DRS compares with nova-scheduler and the impact of running both as part of an OpenStack implementation. Note that I can’t be completely exhaustive in this blog post, but would recommend everyone read the following:
- For DRS, the “bible” is “VMware vSphere 5 Clustering Technical Deepdive,” by Frank Denneman and Duncan Epping.
- For nova-scheduler, the most comprehensive treatment I’ve read is Yong Sheng Bong’s excellent blog post on “OpenStack nova-scheduler and its algorithm.” It’s a must read if you want to dive deep into the internals of the nova-scheduler.
OpenStack nova-scheduler
Nova Compute uses the nova-scheduler service to determine which compute node should be used when a VM instance is to be launched. The nova-scheduler, like vSphere DRS, makes this automatic initial placement decision using a number of metric and policy considerations, such as available compute resources and VM affinity rules. Note that unlike DRS, the nova-scheduler does not do periodic load-balancing or power-management of VMs after the initial placement. The scheduler has a number of configurable options that can be accessed and modified in the nova.conf file, the primary file used to configure nova-compute services in an OpenStack cloud.
scheduler_driver=nova.scheduler.multi.MultiScheduler compute_scheduler_driver=nova.scheduler.filter_scheduler.FilterScheduler scheduler_available_filters=nova.scheduler.filters.all_filters --> scheduler_default_filters=AvailabilityZoneFilter,RamFilter,ComputeFilter --> least_cost_functions=nova.scheduler.least_cost.compute_fill_first_cost_fn compute_fill_first_cost_fn_weight=-1.0
The two variables I want to focus on are “schedule_default_filters” and “Least_cost_functions.” They represent two algorithmic processes used by the Filter Scheduler to determine initial placement of VMs (There are two other schedulers, the Chance Scheduler and the Multi Scheduler, that can be used in place of the Filter Scheduler; however, the Filter Scheduler is the default and should be used in most cases). These two process work together to balance workloads across all existing compute nodes at VM launch, much in the same way the Dynamic Entitlements and Resource Allocation Settings are used by DRS to balance VMs across an ESXi cluster.
Filtering (schedule_default_filters)
When a request is made to launch a new VM, the nova-compute service contacts the nova-scheduler to request placement of the new instance. The nova-scheduler uses the scheduler, by default the Filter Scheduler, named in the nova.conf file to determine that placement. First, a filtering process is used to determine which hosts are eligible for consideration and an eligible host list is created and then a second algorithm, Costs and Weights (described later), is applied against the list to determine which compute node is optimal for fulfilling the request.
The Filtering process uses the scheduler_available_filters
configuration option in nova.conf
to determine what filters will be used to filter out ineligible compute nodes and to create the eligible hosts list. By default, there are three filters that are used:
- The AvailabilityZoneFilter filters out hosts that do not belong to the Availability Zone specified when a new VM launch request is made via the Horizon dashboard or from the nova CLI client.
- The RamFilter ensures that only nodes with sufficient RAM make it on to the eligible host list. If the RamFilter is not used, the nova-scheduler may over-provision a node with insufficient RAM resources. By default, the filter is set to allow overcommitment on RAM by 50 percent, i.e. the scheduler will allow a VM requiring 1.5 GB of RAM to be launched on a node with only 1 GB of available RAM. This setting is configurable by changing the “ram_allocation_ratio=1.5″ setting in nova.conf.
- The ComputeFilter filters out any nodes that do not have sufficient capabilities to launch the requested VM as it corresponds to the requested instance type. It also filters out any nodes that cannot support properties defined by the image that is being used to launch the VM. These image properties include architecture, hypervisor and VM mode.
An example of how the Filtering process may work is that host 2 and host 4 above may have been initially filtered out for any combination of the following reasons, assuming the default filters were applied:
- The requested VM is to be in Availability Zone 1 while nodes 2 and 4 are in Availability Zone 2.
- The requested VM requires 4 GB of RAM and nodes 2 and 4 each have only 2 GB of available RAM.
- The requested VM has to run on vSphere and nodes 2 and 4 support KVM.
There are a number of other filters that can be used along with or in place of the default filters; some of them include:
- The CoreFilter ensures that only nodes with sufficient CPU cores make it on to the eligible host list. If the CoreFilter is not used, the nova-scheduler may over-provision a node with insufficient physical cores. By default, the filter is set to allow overcommitment based on a ratio of 16 virtual cores to one physical core. This setting is configurable by changing the “cpu_allocation_ratio=16.0″ setting in nova.conf.
- The DifferentHostFilter ensures that the VM instance is launched on a different compute node from a given set of instances, as defined in a scheduler hint list. This filter is analogous to the anti-affinity rules in vSphere DRS.
- The SameHostFilter ensures that the VM instance is launched on the same compute node as a given set of instances, as defined in a scheduler hint list. This filter is analogous to the affinity rules in vSphere DRS.
The full list of filters are available in the Filters section of the “OpenStack Compute Administration Guide.” The Nova-Scheduler is flexible enough that customer filters can be created and multiple filters can be applied simultaneously.
Costs and Weights
Next, the Filter Scheduler takes the hosts that remain after the filters have been applied and applies one or more cost function to each host to get numerical scores for each host. Each cost score is multiplied by a weighting constant specified in the nova.conf
config file. Details on the algorithm used are detailed in the “OpenStack nova-scheduler and its algorithm” I referenced previously. The weighting constant configuration option is the name of the cost function, with the _weight
string appended. Here is an example of specifying a cost function and its corresponding weight:
least_cost_functions=nova.scheduler.least_cost.compute_fill_first_cost_fn,nova.scheduler.least_cost.noop_cost_fn compute_fill_first_cost_fn_weight=-1.0 noop_cost_fn_weight=1.0
There are three cost functions available; any of the functions can used alone or in any combination with the other functions.
- The nova.scheduler.least_cost.compute_fill_first_cost_fn function calculates the amount of available RAM on the compute nodes and chooses which node is best suited for fulfilling a VM launch request based on the weight assigned to the function. A weight of 1.0 will configure the Scheduler to “fill-up” a node until there is insufficient RAM available. A weight of -1.0 will configure the scheduler to favor the node with the most available RAM for each VM launch request.
- The nova.scheduler.least_cost.retry_host_cost_fn function adds additional cost for retrying a node that was already used for a previous attempt. The intent of this function is to ensure that nodes which consistently encounter failures are used less frequently.
- The nova.scheduler.least_cost.noop_cost_fn function will cause the scheduler not to discriminate between any nodes. In practice this function is never used.
The Cost and Weight function is analogous to the Share concept in vSphere DRS.
In the example above, if we choose to only use the nova.scheduler.least_cost.compute_fill_first_cost_fn function and set the weight to compute_fill_first_cost_fn_weight=1.0, we would expect the following results:
- All nodes would be ranked according to amount of available RAM, starting with host 4.
- The nova-scheduler would favor launching VMs on host 4 until there is insufficient RAM available to launch a new instance.
DRS with the Filter Scheduler
Now that we’ve reviewed how Nova compute schedules resources and how it works compared with vSphere DRS, let’s look at how DRS and the nova-scheduler work together when integrating vSphere into OpenStack. Note that I will not be discussing Nova with the ESXDriver and standalone ESXi hosts; in that configuration, the filtering works the same as it does with other hypervisors. I will, instead, focus on Nova with the VCDriver and ESXi clusters managed by vCenter.
Since vCenter abstracts the ESXi hosts from the nova-compute service, the nova-scheduler views the cluster as a single host with resources amounting to the aggregate resources of all ESXi hosts in the cluster. This has two effects:
- The nova-scheduler plays NO part in where VMs are hosted in the cluster once the decision is made to use the vSphere cluster to fulfill a launch request. When the appropriate vSphere cluster is chosen by the scheduler, nova-compute simply makes an API call to vCenter and hands-off the request to launch a VM. vCenter then selects which ESXi host in the cluster will host the launched VM, based on Dynamic Entitlements and Resource Allocation settings. Any automatic load balancing or power-management by DRS is allowed but will not be known by Nova compute.
- In the example above, node 1 has 8 GB of available RAM while the nova-scheduler sees node 2 as having 12 GB RAM; again, the nova-scheduler does not take into account the RAM in the vCenter Server or the actual available RAM in node 2 but only the aggregate RAM in all ESXi hosts in the cluster.
The latter effect can cause issues with how VMs are scheduled/distributed across a multi-compute node OpenStack environment. Let’s take a look at two use cases where the nova-scheduler may be impeded in how it attempts to schedule resources; we’ll focus on RAM resources, using the environment shown above, and assume we are allowing the default of 50 percent overcommitment on physical RAM resources.
- A user requests a VM with 10 GB of RAM. The nova-scheduler applies the Filtering algorithm and adds the vSphere compute node to the eligible host list even though neither ESXi hosts in the cluster has sufficient RAM resources, as defined by the RamFilter. If the vSphere compute node is chosen to launch the new instance, vCenter will then make the determination if there are enough resources to fulfill the request based on DRS defined rules.
- A user requests a VM with 4 GB of RAM. The nova-scheduler applies the Filtering Algorithm and correctly adds all three compute nodes to the eligible host list. The scheduler then applies the Cost and Weights algorithm and favors the vSphere compute node which it believes has more available RAM. This creates an imbalanced system where the hypervisor/compute node with the lower amount of available RAM may be incorrectly assigned the lower cost and favored for launching new VMs.
So how do we design around this issue? Unfortunately, the VCDriver is new to OpenStack and there does not seem to be a great deal of documentation or recommended practices available. I hope this changes shortly and I plan to help as best as I can. I’ve been speaking with the VMware team working on OpenStack integration and I hope we’ll be able to collaborate together to put out documentation and recommended practices. VMware is also continuing to improve on the code it’s already contributed and I expect some of these issues will be addressed in future OpenStack releases; I also expect that additional functionality will be added to Nova compute as OpenStack continues to mature.
Want to know more about Kenneth Hui and his thoughts on OpenStack? Check out this Q&A interview.
Heading to VMworld 2013, August 25 to August 29 in San Francisco? We are too.
Come check us out at booth No. 2404 on the expo floor – we’re excited to talk to you about new additions to our VMware® product line and the future of the Rackspace Hybrid Cloud. We’ll also have technical experts in the booth to answer all of your questions about everything from Rackspace Replication Manager (powered by VMware SRM to VMware to OpenStack.
While you’re here, play our “Unlock the Cloud” game for a chance to win prizes.
And be sure to see Rackspace Enterprise Cloud Strategist Paul Croteau and VMware’s Bryan Evans present “DR to the Cloud with VMware Site Recovery Manager and Rackspace Disaster Recovery Planning Services” at 4 p.m. PDT on Monday, August 26.