SmartHarvest: Harvesting Idle CPUs Safely and Efficiently in the cloud
{2021}, {Yawen Wang}, {EuroSys}
@inproceedings{wang2021smartharvest,
title={SmartHarvest: harvesting idle CPUs safely and efficiently in the cloud},
author={Wang, Yawen and Arya, Kapil and Kogias, Marios and Vanga, Manohar and Bhandari, Aditya and Yadwadkar, Neeraja J and Sen, Siddhartha and Elnikety, Sameh and Kozyrakis, Christos and Bianchini, Ricardo},
booktitle={Proceedings of the Sixteenth European Conference on Computer Systems},
pages={1–16},
year={2021}
}
Summary
写完笔记之后最后填,概述文章的内容,以后查阅笔记的时候先看这一段。注:写文章summary切记需要通过自己的思考,用自己的语言描述。忌讳直接Ctrl + c原文。
Research Objective(s)
We can increase the efficiency of public cloud datacenters by harvesting allocated but temporily idling CPU cores from customer virtual machines(VMs) to run batch or analytics workloads. Even small efficiency gains translate into substantial savings, since provisioning and operating a datacenter costs hundreds of millions of dollars per year.
Background / Problem Statement
Most datacenters continue to operate at low resource utilization. Even when the cloud platform can dynamically scale the number of VMs, customers often leave plenty of spare capacity in case load increases faster than the platform can react. This overprovisioning prevents degradations in user experience bust also massively underutilizes the platform’s resources.
Insights
- A primary core is conservatively considered busy if it has an active software thread running on it at the time of the query.
- Since SmartHarvest runs in a public cloud, it only has black-box acces to the primary VMs, which severely restricts the types of features it can use. Besides static properties of the VMs, EVMAgent can only observe the external resource uitilization of the VMs by calling the hypervisor. We focus on CPU utilization. Application performance metrics or service-level objetives are completely obaque to the agent. EVMagent computes the following five features for training and predictions: the min, max, average, standard deviation, and median CPU usage. We identify these features using a feature selection tecniques that trains a decision tree susing offiline data to rank features according to their importance in deciding the predicted values.
Feature importances with a forest of trees
Idea(s)
作者解决问题的方法/算法是什么?是否基于前人的方法?基于了哪些?
by danamically harvesting spare resources from regular VMs fo a co-located ElasticVM. Spare resources are resources allocated to the primary VMs but are temporarily idling. The ElasticVM is a new type of low-priority VM allocated with a minimum set of resources. We also propose SmartHarvest, a system that dynamically manages the number of cores available to ElasticVMs in each fine-grained time window. SmartHarvest uses online learning to predict the core demand of primary, customer VMs and compute the nubmer of cores that can be safely harvested. Based on this prediction, it reassigns cores.
3.3 SmartHarvest Architecture & Operation
SmartHarvest maintains an idle buffer of cores that is ready to immediately absorb load increases of the primary VMs. SmartHarvest tries to only reserve as many cores in the idle buffer as needed, so it periodically predicts the pead knumber of cores required by the primary VMs at fine, sub-second time granularity.
The agent splits time into “learning windows”. During each window, the agent frequently polls the hypervisor for the number of busy primary cores and records this data for the learning algorithm to use. A primary core is conservatively considered busy if it has an active software thread running on it at the time of query. if at any point during the window all primary cores are found to be busy, the SmartHarvest has urn out of cores inthe idle buffer, indicating that the learning algorithm may have underpredicted the peak number of needed cores. In this case, the agent immediately enforces a short-term safegurd for the next window by expanding the primary VM’s assignment and conctracting the ElasticVM’s assignment.
If the safegurard is not engaged, the agents runs the learning algoritm to predict the pead number of cores needed by primary VMs for the new window. Then the agent assigns any remaining primary cores to the ElasticVM.
3.4 Harvesting with Smart Decisions
the five features for training and predcitions: the min,max, average, standard deviation, and midian CPU usage.
learning algorithm
In contrast, a cost-sensitive multi-class classification algorithm allow us to train a separate predictor for each core count(class) and select the class with lowest cost during predictinos. This accommodates our notion of dffierentiated costs, because it allows us to specify any cost values for each class without worrying about the relationship between these values. Since the costs are continuous, we can run a seperate regression for each class with cost as the the predicted value. In general, we assign lower costs to clases that are equal or closely abouve the correct class, and higher costs to all other classes. this skews the learner towards making small overpredictions, and heavily penalizes it for underpredicting primary usage(which triggers the safeguard).
Evaluation
作者如何评估自己的方法?实验的setup是什么样的?感兴趣实验数据和结果有哪些?有没有问题或者可以借鉴的地方?
Conclusion
作者给出了哪些结论?哪些是strong conclusions, 哪些又是weak的conclusions(即作者并没有通过实验提供evidence,只在discussion中提到;或实验的数据并没有给出充分的evidence)?
Notes
(optional) 不在以上列表中,但需要特别记录的笔记。
References
(optional) 列出相关性高的文献,以便之后可以继续track下去。