新手如何设置和优化关键词_大规模超参数优化的新手指南-CSDN博客

新手如何设置和优化关键词

Despite the tremendous success of machine learning (ML), modern algorithms still depend on a variety of free non-trainable hyperparameters. Ultimately, our ability to select quality hyperparameters governs the performance for a given model. In the past, and even some currently, hyperparameters were hand selected through trial and error. An entire field has been dedicated to improving this selection process; it is referred to as hyperparameter optimization (HPO). Inherently, HPO requires testing many different hyperparameter configurations and as a result can benefit tremendously from massively parallel resources like the Perlmutter system we are building at the National Energy Research Scientific Computing Center (NERSC). As we prepare for Perlmutter, we wanted to explore the multitude of HPO frameworks and strategies that exist on a model of interest. This article is a product of that exploration and is intended to provide an introduction to HPO methods and guidance on running HPO at scale, based on my recent experiences and results.

尽管机器学习(ML)取得了巨大的成功，但是现代算法仍然依赖于各种免费的不可训练的超参数。最终，我们选择高质量超参数的能力决定了给定模型的性能。在过去，甚至在某些情况下，都是通过反复试验手动选择超参数的。整个领域都致力于改善这一选择过程。它称为超参数优化(HPO)。 HPO本质上要求测试许多不同的超参数配置，因此可以从大量并行资源(例如我们在国家能源研究科学计算中心( NERSC )上构建的Perlmutter系统)中受益匪浅。在为Perlmutter进行准备时，我们想探索感兴趣模型上存在的许多HPO框架和策略。本文是该探索的产物，旨在根据我最近的经验和结果提供HPO方法的介绍和大规模运行HPO的指导。

Disclaimer; this article contains plenty of general non-software specific information about HPO, but there is a bias for free open source software that is applicable to our systems at NERSC.

免责声明；本文包含有关HPO的大量常规的非软件特定信息，但是对于适用于我们NERSC系统的免费开源软件存在偏见。

在本文中，我们将介绍…… (In this article, we will cover …)

Scalable HPO with Ray Tune
带有Ray Tune的可扩展HPO
Schedulers vs Search Algorithms
调度程序与搜索算法
Not All Hyperparameters Can Be Treated the Same
并非所有的超参数都可以被相同地对待
Time-to-Solution Study
解决时间研究
Optimal Scheduling with PBT
使用PBT的最佳计划
Cheat Sheet for Selecting an HPO Strategy
选择HPO策略的备忘单
Technical Tips— Ray Tune, Dragonfly, Slurm, TB, W&B
技术提示 —雷声(Ray Tune)，蜻蜓，斯鲁姆(Slurm)，结核病，W＆B
Key Takeaways
重要要点

带有Ray Tune的可扩展HPO (Scalable HPO with Ray Tune)

Being able to leverage the power of modern compute resources to run HPO at scale is important to efficiently search hyperparameter space — especially in the time of deep learning (DL) where the size of neural networks continues to increase. Luckily for all of us, the folks at Ray Tune have made scalable HPO easy. Below is a graphic of the general procedure to run Ray Tune at NERSC. Ray Tune is an open-source python library for distributed HPO built on Ray. Some highlights of Ray Tune:

能够利用现代计算资源的力量大规模运行HPO对于有效搜索超参数空间非常重要，尤其是在深度学习(DL)的时代，神经网络的规模不断增加。幸运的是， Ray Tune的员工使可扩展的HPO变得容易。下面是在NERSC上运行Ray Tune的一般过程的图形。 Ray Tune是用于基于Ray构建的分布式HPO的开源python库。 Ray Tune的一些亮点：

Supports any ML framework
支持任何ML框架
Internally handles job scheduling based on the resources available
在内部根据可用资源处理作业调度
Integrates with external optimization packages (e.g. Ax, Dragonfly, HyperOpt, SigOpt)
与外部优化程序包(例如Ax，Dragonfly，HyperOpt，SigOpt)集成
Implements state-of-the-art schedulers (e.g. ASHA, AHB, PBT)
实施最新的调度程序(例如，ASHA，AHB，PBT)

I have enjoyed using Ray Tune, but if you choose a different HPO framework, no worries, there is still plenty of general information in this article.

我很喜欢使用Ray Tune ，但是，如果您选择其他HPO框架，请不用担心，本文中仍然有很多常规信息。

Image for post — *Image by author* *图片作者*

调度程序与搜索算法 (Schedulers vs Search Algorithms)

One of the first distinctions I want to point out about HPO strategies, is the difference between a scheduler and a search algorithm. The search algorithm governs how hyperparameter space is sampled and optimized (e.g. random search). From a practical standpoint, the search algorithm provides a mechanism to select hyperparameter configurations (i.e. trials) to test. A search algorithm is always necessary for HPO. Alternatively, schedulers improve the overall efficiency of the HPO by terminating unpromising trials early. For example, if I use random search, some of the trials are expected to perform poorly, so it would be nice to have the ability to terminate those trials early — saving valuable compute resources. This is what a scheduler does. A scheduler is not necessary for HPO but they massively improve performance.

关于HPO策略，我想指出的第一个区别是调度程序和搜索算法之间的区别。搜索算法控制如何对超参数空间进行采样和优化(例如，随机搜索)。从实际的角度来看，搜索算法提供了一种选择要测试的超参数配置(即试验)的机制。 HPO始终需要搜索算法。另外，调度程序可以通过尽早终止毫无希望的试验来提高HPO的整体效率。例如，如果我使用随机搜索，则预期某些试验的性能会很差，因此能够尽早终止这些试验(节省宝贵的计算资源)将是很好的。调度程序就是这样做的。 HPO不需要调度程序，但是它们可以大大提高性能。

Below are brief descriptions and references for the schedulers and the search algorithms I examine in this article.

下面是本文中介绍的调度程序和搜索算法的简要说明和参考。

异步连续减半算法(ASHA —调度程序) (Async Successive Halving Algorithm (ASHA — scheduler))

First, I want to define the successive halving algorithm (SHA), and instead of doing it myself, I really like the definition given in this paper — they also have pseudocode of the SHA and ASHA, if you are interested.

首先，我想定义连续减半算法(SHA)，而不是自己做，我真的很喜欢本文给出的定义-如果您感兴趣的话，它们也具有SHA和ASHA的伪代码。

The idea behind SHA is simple: allocate a small budget to each configuration, evaluate all configurations and keep the top 1/η, increase the budget per configuration by a factor of η, and repeat until the maximum per-configuration budget of R is reached.

SHA背后的想法很简单：为每个配置分配少量预算，评估所有配置并保持前1 /η，将每个配置的预算增加η倍，然后重复进行，直到达到R的最大每个配置预算。

The SHA does not parallelize well because all configurations need to be evaluated for a short time before the top 1/η can be selected. This creates a bottleneck at each rung (each successive halving is referred to as a rung). ASHA decouples trial promotion and rung completion, such that trials can be advanced to the next rung at any given time. If a trial cannot be promoted additional trials can be added to the base rung so more promotions are possible.

SHA不能很好地并行化，因为在可以选择顶部1 /η之前，需要对所有配置进行短暂评估。这会在每个梯级上造成瓶颈(每个连续的减半称为梯级)。 ASHA将试验升级与梯级完成脱钩，从而可以在任何给定时间将试验推进到下一个梯级。如果无法提升试用版，则可以将其他试用版添加到基本梯级中，以便可以进行更多升级。

A major assumption of SHA and ASHA is that if a trial performs well over an initial short time interval it will perform well at longer time intervals. A classic example where this assumption can break down is tuning learning rates. Larger learning rates may outperform smaller learning rates at short times causing the smaller learning rate trials to be erroneously terminated. In practice, I am honestly not sure how much this matters.

SHA和ASHA的主要假设是，如果试验在最初的较短时间间隔内表现良好，则在较长的时间间隔内将表现良好。一个可以打破这种假设的经典例子是调整学习速度。较高的学习率可能会在短时间内胜过较小的学习率，从而导致较小的学习率试验错误地终止。实际上，我真的不确定这有多重要。

异步超频带(AHB —调度程序) (Async Hyperband (AHB — scheduler))

Hyperband (HB) is a scheduler designed to mitigate the SHA’s bias towards initial performance. HB essentially loops over the SHA with a variety of halving rates — attempting to balance early termination with providing more resources per trial regardless of initial performance. Each loop of the SHA is considered a bracket, which can have a number of rungs. See the figure below. AHB is identical to HB except it loops over ASHA. The AHB and ASHA implementation used in Ray Tune is described in this paper.

Hyperband (HB)是一种调度程序，旨在减轻SHA对初始性能的偏见。 HB本质上以各种减半速率在SHA上循环-尝试平衡早期终止与每次试验提供更多资源之间的关系，而与初始性能无关。 SHA的每个循环都被视为一个括号，可以有多个梯级。参见下图。除了在ASHA上循环外，AHB与HB相同。本文介绍了Ray Tune中使用的AHB和ASHA实现。

基于人口的培训(PBT —混合) (Population Based Training (PBT — hybrid))

I call PBT a hybrid because it has aspects of both a scheduler and a search algorithm. It can also function as an HPO strategy and a trainer all-in-one. More on that in the Not all Hyperparameters Are the Same section. At a high-level, PBT is similar to a genetic algorithm. There is a population of workers, where each worker is assigned a random configuration of hyperparameters (trial) and at set intervals hyperparameter configurations are replaced by higher performing workers in the population (exploitation) and randomly perturbed (exploration). The user can set the balance of exploitation vs exploration. Here are a couple resources to learn more, blog and paper.

我将PBT称为混合动力，因为它既具有调度程序又具有搜索算法。它也可以作为HPO策略和培训师合而为一。关于“并非所有超参数都相同”部分的更多信息。从高层次上讲，PBT与遗传算法相似。有大量的工作人员，其中为每个工作人员分配了一个随机的超参数配置(试用)，并按设置的时间间隔将超参数配置替换为总体中性能较高的工人(开发)并随机地对其进行扰动(探索)。用户可以设置开发与探索之间的平衡。以下是一些博客和论文资源，以了解更多信息。

随机搜索(RS —搜索算法) (Random Search (RS — search algorithm))

When the hyperparameter space of interest is reasonably large, too large for a grid search, the default algorithm is random search. This is exactly as it sounds, hyperparameter configurations or trials are randomly selected from the search space. If given enough compute time RS works reasonably well.

当感兴趣的超参数空间相当大(对于网格搜索而言太大)时，默认算法是随机搜索。听起来确实如此，超参数配置或试验是从搜索空间中随机选择的。如果给定足够的计算时间，RS可以正常工作。

贝叶斯优化(BO —搜索算法) (Bayesian Optimization (BO — search algorithm))

BO provides an algorithmic approach to determining the optimal hyperparameters, instead of randomly searching. Because the objective function is unknown in HPO a black-box optimizer like BO is necessary. In BO a surrogate models the objective function and an acquisition function is used for sampling new points or new hyperparameter configurations in this case. Gaussian processes are typically used as the surrogate models in BO for HPO. Ideally, BO can converge towards the optimal hyperparameters much more efficiently than random search.

BO提供了一种确定最佳超参数的算法，而不是随机搜索。由于目标功能在HPO中是未知的，因此需要像BO这样的黑盒优化程序。在BO中，代理模型对目标函数进行建模，在这种情况下，采集函数用于对新点或新超参数配置进行采样。高斯过程通常用作BO中HPO的替代模型。理想情况下，BO可以比随机搜索更有效地收敛到最佳超参数。

并非所有的超参数都可以被相同地对待 (Not All Hyperparameters Can Be Treated the Same)

There are two main types of hyperparameters in ML and they dictate what HPO strategies are possible.

ML中有两种主要的超参数类型，它们决定了可能的HPO策略。

Model Hyperparameters: Establish model architecture

模型超参数：建立模型架构

Number of convolutional layers
卷积层数
Number of fully connected layers
全连接层数
etc.
等等

Algorithm Hyperparameters: Are involved in the learning process

算法超参数：参与学习过程

Learning rates
学习率
Batch size
批量大小
Momentum
动量
etc.
等等

The important takeaway is that not all HPO strategies can handle both model and algorithm hyperparameters. PBT is a good example. PBT was designed to evolve and inherit hyperparameters from other high performing workers in the population; however, if workers have different network architectures it is unclear to me how exactly that would work. There might be a way to do this with PBT, but it is not standard and does not work out-of-the-box with Ray Tune.

重要的一点是，并非所有的HPO策略都可以处理模型和算法超参数。 PBT是一个很好的例子。 PBT旨在从人群中其他高绩效工人进化并继承超参数；但是，如果工作人员具有不同的网络体系结构，我不清楚这将如何工作。也许有一种方法可以使用PBT来做到这一点，但是它不是标准的，并且不能与Ray Tune一起使用。

解决时间研究 (Time-to-Solution Study)

To compare different HPO strategies I decided to keep it simple and focus on the average time-to-solution, which is a metric that is relatively straightforward to interpret. There are a couple caveats for my results:

为了比较不同的HPO策略，我决定保持简单，只关注解决问题的平均时间，这是一个相对容易解释的指标。我的结果有两个警告：

I did this work with a particular model and problem in mind (more on that below), so I do not expect these results to be completely general.
我在进行这项工作时会考虑到特定的模型和问题(请参见下文)，因此我不希望这些结果是完全笼统的。
There are many arbitrary choices that go into various HPO strategies that may alter the results.
各种HPO策略中有许多选择可能会改变结果。

HPO策略研究 (HPO Strategies Investigated)

Random Search (RS) — search algorithm no scheduler
随机搜索(RS)—搜索算法无调度程序
Async Successive Halving Algorithm/Random Search (ASHA/RS)—scheduler and search algorithm
异步连续减半算法/随机搜索(ASHA / RS)-调度程序和搜索算法
Async Hyperband/Random Search (AHB/RS)—scheduler and search algorithm
异步超频带/随机搜索(AHB / RS)-调度程序和搜索算法
Async Successive Halving Algorithm/Bayesian Optimization (ASHA/BO)—scheduler and search algorithm
异步连续减半算法/贝叶斯优化(ASHA / BO)-调度程序和搜索算法

型号详情 (Model details)

The model I was interested in optimizing hyperparameters for is a graph neural network used in the field of catalysis to predict adsorption energies. Specific details can be found here.

我对优化超参数感兴趣的模型是一个图形神经网络，用于催化领域以预测吸附能。具体细节可以在这里找到。

优化超参数 (Hyperparameters Optimized)

There were six hyperparameters I examined and they are listed below:

我检查了六个超参数，它们在下面列出：

Learning Rate
学习率
Batch Size
批量大小
Atom Embedding Size
原子嵌入尺寸
Number of Graph Convolution Layers
图卷积层数
Fully Connected Feature Size
完全连接的功能大小
Number of Fully Connected Layers
全连接层数

Pro Tip: When making decisions about the size of the hyperparameter space you want to search — consider memory usage. When tuning network architecture and batch size I ran into memory issues on our 16GB GPUs at NERSC.

专家提示：在决定要搜索的超参数空间的大小时，请考虑内存使用情况。在调整网络体系结构和批处理大小时，我在NERSC的16GB GPU上遇到了内存问题。

探索的问题 (Questions Explored)

What is the impact of a scheduler?
调度程序有什么影响？
How much can a sophisticated search algorithm improve HPO?
复杂的搜索算法可以在多大程度上提高HPO？

The first question I wanted to investigate was the impact of using a scheduler. To address this question I compared the time-to-solution of ASHA/RS, AHB/RS, and RS using the same computational resources for each (4 Cori GPU Nodes for 8 hours). All three strategies use the same search algorithm with the addition of the ASHA and the AHB schedulers. The notation I am using is scheduler/search algorithm.

我要调查的第一个问题是使用调度程序的影响。为了解决这个问题，我比较了使用相同计算资源的ASHA / RS，AHB / RS和RS的解决时间(4个Cori GPU节点8小时)。这三种策略都使用相同的搜索算法，并增加了ASHA和AHB调度程序。我使用的表示法是调度程序/搜索算法。

Going beyond a scheduler, I was curious how much a “smarter” search algorithm, such as BO, would improve HPO performance. To explore this question I compared the time-to-solution of ASHA/RS and ASHA/BO using the same computational resources for each (4 Cori GPU Nodes for 4 hours).

超越调度程序，我很好奇“智能”搜索算法(例如BO)能提高HPO性能多少。为了探究这个问题，我比较了使用相同计算资源的ASHA / RS和ASHA / BO的解决时间(4个Cori GPU节点4小时)。

结果和讨论 (Results and Discussion)

Average time-to-solution plot comparing ASHA/RS, AHB/RS, and RS given the same computational resources

给定相同的计算资源，比较ASHA / RS，AHB / RS和RS的平均时间求解图

ASHA/RS clearly outperformed both AHB/RS and RS by reaching a lower average test MAE in a shorter period of time. ASHA/RS improved the time-to-solution by at least 5x compared to RS. I say at least 5x, because RS did not converge to the lower limit of the test MAE in the 8 hour limit. Additionally, more ASHA/RS trials were close to the mean resulting in a smaller standard deviation. The top 6 trials were time averaged in all cases. I suspect the performance of ASHA/RS is largely because of the number of trials completed. ASHA/RS finished nearly 2x the trials of AHB/RS and over 8x the trials of RS. The number of trials finished can be seen in the top right corner. I should also mention that the number of ASHA/RS and AHB/RS trials are not at their upper limit because of the amount of checkpoint I was doing. Minimal checkpointing is critical for the performance of SHA based HPO strategies. This is illustrated by the number of trials finished in the ASHA/RS experiment below that used less checkpointing — the same amount of trials in half the time. The reduced checkpointing increased the time-to-solution improvement for ASHA/RS to approximately 10x compared to RS!

通过在较短的时间内达到较低的平均测试MAE，ASHA / RS明显优于AHB / RS和RS。与RS相比，ASHA / RS的解决时间至少缩短了5倍。我说至少5倍，因为RS在8小时的限制内没有收敛到测试MAE的下限。此外，更多的ASHA / RS试验接近平均值，因此标准偏差较小。在所有情况下，前6个试验均为时间平均。我怀疑ASHA / RS的性能很大程度上是因为完成了许多试验。 ASHA / RS完成了AHB / RS测试的近2倍，RS的测试则超过了8倍。可以在右上角看到完成的试验数量。我还要提到的是，由于我正在做的检查点数量之多，ASHA / RS和AHB / RS的试验次数并未达到上限。 最小检查点对基于SHA的HPO策略的性能至关重要。 这可以通过下面的ASHA / RS实验中完成的试验次数来说明，该试验使用了更少的检查点-一半时间内的试验次数相同。与RS相比，减少的检查点使ASHA / RS的解决时间缩短了大约10倍！

Average time-to-solution plot comparing ASHA/RS and ASHA/BO given the same computational resources

给定相同的计算资源，比较ASHA / RS和ASHA / BO的平均时间求解图

It can be seen from the figure above that there is on average no benefit to adding BO for my particular model. My hypothesis is that the hyperparameter surface I was trying to optimize had a bunch of local minima (think egg carton) and no obvious global minimum, which would reduce the benefit of BO. Situations where I can see BO working well are large hyperparameter search spaces with a more well defined global minimum — not that you can know this a priori. Overall, I think a good approach to HPO is building complexity as needed. One last note on BO, while there was not an average improvement using BO the single best trial I found was using ASHA/BO. As a result, if I had to choose one configuration of hyperparameters that is what I would select.

从上图可以看出，为我的特定模型添加BO平均没有任何好处。我的假设是，我尝试优化的超参数表面有一堆局部最小值(如鸡蛋纸箱)，并且没有明显的全局最小值，这会降低BO的效益。我可以看到BO运作良好的情况是大型超参数搜索空间，并且具有更明确定义的全局最小值-并不是您可以先验地知道这一点。总的来说，我认为HPO的一种好方法是根据需要增加复杂性。 关于BO的最后一点说明，尽管使用BO并没有取得平均的进步，但我发现的唯一最佳试验是使用ASHA / BO。 结果，如果我必须选择一种超参数配置，那是我会选择的配置。

The time delay between the ASHA/RS and the ASHA/BO curves is likely because the acquisition function used in BO needs to be conditioned with a certain amount of data before sampling new hyperparameter configurations.

ASHA / RS与ASHA / BO曲线之间的时间延迟很可能是因为BO中使用的采集功能需要在采样新的超参数配置之前以一定数量的数据进行调节。

使用PBT的最佳计划 (Optimal Scheduling with PBT)

One of the nice features of PBT is the ability to develop an ideal scheduling procedure. For instance, I can determine the ideal learning rate throughout training, which is usually quite important. In my case, I want a configuration of hyperparameters and a learning rate scheduler that I can use to train my model repeatedly. Most ML frameworks include learning rate schedulers (e.g. multistep, reduce on plateau, exponential decay, etc.) to reduce the learning rate as training progresses. Hence, a custom learning rate scheduler can be developed using PBT and incorporated into a given ML framework for subsequent training.

PBT的一个不错的功能之一就是能够开发理想的调度程序。例如，我可以在整个培训过程中确定理想的学习率，这通常很重要。就我而言，我需要一个超参数配置和一个学习速率调度程序，可用于重复训练模型。大多数机器学习框架都包含学习速率调度器(例如，多步，平稳期降低，指数衰减等)，以随着训练的进行而降低学习速率。因此，可以使用PBT开发自定义学习率调度程序，并将其合并到给定的ML框架中以进行后续培训。

Alternatively, if repeated training is not necessary for your application PBT can be used directly as a training procedure and ideal schedules can be developed for all algorithm hyperparameters simultaneously.

或者，如果您的应用不需要重复训练，则可以将PBT直接用作训练过程，并且可以同时为所有算法超参数制定理想的时间表。

Training with PBT is very efficient in terms of actual time, in fact it uses roughly the same amount of time as your normal training procedure, but total computational time goes up because multiple workers are necessary — maybe 16–32 GPUs. In Ray Tune, workers can also be time-multiplexed if the number of workers exceeds the resource size.

就实际时间而言，使用PBT进行培训非常有效，实际上，它使用的时间与您的正常培训过程大致相同，但是由于需要多个工作人员(可能需要16–32个GPU)，所以总的计算时间增加了。在Ray Tune中，如果工作程序的数量超过资源大小，也可以对工作程序进行时间复用。

最佳学习率—结果与讨论 (Optimal Learning Rate — Results and Discussion)

I wanted to experiment with PBT and find a learning rate schedule for my model (described above). Here are the results.

我想尝试PBT，并为我的模型找到学习率时间表(如上所述)。这是结果。

The top plot shows the Test MAE for the best trial in the population. There are some jumps in the Test MAE where presumably random perturbations were attempted and since there was not an improvement the perturbations were ultimately reversed. The lower plot displays the learning rate as a function of training iterations. It appears that my ideal learning rate could reasonably be modeled by a multistep scheduler.

上方的图显示了针对人群中最佳试验的Test MAE。在测试MAE中有一些跳跃，其中尝试了随机扰动，并且由于没有改进，所以扰动最终被逆转了。下图显示了学习速度与训练迭代次数的关系。看来我的理想学习率可以通过多步计划程序合理地建模。

选择HPO策略的备忘单 (Cheat Sheet for Selecting an HPO Strategy)

Choosing an HPO strategy really depends on your particular application. For many of the chemistry and materials science applications that I am interested in, reasonably good hyperparameters that get us 85% of the way there will do just fine. Alternatively, some of you might be interested in squeezing out every last drop of performance for a given model. There is not a one-size-fits-all solution, but I’ve put together a little cheat sheet to help get the ideas flowing.

选择HPO策略确实取决于您的特定应用程序。对于我感兴趣的许多化学和材料科学应用，合理的良好超参数可以使我们达到其中的85％的方式就可以了。或者，你们中的某些人可能会对挤出给定模型的性能的每一个下降感兴趣。没有一个万能的解决方案，但我整理了一些备忘单，以帮助您使想法顺利进行。

技术提示 (Technical Tips)

雷·图恩 (Ray Tune)

Ray Tune is very user friendly and you only need to consider a few things when setting it up to run your model (I am not going to go in-depth here because Ray Tune’s documentation and examples are great): 1. Define a trainable API, either function or class based — I recommend the class option as it allows you do much more 2. Write a script to run Tune via tune.run()

Ray Tune非常用户友好，您只需要在设置它来运行模型时就考虑几件事(由于Ray Tune的文档和示例非常出色，因此在此不做深入介绍)：1.定义可训练的API ，基于函数或基于类—我建议使用class选项，因为它允许您做更多的事情2.编写脚本以通过tune.run()运行Tune

General Tips

一般提示

Check to ensure your model is being put on the right device, this sounds silly but it’s worthwhile. Put a print statement in your _setup function, if you are using the class API, to double check
检查以确保将模型放置在正确的设备上，这听起来很愚蠢，但是值得。如果您使用的是类API，请在_setup函数中添加一条打印语句，以仔细检查
Ray Tune has a bunch of handy functions (e.g. tune.uniform) to generate random distributions
Ray Tune有很多方便的函数 (例如tune.uniform )来生成随机分布

Tune.run() Flags for Performance

Tune.run() 性能标志

checkpoint_at_end=False Default is False and I would leave it that way regardless of other checkpointing settings. True, should never be used with SHA based strategies
checkpoint_at_end=False默认为False，无论其他检查点设置如何，我都会采用这种方式。是的，绝对不能与基于SHA的策略一起使用
sync_on_checkpoint=False This can improve performance but maybe only marginally - it depends on how frequently you are checkpointing
sync_on_checkpoint=False这可以提高性能，但可能仅提高一sync_on_checkpoint=False -这取决于您检查点的频率
fail_fast=True I like this flag because it kills a trial immediately after it fails, otherwise the trial can go through all training iterations where it fails each iteration
fail_fast=True我喜欢这个标志，因为它会在失败后立即fail_fast=True试验，否则试验会经历所有训练迭代，每次迭代都会失败
reuse_actors=True This flag can improve performance on both ASHA and PBT but it requires you to add a reset_config function to your trainable class. In part, this flag can save resources by not reloading your dataset every time an old trial is terminated and a new trial begins.
reuse_actors=True该标志可以同时提高ASHA和PBT的性能，但是它要求您向可训练的类添加reset_config函数。在某种程度上，此标志可以通过在每次旧的试验终止并开始新的试验时不重新加载数据集来节省资源。

蜻蜓— BO (Dragonfly — BO)

I like Dragonfly for Bayesian Optimization because of its ability to work with both discrete and continuous variables. Many BO packages only work with continuous variables and you have to hack your way around that issue. Nevertheless, I did find it a bit tricky to actually define the hyperparameter space. Below is the code snippet I used to set up Dragonfly BO for use with Ray Tune.

我喜欢Dragonfly进行贝叶斯优化，因为它可以同时处理离散变量和连续变量。许多BO软件包仅适用于连续变量，因此您必须设法解决该问题。但是，我确实发现实际上定义超参数空间有些棘手。以下是我用来设置与Ray Tune一起使用的Dragonfly BO的代码段。

param_list = [{"name": "atom_embedding_size",
                "type": "int",
                "min": 1,
                "max": 100},
              {"name": "num_graph_conv_layers",
                "type": "int",
                "min": 1,
                "max": 40},
              {"name": "fc_feat_size",
                "type": "int",
                "min": 1,
                "max": 150},
              {"name": "num_fc_layers",
                "type": "int",
                "min": 1,
                "max": 40},
              {"name": "lr",
                "type": "float",
                "min": 0.001,
                "max": 0.1},
              {"name": "batch_size",
               "type": "discrete_numeric",
               "items":"40-61-79-102-120-141-163-183-201-210-225-238"}]


param_dict = {"name": "BO_CGCNN", "domain": param_list}
domain_config = load_config(param_dict)
domain, domain_orderings = domain_config.domain, domain_config.domain_orderings


# define the hpo search algorithm BO
func_caller = CPFunctionCaller(None, domain, domain_orderings=domain_orderings)
optimizer = CPGPBandit(func_caller, ask_tell_mode=True)
bo_search_alg = DragonflySearch(optimizer, metric="validation_mae", mode="min")

Slurm-工作管理 (Slurm — Job Management)

For those using Slurm, as we do at NERSC, here are the scripts that enable the use of Ray Tune. The start-head.sh and start-worker.sh files can be copied directly; only the submit script requires minor modifications to execute your code on the resource and in the environment of choice. If you run into an issue where worker nodes are not starting and you see an error like this ValueError: The argument None must be a bytes object extend the sleep time after starting the head node found on this line. This is not a bug - the head node needs to set a variable and sometimes it takes a while.

对于使用SLURM，正如我们在NERSC做的那些，这里是允许使用光线调的脚本。可以直接复制start-head.sh和start-worker.sh文件。仅提交脚本需要进行少量修改，以在资源和所选环境中执行代码。如果遇到工作节点无法启动的问题，并且看到类似ValueError: The argument None must be a bytes object的错误ValueError: The argument None must be a bytes object ，启动此行上的根节点后，将延长睡眠时间。这不是错误-头节点需要设置变量，有时需要一段时间。

TensorBoard —记录/可视化 (TensorBoard — Logging/Visualization)

Ray Tune logs with TensorBoard (TB) by default. A couple thoughts about HPO with TB and Ray Tune:

默认情况下，Ray Tune日志使用TensorBoard(TB)。关于TB和Ray Tune的HPO的一些想法：

TB allows you to easily filter your results, which is important when you run 1000s of trials using ASHA
TB可让您轻松过滤结果，这在使用ASHA运行1000多次试验时非常重要
Good visualizations with the HParams Dashboard
HParams仪表板具有良好的可视化效果
TB works great with SHA based strategies in Ray Tune, my only complaint is the integration with PBT is not as good
TB在Ray Tune中基于SHA的策略非常有效，我唯一的抱怨是与PBT的集成不太好

For NERSC users here is how I usually run TB. One downside is that you can only have one TB client open at a time.

对于NERSC用户，这是我通常运行TB的方式。缺点是一次只能打开一个TB客户端。

权重和偏见—记录/可视化 (Weights and Biases — Logging/Visualization)

W&B has a logger that integrates with Ray Tune and I used it with the model I was testing. Clearly a lot of potential exists and in general I like the W&B platform, but at the time (March/April 2020) I had difficulties logging large-scale HPO campaigns with W&B. I believe some updates/upgrades are in progress.

W＆B的记录器与Ray Tune集成在一起，我将其与正在测试的模型一起使用。显然存在很多潜力，总的来说，我喜欢W＆B平台，但是当时(2020年3月/ 2020年)，我很难用W＆B记录大规模的HPO活动。我相信一些更新/升级正在进行中。

重要要点 (Key Takeaways)

发现 (Findings)

The ASHA scheduler improved the time-to-solution for my model by at least 10x compared to random search alone
与单独的随机搜索相比，ASHA调度程序将我的模型的求解时间缩短了至少10倍
BO may not always improve average HPO performance, but I was able to find my single best configuration of hyperparameters with ASHA/BO
BO可能并不总是可以提高平均HPO性能，但是我可以使用ASHA / BO找到我的最佳的超参数配置
Using PBT, I found my optimal learning rate and it can be reasonably modeled with a multistep scheduler
使用PBT，我发现了我的最佳学习率，并且可以使用多步调度程序对其进行合理建模

结论 (Conclusions)

Ray Tune is a simple and scalable HPO framework
Ray Tune是一个简单且可扩展的HPO框架
Using a scheduler to improve HPO efficiency is essential
使用调度程序提高HPO效率至关重要
More sophisticated search algorithms such as BO likely provide some benefit but are not always worth the investment
更复杂的搜索算法(例如BO)可能会带来一些好处，但并不总是值得投资
PBT is great for developing ideal schedulers and for training if the model does not need to be retained frequently
如果不需要经常保留模型，PBT非常适合开发理想的调度程序和培训
There is no one-size-fits-all solution to HPO. Start simple and build complexity as needed — ASHA/RS is a reasonable default strategy
对于HPO，没有万能的解决方案。开始简单并根据需要构建复杂性— ASHA / RS是合理的默认策略

Acknowledgements: I want to thank Zachary Ulissi (CMU), Mustafa Mustafa (NERSC), and Richard Liaw (Ray Tune) for making this work possible.

致谢：我要感谢Zachary Ulissi (CMU)， Mustafa Mustafa (NERSC)和Richard Liaw (Ray Tune)使这项工作成为可能。

Originally published at https://wood-b.github.io on August 31, 2020.

最初于 2020年8月31日 发布于 https://wood-b.github.io 。