

Bowen Baker, Otkrist Gupta, Ramesh Raskar, Nikhil Naik


Methods for neural network hyperparameter optimization and meta-modeling are computationally expensive due to the need to train a large number of model configurations. In this paper, we show that standard frequentist regression models can predict the final performance of partially trained model configurations using features based on network architectures, hyperparameters, and time-series validation performance data. We empirically show that our performance prediction models are much more effective than prominent Bayesian counterparts, are simpler to implement, and are faster to train. Our models can predict final performance in both visual classification and language modeling domains, are effective for predicting performance of drastically varying model architectures, and can even generalize between model classes. Using these prediction models, we also propose an early stopping method for hyperparameter optimization and meta-modeling, which obtains a speedup of a factor up to 6x in both hyperparameter optimization and meta-modeling. Finally, we empirically show that our early stopping method can be seamlessly incorporated into both reinforcement learning-based architecture selection algorithms and bandit based search methods. Through extensive experimentation, we empirically show our performance prediction models and early stopping algorithm are state-of-the-art in terms of prediction accuracy and speedup achieved while still identifying the optimal model configurations.



At present, significant human expertise and labor is required for designing high-performing neural network architectures and successfully training them for different applications. Ongoing research in two areas—meta-modeling and hyperparameter optimization—attempts to reduce the amount of human intervention required for these tasks. Hyperparameter optimization methods (e.g., Hutter et al. (2011); Snoek et al. (2015); Li et al. (2017)) focus primarily on obtaining good optimization hyperparameter configurations for training human-designed networks, whereas meta-modeling algorithms (Bergstra et al., 2013; Verbancsics & Harguess, 2013; Baker et al., 2017; Zoph & Le, 2017) aim to design neural network architectures from scratch. Both sets of algorithms require training a large number of neural network configurations for identifying the right set of hyperparameters or the right network architecture—and are hence computationally expensive.


When sampling many different model configurations, it is likely that many subpar configurations will be explored. Human experts are quite adept at recognizing and terminating suboptimal model configurations by inspecting their partial learning curves. In this paper we seek to emulate this behavior and automatically identify and terminate subpar model configurations in order to speedup both meta-modeling and hyperparameter optimization methods. Our method parameterizes learning curve trajectories with simple features derived from model architectures, training hyperparameters, and early time-series measurements from the learning curve. We use these features to train a set of frequentist regression models that predicts the final validation accuracy of partially trained
neural network configurations using a small training set of fully trained curves from both image classification and language modeling domains. We use these predictions and uncertainty estimates obtained from small model ensembles to construct a simple early stopping algorithm that can speedup
both meta-modeling and hyperparameter optimization methods.



本文研究包括的学习曲线样例。 注意收敛时间和整体学习曲线形状的多样性。

While there is some prior work on neural network performance prediction using Bayesian methods (Domhan et al., 2015; Klein et al., 2017), our proposed method is significantly more accurate, accessible, and efficient. We hope that our work leads to inclusion of neural network performance
prediction and early stopping in the practical neural network training pipeline.



Neural Network Performance Prediction: There has been limited work on predicting neural network performance during the training process. Domhan et al. (2015) introduce a weighted probabilistic model for learning curves and utilize this model for speeding up hyperparameter search in small convolutional neural networks (CNNs) and fully-connected networks (FCNs). Building on Domhan et al. (2015), Klein et al. (2017) train Bayesian neural networks for predicting unobserved learning curves using a training set of fully and partially observed learning curves. Both methods rely on expensive Markov chain Monte Carlo (MCMC) sampling procedures and handcrafted learning curve basis functions. We also note that Swersky et al. (2014) develop a Gaussian Process kernel for predicting individual learning curves, which they use to automatically stop and restart configurations.

神经网络表现预测:之前只有很少一部分相关领域的研究。Domhan 引入了一种概率全职模型用于训练学习曲线并将其应用到小型卷积神经网络和全连接神经网络种超参数的搜索加速过程。基于此,Klein也用一组全部或者部分监督训练的学习曲线训练了一个贝叶斯神经网络用于预测非监督学习曲线。这两种方法都依靠马尔可夫链、蒙特卡洛(MCMC)取样过程和热工学习曲线贝叶斯函数。同时也应注意到Swersky也提出了一种基于高斯过程的预测模型,也能起到中断和重新训练的作用。

Meta-modeling: We define meta-modeling as an algorithmic approach for designing neural network architectures from scratch. The earliest meta-modeling approaches were based on genetic algorithms (Schaffer et al., 1992; Stanley & Miikkulainen, 2002; Verbancsics & Harguess, 2013) or Bayesian optimization (Bergstra et al., 2013; Shahriari et al., 2016). More recently, reinforcement learning methods have become popular. Baker et al. (2017) use Q-learning to design competitive CNNs for image classification. Zoph & Le (2017) use policy gradients to design state-of-the-art CNNs and Recurrent cell architectures. Several methods for architecture search (Cortes et al., 2017; Negrinho & Gordon, 2017; Zoph et al., 2017; Brock et al., 2017; Suganuma et al., 2017) have been proposed this year since the publication of Baker et al. (2017) and Zoph & Le (2017).

元模型:我们定义这里提到的元模型是一种额能够自动设计神经网络拓扑结构的算法。早期的元模型算法是基于遗传算法或者贝叶斯算法的。最近,强化学习方法正在兴起。Baker用Q-learning 设计了一个对抗网络用于图像分类。Zoph和Le用policy gradient设计出了目前最好的CNN网络和循环元胞结构。自从他们的文章发表以来,陆续提出了多种结构搜索方法。

Hyperparameter Optimization: We define hyperparameter optimization as an algorithmic approach for finding optimal values of design-independent hyperparameters such as learning rate and batch size, along with a limited search through the network design space. Bayesian hyperparameter optimization methods include those based on sequential model-based optimization (SMAC) (Hutter et al., 2011), Gaussian processes (GP) (Snoek et al., 2012), TPE (Bergstra et al., 2013), and neural networks Snoek et al. (2015). However, random search or grid search is most commonly used in practical settings (Bergstra & Bengio, 2012). Recently, Li et al. (2017) introduced Hyperband, a multiarmed bandit-based efficient random search technique that outperforms state-of-the-art Bayesian
optimization methods.



We first describe our model for neural network performance prediction, followed by a description of the datasets used to evaluate our model, and finally present experimental results.








We experiment with small and very deep CNNs (e.g., ResNet, Cuda-Convnet) trained on image classification datasets and with LSTMs trained with Penn Treebank (PTB), a language modeling dataset. Figure 1 shows example learning curves from three of the datasets considered in our experiments. We provide brief summary of the datasets below. Please see Appendix Section A for further details on the search space, preprocessing, hyperparameters and training settings of all datasets.




Choice of Regression Method: We now describe our results for performing final neural network performance. For all experiments, we train our SRMs on 100 randomly sampled neural network configurations. We obtain the best performing method using random hyperparameter search over 3-fold cross-validation. We then compute the regression performance over the remainder of the dataset using the coefficient of determination R2. We repeat each experiment 10 times and report the results with standard errors. We experiment with a few different frequentist regression models, including ordinary least squares (OLS), random forests, and -support vector machine regressions (v-SVR). As seen in Table 1, v-SVR with linear or RBF kernels perform the best on most datasets,
though not by a large margin. For the rest of this paper, we use -SVR RBF unless otherwise specified.

回归方法的选择:我们将展示我们执行神经网络的对结果的预测性能。对于所有的实验,我们在100个随机采样的神经网络上训练SRM的配置。我们利用随机超参数搜索以及3次交叉验证获得表现最好的模型。然后,我们计算其余部分的回归性能。数据集使用确定系数R^2。我们重复10次实验并记录结果的标准误差。我们用几个不同的频率回归模型进行实验,包括普通的最小二乘法(OLS)、随机森林和v支持向量机回归(V-SVR)。如表1所示,具有线性或RBF内核的v-SVR在大多数数据集上表现最好,虽然差距不是很大。对于本文的其余部分,除非另有说明,我们都默认使用V-SVR RBF。

Ablation Study on Feature Sets: In Table 2, we compare the predictive ability of different feature sets, training SVR (RBF) with time-series (TS) features obtained from 25% of the learning curve, along with features of architecture parameters (AP), and hyperparameters (HP). TS features explain the largest fraction of the variance in all cases. For datasets with varying architectures, AP are more important that HP; and for hyperparameter search datasets, HP are more important than AP, which is expected. AP features almost match TS on the ResNet (TinyImageNet) dataset, indicating that choice of architecture has a large influence on accuracy for ResNets. Figure 2 shows the true vs. predicted performance for all test points in three datasets, trained with TS, AP, and HP features.



Generalization Between Depths: We also test to see whether SRMs can accurately predict the performance of out-of-distribution neural networks. In particular, we train SVR (RBF) with 25% of TS, along with AP and HP features on ResNets (TinyImagenet) dataset, using 100 models with number of layers less than a threshold d and test on models with number of layers greater than d, averaging over 10 runs. Value of d varies from 14 to 110. For d = 32, R2 is 80:66~3:8. For d = 62,
R2 is 84:58 ~ 2:7.


We now compare the neural network performance prediction ability of SRMs with three existing learning curve prediction methods: (1) Bayesian Neural Network (BNN) (Klein et al., 2017), (2) the learning curve extrapolation (LCE) method (Domhan et al., 2015), and (3) the last seen value (LastSeenValue) heuristic (Li et al., 2017). When training the BNN, we not only present it with the subset of fully observed learning curves but also all other partially observed learning curves from the training set. While we do not present the partially observed curves to the v-SVR SRM for training, we felt this was a fair comparison as v-SVR uses the entire partially observed learning curve during inference. Methods (2) and (3) do not incorporate prior learning curves during training. Figure 3 shows the R2 obtained by each method for predicting the final performance versus the percent of the learning curve used for training the model. We see that in all neural network configuration spaces and across all datasets, either one or both SRMs outperform the competing methods. The LastSeenValue heuristic only becomes viable when the configurations are near convergence, and its performance is worse than an SRM for very deep models. We also find that the SRMs outperform the LCE method in all experiments, even after we remove a few extreme prediction outliers produced by LCE. Finally, while BNN outperforms the LastSeenValue and LCE methods when only a few iterations have been observed, it does worse than our proposed method. In summary, we show that our simple, frequentist SRMs outperforms existing Bayesian approaches on predicting neural network performance on modern, very deep models in computer vision and language modeling tasks.

现在,我们将SRM的神经网络性能预测能力与三种现有的学习曲线预测方法进行比较:(1)贝叶斯神经网络(BNN)(Klein等人,2017),(2)学习曲线外推(LCE)方法(Domhan等人,2015),(3)最后看到的值(LastSeenValue)启发式(Li等人,2017)算法。在训练BNN时,我们不仅给出了完全观测学习曲线的子集,而且还给出了训练集中的所有其他部分观测学习曲线。虽然我们没有将部分观察到的曲线呈现给v-SVR SRM用于训练,但我们认为这是一个公平的比较,因为v-SVR在推理期间使用整个“部分观察到的学习曲线”。方法(2)和(3)在训练过程中没有合并先前的学习曲线。图3显示了每个方法获得的R^2,用于预测最终性能与用于训练模型的学习曲线的百分比。我们看到,在所有神经网络配置空间和所有数据集中,使用一个或两个SRM的性能都优于其他方法。LastSeenValue启发式算法只有在配置接近收敛时才是可行的,而且对于非常深的模型,它的性能比SRM差。我们还发现SRM在所有的实验中都优于LCE方法,即使在我们去除了由LCE产生的一些极端预测异常值之后。最后,当仅观察到少量迭代时,BNN优于LastSeenValue和LCE方法,但其性能比我们提出的方法差。总之,我们发现,在计算机视觉和语言建模任务中,我们的简单、频繁的SRM在预测神经网络性能方面优于现有的贝叶斯方法。

Since most of our experiments perform stepwise learning rate decay; it is conceivable that the performance gap between SRMs and both LCE and BNN results from a lack of sharp jump in their basis functions. We experimented with exponential learning rate decay (ELRD), which the basis functions in LCE are designed for. We trained 630 random nets with ELRD, from the 1000 MetaQNN-CIFAR10 nets. Predicting from 25% of the learning curve, the R2 is 0.95 for v-SVR (RBF), 0.48 for LCE (with extreme outlier removal, negative without), and 0.31 for BNN. This comparison illuminates another benefit of our method: we do not require handcrafted basis functions to model new learning curve types.

由于我们的大多数实验执行逐步学习速率衰减,可以想象,SRM与LCE和BNN之间的性能差距是由于它们的基本函数缺乏急剧跳跃(博主猜测是不是模型因为太稳定了没有变化)造成的。我们实验了指数学习速率衰减(ELRD),同时也是设计在LCE作为基函数使用的。我们在1000 个Meta AQNN-CIFAR10网络中用ELRD训练了630个随机网。从学习曲线的25%预测,V-SVR(RBF)的R^2为0.95,LCE为0.48(去除极端异常值,无负值),BNN为0.31。这个比较说明了我们方法的另一个优点:我们不需要人工干预基函数来建模新的学习曲线类型。

Training and Inference Speed Comparison: Another advantage of our regression approach is speed. SRMs are much faster to train and do inference in than proposed Bayesian methods (Domhan et al., 2015; Klein et al., 2017). On 1 core of a Intel 6700k CPU, an -SVR (RBF) with 100 training points trains in 0.006 seconds, and each inference takes 0.00006 seconds. In comparison, the LCE code takes 60 seconds and BNN code takes 0.024 seconds on the same hardware for each inference.



To speed up hyperparameter optimization and meta-modeling methods, we develop an algorithm to determine whether to continue training a partially trained model configuration using our sequential regression models. If we would like to sample N total neural network configurations, we begin by sampling and training n N configurations to create a training set S. We then train a model f(xf )to predict yT . Now, given the current best performance observed yBEST, we would like to terminate training a new configuration x0 given its partial learning curve y0(t)1– if f(xf 0) = ^yT yBEST so as to not waste computational resources exploring a suboptimal configuration.





Baker et al. (2017) train a Q-learning agent to design convolutional neural networks. In this method, the agent samples architectures from a large, finite space by traversing a path from input layer to termination layer. However, the MetaQNN method uses 100 GPU-days to train 2700 neural architectures and the similar experiment by Zoph & Le (2017) utilized 10,000 GPU-days to train 12,800 models on CIFAR-10. The amount of computing resources required for these approaches makes them prohibitively expensive for large datasets (e.g., Imagenet) and larger search spaces. The main computational expense of reinforcement learning-based meta-modeling methods is training the neural network configuration to T epochs (where T is typically a large number at which the network stabilizes to peak accuracy).

We now detail the performance of a -SVR (RBF) SRM in speeding up architecture search using sequential configuration selection. First, we take 1,000 random models from the MetaQNN (Baker et al., 2017) search space. We simulate the MetaQNN algorithm by taking 10 random orderings of each set and running our early stopping algorithm. We compare against the LCE early stopping algorithm (Domhan et al., 2015) as a baseline, which has a similar probability threshold termination criterion. Our SRM trains off of the first 100 fully observed curves, while the LCE model trains from each individual partial curve and can begin early termination immediately. Despite this “burn in” time needed by an SRM, it is still able to significantly outperform the LCE model (Figure 4). In addition, fitting the LCE model to a learning curve takes between 1-3 minutes on a modern CPU due to expensive MCMC sampling, and it is necessary to fit a new LCE model each time a new point on the learning curve is observed. Therefore, on a full meta-modeling experiment involving thousands of neural network configurations, our method could be faster by several orders of magnitude as compared to LCE based on current implementations.


We furthermore simulate early stopping for ResNets trained on CIFAR-10. We found that only the probability threshold = 0:99 resulted in recovering the top model consistently. However, even with such a conservative threshold, the search was sped up by a factor of 3.4 over the baseline. While we do not have the computational resources to run the full experiment from Zoph & Le (2017), our method could provide similar gains in large scale architecture searches.

我们进一步模拟在CIFAR-10上训练的预测。我们发现,只有概率阈值= 0.99导致恢复顶部模型一致。然而,即使有这样一个保守的阈值,搜索速度提高了3.4倍以上的基线。虽然我们没有计算资源来运行Zoph&Le(2017)的完整实验,我们的方法可以在大规模体系结构搜索中提供类似的增益。

It is not enough, however, to simply simulate the speedup because meta-modeling algorithms typically use the observed performance in order to update an acquisition function to inform future sampling. In the reinforcement learning setting, the performance is given to the agent as a reward, so we also empirically verify that substituting ^yT for yT does not cause the MetaQNN agent to converge to a subpar policy. Replicating the MetaQNN experiment on CIFAR-10 (see Figure 5), we find that integrating early stopping with the Q-learning procedure does not disrupt learning and resulted in a speedup of 3.8x with = 0:99. The speedup is relatively low due to a conservative value of . After training the top models to 300 epochs, we also find that the resulting performance (just under 93%) is on par with original results of Baker et al. (2017).

然而,仅仅模拟加速是不够的,因为元建模算法通常使用观察到的性能来更新采集函数以告知未来的采样。在强化学习设置,表现结果会被作为给代理的奖励,所以我们还实验验证,对yT的近似处理不会引起MetaQNN代理收敛到一个水平一般的策略。在CIFAR-10上复制MetaQNN实验(见图5),我们发现将早期停止与Q-learning过程集成不会中断学习,并导致3.8x的加速速度为= 0.99。由于保守值的原因,加速率相对较低。在将顶级模特训练到300代之后,我们还发现最终的表现(略低于93%)与Baker等人(2017)的原始结果相当。

Recently, Li et al. (2017) introduced Hyperband, a random search technique based on multi-armed bandits that obtains state-of-the-art performance in hyperparameter optimization in a variety of settings. The Hyperband algorithm trains a population of models with different hyperparameter configurations and iteratively discards models below a certain percentile in performance among the population until the computational budget is exhausted or satisfactory results are obtained.



We present a Fast Hyperband (f-Hyperband) algorithm based on our early stopping scheme. During each iteration of successive halving, Hyperband trains ni configurations to ri epochs. In f-Hyperband, we train an SRM to predict yri and do early stopping within each iteration of successive halving. We initialize f-Hyperband in exactly the same way as vanilla Hyperband, except once we have trained 100 models to ri iterations, we begin early stopping for all future successive halving iterations that train to ri iterations. By doing this, we exhibit no initial slowdown to Hyperband due to a “burn-in” phase. We also introduce a parameter which denotes the proportion of the ni models in each iteration that must be trained to the full ri iterations. This is similar to setting the criterion based on
the nth best model in the previous section. See Appendix section C for an algorithmic representation of f-Hyperband.

本文提出了一种基于早期停止方案的快速Fast Hyperband算法。在每次连续减半的迭代中,超带训练ni配置到ri epoch。在f-Hyperband中,我们训练一个SRM来预测yri,并在每次迭代中提前停止。我们初始化f-Hyperband的方式与vanilla Hyperband完全相同,只是一旦我们训练了100个模型来进行ri迭代,我们就开始为所有未来的连续减半迭代提前停止,这些迭代将训练为ri迭代。通过这样做,我们没有显示最初的减速超带由于“burn in”阶段。我们还引入了一个参数,该参数表示每个迭代中ni模型的比例,该比例必须训练到完整的ri迭代。这与基于此设置标准类似上一节中的第n个最佳模型。有关f-超带的算法表示,请参阅附录C部分。

We empirically evaluate f-Hyperband using Cuda-Convnet trained on CIFAR-10 and SVHN datasets. Figure 6 shows that f-Hyperband evaluates the same number of unique configurations as Hyperband within half the compute time, while achieving the same final accuracy within standard error. When reinitializing hyperparameter searches, one can use previously-trained set of SRMs to achieve even larger speedups. Figure 8 in Appendix shows that one can achieve up to a 7x speedup in such cases.



In this paper we introduce a simple, fast, and accurate model for predicting future neural network performance using features derived from network architectures, hyperparameters, and time-series performance data. We show that the performance of drastically different network architectures can be jointly learned and predicted on both image classification and language models. Using our simple algorithm, we can speedup hyperparameter search techniques with complex acquisition functions, such as a Q-learning agent, by a factor of 3x to 6x and Hyperband—a state-of-the-art hyperparameter search method—by a factor of 2x, without disturbing the search procedure. We outperform all competing methods for performance prediction in terms of accuracy, train and test time, and speedups obtained on hyperparameter search methods. We hope that the simplicity and success of our method will allow it to be easily incorporated into current hyperparameter optimization pipelines for deep neural networks. With the advent of large scale automated architecture search (Baker et al., 2017; Zoph & Le, 2017), methods such as ours will be vital in exploring even larger and more complex search spaces.














