机器学习:Gradient-based Hyperparameter Optimization through Reversible Learning

Abstract:

  • Tuning hyperparameters of learning algorithms is hard because gradients are usually unavailable. We compute exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure. These gradients allow us to optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures. We compute hyperparameter gradients by exactly reversing the dynamics of stochastic gradient descent with momentum.

根据摘要:

  • 目前背景:目前的学习算法难以解决调整超参数的问题
  • 技术方法:作者在整个训练过程中使用逆向导数链式法则和交错验证法,训练所有超参数的精确梯度,而作者通过逆向随机梯度下降和动量项来精确计算梯度.
  • 问题解决:借助这些精确梯度,我们可以优化上千个超参数(包括神经网络的其他参数,如步长等).

1 :Introduction:

  • Machine learning systems abound with hyperparameters.
  • Choosing the best hyperparameters is both crucial and frustratingly difficult.

机器学习中存在大量的超参数,选择最优超参数至关重要且极其困难.

The current gold standard for hyperparameter selection is gradient-free model-based optimization (Snoek et al., 2012; Bergstra et al., 2011; 2013; Hutter et al., 2011). However, in general they are not able to effectively optimize more than 10 to 20 hyperparameters.

当前的超参数选择方法中的黄金法则是基于无梯度模型的优化,在2011-2013年的工作给出,但是他们得出的方法,不能有效的优化10-20个以上的超参数.

Why not use gradients? The problem with taking gradients with respect to hyperparameters is that computing the validation loss requires an inner loop of elementary optimization, which makes na¨ıve reverse-mode differentiation infeasible from a memory perspective. Section 2 describes this problem and proposes a solution, which is the main technical contribution of this paper.

若采用梯度算法,在损失计算验证时需要一个内部循环的基本优化,这会使得内存被占满,从而不可行.文章的第2部分将对此问题展开介绍.

Gaining access to gradients with respect to hyperparamters opens up a garden of delights. Instead of straining to eliminate hyperparameters from our models, we can embrace them, and richly hyperparameterize our models. Just as having a high-dimensional elementary parameterization gives a flexible model, having a high-dimensional hyper-parameterization gives flexibility over model classes, regu-larization, and training methods. Section 3 explores these new opportunities.

获得与超参数相关的超梯度,是一种新的思路.与其费力地从模型中消除超参数,我们可以接受它们,并丰富地将模型超参数化.正如高维基本参数化提供了一个灵活的模型,高维超参数化提供了模型类、正则化和训练方法的灵活性.第3节探讨了这些新的方法.

2: Hypergradients

  • Reverse-mode differentiation (RMD) has been an asset to the field of machine learning.The RMD method, known as “back-propagation” in the deep learning community, allows the gradient of a scalar loss with respect to its parameters to be computed in a single backward pass.
  • Applying RMD to hyperparameter optimization was pro-posed by Bengio (2000) and Baydin & Pearlmutter (2014), and applied to small problems by Domke (2012). How-ever, the na¨ıve approach fails for real-sized problems be-cause of memory constraints.

RMD:逆向模式微,这种思想在back-propagation(BP)中得到很好的体现,它允许在一次向后传递中计算标量损失相对于其参数的梯度.
将RMD应用于超参数优化是[Bengio(2000) and Baydin & Pearlmutter(2014)]提出的,Domke(2012)将RMD应用于小问题,但是这种方法因为占用大量内存,所以在超参数优化问题上是失败的.

  • Imagine that we could exactly trace a training procedure backwards, starting from the trained parameter values and working back to the initial parameters. Then we could recompute the learning trajectory on the fly during the reverse pass of RMD rather than storing it in memory.This is not possible in general, but we will show that for the popular training procedure of stochastic gradient descent with momentum, we can do exactly this, storing a small number of auxiliary bits to handle finite precision arithmetic.

假设我们可以精确地追溯一个训练过程,从训练过的参数值开始,再回到初始参数。然后我们可以在RMD的反向传递过程中动态地重新计算学习轨迹,而不是将其存储在内存中.这在一般情况下是不可能的,但我们将展示,对于常用的动量随机梯度下降训练程序,我们可以做到这一点,存储少量辅助位来处理有限精度的计算.

2.1:Reversible learning with exact arithmetic
精确的可逆学习

Stochastic gradient descent (SGD) with momentum (Algorithm 1) can be seen as a physical simulation of a system moving through a series of fixed force fields indexed by time t. With exact arithmetic this procedure is reversible. This lets us write Algorithm 2, which reverses the steps in Algorithm 1, interleaved with computations of gradients. It outputs the gradient of a function of the trained weights f(w) (such as the validation loss) with respect to the initial weights w1, the learning-rate and momentum schedules, and any other hyperparameters which affect training gradients.

在这里插入图片描述

带有动量的随机梯度下降(SGD)(算法1)可以看作是一个系统在一系列以时间t为索引的固定力场中运动的物理模拟。通过精确的计算,这个过程是可逆的。这让我们编写算法2(算法1的可逆过程),它与算法1中的步骤相反,与梯度的计算交织在一起。它输出训练权值f(w)的函数梯度(例如验证损失)相对于初始权值w1、学习率和动量调度以及任何其他影响训练梯度的超参数的梯度。在这里插入图片描述The time complexity of reverse SGD is O(T ), the same as forward SGD.
反向SGD的时间复杂度为O(T),与正向SGD相同。

2.2: Reversible learning with finite precision arithmetic
有限精度的可逆学习

problem:

  • In practice, Algorithm 2 fails utterly due to finite numerical precision. The problem is the momentum decay term γ. Every time we apply step 8 to reduce the velocity, we lose information. Assuming we are using a fixedpoint representation, 2 each multiplication by γ < 1 shifts bits to the right, destroying the least significant bits. This is more than a pedantic concern. Attempting to carry out the reverse training requires repeated multiplication by 1/γ. Errors accumulate exponentially, and the reversed learning proce-dure ends far from the initial point (and usually overflows).
  • Do we need γ < 1? Unfortunately we do. γ > 1 results in unstable dynamics, and γ = 1, recovers the leapfrog integrator (Hut et al., 1995), a perfectly reversible set of dynamics, but one that does not converge.

在实际应用中,由于数值精度有限,算法2完全失败。问题是动量衰减项γ。每次我们应用步骤8来降低速度,我们就会丢失信息。
γ< 1: 每次乘法使位向右移动,破坏最低有效位.
试图执行相反操作,反复乘以1 /γ,错误呈指数级累积,反向学习过程远远超出初始点(通常会溢出)
γ> 1: 导致不稳定的动力学
γ= 1:可以得到一套完全可逆的动力学,但不收敛。

如果我们想要逆推整个动态问题,没有选择,只能存储额外的γ位丢弃的操作。但是我们至少可以尽量节省我们存储的额外比特数。这就是下一节要解决的问题.

2.3: Optimal storage of discarded entropy
丢弃熵的最优存储

If γ = 0.5, we can simply store the single bit that falls off at each iteration, and if γ = 0.25 we could store two bits. But for fine-grained control over γ we need a way to store the information lost when we multiply by, say, γ = 0.9, which will be less than one bit on average. Here we give a procedure which achieves exactly this.

如果γ= 0.5,我们可以简单地存储在每个迭代中脱落单一位, 如果γ= 0.25我们可以存储两部分。但对细粒度的控制γ 我们需要一种方法来存储的信息丢失当我们乘上,说,γ= 0.9,平均不到一位。在这里,我们给出了一个实现这一点的过程。在这里插入图片描述

We could also have used an arithmetic coding scheme for our information buffer (MacKay, 2003, Chapter 6). How much does this procedure save us? When γ = 0.98, we will have to store only 0.029 bits on average. Compared to storing a new 32-bit integer or floating-point number at each iteration, this reduces memory requirements by a factor of one thousand.

我们也可以为我们的信息缓冲区使用算术编码方案(MacKay, 2003, Chapter 6)。当γ= 0.98,我们将不得不商店平均只有0.029位。与在每次迭代中存储一个新的32位整数或浮点数相比,这减少了1000的内存需求。

3: Experiments

This section shows several proof-of-concept experiments in which we can more richly parameterize training and regularization schemes in ways that would have been previously impractical to optimize.

本节将展示几个概念验证实验,在这些实验中,我们可以以以前无法优化的方式更丰富地参数化训练和正则化方案。

3.1. Gradient-based optimization of gradient-based optimization

To more directly shed light on good learning rate schedules, we jointly optimized separate learning rates for every single learning iteration of training of a deep neural network, as well as separately for weights and biases in each layer. Each meta-iteration trained a network for 100 iterations of SGD, meaning that the learning rate schedules were specified by 800 hyperparameters (100 iterations × 4 layers × 2 types of parameters). To avoid learning an optimization schedule that depended on the quirks of a particular random initialization, each evaluation of hypergradients used a different random seed. These random seeds were used both to initialize network weights and to choose mini batches. The network was trained on 10,000 examples of MNIST, and had 4 layers, of sizes 784, 50, 50, and 50.

为了更直接地阐明良好的学习速率调度,我们联合优化了深度神经网络每次训练的单独学习迭代的单独学习速率,并分别优化了每一层的权重和偏差。每个元迭代训练一个包含100个SGD迭代的网络,这意味着学习速率调度由800个超参数(100个迭代×4层×2种类型的参数)指定。为了避免学习依赖于特定随机初始化的特性的优化调度,每个超梯度的计算都使用了不同的随机种子。这些随机种子用于初始化网络权值和选择小批量。该网络培训了10,000个MNIST示例,共有4层,大小分别为784、50、50和50。

Because learning schedules can implicitly regularize net-works (Erhan et al., 2010), for example by enforcing early stopping, for this experiment we optimized the learning rate schedules on the training error rather than on the validation set error. Figure 2 shows the results of optimizing learning rate schedules separately for each layer of a deep neural network. When Bayesian optimization was used to choose a fixed learning rate for all layers and iterations, it chose a learning rate of 2.4.

因为学习调度可以隐式地规范网络(Erhan et al., 2010),例如,通过强制提前停止,在这个实验中,我们优化了训练错误上的学习速率调度,而不是验证集错误上的学习速率调度。图2分别给出了深度神经网络各层优化学习率调度的结果。当使用贝叶斯优化为所有层和迭代选择一个固定的学习率时,选择学习率2.4。

在这里插入图片描述

  • 利用超梯度下降法优化神经网络各层权值的学习率训练计划。优化的调度从只在最顶层执行较大的步骤开始,然后在第一层执行较大的步骤。在最后的10次迭代中,所有层都采用较小的步骤大小。没有显示偏差或动量的时间表,显示的结构较少。
  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值