Abstract:
- Tuning hyperparameters of learning algorithms is hard because gradients are usually unavailable. We compute exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure. These gradients allow us to optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures. We compute hyperparameter gradients by exactly reversing the dynamics of stochastic gradient descent with momentum.
根据摘要:
- 目前背景:目前的学习算法难以解决调整超参数的问题
- 技术方法:作者在整个训练过程中使用逆向导数链式法则和交错验证法,训练所有超参数的精确梯度,而作者通过逆向随机梯度下降和动量项来精确计算梯度.
- 问题解决:借助这些精确梯度,我们可以优化上千个超参数(包括神经网络的其他参数,如步长等).
1 :Introduction:
- Machine learning systems abound with hyperparameters.
- Choosing the best hyperparameters is both crucial and frustratingly difficult.
机器学习中存在大量的超参数,选择最优超参数至关重要且极其困难.
The current gold standard for hyperparameter selection is gradient-free model-based optimization (Snoek et al., 2012; Bergstra et al., 2011; 2013; Hutter et al., 2011). However, in general they are not able to effectively optimize more than 10 to 20 hyperparameters.
当前的超参数选择方法中的黄金法则是基于无梯度模型的优化,在2011-2013年的工作给出,但是他们得出的方法,不能有效的优化10-20个以上的超参数.
Why not use gradients? The problem with taking gradients with respect to hyperparameters is that computing the validation loss requires an inner loop of elementary optimization, which makes na¨ıve reverse-mode differentiation infeasible from a memory perspective. Section 2 describes this problem and proposes a solution, which is the main technical contribution of this paper.
若采用梯度算法,在损失计算验证时需要一个内部循环的基本优化,这会使得内存被占满,从而不可行.文章的第2部分将对此问题展开介绍.
Gaining access to gradients