Here’s a function: f(x). It’s expensive to calculate, not necessarily an analytic expression, and you don’t know its derivative.
这是一个函数: f ( x )。 计算起来很昂贵,不一定是解析表达式 ,而且您也不知道它的派生形式。
Your task: find the global minima.
您的任务:找到全局最小值。
This is, for sure, a difficult task, one more difficult than other optimization problems within machine learning. Gradient descent, for one, has access to a function’s derivatives and takes advantage of mathematical shortcuts for faster expression evaluation.
无疑,这是一项艰巨的任务,比机器学习中的其他优化问题还要困难。 一方面,梯度下降可以访问函数的导数,并且可以利用数学捷径来更快地评估表达式。
Alternatively, in some optimization scenarios the function is cheap to evaluate. If we can get hundreds of results for variants of an input x in a few seconds, a simple grid search can be employed with good results.
或者,在某些优化方案中,该函数的评估成本很低。 如果我们可以在几秒钟内获得输入x的变体的数百个结果,则可以使用简单的网格搜索来获得良好的结果。
Alternatively, an entire host of non-conventional non-gradient optimization methods can be used, like particle swarming or simulated annealing.
可替代地,可以使用整个非常规的非梯度优化方法 ,例如粒子群或模拟退火。
Unfortunately, the current task doesn’t have these luxuries. We are limited in our optimization by several fronts, notably:
不幸的是,当前的任务没有这些奢侈品。 我们的优化受到以下几个方面的限制:
- It’s expensive to calculate. Ideally we would be able to query the function enough to essentially replicate it, but our optimization method must work with a limited sampling of inputs. 计算起来很昂贵。 理想情况下,我们将能够对函数进行足够的查询以实质上复制该函数,但是我们的优化方法必须在有限的输入采样下起作用。
The derivative is unknown. There’s a reason why gradient descent and its flavors still remain the most popular methods for deep learning, and sometimes, in other machine learning algorithms. Knowing the derivative gives the optimizer a sense of direction — we don’t have this.
导数未知。 有一个原因为什么梯度下降及其风味仍然是深度学习(有时在其他机器学习算法中)最受欢迎的方法。 知道导数可为优化器提供方向感-我们没有这个。
- We need to find the global minima, which is a difficult task even for a sophisticated method like gradient descent. Our model somehow will need a mechanism to avoid getting caught in local minima. 我们需要找到全局最小值,即使对于梯度下降这样的复杂方法,这也是一项艰巨的任务。 我们的模型将需要某种机制来避免陷入局部最小值。
The solution: Bayesian optimization, which provides an elegant framework for approaching problems that resemble the scenario described to find the global minimum in the smallest number of steps.
解决方案:贝叶斯优化,它提供了一个优雅的框架来解决问题,该问题类似于所描述的场景,可以在最少的步骤中找到全局最小值。
Let’s construct a hypothetical example of function c(x), or the cost of a model given some input x. Of course, what the function looks like will be hidden from the optimizer; this is the true shape of c(x). This is known in the lingo as the ‘objective function’.
让我们构造一个函数c ( x )的假设示例,或者给定输入x的模型的成本。 当然,该函数的外观将对优化器隐藏; 这是c ( x )的真实形状。 这在术语中被称为“目标函数”。
Bayesian optimization approaches this task through a method known as surrogate optimization. For context, a surrogate mother is a women who agrees to bear a child for another person — in that context, a surrogate function is an approximation of the objective function.
贝叶斯优化通过一种称为代理优化的方法来完成此任务。 就上下文而言,代孕母亲是指同意为另一个人生育孩子的妇女-在这种情况下,代孕功能是目标功能的近似值。
The surrogate function is formed based on sampled points.
代理功能是基于采样点形成的。
Based on the surrogate function, we can identify which points are promising minima. We decide to sample more from these promising regions and update the surrogate function accordingly.
基于代理函数,我们可以确定哪些点是有希望的最小值。 我们决定从这些有前途的地区中抽取更多样本,并相应地更新代理功能。
Each iteration, we continue to look at the current surrogate function, learn more about areas of interest by sampling, and update the function. Note that the surrogate function will be mathematically expressed in a way that is significantly cheaper to evaluate (e.g. y=x
to be a an approximation for a more costly function, y=arcsin((1-cos²x)/sin x)
within a certain range).
每次迭代,我们都会继续查看当前的替代函数,通过采样了解有关感兴趣区域的更多信息,并更新该函数。 注意,替代函数将在某种程度上是显著便宜评价数学表示(例如y = x
是一个用于更昂贵的函数的近似, y= arcsin((1-cos² x )/sin x )
内的一定范围)。
After a certain number of iterations, we’re destined to arrive at a global minima, unless the function’s shape is very bizarre (in that it has large and wild up-and-down swings) at which a better question than optimization should be asked: what’s wrong with your data?
经过一定数量的迭代后,我们注定要达到一个全局最小值,除非函数的形状非常怪异(因为它具有大而随意的上下摆动),在该处应该提出比优化更好的问题:您的数据出了什么问题?
Take a moment to marvel at the beauty of this approach. It doesn’t make any assumptions about the function (except that it is optimizable in the first place), doesn’t require information about derivatives, and is able to use common-sense reasoning through the ingenious use of a continually updated approximation function. The expensive evaluation of our original objective function is not a problem at all.
花一点时间惊叹于这种方法的美妙之处。 它不对函数进行任何假设(首先要对其进行优化),不需要有关导数的信息,并且可以通过巧妙地使用连续更新的近似函数来使用常识推理。 对我们原始目标函数进行昂贵的评估完全不是问题。
This is a surrogate-based approach towards optimization. So what makes it Bayesian, exactly?
这是一种基于代理的优化方法。 那么到底是什么使它成为贝叶斯呢?
The essence of Bayesian statistics and modelling is the updating of a prior (previous) belief in light of new information to produce an updated posterior (‘after’) belief. This is exactly what surrogate optimization in this case does, so it can be best represented through Bayesian systems, formulas, and ideas.
贝叶斯统计和建模的本质是根据新信息更新先前(先前)的信念,以产生更新的后(“之后”)信念。 这正是代理优化在这种情况下所做的,因此可以通过贝叶斯系统, 公式和思想来最好地表示它。
Let’s take a closer look at the surrogate function, which are usually represented by Gaussian Processes, which can be thought of as a dice roll that returns functions fitted to given data points (e.g. sin, log) instead of numbers 1 to 6. The process returns several functions, which have probabilities attached to them.
让我们仔细研究一下替代函数,它通常由高斯过程表示,可以看作是骰子掷骰,返回适合给定数据点(例如sin,log)的函数,而不是数字1至6。返回几个函数,这些函数具有相关的概率。
This article by Oscar Knagg gives good intuition on how GPs work.
Oscar Knagg的这篇文章很好地介绍了GP的工作原理。
There’s a good reason why Gaussian Processes, and not some other curve-fitting method, is used to model the surrogate function: it is Bayesian in nature. A GP is a probability distribution, like a distribution of end results of an event (e.g. 1/2 chance of a coin flip), but over all possible functions.
有一个很好的理由说明为什么使用高斯过程而不是其他一些曲线拟合方法来对替代函数进行建模:本质上是贝叶斯函数。 GP是一种概率分布,类似于事件最终结果的分布(例如,掷硬币的机会为1/2),但是分布在所有可能的函数上。
For instance, we may define the current set of data points as being 40% representable by function a(x), 10% by function b(x), etc. By representing the surrogate function as a probability distribution, it can be updated with new information through inherently probabilistic Bayesian processes. Perhaps when new information is introduced, the data is only 20% representable by function a(x). These changes are governed by Bayesian formulas.
例如,我们可以将当前数据点集定义为由函数a ( x )表示的40%,由函数b ( x )表示的10%,等等。通过将替代函数表示为概率分布,可以使用通过固有的概率贝叶斯过程获得新信息。 也许当引入新信息时,数据只能由函数a ( x )表示20%。 这些变化受贝叶斯公式约束。
This would be difficult or even impossible to do with, say, a polynomial regression fit to new data points.
例如,将多项式回归拟合到新数据点将很难或什至不可能。
The surrogate function — represented as a probability distribution, the prior — is updated with an ‘acquisition function’. This function is responsible for driving the proposition of new points to test, in an exploration and exploitation trade-off:
替代功能-表示为先验概率分布-使用“获取功能”进行更新。 此功能负责在勘探和开发折衷方案中推动提出要测试的新点的提议:
Exploitation seeks to sample where the surrogate model predicts a good objective. This is taking advantage of known promising spots. However, if we have already explored a certain region enough, continually exploiting known information will yield little gain.
开发试图从代理模型预测好的目标的地方进行采样。 这是利用已知的有前途的景点。 但是,如果我们已经对某个区域进行了足够的探索,那么不断利用已知信息将不会带来太大收益。
Exploration seeks to sample in locations where the uncertainty is high. This ensures that no major region of the space is left unexplored — the global minima may happen to lie there.
探索试图在不确定性很高的位置进行采样。 这样可以确保不对空间的主要区域进行探索-全局最小值可能恰好位于该区域。
An acquisition function that encourages too much exploitation and too little exploration will lead to the model to reside only a minima it finds first (usually local — ‘going only where there is light’). An acquisition function that encourages the opposite will not stay in a minima, local or global, in the first place. Yielding good results in a delicate balance.
鼓励过多利用而很少进行探索的获取函数将导致该模型仅驻留其首先找到的最小值(通常是局部最小值-“仅在有光照的地方”)。 鼓励对立面的获取功能首先不会停留在局部或全局最小值中。 产生良好的结果,达到微妙的平衡。
The acquisition function, which we’ll denote a(x), must consider both exploitation and exploration. Common acquisition functions include expected improvement and maximum probability of improvement, all of which measure the probability a specific input may pay off in the future, given information about the prior (the Gaussian process).
采集函数(我们将表示为 ( x ))必须同时考虑开发和探索。 常见的获取功能包括预期的改进和最大的改进可能性,所有这些函数都会在给定有关先验信息(高斯过程)的情况下,衡量特定输入将来可能会获得回报的可能性。
Let’s put the pieces together. Bayesian optimization can be performed as such:
让我们放在一起。 贝叶斯优化可以这样执行:
- Initialize a Gaussian Process ‘surrogate function’ prior distribution. 初始化高斯过程的“代理函数”优先分配。
Choose several data points x such that the acquisition function a(x) operating on the current prior distribution is maximized.
选择几个数据点x ,以使基于当前先验分布的采集函数a ( x )最大化。
Evaluate the data points x in the objective cost function c(x) and obtain the results, y.
评估目标成本函数c ( x )中的数据点x并获得结果y 。
- Update the Gaussian Process prior distribution with the new data to produce a posterior (which will become the prior in the next step). 用新数据更新高斯过程的先验分布,以产生后验(将在下一步中变为先验)。
- Repeat steps 2–5 for several iterations. 重复步骤2–5进行几次迭代。
- Interpret the current Gaussian Process distribution (which is very cheap to do) to find the global minima. 解释当前的高斯过程分布(这样做很便宜)以找到全局最小值。
Bayesian optimization is all about putting probabilistic ideas behind the idea of surrogate optimization. The combination of these two idea creates a powerful system with many applications, from pharmaceutical product development to autonomous vehicles.
贝叶斯优化就是将概率思想放在替代优化思想的后面。 这两种想法的结合创建了一个功能强大的系统,具有从药品开发到自动驾驶汽车的许多应用。
Most commonly in machine learning, however, Bayesian optimization is used for hyperparameter optimization. For instance, if we’re training a gradient boosting classifier, there are dozens of parameters, from the learning rate to the maximum depth to the minimum impurity split value. In this case, x represents the hyperparameters of the model, and c(x) represents the performance of the model, given hyperparameters x.
但是,在机器学习中最常见的是将贝叶斯优化用于超参数优化 。 例如,如果我们正在训练梯度提升分类器,则有数十个参数,从学习率到最大深度再到最小杂质分裂值。 在这种情况下, x表示模型的超参数,而c ( x )表示给定超参数x的模型的性能。
The primary motivation for using Bayesian optimization is in scenarios where it is very expensive to evaluate the output. Firstly, an entire ensemble of trees needs to be built with the parameters, and secondly, they need to run through several predictions, which are expensive for ensembles.
使用贝叶斯优化的主要动机是在评估输出非常昂贵的情况下。 首先,需要使用参数来构建整个树木的集合,其次,它们需要进行多个预测,这对于集成而言是昂贵的。
Arguably, neural network evaluation of the loss for a given set of parameters is faster: simply repeated matrix multiplication, which is very fast, especially on specialized hardware. This is one of the reasons gradient descent is used, which makes repeated queries to understand where it is going.
可以说,对于给定参数集的损失的神经网络评估速度更快:只需重复矩阵乘法即可,这非常快,尤其是在专用硬件上。 这是使用梯度下降的原因之一,它使重复查询以了解其走向。
In summary:
综上所述:
- Surrogate optimization uses a surrogate, or approximation, function to estimate the objective function through sampling. 代理优化使用代理或近似函数通过采样估计目标函数。
- Bayesian optimization puts surrogate optimization in a probabilistic framework by representing surrogate functions as probability distributions, which can be updated in light of new information. 贝叶斯优化通过将代理函数表示为概率分布,将代理优化置于概率框架中,可以根据新信息对其进行更新。
- Acquisition functions are used to evaluate the probability that exploring a certain point in space will yield a ‘good’ return given what is currently known from the prior, balancing exploration & exploitation. 考虑到先前平衡的勘探与开发目前所知的情况,采集函数用于评估探索某个空间点将产生“良好”回报的可能性。
- Use Bayesian optimization primarily when the objective function is expensive to evaluate, commonly used in hyperparameter tuning. (There are many libraries like HyperOpt for this.) 当目标函数的评估成本很高时(主要用于超参数调整),主要使用贝叶斯优化。 (为此,有许多类似HyperOpt的库。)
Thanks for reading!
谢谢阅读!
All images created by author unless stated otherwise.
除非另有说明,否则所有图片均由作者创建。