Bayesian optimisation for smart hyperparameter search

Bayesian optimisation for smart hyperparameter search

Fitting a single classifier does not take long, fitting hundreds takes a while. To find the best hyperparameters you need to fit a lot of classifiers. What to do?

This post explores the inner workings of an algorithm you can use to reduce the number of hyperparameter sets you need to try before finding the best set. The algorithm goes under the name of bayesian optimisation. If you are looking for a production ready implementation check out: MOE, metric optimisation engine developed by Yelp.

Gaussian processe regression is a useful tool in general and is used heavily here. Check out my post on Gaussian processes with george for a short introduction.

This post starts with an example where we know the true form of the scoring function. Followed by pitting random grid search against Bayesian optimisation to find the best hyper-parameter for a real classifier.

As usual first some setup and importing:

%matplotlib inline
import random

import numpy as np np.random.seed(9) from scipy.stats import randint as sp_randint import matplotlib.pyplot as plt import seaborn as sns sns.set_style('whitegrid') sns.set_context("talk") 

By George!

Bayesian optimisation uses gaussian processes to fit a regression model to the previously evaluated points in hyper-parameter space. This model is then used to suggest the next (best) point in hyper-parameter space to evaluate the model at.

To choose the best point we need to define a criterion, in this case we use "expected improvement". As we only know the score to with a certain precision we do not want to simply choose the point with the best score. Instead we pick the point which promises the largest expected improvement. This allows us to incorporate the uncertainty about our estimation of the scoring function into the procedure. It leads to a mixture of exploitation and exploration of the parameter space.

Below we setup a toy scoring function (xsinx), sample a two points from it, and fit our gaussian process model to it.

import george
from george.kernels import ExpSquaredKernel score_func = lambda x: -x*np.sin(x) x = np.arange(0, 10, 0.1) 
# Generate some fake, noisy data. These represent
# the points in hyper-parameter space for which
# we already trained our classifier and evaluated its score
xp = 10 * np.sort(np.random.rand(2)) yerr = 0.2 * np.ones_like(xp) yp = score_func(xp) + yerr * np.random.randn(len(xp)) 
# Set up a Gaussian process
kernel = ExpSquaredKernel(1) gp = george.GP(kernel) gp.compute(xp, yerr) mu, cov = gp.predict(yp, x) std = np.sqrt(np.diag(cov)) def basic_plot(): fig, ax = plt.subplots() ax.plot(x, mu, label="GP median") ax.fill_between(x, mu-std, mu+std, alpha=0.5) ax.plot(x, score_func(x), '--', label=" True score function (unknown)") # explicit zorder to draw points and errorbars on top of everything ax.errorbar(xp, yp, yerr=yerr, fmt='ok', zorder=3, label="samples") ax.set_ylim(-9,6) ax.set_ylabel("score") ax.set_xlabel('hyper-parameter X') ax.legend(loc='best') return fig,ax basic_plot() 
(<matplotlib.figure.Figure at 0x10ab63e90>,
 <matplotlib.axes._subplots.AxesSubplot at 0x10ab6f590>)

The dashed green line represents the true value of the scoring function as a function of our hypothetical hyper-parameter X. The black dots (and their errorbars) represent points at which we evaluated our classifier and calculated the score. In blue our regression model trying to predict the value of the score function. The shaded area represents the uncertainty on the median (solid blue line) value of the estimated score function value.

Next let's calculate the expected improvement at every value of the hyper-parameter X. We also build a multistart optimisation routine (next_sample) which uses the expected improvement to suggest which point to sample next.

from scipy.optimize import minimize
from scipy import stats 
def expected_improvement(points, gp, samples, bigger_better=False): # are we trying to maximise a score or minimise an error? if bigger_better: best_sample = samples[np.argmax(samples)] mu, cov = gp.predict(samples, points) sigma = np.sqrt(cov.diagonal()) Z = (mu-best_sample)/sigma ei = ((mu-best_sample) * stats.norm.cdf(Z) + sigma*stats.norm.pdf(Z)) # want to use this as objective function in a minimiser so multiply by -1 return -ei else: best_sample = samples[np.argmin(samples)] mu, cov = gp.predict(samples, points) sigma = np.sqrt(cov.diagonal()) Z = (best_sample-mu)/sigma ei = ((best_sample-mu) * stats.norm.cdf(Z) + sigma*stats.norm.pdf(Z)) # want to use this as objective function in a minimiser so multiply by -1 return -ei def next_sample(gp, samples, bounds=(0,10), bigger_better=False): """Find point with largest expected improvement""" best_x = None best_ei = 0 # EI is zero at most values -> often get trapped # in a local maximum -> multistarting to increase # our chances to find the global maximum for rand_x in np.random.uniform(bounds[0], bounds[1], size=30): res = minimize(expected_improvement, rand_x, bounds=[bounds], method='L-BFGS-B', args=(gp, samples, bigger_better)) if res
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值