BAYESIAN OPTIMIZATION 贝叶斯优化
Consider the following problem of finding a global minimizer (or maximizer) of an unknown objective function:
x
∗
=
arg
min
x
∈
X
f
(
x
)
\bm{x}^{\ast} = \arg\underset{\bm{x} \in \mathcal{X}}{\min} \quad f(\bm x)
x∗=argx∈Xminf(x)
注意,
f
f
f 可能是非凸的,梯度信息是不可用的。
超参数搜索空间
X
\mathcal{X}
X。
由于目标函数是未知的,贝叶斯策略是把它当作一个随机函数,并放置一个先验(prior)。
- 先验 (prior) 是关于函数行为的置信度(belief)。
- 在收集到可以作为数据的函数评估后,对先验信息进行更新,形成目标函数的后验分布。
- 再用后验分布构造一个采集(acquisition)函数,确定下一个查询点。
widely used Gaussian prior
f ( x 1 : t ) = [ f ( x 1 ) , ⋯ , f ( x t ) ] T ∼ N ( 0 , K t ) f(\bm{x}_{1:t}) = [f(\bm{x}_{1}) , \cdots, f(\bm{x}_{t}) ]^{T} \sim \mathcal{N} (\bm{0},\bm{K}_t) f(x1:t)=[f(x1),⋯,f(xt)]T∼N(0,Kt)
- kernel matrix t × t t \times t t×t and K t ( i , j ) = k ( ∥ x i − x j ∥ ) \bm{K}_t (i,j) = k( \Vert \bm{x}_i - \bm{x}_j \Vert) Kt(i,j)=k(∥xi−xj∥)
Two popular kernels
- squared exponential (SE) kernel (径向基函数核,也被称为高斯核或平方指数核)
- the Matérn kernel(马特恩协方差函数)
- Γ ( ∗ ) \Gamma(*) Γ(∗) Gamma function
- B ν ( ∗ ) B_{\nu}(*) Bν(∗) ν \nu ν-th order Bessel function
- h h h hyper-parameter
In Bayesian optimization, at the t t t-th iteration
- samples D 1 : t = { ( x i , f ( x i ) ) } i = 1 t \mathcal{D}_{1:t} = \left\{ \big( \bm{x}_i, f ( \bm{x}_i) \big) \right\}_{i=1}^{t} D1:t={(xi,f(xi))}i=1t
- at the next query point, inferring the value of f ( x i + 1 ) f ( \bm{x}_{i+1}) f(xi+1)
根据高斯先验假设,
-
k t + 1 = [ k ( ∥ x t + 1 − x 1 ∥ ) , ⋯ , k ( ∥ x t + 1 − x t ∥ ) ] T \bm{k}_{t+1} = \left[ k( \Vert \bm{x}_{t+1} - \bm{x}_1 \Vert) ,\cdots, k( \Vert \bm{x}_{t+1} - \bm{x}_t \Vert) \right]^{T} kt+1=[k(∥xt+1−x1∥),⋯,k(∥xt+1−xt∥)]T
-
由于 [ f ( x 1 : t ) f ( x t + 1 ) ] \begin{bmatrix} f(\bm{x}_{1:t}) \\ f(\bm{x}_{t+1}) \end{bmatrix} [f(x1:t)f(xt+1)] 是联合正态的,条件分布 f ( x t + 1 ) ∣ f ( x 1 : t ) f(\bm{x}_{t+1}) \vert f(\bm{x}_{1:t}) f(xt+1)∣f(x1:t) 也一定是正态的,我们可以使用该条件分布的均值和方差的标准公式
Reduce the complexity
注意,(6) 和 (7) 的计算复杂度很高,因为矩阵和向量的维数随 t 增长。
acquisition function
Now that we have a model of the function and its uncertainty, we will use this to choose which point to sample next.
-
The acquisition function takes the mean and variance at each point on the function and computes a value that indicates how desirable it is to sample next at this position.
-
A good acquisition function should trade off exploration and exploitation.
four popular acquisition functions:
- the upper confidence bound
- expected improvement
- probability of improvement
- Thompson sampling
Upper confidence bound
Direct balance between exploration and exploitation:
This acquisition function is defined as:
Expected Improvement
- Perhaps the most used acquisition.
- It is too greedy in some problems. It is possible to make more explorative adding a ’explorative’ parameter
https://www.cnblogs.com/marsggbo/p/9866764.html