Scalable Global Optimization via Local Bayesian Optimization
概述
回顾一下高斯过程:
有一个数据集 D = { ( x i , y i ) } i = 1 n \mathcal D=\{(x_i,y_i)\}_{i=1}^n D={(xi,yi)}i=1n, x i x_i xi是第 i i i个data point, y i y_i yi是第 i i i个label:
GP Model : f ( x ) ∼ N ( μ n ( x ) , σ n ( x ) ) Likelihood : y ( x ) = f ( x ) + ϵ , ϵ ∼ N ( 0 , σ 2 I ) \text{GP Model}:f(x)\sim \mathcal N(\mu_n(x),\sigma_n(x))\\ \text{Likelihood}:y(x)=f(x)+\epsilon, \epsilon\sim \mathcal N(0,\sigma^2I) GP Model:f(x)∼N(μn(x),σn(x))Likelihood:y(x)=f(x)+ϵ,ϵ∼N(0,σ2I)
- 需要在Model中指定Kernel的类型即 σ n ( x ) \sigma_n(x) σn(x)和初始化超参
- 需要在Likelihood中指定噪声的类型以及初始化的超参数
- 训练的目的是学习适应数据Kernel以及噪声的超参
- 训练的目标(或损失)是Marginal LogLikelihood (MLL)
一、 GPR应用的简单例子
假设labels的真实生成过程如下:
y
(
x
)
=
s
i
n
(
2
π
x
)
+
ϵ
,
ϵ
∼
N
(
0
,
0.04
)
y(x)=sin(2\pi x)+\epsilon,\epsilon\sim \mathcal N(0, 0.04)
y(x)=sin(2πx)+ϵ,ϵ∼N(0,0.04)
其中 0.04 0.04 0.04不知道, s i n ( 2 π x ) sin(2\pi x) sin(2πx)的形式也不知道。
-
Kenel的选择。下面代码取RBF的kernel以及加一个outscale,公式如下:
k ( x 1 , x 2 ) = α exp ( − 1 2 ( x 1 − x 2 ) ⊤ ( x 1 − x 2 ) l 2 ) k(x_1,x_2)=\alpha\exp\left(-\frac{1}{2}\frac{(x_1-x_2)^\top(x_1-x_2)}{l^2}\right) k(x1,x2)=αexp(−21l2(x1−x2)⊤(x1−x2))其中 α \alpha α是outputscale, l l l是lengthscale,是Kernel的超参数
-
噪声的选择。Likelihood选择其噪声的形式正态分布,方差 σ \sigma σ是一个需要学习的超参数,公式如下:
ϵ ∼ N ( 0 , σ ) \epsilon \sim \mathcal N(0,\sigma) ϵ∼N(0,σ) -
计算MLL。给定数据 X , y X,y X,y,拟合的GP模型 f ∼ G P ( μ , K ) f\sim \mathcal{GP}(\mu,K) f∼GP(μ,K),公式如下:
p f ( y ∣ x ) = ∫ p ( y ∣ f ( x ) p ( f ( x ) ∣ x ) d f p_f(y|x)=\int p(y|f(x)p(f(x)|x)df pf(y∣x)=∫p(y∣f(x)p(f(x)∣x)df
下面给出一个可以直接跑的例子(请安装gpytorch和torch):
import torch
import gpytorch
import math
from matplotlib import pyplot as plt
# 准备数据 x
train_x = torch.linspace(0, 1, 100)
# 准备标签 y
train_y = torch.sin(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * math.sqrt(0.04)
# 设定训练次数
training_iter = 50
# 选择mean和covariance function即kernel
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
# 构建一个关于噪声的模型计算likelihood以及要学习的GP模型
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood)
model.train()
likelihood.train()
# 优化器
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
# 计算损失
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
for i in range(training_iter):
optimizer.zero_grad()
output = model(train_x)
loss = -mll(output, train_y)
loss.backward()
print('Iter %d/%d - Loss: %.3f outputscale:%.3f lengthscale: %.3f noise: %.3f' % (
i + 1, training_iter, loss.item(),
model.covar_module.outputscale.item(),
model.covar_module.base_kernel.lengthscale.item(),
model.likelihood.noise.item()
))
optimizer.step()
model.eval()
likelihood.eval()
# 构建测试数据 test_x
with torch.no_grad(), gpytorch.settings.fast_pred_var():
test_x = torch.linspace(0, 1, 51)
observed_pred = likelihood(model(test_x))
# 画图
with torch.no_grad():
f, ax = plt.subplots(1, 1, figsize=(16, 9))
lower, upper = observed_pred.confidence_region()
ax.plot(train_x.numpy(), train_y.numpy(), 'k*')
ax.plot(test_x.numpy(), observed_pred.mean.numpy(), 'b')
ax.fill_between(test_x.numpy(), lower.numpy(), upper.numpy(), alpha=0.5)
ax.set_ylim([-3, 3])
ax.legend(['Observed Data', 'Mean', 'Confidence'])
print的输出:
Iter 1/50 - Loss: 0.934 outputscale:0.693 lengthscale: 0.693 noise: 0.693
Iter 2/50 - Loss: 0.903 outputscale:0.744 lengthscale: 0.644 noise: 0.644
Iter 3/50 - Loss: 0.869 outputscale:0.798 lengthscale: 0.598 noise: 0.598
Iter 4/50 - Loss: 0.831 outputscale:0.855 lengthscale: 0.555 noise: 0.554
Iter 5/50 - Loss: 0.789 outputscale:0.913 lengthscale: 0.514 noise: 0.513
Iter 6/50 - Loss: 0.742 outputscale:0.974 lengthscale: 0.475 noise: 0.474
Iter 7/50 - Loss: 0.693 outputscale:1.038 lengthscale: 0.438 noise: 0.437
Iter 8/50 - Loss: 0.643 outputscale:1.105 lengthscale: 0.404 noise: 0.402
Iter 9/50 - Loss: 0.595 outputscale:1.173 lengthscale: 0.371 noise: 0.370
Iter 10/50 - Loss: 0.551 outputscale:1.243 lengthscale: 0.341 noise: 0.339
Iter 11/50 - Loss: 0.512 outputscale:1.309 lengthscale: 0.315 noise: 0.311
Iter 12/50 - Loss: 0.475 outputscale:1.368 lengthscale: 0.291 noise: 0.284
Iter 13/50 - Loss: 0.441 outputscale:1.418 lengthscale: 0.272 noise: 0.260
Iter 14/50 - Loss: 0.408 outputscale:1.455 lengthscale: 0.256 noise: 0.237
Iter 15/50 - Loss: 0.375 outputscale:1.481 lengthscale: 0.243 noise: 0.216
Iter 16/50 - Loss: 0.343 outputscale:1.494 lengthscale: 0.234 noise: 0.197
Iter 17/50 - Loss: 0.311 outputscale:1.496 lengthscale: 0.226 noise: 0.179
Iter 18/50 - Loss: 0.280 outputscale:1.488 lengthscale: 0.221 noise: 0.163
Iter 19/50 - Loss: 0.249 outputscale:1.471 lengthscale: 0.218 noise: 0.148
Iter 20/50 - Loss: 0.219 outputscale:1.447 lengthscale: 0.216 noise: 0.135
Iter 21/50 - Loss: 0.189 outputscale:1.415 lengthscale: 0.216 noise: 0.123
Iter 22/50 - Loss: 0.161 outputscale:1.379 lengthscale: 0.218 noise: 0.112
Iter 23/50 - Loss: 0.134 outputscale:1.337 lengthscale: 0.221 noise: 0.102
Iter 24/50 - Loss: 0.109 outputscale:1.293 lengthscale: 0.224 noise: 0.093
Iter 25/50 - Loss: 0.087 outputscale:1.245 lengthscale: 0.229 noise: 0.084
Iter 26/50 - Loss: 0.066 outputscale:1.197 lengthscale: 0.235 noise: 0.077
Iter 27/50 - Loss: 0.048 outputscale:1.147 lengthscale: 0.242 noise: 0.071
Iter 28/50 - Loss: 0.033 outputscale:1.097 lengthscale: 0.249 noise: 0.065
Iter 29/50 - Loss: 0.021 outputscale:1.048 lengthscale: 0.256 noise: 0.060
Iter 30/50 - Loss: 0.012 outputscale:1.000 lengthscale: 0.264 noise: 0.055
Iter 31/50 - Loss: 0.006 outputscale:0.955 lengthscale: 0.271 noise: 0.051
Iter 32/50 - Loss: 0.003 outputscale:0.912 lengthscale: 0.279 noise: 0.048
Iter 33/50 - Loss: 0.003 outputscale:0.873 lengthscale: 0.286 noise: 0.045
Iter 34/50 - Loss: 0.004 outputscale:0.839 lengthscale: 0.292 noise: 0.042
Iter 35/50 - Loss: 0.008 outputscale:0.809 lengthscale: 0.297 noise: 0.040
Iter 36/50 - Loss: 0.012 outputscale:0.785 lengthscale: 0.300 noise: 0.038
Iter 37/50 - Loss: 0.016 outputscale:0.767 lengthscale: 0.301 noise: 0.037
Iter 38/50 - Loss: 0.020 outputscale:0.755 lengthscale: 0.301 noise: 0.036
Iter 39/50 - Loss: 0.023 outputscale:0.748 lengthscale: 0.299 noise: 0.035
Iter 40/50 - Loss: 0.024 outputscale:0.746 lengthscale: 0.296 noise: 0.034
Iter 41/50 - Loss: 0.025 outputscale:0.747 lengthscale: 0.291 noise: 0.034
Iter 42/50 - Loss: 0.024 outputscale:0.751 lengthscale: 0.286 noise: 0.034
Iter 43/50 - Loss: 0.023 outputscale:0.757 lengthscale: 0.280 noise: 0.034
Iter 44/50 - Loss: 0.021 outputscale:0.764 lengthscale: 0.274 noise: 0.035
Iter 45/50 - Loss: 0.018 outputscale:0.771 lengthscale: 0.268 noise: 0.035
Iter 46/50 - Loss: 0.016 outputscale:0.777 lengthscale: 0.263 noise: 0.036
Iter 47/50 - Loss: 0.013 outputscale:0.782 lengthscale: 0.258 noise: 0.037
Iter 48/50 - Loss: 0.010 outputscale:0.785 lengthscale: 0.253 noise: 0.038
Iter 49/50 - Loss: 0.008 outputscale:0.786 lengthscale: 0.249 noise: 0.039
Iter 50/50 - Loss: 0.006 outputscale:0.785 lengthscale: 0.245 noise: 0.040
上述代码画出来的图如下:
为什么GP model要学习?
就是为了拟合与数据相关的超参数。
二、TuRBO的文章介绍
在上一篇关于BO的文章贝叶斯优化中提到,BO最常用的关于统计量建模用的是GP,然后利用acquisition function选择下一个候选点。而TuRBO的关键就是利用Trust Region Method来设计了一个strategy结合Thomson Sampling来选择下一个候选点。
为了阐述这个关键点,下面主要分为一个局部信赖域模型local trust region BO 和 global trust region BO来描述TuRBO
2.1 局部信赖域BO
出发点:
想要GP完全拟合好一个高维的
x
∈
R
d
x\in \mathcal R^d
x∈Rd可能比较困难,毕竟GP正态分布的分布结构不一定适合数据那一维度对应的空间,抑或是数据维度之间值变化太大了,GP捕捉拟合能力有限,术语即GP模型容量有限。因此,先拟合一个数据子域的模型总行吧,那这个子域怎么表示?怎样才算拟合好呢?
假设 x = ( x 1 , . . . , x d ) x=(x_1,...,x_d) x=(x1,...,xd),Trust Region来表示子域,用 L m i n , L m a x L_{min},L_{max} Lmin,Lmax代表最小的子域长度和最大的子域长度。所以对于点 x ( 0 ) x^{(0)} x(0),其最大子域为 x ( 0 ) ± L m a x x^{(0)}\pm L_{max} x(0)±Lmax。所以对一个局部信赖域BO关键的地方在于怎么迭代各维度的长度,从而GP对该子域拟合得不错,即可信赖的子域(信赖域)!
寻找“恰当”子域策略为:
- 初始化子域的基本长度 L ← L i n i t L\leftarrow L_{init} L←Linit
- 经过 τ s u c c \tau_{succ} τsucc次“成功”,扩张子域长度 L ← max { L m a x , 2 L } L\leftarrow \max\{L_{max}, 2L\} L←max{Lmax,2L}
- 经过 τ f a i l \tau_{fail} τfail次“失败”,收缩子域长度 L ← min { L m i n , L / 2 } L\leftarrow \min{\{L_{min},L/2\}} L←min{Lmin,L/2}
采集函数的具体流程:
- 同样从已有或初始的数据集 D 0 = { ( x i , y i ) } i = 1 N \mathcal D_0=\{(x_i,y_i)\}_{i=1}^N D0={(xi,yi)}i=1N中选出最大值 y ∗ ∈ D 0 y^*\in \mathcal D_0 y∗∈D0及其对应的 x ∗ ∈ R d x^*\in \mathcal R^d x∗∈Rd
- 在这个 x ∗ = ( x 1 ∗ , x 2 ∗ , . . . , x d ∗ ) x^*=(x_1^*,x_2^*,...,x_d^*) x∗=(x1∗,x2∗,...,xd∗)里确定一个Trust Region,假设基本长度值为 L ← L i n i t L\leftarrow L_{init} L←Linit,各个维度设定一个length scale λ i \lambda_i λi,通过rescaling来确定每一个维度的具体长度: L i = λ i L ( ∏ j = 1 d λ j ) 1 d L_i=\frac{\lambda_iL}{(\prod_{j=1}^d\lambda_j)^{\frac{1}{d}}} Li=(∏j=1dλj)d1λiL
- 通过上一步确定的Trust Region为 TR ( x ∗ ) = ( x 1 ∗ + L 1 , x 2 ∗ + L 2 , . . . , x d ∗ + L d ) \text{TR}(x^*)=(x_1^*+L_1,x_2^*+L_2,...,x^*_d+L_d) TR(x∗)=(x1∗+L1,x2∗+L2,...,xd∗+Ld),从中select出 q q q个候选点(candidates)
- 定义 τ s u c c \tau_{succ} τsucc和 τ f a i l \tau_{fail} τfail中的success和failure,一次“成功”为一个候选点的观测值大于 y ∗ y^* y∗,一次“失败”为一个候选点的观测值小于 y ∗ y^* y∗然后返回到GP model进行拟合
这就是一个single local trust region BO中采集函数的具体流程。(省略了GP的拟合)
2.2 全局信赖域BO(TuRBO)
TuRBO同时维持着m个trust regions, TR l w h e r e l ∈ { 1 , . . . , m } \text{TR}_l ~where~l\in\{1,...,m\} TRl where l∈{1,...,m}。从这m个trust regions选择第 i i i个候选点(candidate)的公式如下:
x ( i ) ∈ arg max l arg max x ∈ TR l f l ( i ) w h e r e f l ( i ) ∼ G P l ( μ l ( x ) , σ l ( x ) ) x^{(i)}\in \argmax_l\argmax_{x\in \text{TR}_l}f_l^{(i)}~where~f_l^{(i)}\sim \mathcal{GP}_l(\mu_l(x),\sigma_l(x)) x(i)∈largmaxx∈TRlargmaxfl(i) where fl(i)∼GPl(μl(x),σl(x))
先解释一下,后面第三章节直接上代码对应了。
- 第一个argmax,意味着先从 m m m个TR中任选一个,比如 l l l
- 选好了第 l l l个 TR l \text{TR}_l TRl,那里有一个回归好的高斯回归模型 G P l \mathcal GP_l GPl,从这个 TR l \text{TR}_l TRl中用伪随机数算法Sobel生成一些点 x x x,把这些点用 G P l \mathcal GP_l GPl来采样得到它们的function values即 f l ( x ) f_l(x) fl(x)
- 从这些 f l ( x ) f_l(x) fl(x)中argmax出最大值点值对即 ( x l , f ( x l ) ) (x_l,f(x_l)) (xl,f(xl)),然后每个TR都这么做,就有m个这样点对,从中选出最大的当作候选点 x l ( i ) x_l^{(i)} xl(i)
三、TuRBO代码分析
只分析turbo_m.py中最主要的代码:
第一部分主要是对n个trust regions进行初始化:
# 初始化n个信赖域的数据
for i in range(self.n_trust_regions):
# 以latin_hypercube的方式生成n_init个初始点
X_init = latin_hypercube(self.n_init, self.dim)
# 对X_init进行标准化
X_init = from_unit_cube(X_init, self.lb, self.ub)
# 对X_init进行评估/观测/计算/测量
fX_init = np.array([[self.f(x)] for x in X_init])
# 存放数据信息
self.X = np.vstack((self.X, X_init))
self.fX = np.vstack((self.fX, fX_init))
self._idx = np.vstack((self._idx, i * np.ones((self.n_init, 1), dtype=int)))
# count 评估次数
self.n_evals += self.n_init
第二部分
# 主循环的条件:最大评估次数max_evals
while self.n_evals < self.max_evals:
# 创建候选点的容器
X_cand = np.zeros((self.n_trust_regions, self.n_cand, self.dim))
y_cand = np.inf * np.ones((self.n_trust_regions, self.n_cand, self.batch_size))
# 对n个regions,各自生成候选点即_create_candiates这个函数
for i in range(self.n_trust_regions):
# 这里是一些对数据维度的处理细节
idx = np.where(self._idx == i)[0]
X = deepcopy(self.X[idx, :])
X = to_unit_cube(X, self.lb, self.ub)
fX = deepcopy(self.fX[idx, 0].ravel())
n_training_steps = 0 if self.hypers[i] else self.n_training_steps
# 在TR内利用sobolEngine生成随机数,然后利用GP采样得到function value,得到的候选点
X_cand[i, :, :], y_cand[i, :, :], self.hypers[i] =self._create_candidates(X, fX, length=self.length[i], n_training_steps=n_training_steps, hypers=self.hypers[i])
# 选出候选点,这一步对应第二章TuRBO理论介绍的
X_next, idx_next = self._select_candidates(X_cand, y_cand)
# 标准化还原
X_next = from_unit_cube(X_next, self.lb, self.ub)
# 评估候选点
fX_next = np.array([[self.f(x)] for x in X_next])
# 更新TR
for i in range(self.n_trust_regions):
idx_i = np.where(idx_next == i)[0]
if len(idx_i) > 0:
self.hypers[i] = {}
fX_i = fX_next[idx_i]
# 调整子域(信赖域)的长度
self._adjust_length(fX_i, i)