1. 为什么需要超参数搜索
神经网络的训练过程中,有很多不变的参数,比如:
- 网络结构参数:几层,每层宽度,每层激活函数等
- 训练参数:batch_size,学习率,学习率衰减算法等
这些参数一方面依靠经验,选取合适的范围,进行设置,另一方面还要依靠可靠的方法,准确的确定参数的最终取值。对于神经网络的参数调优而言,调参大师们有时候也难免捉襟见肘(承认自己不行),如果单纯地依靠手动的方式进行参数选取调优,除非你觉得你的手够快(其实是机器足够快),你又有足够的耐心,那恭喜你加入手工调参的大军,从此远离白富美(高富帅),愉快的开启了自我救赎的旅程。
2. 超参数搜索策略
既然已经明白了超参数搜索的重要性和紧迫性,那么我们一起来看一下超参数搜索的常用策略。主要包含以下几种:
(1)
(2)
(3)
(4)
机器学习中超参数搜索的常用方法为 Grid Search,然而如果参数一多则容易碰到维数诅咒的问题,即参数之间的组合呈指数增长。如果有m个参数,每个有n个取值,则时间复杂度为
ϕ
(
n
m
)
\phi(n^m)
ϕ(nm)。 Bengio 等人在 《Random Search for Hyper-Parameter Optimization》 中提出了随机化搜索的方法。他们指出大部分参数空间存在 “低有效维度 (low effective dimensionality)” 的特点,即有些参数对目标函数影响较大,另一些则几乎没有影响。而且在不同的数据集中通常有效参数也不一样。 在这种情况下 Random Search 通常效果较好,下图是一个例子,其中只有两个参数,绿色的参数影响较大,而黄色的参数则影响很小:
Grid Search 会评估每个可能的参数组合,所以对于影响较大的绿色参数,Grid Search 只探索了3个值,同时浪费了很多计算在影响小的黄色参数上; 相比之下 Random Search 则探索了9个不同的绿色参数值,因而效率更高,在相同的时间范围内 Random Search 通常能找到更好的超参数 (当然这并不绝对)。 另外,Random Search 可以在连续的空间搜索,而 Grid Search 则只能在离散空间搜索,而对于像神经网络中的 learning rate,SVM 中的 gamma 这样的连续型参数宜使用连续分布。
在实际的应用中,Grid Search 只需为每个参数事先指定一个参数列表就可以了,而 Random Search 则通常需要为每个参数制定一个概率分布,进而从这些分布中进行抽样。关于超参数搜索中的随机搜索参数分布选取策略,可以参考:用于超参数随机化搜索的几个分布。
3、超参数搜索实战
- 实验数据:California Housing dataset 房价预测数据
- 实验环境:
- ubuntu16.04
- matplotlib 2.1.2
- numpy 1.19.1
- pandas 0.22.0
- sklearn 0.19.1
- tensorflow 2.2.0
(1)手动参数搜索实现
首先先来看看最原始的手动参数搜索的实现方式。导入模型需要的包并加载所需要的数据,代码如下:
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import sklearn
import pandas as pd
import os
import sys
import time
import pickle
import tensorflow as tf
from tensorflow import keras
# 从本地保存数据文件加载数据
with open(file='data/california_housing.pkl', mode='rb') as f:
housing=pickle.load(f)
# 从网络下载数据
# from sklearn.datasets import fetch_california_housing
# housing = fetch_california_housing()
print(housing.DESCR)
print(housing.data.shape)
print(housing.target.shape)
运行结果
.. _california_housing_dataset:
California Housing dataset
--------------------------
**Data Set Characteristics:**
:Number of Instances: 20640
:Number of Attributes: 8 numeric, predictive attributes and the target
:Attribute Information:
- MedInc median income in block
- HouseAge median house age in block
- AveRooms average number of rooms
- AveBedrms average number of bedrooms
- Population block population
- AveOccup average house occupancy
- Latitude house block latitude
- Longitude house block longitude
:Missing Attribute Values: None
This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/
The target variable is the median house value for California districts.
This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).
It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.
.. topic:: References
- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297
(20640, 8)
(20640,)
由于我提前将数据进行了下载和保存,所以直接读取的本地文件,也可以使用网络直接下载使用。
数据一共包含20640个样本,8各自变量,一个预测变量。然后对数据进行归一化处理,关于数据的归一化处理部分可以参考之前的文章,有问题的地方可以关注公众号【瞧不死的AI】进行相关交流。
# 数据集合划分
from sklearn.model_selection import train_test_split
x_train_all, x_test, y_train_all, y_test = train_test_split(
housing.data, housing.target, random_state = 7)
x_train, x_valid, y_train, y_valid = train_test_split(
x_train_all, y_train_all, random_state = 11)
print(x_train.shape, y_train.shape)
print(x_valid.shape, y_valid.shape)
print(x_test.shape, y_test.shape)
# 数据归一化
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# 此处采用fit_transform是因为,该函数可以将训练集的均值和方差记录下来,这样在验证集和测试集可以保持一致,这样也就保证了数据是同分布的,模型的构建和训练才会有效
x_train_scaled = scaler.fit_transform(x_train)
x_valid_scaled = scaler.transform(x_valid)
x_test_scaled = scaler.transform(x_test)
接下来我们就对超参数“学习率”进行手动设置调参:
# 手动实现学习率的超参数搜索
# learning_rate: [1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2]
# 学习率在梯度更新中的作用部分:W = W + grad * learning_rate
learning_rates = [1e-4, 3e-4, 1e-3, 3e-3, 1e-2, 3e-2]
histories = []
for lr in learning_rates:
model = keras.models.Sequential([
keras.layers.Dense(30, activation='relu',
input_shape=x_train.shape[1:]),
keras.layers.Dense(1),
])
optimizer = keras.optimizers.SGD(lr)
model.compile(loss="mean_squared_error", optimizer=optimizer)
callbacks = [keras.callbacks.EarlyStopping(
patience=5, min_delta=1e-2)]
history = model.fit(x_train_scaled, y_train,
validation_data = (x_valid_scaled, y_valid),
epochs = 100,
callbacks = callbacks)
histories.append(history)
一共设置了6个不同的学习率数值,并且将每一个学习率数值训练的结果数据保存在history中,所有结果保存在histories列表中。
部分训练过程如下:
Epoch 1/100
363/363 [==============================] - 1s 4ms/step - loss: 4.6029 - val_loss: 3.8958
Epoch 2/100
363/363 [==============================] - 1s 3ms/step - loss: 3.0756 - val_loss: 2.6809
Epoch 3/100
363/363 [==============================] - 1s 3ms/step - loss: 2.1854 - val_loss: 1.9770
Epoch 4/100
363/363 [==============================] - 1s 3ms/step - loss: 1.6644 - val_loss: 1.5667
Epoch 5/100
363/363 [==============================] - 1s 3ms/step - loss: 1.3547 - val_loss: 1.3227
Epoch 6/100
363/363 [==============================] - 1s 3ms/step - loss: 1.1673 - val_loss: 1.1757
Epoch 7/100
......
Epoch 13/100
363/363 [==============================] - 1s 3ms/step - loss: 0.3596 - val_loss: 0.3932
Epoch 14/100
363/363 [==============================] - 1s 3ms/step - loss: 0.3623 - val_loss: 0.3668
Epoch 15/100
363/363 [==============================] - 1s 3ms/step - loss: 0.3658 - val_loss: 0.3756
Epoch 1/100
363/363 [==============================] - 1s 3ms/step - loss: nan - val_loss: nan
Epoch 2/100
363/363 [==============================] - 1s 3ms/step - loss: nan - val_loss: nan
Epoch 3/100
363/363 [==============================] - 1s 3ms/step - loss: nan - val_loss: nan
Epoch 4/100
363/363 [==============================] - 1s 3ms/step - loss: nan - val_loss: nan
Epoch 5/100
363/363 [==============================] - 1s 3ms/step - loss: nan - val_loss: nan
对训练结果的损失函数进行可视化:
learning rate: 0.0001
learning rate: 0.0003
learning rate: 0.001
learning rate: 0.003
learning rate: 0.01
learning rate: 0.03
从训练结果可以看出,随着学习率的变化,当学习率为0.003时,通过30次的迭代模型已经能够将最终损失函数的数值控制在0.4以下,而当学习率再小一些为0.0003时,50次的模型迭代训练已经不能是模型损失函数达到0.4以下,同时过大的学习率0.03导致模型出现爆炸,根本无法收敛。我们还注意到,学习率为0.003时,模型损失函数能够很快达到0.4,收敛速度较快。
同时,我们也能发现,通过手动参数循环的方式进行参数搜索存在以下几点劣势:
- 如果超参数过多,比如有20个参数,那么需要设置多层的循环,这样就会导致效率特别低。
- 采用循环的方式进行参数搜索,必须等到上一个模型训练完毕才能训练下一个模型,没办法进行分布式的训练,如果手动进行并行化实现,无疑会增加模型的复杂度。
接下来使用scikit-learn库中已经封装好的方法实现超参数的搜索。
(2)sklearn超参数搜索实现
使用sklearn的超参数搜索方法,需要将tf.keras的模型转化为sklearn的model,这里我们使用sklearn中的RandomizedSearchCV进行参数选择。
首先将keras的模型转化为sklearn的model,数据处理部分的代码与之前相同。
# 1. 转化为sklearn的model
def build_model(hidden_layers = 1,
layer_size = 30,
learning_rate = 3e-3):
model = keras.models.Sequential()
model.add(keras.layers.Dense(layer_size, activation='relu',
input_shape=x_train.shape[1:]))
for _ in range(hidden_layers - 1):
model.add(keras.layers.Dense(layer_size,
activation = 'relu'))
model.add(keras.layers.Dense(1))
optimizer = keras.optimizers.SGD(learning_rate)
model.compile(loss = 'mse', optimizer = optimizer)
return model
sklearn_model = keras.wrappers.scikit_learn.KerasRegressor(
build_fn = build_model)
callbacks = [keras.callbacks.EarlyStopping(patience=5, min_delta=1e-2)]
history = sklearn_model.fit(x_train_scaled, y_train,
epochs = 100,
validation_data = (x_valid_scaled, y_valid),
callbacks = callbacks)
封装为sklearn的方法使用的是tf.keras.wrappers.scikit_learn.KerasRegressor,如果针对分类任务,使用tf.keras.wrappers.scikit_learn.KerasClassifier进行模型的封装。
对模型进行训练,结果如下:
Epoch 1/100
363/363 [==============================] - 1s 3ms/step - loss: 0.9231 - val_loss: 0.7247
Epoch 2/100
363/363 [==============================] - 1s 3ms/step - loss: 0.6358 - val_loss: 0.6265
Epoch 3/100
363/363 [==============================] - 1s 3ms/step - loss: 0.5599 - val_loss: 0.5749
Epoch 4/100
363/363 [==============================] - 1s 3ms/step - loss: 0.5305 - val_loss: 0.5459
Epoch 5/100
363/363 [==============================] - 1s 3ms/step - loss: 0.5260 - val_loss: 0.5237
Epoch 6/100
363/363 [==============================] - 1s 3ms/step - loss: 0.5040 - val_loss: 0.4979
Epoch 7/100
363/363 [==============================] - 1s 3ms/step - loss: 0.4748 - val_loss: 0.4988
Epoch 8/100
363/363 [==============================] - 1s 3ms/step - loss: 0.4559 - val_loss: 0.4744
Epoch 9/100
363/363 [==============================] - 1s 3ms/step - loss: 0.4457 - val_loss: 0.4685
Epoch 10/100
363/363 [==============================] - 1s 3ms/step - loss: 0.4471 - val_loss: 0.4616
Epoch 11/100
363/363 [==============================] - 1s 3ms/step - loss: 0.4370 - val_loss: 0.4531
Epoch 12/100
363/363 [==============================] - 1s 3ms/step - loss: 0.4294 - val_loss: 0.4609
Epoch 13/100
363/363 [==============================] - 1s 3ms/step - loss: 0.4316 - val_loss: 0.4428
Epoch 14/100
363/363 [==============================] - 1s 3ms/step - loss: 0.4269 - val_loss: 0.4462
Epoch 15/100
363/363 [==============================] - 1s 3ms/step - loss: 0.4174 - val_loss: 0.4326
Epoch 16/100
363/363 [==============================] - 1s 3ms/step - loss: 0.4131 - val_loss: 0.4298
Epoch 17/100
363/363 [==============================] - 1s 3ms/step - loss: 0.4103 - val_loss: 0.4305
Epoch 18/100
363/363 [==============================] - 1s 3ms/step - loss: 0.4074 - val_loss: 0.4270
Epoch 19/100
363/363 [==============================] - 1s 3ms/step - loss: 0.4084 - val_loss: 0.4227
Epoch 20/100
363/363 [==============================] - 1s 3ms/step - loss: 0.4020 - val_loss: 0.4164
Epoch 21/100
363/363 [==============================] - 1s 3ms/step - loss: 0.4032 - val_loss: 0.4209
Epoch 22/100
363/363 [==============================] - 1s 3ms/step - loss: 0.4005 - val_loss: 0.4162
Epoch 23/100
363/363 [==============================] - 1s 3ms/step - loss: 0.3978 - val_loss: 0.4149
Epoch 24/100
363/363 [==============================] - 1s 3ms/step - loss: 0.3966 - val_loss: 0.4149
Epoch 25/100
363/363 [==============================] - 1s 3ms/step - loss: 0.3919 - val_loss: 0.4096
模型转换完成之后,接下来就是针对参数进行设置,以及参数训练部分:
# 2. 定义参数集合
# 3. 搜索参数
from scipy.stats import reciprocal
# f(x) = 1/(x*log(b/a)) a <= x <= b
param_distribution = {
"hidden_layers":[1, 2, 3, 4],
"layer_size": np.arange(1, 100),
"learning_rate": reciprocal(1e-4, 1e-2),
}
from sklearn.model_selection import RandomizedSearchCV
random_search_cv = RandomizedSearchCV(sklearn_model,
param_distribution,
n_iter = 10,
cv = 3,
n_jobs = 1)
random_search_cv.fit(x_train_scaled, y_train, epochs = 100,
validation_data = (x_valid_scaled, y_valid),
callbacks = callbacks)
# cross_validation: 训练集分成n份,n-1训练,最后一份验证.
此处针对模型隐含层层数、每一层的神经元个数、学习率进行模型参数搜索。其中针对学习率部分,使用了reciprocal分布进行参数取值配置,reciprocal分布公式如下:
概率密度函数图像如下:
关于超参数的分布选择,可以参看用于超参数随机化搜索的几个分布。
查看最终最优参数搜索结果:
print(random_search_cv.best_params_)
print(random_search_cv.best_score_)
print(random_search_cv.best_estimator_)
结果如下:
{'hidden_layers': 4, 'layer_size': 69, 'learning_rate': 0.0030034154155925037}
-0.3358098069826762
<tensorflow.python.keras.wrappers.scikit_learn.KerasRegressor object at 0x7f1878157c50>
可以看出hidden_layers最优值为4,layer_size最优值为69,learning_rate最优值为0.003。
选择最优模型进行模型的测试:
model = random_search_cv.best_estimator_.model
model.evaluate(x_test_scaled, y_test)
测试结果:
162/162 [==============================] - 0s 2ms/step - loss: 0.3287
0.32874101400375366
通过sklearn的RandomizedSearchCV方法进行模型的超参数搜索,最终获得的最优模型结果损失函数值降到0.32。
更多资源欢迎关注公众号【瞧不死的AI】,进行交流。