当我们创建好模型后,还要调整各个模型的参数,才找到最好的匹配。即使模型还可以,如果它的参数设置不匹配,同样无法输出好的结果。 常用的调参方式有Grid search 和 Random search ,Grid search 是全空间扫描,所以比较慢,Random search 虽然快,但可能错失空间上的一些重要的点,精度不够。 而Hyperopt是一种通过贝叶斯优化来调整参数的工具,该方法较快的速度,并有较好的效果。此外,Hyperopt结合MongoDB可以进行分布式调参,快速找到相对较优的参数。安装的时候需要指定dev版本才能使用模拟退火调参,也支持暴力调参、随机调参等策略。
(贝叶斯优化,又叫序贯模型优化(Sequential model-based optimization,SMBO),是最有效的函数优化方法之一。与共轭梯度下降法等标准优化策略相比,SMBO的优势有:利用平滑性而无需计算梯度;可处理实数、离散值、条件变量等;可处理大量变量并行优化。)
Let's go!!!
1 安装
pip install hyperopt
安装hyperopt时也会安装 networkx,如果在调用时出现 TypeError: 'generator' object is not subscriptable
报错,可以将其换成1.11版本。
-
pip
uninstall networkx
-
pip
install networkx==
1.11
2 重点知识
2.1 fmin
-
from hyperopt
import fmin, tpe, hp
-
best = fmin(
-
fn=
lambda x: x,
-
space=hp.uniform(
'x',
0,
1),
-
algo=tpe.suggest,
-
max_evals=
100)
-
print best
输出结果为:{'x': 0.0006154621520631152}
函数fmin
首先接受一个函数来最小化,记为fn
,在这里用一个函数lambda x: x
来指定。该函数可以是任何有效的值返回函数,例如回归中的平均绝对误差。
下一个参数指定搜索空间,在本例中,它是0到1之间的连续数字范围,由hp.uniform('x', 0, 1)
指定。hp.uniform
是一个内置的hyperopt
函数,它有三个参数:名称x
,范围的下限和上限0
和1
。
algo
参数指定搜索算法,本例中tpe
表示 tree of Parzen estimators。该主题超出了本文的范围,但有数学背景的读者可以细读这篇文章。algo
参数也可以设置为hyperopt.random
,但是这里我们没有涉及,因为它是众所周知的搜索策略。
最后,我们指定fmin
函数将执行的最大评估次数max_evals
。这个fmin
函数将返回一个python字典。
当我们调整max_evals=1000时,输出结果为:{'x': 3.7023587264309516e-06},可以发现结果更接近于0。
为了更好的理解,可以看下面这个更复杂一些的例子。
-
best = fmin(
-
fn=
lambda x: (x
-1)**
2,
-
space=hp.uniform(
'x',
-2,
2),
-
algo=tpe.suggest,
-
max_evals=
100)
-
print best
输出结果为:{'x': 1.007633842139922}
2.2 space
对于变量的变化范围与取值概率,有以下几类。
看个例子,
-
from hyperopt
import hp
-
import hyperopt.pyll.stochastic
-
-
space = {
-
'x': hp.uniform(
'x',
0,
1),
-
'y': hp.normal(
'y',
0,
1),
-
'name': hp.choice(
'name', [
'alice',
'bob']),
-
}
-
print hyperopt.pyll.stochastic.sample(space)
输出结果为:{'y': -1.3901709472842074, 'x': 0.4335747017293238, 'name': 'bob'}
2.3 通过 Trials 捕获信息
Trials用来记录每次eval的时候,具体使用了什么参数以及相关的返回值。这时候,fn的返回值变为dict,除了loss
,还有一个status
。Trials对象将数据存储为一个BSON对象,可以利用MongoDB
做分布式运算。
-
from hyperopt
import fmin, tpe, hp, STATUS_OK, Trials
-
from matplotlib
import pyplot
as plt
-
-
fspace = {
-
'x': hp.uniform(
'x',
-5,
5)
-
}
-
-
def f(params):
-
x = params[
'x']
-
val = x**
2
-
return {
'loss': val,
'status': STATUS_OK}
-
-
trials = Trials()
-
best = fmin(fn=f, space=fspace, algo=tpe.suggest, max_evals=
50, trials=trials)
-
-
print
'best:', best
-
print
'trials:'
-
for trial
in trials.trials[:
2]:
-
print trial
对于STATUS_OK的返回,会统计它的loss值,而对于STATUS_FAIL的返回,则会忽略。
输出结果如下,
-
best: {
'x':
-0.0025882455372094326}
-
trials:
-
{
'refresh_time': datetime.datetime(
2018,
12,
5,
3,
5,
43,
152000),
'book_time': datetime.datetime(
2018,
12,
5,
3,
5,
43,
152000),
'misc': {
'tid':
0,
'idxs': {
'x': [
0]},
'cmd': (
'domain_attachment',
'FMinIter_Domain'),
'vals': {
'x': [
-2.511797855178682]},
'workdir':
None},
'state':
2,
'tid':
0,
'exp_key':
None,
'version':
0,
'result': {
'status':
'ok',
'loss':
6.309128465280228},
'owner':
None,
'spec':
None}
-
{
'refresh_time': datetime.datetime(
2018,
12,
5,
3,
5,
43,
153000),
'book_time': datetime.datetime(
2018,
12,
5,
3,
5,
43,
153000),
'misc': {
'tid':
1,
'idxs': {
'x': [
1]},
'cmd': (
'domain_attachment',
'FMinIter_Domain'),
'vals': {
'x': [
3.43836093884876]},
'workdir':
None},
'state':
2,
'tid':
1,
'exp_key':
None,
'version':
0,
'result': {
'status':
'ok',
'loss':
11.822325945800927},
'owner':
None,
'spec':
None}
可以通过这里面的值,把一些变量与loss的点绘图,来看匹配度。或者tid与变量绘图,看它搜索的位置收敛(非数学意义上的收敛)情况。
trials有这几种:
- trials.trials - a list of dictionaries representing everything about the search
- trials.results - a list of dictionaries returned by ‘objective’ during the search
- trials.losses() - a list of losses (float for each ‘ok’ trial)
- trials.statuses() - a list of status strings
我们可以将上述trials进行可视化,值 vs. 时间与损失 vs. 值。-
f, ax = plt.subplots( 1)
-
xs = [t[ 'tid'] for t in trials.trials]
-
ys = [t[ 'misc'][ 'vals'][ 'x'] for t in trials.trials]
-
ax.set_xlim(xs[ 0] -10, xs[ -1]+ 10)
-
ax.scatter(xs, ys, s= 20, linewidth= 0.01, alpha= 0.75)
-
ax.set_title( '$x$ $vs$ $t$ ', fontsize= 18)
-
ax.set_xlabel( '$t$', fontsize= 16)
-
ax.set_ylabel( '$x$', fontsize= 16)
<p style="text-align:center;"><img alt="" class="has" height="302" src="https://img-blog.csdnimg.cn/20181205111833307.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTI3MzU3MDg=,size_16,color_FFFFFF,t_70" width="405"></p> </li>
-
-
f, ax = plt.subplots(
1)
-
xs = [t[
'misc'][
'vals'][
'x']
for t
in trials.trials]
-
ys = [t[
'result'][
'loss']
for t
in trials.trials]
-
ax.scatter(xs, ys, s=
20, linewidth=
0.01, alpha=
0.75)
-
ax.set_title(
'$val$ $vs$ $x$ ', fontsize=
18)
-
ax.set_xlabel(
'$x$', fontsize=
16)
-
ax.set_ylabel(
'$val$', fontsize=
16)
3 Hyperopt应用
3.1 K近邻
需要注意的是,由于我们试图最大化交叉验证的准确率,而hyperopt
只知道如何最小化函数,所以必须对准确率取负。最小化函数f
与最大化f
的负数是相等的。
-
from sklearn.datasets
import load_iris
-
from sklearn.neighbors
import KNeighborsClassifier
-
from sklearn.cross_validation
import cross_val_score
-
from hyperopt
import hp,STATUS_OK,Trials,fmin,tpe
-
from matplotlib
import pyplot
as plt
-
-
iris=load_iris()
-
X=iris.data
-
y=iris.target
-
-
def hyperopt_train(params):
-
clf=KNeighborsClassifier(**params)
-
return cross_val_score(clf,X,y).mean()
-
-
space_knn={
'n_neighbors':hp.choice(
'n_neighbors',range(
1,
100))}
-
-
def f(parmas):
-
acc=hyperopt_train(parmas)
-
return {
'loss':-acc,
'status':STATUS_OK}
-
-
trials=Trials()
-
best=fmin(f,space_knn,algo=tpe.suggest,max_evals=
100,trials=trials)
-
print
'best',best
输出结果为:best {'n_neighbors': 4}
-
f, ax = plt.subplots(
1)
#, figsize=(10,10))
-
xs = [t[
'misc'][
'vals'][
'n_neighbors']
for t
in trials.trials]
-
ys = [-t[
'result'][
'loss']
for t
in trials.trials]
-
ax.scatter(xs, ys, s=
20, linewidth=
0.01, alpha=
0.5)
-
ax.set_title(
'Iris Dataset - KNN', fontsize=
18)
-
ax.set_xlabel(
'n_neighbors', fontsize=
12)
-
ax.set_ylabel(
'cross validation accuracy', fontsize=
12)
k 大于63后,准确率急剧下降。这是因为数据集中每个类的数量。这三个类中每个类只有50个实例。所以让我们将'n_neighbors'
的值限制为较小的值来进一步探索。
'n_neighbors': hp.choice('n_neighbors', range(1,50))
重新运行后,得到的图像如下,
现在我们可以清楚地看到k的最佳值为4
。
3.2 支持向量机(SVM)
由于这是一个分类任务,我们将使用sklearn
的SVC
类。代码如下
-
from sklearn.datasets
import load_iris
-
from sklearn.cross_validation
import cross_val_score
-
from hyperopt
import hp,STATUS_OK,Trials,fmin,tpe
-
from matplotlib
import pyplot
as plt
-
from sklearn.svm
import SVC
-
import numpy
as np
-
-
iris=load_iris()
-
X=iris.data
-
y=iris.target
-
-
def hyperopt_train_test(params):
-
clf =SVC(**params)
-
return cross_val_score(clf, X, y).mean()
-
-
space_svm = {
-
'C': hp.uniform(
'C',
0,
20),
-
'kernel': hp.choice(
'kernel', [
'linear',
'sigmoid',
'poly',
'rbf']),
-
'gamma': hp.uniform(
'gamma',
0,
20),
-
}
-
-
def f(params):
-
acc = hyperopt_train_test(params)
-
return {
'loss': -acc,
'status': STATUS_OK}
-
-
trials = Trials()
-
best = fmin(f, space_svm, algo=tpe.suggest, max_evals=
100, trials=trials)
-
print
'best:',best
-
-
parameters = [
'C',
'kernel',
'gamma']
-
cols = len(parameters)
-
f, axes = plt.subplots(nrows=
1, ncols=cols, figsize=(
20,
5))
-
cmap = plt.cm.jet
-
for i, val
in enumerate(parameters):
-
xs = np.array([t[
'misc'][
'vals'][val]
for t
in trials.trials]).ravel()
-
ys = [-t[
'result'][
'loss']
for t
in trials.trials]
-
axes[i].scatter(xs, ys, s=
20, linewidth=
0.01, alpha=
0.25, c=cmap(float(i)/len(parameters)))
-
axes[i].set_title(val)
-
axes[i].set_ylim([
0.9,
1.0])
输出结果为:best:{'kernel': 3, 'C': 3.6332677642526985, 'gamma': 2.0192849151350796}
3.3 决策树
我们将尝试只优化决策树的一些参数。代码如下。
-
from sklearn.datasets
import load_iris
-
from sklearn.cross_validation
import cross_val_score
-
from hyperopt
import hp,STATUS_OK,Trials,fmin,tpe
-
from matplotlib
import pyplot
as plt
-
from sklearn.tree
import DecisionTreeClassifier
-
import numpy
as np
-
-
iris=load_iris()
-
X=iris.data
-
y=iris.target
-
-
def hyperopt_train_test(params):
-
clf = DecisionTreeClassifier(**params)
-
return cross_val_score(clf, X, y).mean()
-
-
space_dt = {
-
'max_depth': hp.choice(
'max_depth', range(
1,
20)),
-
'max_features': hp.choice(
'max_features', range(
1,
5)),
-
'criterion': hp.choice(
'criterion', [
"gini",
"entropy"]),
-
}
-
-
def f(params):
-
acc = hyperopt_train_test(params)
-
return {
'loss': -acc,
'status': STATUS_OK}
-
-
trials = Trials()
-
best = fmin(f, space_dt, algo=tpe.suggest, max_evals=
300, trials=trials)
-
print
'best:',best
-
-
parameters = [
'max_depth',
'max_features',
'criterion']
# decision tree
-
cols = len(parameters)
-
f, axes = plt.subplots(nrows=
1, ncols=cols, figsize=(
20,
5))
-
cmap = plt.cm.jet
-
-
for i, val
in enumerate(parameters):
-
xs = np.array([t[
'misc'][
'vals'][val]
for t
in trials.trials]).ravel()
-
ys = [-t[
'result'][
'loss']
for t
in trials.trials]
-
axes[i].scatter(xs, ys, s=
20, linewidth=
0.01, alpha=
0.25, c=cmap(float(i)/len(parameters)))
-
axes[i].set_title(val)
-
axes[i].set_ylim([
0.9,
1.0])
输出结果为:best:{'max_features': 1, 'criterion': 1, 'max_depth': 13}
3.4 随机森林
让我们来看看集成分类器随机森林发生了什么,随机森林只是在不同分区数据上训练的决策树集合,每个分区都对输出类进行投票,并将绝大多数类的选择为预测。
-
from sklearn.datasets
import load_iris
-
from sklearn.cross_validation
import cross_val_score
-
from hyperopt
import hp,STATUS_OK,Trials,fmin,tpe
-
from matplotlib
import pyplot
as plt
-
from sklearn.ensemble
import RandomForestClassifier
-
import numpy
as np
-
-
iris=load_iris()
-
X=iris.data
-
y=iris.target
-
-
def hyperopt_train_test(params):
-
clf = RandomForestClassifier(**params)
-
return cross_val_score(clf, X, y).mean()
-
-
space4rf = {
-
'max_depth': hp.choice(
'max_depth', range(
1,
20)),
-
'max_features': hp.choice(
'max_features', range(
1,
5)),
-
'n_estimators': hp.choice(
'n_estimators', range(
1,
20)),
-
'criterion': hp.choice(
'criterion', [
"gini",
"entropy"]),
-
}
-
-
best =
0
-
def f(params):
-
global best
-
acc = hyperopt_train_test(params)
-
if acc > best:
-
best = acc
-
print
'new best:', best, params
-
return {
'loss': -acc,
'status': STATUS_OK}
-
-
trials = Trials()
-
best = fmin(f, space4rf, algo=tpe.suggest, max_evals=
300, trials=trials)
-
print
'best:',best
-
-
parameters = [
'n_estimators',
'max_depth',
'max_features',
'criterion']
-
f, axes = plt.subplots(nrows=
1,ncols=
4, figsize=(
20,
5))
-
cmap = plt.cm.jet
-
for i, val
in enumerate(parameters):
-
print i, val
-
xs = np.array([t[
'misc'][
'vals'][val]
for t
in trials.trials]).ravel()
-
ys = [-t[
'result'][
'loss']
for t
in trials.trials]
-
ys = np.array(ys)
-
axes[i].scatter(xs, ys, s=
20, linewidth=
0.01, alpha=
0.25, c=cmap(float(i)/len(parameters)))
-
axes[i].set_title(val)
输出结果为:best: {'max_features': 3, 'n_estimators': 11, 'criterion': 1, 'max_depth': 2}
4 多模型调优
从众多模型和众多参数中找到最优模型及其参数
-
from sklearn.datasets
import load_iris
-
from sklearn.cross_validation
import cross_val_score
-
from hyperopt
import hp,STATUS_OK,Trials,fmin,tpe
-
from sklearn.tree
import DecisionTreeClassifier
-
from sklearn.neighbors
import KNeighborsClassifier
-
from sklearn.naive_bayes
import BernoulliNB
-
from sklearn.svm
import SVC
-
-
iris=load_iris()
-
X=iris.data
-
y=iris.target
-
-
def hyperopt_train_test(params):
-
t = params[
'type']
-
del params[
'type']
-
if t ==
'naive_bayes':
-
clf = BernoulliNB(**params)
-
elif t ==
'svm':
-
clf = SVC(**params)
-
elif t ==
'dtree':
-
clf = DecisionTreeClassifier(**params)
-
elif t ==
'knn':
-
clf = KNeighborsClassifier(**params)
-
else:
-
return
0
-
return cross_val_score(clf, X, y).mean()
-
-
space = hp.choice(
'classifier_type', [
-
{
-
'type':
'naive_bayes',
-
'alpha': hp.uniform(
'alpha',
0.0,
2.0)
-
},
-
{
-
'type':
'svm',
-
'C': hp.uniform(
'C',
0,
10.0),
-
'kernel': hp.choice(
'kernel', [
'linear',
'rbf']),
-
'gamma': hp.uniform(
'gamma',
0,
20.0)
-
},
-
{
-
'type':
'randomforest',
-
'max_depth': hp.choice(
'max_depth', range(
1,
20)),
-
'max_features': hp.choice(
'max_features', range(
1,
5)),
-
'n_estimators': hp.choice(
'n_estimators', range(
1,
20)),
-
'criterion': hp.choice(
'criterion', [
"gini",
"entropy"]),
-
'scale': hp.choice(
'scale', [
0,
1])
-
},
-
{
-
'type':
'knn',
-
'n_neighbors': hp.choice(
'knn_n_neighbors', range(
1,
50))
-
}
-
])
-
-
count =
0
-
best =
0
-
def f(params):
-
global best, count
-
count +=
1
-
acc = hyperopt_train_test(params.copy())
-
if acc > best:
-
print
'new best:', acc,
'using', params[
'type']
-
best = acc
-
if count %
50 ==
0:
-
print
'iters:', count,
', acc:', acc,
'using', params
-
return {
'loss': -acc,
'status': STATUS_OK}
-
-
trials = Trials()
-
best = fmin(f, space, algo=tpe.suggest, max_evals=
1500, trials=trials)
-
print
'best:',best
输出结果为:best:{'kernel': 0, 'C': 1.4211568317201784, 'classifier_type': 1, 'gamma': 8.74017707300719}