HyperOpt参数优化

最新推荐文章于 2024-05-23 23:12:38 发布

山抹微云654

最新推荐文章于 2024-05-23 23:12:38 发布

阅读量1.3k

点赞数

分类专栏：机器学习

原文链接：https://blog.csdn.net/u012735708/article/details/84820101

版权

机器学习专栏收录该内容

28 篇文章 0 订阅

订阅专栏

本文链接： https://blog.csdn.net/u012735708/article/details/84820101

当我们创建好模型后，还要调整各个模型的参数，才找到最好的匹配。即使模型还可以，如果它的参数设置不匹配，同样无法输出好的结果。常用的调参方式有Grid search 和 Random search ，Grid search 是全空间扫描，所以比较慢，Random search 虽然快，但可能错失空间上的一些重要的点，精度不够。而Hyperopt是一种通过贝叶斯优化来调整参数的工具，该方法较快的速度，并有较好的效果。此外，Hyperopt结合MongoDB可以进行分布式调参，快速找到相对较优的参数。安装的时候需要指定dev版本才能使用模拟退火调参，也支持暴力调参、随机调参等策略。

（贝叶斯优化，又叫序贯模型优化（Sequential model-based optimization，SMBO），是最有效的函数优化方法之一。与共轭梯度下降法等标准优化策略相比，SMBO的优势有：利用平滑性而无需计算梯度；可处理实数、离散值、条件变量等；可处理大量变量并行优化。）

Let's go！！！

1 安装

pip install hyperopt

安装hyperopt时也会安装 networkx，如果在调用时出现 TypeError: 'generator' object is not subscriptable 报错，可以将其换成1.11版本。


 
 
   
   
    
    
   
   
   
   
    
    
     
     pip 
     
     uninstall networkx
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     pip 
     
     install networkx==
     
     1.11

2 重点知识

2.1 fmin


 
 
   
   
    
    
   
   
   
   
    
    
     
     from hyperopt 
     
     import fmin, tpe, hp
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     best = fmin(
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         fn=
     
     lambda x: x,
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         space=hp.uniform(
     
     'x', 
     
     0, 
     
     1),
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         algo=tpe.suggest,
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         max_evals=
     
     100)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     print best

输出结果为：{'x': 0.0006154621520631152}

函数fmin首先接受一个函数来最小化，记为fn，在这里用一个函数lambda x: x来指定。该函数可以是任何有效的值返回函数，例如回归中的平均绝对误差。

下一个参数指定搜索空间，在本例中，它是0到1之间的连续数字范围，由hp.uniform('x', 0, 1)指定。hp.uniform是一个内置的hyperopt函数，它有三个参数：名称x，范围的下限和上限0和1。

algo参数指定搜索算法，本例中tpe表示 tree of Parzen estimators。该主题超出了本文的范围，但有数学背景的读者可以细读这篇文章。algo参数也可以设置为hyperopt.random，但是这里我们没有涉及，因为它是众所周知的搜索策略。

最后，我们指定fmin函数将执行的最大评估次数max_evals。这个fmin函数将返回一个python字典。

当我们调整max_evals=1000时，输出结果为：{'x': 3.7023587264309516e-06}，可以发现结果更接近于0。

为了更好的理解，可以看下面这个更复杂一些的例子。


 
 
   
   
    
    
   
   
   
   
    
    
     
     best = fmin(
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         fn=
     
     lambda x: (x
     
     -1)**
     
     2,
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         space=hp.uniform(
     
     'x', 
     
     -2, 
     
     2),
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         algo=tpe.suggest,
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         max_evals=
     
     100)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     print best

输出结果为：{'x': 1.007633842139922}

2.2 space

对于变量的变化范围与取值概率，有以下几类。

看个例子，


 
 
   
   
    
    
   
   
   
   
    
    
     
     from hyperopt 
     
     import hp
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     import hyperopt.pyll.stochastic
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     space = {
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     'x': hp.uniform(
     
     'x', 
     
     0, 
     
     1),
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     'y': hp.normal(
     
     'y', 
     
     0, 
     
     1),
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     'name': hp.choice(
     
     'name', [
     
     'alice', 
     
     'bob']),
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     }
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     print hyperopt.pyll.stochastic.sample(space)

输出结果为：{'y': -1.3901709472842074, 'x': 0.4335747017293238, 'name': 'bob'}

2.3 通过 Trials 捕获信息

Trials用来记录每次eval的时候，具体使用了什么参数以及相关的返回值。这时候，fn的返回值变为dict，除了loss，还有一个status。Trials对象将数据存储为一个BSON对象，可以利用MongoDB做分布式运算。


 
 
   
   
    
    
   
   
   
   
    
    
     
     from hyperopt 
     
     import fmin, tpe, hp, STATUS_OK, Trials
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from matplotlib 
     
     import pyplot 
     
     as plt
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     fspace = {
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     'x': hp.uniform(
     
     'x', 
     
     -5, 
     
     5)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     }
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     def f(params):
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         x = params[
     
     'x']
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         val = x**
     
     2
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     return {
     
     'loss': val, 
     
     'status': STATUS_OK}
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     trials = Trials()
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     best = fmin(fn=f, space=fspace, algo=tpe.suggest, max_evals=
     
     50, trials=trials)
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     print 
     
     'best:', best
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     print 
     
     'trials:'
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     for trial 
     
     in trials.trials[:
     
     2]:
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     print trial

对于STATUS_OK的返回，会统计它的loss值，而对于STATUS_FAIL的返回，则会忽略。

输出结果如下，


 
 
   
   
    
    
   
   
   
   
    
    
     
     best: {
     
     'x': 
     
     -0.0025882455372094326}
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     trials:
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     {
     
     'refresh_time': datetime.datetime(
     
     2018, 
     
     12, 
     
     5, 
     
     3, 
     
     5, 
     
     43, 
     
     152000), 
     
     'book_time': datetime.datetime(
     
     2018, 
     
     12, 
     
     5, 
     
     3, 
     
     5, 
     
     43, 
     
     152000), 
     
     'misc': {
     
     'tid': 
     
     0, 
     
     'idxs': {
     
     'x': [
     
     0]}, 
     
     'cmd': (
     
     'domain_attachment', 
     
     'FMinIter_Domain'), 
     
     'vals': {
     
     'x': [
     
     -2.511797855178682]}, 
     
     'workdir': 
     
     None}, 
     
     'state': 
     
     2, 
     
     'tid': 
     
     0, 
     
     'exp_key': 
     
     None, 
     
     'version': 
     
     0, 
     
     'result': {
     
     'status': 
     
     'ok', 
     
     'loss': 
     
     6.309128465280228}, 
     
     'owner': 
     
     None, 
     
     'spec': 
     
     None}
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     {
     
     'refresh_time': datetime.datetime(
     
     2018, 
     
     12, 
     
     5, 
     
     3, 
     
     5, 
     
     43, 
     
     153000), 
     
     'book_time': datetime.datetime(
     
     2018, 
     
     12, 
     
     5, 
     
     3, 
     
     5, 
     
     43, 
     
     153000), 
     
     'misc': {
     
     'tid': 
     
     1, 
     
     'idxs': {
     
     'x': [
     
     1]}, 
     
     'cmd': (
     
     'domain_attachment', 
     
     'FMinIter_Domain'), 
     
     'vals': {
     
     'x': [
     
     3.43836093884876]}, 
     
     'workdir': 
     
     None}, 
     
     'state': 
     
     2, 
     
     'tid': 
     
     1, 
     
     'exp_key': 
     
     None, 
     
     'version': 
     
     0, 
     
     'result': {
     
     'status': 
     
     'ok', 
     
     'loss': 
     
     11.822325945800927}, 
     
     'owner': 
     
     None, 
     
     'spec': 
     
     None}

可以通过这里面的值，把一些变量与loss的点绘图，来看匹配度。或者tid与变量绘图，看它搜索的位置收敛（非数学意义上的收敛）情况。
trials有这几种：

trials.trials - a list of dictionaries representing everything about the search
trials.results - a list of dictionaries returned by ‘objective’ during the search
trials.losses() - a list of losses (float for each ‘ok’ trial)

trials.statuses() - a list of status strings
我们可以将上述trials进行可视化，值 vs. 时间与损失 vs. 值。


   
   
     
     
      
      
     
     
     
     
      
      
       
       f, ax = plt.subplots(
       
       1)
      
      
     
     

     
     
      
      
     
     
     
     
      
      
       
       xs = [t[
       
       'tid'] 
       
       for t 
       
       in trials.trials]
      
      
     
     

     
     
      
      
     
     
     
     
      
      
       
       ys = [t[
       
       'misc'][
       
       'vals'][
       
       'x'] 
       
       for t 
       
       in trials.trials]
      
      
     
     

     
     
      
      
     
     
     
     
      
      
       
       ax.set_xlim(xs[
       
       0]
       
       -10, xs[
       
       -1]+
       
       10)
      
      
     
     

     
     
      
      
     
     
     
     
      
      
       
       ax.scatter(xs, ys, s=
       
       20, linewidth=
       
       0.01, alpha=
       
       0.75)
      
      
     
     

     
     
      
      
     
     
     
     
      
      
       
       ax.set_title(
       
       '$x$ $vs$ $t$ ', fontsize=
       
       18)
      
      
     
     

     
     
      
      
     
     
     
     
      
      
       
       ax.set_xlabel(
       
       '$t$', fontsize=
       
       16)
      
      
     
     

     
     
      
      
     
     
     
     
      
      
       
       ax.set_ylabel(
       
       '$x$', fontsize=
       
       16)

<p style="text-align:center;"><img alt="" class="has" height="302" src="https://img-blog.csdnimg.cn/20181205111833307.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTI3MzU3MDg=,size_16,color_FFFFFF,t_70" width="405"></p>
</li>


 
 
   
   
    
    
   
   
   
   
    
    
     
     f, ax = plt.subplots(
     
     1)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     xs = [t[
     
     'misc'][
     
     'vals'][
     
     'x'] 
     
     for t 
     
     in trials.trials]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     ys = [t[
     
     'result'][
     
     'loss'] 
     
     for t 
     
     in trials.trials]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     ax.scatter(xs, ys, s=
     
     20, linewidth=
     
     0.01, alpha=
     
     0.75)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     ax.set_title(
     
     '$val$ $vs$ $x$ ', fontsize=
     
     18)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     ax.set_xlabel(
     
     '$x$', fontsize=
     
     16)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     ax.set_ylabel(
     
     '$val$', fontsize=
     
     16)

3 Hyperopt应用

3.1 K近邻

需要注意的是，由于我们试图最大化交叉验证的准确率，而hyperopt只知道如何最小化函数，所以必须对准确率取负。最小化函数f与最大化f的负数是相等的。


 
 
   
   
    
    
   
   
   
   
    
    
     
     from sklearn.datasets 
     
     import load_iris
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from sklearn.neighbors 
     
     import KNeighborsClassifier
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from sklearn.cross_validation 
     
     import cross_val_score
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from hyperopt 
     
     import hp,STATUS_OK,Trials,fmin,tpe
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from matplotlib 
     
     import pyplot 
     
     as plt
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     iris=load_iris()
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     X=iris.data
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     y=iris.target
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     def hyperopt_train(params):
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         clf=KNeighborsClassifier(**params)
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     return cross_val_score(clf,X,y).mean()
    
    
   
   

   
   
    
    
   
   
   
   
    
        
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     space_knn={
     
     'n_neighbors':hp.choice(
     
     'n_neighbors',range(
     
     1,
     
     100))}
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     def f(parmas):
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         acc=hyperopt_train(parmas)
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     return {
     
     'loss':-acc,
     
     'status':STATUS_OK}
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     trials=Trials()
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     best=fmin(f,space_knn,algo=tpe.suggest,max_evals=
     
     100,trials=trials)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     print 
     
     'best',best

输出结果为：best {'n_neighbors': 4}


 
 
   
   
    
    
   
   
   
   
    
    
     
     f, ax = plt.subplots(
     
     1)
     
     #, figsize=(10,10))
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     xs = [t[
     
     'misc'][
     
     'vals'][
     
     'n_neighbors'] 
     
     for t 
     
     in trials.trials]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     ys = [-t[
     
     'result'][
     
     'loss'] 
     
     for t 
     
     in trials.trials]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     ax.scatter(xs, ys, s=
     
     20, linewidth=
     
     0.01, alpha=
     
     0.5)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     ax.set_title(
     
     'Iris Dataset - KNN', fontsize=
     
     18)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     ax.set_xlabel(
     
     'n_neighbors', fontsize=
     
     12)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     ax.set_ylabel(
     
     'cross validation accuracy', fontsize=
     
     12)

k 大于63后，准确率急剧下降。这是因为数据集中每个类的数量。这三个类中每个类只有50个实例。所以让我们将'n_neighbors'的值限制为较小的值来进一步探索。

'n_neighbors': hp.choice('n_neighbors', range(1,50))

重新运行后，得到的图像如下，

现在我们可以清楚地看到k的最佳值为4。

3.2 支持向量机（SVM）

由于这是一个分类任务，我们将使用sklearn的SVC类。代码如下


 
 
   
   
    
    
   
   
   
   
    
    
     
     from sklearn.datasets 
     
     import load_iris
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from sklearn.cross_validation 
     
     import cross_val_score
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from hyperopt 
     
     import hp,STATUS_OK,Trials,fmin,tpe
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from matplotlib 
     
     import pyplot 
     
     as plt
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from sklearn.svm 
     
     import SVC 
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     import numpy 
     
     as np
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     iris=load_iris()
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     X=iris.data
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     y=iris.target
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     def hyperopt_train_test(params):
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         clf =SVC(**params)
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     return cross_val_score(clf, X, y).mean()
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     space_svm = {
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     'C': hp.uniform(
     
     'C', 
     
     0, 
     
     20),
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     'kernel': hp.choice(
     
     'kernel', [
     
     'linear', 
     
     'sigmoid', 
     
     'poly', 
     
     'rbf']),
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     'gamma': hp.uniform(
     
     'gamma', 
     
     0, 
     
     20),
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     }
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     def f(params):
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         acc = hyperopt_train_test(params)
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     return {
     
     'loss': -acc, 
     
     'status': STATUS_OK}
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     trials = Trials()
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     best = fmin(f, space_svm, algo=tpe.suggest, max_evals=
     
     100, trials=trials)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     print 
     
     'best:',best
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     parameters = [
     
     'C', 
     
     'kernel', 
     
     'gamma']
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     cols = len(parameters)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     f, axes = plt.subplots(nrows=
     
     1, ncols=cols, figsize=(
     
     20,
     
     5))
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     cmap = plt.cm.jet
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     for i, val 
     
     in enumerate(parameters):
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         xs = np.array([t[
     
     'misc'][
     
     'vals'][val] 
     
     for t 
     
     in trials.trials]).ravel()
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         ys = [-t[
     
     'result'][
     
     'loss'] 
     
     for t 
     
     in trials.trials]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         axes[i].scatter(xs, ys, s=
     
     20, linewidth=
     
     0.01, alpha=
     
     0.25, c=cmap(float(i)/len(parameters)))
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         axes[i].set_title(val)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         axes[i].set_ylim([
     
     0.9, 
     
     1.0])

输出结果为：best:{'kernel': 3, 'C': 3.6332677642526985, 'gamma': 2.0192849151350796}

3.3 决策树

我们将尝试只优化决策树的一些参数。代码如下。


 
 
   
   
    
    
   
   
   
   
    
    
     
     from sklearn.datasets 
     
     import load_iris
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from sklearn.cross_validation 
     
     import cross_val_score
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from hyperopt 
     
     import hp,STATUS_OK,Trials,fmin,tpe
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from matplotlib 
     
     import pyplot 
     
     as plt
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from sklearn.tree 
     
     import DecisionTreeClassifier
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     import numpy 
     
     as np
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     iris=load_iris()
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     X=iris.data
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     y=iris.target
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     def hyperopt_train_test(params):
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         clf = DecisionTreeClassifier(**params)
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     return cross_val_score(clf, X, y).mean()
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     space_dt = {
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     'max_depth': hp.choice(
     
     'max_depth', range(
     
     1,
     
     20)),
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     'max_features': hp.choice(
     
     'max_features', range(
     
     1,
     
     5)),
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     'criterion': hp.choice(
     
     'criterion', [
     
     "gini", 
     
     "entropy"]),
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     }
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     def f(params):
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         acc = hyperopt_train_test(params)
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     return {
     
     'loss': -acc, 
     
     'status': STATUS_OK}
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     trials = Trials()
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     best = fmin(f, space_dt, algo=tpe.suggest, max_evals=
     
     300, trials=trials)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     print 
     
     'best:',best
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     parameters = [
     
     'max_depth', 
     
     'max_features', 
     
     'criterion'] 
     
     # decision tree
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     cols = len(parameters)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     f, axes = plt.subplots(nrows=
     
     1, ncols=cols, figsize=(
     
     20,
     
     5))
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     cmap = plt.cm.jet
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     for i, val 
     
     in enumerate(parameters):
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         xs = np.array([t[
     
     'misc'][
     
     'vals'][val] 
     
     for t 
     
     in trials.trials]).ravel()
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         ys = [-t[
     
     'result'][
     
     'loss'] 
     
     for t 
     
     in trials.trials]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         axes[i].scatter(xs, ys, s=
     
     20, linewidth=
     
     0.01, alpha=
     
     0.25, c=cmap(float(i)/len(parameters)))
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         axes[i].set_title(val)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         axes[i].set_ylim([
     
     0.9, 
     
     1.0])

输出结果为：best:{'max_features': 1, 'criterion': 1, 'max_depth': 13}

3.4 随机森林

让我们来看看集成分类器随机森林发生了什么，随机森林只是在不同分区数据上训练的决策树集合，每个分区都对输出类进行投票，并将绝大多数类的选择为预测。


 
 
   
   
    
    
   
   
   
   
    
    
     
     from sklearn.datasets 
     
     import load_iris
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from sklearn.cross_validation 
     
     import cross_val_score
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from hyperopt 
     
     import hp,STATUS_OK,Trials,fmin,tpe
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from matplotlib 
     
     import pyplot 
     
     as plt
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from sklearn.ensemble 
     
     import RandomForestClassifier
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     import numpy 
     
     as np
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     iris=load_iris()
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     X=iris.data
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     y=iris.target
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     def hyperopt_train_test(params):
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         clf = RandomForestClassifier(**params)
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     return cross_val_score(clf, X, y).mean()
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     space4rf = {
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     'max_depth': hp.choice(
     
     'max_depth', range(
     
     1,
     
     20)),
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     'max_features': hp.choice(
     
     'max_features', range(
     
     1,
     
     5)),
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     'n_estimators': hp.choice(
     
     'n_estimators', range(
     
     1,
     
     20)),
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     'criterion': hp.choice(
     
     'criterion', [
     
     "gini", 
     
     "entropy"]),
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     }
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     best = 
     
     0
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     def f(params):
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     global best
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         acc = hyperopt_train_test(params)
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     if acc > best:
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
             best = acc
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     print 
     
     'new best:', best, params
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     return {
     
     'loss': -acc, 
     
     'status': STATUS_OK}
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     trials = Trials()
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     best = fmin(f, space4rf, algo=tpe.suggest, max_evals=
     
     300, trials=trials)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     print 
     
     'best:',best
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     parameters = [
     
     'n_estimators', 
     
     'max_depth', 
     
     'max_features', 
     
     'criterion']
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     f, axes = plt.subplots(nrows=
     
     1,ncols=
     
     4, figsize=(
     
     20,
     
     5))
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     cmap = plt.cm.jet
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     for i, val 
     
     in enumerate(parameters):
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     print i, val
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         xs = np.array([t[
     
     'misc'][
     
     'vals'][val] 
     
     for t 
     
     in trials.trials]).ravel()
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         ys = [-t[
     
     'result'][
     
     'loss'] 
     
     for t 
     
     in trials.trials]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         ys = np.array(ys)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         axes[i].scatter(xs, ys, s=
     
     20, linewidth=
     
     0.01, alpha=
     
     0.25, c=cmap(float(i)/len(parameters)))
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         axes[i].set_title(val)

输出结果为：best: {'max_features': 3, 'n_estimators': 11, 'criterion': 1, 'max_depth': 2}

4 多模型调优

从众多模型和众多参数中找到最优模型及其参数


 
 
   
   
    
    
   
   
   
   
    
    
     
     from sklearn.datasets 
     
     import load_iris
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from sklearn.cross_validation 
     
     import cross_val_score
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from hyperopt 
     
     import hp,STATUS_OK,Trials,fmin,tpe
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from sklearn.tree 
     
     import DecisionTreeClassifier
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from sklearn.neighbors 
     
     import KNeighborsClassifier
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from sklearn.naive_bayes 
     
     import BernoulliNB
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     from sklearn.svm 
     
     import SVC
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     iris=load_iris()
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     X=iris.data
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     y=iris.target
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     def hyperopt_train_test(params):
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         t = params[
     
     'type']
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     del params[
     
     'type']
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     if t == 
     
     'naive_bayes':
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
             clf = BernoulliNB(**params)
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     elif t == 
     
     'svm':
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
             clf = SVC(**params)
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     elif t == 
     
     'dtree':
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
             clf = DecisionTreeClassifier(**params)
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     elif t == 
     
     'knn':
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
             clf = KNeighborsClassifier(**params)
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     else:
    
    
   
   

   
   
    
    
   
   
   
   
    
            
     
     return 
     
     0
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     return cross_val_score(clf, X, y).mean()
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     space = hp.choice(
     
     'classifier_type', [
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         {
    
    
   
   

   
   
    
    
   
   
   
   
    
            
     
     'type': 
     
     'naive_bayes',
    
    
   
   

   
   
    
    
   
   
   
   
    
            
     
     'alpha': hp.uniform(
     
     'alpha', 
     
     0.0, 
     
     2.0)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         },
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         {
    
    
   
   

   
   
    
    
   
   
   
   
    
            
     
     'type': 
     
     'svm',
    
    
   
   

   
   
    
    
   
   
   
   
    
            
     
     'C': hp.uniform(
     
     'C', 
     
     0, 
     
     10.0),
    
    
   
   

   
   
    
    
   
   
   
   
    
            
     
     'kernel': hp.choice(
     
     'kernel', [
     
     'linear', 
     
     'rbf']),
    
    
   
   

   
   
    
    
   
   
   
   
    
            
     
     'gamma': hp.uniform(
     
     'gamma', 
     
     0, 
     
     20.0)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         },
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         {
    
    
   
   

   
   
    
    
   
   
   
   
    
            
     
     'type': 
     
     'randomforest',
    
    
   
   

   
   
    
    
   
   
   
   
    
            
     
     'max_depth': hp.choice(
     
     'max_depth', range(
     
     1,
     
     20)),
    
    
   
   

   
   
    
    
   
   
   
   
    
            
     
     'max_features': hp.choice(
     
     'max_features', range(
     
     1,
     
     5)),
    
    
   
   

   
   
    
    
   
   
   
   
    
            
     
     'n_estimators': hp.choice(
     
     'n_estimators', range(
     
     1,
     
     20)),
    
    
   
   

   
   
    
    
   
   
   
   
    
            
     
     'criterion': hp.choice(
     
     'criterion', [
     
     "gini", 
     
     "entropy"]),
    
    
   
   

   
   
    
    
   
   
   
   
    
            
     
     'scale': hp.choice(
     
     'scale', [
     
     0, 
     
     1])
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         },
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         {
    
    
   
   

   
   
    
    
   
   
   
   
    
            
     
     'type': 
     
     'knn',
    
    
   
   

   
   
    
    
   
   
   
   
    
            
     
     'n_neighbors': hp.choice(
     
     'knn_n_neighbors', range(
     
     1,
     
     50))
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         }
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     ])
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     count = 
     
     0
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     best = 
     
     0
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     def f(params):
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     global best, count
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         count += 
     
     1
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         acc = hyperopt_train_test(params.copy())
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     if acc > best:
    
    
   
   

   
   
    
    
   
   
   
   
    
            
     
     print 
     
     'new best:', acc, 
     
     'using', params[
     
     'type']
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
             best = acc
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     if count % 
     
     50 == 
     
     0:
    
    
   
   

   
   
    
    
   
   
   
   
    
            
     
     print 
     
     'iters:', count, 
     
     ', acc:', acc, 
     
     'using', params
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     return {
     
     'loss': -acc, 
     
     'status': STATUS_OK}
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     trials = Trials()
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     best = fmin(f, space, algo=tpe.suggest, max_evals=
     
     1500, trials=trials)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     print 
     
     'best:',best