Ubuntu14下Auto-sklearn安装调试总结

最新推荐文章于 2024-08-09 08:01:08 发布

XianMing的博客

最新推荐文章于 2024-08-09 08:01:08 发布

阅读量4.6k

点赞数 4

分类专栏： Machine Learing 文章标签： auto-sklearn

本文链接：https://blog.csdn.net/xummgg/article/details/80274009

版权

Machine Learing 专栏收录该内容

9 篇文章 1 订阅

订阅专栏

1. 说明
本次调试目的：因为公司内部需要做一个算法比较，顺带学习使用该技能，因为太久没有更新博客，本次调试运用为主，原理之后空了再深究
2. 原理篇
2.1什么是auto-sklearn
这里写图片描述
图1 ：Auto-sklearn框架结构（图摘自2015年的论文，此时只支持分类，现在的版本添加了回归）

Auto-sklearn是一个自动化机器学习框架，结构如图1所示，用户只要输入数据和标签，框架可以自动进行数据预处理，特征预处理，（分类/回归）算法选择，最终可导出模型，存储并使用。
auto-sklearn在KDnuggets举办的机器学习博客大赛中，取得了冠军。另外的热门自动机器学习框架auto_ml和TPOT。

2.2 简单的原理介绍
Auto-sklearn可以通过贝叶斯优化方式将超参数最优化，就是通过不断迭代以下几个步骤：
1）.创建一个概率模型，来找到超参数设置与机器学习的表现之间的关系
2）.使用这个模型来挑选出有用的超参数设置，通过权衡探索与开发，进而继续尝试。探索指的是探索模型的未知领域；开发指的是重点从已知的空间中找到表现良好的部分。
3）.设置好超参数，然后运行机器学习算法。

下面将进一步阐明这个过程是如何进行的：
这个过程可以概括为联合选择算法、预处理方法以及超参数。具体如下：分类/回归的选择、预处理方法是最高优先级、分类超参数、被选择方法的超参数会被激活。我们将使用贝叶斯优化方法来搜索组合空间。贝叶斯优化方法适用于处理高维条件空间。我们使用SMAC，SMAC是的基础是随机森林，它是解决这类问题的最好方式。
就实用性而言，由于Auto-sklearn直接替代scikit-learn的estimator，因此scikt-learn需要安装这个功能，我们才能利用到这个优势。Auto-sklearn同样也支持在分布式文件系统中进行并行计算，同时它也可以利用scikit-learn模型的持续特性。

参考：
https://www.leiphone.com/news/201701/dKfVIWiDaWvdMqKu.html?winzoom=1&viewType=weixin

https://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning.pdf

3. 安装篇
3.1 系统需求
• Linux operating system (for example Ubuntu),
• Python (>=3.5).
• C++ compiler (with C++11 supports) and SWIG (version 3.0 or later)
3.2 安装过程
本测试采用ubuntu14.04系统，其中gcc version=4.8.4
3.2.1 python3.5版本升级
由于ubuntu14的python为2.7版本，根据3.1系统需求需要升级到python3.5以上

添加 PPA：

sudo add-apt-repository ppa:fkrull/deadsnakes
sudo apt-get update

安装 Python 3

sudo apt-get install python3.5
sudo apt-get install python3.5-dev
sudo apt-get install libncurses5-dev

取消原本的 Python 3.4 ，并将 Python3 链接到最新的 3.5 上：

sudo mv /usr/bin/python3 /usr/bin/python3-old
sudo ln -s /usr/bin/python3.5 /usr/bin/python3

安装新版pip：

wget https://bootstrap.pypa.io/get-pip.py
sudo python3 get-pip.py
sudo pip3 install setuptools --upgrade
sudo pip3 install ipython[all]

取消原本的 Python 2.7 ，并将 Python 链接到最新的 3.5 上：

sudo mv /usr/bin/python /usr/bin/python-old
sudo ln -s /usr/bin/python3.5 /usr/bin/python

参考：
https://www.jianshu.com/p/4f4b2ed568f4

3.2.2 安装swig3
如果按照官方要求安装sudo apt-get install build-essential swig，默认安装的是的swig2,后续安装pyrfr会报错Can not install pyrfr , error: command ‘swig’ failed with exit ，参考https://github.com/automl/auto-sklearn/issues/314 的错误回答，安装swig3:

Sudo apt-get install swig3 
sudo ln -s /usr/bin/swig3.0 /usr/bin/ swig

第一句是按照swig3，第二句是添加swig的软链接。
3.2.3 安装auto-sklearn

安装auto-sklearn的所有依赖包:

curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 sudo pip install

安装auto-sklearn:

Sudo pip install auto-sklearn

注意：安装完成后，在使用fit进行训练会报Error when using “rf_with_instances.py，此处需要修改SMAC3内rf_with_instances.py的源码。参考：
https://github.com/automl/SMAC3/issues/298
操作方法如下：

cd /usr/local/lib/python3.5/dist-packages/smac/epm
sudo vim rf_with_instances.py

把下面的

self.rf.fit(data, rng=self.rng)

修改成

self.rf.fit(data, self.rng)

4. 调试篇
Auto-sklearn现在仅支持监督学习的分类和回归（官网说明未来希望支持深度学习等内容）。
本次也做两个实验，一个是auto-sklearn分类算法的应用，手写数字识别，见4.1。另一个是auto-sklearn回归算法的应用，iptv用户数预测（仅预测10分钟），见4.2。
4.1 手写数字识别（分类问题）
在ipython中运行如下代码（官方用例），对手写数字识别进行训练（分类问题）。准确率达到99.3%。运行本用例程序需要1小时。
这里写图片描述

如果需要减少运行时间，需要添加参数：time_left_for_this_task和per_run_time_limit。time_left_for_this_task表示该任务一共跑多少时间（秒），默认是3600，所以会跑1小时；per_run_time_limit表示每种算法跑多少时间（秒）；参考代码如下： time_left_for_this_task设置成60秒，per_run_time_limit设置为10秒

这里写图片描述

API默认的内存使用为3G，可以自行配置，详见API：http://automl.github.io/auto-sklearn/stable/api.html
官方建议实际应用中auto-sklearn跑24小时或越久越好。

另外，auto-sklearn的存储方式与sklearn相同，4.2中也将使用：

from sklearn.externals import joblib
…………
automl.fit(X_train, y_train)
joblib.dump(automl, '.\\model\\automl_train.m')
…………
automl_new = joblib.load('.\\model\\automl_train.m')

4.2 iptv用户数预测（回归问题）
4.1中为官方用例的简单试用，本用例为iptv用户预测（回归问题）：从10.10.1.209的mysql数据库中获取数据(公司内部的生产数据)，并使用auto-sklearn进行自动机器学习。

核心代码如下：

这里写图片描述
简单使用1分钟时间训练，并进行预测，得到预测结果的mse（平方根误差率）=0.0007，原来试用GBDT预测的mse=0.0006。
通过show_models()方法得到，一共便利了两种算法训练，先用了svr（支持向量机svm的回归方法），第二种算法是GBDT算法。
如运行时间增加，可尝试更多的数据预处理方法，特征处理方法，轮询跟多的机器学习算法。

完整代码如下（训练时间1小时）：

#用auto-sklearn
from sklearn import ensemble
from sklearn.externals import joblib
import numpy as np
import pandas as pd
import MySQLdb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt

#import autosklearn.classification
import autosklearn.regression
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

def isholiday(data):
    if data == '2017-04-01':
        return 0.2
    if data == '2017-04-02':
        return 1
    if data == '2017-04-03':
        return 1
    if data == '2017-04-04':
        return 0.8
    if data == '2017-04-28':
        return 0.2
    if data == '2017-04-29':
        return 1
    if data == '2017-04-30':
        return 1
    if data == '2017-05-01':
        return 0.8
    else:
        return 0

def getResult():
    db = MySQLdb.connect("10.10.1.209", "root", "123456", "db_iptv_10")
    cursor = db.cursor()
    sql = "select time, online_number from hh_online_smooth_gdbt_province_minute where SUBSTR(time, 1, 6)='201711';"
    cursor.execute(sql)
    rows = cursor.fetchall()
    return rows

def getFeatures(time):
    feature_one = []
    time = pd.to_datetime(time)
    time_index = int(str(time).split(' ')[1].split(':')[0]) * 6 + \
                 int(str(time).split(' ')[1].split(':')[1]) / int(10)
    workday = time.isoweekday()
    holiday = isholiday(str(time))
    feature_one.append([time_index, workday, holiday])
    return feature_one

def getCollection():
    rows = getResult()
    data_column = []
    time_column = []
    for row in rows:
        data_column.append(int(row[1]))
        time_column.append(pd.to_datetime(str(row[0])))
    data = pd.DataFrame(data_column, index=time_column)
    print('------------------')
    #print(data)
    #shift_data = data
    shift_data = (data.shift() / data).shift(-1)
    #print(data.shift())
    #print(shift_data)
    # print rows
    collections = []
    for time in shift_data.index:
        time_index = int(str(time).split(' ')[1].split(':')[0]) * 6 + \
                       int(str(time).split(' ')[1].split(':')[1]) / int(10)
        weekday = time.isoweekday()
        holiday = isholiday(str(time))
        online_number = shift_data.ix[time].values[0]
        collections.append([time_index, weekday, holiday, online_number])

    collections = np.array(collections)
    repetion_index = []
    for i in range(collections.shape[0]):
        if str(collections[i,3]) == 'nan':
            repetion_index.append(i)
    collections = np.delete(collections, repetion_index, axis=0)

    return collections

def train_model():
    collections = getCollection()
    print(collections)
    features = collections[:,:3]
    #print('---------')
    #print(features)
    ff = pd.DataFrame(features, columns=['index', 'weekday', 'holiday'])
    targets = collections[:,3:]

    X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(features, targets, random_state=1)
    #print(y_train)
    automl = autosklearn.regression.AutoSklearnRegressor(time_left_for_this_task=3600, per_run_time_limit=360)
    automl.fit(X_train, y_train)
    joblib.dump(automl, '.\\model\\automl_train.m')
    y_hat = automl.predict(X_test)
    #print(y_hat)
    print("model", automl.show_models())
    print("mean_squared_error", sklearn.metrics.mean_squared_error(y_test, y_hat))
    print("mean_absolute_error", sklearn.metrics.mean_absolute_error(y_test, y_hat))
    print("median_absolute_error", sklearn.metrics.median_absolute_error(y_test, y_hat))

def predict_model(time, number):
    feature = []
    automl = joblib.load('.\\model\\automl_train.m')
    time = pd.to_datetime(time)
    time_index = int(str(time).split(' ')[1].split(':')[0]) * 6 + int(str(time).split(' ')[1].split(':')[1]) / int(10)
    weekday = time.isoweekday()
    holiday = isholiday(str(time))
    feature = [[time_index, weekday, holiday]]
    y_predict = automl.predict(np.array(feature))
    #print(y_predict)
    data = round(number/y_predict[0])
    print(feature, data)

if __name__ == '__main__':

    train_model()
    predict_model('201712131340', 495183)