全自动机器学习神器:H2OAutoML

引言

做机器学习的老铁们在平时训练模型时,对交叉验证、模型集成想必是绞尽了脑汁。现在我将给各位介绍一个神器。叫做H2O。在读了这篇文章后,你将会:

  • 了解H2O是什么,在哪些地方大放异彩
  • H2O的安装与初步使用
  • 迫不及待地去安装使用(哈哈哈)

H2O概述

H2O是一个开源的、内存、分布式、快速和可扩展的机器学习和预测分析平台,允许诸位在大数据上构建机器学习模型,并在企业环境中轻松实现这些模型的搭建。

H2O的核心代码是用Java编写的。在H2O中,使用分布式的Key/Value存储来访问和引用所有节点和机器上的数据、模型、对象等。这些算法是在H2O的分布式Map / Reduce框架之上实现的,并且利用Java Fork / Join框架来实现多线程。数据是并行读取的,并分布在整个集群中,并以压缩的方式以列状格式存储在内存中。 H2O的数据解析器具有内置的智能功能,可以猜测传入数据集的模式,并支持以多种格式从多个源获取数据。

H2O的REST API允许外部程序或脚本通过HTTP上的JSON访问H2O的所有功能。 Rest API使用H2O的Web界面(Flow UI),R binding(H2O-R)和Python binding(H2O-Python)。

深度学习,Tree Ensembles和GLRM等各种有监督和无监督算法的速度,质量,易用性和模型部署方便使得H2O成为大数据数据科学非常受欢迎的API。

H2O的安装及AutoML的使用

H2O的安装(python)

H2O对 Scala, R, and Python并没有硬性要求,但是Java是必须要会的。接下来我们就讲下在python环境中安装H2O。
首先安装依赖文件:

$ pip install requests
$ pip install tabulate
$ pip install scikit-learn

接下来下载安装H2O

$ pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o

软件大小100多M。

AutoML的使用

输入以下代码

import h2o
from h2o.automl import H2OAutoML

h2o.init()
# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")

# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

# Run AutoML for 30 seconds
aml = H2OAutoML(max_runtime_secs = 30)
aml.train(x = x, y = y,
          training_frame = train,
          leaderboard_frame = test)

# View the AutoML Leaderboard
lb = aml.leaderboard
lb
Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_131"; OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-0ubuntu1.16.10.2-b11); OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
  Starting server from /home/ora/anaconda3/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmphnkk6mvy
  JVM stdout: /tmp/tmphnkk6mvy/h2o_ora_started_from_python.out
  JVM stderr: /tmp/tmphnkk6mvy/h2o_ora_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.
H2O cluster uptime:02 secs
H2O cluster version:3.16.0.4
H2O cluster version age:16 days
H2O cluster name:H2O_from_python_ora_l1c8zv
H2O cluster total nodes:1
H2O cluster free memory:6.976 Gb
H2O cluster total cores:4
H2O cluster allowed cores:4
H2O cluster status:accepting new members, healthy
H2O connection url:http://127.0.0.1:54321
H2O connection proxy:None
H2O internal security:False
H2O API Extensions:XGBoost, Algos, AutoML, Core V3, Core V4
Python version:3.6.0 final
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
AutoML progress: |████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
model_id auc logloss
StackedEnsemble_AllModels_0_AutoML_20180201_101807 0.787269 0.554504
StackedEnsemble_BestOfFamily_0_AutoML_20180201_1018070.783812 0.557977
GBM_grid_0_AutoML_20180201_101807_model_0 0.779296 0.562086
GBM_grid_0_AutoML_20180201_101807_model_2 0.779109 0.560944
GBM_grid_0_AutoML_20180201_101807_model_1 0.775373 0.564924
GBM_grid_0_AutoML_20180201_101807_model_3 0.773419 0.567071
GBM_grid_0_AutoML_20180201_101807_model_4 0.755339 0.630771
DRF_0_AutoML_20180201_101807 0.740823 0.605117
XRT_0_AutoML_20180201_101807 0.735793 0.604911
GLM_grid_0_AutoML_20180201_101807_model_0 0.686224 0.634806
# The leader model is stored here
aml.leader
Model Details
=============
H2OStackedEnsembleEstimator :  Stacked Ensemble
Model Key:  StackedEnsemble_AllModels_0_AutoML_20180201_101807
No model summary for this model


ModelMetricsBinomialGLM: stackedensemble
** Reported on train data. **

MSE: 0.10619313022292985
RMSE: 0.32587287432821077
LogLoss: 0.36728814169184465
Null degrees of freedom: 7993
Residual degrees of freedom: 7986
Null deviance: 11050.743244827558
Residual deviance: 5872.202809369212
AIC: 5888.202809369212
AUC: 0.9569241768110353
Gini: 0.9138483536220705
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.4510434688048974: 
01ErrorRate
03039.0708.00.189 (708.0/3747.0)
1240.04007.00.0565 (240.0/4247.0)
Total3279.04715.00.1186 (948.0/7994.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metricthresholdvalueidx
max f10.45104350.8942200221.0
max f20.37259970.9326728253.0
max f0point50.61149400.9036081158.0
max accuracy0.50564870.8855392200.0
max precision0.93804981.00.0
max recall0.16951741.0349.0
max specificity0.93804981.00.0
max absolute_mcc0.50564870.7701297200.0
max min_per_class_accuracy0.53338670.8849746190.0
max mean_per_class_accuracy0.53338670.8852705190.0
Gains/Lift Table: Avg response rate: 53.13 %
groupcumulative_data_fractionlower_thresholdliftcumulative_liftresponse_ratecumulative_response_ratecapture_ratecumulative_capture_rategaincumulative_gain
10.01000750.91182841.88226981.88226981.01.00.01883680.018836888.226983888.2269838
20.02001500.90325381.88226981.88226981.01.00.01883680.037673788.226983888.2269838
30.03002250.89762781.88226981.88226981.01.00.01883680.056510588.226983888.2269838
40.04003000.89224931.88226981.88226981.01.00.01883680.075347388.226983888.2269838
50.05003750.88786121.88226981.88226981.01.00.01883680.094184188.226983888.2269838
60.10007510.86466181.88226981.88226981.01.00.09418410.188368388.226983888.2269838
70.14998750.84021881.87755241.88070000.99749370.99916600.09371320.282081587.755236988.0699971
80.20002500.81336391.84462441.87167540.980.99437150.09230040.374381984.462444187.1675448
90.29997500.74907691.79510591.84616290.95369210.98081730.17942080.553802779.510590384.6162910
100.40005000.66418601.62816341.79162900.8650.95184490.16293850.716741262.816340979.1628951
110.50.56059491.41582501.71650580.75219020.91193400.14151170.858252941.582499771.6505769
120.59995000.43211410.92582231.58478020.49186480.84195160.09253590.9507888-7.417766458.4780151
130.70002500.32255930.37880681.41237510.201250.75035740.03790910.9886979-62.119319541.2375098
140.79997500.24092760.09894281.24827310.05256570.66317440.00988930.9985872-90.105715524.8273085
150.89992490.16455870.01413471.11120380.00750940.59035310.00141281.0-98.586530811.1203781
161.00.05582240.01.00.00.53127350.01.0-100.00.0
ModelMetricsBinomialGLM: stackedensemble
** Reported on validation data. **

MSE: 0.18783313364822057
RMSE: 0.4333972007849388
LogLoss: 0.555646918852381
Null degrees of freedom: 2005
Residual degrees of freedom: 1998
Null deviance: 2777.4964239309966
Residual deviance: 2229.2554384357527
AIC: 2245.2554384357527
AUC: 0.7876166353248658
Gini: 0.5752332706497316
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.3557814003482815: 
01ErrorRate
0463.0495.00.5167 (495.0/958.0)
1128.0920.00.1221 (128.0/1048.0)
Total591.01415.00.3106 (623.0/2006.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metricthresholdvalueidx
max f10.35578140.7470564271.0
max f20.19193390.8565737352.0
max f0point50.61497320.7418069157.0
max accuracy0.51081260.7228315198.0
max precision0.92451281.00.0
max recall0.11524501.0383.0
max specificity0.92451281.00.0
max absolute_mcc0.51081260.4439970198.0
max min_per_class_accuracy0.53772550.7156489187.0
max mean_per_class_accuracy0.51081260.7216001198.0
Gains/Lift Table: Avg response rate: 52.24 %
groupcumulative_data_fractionlower_thresholdliftcumulative_liftresponse_ratecumulative_response_ratecapture_ratecumulative_capture_rategaincumulative_gain
10.01046860.90593181.82297351.82297350.95238100.95238100.01908400.019084082.297346482.2973464
20.02043870.89875791.72270991.77406440.90.92682930.01717560.036259572.270992477.4064420
30.03040880.89174291.81841601.78860590.950.93442620.01812980.054389381.841603178.8605932
40.04037890.88774371.62700381.74870420.850.91358020.01622140.070610762.700381774.8704175
50.05034900.88117321.81841601.76250850.950.92079210.01812980.088740581.841603176.2508503
60.10019940.85362351.72270991.74270820.90.91044780.08587790.174618372.270992474.2708215
70.15004990.82398821.58872141.69154980.830.88372090.07919850.253816858.872137469.1549796
80.20039880.79216431.55403981.65700130.81188120.86567160.07824430.332061155.403975565.7001253
90.30009970.72564471.40687981.57390440.7350.82225910.14026720.472328240.687977157.3904415
100.40029910.64652501.27608141.49935590.66666670.78331260.12786260.600190827.608142549.9355946
110.50.54320061.08147901.41603050.5650.73978070.10782440.70801538.147900841.6030534
120.60019940.44597470.90468461.33066460.47263680.69518270.09064890.7986641-9.531540833.0664642
130.69990030.35885010.72736641.24472470.380.65028490.07251910.8711832-27.263358824.4724723
140.80009970.28566590.60947171.16516970.31840800.60872270.06106870.9322519-39.052827516.5169675
150.89980060.20295650.45938931.08696690.240.56786700.04580150.9780534-54.06106878.6966865
161.00.06673290.21902891.00.11442790.52243270.02194661.0-78.09710990.0
ModelMetricsBinomialGLM: stackedensemble
** Reported on cross-validation data. **

MSE: 0.19021917932290436
RMSE: 0.4361412378151192
LogLoss: 0.5597680244722494
Null degrees of freedom: 7993
Residual degrees of freedom: 7986
Null deviance: 11053.314251577507
Residual deviance: 8949.571175262323
AIC: 8965.571175262323
AUC: 0.7816115854774708
Gini: 0.5632231709549416
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.39478250550345867: 
01ErrorRate
02001.01746.00.466 (1746.0/3747.0)
1689.03558.00.1622 (689.0/4247.0)
Total2690.05304.00.3046 (2435.0/7994.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metricthresholdvalueidx
max f10.39478250.7450529256.0
max f20.19153530.8603172349.0
max f0point50.59046330.7343911164.0
max accuracy0.50846780.7069052201.0
max precision0.93563201.00.0
max recall0.09089591.0389.0
max specificity0.93563201.00.0
max absolute_mcc0.57564140.4173777171.0
max min_per_class_accuracy0.53077100.7045637190.0
max mean_per_class_accuracy0.57564140.7085280171.0
Gains/Lift Table: Avg response rate: 53.13 %
groupcumulative_data_fractionlower_thresholdliftcumulative_liftresponse_ratecumulative_response_ratecapture_ratecumulative_capture_rategaincumulative_gain
10.01000750.90886101.83521311.83521310.9750.9750.01836590.018365983.521309283.5213092
20.02001500.90049001.74109961.78815630.9250.950.01742410.035790074.109960078.8156346
30.03002250.89474531.85874151.81168470.98750.96250.01860140.054391385.874146581.1684719
40.04003000.88704041.71757121.78815630.91250.950.01718860.071579971.757122778.8156346
50.05003750.88002521.78815631.78815630.950.950.01789500.089474978.815634678.8156346
60.10007510.85214821.69874851.74345240.90250.926250.08500120.174476169.874852874.3452437
70.14998750.82079691.61337411.70016530.85714290.90325270.08052740.255003561.337414670.0165333
80.20002500.78922571.52934421.65743340.81250.88055030.07652460.331528152.934424365.7433353
90.29997500.71350591.35693041.55730750.72090110.82735610.13562510.467153335.693044655.7307489
100.40005000.62802231.20465271.46908870.640.78048780.12055570.587709020.465269646.9088654
110.50.54225331.04125571.38356490.55319150.73505130.10407350.69178244.125565538.3564869
120.59995000.44895420.92817811.30769870.49311640.69474560.09277140.7845538-7.182188230.7698728
130.70002500.36277160.77408351.23141350.411250.65421730.07746640.8620202-22.591652923.1413487
140.79997500.28118130.63841691.15732370.33917400.61485540.06380970.9258300-36.158307115.7323691
150.89992490.19353120.53947411.08870240.28660830.57839870.05392040.9797504-46.05259168.8702362
161.00.05503410.20234401.00.10750.53127350.02024961.0-79.76559920.0
# If you need to generate predictions on a test set, you can make
# predictions directly on the `"H2OAutoML"` object, or on the leader
# model object directly

preds = aml.predict(test)

# or:
preds = aml.leader.predict(test)
Parse progress: |█████████████████████████████████████████████████████████| 100%
stackedensemble prediction progress: |████████████████████████████████████| 100%
stackedensemble prediction progress: |████████████████████████████████████| 100%

大家可以发现,本质输入下面两行代码

aml = H2OAutoML(max_runtime_secs = 30)
aml.train(x = x, y = y,
          training_frame = train,
          leaderboard_frame = test)

**就能训练好我们的模型,上面两行代码不仅指定了模型运行的时间:30s,而且默认进行交叉验证以及训练了7个GBM1个DRF模型,及两个集成模型。关键是只要两行代码!!! **

总结

我给大家介绍的这个工具是不是很神奇?感兴趣的读者们,可以自行做进一步研究哈。
github:H2OAutoML

  • 6
    点赞
  • 54
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Roaring Kitty

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值