全自动机器学习神器：H2OAutoML

Roaring Kitty

已于 2022-05-10 16:28:38 修改

阅读量1.3w

点赞数 6

分类专栏： BigData Machine_learning 机器学习文章标签： bigdata machinelearning AI datamining

于 2018-02-01 10:35:32 首次发布

本文链接：https://blog.csdn.net/u010665216/article/details/79224972

版权

Machine_learning 同时被 3 个专栏收录

58 篇文章

订阅专栏

机器学习

51 篇文章

订阅专栏

BigData

4 篇文章

订阅专栏

引言

做机器学习的老铁们在平时训练模型时，对交叉验证、模型集成想必是绞尽了脑汁。现在我将给各位介绍一个神器。叫做H2O。在读了这篇文章后，你将会：

了解H2O是什么，在哪些地方大放异彩
H2O的安装与初步使用
迫不及待地去安装使用（哈哈哈）

H2O概述

H2O是一个开源的、内存、分布式、快速和可扩展的机器学习和预测分析平台，允许诸位在大数据上构建机器学习模型，并在企业环境中轻松实现这些模型的搭建。

H2O的核心代码是用Java编写的。在H2O中，使用分布式的Key/Value存储来访问和引用所有节点和机器上的数据、模型、对象等。这些算法是在H2O的分布式Map / Reduce框架之上实现的，并且利用Java Fork / Join框架来实现多线程。数据是并行读取的，并分布在整个集群中，并以压缩的方式以列状格式存储在内存中。 H2O的数据解析器具有内置的智能功能，可以猜测传入数据集的模式，并支持以多种格式从多个源获取数据。

H2O的REST API允许外部程序或脚本通过HTTP上的JSON访问H2O的所有功能。 Rest API使用H2O的Web界面（Flow UI），R binding（H2O-R）和Python binding（H2O-Python）。

深度学习，Tree Ensembles和GLRM等各种有监督和无监督算法的速度，质量，易用性和模型部署方便使得H2O成为大数据数据科学非常受欢迎的API。

H2O的安装及AutoML的使用

H2O的安装（python）

H2O对 Scala, R, and Python并没有硬性要求，但是Java是必须要会的。接下来我们就讲下在python环境中安装H2O。
首先安装依赖文件:

$ pip install requests
$ pip install tabulate
$ pip install scikit-learn

接下来下载安装H2O

$ pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o

软件大小100多M。

AutoML的使用

输入以下代码

import h2o
from h2o.automl import H2OAutoML

h2o.init()
# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")

# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

# Run AutoML for 30 seconds
aml = H2OAutoML(max_runtime_secs = 30)
aml.train(x = x, y = y,
          training_frame = train,
          leaderboard_frame = test)

# View the AutoML Leaderboard
lb = aml.leaderboard
lb

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_131"; OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-0ubuntu1.16.10.2-b11); OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
  Starting server from /home/ora/anaconda3/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmphnkk6mvy
  JVM stdout: /tmp/tmphnkk6mvy/h2o_ora_started_from_python.out
  JVM stderr: /tmp/tmphnkk6mvy/h2o_ora_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.

H2O cluster uptime:	02 secs
H2O cluster version:	3.16.0.4
H2O cluster version age:	16 days
H2O cluster name:	H2O_from_python_ora_l1c8zv
H2O cluster total nodes:	1
H2O cluster free memory:	6.976 Gb
H2O cluster total cores:	4
H2O cluster allowed cores:	4
H2O cluster status:	accepting new members, healthy
H2O connection url:	http://127.0.0.1:54321
H2O connection proxy:	None
H2O internal security:	False
H2O API Extensions:	XGBoost, Algos, AutoML, Core V3, Core V4
Python version:	3.6.0 final

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
AutoML progress: |████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%

model_id	auc	logloss
StackedEnsemble_AllModels_0_AutoML_20180201_101807	0.787269	0.554504
StackedEnsemble_BestOfFamily_0_AutoML_20180201_101807	0.783812	0.557977
GBM_grid_0_AutoML_20180201_101807_model_0	0.779296	0.562086
GBM_grid_0_AutoML_20180201_101807_model_2	0.779109	0.560944
GBM_grid_0_AutoML_20180201_101807_model_1	0.775373	0.564924
GBM_grid_0_AutoML_20180201_101807_model_3	0.773419	0.567071
GBM_grid_0_AutoML_20180201_101807_model_4	0.755339	0.630771
DRF_0_AutoML_20180201_101807	0.740823	0.605117
XRT_0_AutoML_20180201_101807	0.735793	0.604911
GLM_grid_0_AutoML_20180201_101807_model_0	0.686224	0.634806

# The leader model is stored here
aml.leader

Model Details
=============
H2OStackedEnsembleEstimator :  Stacked Ensemble
Model Key:  StackedEnsemble_AllModels_0_AutoML_20180201_101807
No model summary for this model


ModelMetricsBinomialGLM: stackedensemble
** Reported on train data. **

MSE: 0.10619313022292985
RMSE: 0.32587287432821077
LogLoss: 0.36728814169184465
Null degrees of freedom: 7993
Residual degrees of freedom: 7986
Null deviance: 11050.743244827558
Residual deviance: 5872.202809369212
AIC: 5888.202809369212
AUC: 0.9569241768110353
Gini: 0.9138483536220705
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.4510434688048974:

	0	1	Error	Rate
0	3039.0	708.0	0.189	(708.0/3747.0)
1	240.0	4007.0	0.0565	(240.0/4247.0)
Total	3279.0	4715.0	0.1186	(948.0/7994.0)

Maximum Metrics: Maximum metrics at their respective thresholds

metric	threshold	value	idx
max f1	0.4510435	0.8942200	221.0
max f2	0.3725997	0.9326728	253.0
max f0point5	0.6114940	0.9036081	158.0
max accuracy	0.5056487	0.8855392	200.0
max precision	0.9380498	1.0	0.0
max recall	0.1695174	1.0	349.0
max specificity	0.9380498	1.0	0.0
max absolute_mcc	0.5056487	0.7701297	200.0
max min_per_class_accuracy	0.5333867	0.8849746	190.0
max mean_per_class_accuracy	0.5333867	0.8852705	190.0

Gains/Lift Table: Avg response rate: 53.13 %

group	cumulative_data_fraction	lower_threshold	lift	cumulative_lift	response_rate	cumulative_response_rate	capture_rate	cumulative_capture_rate	gain	cumulative_gain
1	0.0100075	0.9118284	1.8822698	1.8822698	1.0	1.0	0.0188368	0.0188368	88.2269838	88.2269838
2	0.0200150	0.9032538	1.8822698	1.8822698	1.0	1.0	0.0188368	0.0376737	88.2269838	88.2269838
3	0.0300225	0.8976278	1.8822698	1.8822698	1.0	1.0	0.0188368	0.0565105	88.2269838	88.2269838
4	0.0400300	0.8922493	1.8822698	1.8822698	1.0	1.0	0.0188368	0.0753473	88.2269838	88.2269838
5	0.0500375	0.8878612	1.8822698	1.8822698	1.0	1.0	0.0188368	0.0941841	88.2269838	88.2269838
6	0.1000751	0.8646618	1.8822698	1.8822698	1.0	1.0	0.0941841	0.1883683	88.2269838	88.2269838
7	0.1499875	0.8402188	1.8775524	1.8807000	0.9974937	0.9991660	0.0937132	0.2820815	87.7552369	88.0699971
8	0.2000250	0.8133639	1.8446244	1.8716754	0.98	0.9943715	0.0923004	0.3743819	84.4624441	87.1675448
9	0.2999750	0.7490769	1.7951059	1.8461629	0.9536921	0.9808173	0.1794208	0.5538027	79.5105903	84.6162910
10	0.4000500	0.6641860	1.6281634	1.7916290	0.865	0.9518449	0.1629385	0.7167412	62.8163409	79.1628951
11	0.5	0.5605949	1.4158250	1.7165058	0.7521902	0.9119340	0.1415117	0.8582529	41.5824997	71.6505769
12	0.5999500	0.4321141	0.9258223	1.5847802	0.4918648	0.8419516	0.0925359	0.9507888	-7.4177664	58.4780151
13	0.7000250	0.3225593	0.3788068	1.4123751	0.20125	0.7503574	0.0379091	0.9886979	-62.1193195	41.2375098
14	0.7999750	0.2409276	0.0989428	1.2482731	0.0525657	0.6631744	0.0098893	0.9985872	-90.1057155	24.8273085
15	0.8999249	0.1645587	0.0141347	1.1112038	0.0075094	0.5903531	0.0014128	1.0	-98.5865308	11.1203781
16	1.0	0.0558224	0.0	1.0	0.0	0.5312735	0.0	1.0	-100.0	0.0

ModelMetricsBinomialGLM: stackedensemble
** Reported on validation data. **

MSE: 0.18783313364822057
RMSE: 0.4333972007849388
LogLoss: 0.555646918852381
Null degrees of freedom: 2005
Residual degrees of freedom: 1998
Null deviance: 2777.4964239309966
Residual deviance: 2229.2554384357527
AIC: 2245.2554384357527
AUC: 0.7876166353248658
Gini: 0.5752332706497316
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.3557814003482815:

	0	1	Error	Rate
0	463.0	495.0	0.5167	(495.0/958.0)
1	128.0	920.0	0.1221	(128.0/1048.0)
Total	591.0	1415.0	0.3106	(623.0/2006.0)

Maximum Metrics: Maximum metrics at their respective thresholds

metric	threshold	value	idx
max f1	0.3557814	0.7470564	271.0
max f2	0.1919339	0.8565737	352.0
max f0point5	0.6149732	0.7418069	157.0
max accuracy	0.5108126	0.7228315	198.0
max precision	0.9245128	1.0	0.0
max recall	0.1152450	1.0	383.0
max specificity	0.9245128	1.0	0.0
max absolute_mcc	0.5108126	0.4439970	198.0
max min_per_class_accuracy	0.5377255	0.7156489	187.0
max mean_per_class_accuracy	0.5108126	0.7216001	198.0

Gains/Lift Table: Avg response rate: 52.24 %

group	cumulative_data_fraction	lower_threshold	lift	cumulative_lift	response_rate	cumulative_response_rate	capture_rate	cumulative_capture_rate	gain	cumulative_gain
1	0.0104686	0.9059318	1.8229735	1.8229735	0.9523810	0.9523810	0.0190840	0.0190840	82.2973464	82.2973464
2	0.0204387	0.8987579	1.7227099	1.7740644	0.9	0.9268293	0.0171756	0.0362595	72.2709924	77.4064420
3	0.0304088	0.8917429	1.8184160	1.7886059	0.95	0.9344262	0.0181298	0.0543893	81.8416031	78.8605932
4	0.0403789	0.8877437	1.6270038	1.7487042	0.85	0.9135802	0.0162214	0.0706107	62.7003817	74.8704175
5	0.0503490	0.8811732	1.8184160	1.7625085	0.95	0.9207921	0.0181298	0.0887405	81.8416031	76.2508503
6	0.1001994	0.8536235	1.7227099	1.7427082	0.9	0.9104478	0.0858779	0.1746183	72.2709924	74.2708215
7	0.1500499	0.8239882	1.5887214	1.6915498	0.83	0.8837209	0.0791985	0.2538168	58.8721374	69.1549796
8	0.2003988	0.7921643	1.5540398	1.6570013	0.8118812	0.8656716	0.0782443	0.3320611	55.4039755	65.7001253
9	0.3000997	0.7256447	1.4068798	1.5739044	0.735	0.8222591	0.1402672	0.4723282	40.6879771	57.3904415
10	0.4002991	0.6465250	1.2760814	1.4993559	0.6666667	0.7833126	0.1278626	0.6001908	27.6081425	49.9355946
11	0.5	0.5432006	1.0814790	1.4160305	0.565	0.7397807	0.1078244	0.7080153	8.1479008	41.6030534
12	0.6001994	0.4459747	0.9046846	1.3306646	0.4726368	0.6951827	0.0906489	0.7986641	-9.5315408	33.0664642
13	0.6999003	0.3588501	0.7273664	1.2447247	0.38	0.6502849	0.0725191	0.8711832	-27.2633588	24.4724723
14	0.8000997	0.2856659	0.6094717	1.1651697	0.3184080	0.6087227	0.0610687	0.9322519	-39.0528275	16.5169675
15	0.8998006	0.2029565	0.4593893	1.0869669	0.24	0.5678670	0.0458015	0.9780534	-54.0610687	8.6966865
16	1.0	0.0667329	0.2190289	1.0	0.1144279	0.5224327	0.0219466	1.0	-78.0971099	0.0

ModelMetricsBinomialGLM: stackedensemble
** Reported on cross-validation data. **

MSE: 0.19021917932290436
RMSE: 0.4361412378151192
LogLoss: 0.5597680244722494
Null degrees of freedom: 7993
Residual degrees of freedom: 7986
Null deviance: 11053.314251577507
Residual deviance: 8949.571175262323
AIC: 8965.571175262323
AUC: 0.7816115854774708
Gini: 0.5632231709549416
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.39478250550345867:

	0	1	Error	Rate
0	2001.0	1746.0	0.466	(1746.0/3747.0)
1	689.0	3558.0	0.1622	(689.0/4247.0)
Total	2690.0	5304.0	0.3046	(2435.0/7994.0)

Maximum Metrics: Maximum metrics at their respective thresholds

metric	threshold	value	idx
max f1	0.3947825	0.7450529	256.0
max f2	0.1915353	0.8603172	349.0
max f0point5	0.5904633	0.7343911	164.0
max accuracy	0.5084678	0.7069052	201.0
max precision	0.9356320	1.0	0.0
max recall	0.0908959	1.0	389.0
max specificity	0.9356320	1.0	0.0
max absolute_mcc	0.5756414	0.4173777	171.0
max min_per_class_accuracy	0.5307710	0.7045637	190.0
max mean_per_class_accuracy	0.5756414	0.7085280	171.0

Gains/Lift Table: Avg response rate: 53.13 %

group	cumulative_data_fraction	lower_threshold	lift	cumulative_lift	response_rate	cumulative_response_rate	capture_rate	cumulative_capture_rate	gain	cumulative_gain
1	0.0100075	0.9088610	1.8352131	1.8352131	0.975	0.975	0.0183659	0.0183659	83.5213092	83.5213092
2	0.0200150	0.9004900	1.7410996	1.7881563	0.925	0.95	0.0174241	0.0357900	74.1099600	78.8156346
3	0.0300225	0.8947453	1.8587415	1.8116847	0.9875	0.9625	0.0186014	0.0543913	85.8741465	81.1684719
4	0.0400300	0.8870404	1.7175712	1.7881563	0.9125	0.95	0.0171886	0.0715799	71.7571227	78.8156346
5	0.0500375	0.8800252	1.7881563	1.7881563	0.95	0.95	0.0178950	0.0894749	78.8156346	78.8156346
6	0.1000751	0.8521482	1.6987485	1.7434524	0.9025	0.92625	0.0850012	0.1744761	69.8748528	74.3452437
7	0.1499875	0.8207969	1.6133741	1.7001653	0.8571429	0.9032527	0.0805274	0.2550035	61.3374146	70.0165333
8	0.2000250	0.7892257	1.5293442	1.6574334	0.8125	0.8805503	0.0765246	0.3315281	52.9344243	65.7433353
9	0.2999750	0.7135059	1.3569304	1.5573075	0.7209011	0.8273561	0.1356251	0.4671533	35.6930446	55.7307489
10	0.4000500	0.6280223	1.2046527	1.4690887	0.64	0.7804878	0.1205557	0.5877090	20.4652696	46.9088654
11	0.5	0.5422533	1.0412557	1.3835649	0.5531915	0.7350513	0.1040735	0.6917824	4.1255655	38.3564869
12	0.5999500	0.4489542	0.9281781	1.3076987	0.4931164	0.6947456	0.0927714	0.7845538	-7.1821882	30.7698728
13	0.7000250	0.3627716	0.7740835	1.2314135	0.41125	0.6542173	0.0774664	0.8620202	-22.5916529	23.1413487
14	0.7999750	0.2811813	0.6384169	1.1573237	0.3391740	0.6148554	0.0638097	0.9258300	-36.1583071	15.7323691
15	0.8999249	0.1935312	0.5394741	1.0887024	0.2866083	0.5783987	0.0539204	0.9797504	-46.0525916	8.8702362
16	1.0	0.0550341	0.2023440	1.0	0.1075	0.5312735	0.0202496	1.0	-79.7655992	0.0

# If you need to generate predictions on a test set, you can make
# predictions directly on the `"H2OAutoML"` object, or on the leader
# model object directly

preds = aml.predict(test)

# or:
preds = aml.leader.predict(test)

Parse progress: |█████████████████████████████████████████████████████████| 100%
stackedensemble prediction progress: |████████████████████████████████████| 100%
stackedensemble prediction progress: |████████████████████████████████████| 100%

大家可以发现，本质输入下面两行代码

aml = H2OAutoML(max_runtime_secs = 30)
aml.train(x = x, y = y,
          training_frame = train,
          leaderboard_frame = test)

**就能训练好我们的模型，上面两行代码不仅指定了模型运行的时间：30s，而且默认进行交叉验证以及训练了7个GBM1个DRF模型，及两个集成模型。关键是只要两行代码！！！ **

总结

我给大家介绍的这个工具是不是很神奇?感兴趣的读者们，可以自行做进一步研究哈。
github：H2OAutoML