1. 以缺省配置创建并运行实验:
利用工具make_experiment可快速创建一个可运行的实验对象,执行该实验对象的run方法即可开始训练并得到模型。使用该工具时只有实验数据train_data是必须的,其它都是可选项。数据的目标列如果不是y的话,需要通过参数target设置。
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class')
estimator = experiment.run()
print(estimator)
out[]:
Pipeline(steps=[('data_clean',
DataCleanStep(...),
('estimator',
GreedyEnsemble(...)])
可以看出,训练得到的是一个Pipeline,最终模型是由多个模型构成的融合模型。
如果您的训练数据是csv或parquet格式,而且数据文件的扩展名是“.csv”或“.parquet”的话,可以直接使用文件路径创建实验,make_experiment会自动将数据加载为DataFrame,如:
from hypergbm import make_experiment
train_data = '/path/to/mydata.csv'
experiment = make_experiment(train_data, target='my_target')
estimator = experiment.run()
print(estimator)
2. 设置最大搜索次数(max_trials):
缺省情况下,make_experiment所创建的实验最多搜索10种参数便会停止搜索。实际使用中,建议将最大搜索次数设置为30以上。
from hypergbm import make_experiment
train_data = ...
experiment = make_experiment(train_data, max_trials=50)
...
3. 交叉验证:
可通过参数cv指定是否启用交叉验证。当cv设置为False时表示禁用交叉验证并使用经典的train_test_split方式进行模型训练;当cv设置为True(缺省)时表示开启交叉验证,折数可通过参数num_folds设置(默认:3)。
启用交叉验证的示例代码:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', cv=True, num_folds=5)
estimator = experiment.run()
print(estimator)
4. 指定验证数据集(eval_data):
在禁用交叉验证时,模型训练除了需要训练数据集,还需要评估数据集,您可在make_experiment时通过eval_data指定评估集,如:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
from sklearn.model_selection import train_test_split
train_data = dsutils.load_blood()
train_data,eval_data=train_test_split(train_data,test_size=0.3)
experiment = make_experiment(train_data, target='Class', eval_data=eval_data, cv=False)
estimator = experiment.run()
print(estimator)
在禁用交叉验证时,如果您未指定eval_data,实验对象将从train_data中拆分部分数据作为评估集,拆分大小可通过eval_size设置,如:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', cv=False, eval_size=0.2)
estimator = experiment.run()
print(estimator)
5. 指定模型的评价指标:
使用make_experiment创建实验时,分类任务默认的模型评价指标是accuracy,回归任务默认的模型评价指标是rmse,可通过参数reward_metric指定,如:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', reward_metric='auc')
estimator = experiment.run()
print(estimator)
可支持的评估指标如下:
- accuracy
- auc
- f1
- logloss
- mse
- mae
- msle
- precision
- rmse
- r2
- recall
6. 设置搜索次数和早停(Early Stopping)策略:
使用make_experiment时,可通过参数early_stopping_round,early_stopping_time_limit,early_stopping_reward设置实验的早停策略。
将搜索时间设置为最多3小时的示例代码:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', max_trials=300, early_stopping_time_limit=3600 * 3)
estimator = experiment.run()
print(estimator)
7. 指定搜索算法(Searcher):
HyperGBM通过Hypernets中的搜索算法进行参数搜索,包括:EvolutionSearcher(缺省)、MCTSSearcher、RandomSearch,在make_experiment时可通过参数searcher指定,可以指定搜索算法的类名(class)、搜索算法的名称(str)。
示例代码:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', searcher='random')
estimator = experiment.run()
print(estimator)
您也可以自己创建searcher对象,然后用所创建的对象创建实验,如:
from hypergbm import make_experiment
from hypergbm.search_space import search_space_general
from hypernets.searchers import MCTSSearcher
from hypernets.tabular.datasets import dsutils
my_searcher = MCTSSearcher(lambda: search_space_general(n_estimators=100),
max_node_space=20,
optimize_direction='max')
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', searcher=my_searcher)
estimator = experiment.run()
print(estimator)
8. 模型融合:
为了获取较好的模型效果,make_experiment创建实验时默认开启了模型融合的特性,并使用效果最好的20个模型进行融合,可通过参数ensemble_size指定参与融合的模型的数量。当ensemble_size设置为0时则表示禁用模型融合。
调整参与融合的模型数量的示例代码:
train_data = ...
experiment = make_experiment(train_data, ensemble_size=10, ...)
9. 调整日志级别:
如果希望在训练过程中看到使用进度信息的话,可通过log_level指定日志级别,可以是str或int。关于日志级别的详细定义可参考python的logging包。 另外,如果将verbose设置为1的话,可以得到更详细的信息。
将日志级别设置为INFO的示例代码如下:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', log_level='INFO', verbose=1)
estimator = experiment.run()
print(estimator)
out[]:
14:24:33 I hypernets.tabular.u._common.py 30 - 2 class detected, {0, 1}, so inferred as a [binary classification] task
14:24:33 I hypergbm.experiment.py 699 - create experiment with ['data_clean', 'drift_detection', 'space_search', 'final_ensemble']
14:24:33 I hypergbm.experiment.py 1262 - make_experiment with train data:(748, 4), test data:None, eval data:None, target:Class
14:24:33 I hypergbm.experiment.py 716 - fit_transform data_clean
14:24:33 I hypergbm.experiment.py 716 - fit_transform drift_detection
14:24:33 I hypergbm.experiment.py 716 - fit_transform space_search
14:24:33 I hypernets.c.meta_learner.py 22 - Initialize Meta Learner: dataset_id:7123e0d8c8bbbac8797ed9e42352dc59
14:24:33 I hypernets.c.callbacks.py 192 -
Trial No:1
--------------------------------------------------------------
(1)estimator_options.hp_or: 0
(2)numeric_imputer_0.strategy: most_frequent
(3)numeric_scaler_optional_0.hp_opt: True
...
14:24:35 I hypergbm.experiment.py 716 - fit_transform final_ensemble
14:24:35 I hypergbm.experiment.py 737 - trained experiment pipeline: ['data_clean', 'estimator']
Pipeline(steps=[('data_clean',
DataCleanStep(...),
('estimator',
GreedyEnsemble(...)