0.官方文档
下列各个函数的参数说明请查看官方文档
https://pycaret.org/regression/.
1.Setting up Environment
setup()函数初始化 pycaret 中的环境,并创建transformation pipeline以准备建模和部署数据。在调用 pycaret 中的其他函数之前,必须调用 setup()。setup()函数必须包含dataframe类型的数据集和标签列的名称。其余参数都是可选的。
#import regression module
from pycaret.regression import *
#intialize the setup (in Notebook env)
exp_reg = setup(boston, target = 'medv')
#intialize the setup (in Non-Notebook env)
exp_reg = setup(boston, target = 'medv', html = False)
#intialize the setup (remote runs like Kaggle or GitHub actions)
exp_reg = setup(boston, target = 'medv', silent = True)
2.Compare Models
- 该函数将训练pycarey模型库中的所有模型,并使用K-fold交叉验证对模型进行评价。
- 如果要返回前N名模型,使用n_select参数指定需要返回的模型数量
- turbo参数设置为True时,‘kr’,‘ard’,‘mlp’模型将不会被训练,因为训练它们可能会耗费大量的时间。如果希望训练这3个模型,请将turbo参数设置为False
# return best model
best = compare_models()
# return best model based on MAPE
best = compare_models(sort = 'MAPE') #default is 'R2'
# compare specific models
best_specific = compare_models(whitelist = ['dt','rf','xgboost'])
# blacklist certain models
best_specific = compare_models(blacklist = ['catboost'])
# return top 3 models based on R2
top3 = compare_models(n_select = 3)
3. Create Model
- create_model()函数创建一个模型并使用K-fold交叉验证对模型进行评价(K默认值为10)
- 该函数将返回一个训练好的模型,在调用create_model()函数之前必须先执行setup()函数
# train linear regression model
lr = create_model('lr') #lr is the id of the model
# check the model library to see all models
models()
# train rf model using 5 fold CV
rf = create_model('rf', fold = 5)
# train svm model without CV
svm = create_model('svm', cross_validation = False)
# train xgboost model with max_depth = 10
xgboost = create_model('xgboost', max_depth = 10)
# train xgboost model on gpu
xgboost_gpu = create_model('xgboost', tree_method = 'gpu_hist', gpu_id = 0) #0 is gpu-id
# train multiple lightgbm models with n learning_rate
import numpy as np
lgbms = [create_model('lightgbm', learning_rate = i) for i in np.arange(0.1,1,0.1)]
# train custom model
from gplearn.genetic import SymbolicRegressor
symreg = SymbolicRegressor(generation = 50)
sc = create_model(symreg)
当我们需要的模型不存在与Pycarte模型库中时,可以参考上面代码的最后一个实例
4.Tune Model
- tune_model()函数将对输入模型的超参数进行调整并使用K-fold交叉验证对调整后的模型进行评价
- 该函数返回一个训练好的模型对象
# train a decision tree model with default parameters
dt = create_model('dt')
# tune hyperparameters of decision tree
tuned_dt = tune_model(dt)
# tune hyperparameters with increased n_iter
tuned_dt = tune_model(dt, n_iter = 50)
# tune hyperparameters to optimize MAE
tuned_dt = tune_model(dt, optimize = 'MAE') #default is 'R2'
# tune hyperparameters with custom_grid
params = {"max_depth": np.random.randint(1, (len(data.columns)*.85),20),
"max_features": np.random.randint(1, len(data.columns),20),
"min_samples_leaf": [2,3,4,5,6],
"criterion": ["gini", "entropy"]
}
tuned_dt_custom = tune_model(dt, custom_grid = params)
# tune multiple models dynamically
top3 = compare_models(n_select = 3)
tuned_top3 = [tune_model(i) for i in top3]
5.Ensemble Model
集成学习就是一个整合的过程。对于训练集数据,通过训练若干个个体学习器(individual learner、base learner)(弱学习器),再通过一定的结合策略,就可以最终形成一个强学习器,有更好的学习效果。
主要集成学习方法:Boosting, Bagging, Stacking
- 集成个体学习器,通过设置method函数的值来执行不同的集成策略:Boosting(默认值), Bagging。
- Stacking方法我们会在stack_models()函数中使用
- estimator参数指定个体学习器(模型)
# create a decision tree model
dt = create_model('dt')
# ensemble decision tree model with 'Bagging'
bagged_dt = ensemble_model(dt)
# ensemble decision tree model with 'Bagging' with 100 n_estimators
bagged_dt = ensemble_model(dt, n_estimators = 100)
# ensemble decision tree model with 'Boosting'
boosted_dt = ensemble_model(dt, method = 'Boosting')
6.Blend Models
在已经得到多个弱学习器的状况下,如何将这些弱学习器的预测值联合起来,得到更好的预测值,就是Blending做的事情。
- 集成个体学习器,将这些个体学习器的预测值取平均,作为最终的预测结果
- 默认情况下,该函数的个体学习器为模型库中所有的模型(当turbo设置为True时,部分模型不会被使用,我们在之前讨论过),也可以通过给estimator_list 参数传递一个经过训练的个体学习器的列表,指定个体学习器
# train a votingregressor on all models in library
blender = blend_models()
# train a voting regressor on specific models
dt = create_model('dt')
rf = create_model('rf')
adaboost = create_model('ada')
blender_specific = blend_models(estimator_list = [dt,rf,adaboost])
# train a voting regressor dynamically
blender_top5 = blend_models(compare_models(n_select = 5))
7.Stack Models
Stacking(有时候也称之为stacked generalization)是指训练一个模型用于组合(combine)其他各个模型。即首先我们先训练多个不同的模型,然后再以之前训练的各个模型的输出为输入来训练一个模型,以得到一个最终的输出
- 默认情况下,meta model是线性回归模型(Linear Regression),可以使用meta_model参数指定meta model
- restacking参数默认值为True,数据集的特征作为输入传递给meta model
- 该函数返回一个容器,容器内容为个体学习器和meta model。
# train indvidual models for stacking
dt = create_model('dt')
rf = create_model('rf')
ada = create_model('ada')
ridge = create_model('ridge')
knn = create_model('knn')
# stack trained models
stacked_models = stack_models(estimator_list=[dt,rf,ada,ridge,knn])
# stack trained models dynamically
top7 = compare_models(n_select = 7)
stacked_models = stack_models(estimator_list = top7[1:], meta_model = top7[0])
8.Create Stacknet
- 基本层模型可以作为estimator_list参数传递,层可以在estimator_list对象中组织为子列表。
- restacking参数默认值为False,当设置 为True时,数据集的特征也将作为输入传递给meta model
# create models
dt = create_model('dt')
rf = create_model('rf')
ada = create_model('ada')
ridge = create_model('ridge')
knn = create_model('knn')
# create stacknet
stacknet = create_stacknet(estimator_list =[[dt,rf],[ada,ridge,knn]])
官方虽然提供了这个函数,但同时也并不推荐我们使用
WARNING : This function will be deprecated in future release of PyCaret 2.x.
9.Plot Model
绘制模型,可以通过提供模型对象作为参数和所需绘制的图像类型来绘制模型结果。
# create a model
lr = create_model('lr')
# plot a model
plot_model(lr)
可以绘制的模型类型有
10.Evaluate Model
将以上所有可以绘制的模型以用户界面的形式展示给用户,我们可以通过点选需要的模型名称获得需要绘制的模型图形
# create a model
lr = create_model('lr')
# evaluate a model
evaluate_model(lr)
11.Interpret Model
该函数仅支持树算法**(tree based algorithms)**
这一段解释的是算法原理,有兴趣的同学可以详细了解
This function is implemented based on the SHAP (SHapley Additive exPlanations), which is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations.
For more information : https://shap.readthedocs.io/en/latest/
# create a model
dt = create_model('dt')
# interpret overall model
interpret_model(dt)
# correlation shap plot
interpret_model(dt, plot = 'correlation')
# interactive reason plot
interpret_model(dt, plot = 'reason')
# reason plot at observation level
interpret_model(dt, plot = 'reason', observation = 1) #observation 1 for testset
使用这个函数,可能会需要我们安装新的依赖包,按照信息提示安装即可
12.Predict Model
- 使用predict_model对新数据进行预测。新数据:模型训练过程中从未见过(使用)过的数据
- esitmator:训练好的模型
- data:需要被预测的数据
- 请确保需要预测的数据集格式与训练模型用到的数据集格式保持一致
# train linear regression model
lr = create_model('lr')
# predictions on hold-out set
lr_pred_holdout = predict_model(lr)
# predictions on new dataset
lr_pred_new = predict_model(lr, data = new_data) #new_data is pd dataframe
13.Finalize Model
使用完整数据集(包含训练集和测试/验证集)对模型重新训练。为模型部署做准备
# create a model
lr = create_model('lr')
# finalize model
lr_final = finalize_model(lr)
14.Deploy Model
将模型部署到服务器,目前仅支持aws
# create a model
lr = create_model('lr')
# deploy model
deploy_model(model = lr, model_name = 'deploy_lr', platform = 'aws', authentication = {'bucket' : 'pycaret-test'})
15.Save Model
将模型保存在当前工作目录下的pickle文件中,包含transformation pipeline和已经训练好的模型对象。,需要传递模型对象名和文件名
# create a model
lr = create_model('lr')
# save a model
save_model(lr, 'lr_model_23122019')
16.Load Model
从指定文件中加载之前保存的transformation pipeline和训练好的模型到当前python环境中,文件对象必须是pickle文件类型。
saved_lr = load_model('lr_model_23122019')