python模型部署方法
Choosing the best model is a key step after feature selection in any data science projects. This process consists of using the best algorithms (supervised, unsupervised) for obtaining the best predictions. Automate model selection methods for high dimensional datasets generally include Libra and Pycaret. A unicorn data-scientist needs to master the most advanced Automate model selections methods. In this article, we will review the 2 best Kaggle winners’ Automate model selections methods which can be implemented in short python codes.
在任何数据科学项目中选择特征之后,选择最佳模型都是关键的一步。 此过程包括使用最佳算法(有监督,无监督)来获得最佳预测。 用于高维数据集的自动模型选择方法通常包括Libra和Pycaret 。 独角兽数据科学家需要掌握最先进的自动模型选择方法。 在本文中,我们将介绍2种最佳的Kaggle获奖者的Automate模型选择方法,这些方法可以用简短的python代码实现。
For this article, we will analyze the sample chocolate bar rating dataset you can find here.
对于本文,我们将分析示例巧克力条评级数据集,您可以在此处找到。
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/81da67386ebf5fd308dbb59e6f2b15f8.png)
A challenging dataset which after features selections contains 20 from 3400 features correlate to the target feature ‘review date’.
一个极具挑战性的数据集,在特征选择之后包含3400个特征中的20个,与目标特征“审查日期”相关。
Libra
天秤座
The challenge is to find the best performing combination of techniques so that you can minimize the error in your predictions. Libra provides out-of-the-box automated supervised machine learning that optimizes machine (or deep) learning pipelines, automatically searching for the best learning algorithms (Neural network, SVM, decision tree, KNN, etc) and best hyperparameters in seconds. Click here to see a complete list of estimators/models available in Libra.
面临的挑战是找到性能最佳的技术组合,以使预测误差最小。 Libra提供了开箱即用的自动监督机器学习,可优化机器(或深度)学习管道,自动在几秒钟内搜索最佳学习算法(神经网络,SVM,决策树,KNN等)和最佳超参数。 单击此处查看天秤座中可用的估计器/模型的完整列表。
Here an example predicting the review_date feature of the chocolate rating dataset, a complex multiclass classification (labels: 2006–2020).
这是一个预测巧克力评分数据集的review_date功能的示例,这是一个复杂的多类分类(标签:2006–2020)。
#import libraries!pip install libra
from libra import client#open the dataseta_client = client('../input/preprocess-choc/dfn.csv')
print(a_client)#choose the modela_client.neural_network_query('review_date', epochs=20)
a_client.analyze()
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/dbb4ebd9cf2f1d5399241b88eb6631b9.png)
Libra result in a neural network with an accuracy before optimizations of 0.796 and after of 0.860 reducing overfitting from train/test = 0.796–0.764 (0.35) to train/test = 0.860–0.851 (0.009) resulting in the best numbers of neural network layers from 3 to 6.
天秤座导致神经网络的精度在优化之前为0.796,在优化之后为0.860,减少了从训练/测试= 0.796–0.764(0.35)到训练/测试= 0.860–0.851(0.009)的过度拟合,从而获得了最佳的神经网络层数从3到6。
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/734b1ba8d8368b1c7b246a897f7b88f0.png)
2. Pycaret
2. 皮卡雷
PyCaret is simple and easy to use sequential pipeline including a well integrate preprocessing functions with hyperparameters tuning and train models ensembling.
PyCaret是简单易用的顺序流水线,包括具有超参数调整和训练模型集成的良好集成的预处理功能。
#import libraries!pip install pycaret
from pycaret.classification import *#open the datasetdfn = pd.read_csv('../input/preprocess-choc/dfn.csv')#define target label and parametersexp1 = setup(dfn, target = 'review_date', feature_selection = True)
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/9f01f6f36cbf05a66b8e05af3ce52a2d.png)
All the preprocessing steps are applied within setup(). With more than 40 features to prepare data for machine learning including missing values imputation, categorical variable encoding, label encoding (converting yes or no into 1 or 0), and train-test-split are automatically performed when setup() is initialized. For more details about PyCaret’s preprocessing abilities Click here.
所有预处理步骤都在setup()中应用。 初始化setup()时,将自动执行40多种功能来为机器学习准备数据,包括缺失值插补,分类变量编码,标签编码(将yes或no转换为1或0)和train-test-split。 有关PyCaret预处理功能的更多详细信息,请单击此处 。
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/2295398b182b13bfb6ae9a1383a5e202.png)
Pycaret makes model comparisons in one line, returning a table with k-fold cross-validated scores and algorithms scored metrics.
Pycaret在一行中进行模型比较,返回一张带有k倍交叉验证得分和算法得分指标的表格。
compare_models(fold = 5, turbo = True)
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/8f33e37cb8f4f8b948cd66ca6826760f.png)
PyCaret has over 60 open-source ready-to-use algorithms. Click here to see a complete list of estimators/models available in PyCaret.
PyCaret具有60多种开源即用型算法。 单击此处查看PyCaret中可用的估算器/模型的完整列表。
The tune_model function is used for automatically tuning hyperparameters of a machine learning model. PyCaret uses random grid search over a predefined search space. This function returns a table with k-fold cross-validated scores.
tune_model函数用于自动调整机器学习模型的超参数。 PyCaret在预定义的搜索空间上使用随机网格搜索 。 此函数返回具有k倍交叉验证得分的表格。
The ensemble_model function is used for ensembling trained models. It takes only trained model object returning a table with k-fold cross validated scores.
ensemble_model函数用于组合训练后的模型。 它仅需要训练的模型对象返回具有k倍交叉验证得分的表格。
# creating a decision tree modeldt = create_model(dt)# ensembling a trained dt modeldt_bagged = ensemble_model(dt)#plot_model dtplot_model(estimator = dt, plot = 'learning')# plot_model dt_baggedplot_model(estimator = dt_bagged, plot = 'learning')
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/b8e21349e5614e266895ae4b4ac75bb8.png)
Performance evaluation and diagnostics of a trained machine learning model can be done using the plot_model function.
可以使用plot_model函数对经过训练的机器学习模型进行性能评估和诊断。
#hyperparameters tunningtuned_dt = tune_model(dt,optimize = "Accuracy", n_iter = 500)#evaluate modelevaluate_model(estimator=tuned_dt)#plot tuned dt confusion matrixplot_model(tuned_dt, plot = 'confusion_matrix')
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/e684f3d04af97097948d2bb45dc30ea8.png)
Finally, predict_model function can be used to predict unseen dataset.
最后, predict_model函数可用于预测看不见的数据集。
#predicting label on a new datasetpredictions = predict_model(dt)
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/18a90c94e293b90cf9e09f5d01f41817.png)
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/980488d83062d14842f8635ab0ed82e4.png)
If you have some spare time I’d recommend, you’ll read this:
如果您有空闲时间,建议您阅读以下内容:
Sum Up
总结
Refer to these links :
请参考以下链接:
https://jovian.ml/yeonathan/libra
https://jovian.ml/yeonathan/libra
https://jovian.ml/yeonathan/pycaret
https://jovian.ml/yeonathan/pycaret
For complete algorithms selections of chocolate bar review date estimations using these 2 methods.
对于完整的算法选择,使用这两种方法选择巧克力棒的日期估计。
This brief overview is a reminder of the importance of using the right algorithms selection methods in data science. This post has for scope to cover the 2 best Python automate algorithms selection methods for high dimensional datasets, as well as share useful documentation.
这个简短的概述提醒我们在数据科学中使用正确的算法选择方法的重要性。 这篇文章的范围涵盖了针对高维数据集的2种最佳Python自动算法选择方法,并分享了有用的文档。
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/586e9ba474ab6b91702dc333e6c79d9d.png)
I hope you enjoy it, keep exploring!
希望您喜欢它,继续探索!
python模型部署方法