机器学习综合指南第3部分，共3部分

最新推荐文章于 2022-09-07 14:14:38 发布

weixin_26750481

最新推荐文章于 2022-09-07 14:14:38 发布

阅读量357

点赞数

文章标签：机器学习 python 人工智能

原文链接：https://medium.com/analytics-vidhya/comprehensive-guide-to-machine-learning-part-3-of-3-907cd1dd41dd

版权

Welcome to the 3rd and final part of the “Comprehensive Guide to Machine Learning” series. Over the course of this series, we looked at several crucial concepts which play a significant role in developing a good machine learning model.

欢迎来到“机器学习综合指南”系列的第三部分和最后一部分。在本系列的整个过程中，我们研究了几个关键概念，这些概念在开发良好的机器学习模型中起着重要作用。

Concepts like data cleansing and EDA help a great deal in acquiring deeper understanding of the data. Similarly, concepts like feature engineering and feature selection help in making sure that only useful and relevant data is being fed to the machine learning model. You can get a quick recap of these concepts by visiting the below links:

数据清理和EDA等概念有助于极大地加深对数据的了解。类似地，诸如要素工程和要素选择之类的概念有助于确保仅将有用且相关的数据馈入机器学习模型。您可以通过访问以下链接快速回顾一下这些概念：

In this final post, we will look at actual model development and how to perform model validation to ensure that the model is behaving as expected and predicting correct results.

在这最后一篇文章中，我们将研究实际的模型开发以及如何执行模型验证，以确保模型表现出预期并预测正确的结果。

7)基准模型的建立 (7) Baseline model building)

The first step towards model building is to decide that which machine learning framework to use, based on the type of task we have at our hand. Below are few of the machine learning frameworks which work well in majority of the scenarios:

建立模型的第一步是根据我们手头的任务类型来决定使用哪种机器学习框架。以下是在大多数情况下都能正常运行的一些机器学习框架：

Neural Networks (Tensorflow/ Keras/ PyTorch)
神经网络(Tensorflow / Keras / PyTorch)
XGBoost
XGBoost
LightGBM
LightGBM
CatBoost
猫助推器
Logistic Regression
逻辑回归
Random Forest
随机森林
Support Vector Machines (SVM)
支持向量机(SVM)

The “Pet Adoption” dataset that I have been referring to through out this series, has majority of features as categorical variables. So I decided to build the final machine learning model using “CatBoost” framework, since it works pretty well with categorical features.

我在本系列中一直提到的“ Pet Adoption ”数据集具有作为分类变量的大多数功能。因此，我决定使用“ CatBoost ”框架构建最终的机器学习模型，因为它可以很好地与分类功能配合使用。

For experiment sake, I also tried out models based on Neural Networks, XGBoost and LightGBM. But CatBoost gave me the best results out of all these machine learning frameworks. You can find the other codebase at below link.

为了实验起见，我还尝试了基于神经网络，XGBoost和LightGBM的模型。但是CatBoost在所有这些机器学习框架中给了我最好的结果。您可以在下面的链接中找到其他代码库。

Below image shows the final CatBoost model I used for making predictions.

下图显示了我用于进行预测的最终CatBoost模型。

Let’s go over few of the crucial model configurations for better understanding of different nuts and bolts of CatBoost framework.

让我们介绍一些关键模型配置，以更好地了解CatBoost框架的不同螺母和螺栓。

objective='MultiClass'

In the “Pet Adoption” dataset, we are classifying between 4 different breed categories and 3 different pet categories. So considering all that, we need to set the model objective as “MultiClass”.

在“ 宠物收养 ”数据集中，我们在4个不同的品种类别和3个不同的宠物类别之间进行分类。因此，考虑到所有这些，我们需要将模型目标设置为“ MultiClass ”。

eval_metric='TotalF1'

We need to set an evaluation metric for the CatBoost model, so that we can judge that the model is behaving as expected and converging to global minima. For classification tasks, below are the commonly used evaluation metrics:

我们需要为CatBoost模型设置一个评估指标，以便我们可以判断该模型的行为是否符合预期并收敛于全局最小值。对于分类任务，以下是常用的评估指标：

Accuracy
准确性
F1 Score
F1分数
Area Under the Curve (AUC)
曲线下面积(AUC)
Precision Score
精度分数
Recall Score
召回分数

Generally for highly imbalanced datasets like “Pet Adoption”, it’s preferred to use F1 Score or AUC as evaluation criteria, since they provide a much better holistic understanding of the model performance. You can get further understanding of these metrics by visiting below link.

通常，对于高度不平衡的数据集(例如“宠物采用”)，最好使用F1分数或AUC作为评估标准，因为它们可以更好地全面理解模型性能。您可以通过下面的链接进一步了解这些指标。

class_weights=[0.165, 0.185, 1]

In case of imbalanced datasets, the machine learning models tend to get biased towards the majority class. To ensure that the model is giving proper weightage to all the prediction classes, we can set the class weights.

在数据集不平衡的情况下，机器学习模型倾向于偏向多数类。为了确保模型对所有预测类别都赋予适当的权重，我们可以设置类别权重。

Below example shows how to calculate the class weights for imbalanced datasets.

下面的示例显示了如何计算不平衡数据集的类权重。

Total Sample: 100

样本总数： 100

Class A: 60 samples

A级： 60个样本

Class B: 30 samples

B类： 30个样本

Class C: 10 samples

C级： 10个样本

Class A %: (60/100) = 0.6

A级百分比： (60/100)= 0.6

Class B %: (30/100) = 0.3

B级百分比： (30/100)= 0.3

Class C%: (10/100) = 0.1

C级： (10/100)= 0.1

Class Weight = (Lowest Class %) / (Current Class %)

班级权重=(最低班级％)/(当前班级％)

Class A weight: 0.1/0.6 =0.1667

A级重量： 0.1 / 0.6 = 0.1667

Class B weight: 0.3/0.2 = 0.5

B级重量： 0.3 / 0.2 = 0.5

Class C weight: 0.1/0.1 = 1

C类重量： 0.1 / 0.1 = 1

learning_rate=0.025

This hyper-parameter sets the speed at which the model will converge towards the global minima.

此超参数设置模型向全局最小值收敛的速度。

Learning rate is the most critical hyper-parameter in any machine learning model. Setting it too low will cause the model to train very slowly. On the other hand, setting it to high value can cause the model to overshoot the global minima and result in poor predictions.

学习率是任何机器学习模型中最关键的超参数。 设置得太低将导致模型训练非常缓慢。 另一方面，将其设置为较高的值可能导致模型超出全局最小值，并导致较差的预测。

I’d highly recommend to go through below link to get deeper understanding of learning rate and it’s significance in machine learning models.

我强烈建议您通过下面的链接来更深入地了解学习率及其在机器学习模型中的重要性。

reg_lambda=0.009

This parameter helps in preventing model overfitting, which we have already discussed in the 2nd post of this series.

该参数有助于防止模型过度拟合，我们已经在本系列的第二篇文章中进行了讨论。

You can visit below for better understanding of rest of the CatBoost model parameters.

您可以访问下面以更好地了解其余的CatBoost模型参数。

Also, I’d highly recommend to go through below link for better understanding of ensemble machine learning models (CatBoost, XGBoost, LightGBM).

另外，我强烈建议您通过下面的链接来更好地了解整体机器学习模型(CatBoost，XGBoost，LightGBM)。

8)超参数调整 (8) Hyper-parameters Tuning)

Let’s first understand the difference between Model Parameters and Model Hyper-parameters.

首先让我们了解模型参数和模型超参数之间的区别。

Model Parameters: These are the parameters that the model determines on its own while training on the dataset provided. These are the fitted parameters.

模型参数： 这些是模型在提供的数据集上进行训练时自行确定的参数。 这些是拟合的参数。

Model Hyper-parameters: These are adjustable parameters that must be tuned, prior to model training, in order to obtain a model with optimal performance.

模型超参数： 这些是可调参数，必须在模型训练之前对其进行调整，以获取具有最佳性能的模型。

Hyper-parameters are important since they directly control behaviour of the training model, having important impact on performance of the model under training.

超参数非常重要，因为它们直接控制训练模型的行为，对训练模型的性能产生重要影响。

In case of a CatBoost model, below are the critical hyper-parameters to be fine-tuned to achieve an optimal performing model.

对于CatBoost模型，以下是需要微调以实现最佳性能模型的关键超参数。

Learning Rate: controls training speed
学习速度：控制训练速度
Regularisation Lambda: controls model overfitting
正则化Lambda：控制模型过度拟合
Subsample: controls boosting in ensemble models
子样本：控制集成模型中的增强
Max Depth: controls depth of decision trees
最大深度：控制决策树的深度
Min Data in Leaf: controls model overfitting
Leaf中的最小数据：控制模型过度拟合
Max Leaves: controls model complexity
最大叶子：控制模型的复杂性

I have used the “Optuna” python library to automate the task of hyper-parameters tuning. You can visit below link to get better understanding of Optuna and how it works behind the scenes.

我已经使用“ Optuna ” python库来自动执行超参数调整任务。您可以访问下面的链接，以更好地了解Optuna及其在后台的工作方式。

Below image shows the objective function created for Optuna, so that it can search and find good hyper-parameters.

下图显示了为Optuna创建的目标函数，因此它可以搜索并找到良好的超参数。

Once the objective function is set, we can use the below commands to let Optuna run free on the hyper-parameters search space.

设置目标函数后，我们可以使用以下命令让Optuna在超参数搜索空间上自由运行。

Once Optuna is finished with the number of trials provided, we can extract the best hyper-parameters by executing the below commands.

当Optuna完成提供的试验次数后，我们可以通过执行以下命令来提取最佳的超参数。

9)模型验证 (9) Model validation)

Once we are set with our model and done with the hyper-parameters tuning, next step is to validate how the model is performing on the “Validation” and “Test” datasets.

设置好模型并完成超参数调整后，下一步就是验证模型在“ 验证 ”和“ 测试 ”数据集上的表现。

Sklearn python library provides two cross-validation functionalities, which are as listed below.

Sklearn python库提供了两个交叉验证功能，如下所示。

K-Fold: For regression and balanced classification tasks
K折：用于回归和平衡分类任务
Stratified K-Fold: For imbalanced classification tasks
分层K折：用于不平衡的分类任务

Since we are dealing with an imbalanced dataset, I chose to go with “Stratified K-Fold” cross-validation with 5 splits performed. Below images show the model validation performed using same.

由于我们要处理的数据集不平衡，因此我选择进行“分层K折叠”交叉验证并执行5次拆分。下图显示了使用模型执行的模型验证。

The training results would look like below images.

训练结果如下图所示。

As we can see from the above image, the training F1-score is around 99% and the validation F1-score is around 93%. So we can safely assume that the model is not underfitting or overfitting.

从上图可以看出，训练F1分数约为99％，而验证F1分数约为93％。因此，我们可以放心地假设模型不是拟合不足或拟合过度。

Next, let’s check the F1-score on the “Test” dataset.

接下来，让我们检查“测试”数据集上的F1分数。

As we can see, the test F1-Score is around 90%. So the model is already performing way better. We can further confirm this by checking the confusion matrix shown below.

我们可以看到，测试F1-Score约为90％。因此该模型的性能已经更好。我们可以通过检查如下所示的混淆矩阵来进一步确认这一点。

I’d highly recommend to go through the below links to acquire better understand of model validation techniques and confusion matrix.

我强烈建议您通过以下链接来更好地了解模型验证技术和混淆矩阵。

10)做出预测 (10) Making Predictions)

Let’s quickly revisit what all we did up to this point.

让我们快速回顾一下到目前为止我们所做的一切。

We cleaned the data and performed EDA on it to gain better understanding
我们清理了数据并对其进行了EDA，以更好地理解
We performed feature engineering to generate new insightful features for the model
我们进行了特征工程设计，以为模型生成具有洞察力的新特征
We performed feature selection to discard irrelevant features
我们执行功能选择以舍弃不相关的功能
We performed train/validation/test split to prepare the datasets for model validation
我们进行了训练/验证/测试拆分，以准备用于模型验证的数据集
We built the baseline model and performed hyper-parameters tuning to get the best set of hyper-parameters
我们建立了基线模型并执行了超参数调整，以获得最佳的超参数集
We performed cross-validation to ensure that the model is behaving as expected
我们执行了交叉验证，以确保模型的行为符合预期

Pheww! That’s hell lot of steps to perform, to build the final machine learning model. This brings us to the final showdown moment, which is making predictions on test dataset.

ew！建立最终的机器学习模型需要执行很多步骤。这将我们带到了最终的决战时刻，即将对测试数据集进行预测。

I usually follow the same approach of cross-validation to let the model train and predict on multiple K-fold splits, and then take the average of all predictions made across all data splits.

我通常采用相同的交叉验证方法，让模型在多个K折拆分中进行训练和预测，然后取所有数据拆分中所有预测的平均值。

Below image shows the codebase for same.

下图显示了相同的代码库。

结束语 (Concluding Remarks)

This concludes the 3rd and final part of the comprehensive machine learning guide. I do hope that you gained insightful knowledge into the nitty gritty of machine learning model development. And I’m sure you’ll use the learnings from this series to build models of your own and make predictions on real-world problems.

总结了综合机器学习指南的第三部分和最后一部分。我确实希望您对机器学习模型开发的精髓有所了解。而且，我敢肯定，您将利用本系列文章中的经验来建立自己的模型，并对实际问题做出预测。

As always, you can find the codebase for this post at below link.

与往常一样，您可以在下面的链接中找到此文章的代码库。

Do leave me your comments, feedback and challenges (if you’re facing any) and I’ll touchbase with you individually to collaborate together.

请留下您的意见，反馈和挑战(如果您遇到任何问题)，我会与您单独联系以共同协作。

Also please visit my blog (link below) to explore more on Machine Learning and Linux Computing.

另外，请访问我的博客(下面的链接)以探索有关机器学习和Linux计算的更多信息。