不平衡分类_不平衡分类完整的路线图

最新推荐文章于 2021-09-13 12:06:44 发布

weixin_26741341

最新推荐文章于 2021-09-13 12:06:44 发布

阅读量251

点赞数

文章标签： python 机器学习算法

原文链接：https://medium.com/@hannan.ahmed/imbalanced-classification-a-complete-road-map-9f88d16b092f

版权

本文深入探讨了不平衡分类问题，从概念到解决策略，包括采样技术、评估指标和适用的机器学习算法，为处理此类问题提供了一个完整的路线图。

摘要由CSDN通过智能技术生成

不平衡分类

The very interesting problem of imbalanced classification is quite famous in articles and academic papers. Most of the work focus is on one part of the big image where it addresses a specific data set and discusses possible solutions. So eventually, you have to open more than 10 tabs in one browser to learn about the problem and its possible solutions. Here I collected a complete road map so you can see the complete image of all the steps you have to go through from dealing with your data till you end up with an informative conclusion based on your question of interest.

分类不平衡这个非常有趣的问题在文章和学术论文中都非常有名。大部分工作重点都放在大图的一部分上，其中它处理特定的数据集并讨论可能的解决方案。 因此，最终，您必须在一个浏览器中打开10个以上的标签，以了解该问题及其可能的解决方案。在这里，我收集了完整的路线图，以便您可以看到从处理数据到您最终得出基于您感兴趣的问题的有益结论所必须执行的所有步骤的完整图像。

Before starting this journey together. Let’s talk first why imbalanced classification is important?! to industry people not just the nerdy academic people, and what are the applications that suffer from this problem by nature and some other applications that happen to have imbalanced classes due to customers' behavior?!.

一起开始这个旅程之前。让我们先说为什么不平衡分类很重要？对于行业人士，不仅是书呆子的学术人员，还有哪些从本质上受此问题困扰的应用程序，以及由于客户的行为而碰巧出现班级不平衡的其他一些应用程序？

Imbalanced classification refers to having unequal distribution classes. Talking business-wise imagine you have released two products in the market, and you found 90% of your customers prefer one product over the other one. At some point, you will get back to your data team asking to explain the customers' behavior based on the customer characteristics! to be able to understand this behavior and the potential change that would push them to get the other less liked product or to adjust this product based on the customers’ preferences. There are many famous applications for imbalanced classification which are expected to show up due to the nature of this application such as fraud detection, large claim losses in insurance applications, spam mails, hardware failure,.., etc. Some other applications just happen due to unexpected customers' behavior which you can’t anticipate but you have to deal with it when it happens.

不平衡的分类是指具有不相等的分配类别 。进行商务交流时，假设您在市场上发布了两种产品，而您发现90％的客户更喜欢一种产品。在某个时候，您将回到数据团队，要求根据客户特征来解释客户的行为！能够了解这种行为以及可能促使他们获得其他不受欢迎的产品或根据客户的喜好调整产品的潜在变化。由于这种应用程序的性质，有许多著名的不平衡分类应用程序有望出现，例如欺诈检测，保险应用程序中的大量索赔损失，垃圾邮件，硬件故障等 。其他一些应用程序的发生是由于您无法预料的意外客户行为 ，但您必须在发生这种情况时对其进行处理。

In this article, I will go through the general 3 steps of imbalanced classification analysis as previewed in the image below

在本文中，我将进行不平衡分类分析的一般3个步骤，如下图所示

I will explain the details of the available options in each step. This is in addition to highlighting some pitfalls and tricks you need to be aware of when dealing with imbalanced data.

我将在每个步骤中详细说明可用选项。这不仅突出了在处理不平衡数据时需要注意的一些陷阱和技巧。

数据清理和准备 (Data cleaning and preparation)

This part of the process needs your clear knowledge about the features and targeted interpretations. Generally, you have to study the features of your data very well using some preliminary tools, such as descriptive statistics and correlation matrix, to make sure you are not adding overlapping information to your model. In the case of highly correlated features, you can use principal component analysis, for example, to solve this problem.

该过程的这一部分需要您对功能和针对性的解释有清楚的了解。通常，您必须使用一些初步的工具(例如描述性统计信息和相关矩阵)很好地研究数据的特性，以确保没有在模型中添加重叠的信息。对于高度相关的功能，例如，可以使用主成分分析来解决此问题。

It is important to consume good amount of time in data cleaning and preparation, eventually it saves you a lot of effort in later steps

花费大量时间进行数据清理和准备很重要，最终它可以节省您以后的工作量

What you need extra! in the data preparation related to the machine learning model is to use the one-hot-encoder in case of having a categorical feature(s). Basically, by using this function you create dummy variables, or in other words, you transfer the categories into features and observe their effect on the classification process. The final step here is to divide your data into training and testing set using the train_test_split function. Using a random split is not necessary all time.

您还需要什么！ 在具有分类特征的情况下，与机器学习模型相关的数据准备中的一个是使用单热编码器 。基本上，使用此函数可以创建伪变量，换句话说，您可以将类别转移到要素中并观察其对分类过程的影响。这里的最后一步是使用train_test_split将数据分为训练集和测试集功能。不必总是使用随机分割。

造型 (Modeling)

The main two schools of modeling are:

建模的主要两个流派是：

Engineer your data using preprocessing techniques, then use the models for balanced classification.
使用预处理技术设计数据，然后使用模型进行平衡分类。
Use a model that has a specially constructed cost function which gives more penalty for misclassifying the minority class.
使用具有特殊构造的成本函数的模型，该模型会因少数群体类别的错误分类而受到更多的惩罚。

Preprocessing and using traditional models for balanced data is more famous between articles and research papers. There are several preprocessing techniques which mainly divided into three main types: oversampling, undersampling, or mixture of the two.

在文章和研究论文之间，预处理和使用传统模型来获取平衡数据更为著名。有几种预处理技术，主要分为三种主要类型：过采样，欠采样或两者的混合。

almost all techniques work with continuous feature(s) but not all of them are applicable in case that you have a categorical feature(s). In case you have only categorical features, you can use random oversampling or random undersampling. For the mixture of continuous and categorical features, the options are random oversampling, random undersampling, or SMOTE-NC.

几乎所有技术都适用于连续特征，但如果您具有分类特征，则并非所有技术都适用。如果只有分类功能，则可以使用随机过采样或随机欠采样 。对于连续特征和分类特征的混合，选项是随机过采样 ， 随机欠采样或SMOTE-NC 。

The adapted models as mentioned is somehow can be tailored based on your chosen cost function.

可以根据您选择的成本函数对定制的模型进行定制。

But you need to be careful as, the effectiveness of these models heavily depends on quality of the cost function

但您需要小心，因为这些模型的有效性在很大程度上取决于成本函数的质量

There are some models that have built-in weight functions such as the Adaboost classifier or some other models where you can add weight using the argument class_weight such as logistic regression and ridge classifier.

有些模型具有内置的权重功能，例如Adaboost分类器 要么其他一些型号您可以在其中使用class_weight参数添加权重，例如逻辑回归 和岭分类器 。

If your target of the model is to estimate the classification probability, then you need one more step which is to calibrate the resulted probability. Probability calibration is an important step when using models that don’t have a probability-based structure such as SVM, random forest, and gradient boost. These type of models produce a probability-like-score which needs to be calibrated. Other models like logistic regression don’t need the extra calibration step, so you need to be aware of the structure of the model you are using.

如果模型的目标是估计分类概率，那么您还需要一步来校准结果概率。当使用不具有基于概率的结构的模型(例如SVM ， 随机森林和梯度提升)时， 概率校准是重要的一步。这些类型的模型会产生类似概率的得分，需要对其进行校准。其他模型(例如逻辑回归)不需要额外的校准步骤，因此您需要了解所使用模型的结构。

评估指标 (Evaluation metric)

Before going through this part you should have a clear concrete answer of the following question

在完成本部分之前，您应该对以下问题有一个明确的具体答案

What is the target of imbalanced classes analysis, is it labels prediction or probabilities prediction?!

不平衡类分析的目标是什么，它是标签预测还是概率预测？

The answer to this question decides preciously which metric you need to use. In case your answer is labels predication, then you can use F0.5_score when false positive is more costly, F1-score when false negative and false positive are equally costly, and F2_score when false negative is more costly. If you are targeting probabilities prediction, you have two options: log loss score and brier score.

这个问题的答案非常宝贵地决定了您需要使用哪个指标。如果您的答案是标签谓词，则可以使用F0.5_score 如果误报成本更高，则F1得分 当假阴性和假阳性的代价相同时，以及 F2得分 假阴性的成本更高。如果您以概率预测为目标，则有两个选择： 对数损失评分和brier评分 。

Finally as a conclusion of the important remarks in this article. First, consume more time in your raw data and try to understand as much as you can. Second, there is no perfect model for all data you can try different models and choose the best for your data. Parameter tuning and cross-validation are also important. Third, be clear about your target to choose a suitable evaluation metric. Don’t be tricked by metrics such as accuracy which doesn’t represent true efficiency in the case of imbalanced classes.

最后作为本文重要论述的总结。首先，在原始数据中花费更多时间，并尝试尽可能多地了解。其次，对于所有数据都没有完美的模型，您可以尝试不同的模型并为数据选择最佳模型。参数调整和交叉验证也很重要。第三，明确目标，选择合适的评估指标。不要被诸如准确性之类的指标所欺骗，这些指标在类不平衡的情况下并不能代表真正的效率。

Thanks for your reading, please feel free to share and open discussions in the comments.

感谢您的阅读，请随时在评论中分享和打开讨论。