不平衡分类
The very interesting problem of imbalanced classification is quite famous in articles and academic papers. Most of the work focus is on one part of the big image where it addresses a specific data set and discusses possible solutions. So eventually, you have to open more than 10 tabs in one browser to learn about the problem and its possible solutions. Here I collected a complete road map so you can see the complete image of all the steps you have to go through from dealing with your data till you end up with an informative conclusion based on your question of interest.
分类不平衡这个非常有趣的问题在文章和学术论文中都非常有名。 大部分工作重点都放在大图的一部分上,其中它处理特定的数据集并讨论可能的解决方案。 因此,最终,您必须在一个浏览器中打开10个以上的标签,以了解该问题及其可能的解决方案。 在这里,我收集了完整的路线图,以便您可以看到从处理数据到您最终得出基于您感兴趣的问题的有益结论所必须执行的所有步骤的完整图像。
Before starting this journey together. Let’s talk first why imbalanced classification is important?! to industry people not just the nerdy academic people, and what are the applications that suffer from this problem by nature and some other applications that happen to have imbalanced classes due to customers' behavior?!.
一起开始这个旅程之前。 让我们先说为什么不平衡分类很重要? 对于行业人士,不仅是书呆子的学术人员,还有哪些从本质上受此问题困扰的应用程序,以及由于客户的行为而碰巧出现班级不平衡的其他一些应用程序?
Imbalanced classification refers to having unequal distribution classes. Talking business-wise imagine you have released two products in the market, and you found 90% of your customers prefer one product over the other one. At some point, you will get back to your data team asking to explain the customers' behavior based on the customer characteristics! to be able to understand this behavior and the potential change that would push them to get the other less liked product or to adjust this product based on the customers’ preferences. There are many famous applications for imbalanced classification which are expected to show up due to the nature of this application such as fraud detection, large claim losses in insurance applications, spam mails, hardware failure,.., etc. Some other applications just happen due to unexpected customers' behavior which you can’t anticipate but you have to deal with it when it happens.
不平衡的分类是指具有不相等的分配类别 。 进行商务交流时,假设您在市场上发布了两种产品,而您发现90%的客户更喜欢一种产品。 在某个时候,您将回到数据团队,要求根据客户特征来解释客户的行为! 能够了解这种行为以及可能促使他们获得其他不受欢迎的产品或根据客户的喜好调整产品的潜在变化。 由于这种应用程序的性质,有许多著名的不平衡分类应用程序有望出现,例如欺诈检测,保险应用程序中的大量索赔损失,垃圾邮件,硬件故障等 。 其他一些应用程序的发生是由于您无法预料的意外客户行为 ,但您必须在发生这种情况时对其进行处理。
In this article, I will go through the general 3 steps of imbalanced classification analysis as previewed in the image below
在本文中,我将进行不平衡分类分析的一般3个步骤,如下图所示
I will explain the details of the available options in each step. This is in addition to highlighting some pitfalls and tricks you need to be aware of when dealing with imbalanced data.
我将在每个步骤中详细说明可用选项。 这不仅突出了在处理不平衡数据时需要注意的一些陷阱和技巧。